Thread: Confronting the maximum column limitation

Confronting the maximum column limitation

From
Jeff Gentry
Date:
Hi there ...

I recently discovered that there is a hard cap on the # of columns, being
at 1600.  I also understand that it is generally unfathomable that anyone
would ever feel limited by that number ... however I've managed to bump
into it myself and was looking to see if anyone had advice on how to
manage the situation.

As a bit of background, we have a Postgres database to manage information
revolving around genomic datasets, including the dataset itself.  The
actual data is treated in other applications as a matrix, and while it has
caused the DB design to be sub-optimal the model worked to just stash the
entire matrix in the DB (the rest of the DB design is proper, but the
storage of these matrices straight up is unorthodox ... for the
convenience of having everything in the same storage unit with all of the
other information, it has been worth the extra headache and potential
performance dings).

In these matrices, columns represent biological samples, rows represent
fragments of the genome and the cells are populated with values.  There
are a variety of row configurations (depending on what chip the samples
were handled on) which range in number from a few thousand to a few
hundred thousand (currently, it is constantly expanding upwards).  The
real problem lies with the columns (biological samples) in that it is
rarely the case that we'll have multiple matrices with overlap in columns
- and even in the cases where that happens, it is almost never a good idea
to treat them as the same thing.

Mind you, this is a world where having a set with a few hundred samples is
still considered pretty grandiose - I just happened to have one of the
very few out there which would come anywhere close to breaking the 1600
barrier and it is unlikely to really be an issue for at least a few (if
not more) years ... but looking down the road it'd be better to nip this
in the bud now than punt it until it becomes a real issue.

So I've seen the header file where the 1600 column limit is defined, and
I know the arguments that no one should ever want to come anywhere close
to that limit.  I'm willing to accept that these matrices could be stored
in some alternate configuration, although I don't really know what that
would be.  It's possible that the right answer might be "pgsql just isn't
the right tool for this job" or even punting it for down the road might be
the correct choice.  I was just hoping that some folks here might be able
to give their thoughts here.


Re: Confronting the maximum column limitation

From
Steve Atkins
Date:
On Aug 12, 2008, at 1:15 PM, Jeff Gentry wrote:

> Hi there ...
>
> I recently discovered that there is a hard cap on the # of columns,
> being
> at 1600.  I also understand that it is generally unfathomable that
> anyone
> would ever feel limited by that number ... however I've managed to
> bump
> into it myself and was looking to see if anyone had advice on how to
> manage the situation.
>
> As a bit of background, we have a Postgres database to manage
> information
> revolving around genomic datasets, including the dataset itself.  The
> actual data is treated in other applications as a matrix, and while
> it has
> caused the DB design to be sub-optimal the model worked to just
> stash the
> entire matrix in the DB (the rest of the DB design is proper, but the
> storage of these matrices straight up is unorthodox ... for the
> convenience of having everything in the same storage unit with all
> of the
> other information, it has been worth the extra headache and potential
> performance dings).

What operations do you perform on the data? If it's just store and
retrieve, can you serialize them into a bytea (or xml) field?

Cheers,
   Steve


Re: Confronting the maximum column limitation

From
"Scott Marlowe"
Date:
On Tue, Aug 12, 2008 at 2:15 PM, Jeff Gentry <jgentry@jimmy.harvard.edu> wrote:
> Hi there ...
>
> I recently discovered that there is a hard cap on the # of columns, being
> at 1600.  I also understand that it is generally unfathomable that anyone
> would ever feel limited by that number ... however I've managed to bump
> into it myself and was looking to see if anyone had advice on how to
> manage the situation.

The generic solution without making too much work is to store similar
data types in an arrayed type in the db.

Re: Confronting the maximum column limitation

From
Jeff Gentry
Date:
On Tue, 12 Aug 2008, Steve Atkins wrote:
> What operations do you perform on the data? If it's just store and
> retrieve, can you serialize them into a bytea (or xml) field?

Store & retrieve although we take advantage of the fact that it's in a DB
to allow for subsetting (done at the postgres level), which cuts down on
client side overhead as well as network traffic.

The DB is accessed by a variety of clients (including a webapp) which
could all perform that sort of work if necessary, although it's been nice
to subset at the DB level.  I'm not very familiar w/ the serialization
methods you're talking about - would that have me needing to do full
retrieval and subsetting on the client side? (definitely not a deal
breaker, I'm just trying to get as many ideas w/ related info as possible
before bringing this whole issue up with the powers that be).


Re: Confronting the maximum column limitation

From
Jeff Gentry
Date:
On Tue, 12 Aug 2008, Scott Marlowe wrote:
> The generic solution without making too much work is to store similar
> data types in an arrayed type in the db.

That's a good idea.  I'll have to play w/ this one.  Thanks.


Re: Confronting the maximum column limitation

From
Steve Atkins
Date:
On Aug 12, 2008, at 2:11 PM, Jeff Gentry wrote:

> On Tue, 12 Aug 2008, Steve Atkins wrote:
>> What operations do you perform on the data? If it's just store and
>> retrieve, can you serialize them into a bytea (or xml) field?
>
> Store & retrieve although we take advantage of the fact that it's in
> a DB
> to allow for subsetting (done at the postgres level), which cuts
> down on
> client side overhead as well as network traffic.
>
> The DB is accessed by a variety of clients (including a webapp) which
> could all perform that sort of work if necessary, although it's been
> nice
> to subset at the DB level.  I'm not very familiar w/ the serialization
> methods you're talking about

I wasn't thinking of anything specific, more just some convenient way
of mapping the data structure into one or more larger chunks of data,
rather than one column per cell.

It may well be possible to do some of the serialization in stored
functions
in the database, to move that away from having to implement it in
each client.

> - would that have me needing to do full
> retrieval and subsetting on the client side? (definitely not a deal
> breaker, I'm just trying to get as many ideas w/ related info as
> possible
> before bringing this whole issue up with the powers that be).
>


Maybe, maybe not. It would depend on how you serialized it, what
your typical subsets were and so on.

Serialization isn't the only solution to storing this sort of data (EAV
of some sort would be another), but it's something worth looking at.

I think that what's sensible to do is going to depend on the details
of the data and (more so) the ways you access it.

Cheers,
   Steve


Re: Confronting the maximum column limitation

From
Scott Ribe
Date:
> The
> real problem lies with the columns (biological samples) in that it is
> rarely the case that we'll have multiple matrices with overlap in columns

Should each configuration have its own table, while inheriting from a common
base table?

--
Scott Ribe
scott_ribe@killerbytes.com
http://www.killerbytes.com/
(303) 722-0567 voice



Re: Confronting the maximum column limitation

From
Decibel!
Date:
On Aug 12, 2008, at 3:15 PM, Jeff Gentry wrote:
> So I've seen the header file where the 1600 column limit is defined


IIRC, that limit is directly related to block size in the header, so
one possible fix is to increase block size. AFAIK anything up to 64K
blocks should be safe.

BTW, keep in mind that if you're storing anything that's a varlena
(anything that's variable length, including NUMBER) where you have
that many columns, every single varlena is going to end up toasted.
That's bound to have a *serious* performance impact.
--
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828



Attachment

Re: Confronting the maximum column limitation

From
"Scott Marlowe"
Date:
On Sat, Aug 16, 2008 at 1:28 PM, Decibel! <decibel@decibel.org> wrote:
> On Aug 12, 2008, at 3:15 PM, Jeff Gentry wrote:
>>
>> So I've seen the header file where the 1600 column limit is defined
>
>
> IIRC, that limit is directly related to block size in the header, so one
> possible fix is to increase block size. AFAIK anything up to 64K blocks
> should be safe.

Unless something's changed, I'm pretty sure things start breaking
after 32k blocks.

> BTW, keep in mind that if you're storing anything that's a varlena (anything
> that's variable length, including NUMBER) where you have that many columns,
> every single varlena is going to end up toasted. That's bound to have a
> *serious* performance impact.

Yeah, usually you're better off using arrayed types than 1600+ columns.