Thread: Confronting the maximum column limitation
Hi there ... I recently discovered that there is a hard cap on the # of columns, being at 1600. I also understand that it is generally unfathomable that anyone would ever feel limited by that number ... however I've managed to bump into it myself and was looking to see if anyone had advice on how to manage the situation. As a bit of background, we have a Postgres database to manage information revolving around genomic datasets, including the dataset itself. The actual data is treated in other applications as a matrix, and while it has caused the DB design to be sub-optimal the model worked to just stash the entire matrix in the DB (the rest of the DB design is proper, but the storage of these matrices straight up is unorthodox ... for the convenience of having everything in the same storage unit with all of the other information, it has been worth the extra headache and potential performance dings). In these matrices, columns represent biological samples, rows represent fragments of the genome and the cells are populated with values. There are a variety of row configurations (depending on what chip the samples were handled on) which range in number from a few thousand to a few hundred thousand (currently, it is constantly expanding upwards). The real problem lies with the columns (biological samples) in that it is rarely the case that we'll have multiple matrices with overlap in columns - and even in the cases where that happens, it is almost never a good idea to treat them as the same thing. Mind you, this is a world where having a set with a few hundred samples is still considered pretty grandiose - I just happened to have one of the very few out there which would come anywhere close to breaking the 1600 barrier and it is unlikely to really be an issue for at least a few (if not more) years ... but looking down the road it'd be better to nip this in the bud now than punt it until it becomes a real issue. So I've seen the header file where the 1600 column limit is defined, and I know the arguments that no one should ever want to come anywhere close to that limit. I'm willing to accept that these matrices could be stored in some alternate configuration, although I don't really know what that would be. It's possible that the right answer might be "pgsql just isn't the right tool for this job" or even punting it for down the road might be the correct choice. I was just hoping that some folks here might be able to give their thoughts here.
On Aug 12, 2008, at 1:15 PM, Jeff Gentry wrote: > Hi there ... > > I recently discovered that there is a hard cap on the # of columns, > being > at 1600. I also understand that it is generally unfathomable that > anyone > would ever feel limited by that number ... however I've managed to > bump > into it myself and was looking to see if anyone had advice on how to > manage the situation. > > As a bit of background, we have a Postgres database to manage > information > revolving around genomic datasets, including the dataset itself. The > actual data is treated in other applications as a matrix, and while > it has > caused the DB design to be sub-optimal the model worked to just > stash the > entire matrix in the DB (the rest of the DB design is proper, but the > storage of these matrices straight up is unorthodox ... for the > convenience of having everything in the same storage unit with all > of the > other information, it has been worth the extra headache and potential > performance dings). What operations do you perform on the data? If it's just store and retrieve, can you serialize them into a bytea (or xml) field? Cheers, Steve
On Tue, Aug 12, 2008 at 2:15 PM, Jeff Gentry <jgentry@jimmy.harvard.edu> wrote: > Hi there ... > > I recently discovered that there is a hard cap on the # of columns, being > at 1600. I also understand that it is generally unfathomable that anyone > would ever feel limited by that number ... however I've managed to bump > into it myself and was looking to see if anyone had advice on how to > manage the situation. The generic solution without making too much work is to store similar data types in an arrayed type in the db.
On Tue, 12 Aug 2008, Steve Atkins wrote: > What operations do you perform on the data? If it's just store and > retrieve, can you serialize them into a bytea (or xml) field? Store & retrieve although we take advantage of the fact that it's in a DB to allow for subsetting (done at the postgres level), which cuts down on client side overhead as well as network traffic. The DB is accessed by a variety of clients (including a webapp) which could all perform that sort of work if necessary, although it's been nice to subset at the DB level. I'm not very familiar w/ the serialization methods you're talking about - would that have me needing to do full retrieval and subsetting on the client side? (definitely not a deal breaker, I'm just trying to get as many ideas w/ related info as possible before bringing this whole issue up with the powers that be).
On Tue, 12 Aug 2008, Scott Marlowe wrote: > The generic solution without making too much work is to store similar > data types in an arrayed type in the db. That's a good idea. I'll have to play w/ this one. Thanks.
On Aug 12, 2008, at 2:11 PM, Jeff Gentry wrote: > On Tue, 12 Aug 2008, Steve Atkins wrote: >> What operations do you perform on the data? If it's just store and >> retrieve, can you serialize them into a bytea (or xml) field? > > Store & retrieve although we take advantage of the fact that it's in > a DB > to allow for subsetting (done at the postgres level), which cuts > down on > client side overhead as well as network traffic. > > The DB is accessed by a variety of clients (including a webapp) which > could all perform that sort of work if necessary, although it's been > nice > to subset at the DB level. I'm not very familiar w/ the serialization > methods you're talking about I wasn't thinking of anything specific, more just some convenient way of mapping the data structure into one or more larger chunks of data, rather than one column per cell. It may well be possible to do some of the serialization in stored functions in the database, to move that away from having to implement it in each client. > - would that have me needing to do full > retrieval and subsetting on the client side? (definitely not a deal > breaker, I'm just trying to get as many ideas w/ related info as > possible > before bringing this whole issue up with the powers that be). > Maybe, maybe not. It would depend on how you serialized it, what your typical subsets were and so on. Serialization isn't the only solution to storing this sort of data (EAV of some sort would be another), but it's something worth looking at. I think that what's sensible to do is going to depend on the details of the data and (more so) the ways you access it. Cheers, Steve
> The > real problem lies with the columns (biological samples) in that it is > rarely the case that we'll have multiple matrices with overlap in columns Should each configuration have its own table, while inheriting from a common base table? -- Scott Ribe scott_ribe@killerbytes.com http://www.killerbytes.com/ (303) 722-0567 voice
On Aug 12, 2008, at 3:15 PM, Jeff Gentry wrote: > So I've seen the header file where the 1600 column limit is defined IIRC, that limit is directly related to block size in the header, so one possible fix is to increase block size. AFAIK anything up to 64K blocks should be safe. BTW, keep in mind that if you're storing anything that's a varlena (anything that's variable length, including NUMBER) where you have that many columns, every single varlena is going to end up toasted. That's bound to have a *serious* performance impact. -- Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828
Attachment
On Sat, Aug 16, 2008 at 1:28 PM, Decibel! <decibel@decibel.org> wrote: > On Aug 12, 2008, at 3:15 PM, Jeff Gentry wrote: >> >> So I've seen the header file where the 1600 column limit is defined > > > IIRC, that limit is directly related to block size in the header, so one > possible fix is to increase block size. AFAIK anything up to 64K blocks > should be safe. Unless something's changed, I'm pretty sure things start breaking after 32k blocks. > BTW, keep in mind that if you're storing anything that's a varlena (anything > that's variable length, including NUMBER) where you have that many columns, > every single varlena is going to end up toasted. That's bound to have a > *serious* performance impact. Yeah, usually you're better off using arrayed types than 1600+ columns.