Thread: maximum "target list" (maximum columns)
The opinion of database developers is that using more than 1664 columns is bad design, so that the current maximum numberof columns is not onerous. When I simply built applications on top of databases, I was of the same opinion. However, now my job is to perform some sort of analysis on data from elsewhere. Over the course of a couple weeks I needto do something with data that often is structured into tables of many thousands of columns. Figuring out what the appropriate form is part of manipulating the data— something for which I’d like to use postgres. My applications, such as they are, handle metadata not data. Operations such as crosstab or json_populate_record can easily create rows with many more than 1664 fields. I also use datasuch as census data — with many different survey fields all applying to geographic areas. For a given application onlysome are relevant, but ideally I’d create a materialized view for a given project with relevant data. In the best case,the “library” version of the dataset would have just a few very wide tables. Storing in json is possible but inconvenient,slower and means that often type info must be stored separately. In short, IMHO the database shouldn’t force me to structure my data before I understand it, and being able to query it ishow I come to understand it. At some scale, practicality dictates that one needs to take “special measures” to handle largevolumes of data — large in breadth as well as width. But this boundary should be on the order of millions of columns,not thousands. Is there a reason besides “its bad design” to disallow tables with many columns?
Shaun Cutts <shauncutts@factfiber.com> writes: > Is there a reason besides “its bad design” to disallow tables with many > columns? There are numerous implementation problems you'd have to overcome. There are illuminating comments in htup_details.h: * MaxTupleAttributeNumber limits the number of (user) columns in a tuple. * The key limit on this value is that the size of the fixed overhead for * a tuple, plus the size of the null-values bitmap (at 1 bit per column), * plus MAXALIGN alignment, must fit into t_hoff which is uint8. On most * machines the upper limit without making t_hoff wider would be a little * over 1700. We use round numbers here and for MaxHeapAttributeNumber * so that alterations in HeapTupleHeaderData layout won't change the * supported max number of columns. and * In any case, depending on column data types you will likely be running * into the disk-block-based limit on overall tuple size if you have more * than a thousand or so columns. TOAST won't help. Also, in the past we've noted planner and executor behaviors that are O(N^2) in the number of columns in a query result. I'm not sure those are all gone, but if not you'd be in for some pain with very wide queries. If someone were to throw lots of effort at the problem, and not care about preserving on-disk database compatibility, no doubt all these things could be dealt with. But I don't see it getting to the top of the community's TODO list. regards, tom lane
On 16/06/16 16:39, Shaun Cutts wrote: > The opinion of database developers is that using more than 1664 columns is bad design, so that the current maximum numberof columns is not onerous. > > When I simply built applications on top of databases, I was of the same opinion. > > However, now my job is to perform some sort of analysis on data from elsewhere. Over the course of a couple weeks I needto do something with data that often is > structured into tables of many thousands of columns. Figuring out what the appropriate form is part of manipulating thedata — something for which I’d > like to use postgres. My applications, such as they are, handle metadata not data. > > Operations such as crosstab or json_populate_record can easily create rows with many more than 1664 fields. I also usedata such as census data — with many different survey fields all applying to geographic areas. For a given applicationonly some are relevant, but ideally I’d create a materialized view for a given project with relevant data. Inthe best case, the “library” version of the dataset would have just a few very wide tables. Storing in json is possiblebut inconvenient, slower and means that often type info must be stored > separately. > > In short, IMHO the database shouldn’t force me to structure my data before I understand it, and being able to query itis how I come to understand it. At some scale, practicality dictates that one needs to take “special measures” to handlelarge volumes of data — large in breadth as well as width. But this boundary should be on the order of millions ofcolumns, not thousands. > > Is there a reason besides “its bad design” to disallow tables with many columns? > > > Not had to deal with extremely fat database tables. However, back when I was a COBOL programmer: one system had a very big record type, more than 10 times bigger than anything else I'd ever dealt with. It was an extreme pain to debug problems with such a large record (try poring over hex dumps on lineflow!) - so I expect it would be painful for a large database table. Could you spread the huge number of columns into a number or database tables with some sort of logical grouping? Cheers, Gavin
On 6/15/2016 9:39 PM, Shaun Cutts wrote: > Is there a reason besides “its bad design” to disallow tables with many columns? it would require significantly different internal architecture optimized for said huge rows. -- john r pierce, recycling bits in santa cruz