Thread: JOIN column maximum

JOIN column maximum

From

Lee Hachadoorian

Date:

05 January 2012, 22:11:30

How is the number of columns in a join determined? When I combine somewhere around 90 tables in a JOIN, the query returns:

ERROR: joins can have at most 32767 columns
SQL state: 54000

I'm sure most people will say "Why the hell are you joining 90 tables." I've asked this list before for advice on how to work with the approximately 23,000 column American Community Survey dataset, and based on previous responses I am trying to combine 117 "sequences" (basically vertical partitions of the dataset) into one table using array columns. Of course, I can build this up by joining a few tables at a time, so the question is mostly curiosity, but I haven't been able to find this documented anywhere. Moreover, the 32767 limit doesn't map to any immediately intuitive transformation of 90, like squaring (which is much too low) or factorial (which is much to high).

Any insight?

Regards,
--Lee

--
Lee Hachadoorian
PhD, Earth & Environmental Sciences (Geography)
Research Associate, CUNY Center for Urban Research
http://freecity.commons.gc.cuny.edu/

Re: JOIN column maximum

From

Tom Lane

Date:

05 January 2012, 22:18:27

Lee Hachadoorian <lee.hachadoorian@gmail.com> writes:
> How is the number of columns in a join determined? When I combine somewhere
> around 90 tables in a JOIN, the query returns:

> ERROR: joins can have at most 32767 columns

It's the sum of the number of columns in the base tables.

> I'm sure most people will say "Why the hell are you joining 90 tables."

Not only that, but why are you working with over-300-column tables?
Seems like your schema design needs rethinking.

> I've asked this list before for advice on how to work with the
> approximately 23,000 column American Community Survey dataset,

Are there really 23000 populated values in each row?  I hesitate to
suggest an EAV approach, but it kinda seems like you need to go in that
direction.  You're never going to get decent performance out of a schema
that requires 100-way joins, even if you avoid bumping up against hard
limits.

            regards, tom lane

Re: JOIN column maximum

From

Lee Hachadoorian

Date:

06 January 2012, 00:10:44

On 01/05/2012 06:18 PM, Tom Lane wrote:
>
>> ERROR: joins can have at most 32767 columns
> It's the sum of the number of columns in the base tables.
>
That makes sense. I totally misunderstood the message to be referring to
the number of joined columns rather than table columns.
>
>> I've asked this list before for advice on how to work with the
>> approximately 23,000 column American Community Survey dataset,
> Are there really 23000 populated values in each row?  I hesitate to
> suggest an EAV approach, but it kinda seems like you need to go in that
> direction.  You're never going to get decent performance out of a schema
> that requires 100-way joins, even if you avoid bumping up against hard
> limits.
Many of the smaller geographies, e.g. census tracts, do in fact have
data for the vast majority of the columns. I am trying to combine it all
into one table to avoid the slowness of multiple JOINs (even though in
practice I'm never joining all the tables at once). EAV sounds correct
in terms of normalization, but isn't it usually better performance-wise
to store write-once/read-many data in a denormalized (i.e. flattened)
fashion? One of these days I'll have to try to benchmark some different
approaches, but for now planning on using array columns, with each
"sequence" (in the Census sense, not the Postgres sense) of 200+
variables in its own array rather than its own table.

--Lee

--
Lee Hachadoorian
PhD, Earth&  Environmental Sciences (Geography)
Research Associate, CUNY Center for Urban Research
http://freecity.commons.gc.cuny.edu

Re: JOIN column maximum

From

Scott Marlowe

Date:

06 January 2012, 01:23:13

On Thu, Jan 5, 2012 at 6:10 PM, Lee Hachadoorian
<lee.hachadoorian@gmail.com> wrote:

>
> Many of the smaller geographies, e.g. census tracts, do in fact have data
> for the vast majority of the columns. I am trying to combine it all into one
> table to avoid the slowness of multiple JOINs (even though in practice I'm
> never joining all the tables at once). EAV sounds correct in terms of
> normalization, but isn't it usually better performance-wise to store
> write-once/read-many data in a denormalized (i.e. flattened) fashion? One of
> these days I'll have to try to benchmark some different approaches, but for
> now planning on using array columns, with each "sequence" (in the Census
> sense, not the Postgres sense) of 200+ variables in its own array rather
> than its own table.

Are you using arrays or hstore?

Re: JOIN column maximum

From

Darren Duncan

Date:

06 January 2012, 02:19:31

Lee Hachadoorian wrote:
> On 01/05/2012 06:18 PM, Tom Lane wrote:
>> Are there really 23000 populated values in each row?  I hesitate to
>> suggest an EAV approach, but it kinda seems like you need to go in that
>> direction.  You're never going to get decent performance out of a schema
>> that requires 100-way joins, even if you avoid bumping up against hard
>> limits.
> Many of the smaller geographies, e.g. census tracts, do in fact have
> data for the vast majority of the columns. I am trying to combine it all
> into one table to avoid the slowness of multiple JOINs (even though in
> practice I'm never joining all the tables at once). EAV sounds correct
> in terms of normalization, but isn't it usually better performance-wise
> to store write-once/read-many data in a denormalized (i.e. flattened)
> fashion? One of these days I'll have to try to benchmark some different
> approaches, but for now planning on using array columns, with each
> "sequence" (in the Census sense, not the Postgres sense) of 200+
> variables in its own array rather than its own table.

EAV is not necessarily more correct than what you're doing.

The most correct solution is one where your database schema defines, and the
DBMS enforces, all of the constraints or business rules on your data, so that
you can not put something in the database that violates the business rules.

Traditional EAV, if you're talking about the common binary table of
unconstrained field-name,field-value pairs, is not an improvement.

A more correct solution is to use different columns for things with different
business rules or data types.  If the DBMS can't handle this then that is
grounds for improving the DBMS.

There's no reason that joins *have* to be slow, and in some DBMS designs you can
join in linear time, its all about how you implement.

This all being said, 23K values per row just sounds wrong, and I can't imagine
any census forms having that many details.

Do you, by chance, have multiple values of the same type that are in different
fields, eg telephone_1, telephone_2 or child_1, child_2 etc?  You should take
any of those and collect them into array-typed fields, or separate tables with
just telephone or child columns.  Or do you say have a set of coordinates in
separate fields?  Or you may have other kinds of redundancy within single rows
that are best normalized into separate rows.

With 23K values, these probably have many mutual associations, and you could
split that table into a bunch of other ones where columns that relate more
closely together are collected.

What I said in the last couple paragraphs is probably your earliest best thing
to fix, so you both have a better design and it performs together on the DBMS
you have.

-- Darren Duncan

Re: JOIN column maximum

From

Lee Hachadoorian

Date:

06 January 2012, 20:56:40

On Thu, Jan 5, 2012 at 10:19 PM, Darren Duncan <darren@darrenduncan.net> wrote:

This all being said, 23K values per row just sounds wrong, and I can't imagine any census forms having that many details.

Do you, by chance, have multiple values of the same type that are in different fields, eg telephone_1, telephone_2 or child_1, child_2 etc? You should take any of those and collect them into array-typed fields, or separate tables with just telephone or child columns. Or do you say have a set of coordinates in separate fields? Or you may have other kinds of redundancy within single rows that are best normalized into separate rows.

With 23K values, these probably have many mutual associations, and you could split that table into a bunch of other ones where columns that relate more closely together are collected.

My question is already answered, so this is mostly for anyone curious about Census data, and if anyone wants to add to this, feel free.

You're right that no census form has that many details. The American Community Survey replaced the old Census long form, so it does have more details than the form sent out for the 2010 Census, but not 23,000 questions. It might ask, e.g. income, sex, and how you travel to work. But the forms are private, so that data (the so-called microdata) is not released in its entirety. What I am working with is called the summary file, which presents the data in aggregate. That means you have an income table with 16 income classes, plus a total column. Then you have 9 more tables which show the same income classes by 9 racial and ethnic categories, for a total of 153 more columns. Then you also have a table which crosses 9 income classes by 5 mobility statuses (living in same house, moved within county, moved within state, etc.) for a total of 55 columns. Then you have income classes crossed with sex, income classes crossed with mode of transportation to work, sex crossed with mode of transportation to work, etc. When all is said and done, you have 23,000 columns.

Believe me, I would all love to be working directly with the microdata. All of these different ways of slicing and dicing the categories are basically how the Census Bureau provides as much detail as possible without providing so much detail that privacy would be compromised (i.e., enough information is available that specific individuals could be identified). That inevitably leads to a great deal of redundancy in the data, since the same individuals are just being counted in different groups in different tables.

Given all that, one could still take the data that came from Census and try to normalize it and organize it, but my main goal given the size of the dataset is to keep it as similar as possible to the source, so that a researcher familiar with the Census data can work with our database using the Census documentation without our having to produce a new set of documentation. The way I had done that was to store the sequences (which are merely a data dissemination convenience, and have no relationship to the logic of the data) in Postgres tables, and make the subject tables (which do represent logical groupings of data) into views. I'm thinking about making the sequences into array columns. The subject tables would still be views.

--Lee

--
Lee Hachadoorian
PhD, Earth & Environmental Sciences (Geography)
Research Associate, CUNY Center for Urban Research
http://freecity.commons.gc.cuny.edu/

Re: JOIN column maximum

From

Alban Hertroys

Date:

08 January 2012, 11:59:04

On 6 Jan 2012, at 22:56, Lee Hachadoorian wrote:

> What I am working with is called the summary file, which presents the data in aggregate. That means you have an
incometable with 16 income classes, plus a total column. 

That could be a table: income class and income.

It may make sense to add a record for a "virtual" class 'TOTAL' there, that incorporates the totals column, although I
expectthose same numbers could be obtained by adding up the numbers in the other categories. If those numbers are
roundedthough, then your totals can be off a bit that way (due to lack of precision). 

> Then you have 9 more tables which show the same income classes by 9 racial and ethnic categories, for a total of 153
morecolumns. 

That could be a table: income class (FK), ethnic category and income.

You could turn these into 9 separate views with one ethnic category pre-selected in each.

> Then you also have a table which crosses 9 income classes by 5 mobility statuses (living in same house, moved within
county,moved within state, etc.) for a total of 55 columns. 

Another table: income class (FK), mobility status, income.

> Then you have income classes crossed with sex, income classes crossed with mode of transportation to work, sex
crossedwith mode of transportation to work, etc. 

...etc.

> When all is said and done, you have 23,000 columns.

You can definitely normalise that data much more than you are doing. I also think you may be underestimating your
researchersif you think they can't handle mapping these Census sheets (which I'm unfamiliar with) to normalised tables. 

You may even find that they can find out more from those normalised tables, as they have the freedom to combine tables
thatthe Census Bureau didn't combine for them. Things like "how much do female Asians make compared to their mobility". 

Alban Hertroys

--
The scale of a problem often equals the size of an ego.