Re: Selecting across Multiple Tables - Mailing list pgsql-general

From Tielman J de Villiers
Subject Re: Selecting across Multiple Tables
Date
Msg-id E1F206EC93DCD4119A05009027A41359064AD3@sbssvr.bondnet.co.za
Whole thread Raw
In response to Selecting across Multiple Tables  (Tielman J de Villiers <tjdevil@bondnet.co.za>)
List pgsql-general
Thank you Antonio,

I have played with a couple of possibilities, but none really give a good
enogh performance.

The main change was creating indices on all the unique_id columns of the 10
tables. That decreased the select time somewhat.

The main issue however is why postgres 7.1 works fine (and quick) on "WHERE"
statements such as

>Where 1.id = (select last_value from sequence) and
>      2.id = (select last_value from sequence) and
>      3.id = (select last_value from sequence) and
>      4.id = (select last_value from sequence) and

While postgres 6.2 drops dead on it, and works slow on "WHERE" such as
>Where 1.id = (select last_value from sequence) and
>      2.id = 1.id and
>      3.id = 1.id and
>      4.id = 1.id and
>      5.id = 1.id and

Regards,

Tielman J de Villiers
BondNet Pty Ltd

-----Original Message-----
From: Antonio Fiol Bonnín [mailto:fiol@w3ping.com]
Sent: Wednesday, December 05, 2001 3:43 PM
To: Tielman J de Villiers
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Selecting across Multiple Tables


>
>
>I am now rewriting the structure to rather insert/update/delete the
>details in 10 little tables, all linked by a unique ID.
>
I see no reason to do so unless all 10 "attributes" (a,b,c,d,...) are
very unfrequent, and otherwise tables would get far too big.

>The problem is selecting ALL details from all 10 the tables in this
>kind of
>format:
>
>Select a,b,c,
>       d,e,
>       f,g,
>       ...
>From 1,2,3,4,5,6,7,8,9,10
>Where 1.id = (select last_value from sequence) and
>      2.id = 1.id and
>      3.id = 1.id and
>      4.id = 1.id and
>      5.id = 1.id and
>      ...
>
>When I developed the new application, it was on postgres 7.1.3, and
>initially it caused a heavy load on the postmaster. This load was
>substantially reduced by changing the where statement to:
>
>      ...
>Where 1.id = (select last_value from sequence) and
>      2.id = (select last_value from sequence) and
>      3.id = (select last_value from sequence) and
>      4.id = (select last_value from sequence) and
>      ...
>
>When I now tried the same on the live server, running postgres 6.5.3,
>and any which way I try, I get an extremely heavy load on the
>postmaster -- with the last test I had a (cost=737.78 rows=11 width=40)
>when selecting only 1 column from table 1!.
>
>Am I misunderstanding the "relational" data model completely or
>selecting wrongly?
>
I'm afraid that performance shoud be substantially better if you had a
single table with all the attributes. Except for situations like:

Contact_name, Company_name, Phone_number, Billing_address

In this case, billing address is something "company-specific", and not
"contact-specific". So I'd see something like:

Table1: Contact_name, Phone_number, id_company
Table2: id_company, Company_name, Billing_address

If and only if you intend to have multiple contacts for each company. If
you only intend to have one contact for each company, performance (and
file size) will be better if you stick everything in the same table.

Another situation that could lead to good results by splitting tables is
the following:

Original Table: Report_number, magnitudeA, magnitudeB, magnitudeC

If reports usually only contain one magnitude, either A, B, or C, you
may think of splitting that into three tables:

Table A: Report_number, magnitudeA
Table B: Report_number, magnitudeB
Table C: Report_number, magnitudeC

You may substantially improve disk usage, but it depends mostly on how
your data is organised. However, performance does not seem to me that it
may get improved by splitting tables, in general, as you will need (in
general, again) to re-join the information to access it.

Each case is different, and it depends much on your numbers.

As a general advice:

EXPLAIN SELECT ...
And then analyze precisely what is happening. A long (high cost) seq
scan is bad. But an index scan can also be bad, if it is not on the best
possible index. I have recently optimised a request that used to take
about 50 sec, and now it is sub-second, just by finding (and creating)
the missing index.

In your case, I suppose I should not need to say that if you have the
split tables you should index all of them on the ID you use to access
the data.

I hope that helps, but I could be completely wrong.

Good luck!

Antonio


pgsql-general by date:

Previous
From: Kirk Strauser
Date:
Subject: I wrote a program to migrate Interbase -> PostgreSQL
Next
From: "Peter Darley"
Date:
Subject: Re: Installing DBD::Pg module without Pg Database server