Thread: Selecting across Multiple Tables

Selecting across Multiple Tables

From
Tielman J de Villiers
Date:
Hi,

This might be a general SQL question rather than a Postgres specific one.

Nevertheless, I hope someone can point me in the right direction.

We run a consumer site with one main table capturing all the client's data.
This works fine.

I am now rewriting the structure to rather insert/update/delete the details
in 10 little tables, all linked by a unique ID.

The problem is selecting ALL details from all 10 the tables in this kind of
format:

Select a,b,c,
       d,e,
       f,g,
       ...
From 1,2,3,4,5,6,7,8,9,10
Where 1.id = (select last_value from sequence) and
      2.id = 1.id and
      3.id = 1.id and
      4.id = 1.id and
      5.id = 1.id and
      ...

When I developed the new application, it was on postgres 7.1.3, and
initially it caused a heavy load on the postmaster. This load was
substantially reduced by changing the where statement to:

      ...
Where 1.id = (select last_value from sequence) and
      2.id = (select last_value from sequence) and
      3.id = (select last_value from sequence) and
      4.id = (select last_value from sequence) and
      ...

When I now tried the same on the live server, running postgres 6.5.3, and
any which way I try, I get an extremely heavy load on the postmaster -- with
the last test I had a (cost=737.78 rows=11 width=40) when selecting only 1
column from table 1!.

Am I misunderstanding the "relational" data model completely or selecting
wrongly?

Thank You

Tielman J de Villiers
BondNet Pty Ltd

Re: Selecting across Multiple Tables

From
Antonio Fiol Bonnín
Date:
>
>
>I am now rewriting the structure to rather insert/update/delete the details
>in 10 little tables, all linked by a unique ID.
>
I see no reason to do so unless all 10 "attributes" (a,b,c,d,...) are
very unfrequent, and otherwise tables would get far too big.

>The problem is selecting ALL details from all 10 the tables in this kind of
>format:
>
>Select a,b,c,
>       d,e,
>       f,g,
>       ...
>From 1,2,3,4,5,6,7,8,9,10
>Where 1.id = (select last_value from sequence) and
>      2.id = 1.id and
>      3.id = 1.id and
>      4.id = 1.id and
>      5.id = 1.id and
>      ...
>
>When I developed the new application, it was on postgres 7.1.3, and
>initially it caused a heavy load on the postmaster. This load was
>substantially reduced by changing the where statement to:
>
>      ...
>Where 1.id = (select last_value from sequence) and
>      2.id = (select last_value from sequence) and
>      3.id = (select last_value from sequence) and
>      4.id = (select last_value from sequence) and
>      ...
>
>When I now tried the same on the live server, running postgres 6.5.3, and
>any which way I try, I get an extremely heavy load on the postmaster -- with
>the last test I had a (cost=737.78 rows=11 width=40) when selecting only 1
>column from table 1!.
>
>Am I misunderstanding the "relational" data model completely or selecting
>wrongly?
>
I'm afraid that performance shoud be substantially better if you had a
single table with all the attributes. Except for situations like:

Contact_name, Company_name, Phone_number, Billing_address

In this case, billing address is something "company-specific", and not
"contact-specific". So I'd see something like:

Table1: Contact_name, Phone_number, id_company
Table2: id_company, Company_name, Billing_address

If and only if you intend to have multiple contacts for each company. If
you only intend to have one contact for each company, performance (and
file size) will be better if you stick everything in the same table.

Another situation that could lead to good results by splitting tables is
the following:

Original Table: Report_number, magnitudeA, magnitudeB, magnitudeC

If reports usually only contain one magnitude, either A, B, or C, you
may think of splitting that into three tables:

Table A: Report_number, magnitudeA
Table B: Report_number, magnitudeB
Table C: Report_number, magnitudeC

You may substantially improve disk usage, but it depends mostly on how
your data is organised. However, performance does not seem to me that it
may get improved by splitting tables, in general, as you will need (in
general, again) to re-join the information to access it.

Each case is different, and it depends much on your numbers.

As a general advice:

EXPLAIN SELECT ...
And then analyze precisely what is happening. A long (high cost) seq
scan is bad. But an index scan can also be bad, if it is not on the best
possible index. I have recently optimised a request that used to take
about 50 sec, and now it is sub-second, just by finding (and creating)
the missing index.

In your case, I suppose I should not need to say that if you have the
split tables you should index all of them on the ID you use to access
the data.

I hope that helps, but I could be completely wrong.

Good luck!

Antonio



Re: Selecting across Multiple Tables

From
Tielman J de Villiers
Date:
Thank you Antonio,

I have played with a couple of possibilities, but none really give a good
enogh performance.

The main change was creating indices on all the unique_id columns of the 10
tables. That decreased the select time somewhat.

The main issue however is why postgres 7.1 works fine (and quick) on "WHERE"
statements such as

>Where 1.id = (select last_value from sequence) and
>      2.id = (select last_value from sequence) and
>      3.id = (select last_value from sequence) and
>      4.id = (select last_value from sequence) and

While postgres 6.2 drops dead on it, and works slow on "WHERE" such as
>Where 1.id = (select last_value from sequence) and
>      2.id = 1.id and
>      3.id = 1.id and
>      4.id = 1.id and
>      5.id = 1.id and

Regards,

Tielman J de Villiers
BondNet Pty Ltd

-----Original Message-----
From: Antonio Fiol Bonnín [mailto:fiol@w3ping.com]
Sent: Wednesday, December 05, 2001 3:43 PM
To: Tielman J de Villiers
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Selecting across Multiple Tables


>
>
>I am now rewriting the structure to rather insert/update/delete the
>details in 10 little tables, all linked by a unique ID.
>
I see no reason to do so unless all 10 "attributes" (a,b,c,d,...) are
very unfrequent, and otherwise tables would get far too big.

>The problem is selecting ALL details from all 10 the tables in this
>kind of
>format:
>
>Select a,b,c,
>       d,e,
>       f,g,
>       ...
>From 1,2,3,4,5,6,7,8,9,10
>Where 1.id = (select last_value from sequence) and
>      2.id = 1.id and
>      3.id = 1.id and
>      4.id = 1.id and
>      5.id = 1.id and
>      ...
>
>When I developed the new application, it was on postgres 7.1.3, and
>initially it caused a heavy load on the postmaster. This load was
>substantially reduced by changing the where statement to:
>
>      ...
>Where 1.id = (select last_value from sequence) and
>      2.id = (select last_value from sequence) and
>      3.id = (select last_value from sequence) and
>      4.id = (select last_value from sequence) and
>      ...
>
>When I now tried the same on the live server, running postgres 6.5.3,
>and any which way I try, I get an extremely heavy load on the
>postmaster -- with the last test I had a (cost=737.78 rows=11 width=40)
>when selecting only 1 column from table 1!.
>
>Am I misunderstanding the "relational" data model completely or
>selecting wrongly?
>
I'm afraid that performance shoud be substantially better if you had a
single table with all the attributes. Except for situations like:

Contact_name, Company_name, Phone_number, Billing_address

In this case, billing address is something "company-specific", and not
"contact-specific". So I'd see something like:

Table1: Contact_name, Phone_number, id_company
Table2: id_company, Company_name, Billing_address

If and only if you intend to have multiple contacts for each company. If
you only intend to have one contact for each company, performance (and
file size) will be better if you stick everything in the same table.

Another situation that could lead to good results by splitting tables is
the following:

Original Table: Report_number, magnitudeA, magnitudeB, magnitudeC

If reports usually only contain one magnitude, either A, B, or C, you
may think of splitting that into three tables:

Table A: Report_number, magnitudeA
Table B: Report_number, magnitudeB
Table C: Report_number, magnitudeC

You may substantially improve disk usage, but it depends mostly on how
your data is organised. However, performance does not seem to me that it
may get improved by splitting tables, in general, as you will need (in
general, again) to re-join the information to access it.

Each case is different, and it depends much on your numbers.

As a general advice:

EXPLAIN SELECT ...
And then analyze precisely what is happening. A long (high cost) seq
scan is bad. But an index scan can also be bad, if it is not on the best
possible index. I have recently optimised a request that used to take
about 50 sec, and now it is sub-second, just by finding (and creating)
the missing index.

In your case, I suppose I should not need to say that if you have the
split tables you should index all of them on the ID you use to access
the data.

I hope that helps, but I could be completely wrong.

Good luck!

Antonio


Re: Selecting across Multiple Tables

From
Doug McNaught
Date:
Tielman J de Villiers <tjdevil@bondnet.co.za> writes:

> The main issue however is why postgres 7.1 works fine (and quick) on "WHERE"
> statements such as
>
> >Where 1.id = (select last_value from sequence) and
> >      2.id = (select last_value from sequence) and
> >      3.id = (select last_value from sequence) and
> >      4.id = (select last_value from sequence) and
>
> While postgres 6.2 drops dead on it, and works slow on "WHERE" such as
> >Where 1.id = (select last_value from sequence) and
> >      2.id = 1.id and
> >      3.id = 1.id and
> >      4.id = 1.id and
> >      5.id = 1.id and

I would say the answer is that 7.1 has at least 2 years' worth of
improvement and optimization over 6.5.2.  There is no reason to be
running 6.X in this day and age, and a lot of reasons not to be.

-Doug
--
Let us cross over the river, and rest under the shade of the trees.
   --T. J. Jackson, 1863

Re: Selecting across Multiple Tables

From
will trillich
Date:
On Wed, Dec 05, 2001 at 02:43:07PM +0100, Antonio Fiol Bonnín wrote:
> I'm afraid that performance shoud be substantially better if you had a
> single table with all the attributes. Except for situations like:
>
> Contact_name, Company_name, Phone_number, Billing_address
>
> In this case, billing address is something "company-specific", and not
> "contact-specific". So I'd see something like:
>
> Table1: Contact_name, Phone_number, id_company
> Table2: id_company, Company_name, Billing_address
>
> If and only if you intend to have multiple contacts for each company. If
> you only intend to have one contact for each company, performance (and
> file size) will be better if you stick everything in the same table.
>
> Another situation that could lead to good results by splitting tables is
> the following:
>
> Original Table: Report_number, magnitudeA, magnitudeB, magnitudeC
>
> If reports usually only contain one magnitude, either A, B, or C, you
> may think of splitting that into three tables:
>
> Table A: Report_number, magnitudeA
> Table B: Report_number, magnitudeB
> Table C: Report_number, magnitudeC
>
> You may substantially improve disk usage, but it depends mostly on how
> your data is organised. However, performance does not seem to me that it
> may get improved by splitting tables, in general, as you will need (in
> general, again) to re-join the information to access it.
>
> Each case is different, and it depends much on your numbers.
>
> As a general advice:
>
> EXPLAIN SELECT ...
> And then analyze precisely what is happening. A long (high cost) seq
> scan is bad. But an index scan can also be bad, if it is not on the best
> possible index. I have recently optimised a request that used to take
> about 50 sec, and now it is sub-second, just by finding (and creating)
> the missing index.
>
> In your case, I suppose I should not need to say that if you have the
> split tables you should index all of them on the ID you use to access
> the data.

very helpful nudgings...  nice general advice!  we need more like
this. nice work, Antonio!

> I hope that helps, but I could be completely wrong.

:)

--
DEBIAN NEWBIE TIP #99 from Greg Wiley <greg@orthogony.com>
:
Wondering WHERE TO SPECIFY BASH DEFAULTS? In case you are in X
and launching a term, .bashrc is automatically loaded (instead
of .bash_profile, .profile, .login) when bash is not a login
shell.

Also see http://newbieDoc.sourceForge.net/ ...