Thread: Selecting across Multiple Tables
Hi, This might be a general SQL question rather than a Postgres specific one. Nevertheless, I hope someone can point me in the right direction. We run a consumer site with one main table capturing all the client's data. This works fine. I am now rewriting the structure to rather insert/update/delete the details in 10 little tables, all linked by a unique ID. The problem is selecting ALL details from all 10 the tables in this kind of format: Select a,b,c, d,e, f,g, ... From 1,2,3,4,5,6,7,8,9,10 Where 1.id = (select last_value from sequence) and 2.id = 1.id and 3.id = 1.id and 4.id = 1.id and 5.id = 1.id and ... When I developed the new application, it was on postgres 7.1.3, and initially it caused a heavy load on the postmaster. This load was substantially reduced by changing the where statement to: ... Where 1.id = (select last_value from sequence) and 2.id = (select last_value from sequence) and 3.id = (select last_value from sequence) and 4.id = (select last_value from sequence) and ... When I now tried the same on the live server, running postgres 6.5.3, and any which way I try, I get an extremely heavy load on the postmaster -- with the last test I had a (cost=737.78 rows=11 width=40) when selecting only 1 column from table 1!. Am I misunderstanding the "relational" data model completely or selecting wrongly? Thank You Tielman J de Villiers BondNet Pty Ltd
> > >I am now rewriting the structure to rather insert/update/delete the details >in 10 little tables, all linked by a unique ID. > I see no reason to do so unless all 10 "attributes" (a,b,c,d,...) are very unfrequent, and otherwise tables would get far too big. >The problem is selecting ALL details from all 10 the tables in this kind of >format: > >Select a,b,c, > d,e, > f,g, > ... >From 1,2,3,4,5,6,7,8,9,10 >Where 1.id = (select last_value from sequence) and > 2.id = 1.id and > 3.id = 1.id and > 4.id = 1.id and > 5.id = 1.id and > ... > >When I developed the new application, it was on postgres 7.1.3, and >initially it caused a heavy load on the postmaster. This load was >substantially reduced by changing the where statement to: > > ... >Where 1.id = (select last_value from sequence) and > 2.id = (select last_value from sequence) and > 3.id = (select last_value from sequence) and > 4.id = (select last_value from sequence) and > ... > >When I now tried the same on the live server, running postgres 6.5.3, and >any which way I try, I get an extremely heavy load on the postmaster -- with >the last test I had a (cost=737.78 rows=11 width=40) when selecting only 1 >column from table 1!. > >Am I misunderstanding the "relational" data model completely or selecting >wrongly? > I'm afraid that performance shoud be substantially better if you had a single table with all the attributes. Except for situations like: Contact_name, Company_name, Phone_number, Billing_address In this case, billing address is something "company-specific", and not "contact-specific". So I'd see something like: Table1: Contact_name, Phone_number, id_company Table2: id_company, Company_name, Billing_address If and only if you intend to have multiple contacts for each company. If you only intend to have one contact for each company, performance (and file size) will be better if you stick everything in the same table. Another situation that could lead to good results by splitting tables is the following: Original Table: Report_number, magnitudeA, magnitudeB, magnitudeC If reports usually only contain one magnitude, either A, B, or C, you may think of splitting that into three tables: Table A: Report_number, magnitudeA Table B: Report_number, magnitudeB Table C: Report_number, magnitudeC You may substantially improve disk usage, but it depends mostly on how your data is organised. However, performance does not seem to me that it may get improved by splitting tables, in general, as you will need (in general, again) to re-join the information to access it. Each case is different, and it depends much on your numbers. As a general advice: EXPLAIN SELECT ... And then analyze precisely what is happening. A long (high cost) seq scan is bad. But an index scan can also be bad, if it is not on the best possible index. I have recently optimised a request that used to take about 50 sec, and now it is sub-second, just by finding (and creating) the missing index. In your case, I suppose I should not need to say that if you have the split tables you should index all of them on the ID you use to access the data. I hope that helps, but I could be completely wrong. Good luck! Antonio
Thank you Antonio, I have played with a couple of possibilities, but none really give a good enogh performance. The main change was creating indices on all the unique_id columns of the 10 tables. That decreased the select time somewhat. The main issue however is why postgres 7.1 works fine (and quick) on "WHERE" statements such as >Where 1.id = (select last_value from sequence) and > 2.id = (select last_value from sequence) and > 3.id = (select last_value from sequence) and > 4.id = (select last_value from sequence) and While postgres 6.2 drops dead on it, and works slow on "WHERE" such as >Where 1.id = (select last_value from sequence) and > 2.id = 1.id and > 3.id = 1.id and > 4.id = 1.id and > 5.id = 1.id and Regards, Tielman J de Villiers BondNet Pty Ltd -----Original Message----- From: Antonio Fiol Bonnín [mailto:fiol@w3ping.com] Sent: Wednesday, December 05, 2001 3:43 PM To: Tielman J de Villiers Cc: pgsql-general@postgresql.org Subject: Re: [GENERAL] Selecting across Multiple Tables > > >I am now rewriting the structure to rather insert/update/delete the >details in 10 little tables, all linked by a unique ID. > I see no reason to do so unless all 10 "attributes" (a,b,c,d,...) are very unfrequent, and otherwise tables would get far too big. >The problem is selecting ALL details from all 10 the tables in this >kind of >format: > >Select a,b,c, > d,e, > f,g, > ... >From 1,2,3,4,5,6,7,8,9,10 >Where 1.id = (select last_value from sequence) and > 2.id = 1.id and > 3.id = 1.id and > 4.id = 1.id and > 5.id = 1.id and > ... > >When I developed the new application, it was on postgres 7.1.3, and >initially it caused a heavy load on the postmaster. This load was >substantially reduced by changing the where statement to: > > ... >Where 1.id = (select last_value from sequence) and > 2.id = (select last_value from sequence) and > 3.id = (select last_value from sequence) and > 4.id = (select last_value from sequence) and > ... > >When I now tried the same on the live server, running postgres 6.5.3, >and any which way I try, I get an extremely heavy load on the >postmaster -- with the last test I had a (cost=737.78 rows=11 width=40) >when selecting only 1 column from table 1!. > >Am I misunderstanding the "relational" data model completely or >selecting wrongly? > I'm afraid that performance shoud be substantially better if you had a single table with all the attributes. Except for situations like: Contact_name, Company_name, Phone_number, Billing_address In this case, billing address is something "company-specific", and not "contact-specific". So I'd see something like: Table1: Contact_name, Phone_number, id_company Table2: id_company, Company_name, Billing_address If and only if you intend to have multiple contacts for each company. If you only intend to have one contact for each company, performance (and file size) will be better if you stick everything in the same table. Another situation that could lead to good results by splitting tables is the following: Original Table: Report_number, magnitudeA, magnitudeB, magnitudeC If reports usually only contain one magnitude, either A, B, or C, you may think of splitting that into three tables: Table A: Report_number, magnitudeA Table B: Report_number, magnitudeB Table C: Report_number, magnitudeC You may substantially improve disk usage, but it depends mostly on how your data is organised. However, performance does not seem to me that it may get improved by splitting tables, in general, as you will need (in general, again) to re-join the information to access it. Each case is different, and it depends much on your numbers. As a general advice: EXPLAIN SELECT ... And then analyze precisely what is happening. A long (high cost) seq scan is bad. But an index scan can also be bad, if it is not on the best possible index. I have recently optimised a request that used to take about 50 sec, and now it is sub-second, just by finding (and creating) the missing index. In your case, I suppose I should not need to say that if you have the split tables you should index all of them on the ID you use to access the data. I hope that helps, but I could be completely wrong. Good luck! Antonio
Tielman J de Villiers <tjdevil@bondnet.co.za> writes: > The main issue however is why postgres 7.1 works fine (and quick) on "WHERE" > statements such as > > >Where 1.id = (select last_value from sequence) and > > 2.id = (select last_value from sequence) and > > 3.id = (select last_value from sequence) and > > 4.id = (select last_value from sequence) and > > While postgres 6.2 drops dead on it, and works slow on "WHERE" such as > >Where 1.id = (select last_value from sequence) and > > 2.id = 1.id and > > 3.id = 1.id and > > 4.id = 1.id and > > 5.id = 1.id and I would say the answer is that 7.1 has at least 2 years' worth of improvement and optimization over 6.5.2. There is no reason to be running 6.X in this day and age, and a lot of reasons not to be. -Doug -- Let us cross over the river, and rest under the shade of the trees. --T. J. Jackson, 1863
On Wed, Dec 05, 2001 at 02:43:07PM +0100, Antonio Fiol Bonnín wrote: > I'm afraid that performance shoud be substantially better if you had a > single table with all the attributes. Except for situations like: > > Contact_name, Company_name, Phone_number, Billing_address > > In this case, billing address is something "company-specific", and not > "contact-specific". So I'd see something like: > > Table1: Contact_name, Phone_number, id_company > Table2: id_company, Company_name, Billing_address > > If and only if you intend to have multiple contacts for each company. If > you only intend to have one contact for each company, performance (and > file size) will be better if you stick everything in the same table. > > Another situation that could lead to good results by splitting tables is > the following: > > Original Table: Report_number, magnitudeA, magnitudeB, magnitudeC > > If reports usually only contain one magnitude, either A, B, or C, you > may think of splitting that into three tables: > > Table A: Report_number, magnitudeA > Table B: Report_number, magnitudeB > Table C: Report_number, magnitudeC > > You may substantially improve disk usage, but it depends mostly on how > your data is organised. However, performance does not seem to me that it > may get improved by splitting tables, in general, as you will need (in > general, again) to re-join the information to access it. > > Each case is different, and it depends much on your numbers. > > As a general advice: > > EXPLAIN SELECT ... > And then analyze precisely what is happening. A long (high cost) seq > scan is bad. But an index scan can also be bad, if it is not on the best > possible index. I have recently optimised a request that used to take > about 50 sec, and now it is sub-second, just by finding (and creating) > the missing index. > > In your case, I suppose I should not need to say that if you have the > split tables you should index all of them on the ID you use to access > the data. very helpful nudgings... nice general advice! we need more like this. nice work, Antonio! > I hope that helps, but I could be completely wrong. :) -- DEBIAN NEWBIE TIP #99 from Greg Wiley <greg@orthogony.com> : Wondering WHERE TO SPECIFY BASH DEFAULTS? In case you are in X and launching a term, .bashrc is automatically loaded (instead of .bash_profile, .profile, .login) when bash is not a login shell. Also see http://newbieDoc.sourceForge.net/ ...