Re: 500rows = 1min/2.5k rows=20min/6K rows 2 hours and still running - Mailing list pgsql-performance
From | Ow Mun Heng |
---|---|
Subject | Re: 500rows = 1min/2.5k rows=20min/6K rows 2 hours and still running |
Date | |
Msg-id | 1189751427.22514.78.camel@neuromancer.home.net Whole thread Raw |
In response to | Re: 500rows = 1min/2.5k rows=20min/6K rows 2 hours and still running (Nis Jørgensen <nis@superlativ.dk>) |
List | pgsql-performance |
On Wed, 2007-09-12 at 15:14 +0200, Nis Jørgensen wrote: > El-Lotso skrev: > > > I'm on the verge of giving up... the schema seems simple and yet there's > > so much issues with it. Perhaps it's the layout of the data, I don't > > know. But based on the ordering/normalisation of the data and the one to > > many relationship of some tables, this is giving the planner a headache > > (and me a bulge on the head from knockin it against the wall) > > I think you should look more at the db design, and less on rewriting the > query. Here are some observations: I can't help much with the design per-se. So.. > > - Your table structure is quite hard to understand (possibly because you > have changed the names) - if you want help on improving it, you will > need to explain the data to us, and possibly post some sample data. If anyone is willing, I can send some sample data to you off-list. on the trh table, hid is a subset of data for a particular id. eg: PARENT : CHILD 1 PARENT : CHILD 2 PARENT : CHILD 3 PARENT : CHILD 4 uniqueid = merged fields from id / index1 / index2 / start_timestamp(IN EPOCH) / phase_id / ttype which is unique on each table (but not across ALL the tables) > - You seem to be lacking constraints on the tables. My guess is that > (id,ttype,start_timestamp) is unique in both trh and ts - but I cannot > tell (and neither can the query planner). Foreign key constraints might > help as well. These would also help others to understand your data, and > suggest reformulations of your queries. AFAICT, there are no foreign constraints in the original DB design. (and I'm not even sure how to begin the FK design based on this org design) the unique_id is as above. TRH/TRD uniqueid = merged fields from id / index1 / index2 / start_timestamp(IN EPOCH) / phase_id / ttype TS uniqueid = merged fields from id / start_timestamp(IN EPOCH) / ttype Problem with this is that the fields in which they are unique is different across the different tables, so the unique_id is only unique for that table alone and acts as a primary key so that no dupes exists in that one table. > - Another guess is that the ttype sets (177,197,705,742,758,766), > (69,178,198,704,757,741,765) are actually indicating some other property > a common "type" of record, and that only one of each will be present for > an id,start_timestamp combination. This may be related to the repeatingd > fields issue - if a certain ttype indicates that we are interested in a > certain pber_x field (and possibly that the others are empty). yes.. eg: id | hid |ttype | start_timestamp | pber_2 | pber 3 |pber_4 PARENT | 0 |764 | 2007-07-01 00:00 | 4000 | null | null PARENT | 0 |765 | 2007-07-01 00:00 | null | 9000 | null PARENT | 0 |766 | 2007-07-01 00:00 | null | null | 7999 PARENT | 1 |764 | 2007-07-01 00:00 | 4550 | null | null PARENT | 1 |765 | 2007-07-01 00:00 | null | 9220 | null PARENT | 1 |766 | 2007-07-01 00:00 | null | null | 6669 the subqueries are just to take out the fields with the value and leave the nulls so that we end-up with id |hid| start_timestamp |pber_2 | pber 3 | pber_4 PARENT | 0 | 2007-07-01 00:00 | 4000 | 9000 | 7999 PARENT | 1 | 2007-07-01 00:00 | 4550 | 9220 | 6669 which is basically just joining a table by itself, but there is a caveat whereby pber_3 and pber_4 is/can only be joined together based on the seq_date/seq_time in the ts table hence the query.. JOIN1.id = join2.id and join1.seq_date = join2.seq_date etc.. but the problem is confounded by the fact that there is numerous hid values for head id > - You have what looks like repeating fields - pber_x, fval_x, index_x - > in your tables. Fixing this might not improve your query, but might be a > good idea for other reasons. it's being looked at by some other team to collapse this to something like this ttype | pber 764 | 500 765 | 600 766 | 700 so that there are lesser # of columns and no null fields. But the query will remain the same > - seq_date and seq_time seems like they may be redundant - are they > different casts of the same data? No. They're used to join together the pber_2/3/4 fields as one may happen between a few hours to days between each other, but each will be uniquely identified by the seq_date/time eg : id | pber_2 | seq_date | seq time PARENT | 400 | 2007-07-01 00:00:00 | 1980-01-01 20:00:00 PARENT | 410 | 2007-07-10 00:00:00 | 1980-01-01 22:00:00 id | pber_3 | seq_date | seq time PARENT | 900 | 2007-07-01 00:00:00 | 1980-01-01 20:00:00 PARENT | 100 | 2007-07-10 00:00:00 | 1980-01-01 22:00:00 id | pber_4 | seq_date | seq time PARENT | 10000 | 2007-07-01 00:00:00 | 1980-01-01 20:00:00 PARENT | 999 | 2007-07-10 00:00:00 | 1980-01-01 22:00:00 so, the correct value for the fields when joined together will be of the form id |start_timestamp |seq_date | seq_time |pber_2 | pber 3 | pber_4 PARENT |2007-07-01 00:00 |2007-07-01 00:00:00 | 1980-01-01 20:00:00| 400 | 900 | 10000 PARENT |2007-07-01 00:00 |2007-07-10 00:00:00 | 1980-01-01 22:00:00| 410 | 100 | 999 (repeating for each hid subset value) > All speculation. Hope it helps anything would help.. I'm more or less willing to try anything to make things faster else this project is going to the toilet.
pgsql-performance by date: