Thread: Inner join question
Greetings all, I am trying to do what should be a simple join but the tables are large and it is taking a long, long time. I have the feeling that I have stuffed up something in the syntax. Here is what I have: telemetry=> select (tq1.timestamp = tq2.timestamp) as timestamp, tq1.value as q1, tq2.value as q2 from cal_quat_1 tq1 inner join cal_quat_2 as tq2 using (timestamp) where timestamp > '2004-01-12 09:47:56.0000 +0' and timestamp < '2004-01-12 09:50:44.7187 +0' order by timestamp; telemetry=> \d cal_quat_1 Table "cal_quat_1" Column | Type | Modifiers -----------+--------------------------+----------- timestamp | timestamp with time zone | value | double precision | telemetry=> \d cal_quat_2 Table "cal_quat_2" Column | Type | Modifiers -----------+--------------------------+----------- timestamp | timestamp with time zone | value | double precision | My understanding of an inner join is that the query above will restrict to finding tq1.timestamp, tq1.value and then move onto t12.value to search the subset. I have tried this with and without the '=' sign and it isn't clear if it is making any difference at all (the timestamps are identical in the range of interest). I have not allowed the query to finish as it seems to take more than 10 minutes. Both timestamps are indexed and I expect about 150 rows to be returned. At the end of the day, I have four identical tables of quaternions (timestamp, value) and I need to extract them all for a range of timestamps. Cheers, Randall
Randall Skelton wrote: > Greetings all, > > I am trying to do what should be a simple join but the tables are > large and it is taking a long, long time. I have the feeling that I > have stuffed up something in the syntax. > > Here is what I have: > > telemetry=> select (tq1.timestamp = tq2.timestamp) as timestamp, > tq1.value as q1, tq2.value as q2 from cal_quat_1 tq1 inner join > cal_quat_2 as tq2 using (timestamp) where timestamp > '2004-01-12 > 09:47:56.0000 +0' and timestamp < '2004-01-12 09:50:44.7187 +0' order > by timestamp; > > telemetry=> \d cal_quat_1 > Table "cal_quat_1" > Column | Type | Modifiers > -----------+--------------------------+----------- > timestamp | timestamp with time zone | > value | double precision | > > telemetry=> \d cal_quat_2 > Table "cal_quat_2" > Column | Type | Modifiers > -----------+--------------------------+----------- > timestamp | timestamp with time zone | > value | double precision | > > My understanding of an inner join is that the query above will > restrict to finding tq1.timestamp, tq1.value and then move onto > t12.value to search the subset. I have tried this with and without > the '=' sign and it isn't clear if it is making any difference at all > (the timestamps are identical in the range of interest). I have not > allowed the query to finish as it seems to take more than 10 minutes. > Both timestamps are indexed and I expect about 150 rows to be > returned. At the end of the day, I have four identical tables of > quaternions (timestamp, value) and I need to extract them all for a > range of timestamps. > > Cheers, > Randall > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org We need more information to be able to help further. Can you supply: 1. Total number of rows in each table. 2. Results from "explain analyze <your query>" 3. key configuration values from postgresql.conf 4. Basic hardware config. (CPU type and number, Total RAM, HDD type, size and speed) But in the mean time can you try the following query instead. select (tq1.timestamp = tq2.timestamp) as timestamp, tq1.value as q1, tq2.value as q2 from cal_quat_1 tq1, cal_quat_2 as tq2 WHERE tq1.timestamp=tq2.timestamp AND tq1.timestamp BETWEEN '2004-01-12 09:47:56.0000 +0'::timestamp AND '2004-01-12 09:50:44.7187 +0'::timestamp order by tq1.timestamp; As far as I know, and someone please correct me, this allows the planner the most flexibility when figuring out the optimum plan. Thanks Nick
Hi, try this on psql console: explain analyze select tq1.*, tq2.* from cal_quat_1 tq1, cal_quat_2 tq2 where tq1.timestamp = tq2.timestamp and tq1.timestamp > '2004-01-12 09:47:56.0000 +0'::timestamp with time zone and tq1.timestamp < '2004-01-12 09:50:44.7187 +0'::timestamp with time zone order by tq1.timestamp; ... and examine generated query plan (or post it) regards, pajout P.S. And what about vacuum full analyze ? :) Randall Skelton wrote: > Greetings all, > > I am trying to do what should be a simple join but the tables are > large and it is taking a long, long time. I have the feeling that I > have stuffed up something in the syntax. > > Here is what I have: > > telemetry=> select (tq1.timestamp = tq2.timestamp) as timestamp, > tq1.value as q1, tq2.value as q2 from cal_quat_1 tq1 inner join > cal_quat_2 as tq2 using (timestamp) where timestamp > '2004-01-12 > 09:47:56.0000 +0' and timestamp < '2004-01-12 09:50:44.7187 +0' order > by timestamp; > > telemetry=> \d cal_quat_1 > Table "cal_quat_1" > Column | Type | Modifiers > -----------+--------------------------+----------- > timestamp | timestamp with time zone | > value | double precision | > > telemetry=> \d cal_quat_2 > Table "cal_quat_2" > Column | Type | Modifiers > -----------+--------------------------+----------- > timestamp | timestamp with time zone | > value | double precision | > > My understanding of an inner join is that the query above will > restrict to finding tq1.timestamp, tq1.value and then move onto > t12.value to search the subset. I have tried this with and without > the '=' sign and it isn't clear if it is making any difference at all > (the timestamps are identical in the range of interest). I have not > allowed the query to finish as it seems to take more than 10 minutes. > Both timestamps are indexed and I expect about 150 rows to be > returned. At the end of the day, I have four identical tables of > quaternions (timestamp, value) and I need to extract them all for a > range of timestamps. > > Cheers, > Randall > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org >
The main problem was that there wasn't an index on cal_qat_1. The other indexes were fine so I don't know what happened to the first one... Nevertheless, it still takes longer than I would like. As requested: telemetry=> explain analyze select tq1.*, tq2.* from telemetry-> cal_quat_1 tq1, cal_quat_2 tq2 telemetry-> where tq1.timestamp = tq2.timestamp telemetry-> and tq1.timestamp > '2004-01-12 09:47:56.0000 +0'::timestamp with time zone telemetry-> and tq1.timestamp < '2004-01-12 09:50:44.7187 +0'::timestamp with time zone telemetry-> order by tq1.timestamp; NOTICE: QUERY PLAN: Merge Join (cost=517417.89..2795472.80 rows=177664640 width=32) (actual time=64878.04..64936.41 rows=142 loops=1) -> Index Scan using cal_quat_1_timestamp on cal_quat_1 tq1 (cost=0.00..50549.03 rows=13329 width=16) (actual time=73.29..129.66 rows=142 loops=1) -> Sort (cost=517417.89..517417.89 rows=2665818 width=16) (actual time=62310.53..63727.33 rows=1020155 loops=1) -> Seq Scan on cal_quat_2 tq2 (cost=0.00..43638.18 rows=2665818 width=16) (actual time=14.12..13462.19 rows=2665818 loops=1) Total runtime: 65424.79 msec Each table currently has 2665818 rows but grows by 86400/day. With regards to hardware, the machine is a Sunfire 3600 (4 x 750MHz, 4GB RAM, DB is on on a fiber channel disk array). We are using 7.2.1. Cheers, Randall
On Thu, Feb 19, 2004 at 01:23:34PM -0500, Randall Skelton wrote: > The main problem was that there wasn't an index on cal_qat_1. The > other indexes were fine so I don't know what happened to the first > one... How about an index on cal_quat_2.timestamp? Hope this helps, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > If the Catholic church can survive the printing press, science fiction > will certainly weather the advent of bookwarez. > http://craphound.com/ebooksneitherenorbooks.txt - Cory Doctorow
Attachment
There are indexes on all of the timestamps now. Originally, the index for cal_quat_1.timestamp was missing but cal_quat_2.timestamp was created with the table. On 19 Feb 2004, at 15:04, Martijn van Oosterhout wrote: > On Thu, Feb 19, 2004 at 01:23:34PM -0500, Randall Skelton wrote: >> The main problem was that there wasn't an index on cal_qat_1. The >> other indexes were fine so I don't know what happened to the first >> one... > > How about an index on cal_quat_2.timestamp? > > Hope this helps,
Randall Skelton <skelton@brutus.uwaterloo.ca> writes: > Nevertheless, it still takes longer than I would like. As requested: > Merge Join (cost=517417.89..2795472.80 rows=177664640 width=32) > (actual time=64878.04..64936.41 rows=142 loops=1) > -> Index Scan using cal_quat_1_timestamp on cal_quat_1 tq1 > (cost=0.00..50549.03 rows=13329 width=16) (actual time=73.29..129.66 > rows=142 loops=1) > -> Sort (cost=517417.89..517417.89 rows=2665818 width=16) (actual > time=62310.53..63727.33 rows=1020155 loops=1) > -> Seq Scan on cal_quat_2 tq2 (cost=0.00..43638.18 > rows=2665818 width=16) (actual time=14.12..13462.19 rows=2665818 > loops=1) > Total runtime: 65424.79 msec I think the problem is the gross misestimation of the number of rows involved --- first within the timestamp interval (13329 vs actual 142) and then for the join result (177664640 is just silly). With more accurate estimates you would probably have gotten the double indexscan plan that you really want. The estimates look remarkably default-ish, however --- if I'm doing the math correctly, the selectivity is being estimated as 0.005 at each step, which just happens to be the default estimate in the absence of any statistics. Have you ANALYZEd these tables lately? If you have, try increasing the statistics target for the timestamp rows (see ALTER TABLE) and analyze again. regards, tom lane