Thread:
I would like to know if there is a more efficient way to perform a query. I have a table (see below) with an ID, a date, and a value. The date specifies when the entry was added. I want to query the table to determine what the value was on a specific day. I have the following query, which works quite well but it is very slow (the real table I'm using has 200000 records)... I'm sure the slowness is due to the subselect... does anybody know any way to make this query faster? Thanks in advance, Scott My current query: select a.* from sample a where (id,date) in (select a.id,max(date) from sample where date<='<<the date>>' and id=a.id) order by id; So with the data below, filling in '2003-02-01' for <<the date>> gives: id | date | value ----+------------+------- 1 | 2003-02-01 | 12 2 | 2003-02-01 | 9 3 | 2003-02-01 | 2 4 | 2003-02-01 | 11 And filling in '2003-02-04' for <<the date>> gives: id | date | value ----+------------+------- 1 | 2003-02-04 | 21 2 | 2003-02-01 | 9 3 | 2003-02-01 | 2 4 | 2003-02-03 | 12 Here is the table layout and the sample data I'm using: Table "sample" Column | Type | Modifiers --------+---------+----------- id | integer | not null date | date | not null value | integer | not null Primary key: sample_pkey id | date | value ----+------------+------- 1 | 2003-02-01 | 12 1 | 2003-02-02 | 16 1 | 2003-02-04 | 21 2 | 2003-02-01 | 9 3 | 2003-02-01 | 2 4 | 2003-02-01 | 11 4 | 2003-02-03 | 12 (7 rows)
On Tuesday 04 February 2003 03:35 pm, Scott Morrison wrote: > I would like to know if there is a more efficient way to perform a query. > > I have a table (see below) with an ID, a date, and a value. The date > specifies when the entry was added. I want to query the table to determine > what the value was on a specific day. > > I have the following query, which works quite well but it is very slow (the > real table I'm using has 200000 records)... I'm sure the slowness is due to > the subselect... does anybody know any way to make this query faster? > > Thanks in advance, > Scott > > My current query: > select a.* from sample a where (id,date) in (select a.id,max(date) from > sample where date<='<<the date>>' and id=a.id) order by id; > > So with the data below, filling in '2003-02-01' for <<the date>> gives: > id | date | value > ----+------------+------- > 1 | 2003-02-01 | 12 > 2 | 2003-02-01 | 9 > 3 | 2003-02-01 | 2 > 4 | 2003-02-01 | 11 > > And filling in '2003-02-04' for <<the date>> gives: > id | date | value > ----+------------+------- > 1 | 2003-02-04 | 21 > 2 | 2003-02-01 | 9 > 3 | 2003-02-01 | 2 > 4 | 2003-02-03 | 12 > > Here is the table layout and the sample data I'm using: > > Table "sample" > Column | Type | Modifiers > --------+---------+----------- > id | integer | not null > date | date | not null > value | integer | not null > Primary key: sample_pkey > > id | date | value > ----+------------+------- > 1 | 2003-02-01 | 12 > 1 | 2003-02-02 | 16 > 1 | 2003-02-04 | 21 > 2 | 2003-02-01 | 9 > 3 | 2003-02-01 | 2 > 4 | 2003-02-01 | 11 > 4 | 2003-02-03 | 12 > (7 rows) > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org First, you didn't say if you've indexed the date field, which you should do if you're searching on that field more than once. That will certainly help out with the speed. Also, that query seems a bit overdone, how about this: select * from sample where date<='<<the date>>' order by id,date; You're only getting information from a single table, so I don't think the subselect was necessary, though I might be missing something since the coffee has worn off ;) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Doug Silver Urchin Software Corp. http://www.urchin.com ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
"Scott Morrison" <smorrison@navtechinc.com> writes: > I would like to know if there is a more efficient way to perform a query. > I have a table (see below) with an ID, a date, and a value. The date > specifies when the entry was added. I want to query the table to determine > what the value was on a specific day. Try something like SELECT ... FROM table WHERE date <= 'target date' ORDER BY date DESC LIMIT 1; Given an index on the date column, this should be very fast. If you need additional constraints (like the 'id' in your example), you can try SELECT ... FROM table WHERE id = 'target id' AND date <= 'target date' ORDER BY id DESC, date DESC LIMIT 1; where now you need an index declared on (id, date) (in that order). The basic idea is to make sure that the row you want is the first one reached when scanning from a given boundary in the index. regards, tom lane
Thanks to everybody for your replies. For some reason that has went through as two different threads. Anyways, I do have indices on both the date and id fields. The queries suggested by Tom Lane and Doug Silver do not do exactly what I want to do... but I believe that's because I did not form what it was correctly. I want a query which lists all of the ids from the table with the date which is closest to (but not past) a given date. For example, the statement select * from sample where date<='2003-02-04' order by id,date; will return: id | date | value ----+------------+------- 1 | 2003-02-01 | 12 1 | 2003-02-02 | 16 1 | 2003-02-04 | 21 ** 2 | 2003-02-01 | 9 ** 3 | 2003-02-01 | 2 ** 4 | 2003-02-01 | 11 4 | 2003-02-03 | 12 ** whereas I only want the fields marked with ** (the most recent date on or before the one specified in the query) Greg Sabino Mullane (in the other thread) suggested the following alternate, but it ran slightly slower on my machine than the original. SELECT id, date, value FROM sample a WHERE date <='2003-01-01' AND oid = (SELECT oid FROM sample WHERE id = a.id AND DATE <='2003-01-01' ORDER BY date DESC LIMIT 1) ORDER BY id; Thanks, Scott -----Original Message----- From: pgsql-novice-owner@postgresql.org [mailto:pgsql-novice-owner@postgresql.org]On Behalf Of Doug Silver Sent: Thursday, February 06, 2003 3:03 PM To: Scott Morrison; pgsql-novice@postgresql.org Subject: Re: [NOVICE] On Tuesday 04 February 2003 03:35 pm, Scott Morrison wrote: > I would like to know if there is a more efficient way to perform a query. > > I have a table (see below) with an ID, a date, and a value. The date > specifies when the entry was added. I want to query the table to determine > what the value was on a specific day. > > I have the following query, which works quite well but it is very slow (the > real table I'm using has 200000 records)... I'm sure the slowness is due to > the subselect... does anybody know any way to make this query faster? > > Thanks in advance, > Scott > > My current query: > select a.* from sample a where (id,date) in (select a.id,max(date) from > sample where date<='<<the date>>' and id=a.id) order by id; > > So with the data below, filling in '2003-02-01' for <<the date>> gives: > id | date | value > ----+------------+------- > 1 | 2003-02-01 | 12 > 2 | 2003-02-01 | 9 > 3 | 2003-02-01 | 2 > 4 | 2003-02-01 | 11 > > And filling in '2003-02-04' for <<the date>> gives: > id | date | value > ----+------------+------- > 1 | 2003-02-04 | 21 > 2 | 2003-02-01 | 9 > 3 | 2003-02-01 | 2 > 4 | 2003-02-03 | 12 > > Here is the table layout and the sample data I'm using: > > Table "sample" > Column | Type | Modifiers > --------+---------+----------- > id | integer | not null > date | date | not null > value | integer | not null > Primary key: sample_pkey > > id | date | value > ----+------------+------- > 1 | 2003-02-01 | 12 > 1 | 2003-02-02 | 16 > 1 | 2003-02-04 | 21 > 2 | 2003-02-01 | 9 > 3 | 2003-02-01 | 2 > 4 | 2003-02-01 | 11 > 4 | 2003-02-03 | 12 > (7 rows) > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org First, you didn't say if you've indexed the date field, which you should do if you're searching on that field more than once. That will certainly help out with the speed. Also, that query seems a bit overdone, how about this: select * from sample where date<='<<the date>>' order by id,date; You're only getting information from a single table, so I don't think the subselect was necessary, though I might be missing something since the coffee has worn off ;) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Doug Silver Urchin Software Corp. http://www.urchin.com ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ---------------------------(end of broadcast)--------------------------- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > Greg Sabino Mullane (in the other thread) suggested the following alternate, > but it ran slightly slower on my machine than the original. > > SELECT id, date, value FROM sample a WHERE date <='2003-01-01' AND oid = > (SELECT oid FROM sample WHERE id = a.id AND DATE <='2003-01-01' ORDER BY > date DESC LIMIT 1) > ORDER BY id; You should check this again. There is no way that my query is slower, as long as you have an index on the "date" column and have run an ANALYZE on the "sample" table. Try running these: ANALYZE sample; CREATE INDEX sample_date on sample(date); EXPLAIN <the query above>; ...and make sure that it is hitting the sample_date index. I also just realized that if you have pre-existing indexes, they may be slowing my query down (e.g. one on "date" *and* "id"). Make sure that the explain above only uses the sample_date index. - -- Greg Sabino Mullane greg@turnstep.com PGP Key: 0x14964AC8 200302071047 -----BEGIN PGP SIGNATURE----- Comment: http://www.turnstep.com/pgp.html iD8DBQE+Q9VdvJuQZxSWSsgRAjHVAJwP6Qw4L+gOpfCC6yYkgIgJjzkcZQCggZ1y 2ckvZhgeUpQYOByPxHNKOmk= =ix4n -----END PGP SIGNATURE-----
> in one place you are counting(*), in another you are counting something else -count(distinct(l.full_phone). > > Not surprising they are different? > > > Ken > > Thanks Ken. I realized that as well just after I hit send. ;-) Thanks Chad
On Sat, 2003-02-08 at 04:48, greg@turnstep.com wrote: > > Greg Sabino Mullane (in the other thread) suggested the following alternate, > > but it ran slightly slower on my machine than the original. > > > > SELECT id, date, value FROM sample a WHERE date <='2003-01-01' AND oid = > > (SELECT oid FROM sample WHERE id = a.id AND DATE <='2003-01-01' ORDER BY > > date DESC LIMIT 1) > > ORDER BY id; > > You should check this again. There is no way that my query is slower, as > long as you have an index on the "date" column and have run an ANALYZE on > the "sample" table. Wouldn't there need to be an index on OID for your query to be fast? Cheers, Andrew. -- --------------------------------------------------------------------- Andrew @ Catalyst .Net.NZ Ltd, PO Box 11-053, Manners St, Wellington WEB: http://catalyst.net.nz/ PHYS: Level 2, 150-154 Willis St DDI: +64(4)916-7201 MOB: +64(21)635-694 OFFICE: +64(4)499-2267 Survey for nothing with http://survey.net.nz/ ---------------------------------------------------------------------
I am having a little problem with STATEMENT_TIMEOUT, i dont know why this is happening.... there are 2 users, first user have selected a record using FOR UPDATE. second user have selected 5 records using FOR UPDATE. but when the second user tries to select the record which is locked by the first user. Pgsql gives the error "ERROR: query was cancelled." and rollbacks the transaction which unlocks all the records locked by the second user. Is there any way that pgsql only gives the error and keep the status of the transction as it is? thanx in advance. __________________________________________________ Do You Yahoo!? Everything you'll ever need on one web page from News and Sport to Email and Music Charts http://uk.my.yahoo.com
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > Wouldn't there need to be an index on OID for your query to be fast? No, the single index on the "date" field is enough. See the EXPLAIN output from my first post in the original thread: http://archives.postgresql.org/pgsql-novice/2003-02/msg00075.php - -- Greg Sabino Mullane greg@turnstep.com PGP Key: 0x14964AC8 200302101404 -----BEGIN PGP SIGNATURE----- Comment: http://www.turnstep.com/pgp.html iD8DBQE+R/jEvJuQZxSWSsgRAsaYAKDOzKuujdaqkvImutA0nl1/YqVNvgCg+Qk2 lNTYcT/cGyoFKZzd3bVqef4= =n1Iq -----END PGP SIGNATURE-----
On Tue, 2003-02-11 at 08:10, greg@turnstep.com wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > > > Wouldn't there need to be an index on OID for your query to be fast? > > No, the single index on the "date" field is enough. See the EXPLAIN > output from my first post in the original thread: However the original poster suggested there would ultimately be 200,000 or more rows in the table. Cheers, Andrew. > > http://archives.postgresql.org/pgsql-novice/2003-02/msg00075.php > > - -- > Greg Sabino Mullane greg@turnstep.com > PGP Key: 0x14964AC8 200302101404 > -----BEGIN PGP SIGNATURE----- > Comment: http://www.turnstep.com/pgp.html > > iD8DBQE+R/jEvJuQZxSWSsgRAsaYAKDOzKuujdaqkvImutA0nl1/YqVNvgCg+Qk2 > lNTYcT/cGyoFKZzd3bVqef4= > =n1Iq > -----END PGP SIGNATURE----- > > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org -- --------------------------------------------------------------------- Andrew @ Catalyst .Net.NZ Ltd, PO Box 11-053, Manners St, Wellington WEB: http://catalyst.net.nz/ PHYS: Level 2, 150-154 Willis St DDI: +64(4)916-7201 MOB: +64(21)635-694 OFFICE: +64(4)499-2267 Survey for nothing with http://survey.net.nz/ ---------------------------------------------------------------------
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 >>> Wouldn't there need to be an index on OID for your query to be fast? >>> >> No, the single index on the "date" field is enough. See the EXPLAIN >> output from my first post in the original thread: >> > However the original poster suggested there would ultimately be 200,000 > or more rows in the table. Yes, and my EXPLAIN is on a table with 200,000 rows in it. Hence the fact that it took over 5 seconds - which is still very fast compared to the original query. - -- Greg Sabino Mullane greg@turnstep.com PGP Key: 0x14964AC8 200302101447 -----BEGIN PGP SIGNATURE----- Comment: http://www.turnstep.com/pgp.html iD8DBQE+SAHxvJuQZxSWSsgRAkxTAJ9TnEAjypMGbv8ZRn55yqe/8AZyQgCcCUPm /dU6kHroxm1XpC0lUrvdY+Y= =GwmT -----END PGP SIGNATURE-----
Well I ran the queries through some more tests and the original query is still faster than any of the suggested queries. Query 1 (Original): ANALYZE sample; CREATE INDEX sample_id ON sample(id); CREATE INDEX sample_date ON sample(date); ANALYZE sample; EXPLAIN ANALYZE SELECT * FROM sample a WHERE (id,date) IN (SELECT id,max(date) FROM sample WHERE id=a.id AND date<='2003-02-07' GROUP BY id); DROP INDEX sample_id; DROP INDEX sample_date; Query 2 (Greg): ANALYZE sample; CREATE INDEX sample_id ON sample(id); CREATE INDEX sample_date ON sample(date); ANALYZE sample; EXPLAIN ANALYZE SELECT * FROM sample a WHERE date<='2003-02-07' AND oid= (SELECT oid FROM sample WHERE id=a.id AND date<='2003-02-07' ORDER BY date DESC LIMIT 1); DROP INDEX sample_id; DROP INDEX sample_date; As for Greg's query, I tried it without an id index and it took a long long time (I broke it after about 5 minutes, so that's why the id index is added in there). I also took out the ORDER BY id because that slowed it down by another 3 seconds or so, and I don't need it for what I'm querying out. Below are the analysis reports for both queries. Query 1 Analysis (Average time 21779.98ms): Seq Scan on sample a (cost=0.00..13702.91 rows=3126 width=12) (actual time=58.67..21770.83 rows=99999 loops=1) SubPlan -> Aggregate (cost=0.00..3.74 rows=1 width=8) (actual time=0.11..0.11 rows=1 loops=180566) -> Group (cost=0.00..3.74 rows=1 width=8) (actual time=0.06..0.10 rows=2 loops=180566) -> Index Scan using sample_id on sample (cost=0.00..3.73 rows=1 width=8) (actual time=0.05..0.08 rows=2 loops=180566) Query 2 Analysis (Average time 28358.92ms): Index Scan using sample_date on sample a (cost=0.00..24294.52 rows=1 width=12) (actual time=0.42..27831.20 rows=99999 loops=1) SubPlan -> Limit (cost=3.84..3.84 rows=1 width=8) (actual time=0.15..0.16 rows=1 loops=160446) -> Sort (cost=3.84..3.84 rows=1 width=8) (actual time=0.15..0.15 rows=2 loops=160446) -> Index Scan using sample_id on sample (cost=0.00..3.83 rows=1 width=8) (actual time=0.05..0.07 rows=2 loops=160446) So in conclusion, it doesn't look like I'll be able to get the query any faster than my original query. I was really hoping for something that would run in 5-10% of the time, but I guess that's not possible. Thanks for the suggestions. -- Scott -----Original Message----- From: pgsql-novice-owner@postgresql.org [mailto:pgsql-novice-owner@postgresql.org]On Behalf Of greg@turnstep.com Sent: Monday, February 10, 2003 2:50 PM To: pgsql-novice@postgresql.org Subject: Re: [NOVICE] -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 >>> Wouldn't there need to be an index on OID for your query to be fast? >>> >> No, the single index on the "date" field is enough. See the EXPLAIN >> output from my first post in the original thread: >> > However the original poster suggested there would ultimately be 200,000 > or more rows in the table. Yes, and my EXPLAIN is on a table with 200,000 rows in it. Hence the fact that it took over 5 seconds - which is still very fast compared to the original query. - -- Greg Sabino Mullane greg@turnstep.com PGP Key: 0x14964AC8 200302101447 -----BEGIN PGP SIGNATURE----- Comment: http://www.turnstep.com/pgp.html iD8DBQE+SAHxvJuQZxSWSsgRAkxTAJ9TnEAjypMGbv8ZRn55yqe/8AZyQgCcCUPm /dU6kHroxm1XpC0lUrvdY+Y= =GwmT -----END PGP SIGNATURE----- ---------------------------(end of broadcast)--------------------------- TIP 4: Don't 'kill -9' the postmaster
"Scott Morrison" <smorrison@navtechinc.com> writes: > Query 1 (Original): > ANALYZE sample; > CREATE INDEX sample_id ON sample(id); > CREATE INDEX sample_date ON sample(date); > ANALYZE sample; > EXPLAIN ANALYZE > SELECT * FROM sample a WHERE (id,date) IN > (SELECT id,max(date) FROM sample > WHERE id=a.id AND date<='2003-02-07' > GROUP BY id); It's got to be possible to improve on that. The big problem is the IN, which is a very inefficient construct (and the upcoming 7.4 improvements won't help any, because they only address cases where the subselect is uncorrelated, ie, doesn't use variables from the outer query). But it's easy to get rid of the IN. First off, because the sub-select contains WHERE id = a.id, there is no need to return id from the sub-SELECT; it must be equal to the outer a.id, no? SELECT * FROM sample a WHERE date IN (SELECT max(date) FROM sample WHERE id=a.id AND date<='2003-02-07' GROUP BY id); For the same reason, we do not need GROUP BY in the sub-select, because only one group can be formed. This gives us SELECT * FROM sample a WHERE date IN (SELECT max(date) FROM sample WHERE id=a.id AND date<='2003-02-07'); and now we can see that the sub-select returns exactly one row, which means we can reduce IN to "=" SELECT * FROM sample a WHERE date = (SELECT max(date) FROM sample WHERE id=a.id AND date<='2003-02-07'); and now the sub-select is amenable to the LIMIT-and-index trick for implementing MAX() quickly: CREATE INDEX fooi on sample (id, date); SELECT * FROM sample a WHERE date = (SELECT date FROM sample WHERE id=a.id AND date<='2003-02-07' ORDER BY id DESC, date DESC LIMIT 1); Now this is a pretty decent subplan --- if you try it you should see a plan like Seq Scan on sample a (cost=0.00..4330.81 rows=5 width=8) Filter: (date = (subplan)) SubPlan -> Limit (cost=0.00..4.31 rows=1 width=8) -> Index Scan Backward using fooi on sample (cost=0.00..7.18 rows=2 width=8) Index Cond: ((id = $0) AND (date <= '2003-02-07'::date)) and there isn't any way to make it quicker: the index is taking us directly to the single value that we want. But the overall performance will still suck, because we are invoking the subplan once for every row of the table. And here's where you need to consider rearranging your schema. The only reason we need to scan all of the 'sample' table is that we are using the table rows as the source of potential values of 'id' to plug into the subquery. Are you willing to make a side table showing the allowed values of 'id'? (You could enforce that it's correct with a foreign key constraint on 'sample'.) If you are, then this would work: SELECT a.* FROM sample a, possible_ids p WHERE p.id = a.id AND date = (SELECT date FROM sample WHERE id = p.id AND date <= '2003-02-07' ORDER BY id DESC, date DESC LIMIT 1); Experimenting, I get a plan like Nested Loop (cost=0.00..13477.14 rows=25 width=12) -> Seq Scan on possible_ids p (cost=0.00..20.00 rows=1000 width=4) -> Index Scan using fooi on sample a (cost=0.00..9.13 rows=1 width=8) Index Cond: (("outer".id = a.id) AND (a.date = (subplan))) SubPlan -> Limit (cost=0.00..4.31 rows=1 width=8) -> Index Scan Backward using fooi on sample (cost=0.00..7.18 rows=2 width=8) Index Cond: ((id = $0) AND (date <= '2003-02-07'::date)) -> Limit (cost=0.00..4.31 rows=1 width=8) -> Index Scan Backward using fooi on sample (cost=0.00..7.18 rows=2 width=8) Index Cond: ((id = $0) AND (date <= '2003-02-07'::date)) which looks pretty promising. This should require just one subplan execution (one index probe) followed by an index probe into the 'sample' table, for each possible 'id' value. Assuming you have many fewer distinct 'id' values than you do sample rows, this should win big. (Note: the reason we see the subplan listed twice is an arcane implementation detail; in practice only one of the copies will get executed in this query.) regards, tom lane
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > Well I ran the queries through some more tests and the original query is > still faster than any of the suggested queries. I finally have it figured out - it all depends on the distribution of the data. I was assuming that there were relatively few unique ids in each of the 200,000 rows. When this is true, my query runs fast. When there are few duplicate ids, and the number approaches the total number of rows, the original query (and Tom's rewrite) works much better. I was getting the opposite results as you: my query was fast, and yours would basically time out. When I rearranged the data to provide for more unique ids, this was reversed. Ohwell :) One minor tweak's to Tom's final query: EXPLAIN ANALYZE SELECT * FROM sample a WHERE date = (SELECT date FROM sample WHERE id=a.id AND date<='2003-02-07' ORDER BY id DESC, date DESC LIMIT 1); If you know that you are grabbing a certain percentage of the data, you can add another WHERE clause to speed things up: EXPLAIN ANALYZE SELECT * FROM sample a WHERE date <='2003-02-07' AND date = (SELECT date FROM sample WHERE id=a.id AND date<='2003-02-07' ORDER BY id DESC, date DESC LIMIT 1); This hits the sample_date index instead of doing a sequential scan of sample: and is probably faster for more cases. In my tests the two even out only when we start grabbing most of the rows: all dates earlier than 02-25, with a even sitribution of dates throughout 2/2003. - -- Greg Sabino Mullane greg@turnstep.com PGP Key: 0x14964AC8 200302111237 -----BEGIN PGP SIGNATURE----- Comment: http://www.turnstep.com/pgp.html iD8DBQE+STTuvJuQZxSWSsgRAgcyAJ4zuW6G/j2cvvjaPynKRyV7rsih6ACeNpkl UXerayY3r02qttNs6tTUMiw= =dlwt -----END PGP SIGNATURE-----