Re: speed up query with max() and odd estimates - Mailing list pgsql-performance
From | John A Meinel |
---|---|
Subject | Re: speed up query with max() and odd estimates |
Date | |
Msg-id | 426EB7CD.2020406@arbash-meinel.com Whole thread Raw |
In response to | speed up query with max() and odd estimates (Matthew Nuzum <mattnuzum@gmail.com>) |
List | pgsql-performance |
Matthew Nuzum wrote: > I have this query that takes a little over 8 min to run: > select client,max(atime) as atime from usage_access where atime >= > (select atime - '1 hour'::interval from usage_access order by atime > desc limit 1) group by client; > > I think it can go a lot faster. Any suggestions on improving this? DB > is 7.3.4 I think. (There is no index on client because it is very big > and this data is used infrequently.) Switch to Postgres 8.0.2 :) Actually, I think one problem that you are running into is that postgres (at least used to) has problems with selectivity of date fields when using a non-constant parameter. So it isn't switching over to using an index, even though you are restricting the access time. I would guess that creating a multi-column index on (client, atime) *might* get you the best performance. Try adding the index, and then doing this query: select atime from usage_access where client = <client_id> order by atime desc limit 1; If you can get that query to use an index, then you can put it in a loop. Something like: CREATE FUNCTION last_client_access() RETURNS SETOF time AS ' DECLARE client_id INT; client_time TIME; BEGIN FOR client_id IN SELECT id FROM <client_list> LOOP SELECT INTO client_time atime FROM usage_access WHERE client = client_id ORDER BY atime DESC LIMIT 1; RETURN NEXT client_time; END LOOP; END; ' LANGUAGE plpgsql; If you really need high speed, you could create a partial index for each client id, something like: CREATE INDEX usage_access_atime_client1_idx ON usage_access(atime) WHERE client = client1; But that is a lot of indexes to maintain. I'm hoping that the multi-column index would be enough. You might also try something like: SELECT client, max(atime) FROM usage_access WHERE atime > now - '1 hour'::interval GROUP BY client; now is more of a constant, so postgres might have a better time figuring out the selectivity. I don't know your table, but I assume you are constantly inserting new rows, and the largest atime value will be close to now(). Remember, in this query (and in your original query) clients with their last access time > then 1 hour since the max time (of all clients) will not be shown. (Example, client 1 accessed yesterday, client 2 accessed right now your original last atime would be today, which would hide client 1). Also, if it is simply a problem of the planner mis-estimating the selectivity of the row, you can alter the statistics for atime. ALTER TABLE usage_access ALTER COLUMN atime SET STATISTICS 1000; I'm not really sure what else to try, but you might start there. Also, I still recommend upgrading to postgres 8, as I think it handles a lot of these things better. (7.3 is pretty old). John =:-> > > explain ANALYZE select client,max(atime) as atime from usage_access > where atime >= (select atime - '1 hour'::interval from usage_access > order by atime desc limit 1) group by client; > > QUERY PLAN > -------------------------------------------------------------------------------------------------------------------------------------------------------------------- > Aggregate (cost=3525096.28..3620450.16 rows=1271385 width=20) > (actual time=482676.95..482693.69 rows=126 loops=1) > InitPlan > -> Limit (cost=0.00..0.59 rows=1 width=8) (actual > time=0.40..0.41 rows=1 loops=1) > -> Index Scan Backward using usage_access_atime on > usage_access (cost=0.00..22657796.18 rows=38141552 width=8) (actual > time=0.39..0.40 rows=2 loops=1) > -> Group (cost=3525096.28..3588665.53 rows=12713851 width=20) > (actual time=482676.81..482689.29 rows=3343 loops=1) > -> Sort (cost=3525096.28..3556880.90 rows=12713851 > width=20) (actual time=482676.79..482679.16 rows=3343 loops=1) > Sort Key: client > -> Seq Scan on usage_access (cost=0.00..1183396.40 > rows=12713851 width=20) (actual time=482641.57..482659.18 rows=3343 > loops=1) > Filter: (atime >= $0) > Total runtime: 482694.65 msec > > > I'm starting to understand this, which is quite frightening to me. I > thought that maybe if I shrink the number of rows down I could improve > things a bit, but my first attempt didn't work. I thought I'd replace > the "from usage_access" with this query instead: > select * from usage_access where atime >= (select atime - '1 > hour'::interval from usage_access order by atime desc limit 1); > > QUERY PLAN > -------------------------------------------------------------------------------------------------------------------------------------------------------------------- > Seq Scan on usage_access (cost=0.00..1183396.40 rows=12713851 > width=116) (actual time=481796.22..481839.43 rows=3343 loops=1) > Filter: (atime >= $0) > InitPlan > -> Limit (cost=0.00..0.59 rows=1 width=8) (actual > time=0.41..0.42 rows=1 loops=1) > -> Index Scan Backward using usage_access_atime on > usage_access (cost=0.00..22657796.18 rows=38141552 width=8) (actual > time=0.40..0.41 rows=2 loops=1) > Total runtime: 481842.47 msec > > It doesn't look like this will help at all. > > This table is primarily append, however I just recently deleted a few > million rows from the table, if that helps anyone. >
Attachment
pgsql-performance by date: