Thread: select max(field) from table much faster with a group by clause?
Hi, I have a table "login" with approx 600,000 tuples, a person table with approx 100000 tuples. When running select max("when") from login where userid='userid' it takes a second or two, but when adding "group by userid" the planner decides on using another plan, and it gets *much* faster. See example below. Number of tuples per user varies from zero to a couple of thousands. It seems to slower when there are no tuples as all, but it is always slow. This is only for max() and min(). For count(), the plan is the same, it always uses "Aggregate". Any ideas about this? Do we need to add "group by userid" to our code base to optimize, or is there another way? Updating postgresql to 8.2 is a long term option, but I'd like a short term option as well... Regards, Palle pp=# select version(); version ------------------------------------------------------------------------------------------------- PostgreSQL 8.1.8 on amd64-portbld-freebsd6.1, compiled by GCC cc (GCC) 3.4.4 [FreeBSD] 20050518 (1 row) Time: 0,530 ms pp=# explain analyze SELECT max("when") FROM login WHERE userid='girgen' ; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------ Result (cost=323.80..323.81 rows=1 width=0) (actual time=3478.781..3478.785 rows=1 loops=1) InitPlan -> Limit (cost=0.00..323.80 rows=1 width=8) (actual time=3478.768..3478.768 rows=0 loops=1) -> Index Scan Backward using login_when_idx on "login" (cost=0.00..131461.90 rows=406 width=8) (actual time=3478.759..3478.759 rows=0 loops=1) Filter: (("when" IS NOT NULL) AND (userid = 'sarah.gilliam1'::text)) Total runtime: 3478.868 ms (6 rows) Time: 3480,442 ms pp=# explain analyze SELECT max("when") FROM login WHERE userid='girgen' group by userid; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=0.00..648.44 rows=1 width=25) (actual time=0.191..0.191 rows=0 loops=1) -> Index Scan using login_userid_idx on "login" (cost=0.00..646.40 rows=406 width=25) (actual time=0.183..0.183 rows=0 loops=1) Index Cond: (userid = 'sarah.gilliam1'::text) Total runtime: 0.243 ms (4 rows) Time: 0,938 ms pp=# \d login Table "public.login" Column | Type | Modifiers --------+--------------------------+-------------------- userid | text | kursid | integer | when | timestamp with time zone | mode | text | default 'pm'::text Indexes: "login_kurs_user_idx" btree (kursid, userid) "login_userid_idx" btree (userid) "login_when_idx" btree ("when") Foreign-key constraints: "pp_fk1" FOREIGN KEY (userid) REFERENCES person(userid) ON UPDATE CASCADE ON DELETE CASCADE "pp_fk2" FOREIGN KEY (kursid) REFERENCES course(id) ON UPDATE CASCADE ON DELETE CASCADE
Palle Girgensohn <girgen@pingpong.net> writes: > When running > select max("when") from login where userid='userid' > it takes a second or two, but when adding "group by userid" the planner > decides on using another plan, and it gets *much* faster. See example below. It's only faster for cases where there are few or no rows for the particular userid ... > Number of tuples per user varies from zero to a couple of thousands. The planner is using an intermediate estimate of 406 rows. You might be well advised to increase the statistics target for login.userid --- with luck that would help it to choose the right plan type for both common and uncommon userids. regards, tom lane
--On torsdag, november 01, 2007 09.43.39 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: > Palle Girgensohn <girgen@pingpong.net> writes: >> When running >> select max("when") from login where userid='userid' >> it takes a second or two, but when adding "group by userid" the planner >> decides on using another plan, and it gets *much* faster. See example >> below. > > It's only faster for cases where there are few or no rows for the > particular userid ... Well, no, not really. See below. OTH, it sometimes a bit slower. Seems to depend on how far away from the estimated number of rows you get? Weird? >> Number of tuples per user varies from zero to a couple of thousands. > > The planner is using an intermediate estimate of 406 rows. You might be > well advised to increase the statistics target for login.userid --- with > luck that would help it to choose the right plan type for both common > and uncommon userids. I'll try that, thanks! -- pp=# SELECT max("when") FROM login WHERE userid='kudo' group by userid; max ------------------------------- 2007-01-04 15:31:46.863325+01 (1 row) Time: 6,194 ms pp=# SELECT max("when") FROM login WHERE userid='kudo' ; max ------------------------------- 2007-01-04 15:31:46.863325+01 (1 row) Time: 992,391 ms pp=# SELECT max("when") FROM login WHERE userid='kudo' ; max ------------------------------- 2007-01-04 15:31:46.863325+01 (1 row) Time: 779,582 ms pp=# SELECT max("when") FROM login WHERE userid='kudo' ; max ------------------------------- 2007-01-04 15:31:46.863325+01 (1 row) Time: 818,667 ms pp=# SELECT max("when") FROM login WHERE userid='kudo' ; max ------------------------------- 2007-01-04 15:31:46.863325+01 (1 row) Time: 640,242 ms pp=# SELECT max("when") FROM login WHERE userid='kudo' group by userid; max ------------------------------- 2007-01-04 15:31:46.863325+01 (1 row) Time: 18,384 ms pp=# SELECT count(*) FROM login WHERE userid='kudo' group by userid; count ------- 1998 (1 row) Time: 12,762 ms pp=# explain analyze SELECT max("when") FROM login WHERE userid='kudo' group by userid; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=0.00..648.44 rows=1 width=25) (actual time=24.700..24.703 rows=1 loops=1) -> Index Scan using login_userid_idx on "login" (cost=0.00..646.40 rows=406 width=25) (actual time=0.140..16.931 rows=1998 loops=1) Index Cond: (userid = 'kudo'::text) Total runtime: 24.779 ms (4 rows) Time: 25,633 ms pp=# explain analyze SELECT max("when") FROM login WHERE userid='kudo' ; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------ Result (cost=323.93..323.94 rows=1 width=0) (actual time=1400.994..1400.997 rows=1 loops=1) InitPlan -> Limit (cost=0.00..323.93 rows=1 width=8) (actual time=1400.975..1400.979 rows=1 loops=1) -> Index Scan Backward using login_when_idx on "login" (cost=0.00..131515.87 rows=406 width=8) (actual time=1400.968..1400.968 rows=1 loops=1) Filter: (("when" IS NOT NULL) AND (userid = 'kudo'::text)) Total runtime: 1401.057 ms (6 rows) Time: 1401,881 ms pp=# SELECT userid, count("when") FROM login WHERE userid in ('girgen' , 'kudo') group by userid; userid | count --------+------- kudo | 1998 girgen | 1120 (2 rows) pp=# explain analyze SELECT max("when") FROM login WHERE userid='girgen' group by userid; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=0.00..648.44 rows=1 width=25) (actual time=25.137..25.141 rows=1 loops=1) -> Index Scan using login_userid_idx on "login" (cost=0.00..646.40 rows=406 width=25) (actual time=0.121..20.712 rows=1120 loops=1) Index Cond: (userid = 'girgen'::text) Total runtime: 25.209 ms (4 rows) Time: 25,986 ms pp=# explain analyze SELECT max("when") FROM login WHERE userid='girgen' ; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------ Result (cost=323.93..323.94 rows=1 width=0) (actual time=6.695..6.698 rows=1 loops=1) InitPlan -> Limit (cost=0.00..323.93 rows=1 width=8) (actual time=6.669..6.675 rows=1 loops=1) -> Index Scan Backward using login_when_idx on "login" (cost=0.00..131515.87 rows=406 width=8) (actual time=6.660..6.660 rows=1 loops=1) Filter: (("when" IS NOT NULL) AND (userid = 'girgen'::text)) Total runtime: 6.785 ms (6 rows) Time: 7,776 ms
--On torsdag, november 01, 2007 09.43.39 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: > Palle Girgensohn <girgen@pingpong.net> writes: >> When running >> select max("when") from login where userid='userid' >> it takes a second or two, but when adding "group by userid" the planner >> decides on using another plan, and it gets *much* faster. See example >> below. > > It's only faster for cases where there are few or no rows for the > particular userid ... > >> Number of tuples per user varies from zero to a couple of thousands. > > The planner is using an intermediate estimate of 406 rows. You might be > well advised to increase the statistics target for login.userid --- with > luck that would help it to choose the right plan type for both common > and uncommon userids. Unfortunately, altering statistics doesn't help. I see no difference when altering the value from 10 (default) to 100, 1000 or 100000. :-( Are there any other things I can modify? OH, btw, maybe something in the postgresql.conf sucks? max_connections = 100 shared_buffers = 30000 # min 16 or max_connections*2, 8KB each temp_buffers = 2500 # min 100, 8KB each max_prepared_transactions = 100 # can be 0 or more work_mem = 16384 # min 64, size in KB maintenance_work_mem = 16384 # min 1024, size in KB max_stack_depth = 32768 # min 100, size in KB max_fsm_pages = 500000 max_fsm_relations = 20000 max_files_per_process = 2000 fsync = off checkpoint_segments = 50 # in logfile segments, min 1, 16MB each effective_cache_size = 10000 # typically 8KB each random_page_cost = 1.8 geqo = on geqo_threshold = 10 from_collapse_limit = 8 join_collapse_limit = 8 # 1 disables collapsing of explicit
Palle Girgensohn <girgen@pingpong.net> writes: > Unfortunately, altering statistics doesn't help. I see no difference when > altering the value from 10 (default) to 100, 1000 or 100000. :-( Um, you did re-ANALYZE the table after changing the setting? regards, tom lane
--On torsdag, november 01, 2007 11.06.57 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: > Palle Girgensohn <girgen@pingpong.net> writes: >> Unfortunately, altering statistics doesn't help. I see no difference >> when altering the value from 10 (default) to 100, 1000 or 100000. :-( > > Um, you did re-ANALYZE the table after changing the setting? alter table login alter userid SET statistics 1000; vacuum analyze login;
On 11/1/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Palle Girgensohn <girgen@pingpong.net> writes: > > Unfortunately, altering statistics doesn't help. I see no difference when > > altering the value from 10 (default) to 100, 1000 or 100000. :-( > > Um, you did re-ANALYZE the table after changing the setting? And he changed it with ALTER TABLE name ALTER [ COLUMN ] column SET STORAGE { PLAIN | EXTERNAL | EXTENDED | MAIN } right?
Palle Girgensohn <girgen@pingpong.net> writes: > --On torsdag, november 01, 2007 11.06.57 -0400 Tom Lane <tgl@sss.pgh.pa.us> > wrote: >> Um, you did re-ANALYZE the table after changing the setting? > alter table login alter userid SET statistics 1000; > vacuum analyze login; Hm, that's the approved procedure all right. But the plans didn't change at all? Not even the estimated number of rows? regards, tom lane
"Palle Girgensohn" <girgen@pingpong.net> writes: > Are there any other things I can modify? You might consider an index on <userid,when>. Keep in mind that every new index imposes an incremental cost on every update and insert and increases the time for vacuum. > max_prepared_transactions = 100 # can be 0 or more Are you actually using prepared transactions (are you synchronising multiple databases using a transaction manager)? If not then set this to 0 as it takes some resources. > maintenance_work_mem = 16384 # min 1024, size in KB Raising this might decrease vacuum times if that's a problem. > fsync = off You realize that this means if the system loses power or the kernel crashes you could have data corruption? Do you take very frequent backups or can you reconstruct your data? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
Re: select max(field) from table much faster with a group by clause?
From
hubert depesz lubaczewski
Date:
On Thu, Nov 01, 2007 at 02:07:55PM +0100, Palle Girgensohn wrote: > I have a table "login" with approx 600,000 tuples, a person table with > approx 100000 tuples. > When running > select max("when") from login where userid='userid' > it takes a second or two, but when adding "group by userid" the planner > decides on using another plan, and it gets *much* faster. See example below. > pp=# explain analyze SELECT max("when") FROM login WHERE userid='girgen' ; just do: create index q on login (userid, "when"); and you should be fine. if it will not help, rewrite the query as: select "when" from login where userid = 'girgen' order by userid desc, "when" desc limit 1; depesz -- quicksil1er: "postgres is excellent, but like any DB it requires a highly paid DBA. here's my CV!" :) http://www.depesz.com/ - blog dla ciebie (i moje CV)
--On torsdag, november 01, 2007 11.34.42 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: > Palle Girgensohn <girgen@pingpong.net> writes: >> --On torsdag, november 01, 2007 11.06.57 -0400 Tom Lane >> <tgl@sss.pgh.pa.us> wrote: >>> Um, you did re-ANALYZE the table after changing the setting? > >> alter table login alter userid SET statistics 1000; >> vacuum analyze login; > > Hm, that's the approved procedure all right. But the plans didn't > change at all? Not even the estimated number of rows? Estimated number of rows did change from ~400 to ~1900, but the timing was the same. Seems that the problem is that it is using an index on "when". Removing that index (login_when_idx) changes the plan, and makes the query equally fast whether group by is there or not. I may need the index, though, in which one more index, on (userid, "when"), will fix the problem. I'd rather get rid of an index than creating another one. Anyway, I think I have two suggestions for a solution that will work for me. I still think it is strange that the group by clause so radically changes the behaviour and the query time. Cheers, Palle pp=# \d login Table "public.login" Column | Type | Modifiers --------+--------------------------+-------------------- userid | text | kursid | integer | when | timestamp with time zone | mode | text | default 'pm'::text Indexes: "login_kurs_user_idx" btree (kursid, userid) "login_userid_idx" btree (userid) "login_when_idx" btree ("when") Foreign-key constraints: "pp_fk1" FOREIGN KEY (userid) REFERENCES person(userid) ON UPDATE CASCADE ON DELETE CASCADE "pp_fk2" FOREIGN KEY (kursid) REFERENCES course(id) ON UPDATE CASCADE ON DELETE CASCADE