Thread: select max(field) from table much faster with a group by clause?

select max(field) from table much faster with a group by clause?

From
Palle Girgensohn
Date:
Hi,

I have a table "login" with approx 600,000 tuples, a person table with
approx 100000 tuples.

When running
 select max("when") from login where userid='userid'

it takes a second or two, but when adding "group by userid" the planner
decides on using another plan, and it gets *much* faster. See example below.

Number of tuples per user varies from zero to a couple of thousands. It
seems to slower when there are no tuples as all, but it is always slow.

This is only for max() and min(). For count(), the plan is the same, it
always uses "Aggregate".

Any ideas about this? Do we need to add "group by userid" to our code base
to optimize, or is there another way? Updating postgresql to 8.2 is a long
term option, but I'd like a short term option as well...

Regards,
Palle


pp=# select version();
                                             version
-------------------------------------------------------------------------------------------------
 PostgreSQL 8.1.8 on amd64-portbld-freebsd6.1, compiled by GCC cc (GCC)
3.4.4 [FreeBSD] 20050518
(1 row)

Time: 0,530 ms
pp=# explain analyze SELECT max("when") FROM login WHERE userid='girgen' ;
 QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------
 Result  (cost=323.80..323.81 rows=1 width=0) (actual
time=3478.781..3478.785 rows=1 loops=1)
   InitPlan
     ->  Limit  (cost=0.00..323.80 rows=1 width=8) (actual
time=3478.768..3478.768 rows=0 loops=1)
           ->  Index Scan Backward using login_when_idx on "login"
(cost=0.00..131461.90 rows=406 width=8) (actual time=3478.759..3478.759
rows=0 loops=1)
                 Filter: (("when" IS NOT NULL) AND (userid =
'sarah.gilliam1'::text))
 Total runtime: 3478.868 ms
(6 rows)

Time: 3480,442 ms
pp=# explain analyze SELECT max("when") FROM login WHERE userid='girgen'
group by userid;
                                                             QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=0.00..648.44 rows=1 width=25) (actual
time=0.191..0.191 rows=0 loops=1)
   ->  Index Scan using login_userid_idx on "login"  (cost=0.00..646.40
rows=406 width=25) (actual time=0.183..0.183 rows=0 loops=1)
         Index Cond: (userid = 'sarah.gilliam1'::text)
 Total runtime: 0.243 ms
(4 rows)

Time: 0,938 ms
pp=# \d login
                  Table "public.login"
 Column |           Type           |     Modifiers
--------+--------------------------+--------------------
 userid | text                     |
 kursid | integer                  |
 when   | timestamp with time zone |
 mode   | text                     | default 'pm'::text
Indexes:
    "login_kurs_user_idx" btree (kursid, userid)
    "login_userid_idx" btree (userid)
    "login_when_idx" btree ("when")
Foreign-key constraints:
    "pp_fk1" FOREIGN KEY (userid) REFERENCES person(userid) ON UPDATE
CASCADE ON DELETE CASCADE
    "pp_fk2" FOREIGN KEY (kursid) REFERENCES course(id) ON UPDATE CASCADE
ON DELETE CASCADE



Re: select max(field) from table much faster with a group by clause?

From
Tom Lane
Date:
Palle Girgensohn <girgen@pingpong.net> writes:
> When running
>  select max("when") from login where userid='userid'
> it takes a second or two, but when adding "group by userid" the planner
> decides on using another plan, and it gets *much* faster. See example below.

It's only faster for cases where there are few or no rows for the
particular userid ...

> Number of tuples per user varies from zero to a couple of thousands.

The planner is using an intermediate estimate of 406 rows.  You might be
well advised to increase the statistics target for login.userid --- with
luck that would help it to choose the right plan type for both common
and uncommon userids.

            regards, tom lane

Re: select max(field) from table much faster with a group by clause?

From
Palle Girgensohn
Date:

--On torsdag, november 01, 2007 09.43.39 -0400 Tom Lane <tgl@sss.pgh.pa.us>
wrote:

> Palle Girgensohn <girgen@pingpong.net> writes:
>> When running
>>  select max("when") from login where userid='userid'
>> it takes a second or two, but when adding "group by userid" the planner
>> decides on using another plan, and it gets *much* faster. See example
>> below.
>
> It's only faster for cases where there are few or no rows for the
> particular userid ...

Well, no, not really. See below. OTH, it sometimes a bit slower. Seems to
depend on how far away from the estimated number of rows you get? Weird?

>> Number of tuples per user varies from zero to a couple of thousands.
>
> The planner is using an intermediate estimate of 406 rows.  You might be
> well advised to increase the statistics target for login.userid --- with
> luck that would help it to choose the right plan type for both common
> and uncommon userids.

I'll try that, thanks!

--

pp=#  SELECT max("when") FROM login WHERE userid='kudo' group by userid;
              max
-------------------------------
 2007-01-04 15:31:46.863325+01
(1 row)

Time: 6,194 ms
pp=#  SELECT max("when") FROM login WHERE userid='kudo' ;
              max
-------------------------------
 2007-01-04 15:31:46.863325+01
(1 row)

Time: 992,391 ms
pp=#  SELECT max("when") FROM login WHERE userid='kudo' ;
              max
-------------------------------
 2007-01-04 15:31:46.863325+01
(1 row)

Time: 779,582 ms
pp=#  SELECT max("when") FROM login WHERE userid='kudo' ;
              max
-------------------------------
 2007-01-04 15:31:46.863325+01
(1 row)

Time: 818,667 ms
pp=#  SELECT max("when") FROM login WHERE userid='kudo' ;
              max
-------------------------------
 2007-01-04 15:31:46.863325+01
(1 row)

Time: 640,242 ms
pp=#  SELECT max("when") FROM login WHERE userid='kudo' group by userid;
              max
-------------------------------
 2007-01-04 15:31:46.863325+01
(1 row)

Time: 18,384 ms
pp=#  SELECT count(*) FROM login WHERE userid='kudo' group by userid;
 count
-------
  1998
(1 row)

Time: 12,762 ms
pp=# explain analyze SELECT max("when") FROM login WHERE userid='kudo'
group by userid;
                                                               QUERY PLAN


-----------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=0.00..648.44 rows=1 width=25) (actual
time=24.700..24.703 rows=1 loops=1)
   ->  Index Scan using login_userid_idx on "login"  (cost=0.00..646.40
rows=406 width=25) (actual time=0.140..16.931 rows=1998 loops=1)
         Index Cond: (userid = 'kudo'::text)
 Total runtime: 24.779 ms
(4 rows)

Time: 25,633 ms
pp=# explain analyze SELECT max("when") FROM login WHERE userid='kudo' ;

QUERY PLAN


------------------------------------------------------------------------------------------------------------------------------------------------------------
 Result  (cost=323.93..323.94 rows=1 width=0) (actual
time=1400.994..1400.997 rows=1 loops=1)
   InitPlan
     ->  Limit  (cost=0.00..323.93 rows=1 width=8) (actual
time=1400.975..1400.979 rows=1 loops=1)
           ->  Index Scan Backward using login_when_idx on "login"
(cost=0.00..131515.87 rows=406 width=8) (actual time=1400.968..1400.968
rows=1 loops=1)
                 Filter: (("when" IS NOT NULL) AND (userid = 'kudo'::text))
 Total runtime: 1401.057 ms
(6 rows)

Time: 1401,881 ms


pp=#  SELECT userid, count("when") FROM login WHERE userid in ('girgen' ,
'kudo') group by userid;
 userid | count
--------+-------
 kudo   |  1998
 girgen |  1120
(2 rows)


pp=# explain analyze SELECT max("when") FROM login WHERE userid='girgen'
group by userid;
                                                               QUERY PLAN


-----------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=0.00..648.44 rows=1 width=25) (actual
time=25.137..25.141 rows=1 loops=1)
   ->  Index Scan using login_userid_idx on "login"  (cost=0.00..646.40
rows=406 width=25) (actual time=0.121..20.712 rows=1120 loops=1)
         Index Cond: (userid = 'girgen'::text)
 Total runtime: 25.209 ms
(4 rows)

Time: 25,986 ms

pp=# explain analyze SELECT max("when") FROM login WHERE userid='girgen' ;
                                                                      QUERY
PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------
 Result  (cost=323.93..323.94 rows=1 width=0) (actual time=6.695..6.698
rows=1 loops=1)
   InitPlan
     ->  Limit  (cost=0.00..323.93 rows=1 width=8) (actual
time=6.669..6.675 rows=1 loops=1)
           ->  Index Scan Backward using login_when_idx on "login"
(cost=0.00..131515.87 rows=406 width=8) (actual time=6.660..6.660 rows=1
loops=1)
                 Filter: (("when" IS NOT NULL) AND (userid =
'girgen'::text))
 Total runtime: 6.785 ms
(6 rows)

Time: 7,776 ms


Re: select max(field) from table much faster with a group by clause?

From
Palle Girgensohn
Date:

--On torsdag, november 01, 2007 09.43.39 -0400 Tom Lane <tgl@sss.pgh.pa.us>
wrote:

> Palle Girgensohn <girgen@pingpong.net> writes:
>> When running
>>  select max("when") from login where userid='userid'
>> it takes a second or two, but when adding "group by userid" the planner
>> decides on using another plan, and it gets *much* faster. See example
>> below.
>
> It's only faster for cases where there are few or no rows for the
> particular userid ...
>
>> Number of tuples per user varies from zero to a couple of thousands.
>
> The planner is using an intermediate estimate of 406 rows.  You might be
> well advised to increase the statistics target for login.userid --- with
> luck that would help it to choose the right plan type for both common
> and uncommon userids.

Unfortunately, altering statistics doesn't help. I see no difference when
altering the value from 10 (default) to 100, 1000 or 100000. :-(

Are there any other things I can modify?

OH, btw, maybe something in the postgresql.conf sucks?

max_connections = 100
shared_buffers = 30000                  # min 16 or max_connections*2, 8KB
each
temp_buffers = 2500                     # min 100, 8KB each
max_prepared_transactions = 100         # can be 0 or more
work_mem = 16384                        # min 64, size in KB
maintenance_work_mem = 16384            # min 1024, size in KB
max_stack_depth = 32768                 # min 100, size in KB
max_fsm_pages = 500000
max_fsm_relations = 20000
max_files_per_process = 2000
fsync = off
checkpoint_segments = 50                # in logfile segments, min 1, 16MB
each
effective_cache_size = 10000            # typically 8KB each
random_page_cost = 1.8
geqo = on
geqo_threshold = 10
from_collapse_limit = 8
join_collapse_limit = 8         # 1 disables collapsing of explicit


Re: select max(field) from table much faster with a group by clause?

From
Tom Lane
Date:
Palle Girgensohn <girgen@pingpong.net> writes:
> Unfortunately, altering statistics doesn't help. I see no difference when
> altering the value from 10 (default) to 100, 1000 or 100000. :-(

Um, you did re-ANALYZE the table after changing the setting?

            regards, tom lane

Re: select max(field) from table much faster with a group by clause?

From
Palle Girgensohn
Date:

--On torsdag, november 01, 2007 11.06.57 -0400 Tom Lane <tgl@sss.pgh.pa.us>
wrote:

> Palle Girgensohn <girgen@pingpong.net> writes:
>> Unfortunately, altering statistics doesn't help. I see no difference
>> when  altering the value from 10 (default) to 100, 1000 or 100000. :-(
>
> Um, you did re-ANALYZE the table after changing the setting?

alter table login alter userid SET statistics 1000;
vacuum analyze login;


Re: select max(field) from table much faster with a group by clause?

From
"Scott Marlowe"
Date:
On 11/1/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Palle Girgensohn <girgen@pingpong.net> writes:
> > Unfortunately, altering statistics doesn't help. I see no difference when
> > altering the value from 10 (default) to 100, 1000 or 100000. :-(
>
> Um, you did re-ANALYZE the table after changing the setting?

And he changed it with

ALTER TABLE name ALTER [ COLUMN ] column SET STORAGE { PLAIN |
EXTERNAL | EXTENDED | MAIN }

right?

Re: select max(field) from table much faster with a group by clause?

From
Tom Lane
Date:
Palle Girgensohn <girgen@pingpong.net> writes:
> --On torsdag, november 01, 2007 11.06.57 -0400 Tom Lane <tgl@sss.pgh.pa.us>
> wrote:
>> Um, you did re-ANALYZE the table after changing the setting?

> alter table login alter userid SET statistics 1000;
> vacuum analyze login;

Hm, that's the approved procedure all right.  But the plans didn't
change at all?  Not even the estimated number of rows?

            regards, tom lane

Re: select max(field) from table much faster with a group by clause?

From
Gregory Stark
Date:
"Palle Girgensohn" <girgen@pingpong.net> writes:

> Are there any other things I can modify?

You might consider an index on <userid,when>. Keep in mind that every new
index imposes an incremental cost on every update and insert and increases the
time for vacuum.

> max_prepared_transactions = 100         # can be 0 or more

Are you actually using prepared transactions (are you synchronising multiple
databases using a transaction manager)? If not then set this to 0 as it takes
some resources.

> maintenance_work_mem = 16384            # min 1024, size in KB

Raising this might decrease vacuum times if that's a problem.

> fsync = off

You realize that this means if the system loses power or the kernel crashes
you could have data corruption? Do you take very frequent backups or can you
reconstruct your data?

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's Slony Replication support!

Re: select max(field) from table much faster with a group by clause?

From
hubert depesz lubaczewski
Date:
On Thu, Nov 01, 2007 at 02:07:55PM +0100, Palle Girgensohn wrote:
> I have a table "login" with approx 600,000 tuples, a person table with
> approx 100000 tuples.
> When running
> select max("when") from login where userid='userid'
> it takes a second or two, but when adding "group by userid" the planner
> decides on using another plan, and it gets *much* faster. See example below.
> pp=# explain analyze SELECT max("when") FROM login WHERE userid='girgen' ;

just do:
create index q on login (userid, "when"); and you should be fine.
if it will not help, rewrite the query as:
select "when"
from login
where userid = 'girgen'
order by userid desc, "when" desc limit 1;

depesz

--
quicksil1er: "postgres is excellent, but like any DB it requires a
highly paid DBA.  here's my CV!" :)
http://www.depesz.com/ - blog dla ciebie (i moje CV)

Re: select max(field) from table much faster with a group by clause?

From
Palle Girgensohn
Date:

--On torsdag, november 01, 2007 11.34.42 -0400 Tom Lane <tgl@sss.pgh.pa.us>
wrote:

> Palle Girgensohn <girgen@pingpong.net> writes:
>> --On torsdag, november 01, 2007 11.06.57 -0400 Tom Lane
>> <tgl@sss.pgh.pa.us>  wrote:
>>> Um, you did re-ANALYZE the table after changing the setting?
>
>> alter table login alter userid SET statistics 1000;
>> vacuum analyze login;
>
> Hm, that's the approved procedure all right.  But the plans didn't
> change at all?  Not even the estimated number of rows?

Estimated number of rows did change from ~400 to ~1900, but the timing was
the same.

Seems that the problem is that it is using an index on "when". Removing
that index (login_when_idx) changes the plan, and makes the query equally
fast whether group by is there or not. I may need the index, though, in
which one more index, on (userid, "when"), will fix the problem. I'd rather
get rid of an index than creating another one.

Anyway, I think I have two suggestions for a solution that will work for
me. I still think it is strange that the group by clause so radically
changes the behaviour and the query time.

Cheers,
Palle

pp=# \d login
                  Table "public.login"
 Column |           Type           |     Modifiers
--------+--------------------------+--------------------
 userid | text                     |
 kursid | integer                  |
 when   | timestamp with time zone |
 mode   | text                     | default 'pm'::text
Indexes:
    "login_kurs_user_idx" btree (kursid, userid)
    "login_userid_idx" btree (userid)
    "login_when_idx" btree ("when")
Foreign-key constraints:
    "pp_fk1" FOREIGN KEY (userid) REFERENCES person(userid) ON UPDATE
CASCADE ON DELETE CASCADE
    "pp_fk2" FOREIGN KEY (kursid) REFERENCES course(id) ON UPDATE CASCADE
ON DELETE CASCADE