Thread: speeding up COUNT and DISTINCT queries

speeding up COUNT and DISTINCT queries

From
Max Baker
Date:
I'm looking for a general method to
speed up DISTINCT and COUNT queries.


mydatabase=> EXPLAIN ANALYZE  select distinct(mac) from node;
NOTICE:  QUERY PLAN:

Unique  (cost=110425.67..110514.57 rows=3556 width=6) (actual
time=45289.78..45598.62 rows=25334 loops=1)
  ->  Sort  (cost=110425.67..110425.67 rows=35561 width=6) (actual
  time=45289.77..45411.53 rows=34597 loops=1)
          ->  Seq Scan on node  (cost=0.00..107737.61 rows=35561
          width=6) (actual time=6.73..44383.57 rows=34597 loops=1)

          Total runtime: 45673.19 msec
                         ouch.

I run VACCUUM ANALYZE once a day.

Thanks,
max

Re: speeding up COUNT and DISTINCT queries

From
Rod Taylor
Date:
On Wed, 2003-03-12 at 17:38, Max Baker wrote:
> I'm looking for a general method to
> speed up DISTINCT and COUNT queries.
>
>
> mydatabase=> EXPLAIN ANALYZE  select distinct(mac) from node;
> NOTICE:  QUERY PLAN:
>
> Unique  (cost=110425.67..110514.57 rows=3556 width=6) (actual
> time=45289.78..45598.62 rows=25334 loops=1)
>   ->  Sort  (cost=110425.67..110425.67 rows=35561 width=6) (actual
>   time=45289.77..45411.53 rows=34597 loops=1)
>           ->  Seq Scan on node  (cost=0.00..107737.61 rows=35561
>           width=6) (actual time=6.73..44383.57 rows=34597 loops=1)
>
>           Total runtime: 45673.19 msec
>                          ouch.
>
> I run VACCUUM ANALYZE once a day.

Thats not going to do anything for that query, as there only is one
possible plan at the moment :)

I don't think you can do much about that query, other than buy a faster
harddisk or more ram.  Nearly all the time seems to be used pulling the
data off the disk (in the Seq Scan).

--
Rod Taylor <rbt@rbt.ca>

PGP Key: http://www.rbt.ca/rbtpub.asc

Attachment

Re: speeding up COUNT and DISTINCT queries

From
"Chad Thompson"
Date:
Ive found that group by works faster than distinct.

Try
EXPLAIN ANALYZE  select mac from node group by mac;

HTH
Chad

----- Original Message -----
From: "Max Baker" <max@warped.org>
To: "PostgreSQL Performance Mailing List" <pgsql-performance@postgresql.org>
Sent: Wednesday, March 12, 2003 3:38 PM
Subject: [PERFORM] speeding up COUNT and DISTINCT queries


> I'm looking for a general method to
> speed up DISTINCT and COUNT queries.
>
>
> mydatabase=> EXPLAIN ANALYZE  select distinct(mac) from node;
> NOTICE:  QUERY PLAN:
>
> Unique  (cost=110425.67..110514.57 rows=3556 width=6) (actual
> time=45289.78..45598.62 rows=25334 loops=1)
>   ->  Sort  (cost=110425.67..110425.67 rows=35561 width=6) (actual
>   time=45289.77..45411.53 rows=34597 loops=1)
>           ->  Seq Scan on node  (cost=0.00..107737.61 rows=35561
>           width=6) (actual time=6.73..44383.57 rows=34597 loops=1)
>
>           Total runtime: 45673.19 msec
>                          ouch.
>
> I run VACCUUM ANALYZE once a day.
>
> Thanks,
> max
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly
>


Re: speeding up COUNT and DISTINCT queries

From
Jean-Luc Lachance
Date:
Do you have an index on mac?


Max Baker wrote:
>
> I'm looking for a general method to
> speed up DISTINCT and COUNT queries.
>
> mydatabase=> EXPLAIN ANALYZE  select distinct(mac) from node;
> NOTICE:  QUERY PLAN:
>
> Unique  (cost=110425.67..110514.57 rows=3556 width=6) (actual
> time=45289.78..45598.62 rows=25334 loops=1)
>   ->  Sort  (cost=110425.67..110425.67 rows=35561 width=6) (actual
>   time=45289.77..45411.53 rows=34597 loops=1)
>           ->  Seq Scan on node  (cost=0.00..107737.61 rows=35561
>           width=6) (actual time=6.73..44383.57 rows=34597 loops=1)
>
>           Total runtime: 45673.19 msec
>                          ouch.
>
> I run VACCUUM ANALYZE once a day.
>
> Thanks,
> max
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly

Re: speeding up COUNT and DISTINCT queries

From
Manfred Koizar
Date:
On Wed, 12 Mar 2003 14:38:11 -0800, Max Baker <max@warped.org> wrote:
>          ->  Seq Scan on node  (cost=0.00..107737.61 rows=35561
>          width=6) (actual time=6.73..44383.57 rows=34597 loops=1)

35000 tuples in 100000 pages?

>I run VACCUUM ANALYZE once a day.

Try VACUUM FULL VERBOSE ANALAYZE;  this should bring back your table
to a reasonable size.  If the table starts growing again, VACUUM more
often.

Servus
 Manfred

Re: speeding up COUNT and DISTINCT queries

From
Max Baker
Date:
On Wed, Mar 12, 2003 at 03:55:09PM -0700, Chad Thompson wrote:
> Ive found that group by works faster than distinct.
>
> Try
> EXPLAIN ANALYZE  select mac from node group by mac;

This was about 25% faster, thanks!

That will work for distinct() only calls, but I still am looking for a
way to speed up the count() command.  Maybe an internal counter of rows,
and  triggers?

-m

Re: speeding up COUNT and DISTINCT queries

From
Max Baker
Date:
On Thu, Mar 13, 2003 at 12:48:27AM +0100, Manfred Koizar wrote:
> On Wed, 12 Mar 2003 14:38:11 -0800, Max Baker <max@warped.org> wrote:
> >          ->  Seq Scan on node  (cost=0.00..107737.61 rows=35561
> >          width=6) (actual time=6.73..44383.57 rows=34597 loops=1)
>
> 35000 tuples in 100000 pages?
>
> >I run VACCUUM ANALYZE once a day.
>
> Try VACUUM FULL VERBOSE ANALAYZE;  this should bring back your table
> to a reasonable size.  If the table starts growing again, VACUUM more
> often.

Manfred,

Thanks for the help.  I guess i'm not clear on why there is so much
extra cruft.  Does postgres leave a little bit behind every time it does
an update?  Because this table is updated constantly.

Check out the results, 1.5 seconds compared to 46 seconds :

mydb=> vacuum full verbose analyze node;
NOTICE:  --Relation node--
NOTICE:  Pages 107589: Changed 0, reaped 107588, Empty 0, New 0; Tup 34846: Vac 186847, Keep/VTL 0/0, UnUsed 9450103,
MinLen88, MaxLen 104; Re-using: Free/Avail. Space 837449444/837449368; EndEmpty/Avail. Pages 0/107588. 
        CPU 15.32s/0.51u sec elapsed 30.89 sec.
NOTICE:  Index node_pkey: Pages 10412; Tuples 34846: Deleted 186847.
        CPU 3.67s/2.48u sec elapsed 77.06 sec.
NOTICE:  Index idx_node_switch_port: Pages 54588; Tuples 34846: Deleted 186847.
        CPU 9.59s/2.42u sec elapsed 273.50 sec.
NOTICE:  Index idx_node_switch: Pages 50069; Tuples 34846: Deleted 186847.
        CPU 8.46s/2.08u sec elapsed 258.62 sec.
NOTICE:  Index idx_node_mac: Pages 6749; Tuples 34846: Deleted 186847.
        CPU 2.19s/1.59u sec elapsed 56.05 sec.
NOTICE:  Index idx_node_switch_port_active: Pages 51138; Tuples 34846: Deleted 186847.
        CPU 8.58s/2.99u sec elapsed 273.03 sec.
NOTICE:  Index idx_node_mac_active: Pages 6526; Tuples 34846: Deleted 186847.
        CPU 1.75s/1.90u sec elapsed 46.70 sec.
        NOTICE:  Rel node: Pages: 107589 --> 399; Tuple(s) moved: 34303.
        CPU 83.49s/51.73u sec elapsed 1252.35 sec.
NOTICE:  Index node_pkey: Pages 10412; Tuples 34846: Deleted 34303.
        CPU 3.65s/1.64u sec elapsed 72.99 sec.
NOTICE:  Index idx_node_switch_port: Pages 54650; Tuples 34846: Deleted 34303.
        CPU 10.77s/2.05u sec elapsed 278.46 sec.
NOTICE:  Index idx_node_switch: Pages 50114; Tuples 34846: Deleted 34303.
        CPU 9.95s/1.65u sec elapsed 266.55 sec.
NOTICE:  Index idx_node_mac: Pages 6749; Tuples 34846: Deleted 34303.
        CPU 1.75s/1.13u sec elapsed 52.78 sec.
NOTICE:  Index idx_node_switch_port_active: Pages 51197; Tuples 34846: Deleted 34303.
        CPU 10.48s/1.89u sec elapsed 287.46 sec.
NOTICE:  Index idx_node_mac_active: Pages 6526; Tuples 34846: Deleted 34303.
        CPU 2.16s/0.96u sec elapsed 48.67 sec.
NOTICE:  --Relation pg_toast_64458--
NOTICE:  Pages 0: Changed 0, reaped 0, Empty 0, New 0; Tup 0: Vac 0, Keep/VTL 0/0, UnUsed 0, MinLen 0, MaxLen 0;
Re-using:Free/Avail. Space 0/0; EndEmpty/Avail. Pages 0/0. 
        CPU 0.00s/0.00u sec elapsed 0.00 sec.
NOTICE:  Index pg_toast_64458_idx: Pages 1; Tuples 0.
        CPU 0.00s/0.00u sec elapsed 0.00 sec.
NOTICE:  Analyzing node
VACUUM

mydb=> EXPLAIN ANALYZE  select distinct(mac) from node;
NOTICE:  QUERY PLAN:

Unique  (cost=3376.37..3463.48 rows=3485 width=6) (actual time=1049.09..1400.45 rows=25340 loops=1)
  ->  Sort  (cost=3376.37..3376.37 rows=34846 width=6) (actual time=1049.07..1190.58 rows=34846 loops=1)
        ->  Seq Scan on node  (cost=0.00..747.46 rows=34846 width=6) (actual time=0.14..221.18 rows=34846 loops=1)
Total runtime: 1491.56 msec

EXPLAIN

now that's results =]
-m

Re: speeding up COUNT and DISTINCT queries

From
Joe Conway
Date:
Max Baker wrote:
> Thanks for the help.  I guess i'm not clear on why there is so much
> extra cruft.  Does postgres leave a little bit behind every time it does
> an update?  Because this table is updated constantly.
>

Yes. See:
http://www.us.postgresql.org/users-lounge/docs/7.3/postgres/routine-vacuuming.html

Joe


Re: speeding up COUNT and DISTINCT queries

From
Max Baker
Date:
On Wed, Mar 12, 2003 at 05:57:50PM -0800, Joe Conway wrote:
> Max Baker wrote:
> >Thanks for the help.  I guess i'm not clear on why there is so much
> >extra cruft.  Does postgres leave a little bit behind every time it does
> >an update?  Because this table is updated constantly.
> >
>
> Yes. See:
> http://www.us.postgresql.org/users-lounge/docs/7.3/postgres/routine-vacuuming.html

That would explain why once a night isn't enough.  Thanks.
The contents of this table get refreshed every 4 hours.  I'll add a
vacuum after every refresh and comapre the results in a couple days.

-m

Re: speeding up COUNT and DISTINCT queries

From
"Christopher Kings-Lynne"
Date:
Try setting up a trigger to maintain a separate table containing only the
distinct values...

Chris

----- Original Message -----
From: "Max Baker" <max@warped.org>
To: "PostgreSQL Performance Mailing List" <pgsql-performance@postgresql.org>
Sent: Thursday, March 13, 2003 6:38 AM
Subject: [PERFORM] speeding up COUNT and DISTINCT queries


> I'm looking for a general method to
> speed up DISTINCT and COUNT queries.
>
>
> mydatabase=> EXPLAIN ANALYZE  select distinct(mac) from node;
> NOTICE:  QUERY PLAN:
>
> Unique  (cost=110425.67..110514.57 rows=3556 width=6) (actual
> time=45289.78..45598.62 rows=25334 loops=1)
>   ->  Sort  (cost=110425.67..110425.67 rows=35561 width=6) (actual
>   time=45289.77..45411.53 rows=34597 loops=1)
>           ->  Seq Scan on node  (cost=0.00..107737.61 rows=35561
>           width=6) (actual time=6.73..44383.57 rows=34597 loops=1)
>
>           Total runtime: 45673.19 msec
>                          ouch.
>
> I run VACCUUM ANALYZE once a day.
>
> Thanks,
> max
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly
>


Re: speeding up COUNT and DISTINCT queries

From
Greg Stark
Date:
Max Baker <max@warped.org> writes:

> On Wed, Mar 12, 2003 at 05:57:50PM -0800, Joe Conway wrote:
> > Max Baker wrote:
> > >Thanks for the help.  I guess i'm not clear on why there is so much
> > >extra cruft.  Does postgres leave a little bit behind every time it does
> > >an update?  Because this table is updated constantly.
> > >
> >
> > Yes. See:
> > http://www.us.postgresql.org/users-lounge/docs/7.3/postgres/routine-vacuuming.html
>
> That would explain why once a night isn't enough.  Thanks.
> The contents of this table get refreshed every 4 hours.  I'll add a
> vacuum after every refresh and comapre the results in a couple days.

If it gets completely refreshed, ie, every tuple is updated or deleted and
re-inserted in a big batch job then VACUUM might never be enough without
boosting some config values a lot. You might need to do a VACUUM FULL after
the refresh. VACUUM FULL locks the table though which might be unfortunate.

VACUUM FULL should be sufficient but you might want to consider instead
TRUNCATE-ing the table and then reinserting records rather than deleting if
that's what you're doing. Or alternatively building the new data in a new
table and then doing a switcheroo with ALTER TABLE RENAME. However ALTER TABLE
(and possible TRUNCATE as well?) will invalidate functions and other objects
that refer to the table.

Regarding the original question:

. 7.4 will probably be faster than 7.3 at least if you stick with GROUP BY.

. You could try building an index on mac, but I suspect even then it'll choose
  the sequential scan. But try it with an index and enable_seqscan = off to
  see if it's even worth trying to get it to use the index. If so you'll have
  to lower random_page_cost and/or play with cpu_tuple_cost and other
  variables to get it to do so.

. You might also want to cluster the table on that index. You would have to
  recluster it every time you do your refresh and it's not clear how much it
  would help if any. But it might be worth trying.

--
greg

Re: speeding up COUNT and DISTINCT queries

From
Robert Treat
Date:
On Thu, 2003-03-13 at 10:42, Greg Stark wrote:
> Max Baker <max@warped.org> writes:
> > On Wed, Mar 12, 2003 at 05:57:50PM -0800, Joe Conway wrote:
> > That would explain why once a night isn't enough.  Thanks.
> > The contents of this table get refreshed every 4 hours.  I'll add a
> > vacuum after every refresh and comapre the results in a couple days.
>
> If it gets completely refreshed, ie, every tuple is updated or deleted and
> re-inserted in a big batch job then VACUUM might never be enough without
> boosting some config values a lot. You might need to do a VACUUM FULL after
> the refresh. VACUUM FULL locks the table though which might be unfortunate.
>

hmm... approx 35,000 records, getting updated every 4 hours. so..

35000 / (4*60) =~ 145 tuples per minute.

Lets assume we want to keep any overhead at 10% or less, so we need to
lazy vacuum every 3500 updates. so...

3500 tuples / 145 tpm =~ 25 minutes.

So, set up a cron job to lazy vacuum every 20 minutes and see how that
works for you.

Robert Treat



Re: speeding up COUNT and DISTINCT queries

From
Max Baker
Date:
On Thu, Mar 13, 2003 at 03:05:30PM -0500, Robert Treat wrote:
> On Thu, 2003-03-13 at 10:42, Greg Stark wrote:
> > Max Baker <max@warped.org> writes:
> > > On Wed, Mar 12, 2003 at 05:57:50PM -0800, Joe Conway wrote:
> > > That would explain why once a night isn't enough.  Thanks.
> > > The contents of this table get refreshed every 4 hours.  I'll add a
> > > vacuum after every refresh and comapre the results in a couple days.
> >
> > If it gets completely refreshed, ie, every tuple is updated or deleted and
> > re-inserted in a big batch job then VACUUM might never be enough without
> > boosting some config values a lot. You might need to do a VACUUM FULL after
> > the refresh. VACUUM FULL locks the table though which might be unfortunate.

I'm not starting with fresh data every time, I'm usually checking for
an existing record, then setting a timestamp and a boolean flag.

I've run some profiling and it's about 8000-10,000 UPDATEs every 4
hours.  These are accompanied by about 800-1000 INSERTs.

> hmm... approx 35,000 records, getting updated every 4 hours. so..
>
> 35000 / (4*60) =~ 145 tuples per minute.
>
> Lets assume we want to keep any overhead at 10% or less, so we need to
> lazy vacuum every 3500 updates. so...
>
> 3500 tuples / 145 tpm =~ 25 minutes.
>
> So, set up a cron job to lazy vacuum every 20 minutes and see how that
> works for you.

I'm now having VACUUM ANALYZE run after each of these updates.  The data
comes in in spurts -- a 90 minute batch job that runs every 4 hours.


thanks folks,
-m

Re: speeding up COUNT and DISTINCT queries

From
Josh Berkus
Date:
Max,

> I'm not starting with fresh data every time, I'm usually checking for
> an existing record, then setting a timestamp and a boolean flag.
>
> I've run some profiling and it's about 8000-10,000 UPDATEs every 4
> hours.  These are accompanied by about 800-1000 INSERTs.

If these are wide records (i.e. large text fields or lots of columns ) you may
want to consider raising your max_fsm_relation in postgresql.conf slightly,
to about 15,000.

You can get a better idea of a good FSM setting by running VACUUM FULL VERBOSE
after your next batch (this will lock the database temporarily) and seeing
how many data pages are "reclaimed", in total, by the vacuum.  Then set your
FSM to at least that level.

And has anyone mentioned REINDEX on this thread?

--
Josh Berkus
Aglio Database Solutions
San Francisco