Thread: speeding up COUNT and DISTINCT queries
I'm looking for a general method to speed up DISTINCT and COUNT queries. mydatabase=> EXPLAIN ANALYZE select distinct(mac) from node; NOTICE: QUERY PLAN: Unique (cost=110425.67..110514.57 rows=3556 width=6) (actual time=45289.78..45598.62 rows=25334 loops=1) -> Sort (cost=110425.67..110425.67 rows=35561 width=6) (actual time=45289.77..45411.53 rows=34597 loops=1) -> Seq Scan on node (cost=0.00..107737.61 rows=35561 width=6) (actual time=6.73..44383.57 rows=34597 loops=1) Total runtime: 45673.19 msec ouch. I run VACCUUM ANALYZE once a day. Thanks, max
On Wed, 2003-03-12 at 17:38, Max Baker wrote: > I'm looking for a general method to > speed up DISTINCT and COUNT queries. > > > mydatabase=> EXPLAIN ANALYZE select distinct(mac) from node; > NOTICE: QUERY PLAN: > > Unique (cost=110425.67..110514.57 rows=3556 width=6) (actual > time=45289.78..45598.62 rows=25334 loops=1) > -> Sort (cost=110425.67..110425.67 rows=35561 width=6) (actual > time=45289.77..45411.53 rows=34597 loops=1) > -> Seq Scan on node (cost=0.00..107737.61 rows=35561 > width=6) (actual time=6.73..44383.57 rows=34597 loops=1) > > Total runtime: 45673.19 msec > ouch. > > I run VACCUUM ANALYZE once a day. Thats not going to do anything for that query, as there only is one possible plan at the moment :) I don't think you can do much about that query, other than buy a faster harddisk or more ram. Nearly all the time seems to be used pulling the data off the disk (in the Seq Scan). -- Rod Taylor <rbt@rbt.ca> PGP Key: http://www.rbt.ca/rbtpub.asc
Attachment
Ive found that group by works faster than distinct. Try EXPLAIN ANALYZE select mac from node group by mac; HTH Chad ----- Original Message ----- From: "Max Baker" <max@warped.org> To: "PostgreSQL Performance Mailing List" <pgsql-performance@postgresql.org> Sent: Wednesday, March 12, 2003 3:38 PM Subject: [PERFORM] speeding up COUNT and DISTINCT queries > I'm looking for a general method to > speed up DISTINCT and COUNT queries. > > > mydatabase=> EXPLAIN ANALYZE select distinct(mac) from node; > NOTICE: QUERY PLAN: > > Unique (cost=110425.67..110514.57 rows=3556 width=6) (actual > time=45289.78..45598.62 rows=25334 loops=1) > -> Sort (cost=110425.67..110425.67 rows=35561 width=6) (actual > time=45289.77..45411.53 rows=34597 loops=1) > -> Seq Scan on node (cost=0.00..107737.61 rows=35561 > width=6) (actual time=6.73..44383.57 rows=34597 loops=1) > > Total runtime: 45673.19 msec > ouch. > > I run VACCUUM ANALYZE once a day. > > Thanks, > max > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >
Do you have an index on mac? Max Baker wrote: > > I'm looking for a general method to > speed up DISTINCT and COUNT queries. > > mydatabase=> EXPLAIN ANALYZE select distinct(mac) from node; > NOTICE: QUERY PLAN: > > Unique (cost=110425.67..110514.57 rows=3556 width=6) (actual > time=45289.78..45598.62 rows=25334 loops=1) > -> Sort (cost=110425.67..110425.67 rows=35561 width=6) (actual > time=45289.77..45411.53 rows=34597 loops=1) > -> Seq Scan on node (cost=0.00..107737.61 rows=35561 > width=6) (actual time=6.73..44383.57 rows=34597 loops=1) > > Total runtime: 45673.19 msec > ouch. > > I run VACCUUM ANALYZE once a day. > > Thanks, > max > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly
On Wed, 12 Mar 2003 14:38:11 -0800, Max Baker <max@warped.org> wrote: > -> Seq Scan on node (cost=0.00..107737.61 rows=35561 > width=6) (actual time=6.73..44383.57 rows=34597 loops=1) 35000 tuples in 100000 pages? >I run VACCUUM ANALYZE once a day. Try VACUUM FULL VERBOSE ANALAYZE; this should bring back your table to a reasonable size. If the table starts growing again, VACUUM more often. Servus Manfred
On Wed, Mar 12, 2003 at 03:55:09PM -0700, Chad Thompson wrote: > Ive found that group by works faster than distinct. > > Try > EXPLAIN ANALYZE select mac from node group by mac; This was about 25% faster, thanks! That will work for distinct() only calls, but I still am looking for a way to speed up the count() command. Maybe an internal counter of rows, and triggers? -m
On Thu, Mar 13, 2003 at 12:48:27AM +0100, Manfred Koizar wrote: > On Wed, 12 Mar 2003 14:38:11 -0800, Max Baker <max@warped.org> wrote: > > -> Seq Scan on node (cost=0.00..107737.61 rows=35561 > > width=6) (actual time=6.73..44383.57 rows=34597 loops=1) > > 35000 tuples in 100000 pages? > > >I run VACCUUM ANALYZE once a day. > > Try VACUUM FULL VERBOSE ANALAYZE; this should bring back your table > to a reasonable size. If the table starts growing again, VACUUM more > often. Manfred, Thanks for the help. I guess i'm not clear on why there is so much extra cruft. Does postgres leave a little bit behind every time it does an update? Because this table is updated constantly. Check out the results, 1.5 seconds compared to 46 seconds : mydb=> vacuum full verbose analyze node; NOTICE: --Relation node-- NOTICE: Pages 107589: Changed 0, reaped 107588, Empty 0, New 0; Tup 34846: Vac 186847, Keep/VTL 0/0, UnUsed 9450103, MinLen88, MaxLen 104; Re-using: Free/Avail. Space 837449444/837449368; EndEmpty/Avail. Pages 0/107588. CPU 15.32s/0.51u sec elapsed 30.89 sec. NOTICE: Index node_pkey: Pages 10412; Tuples 34846: Deleted 186847. CPU 3.67s/2.48u sec elapsed 77.06 sec. NOTICE: Index idx_node_switch_port: Pages 54588; Tuples 34846: Deleted 186847. CPU 9.59s/2.42u sec elapsed 273.50 sec. NOTICE: Index idx_node_switch: Pages 50069; Tuples 34846: Deleted 186847. CPU 8.46s/2.08u sec elapsed 258.62 sec. NOTICE: Index idx_node_mac: Pages 6749; Tuples 34846: Deleted 186847. CPU 2.19s/1.59u sec elapsed 56.05 sec. NOTICE: Index idx_node_switch_port_active: Pages 51138; Tuples 34846: Deleted 186847. CPU 8.58s/2.99u sec elapsed 273.03 sec. NOTICE: Index idx_node_mac_active: Pages 6526; Tuples 34846: Deleted 186847. CPU 1.75s/1.90u sec elapsed 46.70 sec. NOTICE: Rel node: Pages: 107589 --> 399; Tuple(s) moved: 34303. CPU 83.49s/51.73u sec elapsed 1252.35 sec. NOTICE: Index node_pkey: Pages 10412; Tuples 34846: Deleted 34303. CPU 3.65s/1.64u sec elapsed 72.99 sec. NOTICE: Index idx_node_switch_port: Pages 54650; Tuples 34846: Deleted 34303. CPU 10.77s/2.05u sec elapsed 278.46 sec. NOTICE: Index idx_node_switch: Pages 50114; Tuples 34846: Deleted 34303. CPU 9.95s/1.65u sec elapsed 266.55 sec. NOTICE: Index idx_node_mac: Pages 6749; Tuples 34846: Deleted 34303. CPU 1.75s/1.13u sec elapsed 52.78 sec. NOTICE: Index idx_node_switch_port_active: Pages 51197; Tuples 34846: Deleted 34303. CPU 10.48s/1.89u sec elapsed 287.46 sec. NOTICE: Index idx_node_mac_active: Pages 6526; Tuples 34846: Deleted 34303. CPU 2.16s/0.96u sec elapsed 48.67 sec. NOTICE: --Relation pg_toast_64458-- NOTICE: Pages 0: Changed 0, reaped 0, Empty 0, New 0; Tup 0: Vac 0, Keep/VTL 0/0, UnUsed 0, MinLen 0, MaxLen 0; Re-using:Free/Avail. Space 0/0; EndEmpty/Avail. Pages 0/0. CPU 0.00s/0.00u sec elapsed 0.00 sec. NOTICE: Index pg_toast_64458_idx: Pages 1; Tuples 0. CPU 0.00s/0.00u sec elapsed 0.00 sec. NOTICE: Analyzing node VACUUM mydb=> EXPLAIN ANALYZE select distinct(mac) from node; NOTICE: QUERY PLAN: Unique (cost=3376.37..3463.48 rows=3485 width=6) (actual time=1049.09..1400.45 rows=25340 loops=1) -> Sort (cost=3376.37..3376.37 rows=34846 width=6) (actual time=1049.07..1190.58 rows=34846 loops=1) -> Seq Scan on node (cost=0.00..747.46 rows=34846 width=6) (actual time=0.14..221.18 rows=34846 loops=1) Total runtime: 1491.56 msec EXPLAIN now that's results =] -m
Max Baker wrote: > Thanks for the help. I guess i'm not clear on why there is so much > extra cruft. Does postgres leave a little bit behind every time it does > an update? Because this table is updated constantly. > Yes. See: http://www.us.postgresql.org/users-lounge/docs/7.3/postgres/routine-vacuuming.html Joe
On Wed, Mar 12, 2003 at 05:57:50PM -0800, Joe Conway wrote: > Max Baker wrote: > >Thanks for the help. I guess i'm not clear on why there is so much > >extra cruft. Does postgres leave a little bit behind every time it does > >an update? Because this table is updated constantly. > > > > Yes. See: > http://www.us.postgresql.org/users-lounge/docs/7.3/postgres/routine-vacuuming.html That would explain why once a night isn't enough. Thanks. The contents of this table get refreshed every 4 hours. I'll add a vacuum after every refresh and comapre the results in a couple days. -m
Try setting up a trigger to maintain a separate table containing only the distinct values... Chris ----- Original Message ----- From: "Max Baker" <max@warped.org> To: "PostgreSQL Performance Mailing List" <pgsql-performance@postgresql.org> Sent: Thursday, March 13, 2003 6:38 AM Subject: [PERFORM] speeding up COUNT and DISTINCT queries > I'm looking for a general method to > speed up DISTINCT and COUNT queries. > > > mydatabase=> EXPLAIN ANALYZE select distinct(mac) from node; > NOTICE: QUERY PLAN: > > Unique (cost=110425.67..110514.57 rows=3556 width=6) (actual > time=45289.78..45598.62 rows=25334 loops=1) > -> Sort (cost=110425.67..110425.67 rows=35561 width=6) (actual > time=45289.77..45411.53 rows=34597 loops=1) > -> Seq Scan on node (cost=0.00..107737.61 rows=35561 > width=6) (actual time=6.73..44383.57 rows=34597 loops=1) > > Total runtime: 45673.19 msec > ouch. > > I run VACCUUM ANALYZE once a day. > > Thanks, > max > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >
Max Baker <max@warped.org> writes: > On Wed, Mar 12, 2003 at 05:57:50PM -0800, Joe Conway wrote: > > Max Baker wrote: > > >Thanks for the help. I guess i'm not clear on why there is so much > > >extra cruft. Does postgres leave a little bit behind every time it does > > >an update? Because this table is updated constantly. > > > > > > > Yes. See: > > http://www.us.postgresql.org/users-lounge/docs/7.3/postgres/routine-vacuuming.html > > That would explain why once a night isn't enough. Thanks. > The contents of this table get refreshed every 4 hours. I'll add a > vacuum after every refresh and comapre the results in a couple days. If it gets completely refreshed, ie, every tuple is updated or deleted and re-inserted in a big batch job then VACUUM might never be enough without boosting some config values a lot. You might need to do a VACUUM FULL after the refresh. VACUUM FULL locks the table though which might be unfortunate. VACUUM FULL should be sufficient but you might want to consider instead TRUNCATE-ing the table and then reinserting records rather than deleting if that's what you're doing. Or alternatively building the new data in a new table and then doing a switcheroo with ALTER TABLE RENAME. However ALTER TABLE (and possible TRUNCATE as well?) will invalidate functions and other objects that refer to the table. Regarding the original question: . 7.4 will probably be faster than 7.3 at least if you stick with GROUP BY. . You could try building an index on mac, but I suspect even then it'll choose the sequential scan. But try it with an index and enable_seqscan = off to see if it's even worth trying to get it to use the index. If so you'll have to lower random_page_cost and/or play with cpu_tuple_cost and other variables to get it to do so. . You might also want to cluster the table on that index. You would have to recluster it every time you do your refresh and it's not clear how much it would help if any. But it might be worth trying. -- greg
On Thu, 2003-03-13 at 10:42, Greg Stark wrote: > Max Baker <max@warped.org> writes: > > On Wed, Mar 12, 2003 at 05:57:50PM -0800, Joe Conway wrote: > > That would explain why once a night isn't enough. Thanks. > > The contents of this table get refreshed every 4 hours. I'll add a > > vacuum after every refresh and comapre the results in a couple days. > > If it gets completely refreshed, ie, every tuple is updated or deleted and > re-inserted in a big batch job then VACUUM might never be enough without > boosting some config values a lot. You might need to do a VACUUM FULL after > the refresh. VACUUM FULL locks the table though which might be unfortunate. > hmm... approx 35,000 records, getting updated every 4 hours. so.. 35000 / (4*60) =~ 145 tuples per minute. Lets assume we want to keep any overhead at 10% or less, so we need to lazy vacuum every 3500 updates. so... 3500 tuples / 145 tpm =~ 25 minutes. So, set up a cron job to lazy vacuum every 20 minutes and see how that works for you. Robert Treat
On Thu, Mar 13, 2003 at 03:05:30PM -0500, Robert Treat wrote: > On Thu, 2003-03-13 at 10:42, Greg Stark wrote: > > Max Baker <max@warped.org> writes: > > > On Wed, Mar 12, 2003 at 05:57:50PM -0800, Joe Conway wrote: > > > That would explain why once a night isn't enough. Thanks. > > > The contents of this table get refreshed every 4 hours. I'll add a > > > vacuum after every refresh and comapre the results in a couple days. > > > > If it gets completely refreshed, ie, every tuple is updated or deleted and > > re-inserted in a big batch job then VACUUM might never be enough without > > boosting some config values a lot. You might need to do a VACUUM FULL after > > the refresh. VACUUM FULL locks the table though which might be unfortunate. I'm not starting with fresh data every time, I'm usually checking for an existing record, then setting a timestamp and a boolean flag. I've run some profiling and it's about 8000-10,000 UPDATEs every 4 hours. These are accompanied by about 800-1000 INSERTs. > hmm... approx 35,000 records, getting updated every 4 hours. so.. > > 35000 / (4*60) =~ 145 tuples per minute. > > Lets assume we want to keep any overhead at 10% or less, so we need to > lazy vacuum every 3500 updates. so... > > 3500 tuples / 145 tpm =~ 25 minutes. > > So, set up a cron job to lazy vacuum every 20 minutes and see how that > works for you. I'm now having VACUUM ANALYZE run after each of these updates. The data comes in in spurts -- a 90 minute batch job that runs every 4 hours. thanks folks, -m
Max, > I'm not starting with fresh data every time, I'm usually checking for > an existing record, then setting a timestamp and a boolean flag. > > I've run some profiling and it's about 8000-10,000 UPDATEs every 4 > hours. These are accompanied by about 800-1000 INSERTs. If these are wide records (i.e. large text fields or lots of columns ) you may want to consider raising your max_fsm_relation in postgresql.conf slightly, to about 15,000. You can get a better idea of a good FSM setting by running VACUUM FULL VERBOSE after your next batch (this will lock the database temporarily) and seeing how many data pages are "reclaimed", in total, by the vacuum. Then set your FSM to at least that level. And has anyone mentioned REINDEX on this thread? -- Josh Berkus Aglio Database Solutions San Francisco