Thread: much slower query in production
Dear all, I am facing a much, much slower query in production than on my development computer using a restored production backup, and I don't understand why nor I see what I could do to speedup the query on production :/ Short data model explanation: one table stores tickets, one table stores multicards; tickets can belong to a multicard through a many-to-one relationship, or be independant from multicards. The query I am interested in, will be used to update a new column in the multicards table, containing the count of related tickets. On production: # EXPLAIN ANALYZE SELECT transaction_uid, (SELECT COUNT(*) FROM tickets WHERE multicard_uid = multicards.uid) from multicards; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Seq Scan on multicards (cost=0.00..1455177.30 rows=204548 width=12) (actual time=0.178..1694987.355 rows=204548 loops=1) SubPlan 1 -> Aggregate (cost=7.07..7.08 rows=1 width=8) (actual time=8.283..8.283 rows=1 loops=204548) -> Index Only Scan using tickets_multicard_uid on tickets (cost=0.43..7.05 rows=9 width=0) (actual time=1.350..8.280rows=6 loops=204548) Index Cond: (multicard_uid = multicards.uid) Heap Fetches: 1174940 Planning Time: 1.220 ms Execution Time: 1695029.673 ms iostat in the middle of this execution: procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 1 479548 143652 79412 5609292 0 0 1986 116 999 1115 2 2 71 25 3 1 479548 134312 79412 5614564 0 0 2386 128 2832 3973 20 2 49 28 0 1 479548 142520 79428 5617708 0 0 1584 232 3440 3716 11 3 58 29 0 1 479548 161184 79024 5597756 0 0 1922 144 1249 1562 1 2 70 27 0 1 479548 161244 79048 5600804 0 0 1556 117 2138 3035 6 2 68 25 0 2 479548 158384 79048 5604008 0 0 1388 402 2970 4320 6 2 66 27 On my development computer: # EXPLAIN ANALYZE SELECT transaction_uid, (SELECT COUNT(*) FROM tickets WHERE multicard_uid = multicards.uid) from multicards; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Seq Scan on multicards (cost=0.00..538974.69 rows=204548 width=12) (actual time=0.055..451.691 rows=204548 loops=1) SubPlan 1 -> Aggregate (cost=2.57..2.58 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=204548) -> Index Only Scan using tickets_multicard_uid on tickets (cost=0.43..2.56 rows=7 width=0) (actual time=0.001..0.001rows=6 loops=204548) Index Cond: (multicard_uid = multicards.uid) Heap Fetches: 0 Planning Time: 0.296 ms Execution Time: 456.677 ms The execution time ratio is a huge 3700. I guess the Heap Fetches difference is the most meaningful here; my understanding would be, that the index would easily fit in the shared_buffers after a few subselects, as configuration is 2GB shared_buffers and here's the index on disk size: # SELECT relname as index, reltuples as "rows estimate", pg_size_pretty(pg_table_size(quote_ident(relname))) as "on disksize" FROM pg_class, pg_namespace WHERE pg_namespace.oid = pg_class.relnamespace AND relkind = 'i' AND nspname = 'public'AND relname = 'tickets_multicard_uid'; index | rows estimate | on disk size -----------------------+---------------+-------------- tickets_multicard_uid | 7.2136e+06 | 161 MB Though it's not too clear for me what "heap fetches" are. It seems it might be the actual table data fetches (e.g. not an index fetch), but I don't really know why so many of them are needed here, and how to reduce that (if that's the explanation for the longer timing). It's true that production has constant activity, but not so much of it, load avg is typically about 0.5, most queries run very quickly, pg_stat_activity frequently reports no currently running query; here's iostat during normal production activity, for reference: procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 480712 151112 10144 5664784 1 1 310 69 0 0 9 1 88 3 0 0 480712 151312 10152 5664876 0 0 0 248 4911 7072 9 3 87 0 0 0 480720 165488 10072 5642564 0 4 76 210 2019 2685 6 1 92 1 0 0 480720 165332 10088 5642508 0 0 0 221 3535 17545 25 4 70 1 0 0 480720 144772 10108 5643324 0 0 84 378 3833 5096 11 2 80 7 0 0 480720 143300 10116 5644144 0 0 42 298 3446 4784 6 1 92 1 0 0 480720 143300 10136 5644256 0 0 10 2340 1073 1496 1 1 96 2 Here's also a second susequent run on production for comparison: # EXPLAIN ANALYZE SELECT transaction_uid, (SELECT COUNT(*) FROM tickets WHERE multicard_uid = multicards.uid) from multicards; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Seq Scan on multicards (cost=0.00..1455177.30 rows=204548 width=12) (actual time=0.101..834176.389 rows=204548 loops=1) SubPlan 1 -> Aggregate (cost=7.07..7.08 rows=1 width=8) (actual time=4.075..4.075 rows=1 loops=204548) -> Index Only Scan using tickets_multicard_uid on tickets (cost=0.43..7.05 rows=9 width=0) (actual time=0.624..4.072rows=6 loops=204548) Index Cond: (multicard_uid = multicards.uid) Heap Fetches: 1174941 Planning Time: 0.273 ms Execution Time: 834209.323 ms Heap fetches still almost the same, albeit timing divided by two. Both environments are running postgresql 11.5, with 2GB shared_buffers. Differences I can think of: production is using ext4 on drbd on SATA and linux 3.2, dev is using ext4 (no drbd) on SSD and linux 4.15. I can't believe SSD would explain the difference alone? If positive, then I know what we should do on production.. Thanks for any hints/help! -- Guillaume Cottenceau
On Wed, Feb 26, 2020 at 05:17:21PM +0100, Guillaume Cottenceau wrote: > On production: > > # EXPLAIN ANALYZE SELECT transaction_uid, (SELECT COUNT(*) FROM tickets WHERE multicard_uid = multicards.uid) from multicards; > QUERY PLAN > --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > Seq Scan on multicards (cost=0.00..1455177.30 rows=204548 width=12) (actual time=0.178..1694987.355 rows=204548 loops=1) > SubPlan 1 > -> Aggregate (cost=7.07..7.08 rows=1 width=8) (actual time=8.283..8.283 rows=1 loops=204548) > -> Index Only Scan using tickets_multicard_uid on tickets (cost=0.43..7.05 rows=9 width=0) (actual time=1.350..8.280rows=6 loops=204548) > Index Cond: (multicard_uid = multicards.uid) > Heap Fetches: 1174940 > Planning Time: 1.220 ms > Execution Time: 1695029.673 ms > The execution time ratio is a huge 3700. I guess the Heap Fetches > difference is the most meaningful here; Yes, it's doing an "index only" scan, but not very effectively. Vacuum the tickets table to set relallvisible and see if that helps. If so, try to keep it better vacuumed with something like ALTER TABLE tickets SET (AUTOVACUUM_VACUUM_SCALE_FACTOR=0.005); -- Justin
By the way, I expect the time is cut in half while heap fetches stays similar because the index is now in OS cache on the second run and didn't need to be fetched from disk. Definitely need to check on vacuuming as Justin says. If you have a fairly active system, you would need to run this query many times in order to push other stuff out of shared_buffers and get this query to perform more like it does on dev.
SELECT transaction_uid, COALESCE( sub.count, 0 ) AS count FROM multicards LEFT JOIN (SELECT multicard_uid, COUNT(*) AS count FROM tickets GROUP BY multicard_uid ) AS sub ON sub.multicard_uid = multicards.uid;
Do you have the option to re-write the query or is this generated by an ORM? You are forcing the looping as I read this query. If you aggregate before you join, then the system should be able to do a single scan of the index, aggregate, then join those relatively few rows to the multicards table records.
SELECT transaction_uid, COALESCE( sub.count, 0 ) AS count FROM multicards LEFT JOIN (SELECT multicard_uid, COUNT(*) AS count FROM tickets GROUP BY multicard_uid ) AS sub ON sub.multicard_uid = multicards.uid;
Justin Pryzby <pryzby 'at' telsasoft.com> writes: > On Wed, Feb 26, 2020 at 05:17:21PM +0100, Guillaume Cottenceau wrote: >> On production: >> >> # EXPLAIN ANALYZE SELECT transaction_uid, (SELECT COUNT(*) FROM tickets WHERE multicard_uid = multicards.uid) from multicards; >> QUERY PLAN >> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- >> Seq Scan on multicards (cost=0.00..1455177.30 rows=204548 width=12) (actual time=0.178..1694987.355 rows=204548 loops=1) >> SubPlan 1 >> -> Aggregate (cost=7.07..7.08 rows=1 width=8) (actual time=8.283..8.283 rows=1 loops=204548) >> -> Index Only Scan using tickets_multicard_uid on tickets (cost=0.43..7.05 rows=9 width=0) (actual time=1.350..8.280rows=6 loops=204548) >> Index Cond: (multicard_uid = multicards.uid) >> Heap Fetches: 1174940 >> Planning Time: 1.220 ms >> Execution Time: 1695029.673 ms > >> The execution time ratio is a huge 3700. I guess the Heap Fetches >> difference is the most meaningful here; > > Yes, it's doing an "index only" scan, but not very effectively. > Vacuum the tickets table to set relallvisible and see if that helps. > > If so, try to keep it better vacuumed with something like > ALTER TABLE tickets SET (AUTOVACUUM_VACUUM_SCALE_FACTOR=0.005); Thanks for your reply! The effect is huge: # vacuum analyze tickets; VACUUM Time: 182850.756 ms (03:02.851) # EXPLAIN ANALYZE SELECT transaction_uid, (SELECT COUNT(*) FROM tickets WHERE multicard_uid = multicards.uid) from multicards; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Seq Scan on multicards (cost=0.00..947739.65 rows=204548 width=12) (actual time=15.579..5210.869 rows=204548 loops=1) SubPlan 1 -> Aggregate (cost=4.59..4.60 rows=1 width=8) (actual time=0.025..0.025 rows=1 loops=204548) -> Index Only Scan using tickets_multicard_uid on tickets (cost=0.43..4.57 rows=8 width=0) (actual time=0.022..0.024rows=6 loops=204548) Index Cond: (multicard_uid = multicards.uid) Heap Fetches: 8 Planning Time: 71.469 ms Execution Time: 5223.408 ms (8 rows) Time: 5332.361 ms (00:05.332) (and subsequent executions are below 1 second) It is actually consistent with using a restored backup on the dev computer, as my understanding is this comes out without any garbage and like a perfectly vacuumed database. Btw do you have any hint as to how to perform timings using production data which are consistent with production? Backup/restore is maybe not the way to go, but rather a block device level copy? Since postgresql 8, I have to say I rely entirely on autovacuum, and did not notice it could really run too infrequently for the work and create such difference. I see in documentation a default autovacuum_vacuum_scale_factor = 0.2, is that something that is typically lowered globally, e.g. maybe on a fairly active system? I am worried that changing that configuration for that table to 0.005 would fix this query and similar ones, but later I might face the same situation on other tables. Or how would you elect tables for a lowered value configuration? Thanks! -- Guillaume Cottenceau
Michael Lewis <mlewis 'at' entrata.com> writes: > By the way, I expect the time is cut in half while heap fetches stays similar because the index is now in OS cache on the > second run and didn't need to be fetched from disk. Definitely need to check on vacuuming as Justin says. If you have afairly > active system, you would need to run this query many times in order to push other stuff out of shared_buffers and get this > query to perform more like it does on dev. > > Do you have the option to re-write the query or is this generated by an ORM? You are forcing the looping as I read thisquery. > If you aggregate before you join, then the system should be able to do a single scan of the index, aggregate, then jointhose > relatively few rows to the multicards table records. > > SELECT transaction_uid, COALESCE( sub.count, 0 ) AS count FROM multicards LEFT JOIN (SELECT multicard_uid, COUNT(*) AScount > FROM tickets GROUP BY multicard_uid ) AS sub ON sub.multicard_uid = multicards.uid; Thanks for this hint! I always hit this fact that I never write good queries using explicit joins :/ Execution time (before vacuuming the table as adviced by Justin) down 38x to 44509ms using this query :) Real query was an UPDATE of the multicards table to set the count value. I rewrote this using your approach but I think I lack what coalesce did in your query, this would update only the rows where count >= 1 obviously: UPDATE multicards SET defacements = count FROM ( SELECT multicard_uid, COUNT(*) AS count FROM tickets GROUP BY multicard_uid ) AS sub WHERE uid = multicard_uid; Any hinted solution to do that in one pass? I could do a first pass setting defacements = 0, but that would produce more garbage :/ Thanks! -- Guillaume Cottenceau
UPDATE multicards
SET defacements = COALESCE( count, 0 )
FROM ( SELECT multicard_uid, COUNT(*) AS count FROM tickets GROUP BY multicard_uid ) AS sub
WHERE uid = multicard_uid OR multicard_uid is null;
I expect this should work. Not sure of performance of course.
Vacuum everything that you restored
Sent from my iPhone
On Feb 26, 2020, at 1:19 PM, Michael Lewis <mlewis@entrata.com> wrote:
UPDATE multicards
SET defacements = COALESCE( count, 0 )
FROM ( SELECT multicard_uid, COUNT(*) AS count FROM tickets GROUP BY multicard_uid ) AS sub
WHERE uid = multicard_uid OR multicard_uid is null;I expect this should work. Not sure of performance of course.
Michael Lewis <mlewis 'at' entrata.com> writes: > UPDATE multicards > SET defacements = COALESCE( count, 0 ) > FROM ( SELECT multicard_uid, COUNT(*) AS count FROM tickets GROUP BY multicard_uid ) AS sub > WHERE uid = multicard_uid OR multicard_uid is null; > > I expect this should work. Not sure of performance of course. This looked great but as it seems you suspected, it's very slow :/ I interrupted it after 5 minutes run on my dev computer. -- Guillaume Cottenceau
On Wed, Feb 26, 2020 at 11:17 AM Guillaume Cottenceau <gc@mnc.ch> wrote:
Dear all,
I am facing a much, much slower query in production than on my
development computer using a restored production backup, and I
don't understand why nor I see what I could do to speedup the
query on production :/
You've already seen a VACUUM fixed it.
This must have been a logical restore (from pg_dump), not a physical restore (from pg_basebackup for example) correct?
A physical restore should have resulted in a database in the same state of vacuuming as its source was. A logical restore will not, and since it is an append-only process it will not trigger autovacuum to run, either.
If you do a logical restore, probably the first should you do afterwards is a VACUUM ANALYZE.
Cheers,
Jeff
On Wed, Feb 26, 2020 at 1:02 PM Guillaume Cottenceau <gc@mnc.ch> wrote:
It is actually consistent with using a restored backup on the dev
computer, as my understanding is this comes out without any
garbage and like a perfectly vacuumed database.
I think I got that backwards in my previous email. It is the dev that was restored, not the prod? But unless you went out of your way to vacuum dev, it would not be perfectly vacuumed. If it were a logical restore, it would be perfectly unvacuumed, and if a physical restore would be in the same state of vacuuming as the database it was cloned from.
Btw do you have
any hint as to how to perform timings using production data which
are consistent with production? Backup/restore is maybe not the
way to go, but rather a block device level copy?
block device copy seems like overkill, just using pg_basebackup should be good enough.
Since postgresql 8, I have to say I rely entirely on autovacuum,
and did not notice it could really run too infrequently for the
work and create such difference. I see in documentation a default
autovacuum_vacuum_scale_factor = 0.2, is that something that is
typically lowered globally, e.g. maybe on a fairly active system?
I am worried that changing that configuration for that table to
0.005 would fix this query and similar ones, but later I might
face the same situation on other tables. Or how would you elect
tables for a lowered value configuration?
The autovacuum system has never been redesigned with the needs of index-only-scans in mind. If I have a table for which index-only scans are important, I'd set autovacuum_vacuum_scale_factor = 0 and set autovacuum_vacuum_threshold to about 5% of the number of blocks in the table. There is no syntax to say '5% of the number of blocks in the table' so you have to compute it yourself and hardcode the result, which makes it unsuitable for a global setting. And this still only addresses UPDATE and DELETE operations, not INSERTs. If you have INSERT only or mostly table for which index-only-scans are important, you might need to set up cron jobs to do vacuuming.
Cheers,
Jeff
Jeff Janes <jeff.janes 'at' gmail.com> writes: > On Wed, Feb 26, 2020 at 1:02 PM Guillaume Cottenceau <gc@mnc.ch> wrote: > > It is actually consistent with using a restored backup on the dev > computer, as my understanding is this comes out without any > garbage and like a perfectly vacuumed database. > > I think I got that backwards in my previous email. It is the > dev that was restored, not the prod? But unless you went out of Yes (prod was also restored not so long ago, when updating to pg 11.5 tho). > your way to vacuum dev, it would not be perfectly vacuumed. If > it were a logical restore, it would be perfectly unvacuumed, > and if a physical restore would be in the same state of > vacuuming as the database it was cloned from. > > Btw do you have > any hint as to how to perform timings using production data which > are consistent with production? Backup/restore is maybe not the > way to go, but rather a block device level copy? > > block device copy seems like overkill, just using pg_basebackup should be good enough. > > Since postgresql 8, I have to say I rely entirely on autovacuum, > and did not notice it could really run too infrequently for the > work and create such difference. I see in documentation a default > autovacuum_vacuum_scale_factor = 0.2, is that something that is > typically lowered globally, e.g. maybe on a fairly active system? > I am worried that changing that configuration for that table to > 0.005 would fix this query and similar ones, but later I might > face the same situation on other tables. Or how would you elect > tables for a lowered value configuration? > > The autovacuum system has never been redesigned with the needs of index-only-scans in mind. If I have a table for which > index-only scans are important, I'd set autovacuum_vacuum_scale_factor = 0 and set autovacuum_vacuum_threshold to about5% of > the number of blocks in the table. There is no syntax to say '5% of the number of blocks in the table' so you have to compute > it yourself and hardcode the result, which makes it unsuitable for a global setting. And this still only addresses UPDATEand It seems also difficult for us as this table grows over time (and is trimmed only infrequently). > DELETE operations, not INSERTs. If you have INSERT only or mostly table for which index-only-scans are important, you might > need to set up cron jobs to do vacuuming. Thanks! -- Guillaume Cottenceau