Thread: does the query planner consider work_mem?
hi all - i'm having a bit of trouble with some queries that are running painfully slowly after migrating my database from one machine using PostgreSQL 8.2 to another machine with PostgreSQL 8.4. as far as i can tell, the two *servers* (not the physical machines) are set up pretty close to identically, and as far as query planning is concerned, the only setting that seems to be different is 'default_statistics_target', which is 10 on the 8.2 sever and 100 on the 8.4 server (the defaults)... and presumably this should be giving the 8.4 server more data to work with for better query plans (AFAIK). (other settings, e.g. cost estimates for page/row/etc access are identical between the servers.) i VACUUMed and ANALYZEd on both databases (and all tables) prior to running any of the examples below. i've narrowed some of the queries down to a few simpler test cases and found one example of a huge difference. here's a basic query reproducing the results: explain analyze select lt.l_id from lt where lt.t_name in (select t_name from lt x where x.l_id = 91032370) group by lt.l_id; [YES: i know i can use a join here. i extracted this from a much larger query that originally had 7 joins, and to help the query planner out a bit i managed to isolate one nested query that makes a large performance-increasing difference that the query planner was ignoring... and for the purposes of exploring this behavior, though, this serves as a nice example as it seems to be enforcing a nested query.] here's the analysis on the 8.2 machine: HashAggregate (cost=23240.77..23297.35 rows=5658 width=4) (actual time=5515.065..5600.372 rows=279785 loops=1) -> Nested Loop (cost=3461.24..23226.63 rows=5658 width=4) (actual time=134.604..5276.014 rows=304248 loops=1) -> HashAggregate (cost=3288.18..3288.19 rows=1 width=13) (actual time=0.097..0.120 rows=11 loops=1) -> Index Scan using listing__tag___listing_id on listing__tag x (cost=0.00..3286.06 rows=851 width=13) (actual time=0.069..0.084 rows=11 loops=1) Index Cond: (listing_id = 91032370) -> Bitmap Heap Scan on listing__tag lt (cost=173.06..19867.70 rows=5659 width=17) (actual time=115.275..474.135 rows=27659 loops=11) Recheck Cond: (lt.tag_name = x.tag_name) -> Bitmap Index Scan on listing__tag___tag_name (cost=0.00..171.64 rows=5659 width=0) (actual time=113.595..113.595 rows=27659 loops=11) Index Cond: (lt.tag_name = x.tag_name) Total runtime: 5615.036 ms same query on the 8.4 machine, trial 1 (almost exactly the same data... just (very) slightly fewer records): Group (cost=782153.70..807638.31 rows=516717 width=4) (actual time=184264.479..184434.087 rows=275876 loops=1) -> Sort (cost=782153.70..794896.00 rows=5096921 width=4) (actual time=184264.476..184353.314 rows=299992 loops=1) Sort Key: lt.listing_id Sort Method: external merge Disk: 4096kB -> Nested Loop (cost=306.17..5271.26 rows=5096921 width=4) (actual time=126.267..183408.035 rows=299992 loops=1) -> HashAggregate (cost=270.42..270.43 rows=1 width=10) (actual time=57.744..57.771 rows=11 loops=1) -> Index Scan using listing__tag___listing_id on listing__tag x (cost=0.00..270.25 rows=68 width=10) (actual time=57.728..57.731 rows=11 loops=1) Index Cond: (listing_id = 91032370) -> Bitmap Heap Scan on listing__tag lt (cost=35.75..4983.92 rows=1353 width=14) (actual time=59.723..16652.856 rows=27272 loops=11) Recheck Cond: (lt.tag_name = x.tag_name) -> Bitmap Index Scan on listing__tag___tag_name (cost=0.00..35.41 rows=1353 width=0) (actual time=58.036..58.036 rows=27272 loops=11) Index Cond: (lt.tag_name = x.tag_name) Total runtime: 184455.567 ms 3 seconds vs 3 minutes! so i obviously first noticed that the nested loop was taking up all the time... though that part of the query is identical between the two servers. another noticeable difference is that the estimates on number of rows are better in the 8.4 query (estimated around 500,000 vs 5,000, and actual number of rows was around 300,000). the only difference in query plans is the choice to Group->Sort vs HashAggregate, but both of those operate on the Nested Loop results. despite seeing all the time being spent on the Nested Loop, i also noticed that it was sorting on disk... so i upped the work_mem from the 1 MB default to 8 MB. i then re-ran the 8.4 query, and got these times (for the same query): Group (cost=642783.70..668268.31 rows=516717 width=4) (actual time=1946.838..2102.696 rows=275876 loops=1) -> Sort (cost=642783.70..655526.00 rows=5096921 width=4) (actual time=1946.835..2023.099 rows=299992 loops=1) Sort Key: lt.listing_id Sort Method: external merge Disk: 4096kB -> Nested Loop (cost=306.17..5271.26 rows=5096921 width=4) (actual time=4.336..1518.962 rows=299992 loops=1) -> HashAggregate (cost=270.42..270.43 rows=1 width=10) (actual time=0.069..0.089 rows=11 loops=1) -> Index Scan using listing__tag___listing_id on listing__tag x (cost=0.00..270.25 rows=68 width=10) (actual time=0.052..0.058 rows=11 loops=1) Index Cond: (listing_id = 91032370) -> Bitmap Heap Scan on listing__tag lt (cost=35.75..4983.92 rows=1353 width=14) (actual time=11.585..130.841 rows=27272 loops=11) Recheck Cond: (lt.tag_name = x.tag_name) -> Bitmap Index Scan on listing__tag___tag_name (cost=0.00..35.41 rows=1353 width=0) (actual time=6.588..6.588 rows=27272 loops=11) Index Cond: (lt.tag_name = x.tag_name) Total runtime: 2123.992 ms 2 seconds! great! kind of. this really just raised a whole series of questions now that are going to bother me for my future query-writing purposes. 1) obvious basic question: why, after raising work_mem to 8MB is the sort still happening on disk? or, is it happening on disk (despite being reported as so). 2) the Nested Loop time dropped dramatically (basically all the time savings)... but i had no warning prior to this that the work_mem limit was being hit inside that Nested Loop. is there any way to see which steps are requiring additional disk IO for processing... similar to the reported "external merge" for the Sort Method? 3) here's the biggest problem/issue in my brain: work_mem on the 8.2 server was also set to the 1 MB default! but ran quite speedily! the full migration will take a while, so there will still be query development/optimization on one system, and i'd love for those many hours testing to be worth something when ported over to the other system. in this particular example, the Nested Loop seems to fit in the 1 MB work_mem space on the 8.2 server, but not the 8.4? does this seem right to anybody? 4) i thought, maybe i'm reading the ANALYZE report incorrectly, and the true bottleneck in the first 8.4 query was indeed the on-disk sorting step (which would make sense). and upping the work_mem limit would alleviate that (which would also make sense). but then why is the second 8.4 query still reporting a sort on disk, despite the giant speed boost? 5) and finally, if indeed the sort is happening on disk in the first 8.4 query, will the query planner consider work_mem as part of the planning process? i.e. will a slightly-faster (in terms of data page reads) be downgraded if it looks like it'll require a sort that exceeds work_mem when a slightly slower plan that's expected to fit in work_mem is available? any insights here? cheers! -murat
On Wed, May 30, 2012 at 8:57 PM, Murat Tasan <mmuurr@gmail.com> wrote: > any insights here? Have you tried running the slow option multiple times? According to the explain output all of the time was accounted to the bitmap heap scan. For the second explain plan the same node was fast. It looks to me as the first explain on 8.4 was slow because the data was still on disk. Raising work mem doubled the speed of the sort from 800ms to 400ms. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
Ants -- you're on the right track: i tried your suggestion and found that at times during subsequent executions the performance will drop down to about 6 seconds. though periodically it jumps back up to about 3 minutes, and there's no other DB server usage during these times (i.e. i'm the only one connected). i should note, however, that the 8.2 version has not once been slow with this query. so it may be a cache issue, though i have other queries, also slow in the 8.4 version, and continue to be slow, no matter how many times i re-run them. in most cases (so far) i'm able to re-write them to be faster in the 8.4 version, but it's a different formulation than the 8.2 version. (i.e. if i take the 8.2 version and run it on 8.4 it's slow and vice-versa). this means i need to maintain/test two versions of each query during the migration, which is a nightmare for me. (BTW -- forgot to mention this, but the listing__tag table in the examples has ~35 million rows, and there's an index on each column.) as an example: i re-wrote the query to use the full join version, and on the 8.4 version (after a fresh restart of the server) the plan was the same as before: explain analyze select lt.listing_id from listing__tag lt, listing__tag x where lt.tag_name = x.tag_name and x.listing_id = 91032370 group by lt.listing_id; Group (cost=485411.21..490831.04 rows=488868 width=4) (actual time=5474.662..5636.718 rows=272166 loops=1) -> Sort (cost=485411.21..488121.13 rows=1083967 width=4) (actual time=5474.658..5560.040 rows=295990 loops=1) Sort Key: lt.listing_id Sort Method: external merge Disk: 4048kB -> Nested Loop (cost=35.44..347109.96 rows=1083967 width=4) (actual time=3.908..5090.687 rows=295990 loops=1) -> Index Scan using listing__tag___listing_id on listing__tag x (cost=0.00..283.44 rows=71 width=10) (actu al time=0.050..0.086 rows=11 loops=1) Index Cond: (listing_id = 91032370) -> Bitmap Heap Scan on listing__tag lt (cost=35.44..4868.36 rows=1322 width=14) (actual time=8.664..456.08 7 rows=26908 loops=11) Recheck Cond: (lt.tag_name = x.tag_name) -> Bitmap Index Scan on listing__tag___tag_name (cost=0.00..35.11 rows=1322 width=0) (actual time=7. 065..7.065 rows=26908 loops=11) Index Cond: (lt.tag_name = x.tag_name) Total runtime: 5656.900 ms this top-level join query on the 8.2 machine, despite there being only a single join, performs abysmally (which is why i had to coerce the 8.2 query planner to do the correct nesting in the first place): Group (cost=4172480.61..4232829.15 rows=37744 width=4) -> Sort (cost=4172480.61..4202654.88 rows=12069709 width=4) Sort Key: lt.listing_id -> Hash Join (cost=1312642.10..1927697.87 rows=12069709 width=4) Hash Cond: (x.tag_name = lt.tag_name) -> Index Scan using listing__tag___listing_id on listing__tag x (cost=0.00..3682.79 rows=951 width=13) Index Cond: (listing_id = 91032370) -> Hash (cost=613609.60..613609.60 rows=36151960 width=17) -> Seq Scan on listing__tag lt (cost=0.00..613609.60 rows=36151960 width=17) (only EXPLAIN here on this query as i stopped the first attempt to EXPLAIN ANALYZE after about 15 minutes :-/ ) cheers, -m p.s. on the 8.4 version EXPLAIN ANALYZE *still* tells me that an external merge on disk is happening, despite my setting of work_mem to a full 16 MB this time. does anyone know how to resolve this? or should i even worry about it? On Wed, May 30, 2012 at 2:25 PM, Ants Aasma <ants@cybertec.at> wrote: > On Wed, May 30, 2012 at 8:57 PM, Murat Tasan <mmuurr@gmail.com> wrote: >> any insights here? > > Have you tried running the slow option multiple times? According to > the explain output all of the time was accounted to the bitmap heap > scan. For the second explain plan the same node was fast. It looks to > me as the first explain on 8.4 was slow because the data was still on > disk. Raising work mem doubled the speed of the sort from 800ms to > 400ms. > > Regards, > Ants Aasma > -- > Cybertec Schönig & Schönig GmbH > Gröhrmühlgasse 26 > A-2700 Wiener Neustadt > Web: http://www.postgresql-support.de
On 31/05/12 05:57, Murat Tasan wrote: > hi all - i'm having a bit of trouble with some queries that are > running painfully slowly after migrating my database from one machine > using PostgreSQL 8.2 to another machine with PostgreSQL 8.4. > as far as i can tell, the two *servers* (not the physical machines) > are set up pretty close to identically, and as far as query planning > is concerned, the only setting that seems to be different is > 'default_statistics_target', which is 10 on the 8.2 sever and 100 on > the 8.4 server (the defaults)... and presumably this should be giving > the 8.4 server more data to work with for better query plans (AFAIK). > (other settings, e.g. cost estimates for page/row/etc access are > identical between the servers.) It would probably be useful know what release of 8.4 you have - i.e 8.4.x. There were some significant planner changes at 8.4.9 or thereabouts. I think it would also be useful to know all of your non default parameters for 8.4 (SELECT name,setting FROM pg_settings WHERE source != 'default'). > 3) here's the biggest problem/issue in my brain: work_mem on the 8.2 > server was also set to the 1 MB default! but ran quite speedily! > the full migration will take a while, so there will still be query > development/optimization on one system, and i'd love for those many > hours testing to be worth something when ported over to the other > system. > in this particular example, the Nested Loop seems to fit in the 1 MB > work_mem space on the 8.2 server, but not the 8.4? does this seem > right to anybody? > > > Well 8.4 has 100 stats buckets to get distribution info, so typically has a better idea about things, however sometimes more info is just enough to tip the planner into believing that it needs more space to do something. The other possibility is that the 8.2 box is 32-bit and the 8.4 one is 64-bit and really does need more memory to hold the loop data structures. Regards Mark
Murat Tasan <mmuurr@gmail.com> writes: > p.s. on the 8.4 version EXPLAIN ANALYZE *still* tells me that an > external merge on disk is happening, despite my setting of work_mem to > a full 16 MB this time. When we have to push sort data to disk, it's written in a form that's noticeably more compact than what's used in-memory. So it can be expected that an in-memory sort will need significantly more memory than what an external merge reports using on-disk. I'd try cranking up work_mem to at least 10X the on-disk size if you want to be sure of seeing an in-memory sort. However, so far as I can see the sort is taking much less time than the scan and join steps anyway, so you're not likely to get much improvement this way. The unstable performance is a result of caching effects for the table data, not of how the sort is done. regards, tom lane