Thread: does the query planner consider work_mem?

does the query planner consider work_mem?

From
Murat Tasan
Date:
hi all - i'm having a bit of trouble with some queries that are
running painfully slowly after migrating my database from one machine
using PostgreSQL 8.2 to another machine with PostgreSQL 8.4.
as far as i can tell, the two *servers* (not the physical machines)
are set up pretty close to identically, and as far as query planning
is concerned, the only setting that seems to be different is
'default_statistics_target', which is 10 on the 8.2 sever and 100 on
the 8.4 server (the defaults)... and presumably this should be giving
the 8.4 server more data to work with for better query plans (AFAIK).
(other settings, e.g. cost estimates for page/row/etc access are
identical between the servers.)
i VACUUMed and ANALYZEd on both databases (and all tables) prior to
running any of the examples below.

i've narrowed some of the queries down to a few simpler test cases and
found one example of a huge difference.
here's a basic query reproducing the results:

explain analyze select lt.l_id from lt where lt.t_name in (select
t_name from lt x where x.l_id = 91032370) group by lt.l_id;

[YES: i know i can use a join here. i extracted this from a much
larger query that originally had 7 joins, and to help the query
planner out a bit i managed to isolate one nested query that makes a
large performance-increasing difference that the query planner was
ignoring... and for the purposes of exploring this behavior, though,
this serves as a nice example as it seems to be enforcing a nested
query.]

here's the analysis on the 8.2 machine:

HashAggregate  (cost=23240.77..23297.35 rows=5658 width=4) (actual
time=5515.065..5600.372 rows=279785 loops=1)
   ->  Nested Loop  (cost=3461.24..23226.63 rows=5658 width=4) (actual
time=134.604..5276.014 rows=304248 loops=1)
         ->  HashAggregate  (cost=3288.18..3288.19 rows=1 width=13)
(actual time=0.097..0.120 rows=11 loops=1)
               ->  Index Scan using listing__tag___listing_id on
listing__tag x  (cost=0.00..3286.06 rows=851 width=13) (actual
time=0.069..0.084 rows=11 loops=1)
                     Index Cond: (listing_id = 91032370)
         ->  Bitmap Heap Scan on listing__tag lt
(cost=173.06..19867.70 rows=5659 width=17) (actual
time=115.275..474.135 rows=27659 loops=11)
               Recheck Cond: (lt.tag_name = x.tag_name)
               ->  Bitmap Index Scan on listing__tag___tag_name
(cost=0.00..171.64 rows=5659 width=0) (actual time=113.595..113.595
rows=27659 loops=11)
                     Index Cond: (lt.tag_name = x.tag_name)
 Total runtime: 5615.036 ms

same query on the 8.4 machine, trial 1 (almost exactly the same
data... just (very) slightly fewer records):

Group  (cost=782153.70..807638.31 rows=516717 width=4) (actual
time=184264.479..184434.087 rows=275876 loops=1)
   ->  Sort  (cost=782153.70..794896.00 rows=5096921 width=4) (actual
time=184264.476..184353.314 rows=299992 loops=1)
         Sort Key: lt.listing_id
         Sort Method:  external merge  Disk: 4096kB
         ->  Nested Loop  (cost=306.17..5271.26 rows=5096921 width=4)
(actual time=126.267..183408.035 rows=299992 loops=1)
               ->  HashAggregate  (cost=270.42..270.43 rows=1
width=10) (actual time=57.744..57.771 rows=11 loops=1)
                     ->  Index Scan using listing__tag___listing_id on
listing__tag x  (cost=0.00..270.25 rows=68 width=10) (actual
time=57.728..57.731 rows=11 loops=1)
                           Index Cond: (listing_id = 91032370)
               ->  Bitmap Heap Scan on listing__tag lt
(cost=35.75..4983.92 rows=1353 width=14) (actual
time=59.723..16652.856 rows=27272 loops=11)
                     Recheck Cond: (lt.tag_name = x.tag_name)
                     ->  Bitmap Index Scan on listing__tag___tag_name
(cost=0.00..35.41 rows=1353 width=0) (actual time=58.036..58.036
rows=27272 loops=11)
                           Index Cond: (lt.tag_name = x.tag_name)
 Total runtime: 184455.567 ms

3 seconds vs 3 minutes!
so i obviously first noticed that the nested loop was taking up all
the time... though that part of the query is identical between the two
servers.
another noticeable difference is that the estimates on number of rows
are better in the 8.4 query (estimated around 500,000 vs 5,000, and
actual number of rows was around 300,000).
the only difference in query plans is the choice to Group->Sort vs
HashAggregate, but both of those operate on the Nested Loop results.
despite seeing all the time being spent on the Nested Loop, i also
noticed that it was sorting on disk... so i upped the work_mem from
the 1 MB default to 8 MB.
i then re-ran the 8.4 query, and got these times (for the same query):

Group  (cost=642783.70..668268.31 rows=516717 width=4) (actual
time=1946.838..2102.696 rows=275876 loops=1)
   ->  Sort  (cost=642783.70..655526.00 rows=5096921 width=4) (actual
time=1946.835..2023.099 rows=299992 loops=1)
         Sort Key: lt.listing_id
         Sort Method:  external merge  Disk: 4096kB
         ->  Nested Loop  (cost=306.17..5271.26 rows=5096921 width=4)
(actual time=4.336..1518.962 rows=299992 loops=1)
               ->  HashAggregate  (cost=270.42..270.43 rows=1
width=10) (actual time=0.069..0.089 rows=11 loops=1)
                     ->  Index Scan using listing__tag___listing_id on
listing__tag x  (cost=0.00..270.25 rows=68 width=10) (actual
time=0.052..0.058 rows=11 loops=1)
                           Index Cond: (listing_id = 91032370)
               ->  Bitmap Heap Scan on listing__tag lt
(cost=35.75..4983.92 rows=1353 width=14) (actual time=11.585..130.841
rows=27272 loops=11)
                     Recheck Cond: (lt.tag_name = x.tag_name)
                     ->  Bitmap Index Scan on listing__tag___tag_name
(cost=0.00..35.41 rows=1353 width=0) (actual time=6.588..6.588
rows=27272 loops=11)
                           Index Cond: (lt.tag_name = x.tag_name)
 Total runtime: 2123.992 ms

2 seconds!
great! kind of.
this really just raised a whole series of questions now that are going
to bother me for my future query-writing purposes.

1) obvious basic question: why, after raising work_mem to 8MB is the
sort still happening on disk? or, is it happening on disk (despite
being reported as so).

2) the Nested Loop time dropped dramatically (basically all the time
savings)... but i had no warning prior to this that the work_mem limit
was being hit inside that Nested Loop.
is there any way to see which steps are requiring additional disk IO
for processing... similar to the reported "external merge" for the
Sort Method?

3) here's the biggest problem/issue in my brain: work_mem on the 8.2
server was also set to the 1 MB default! but ran quite speedily!
the full migration will take a while, so there will still be query
development/optimization on one system, and i'd love for those many
hours testing to be worth something when ported over to the other
system.
in this particular example, the Nested Loop seems to fit in the 1 MB
work_mem space on the 8.2 server, but not the 8.4? does this seem
right to anybody?

4) i thought, maybe i'm reading the ANALYZE report incorrectly, and
the true bottleneck in the first 8.4 query was indeed the on-disk
sorting step (which would make sense). and upping the work_mem limit
would alleviate that (which would also make sense).
but then why is the second 8.4 query still reporting a sort on disk,
despite the giant speed boost?

5) and finally, if indeed the sort is happening on disk in the first
8.4 query, will the query planner consider work_mem as part of the
planning process?
i.e. will a slightly-faster (in terms of data page reads) be
downgraded if it looks like it'll require a sort that exceeds work_mem
when a slightly slower plan that's expected to fit in work_mem is
available?

any insights here?

cheers!

-murat

Re: does the query planner consider work_mem?

From
Ants Aasma
Date:
On Wed, May 30, 2012 at 8:57 PM, Murat Tasan <mmuurr@gmail.com> wrote:
> any insights here?

Have you tried running the slow option multiple times? According to
the explain output all of the time was accounted to the bitmap heap
scan. For the second explain plan the same node was fast. It looks to
me as the first explain on 8.4 was slow because the data was still on
disk. Raising work mem doubled the speed of the sort from 800ms to
400ms.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

Re: does the query planner consider work_mem?

From
Murat Tasan
Date:
Ants -- you're on the right track: i tried your suggestion and found
that at times during subsequent executions the performance will drop
down to about 6 seconds.
though periodically it jumps back up to about 3 minutes, and there's
no other DB server usage during these times (i.e. i'm the only one
connected).
i should note, however, that the 8.2 version has not once been slow
with this query.

so it may be a cache issue, though i have other queries, also slow in
the 8.4 version, and continue to be slow, no matter how many times i
re-run them.
in most cases (so far) i'm able to re-write them to be faster in the
8.4 version, but it's a different formulation than the 8.2 version.
(i.e. if i take the 8.2 version and run it on 8.4 it's slow and
vice-versa).
this means i need to maintain/test two versions of each query during
the migration, which is a nightmare for me.
(BTW -- forgot to mention this, but the listing__tag table in the
examples has ~35 million rows, and there's an index on each column.)

as an example: i re-wrote the query to use the full join version, and
on the 8.4 version (after a fresh restart of the server) the plan was
the same as before:

explain analyze select lt.listing_id from listing__tag lt,
listing__tag x where lt.tag_name = x.tag_name and x.listing_id =
91032370 group by lt.listing_id;

Group  (cost=485411.21..490831.04 rows=488868 width=4) (actual
time=5474.662..5636.718 rows=272166 loops=1)
   ->  Sort  (cost=485411.21..488121.13 rows=1083967 width=4) (actual
time=5474.658..5560.040 rows=295990 loops=1)
         Sort Key: lt.listing_id
         Sort Method:  external merge  Disk: 4048kB
         ->  Nested Loop  (cost=35.44..347109.96 rows=1083967 width=4)
(actual time=3.908..5090.687 rows=295990 loops=1)
               ->  Index Scan using listing__tag___listing_id on
listing__tag x  (cost=0.00..283.44 rows=71 width=10) (actu
al time=0.050..0.086 rows=11 loops=1)
                     Index Cond: (listing_id = 91032370)
               ->  Bitmap Heap Scan on listing__tag lt
(cost=35.44..4868.36 rows=1322 width=14) (actual time=8.664..456.08
7 rows=26908 loops=11)
                     Recheck Cond: (lt.tag_name = x.tag_name)
                     ->  Bitmap Index Scan on listing__tag___tag_name
(cost=0.00..35.11 rows=1322 width=0) (actual time=7.
065..7.065 rows=26908 loops=11)
                           Index Cond: (lt.tag_name = x.tag_name)
 Total runtime: 5656.900 ms

this top-level join query on the 8.2 machine, despite there being only
a single join, performs abysmally (which is why i had to coerce the
8.2 query planner to do the correct nesting in the first place):

Group  (cost=4172480.61..4232829.15 rows=37744 width=4)
   ->  Sort  (cost=4172480.61..4202654.88 rows=12069709 width=4)
         Sort Key: lt.listing_id
         ->  Hash Join  (cost=1312642.10..1927697.87 rows=12069709 width=4)
               Hash Cond: (x.tag_name = lt.tag_name)
               ->  Index Scan using listing__tag___listing_id on
listing__tag x  (cost=0.00..3682.79 rows=951 width=13)
                     Index Cond: (listing_id = 91032370)
               ->  Hash  (cost=613609.60..613609.60 rows=36151960 width=17)
                     ->  Seq Scan on listing__tag lt
(cost=0.00..613609.60 rows=36151960 width=17)

(only EXPLAIN here on this query as i stopped the first attempt to
EXPLAIN ANALYZE after about 15 minutes :-/ )

cheers,

-m

p.s. on the 8.4 version EXPLAIN ANALYZE *still* tells me that an
external merge on disk is happening, despite my setting of work_mem to
a full 16 MB this time.
does anyone know how to resolve this? or should i even worry about it?


On Wed, May 30, 2012 at 2:25 PM, Ants Aasma <ants@cybertec.at> wrote:
> On Wed, May 30, 2012 at 8:57 PM, Murat Tasan <mmuurr@gmail.com> wrote:
>> any insights here?
>
> Have you tried running the slow option multiple times? According to
> the explain output all of the time was accounted to the bitmap heap
> scan. For the second explain plan the same node was fast. It looks to
> me as the first explain on 8.4 was slow because the data was still on
> disk. Raising work mem doubled the speed of the sort from 800ms to
> 400ms.
>
> Regards,
> Ants Aasma
> --
> Cybertec Schönig & Schönig GmbH
> Gröhrmühlgasse 26
> A-2700 Wiener Neustadt
> Web: http://www.postgresql-support.de

Re: does the query planner consider work_mem?

From
Mark Kirkwood
Date:
On 31/05/12 05:57, Murat Tasan wrote:
> hi all - i'm having a bit of trouble with some queries that are
> running painfully slowly after migrating my database from one machine
> using PostgreSQL 8.2 to another machine with PostgreSQL 8.4.
> as far as i can tell, the two *servers* (not the physical machines)
> are set up pretty close to identically, and as far as query planning
> is concerned, the only setting that seems to be different is
> 'default_statistics_target', which is 10 on the 8.2 sever and 100 on
> the 8.4 server (the defaults)... and presumably this should be giving
> the 8.4 server more data to work with for better query plans (AFAIK).
> (other settings, e.g. cost estimates for page/row/etc access are
> identical between the servers.)

It would probably be useful know what release of 8.4 you have - i.e
8.4.x. There were some significant planner changes at 8.4.9 or thereabouts.

I think it would also be useful to know all of your non default
parameters for 8.4 (SELECT name,setting FROM pg_settings WHERE source !=
'default').

> 3) here's the biggest problem/issue in my brain: work_mem on the 8.2
> server was also set to the 1 MB default! but ran quite speedily!
> the full migration will take a while, so there will still be query
> development/optimization on one system, and i'd love for those many
> hours testing to be worth something when ported over to the other
> system.
> in this particular example, the Nested Loop seems to fit in the 1 MB
> work_mem space on the 8.2 server, but not the 8.4? does this seem
> right to anybody?
>
>
>

Well 8.4 has 100 stats buckets to get distribution info, so typically
has a better idea about things, however sometimes more info is just
enough to tip the planner into believing that it needs more space to do
something. The other possibility is that the 8.2 box is 32-bit and the
8.4 one is 64-bit and really does need more memory to hold the loop data
structures.

Regards

Mark

Re: does the query planner consider work_mem?

From
Tom Lane
Date:
Murat Tasan <mmuurr@gmail.com> writes:
> p.s. on the 8.4 version EXPLAIN ANALYZE *still* tells me that an
> external merge on disk is happening, despite my setting of work_mem to
> a full 16 MB this time.

When we have to push sort data to disk, it's written in a form that's
noticeably more compact than what's used in-memory.  So it can be
expected that an in-memory sort will need significantly more memory
than what an external merge reports using on-disk.  I'd try cranking
up work_mem to at least 10X the on-disk size if you want to be sure
of seeing an in-memory sort.

However, so far as I can see the sort is taking much less time than
the scan and join steps anyway, so you're not likely to get much
improvement this way.  The unstable performance is a result of caching
effects for the table data, not of how the sort is done.

            regards, tom lane