Thread: slow bitmap heap scans on pg 9.2

slow bitmap heap scans on pg 9.2

From
Steve Singer
Date:
I'm encountering an issue where PG 9.2.4 (we also see this with 9.2.3)
is picking a plan involving a bitmap heap scan that turns out to be much
slower than a nested-loop plan using indexes.

The planner picks the hashjoin plan by default (see attached files)

Bitmap Heap Scan on public.table_b_2 b  (cost=172635.99..9800225.75
rows=8435754 width=10) (actual t
ime=9132.194..1785196.352 rows=9749680 loops=1)
                            Recheck Cond: ((b.organization_id = 3) AND
(b.year = 2013) AND (b.month = 3))
                            Rows Removed by Index Recheck: 313195667
                            Filter: (b.product_id = 2)

Is the part that seems be causing the problem (or at least taking most
of the time, other than the final aggregation)

If I set enable_hashjoin=false and enable_mergejoin=false I get the
nestedloop join plan.

table_b is 137 GB plus indexes each on is around 43 GB
table_a is 20 GB

random_page_cost = 2.0
effective_cache_size = 3500MB
cpu_tuple_cost = 0.01
cpu_index_tuple_cost = 0.005
cpu_operator_cost = 0.0025
work_mem = 64MB
shared_buffers = 300MB  (for this output, I've also had it at 2GB)

If I bump cpu_tuple_cost to the 10-20 range it will pick the nested loop
join for some date ranges but not all. cpu_tuple_cost of 20 doesn't
sound like an sane value.

This database used to run 8.3 where it picked the nested-loop join.   We
used pg_upgrade to migrate to 9.2

Any ideas why the bitmap heap scan is much slower than the planner expects?

Steve

Attachment

Re: slow bitmap heap scans on pg 9.2

From
"ktm@rice.edu"
Date:
On Wed, Apr 10, 2013 at 09:49:55AM -0400, Steve Singer wrote:
> I'm encountering an issue where PG 9.2.4 (we also see this with
> 9.2.3) is picking a plan involving a bitmap heap scan that turns out
> to be much slower than a nested-loop plan using indexes.
>
> The planner picks the hashjoin plan by default (see attached files)
>
> Bitmap Heap Scan on public.table_b_2 b  (cost=172635.99..9800225.75
> rows=8435754 width=10) (actual t
> ime=9132.194..1785196.352 rows=9749680 loops=1)
>                            Recheck Cond: ((b.organization_id = 3)
> AND (b.year = 2013) AND (b.month = 3))
>                            Rows Removed by Index Recheck: 313195667
>                            Filter: (b.product_id = 2)
>
> Is the part that seems be causing the problem (or at least taking
> most of the time, other than the final aggregation)
>
> If I set enable_hashjoin=false and enable_mergejoin=false I get the
> nestedloop join plan.
>
> table_b is 137 GB plus indexes each on is around 43 GB
> table_a is 20 GB
>
> random_page_cost = 2.0
> effective_cache_size = 3500MB
> cpu_tuple_cost = 0.01
> cpu_index_tuple_cost = 0.005
> cpu_operator_cost = 0.0025
> work_mem = 64MB
> shared_buffers = 300MB  (for this output, I've also had it at 2GB)
>
> If I bump cpu_tuple_cost to the 10-20 range it will pick the nested
> loop join for some date ranges but not all. cpu_tuple_cost of 20
> doesn't sound like an sane value.
>
> This database used to run 8.3 where it picked the nested-loop join.
> We used pg_upgrade to migrate to 9.2
>
> Any ideas why the bitmap heap scan is much slower than the planner expects?
>
> Steve

Hi Steve,

The one thing that stands out to me is that you are working with 200GB of
data on a machine with 4-8GB of ram and you have the random_page_cost set
to 2.0. That is almost completely uncached and I would expect a value of
10 or more to be closer to reality.

Regards,
Ken


Re: slow bitmap heap scans on pg 9.2

From
Steve Singer
Date:
On 13-04-10 09:56 AM, ktm@rice.edu wrote:
> On Wed, Apr 10, 2013 at 09:49:55AM -0400, Steve Singer wrote:

>
> Hi Steve,
>
> The one thing that stands out to me is that you are working with 200GB of
> data on a machine with 4-8GB of ram and you have the random_page_cost set
> to 2.0. That is almost completely uncached and I would expect a value of
> 10 or more to be closer to reality.

Setting random_page_cost to 15 makes the planner choose the nested-loop
plan (at least the date range I tried).

I thought that the point of effective cache size was to tell the planner
high likely it is for a random page to be in cache.  With 200GB of data
for this query and an effective cache size of 3.5 GB I would have
expected that to be accounted for.



>
> Regards,
> Ken
>
>



Re: slow bitmap heap scans on pg 9.2

From
"ktm@rice.edu"
Date:
On Wed, Apr 10, 2013 at 11:56:32AM -0400, Steve Singer wrote:
> On 13-04-10 09:56 AM, ktm@rice.edu wrote:
> >On Wed, Apr 10, 2013 at 09:49:55AM -0400, Steve Singer wrote:
>
> >
> >Hi Steve,
> >
> >The one thing that stands out to me is that you are working with 200GB of
> >data on a machine with 4-8GB of ram and you have the random_page_cost set
> >to 2.0. That is almost completely uncached and I would expect a value of
> >10 or more to be closer to reality.
>
> Setting random_page_cost to 15 makes the planner choose the
> nested-loop plan (at least the date range I tried).
>
> I thought that the point of effective cache size was to tell the
> planner high likely it is for a random page to be in cache.  With
> 200GB of data for this query and an effective cache size of 3.5 GB I
> would have expected that to be accounted for.
>
For random_page_cost to be that low, the database would need to be
mostly cached. 3.5GB is almost 100X too small to do that unless your
query exhibits a large amount of locality of reference. Values for
random_page_cost between 10 and 20 are very reasonable for disk-bound
I/O scenarios.

Regards,
Ken


Re: slow bitmap heap scans on pg 9.2

From
Jeff Janes
Date:
On Wed, Apr 10, 2013 at 6:49 AM, Steve Singer <ssinger@ca.afilias.info> wrote:
I'm encountering an issue where PG 9.2.4 (we also see this with 9.2.3) is picking a plan involving a bitmap heap scan that turns out to be much slower than a nested-loop plan using indexes.

The planner picks the hashjoin plan by default (see attached files)

Bitmap Heap Scan on public.table_b_2 b  (cost=172635.99..9800225.75 rows=8435754 width=10) (actual t
ime=9132.194..1785196.352 rows=9749680 loops=1)
                           Recheck Cond: ((b.organization_id = 3) AND (b.year = 2013) AND (b.month = 3))
                           Rows Removed by Index Recheck: 313195667
                           Filter: (b.product_id = 2)

I think the index recheck means your bitmap is overflowing (i.e. needing more space than work_mem) and so keeping only the pages which have at least one match, which means all rows in those pages need to be rechecked.  How many rows does the table have?  You might be essentially doing a seq scan, but with the additional overhead of the bitmap machinery.  Could you do "explain (analyze,buffers)", preferably with track_io_timing set to on?

 Cheers,

Jeff

Re: slow bitmap heap scans on pg 9.2

From
Jeff Janes
Date:
On Wed, Apr 10, 2013 at 8:56 AM, Steve Singer <ssinger@ca.afilias.info> wrote:
On 13-04-10 09:56 AM, ktm@rice.edu wrote:
On Wed, Apr 10, 2013 at 09:49:55AM -0400, Steve Singer wrote:


Hi Steve,

The one thing that stands out to me is that you are working with 200GB of
data on a machine with 4-8GB of ram and you have the random_page_cost set
to 2.0. That is almost completely uncached and I would expect a value of
10 or more to be closer to reality.

Setting random_page_cost to 15 makes the planner choose the nested-loop plan (at least the date range I tried).

I thought that the point of effective cache size was to tell the planner high likely it is for a random page to be in cache.  


e_c_s tells it how likely it is to still be in cache the second (and subsequent) time the page is visited during the *same query*.  It doesn't tell it how likely it is to be in cache the first time it is needed in a given query.  (Also, e_c_s is irrelevant for bitmap scans, as they inherently hit every block only once)


Also, it doesn't tell how expensive it is to bring it into cache when it is needed. That is what random_page_cost is for.  If you tell that those fetches are going to be cheap, then it doesn't matter so much how many of them it is going to have to do.

Cheers,

Jeff

Re: slow bitmap heap scans on pg 9.2

From
Steve Singer
Date:
On 13-04-10 02:06 PM, Jeff Janes wrote:
> On Wed, Apr 10, 2013 at 6:49 AM, Steve Singer <ssinger@ca.afilias.info
> <mailto:ssinger@ca.afilias.info>> wrote:
>

> I think the index recheck means your bitmap is overflowing (i.e. needing
> more space than work_mem) and so keeping only the pages which have at
> least one match, which means all rows in those pages need to be
> rechecked.  How many rows does the table have?  You might be essentially
> doing a seq scan, but with the additional overhead of the bitmap
> machinery.  Could you do "explain (analyze,buffers)", preferably with
> track_io_timing set to on?
>

table_b has 1,530,710,469 rows

Attached is the output with track_io_timings and buffers.





>   Cheers,
>
> Jeff


Attachment

Re: slow bitmap heap scans on pg 9.2

From
Steve Singer
Date:
On 13-04-10 07:54 PM, Steve Singer wrote:
> On 13-04-10 02:06 PM, Jeff Janes wrote:
>> On Wed, Apr 10, 2013 at 6:49 AM, Steve Singer <ssinger@ca.afilias.info
>> <mailto:ssinger@ca.afilias.info>> wrote:
>>
>
>> I think the index recheck means your bitmap is overflowing (i.e. needing
>> more space than work_mem) and so keeping only the pages which have at
>> least one match, which means all rows in those pages need to be
>> rechecked.  How many rows does the table have?  You might be essentially
>> doing a seq scan, but with the additional overhead of the bitmap
>> machinery.  Could you do "explain (analyze,buffers)", preferably with
>> track_io_timing set to on?
>>
>
> table_b has 1,530,710,469 rows
>
> Attached is the output with track_io_timings and buffers.
>

I've done some more testing with a random_page_cost=20.

This gives me the nested-loop plan for the various date ranges I've tried.

However table_a_2 and table_b_2 are actually partition tables. This
query only needs to look at a single partition.  When I run this same
query against a different partition (a smaller partition, but still
bigger than cache) it picks hash join plan involving a seq scan of
table_b but no bitmap index scan.   On this partition the hash-join
plans tend to take 15 minutes versus 2 minutes when I disable hashjoin
plans. Bumping random_page_cost higher doesn't fix this.

I think the reason why it is picking the hash join based plans is
because of

Index Scan using table_b_1_ptid_orgid_ym_unq on table_b_1 b
(cost=0.00..503.86 rows=1 width=10) (actual time=0.016..0.017 rows=1
loops=414249)
                     Index Cond: ((a.id = a_id) AND (organization_id =
2) AND (year = 2013) AND (month = 3))
                     Filter: (product_id = 1)

I think we are over-estimating the cost of the index scans in the inner
loop.  This seems similar to what was discussed a few months ago
http://www.postgresql.org/message-id/092a01cdd230$ff6143c0$fe23cb40$@foo.me.uk

This version of PG should have 3e9960e9d935e7e applied.  I am trying to
get the database copied to a machine where I can easily switch PG
versions and test this against something prior to that commit and also
against a 9.3 build.

Steve


>
>
>>   Cheers,
>>
>> Jeff
>


Attachment

Re: slow bitmap heap scans on pg 9.2

From
Jeff Janes
Date:
On Thursday, April 11, 2013, Steve Singer wrote:

I think the reason why it is picking the hash join based plans is because of

Index Scan using table_b_1_ptid_orgid_ym_unq on table_b_1 b (cost=0.00..503.86 rows=1 width=10) (actual time=0.016..0.017 rows=1 loops=414249)
                    Index Cond: ((a.id = a_id) AND (organization_id = 2) AND (year = 2013) AND (month = 3))
                    Filter: (product_id = 1)


Trying to reason about how the planner estimates costs for the inner side of nested loops makes my head hurt.  
So before doing that, could you run explain (analyze,buffers) on both of these much simpler (but hopefully morally equivalent to this planner node) sql:

select * from table_b_1_b where a_id = <some plausible value> and organization_id=2 and year=2013 and month=3

select * from table_b_1_b where a_id = <some plausible value> and organization_id=2 and year=2013 and month=3 and product_id=1


Of particular interest here is whether the estimate of 1 row is due to the specificity of the filter, or if the index clauses alone are specific enough to drive that estimate.  (If you get many rows without the product_id filter, that would explain the high estimate.).

Please run with the default cost parameters, or if you can't get the right plan with the defaults, specify what the used parameters were.

Cheers,

Jeff

Re: slow bitmap heap scans on pg 9.2

From
Steve Singer
Date:
On 13-04-12 09:20 PM, Jeff Janes wrote:
> On Thursday, April 11, 2013, Steve Singer wrote:
>
>
>     I think the reason why it is picking the hash join based plans is
>     because of
>
>     Index Scan using table_b_1_ptid_orgid_ym_unq on table_b_1 b
>     (cost=0.00..503.86 rows=1 width=10) (actual time=0.016..0.017 rows=1
>     loops=414249)
>                          Index Cond: ((a.id <http://a.id> = a_id) AND
>     (organization_id = 2) AND (year = 2013) AND (month = 3))
>                          Filter: (product_id = 1)
>
>
>
> Trying to reason about how the planner estimates costs for the inner
> side of nested loops makes my head hurt.
> So before doing that, could you run explain (analyze,buffers) on both of
> these much simpler (but hopefully morally equivalent to this planner
> node) sql:
>
> select * from table_b_1_b where a_id = <some plausible value> and
> organization_id=2 and year=2013 and month=3
>
> select * from table_b_1_b where a_id = <some plausible value> and
> organization_id=2 and year=2013 and month=3 and product_id=1
>

table_b_1 is a partition of table_b  on product_id so when querying
table table_b_1 directly they are equivalent

explain (analyze,buffers) select * FROM table_b_1 where a_id=1128944 and
organization_id=2 and year=2013 and month=3;

QUERY PLAN


-------------------------------------------------------------------------------------------------------------------------------------------
-----------
  Index Scan using table_b_1_ptid_orgid_ym_unq on table_b_1
(cost=0.00..50.73 rows=1 width=56) (actual time=60.328..60.330 rows=
1 loops=1)
    Index Cond: ((a_id = 1128944) AND (organization_id = 2) AND (year =
2013) AND (month = 3))
    Buffers: shared hit=1 read=5
  Total runtime: 60.378 ms
(4 rows)


The plans are the same if I do or do not specify the product_id in the
where clause (if I query the parent table and neglect to query the query
clause it of course queries all the other partitions)




>
> Of particular interest here is whether the estimate of 1 row is due to
> the specificity of the filter, or if the index clauses alone are
> specific enough to drive that estimate.  (If you get many rows without
> the product_id filter, that would explain the high estimate.).

The index clauses alone , we normally expect 1 row back for a query like
that.


>
> Please run with the default cost parameters, or if you can't get the
> right plan with the defaults, specify what the used parameters were.

indexTotalCost += index->pages * spc_random_page_cost / 100000.0;

Is driving my high costs on the inner loop. The index has 2-5 million
pages depending on the partition .   If I run this against 9.2.2 with /
10000.0 the estimate is even higher.

If I try this with this with the

*indexTotalCost += log(1.0 + index->pages / 10000.0) * spc_random_page_cost;

from 9.3 and I play I can make this work I can it pick the plan on some
partitions with product_id=2 but not product_id=1.   If I remove the
fudge-factor cost adjustment line I get the nested-loop plan always.

Breaking the index into smaller partial indexes for each year seems to
be giving me the plans I want with random_page_cost=2 (I might also try
partial indexes on the month).

Even with the 9.3 log based fudge-factor we are seeing the fudge-factor
being big enough so that the planner is picking a table scan over the
index.  A lot of loop iterations can be satisfied by cached pages of the
index the fudge-factor doesn't really account for this.




>
> Cheers,
>
> Jeff



Re: slow bitmap heap scans on pg 9.2

From
Jeff Janes
Date:
On Sat, Apr 13, 2013 at 9:14 AM, Steve Singer <ssinger@ca.afilias.info> wrote:

    
indexTotalCost += index->pages * spc_random_page_cost / 100000.0;

Is driving my high costs on the inner loop. The index has 2-5 million pages depending on the partition .   If I run this against 9.2.2 with / 10000.0 the estimate is even higher.

If I try this with this with the

*indexTotalCost += log(1.0 + index->pages / 10000.0) * spc_random_page_cost;

from 9.3 and I play I can make this work I can it pick the plan on some partitions with product_id=2 but not product_id=1.   If I remove the fudge-factor cost adjustment line I get the nested-loop plan always.

That was only temporarily the formula during 9.3dev.  Tom re-did that entire part of the code rather substantially in the current tip of 9.3 (commit 31f38f28b00cbe2b).  Now it is based on the number of tuples, and the height, rather than pages, and is multiplied by the cpu_operator_cost not the random_page_cost.

descentCost = ceil(log(index->tuples) / log(2.0)) * cpu_operator_cost;

...

descentCost = (index->tree_height + 1) * 50.0 * cpu_operator_cost;

 

Breaking the index into smaller partial indexes for each year seems to be giving me the plans I want with random_page_cost=2 (I might also try partial indexes on the month).

Even with the 9.3 log based fudge-factor we are seeing the fudge-factor being big enough so that the planner is picking a table scan over the index.

Have you tried it under 9.3 HEAD, rather than just back-porting the temporary
*indexTotalCost += log(1.0 + index->pages / 10000.0) * spc_random_page_cost;
code into 9.2?

If you are trying to make your own private copy of 9.2, then removing the fudge factor altogether is probably the way to go.  But if you want to help improve future versions, you probably need to test with the most up-to-date dev version.

 
 A lot of loop iterations can be satisfied by cached pages of the index the fudge-factor doesn't really account for this.


Setting random_page_cost to 2 is already telling it that most of fetches are coming from the cache.  Of course for the upper blocks of an index even more than "most" are likely to be, but the latest dev code takes care of that.

Cheers,

Jeff

Re: slow bitmap heap scans on pg 9.2

From
Steve Singer
Date:
On 13-04-13 04:54 PM, Jeff Janes wrote:
> On Sat, Apr 13, 2013 at 9:14 AM, Steve Singer <ssinger@ca.afilias.info
> <mailto:ssinger@ca.afilias.info>> wrote:
>
>
>     indexTotalCost += index->pages * spc_random_page_cost / 100000.0;
>
>     Is driving my high costs on the inner loop. The index has 2-5
>     million pages depending on the partition .   If I run this against
>     9.2.2 with / 10000.0 the estimate is even higher.
>
>     If I try this with this with the
>
>     *indexTotalCost += log(1.0 + index->pages / 10000.0) *
>     spc_random_page_cost;
>
>     from 9.3 and I play I can make this work I can it pick the plan on
>     some partitions with product_id=2 but not product_id=1.   If I
>     remove the fudge-factor cost adjustment line I get the nested-loop
>     plan always.
>
>
> That was only temporarily the formula during 9.3dev.  Tom re-did that
> entire part of the code rather substantially in the current tip of 9.3
> (commit 31f38f28b00cbe2b).  Now it is based on the number of tuples, and
> the height, rather than pages, and is multiplied by the
> cpu_operator_cost not the random_page_cost.
>
> descentCost = ceil(log(index->tuples) / log(2.0)) * cpu_operator_cost;
>
> ...
>
> descentCost = (index->tree_height + 1) * 50.0 * cpu_operator_cost;
>
>
>     Breaking the index into smaller partial indexes for each year seems
>     to be giving me the plans I want with random_page_cost=2 (I might
>     also try partial indexes on the month).
>
>     Even with the 9.3 log based fudge-factor we are seeing the
>     fudge-factor being big enough so that the planner is picking a table
>     scan over the index.
>
>
> Have you tried it under 9.3 HEAD, rather than just back-porting the
> temporary
> *indexTotalCost += log(1.0 + index->pages / 10000.0) * spc_random_page_cost;
> code into 9.2?
>
> If you are trying to make your own private copy of 9.2, then removing
> the fudge factor altogether is probably the way to go.  But if you want
> to help improve future versions, you probably need to test with the most
> up-to-date dev version.

I will do that in a few days.  I don't have enough disk space on this
dev server to have a 9.2 datadir and a 9.3 one for this database.  Once
I have a solution that I can use with 9.2 firmed up I can upgrade the
datadir to 9.3 and test this.  I am hoping I can get a set of partial
indexes that will give good results with an unmodified 9.2, so far that
looks promising but I still have more cases to verify (these indexes
take a while to build).

>
>       A lot of loop iterations can be satisfied by cached pages of the
>     index the fudge-factor doesn't really account for this.
>
>
>
> Setting random_page_cost to 2 is already telling it that most of fetches
> are coming from the cache.  Of course for the upper blocks of an index
> even more than "most" are likely to be, but the latest dev code takes
> care of that.
>
> Cheers,
>
> Jeff



Re: slow bitmap heap scans on pg 9.2

From
Steve Singer
Date:
On 13-04-14 08:06 PM, Steve Singer wrote:
> On 13-04-13 04:54 PM, Jeff Janes wrote:

>>
>> If you are trying to make your own private copy of 9.2, then removing
>> the fudge factor altogether is probably the way to go.  But if you want
>> to help improve future versions, you probably need to test with the most
>> up-to-date dev version.
>
> I will do that in a few days.  I don't have enough disk space on this
> dev server to have a 9.2 datadir and a 9.3 one for this database.  Once
> I have a solution that I can use with 9.2 firmed up I can upgrade the
> datadir to 9.3 and test this.  I am hoping I can get a set of partial
> indexes that will give good results with an unmodified 9.2, so far that
> looks promising but I still have more cases to verify (these indexes
> take a while to build).
>

I've run these queries against a recent master/9.3 and the planner is
picking the nested-loop plan for using the full index. This is with
random_page_cost=2.