Thread: slow bitmap heap scans on pg 9.2
I'm encountering an issue where PG 9.2.4 (we also see this with 9.2.3) is picking a plan involving a bitmap heap scan that turns out to be much slower than a nested-loop plan using indexes. The planner picks the hashjoin plan by default (see attached files) Bitmap Heap Scan on public.table_b_2 b (cost=172635.99..9800225.75 rows=8435754 width=10) (actual t ime=9132.194..1785196.352 rows=9749680 loops=1) Recheck Cond: ((b.organization_id = 3) AND (b.year = 2013) AND (b.month = 3)) Rows Removed by Index Recheck: 313195667 Filter: (b.product_id = 2) Is the part that seems be causing the problem (or at least taking most of the time, other than the final aggregation) If I set enable_hashjoin=false and enable_mergejoin=false I get the nestedloop join plan. table_b is 137 GB plus indexes each on is around 43 GB table_a is 20 GB random_page_cost = 2.0 effective_cache_size = 3500MB cpu_tuple_cost = 0.01 cpu_index_tuple_cost = 0.005 cpu_operator_cost = 0.0025 work_mem = 64MB shared_buffers = 300MB (for this output, I've also had it at 2GB) If I bump cpu_tuple_cost to the 10-20 range it will pick the nested loop join for some date ranges but not all. cpu_tuple_cost of 20 doesn't sound like an sane value. This database used to run 8.3 where it picked the nested-loop join. We used pg_upgrade to migrate to 9.2 Any ideas why the bitmap heap scan is much slower than the planner expects? Steve
Attachment
On Wed, Apr 10, 2013 at 09:49:55AM -0400, Steve Singer wrote: > I'm encountering an issue where PG 9.2.4 (we also see this with > 9.2.3) is picking a plan involving a bitmap heap scan that turns out > to be much slower than a nested-loop plan using indexes. > > The planner picks the hashjoin plan by default (see attached files) > > Bitmap Heap Scan on public.table_b_2 b (cost=172635.99..9800225.75 > rows=8435754 width=10) (actual t > ime=9132.194..1785196.352 rows=9749680 loops=1) > Recheck Cond: ((b.organization_id = 3) > AND (b.year = 2013) AND (b.month = 3)) > Rows Removed by Index Recheck: 313195667 > Filter: (b.product_id = 2) > > Is the part that seems be causing the problem (or at least taking > most of the time, other than the final aggregation) > > If I set enable_hashjoin=false and enable_mergejoin=false I get the > nestedloop join plan. > > table_b is 137 GB plus indexes each on is around 43 GB > table_a is 20 GB > > random_page_cost = 2.0 > effective_cache_size = 3500MB > cpu_tuple_cost = 0.01 > cpu_index_tuple_cost = 0.005 > cpu_operator_cost = 0.0025 > work_mem = 64MB > shared_buffers = 300MB (for this output, I've also had it at 2GB) > > If I bump cpu_tuple_cost to the 10-20 range it will pick the nested > loop join for some date ranges but not all. cpu_tuple_cost of 20 > doesn't sound like an sane value. > > This database used to run 8.3 where it picked the nested-loop join. > We used pg_upgrade to migrate to 9.2 > > Any ideas why the bitmap heap scan is much slower than the planner expects? > > Steve Hi Steve, The one thing that stands out to me is that you are working with 200GB of data on a machine with 4-8GB of ram and you have the random_page_cost set to 2.0. That is almost completely uncached and I would expect a value of 10 or more to be closer to reality. Regards, Ken
On 13-04-10 09:56 AM, ktm@rice.edu wrote: > On Wed, Apr 10, 2013 at 09:49:55AM -0400, Steve Singer wrote: > > Hi Steve, > > The one thing that stands out to me is that you are working with 200GB of > data on a machine with 4-8GB of ram and you have the random_page_cost set > to 2.0. That is almost completely uncached and I would expect a value of > 10 or more to be closer to reality. Setting random_page_cost to 15 makes the planner choose the nested-loop plan (at least the date range I tried). I thought that the point of effective cache size was to tell the planner high likely it is for a random page to be in cache. With 200GB of data for this query and an effective cache size of 3.5 GB I would have expected that to be accounted for. > > Regards, > Ken > >
On Wed, Apr 10, 2013 at 11:56:32AM -0400, Steve Singer wrote: > On 13-04-10 09:56 AM, ktm@rice.edu wrote: > >On Wed, Apr 10, 2013 at 09:49:55AM -0400, Steve Singer wrote: > > > > >Hi Steve, > > > >The one thing that stands out to me is that you are working with 200GB of > >data on a machine with 4-8GB of ram and you have the random_page_cost set > >to 2.0. That is almost completely uncached and I would expect a value of > >10 or more to be closer to reality. > > Setting random_page_cost to 15 makes the planner choose the > nested-loop plan (at least the date range I tried). > > I thought that the point of effective cache size was to tell the > planner high likely it is for a random page to be in cache. With > 200GB of data for this query and an effective cache size of 3.5 GB I > would have expected that to be accounted for. > For random_page_cost to be that low, the database would need to be mostly cached. 3.5GB is almost 100X too small to do that unless your query exhibits a large amount of locality of reference. Values for random_page_cost between 10 and 20 are very reasonable for disk-bound I/O scenarios. Regards, Ken
On Wed, Apr 10, 2013 at 6:49 AM, Steve Singer <ssinger@ca.afilias.info> wrote:
I'm encountering an issue where PG 9.2.4 (we also see this with 9.2.3) is picking a plan involving a bitmap heap scan that turns out to be much slower than a nested-loop plan using indexes.
The planner picks the hashjoin plan by default (see attached files)
Bitmap Heap Scan on public.table_b_2 b (cost=172635.99..9800225.75 rows=8435754 width=10) (actual t
ime=9132.194..1785196.352 rows=9749680 loops=1)
Recheck Cond: ((b.organization_id = 3) AND (b.year = 2013) AND (b.month = 3))
Rows Removed by Index Recheck: 313195667
Filter: (b.product_id = 2)
I think the index recheck means your bitmap is overflowing (i.e. needing more space than work_mem) and so keeping only the pages which have at least one match, which means all rows in those pages need to be rechecked. How many rows does the table have? You might be essentially doing a seq scan, but with the additional overhead of the bitmap machinery. Could you do "explain (analyze,buffers)", preferably with track_io_timing set to on?
Cheers,
Jeff
On Wed, Apr 10, 2013 at 8:56 AM, Steve Singer <ssinger@ca.afilias.info> wrote:
On 13-04-10 09:56 AM, ktm@rice.edu wrote:On Wed, Apr 10, 2013 at 09:49:55AM -0400, Steve Singer wrote:Setting random_page_cost to 15 makes the planner choose the nested-loop plan (at least the date range I tried).
Hi Steve,
The one thing that stands out to me is that you are working with 200GB of
data on a machine with 4-8GB of ram and you have the random_page_cost set
to 2.0. That is almost completely uncached and I would expect a value of
10 or more to be closer to reality.
I thought that the point of effective cache size was to tell the planner high likely it is for a random page to be in cache.
e_c_s tells it how likely it is to still be in cache the second (and subsequent) time the page is visited during the *same query*. It doesn't tell it how likely it is to be in cache the first time it is needed in a given query. (Also, e_c_s is irrelevant for bitmap scans, as they inherently hit every block only once)
Also, it doesn't tell how expensive it is to bring it into cache when it is needed. That is what random_page_cost is for. If you tell that those fetches are going to be cheap, then it doesn't matter so much how many of them it is going to have to do.
Cheers,
Jeff
On 13-04-10 02:06 PM, Jeff Janes wrote: > On Wed, Apr 10, 2013 at 6:49 AM, Steve Singer <ssinger@ca.afilias.info > <mailto:ssinger@ca.afilias.info>> wrote: > > I think the index recheck means your bitmap is overflowing (i.e. needing > more space than work_mem) and so keeping only the pages which have at > least one match, which means all rows in those pages need to be > rechecked. How many rows does the table have? You might be essentially > doing a seq scan, but with the additional overhead of the bitmap > machinery. Could you do "explain (analyze,buffers)", preferably with > track_io_timing set to on? > table_b has 1,530,710,469 rows Attached is the output with track_io_timings and buffers. > Cheers, > > Jeff
Attachment
On 13-04-10 07:54 PM, Steve Singer wrote: > On 13-04-10 02:06 PM, Jeff Janes wrote: >> On Wed, Apr 10, 2013 at 6:49 AM, Steve Singer <ssinger@ca.afilias.info >> <mailto:ssinger@ca.afilias.info>> wrote: >> > >> I think the index recheck means your bitmap is overflowing (i.e. needing >> more space than work_mem) and so keeping only the pages which have at >> least one match, which means all rows in those pages need to be >> rechecked. How many rows does the table have? You might be essentially >> doing a seq scan, but with the additional overhead of the bitmap >> machinery. Could you do "explain (analyze,buffers)", preferably with >> track_io_timing set to on? >> > > table_b has 1,530,710,469 rows > > Attached is the output with track_io_timings and buffers. > I've done some more testing with a random_page_cost=20. This gives me the nested-loop plan for the various date ranges I've tried. However table_a_2 and table_b_2 are actually partition tables. This query only needs to look at a single partition. When I run this same query against a different partition (a smaller partition, but still bigger than cache) it picks hash join plan involving a seq scan of table_b but no bitmap index scan. On this partition the hash-join plans tend to take 15 minutes versus 2 minutes when I disable hashjoin plans. Bumping random_page_cost higher doesn't fix this. I think the reason why it is picking the hash join based plans is because of Index Scan using table_b_1_ptid_orgid_ym_unq on table_b_1 b (cost=0.00..503.86 rows=1 width=10) (actual time=0.016..0.017 rows=1 loops=414249) Index Cond: ((a.id = a_id) AND (organization_id = 2) AND (year = 2013) AND (month = 3)) Filter: (product_id = 1) I think we are over-estimating the cost of the index scans in the inner loop. This seems similar to what was discussed a few months ago http://www.postgresql.org/message-id/092a01cdd230$ff6143c0$fe23cb40$@foo.me.uk This version of PG should have 3e9960e9d935e7e applied. I am trying to get the database copied to a machine where I can easily switch PG versions and test this against something prior to that commit and also against a 9.3 build. Steve > > >> Cheers, >> >> Jeff >
Attachment
On Thursday, April 11, 2013, Steve Singer wrote:
I think the reason why it is picking the hash join based plans is because of
Index Scan using table_b_1_ptid_orgid_ym_unq on table_b_1 b (cost=0.00..503.86 rows=1 width=10) (actual time=0.016..0.017 rows=1 loops=414249)
Index Cond: ((a.id = a_id) AND (organization_id = 2) AND (year = 2013) AND (month = 3))
Filter: (product_id = 1)
Trying to reason about how the planner estimates costs for the inner side of nested loops makes my head hurt.
So before doing that, could you run explain (analyze,buffers) on both of these much simpler (but hopefully morally equivalent to this planner node) sql:
select * from table_b_1_b where a_id = <some plausible value> and organization_id=2 and year=2013 and month=3
select * from table_b_1_b where a_id = <some plausible value> and organization_id=2 and year=2013 and month=3 and product_id=1
Of particular interest here is whether the estimate of 1 row is due to the specificity of the filter, or if the index clauses alone are specific enough to drive that estimate. (If you get many rows without the product_id filter, that would explain the high estimate.).
Please run with the default cost parameters, or if you can't get the right plan with the defaults, specify what the used parameters were.
Cheers,
Jeff
On 13-04-12 09:20 PM, Jeff Janes wrote: > On Thursday, April 11, 2013, Steve Singer wrote: > > > I think the reason why it is picking the hash join based plans is > because of > > Index Scan using table_b_1_ptid_orgid_ym_unq on table_b_1 b > (cost=0.00..503.86 rows=1 width=10) (actual time=0.016..0.017 rows=1 > loops=414249) > Index Cond: ((a.id <http://a.id> = a_id) AND > (organization_id = 2) AND (year = 2013) AND (month = 3)) > Filter: (product_id = 1) > > > > Trying to reason about how the planner estimates costs for the inner > side of nested loops makes my head hurt. > So before doing that, could you run explain (analyze,buffers) on both of > these much simpler (but hopefully morally equivalent to this planner > node) sql: > > select * from table_b_1_b where a_id = <some plausible value> and > organization_id=2 and year=2013 and month=3 > > select * from table_b_1_b where a_id = <some plausible value> and > organization_id=2 and year=2013 and month=3 and product_id=1 > table_b_1 is a partition of table_b on product_id so when querying table table_b_1 directly they are equivalent explain (analyze,buffers) select * FROM table_b_1 where a_id=1128944 and organization_id=2 and year=2013 and month=3; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------- ----------- Index Scan using table_b_1_ptid_orgid_ym_unq on table_b_1 (cost=0.00..50.73 rows=1 width=56) (actual time=60.328..60.330 rows= 1 loops=1) Index Cond: ((a_id = 1128944) AND (organization_id = 2) AND (year = 2013) AND (month = 3)) Buffers: shared hit=1 read=5 Total runtime: 60.378 ms (4 rows) The plans are the same if I do or do not specify the product_id in the where clause (if I query the parent table and neglect to query the query clause it of course queries all the other partitions) > > Of particular interest here is whether the estimate of 1 row is due to > the specificity of the filter, or if the index clauses alone are > specific enough to drive that estimate. (If you get many rows without > the product_id filter, that would explain the high estimate.). The index clauses alone , we normally expect 1 row back for a query like that. > > Please run with the default cost parameters, or if you can't get the > right plan with the defaults, specify what the used parameters were. indexTotalCost += index->pages * spc_random_page_cost / 100000.0; Is driving my high costs on the inner loop. The index has 2-5 million pages depending on the partition . If I run this against 9.2.2 with / 10000.0 the estimate is even higher. If I try this with this with the *indexTotalCost += log(1.0 + index->pages / 10000.0) * spc_random_page_cost; from 9.3 and I play I can make this work I can it pick the plan on some partitions with product_id=2 but not product_id=1. If I remove the fudge-factor cost adjustment line I get the nested-loop plan always. Breaking the index into smaller partial indexes for each year seems to be giving me the plans I want with random_page_cost=2 (I might also try partial indexes on the month). Even with the 9.3 log based fudge-factor we are seeing the fudge-factor being big enough so that the planner is picking a table scan over the index. A lot of loop iterations can be satisfied by cached pages of the index the fudge-factor doesn't really account for this. > > Cheers, > > Jeff
On Sat, Apr 13, 2013 at 9:14 AM, Steve Singer <ssinger@ca.afilias.info> wrote:
indexTotalCost += index->pages * spc_random_page_cost / 100000.0;
Is driving my high costs on the inner loop. The index has 2-5 million pages depending on the partition . If I run this against 9.2.2 with / 10000.0 the estimate is even higher.
If I try this with this with the
*indexTotalCost += log(1.0 + index->pages / 10000.0) * spc_random_page_cost;
from 9.3 and I play I can make this work I can it pick the plan on some partitions with product_id=2 but not product_id=1. If I remove the fudge-factor cost adjustment line I get the nested-loop plan always.
That was only temporarily the formula during 9.3dev. Tom re-did that entire part of the code rather substantially in the current tip of 9.3 (commit 31f38f28b00cbe2b). Now it is based on the number of tuples, and the height, rather than pages, and is multiplied by the cpu_operator_cost not the random_page_cost.
descentCost = ceil(log(index->tuples) / log(2.0)) * cpu_operator_cost;
...
descentCost = (index->tree_height + 1) * 50.0 * cpu_operator_cost;
Breaking the index into smaller partial indexes for each year seems to be giving me the plans I want with random_page_cost=2 (I might also try partial indexes on the month).
Even with the 9.3 log based fudge-factor we are seeing the fudge-factor being big enough so that the planner is picking a table scan over the index.
Have you tried it under 9.3 HEAD, rather than just back-porting the temporary
*indexTotalCost += log(1.0 + index->pages / 10000.0) * spc_random_page_cost;
code into 9.2?
If you are trying to make your own private copy of 9.2, then removing the fudge factor altogether is probably the way to go. But if you want to help improve future versions, you probably need to test with the most up-to-date dev version.
A lot of loop iterations can be satisfied by cached pages of the index the fudge-factor doesn't really account for this.
Setting random_page_cost to 2 is already telling it that most of fetches are coming from the cache. Of course for the upper blocks of an index even more than "most" are likely to be, but the latest dev code takes care of that.
Cheers,
Jeff
On 13-04-13 04:54 PM, Jeff Janes wrote: > On Sat, Apr 13, 2013 at 9:14 AM, Steve Singer <ssinger@ca.afilias.info > <mailto:ssinger@ca.afilias.info>> wrote: > > > indexTotalCost += index->pages * spc_random_page_cost / 100000.0; > > Is driving my high costs on the inner loop. The index has 2-5 > million pages depending on the partition . If I run this against > 9.2.2 with / 10000.0 the estimate is even higher. > > If I try this with this with the > > *indexTotalCost += log(1.0 + index->pages / 10000.0) * > spc_random_page_cost; > > from 9.3 and I play I can make this work I can it pick the plan on > some partitions with product_id=2 but not product_id=1. If I > remove the fudge-factor cost adjustment line I get the nested-loop > plan always. > > > That was only temporarily the formula during 9.3dev. Tom re-did that > entire part of the code rather substantially in the current tip of 9.3 > (commit 31f38f28b00cbe2b). Now it is based on the number of tuples, and > the height, rather than pages, and is multiplied by the > cpu_operator_cost not the random_page_cost. > > descentCost = ceil(log(index->tuples) / log(2.0)) * cpu_operator_cost; > > ... > > descentCost = (index->tree_height + 1) * 50.0 * cpu_operator_cost; > > > Breaking the index into smaller partial indexes for each year seems > to be giving me the plans I want with random_page_cost=2 (I might > also try partial indexes on the month). > > Even with the 9.3 log based fudge-factor we are seeing the > fudge-factor being big enough so that the planner is picking a table > scan over the index. > > > Have you tried it under 9.3 HEAD, rather than just back-porting the > temporary > *indexTotalCost += log(1.0 + index->pages / 10000.0) * spc_random_page_cost; > code into 9.2? > > If you are trying to make your own private copy of 9.2, then removing > the fudge factor altogether is probably the way to go. But if you want > to help improve future versions, you probably need to test with the most > up-to-date dev version. I will do that in a few days. I don't have enough disk space on this dev server to have a 9.2 datadir and a 9.3 one for this database. Once I have a solution that I can use with 9.2 firmed up I can upgrade the datadir to 9.3 and test this. I am hoping I can get a set of partial indexes that will give good results with an unmodified 9.2, so far that looks promising but I still have more cases to verify (these indexes take a while to build). > > A lot of loop iterations can be satisfied by cached pages of the > index the fudge-factor doesn't really account for this. > > > > Setting random_page_cost to 2 is already telling it that most of fetches > are coming from the cache. Of course for the upper blocks of an index > even more than "most" are likely to be, but the latest dev code takes > care of that. > > Cheers, > > Jeff
On 13-04-14 08:06 PM, Steve Singer wrote: > On 13-04-13 04:54 PM, Jeff Janes wrote: >> >> If you are trying to make your own private copy of 9.2, then removing >> the fudge factor altogether is probably the way to go. But if you want >> to help improve future versions, you probably need to test with the most >> up-to-date dev version. > > I will do that in a few days. I don't have enough disk space on this > dev server to have a 9.2 datadir and a 9.3 one for this database. Once > I have a solution that I can use with 9.2 firmed up I can upgrade the > datadir to 9.3 and test this. I am hoping I can get a set of partial > indexes that will give good results with an unmodified 9.2, so far that > looks promising but I still have more cases to verify (these indexes > take a while to build). > I've run these queries against a recent master/9.3 and the planner is picking the nested-loop plan for using the full index. This is with random_page_cost=2.