Thread: rows estimate in explain analyze for the BRIN index
<div class="">Hi,</div><div class=""><br class="" /></div><div class="">While experimenting with BRIN on PostgreSQL 9.5RC1I came across the following plan (which is, btw a very good example of how BRIN rocks for the clustered data, the sizeof this table is around 90GB, the size of the index is around 3MB):</div><div class=""><br class="" /></div><div class="">explain(analyze, buffers, verbose) select 1 from example where event_time = now() - interval '5 months';</div><divclass=""> QUERY PLAN</div><div class="">-------------------------------------------------------------------------------------------------------------------------------------------------------</div><div class=""> BitmapHeap Scan on example (cost=744.44..757.64 rows=6 width=0) (actual time=73.895..73.895 rows=0 loops=1)</div><divclass=""> Output: 1</div><div class=""> Recheck Cond: (example.event_time = (now() - '5 mons'::interval))</div><divclass=""> Rows Removed by Index Recheck: 4030</div><div class=""> Heap Blocks: lossy=128</div><divclass=""> Buffers: shared hit=629</div><div class=""> -> Bitmap Index Scan on example_event_time_idx1 (cost=0.00..744.41 rows=6 width=0) (actual time=70.335..70.335 rows=1280 loops=1)</div><div class=""> Index Cond: (example.event_time = (now() - '5 mons'::interval))</div><div class=""> Buffers: sharedhit=501</div><div class=""> Planning time: 0.125 ms</div><div class=""> Execution time: 73.943 ms</div><div class="">(11rows)</div><div class=""><br class="" /></div><div class="">Time: 74.642 ms</div><div class=""><br class="" /></div><divclass=""><br class="" /></div><div class="">Here EXPLAIN ANALYZE reports 1280 rows from the Bitmap Index Scan,but on the higher level, 4030 rows were removed by Index Recheck. </div><div class=""><br class="" /></div><div class="">Thequestions are:</div><div class=""><br class="" /></div><div class="">- how does it get 1280 rows from the BRINindex scan, given that BRIN only stores pointers to the heap blocks, not individual rows. Does it calculate the numberof rows in the blocks returned?</div><div class=""><br class="" /></div><div class="">- how comes that the subsequentrecheck filters out 4030 rows, out of supposedly 1280 returned?</div><div class=""><br class="" /></div><div class="">Kindregards,</div><div class=""><div class="" style="color: rgb(0, 0, 0); letter-spacing: normal; orphans: auto;text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width:0px; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">--</div><divclass="" style="color: rgb(0, 0, 0); letter-spacing: normal; orphans: auto; text-align: start;text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width:0px; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">Oleksii</div></div><brclass="" />
Oleksii Kliukin <alexk@hintbits.com> writes: > Bitmap Heap Scan on example (cost=744.44..757.64 rows=6 width=0) (actual time=73.895..73.895 rows=0 loops=1) > Output: 1 > Recheck Cond: (example.event_time = (now() - '5 mons'::interval)) > Rows Removed by Index Recheck: 4030 > Heap Blocks: lossy=128 > Buffers: shared hit=629 > -> Bitmap Index Scan on example_event_time_idx1 (cost=0.00..744.41 rows=6 width=0) (actual time=70.335..70.335 rows=1280loops=1) > Index Cond: (example.event_time = (now() - '5 mons'::interval)) > Buffers: shared hit=501 > - how does it get 1280 rows from the BRIN index scan, given that BRIN only stores pointers to the heap blocks, not individualrows. Does it calculate the number of rows in the blocks returned? It evidently returned 128 block IDs to the heapscan logic. I have not looked at the code, but a reasonable bet is that it's just guessing that there are 10 rows per block. That seems like an awfully low number, though; it equates to assuming that rows are 800 bytes wide on average. If we're going to use a fixed number, 100 rows per block would probably be more nearly the correct order of magnitude. Another idea would be to use the heap's row density as calculated by the last ANALYZE (ie, reltuples/relpages), with a fallback to 100 if relpages=0. This'd only be convenient if the bitmap scan node has the parent heap rel open, which it might not. regards, tom lane
On 30 Dec 2015, at 17:02, Tom Lane <tgl@sss.pgh.pa.us> wrote:Oleksii Kliukin <alexk@hintbits.com> writes:Bitmap Heap Scan on example (cost=744.44..757.64 rows=6 width=0) (actual time=73.895..73.895 rows=0 loops=1)
Output: 1
Recheck Cond: (example.event_time = (now() - '5 mons'::interval))
Rows Removed by Index Recheck: 4030
Heap Blocks: lossy=128
Buffers: shared hit=629
-> Bitmap Index Scan on example_event_time_idx1 (cost=0.00..744.41 rows=6 width=0) (actual time=70.335..70.335 rows=1280 loops=1)
Index Cond: (example.event_time = (now() - '5 mons'::interval))
Buffers: shared hit=501- how does it get 1280 rows from the BRIN index scan, given that BRIN only stores pointers to the heap blocks, not individual rows. Does it calculate the number of rows in the blocks returned?
It evidently returned 128 block IDs to the heapscan logic. I have not
looked at the code, but a reasonable bet is that it's just guessing that
there are 10 rows per block.
You are right, this is at the end of bringetbitmap in brin.c
/*
* XXX We have an approximation of the number of *pages* that our scan
* returns, but we don't have a precise idea of the number of heap tuples
* involved.
*/
PG_RETURN_INT64(totalpages * 10);
That seems like an awfully low number, though; it equates to assuming
that rows are 800 bytes wide on average. If we're going to use a fixed
number, 100 rows per block would probably be more nearly the correct
order of magnitude.
Another idea would be to use the heap's row density as calculated
by the last ANALYZE (ie, reltuples/relpages), with a fallback to 100
if relpages=0. This'd only be convenient if the bitmap scan node has
the parent heap rel open, which it might not.
+1
--
Oleksii
Oleksii Kliukin <alexk@hintbits.com> writes: >> On 30 Dec 2015, at 17:02, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Another idea would be to use the heap's row density as calculated >> by the last ANALYZE (ie, reltuples/relpages), with a fallback to 100 >> if relpages=0. This'd only be convenient if the bitmap scan node has >> the parent heap rel open, which it might not. > +1 Any objections to the attached? regards, tom lane diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c index 2622a7e..64bf788 100644 *** a/src/backend/access/brin/brin.c --- b/src/backend/access/brin/brin.c *************** bringetbitmap(PG_FUNCTION_ARGS) *** 279,286 **** Relation heapRel; BrinOpaque *opaque; BlockNumber nblocks; BlockNumber heapBlk; ! int totalpages = 0; FmgrInfo *consistentFn; MemoryContext oldcxt; MemoryContext perRangeCxt; --- 279,287 ---- Relation heapRel; BrinOpaque *opaque; BlockNumber nblocks; + int heap_tuples_per_page; BlockNumber heapBlk; ! int64 totalpages = 0; FmgrInfo *consistentFn; MemoryContext oldcxt; MemoryContext perRangeCxt; *************** bringetbitmap(PG_FUNCTION_ARGS) *** 291,301 **** /* * We need to know the size of the table so that we know how long to ! * iterate on the revmap. */ heapOid = IndexGetRelation(RelationGetRelid(idxRel), false); heapRel = heap_open(heapOid, AccessShareLock); nblocks = RelationGetNumberOfBlocks(heapRel); heap_close(heapRel, AccessShareLock); /* --- 292,309 ---- /* * We need to know the size of the table so that we know how long to ! * iterate on the revmap. While we have it open, estimate the number of ! * tuples per heap page for use later. */ heapOid = IndexGetRelation(RelationGetRelid(idxRel), false); heapRel = heap_open(heapOid, AccessShareLock); nblocks = RelationGetNumberOfBlocks(heapRel); + if (heapRel->rd_rel->relpages != 0 && heapRel->rd_rel->reltuples > 0) + heap_tuples_per_page = (int) + ((double) heapRel->rd_rel->reltuples / + (BlockNumber) heapRel->rd_rel->relpages); + else /* if no info, assume 100-byte tuples */ + heap_tuples_per_page = BLCKSZ / 100; heap_close(heapRel, AccessShareLock); /* *************** bringetbitmap(PG_FUNCTION_ARGS) *** 447,457 **** ReleaseBuffer(buf); /* ! * XXX We have an approximation of the number of *pages* that our scan ! * returns, but we don't have a precise idea of the number of heap tuples ! * involved. */ ! PG_RETURN_INT64(totalpages * 10); } /* --- 455,465 ---- ReleaseBuffer(buf); /* ! * We have an approximation of the number of pages that our scan returns, ! * but we don't have a precise idea of the number of heap tuples involved. ! * We have to estimate based on average tuple density. */ ! PG_RETURN_INT64(totalpages * heap_tuples_per_page); } /*
On 30 Dec 2015, at 17:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:Oleksii Kliukin <alexk@hintbits.com> writes:On 30 Dec 2015, at 17:02, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Another idea would be to use the heap's row density as calculated
by the last ANALYZE (ie, reltuples/relpages), with a fallback to 100
if relpages=0. This'd only be convenient if the bitmap scan node has
the parent heap rel open, which it might not.+1
Any objections to the attached?
Looks good to me. On my sample system with 100K rows, the new version gives me:
— CREATE TABLE test AS SELECT id FROM generate_series(1,100000) id;
— CREATE INDEX ON test USING brin(id);
postgres=# explain analyze select 1 from test where id = 500;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test (cost=12.01..16.02 rows=1 width=0) (actual time=0.199..4.220 rows=1 loops=1)
Recheck Cond: (id = 500)
Rows Removed by Index Recheck: 28927
Heap Blocks: lossy=128
-> Bitmap Index Scan on test_id_idx (cost=0.00..12.01 rows=1 width=0) (actual time=0.072..0.072 rows=28800 loops=1)
Index Cond: (id = 500)
Planning time: 0.433 ms
Execution time: 4.323 ms
(8 rows)
which is much closer to the actual number of rows removed by the index recheck + the one left.
--
Oleksii
> which is much closer to the actual number of rows removed by the index > recheck + the one left. Is it better to be closer? We are saying those are the "actual" values not the estimates. If we cannot provide the actual rows, I think it is better to provide nothing. Something closer to the reality would create more confusion. Maybe, we just just return the number of blocks, and put somewhere a note about it. The actual row count is already available on the upper part of the plan. By the way, the estimation is a bigger problem than that. Please see my patch [1] about it. [1] http://www.postgresql.org/message-id/CAE2gYzzJvzPy-1cSgZjJyH69izSa13SEgFC4i4r2z0qBQ2P8Uw@mail.gmail.com
Emre Hasegeli <emre@hasegeli.com> writes: >> which is much closer to the actual number of rows removed by the index >> recheck + the one left. > Is it better to be closer? We are saying those are the "actual" > values not the estimates. If we cannot provide the actual rows, I > think it is better to provide nothing. Return zero you mean? Good point; there's precedent for that elsewhere in EXPLAIN ANALYZE, IIRC. If we do what I propose in my patch, it would possibly just lead to more questions like Oleksii's, because people would be even more likely to believe that the number is accurate. regards, tom lane
> On 30 Dec 2015, at 18:38, Emre Hasegeli <emre@hasegeli.com> wrote: > >> which is much closer to the actual number of rows removed by the index >> recheck + the one left. > > Is it better to be closer? We are saying those are the "actual" > values not the estimates. If we cannot provide the actual rows, I > think it is better to provide nothing. Something closer to the > reality would create more confusion. Maybe, we just just return the > number of blocks, and put somewhere a note about it. The actual row > count is already available on the upper part of the plan. I don’t see how to solve this problem without changing explain analyze output to accommodate for “unknown” value. I don’tthink “0” is a non-confusing representation of “unknown” for most people, and from the practical standpoint, a “besteffort” estimate is better than 0 (i.e. I will be able to estimate how efficient BRIN index is for my tables in termsof the number of tuples retrieved/thrown away) We might still reflect in the documentation that the BRIN index cannot produce the exact number of rows during the bitmapscan and point people asking similar questions there.
> I don’t see how to solve this problem without changing explain analyze output to accommodate for “unknown” value. I don’tthink “0” is a non-confusing representation of “unknown” for most people, and from the practical standpoint, a “besteffort” estimate is better than 0 (i.e. I will be able to estimate how efficient BRIN index is for my tables in termsof the number of tuples retrieved/thrown away) The number of retrieved and thrown away rows are already available on the upper part of the plan. Bitmap Index Scan should provide the rows that matched the index. Another alternative would be just returning the number of matching pages (by not multiplying with 10). It might be better understood. The users who can understand the EXPLAIN ANALYZE output shouldn't be expecting BRIN to return rows.
Emre Hasegeli <emre@hasegeli.com> writes: >> I don’t see how to solve this problem without changing explain analyze output to accommodate for “unknown” value. I don’tthink “0” is a non-confusing representation of “unknown” for most people, and from the practical standpoint, a “besteffort” estimate is better than 0 (i.e. I will be able to estimate how efficient BRIN index is for my tables in termsof the number of tuples retrieved/thrown away) We do already have a nearby precedent for returning zero when we don't have an accurate answer: that's what BitmapAnd and BitmapOr plan nodes do. (This is documented btw, at the bottom of section 14.1.) > The number of retrieved and thrown away rows are already available on > the upper part of the plan. Bitmap Index Scan should provide the rows > that matched the index. It doesn't have that information. > Another alternative would be just returning > the number of matching pages (by not multiplying with 10). It might > be better understood. No, it would not, at least not unless we found a way to explicitly mark the output as being blocks not rows (which would doubtless break a lot of existing client-side code). Zero is fairly clearly an impossible value, whereas anything that's not zero is going to be taken at face value by many users. On balance I think likely the best thing to do is return zero, and document that behavior. regards, tom lane
On 30 Dec 2015, at 21:12, Tom Lane <tgl@sss.pgh.pa.us> wrote:Emre Hasegeli <emre@hasegeli.com> writes:I don’t see how to solve this problem without changing explain analyze output to accommodate for “unknown” value. I don’t think “0” is a non-confusing representation of “unknown” for most people, and from the practical standpoint, a “best effort” estimate is better than 0 (i.e. I will be able to estimate how efficient BRIN index is for my tables in terms of the number of tuples retrieved/thrown away)
We do already have a nearby precedent for returning zero when we don't
have an accurate answer: that's what BitmapAnd and BitmapOr plan nodes
do. (This is documented btw, at the bottom of section 14.1.)
+1 for following a precedent.
The number of retrieved and thrown away rows are already available on
the upper part of the plan. Bitmap Index Scan should provide the rows
that matched the index.
It doesn't have that information.Another alternative would be just returning
the number of matching pages (by not multiplying with 10). It might
be better understood.
No, it would not, at least not unless we found a way to explicitly mark
the output as being blocks not rows (which would doubtless break a lot of
existing client-side code). Zero is fairly clearly an impossible value,
whereas anything that's not zero is going to be taken at face value by
many users.
But is it? Is it impossible for the BRIN bitmap index scan to return 0 rows (say, if the value being matched is outside the min/max boundary for every block range?) Granted, if we document that it always returns 0 and should be ignored, then confusing the actual 0 with the 0 as a representation of “unknown” would be less a problem.
--
Oleksii
Tom Lane wrote: > Emre Hasegeli <emre@hasegeli.com> writes: > >> I don’t see how to solve this problem without changing explain analyze output to accommodate for “unknown” value. Idon’t think “0” is a non-confusing representation of “unknown” for most people, and from the practical standpoint, a “besteffort” estimate is better than 0 (i.e. I will be able to estimate how efficient BRIN index is for my tables in termsof the number of tuples retrieved/thrown away) > > We do already have a nearby precedent for returning zero when we don't > have an accurate answer: that's what BitmapAnd and BitmapOr plan nodes > do. (This is documented btw, at the bottom of section 14.1.) Hmm, but amgetbitmap is documented thusly: The number of tuples fetched is returned (this might be just anapproximate count, for instance some AMs do not detectduplicates).http://www.postgresql.org/docs/9.5/static/index-functions.html so I'm not sure we're actually violating an expectation that the number of rows is exact. I mean, if we zero out this one, shouldn't we set it to zero for these other documented cases too? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Tom Lane wrote: >> We do already have a nearby precedent for returning zero when we don't >> have an accurate answer: that's what BitmapAnd and BitmapOr plan nodes >> do. (This is documented btw, at the bottom of section 14.1.) > Hmm, but amgetbitmap is documented thusly: > The number of tuples fetched is returned (this might be just an > approximate count, for instance some AMs do not detect > duplicates). > http://www.postgresql.org/docs/9.5/static/index-functions.html > so I'm not sure we're actually violating an expectation that the number > of rows is exact. I mean, if we zero out this one, shouldn't we set it > to zero for these other documented cases too? Well, the code comments may say that, but the user-facing docs don't ... More generally, people are accustomed to the idea that the planner's estimated rowcounts are just estimates, but they're not accustomed to the idea that the runtime "actual" rowcounts might be just estimates. That seems like it's moving EXPLAIN ANALYZE's contract quite a bit, even if there are some documents for internal APIs that say something else. regards, tom lane
> But is it? Is it impossible for the BRIN bitmap index scan to return 0 rows > (say, if the value being matched is outside the min/max boundary for every > block range?) Granted, if we document that it always returns 0 and should be > ignored, then confusing the actual 0 with the 0 as a representation of > “unknown” would be less a problem. How about -1 ? It is an impossible value for sure. Maybe we should change BitmapAnd and BitmapOr nodes, too. It is better to make it obvious that it is not the correct value. I don't think many people would notice the note on the documentation. On the other hand, returning -1 broke parser of explain.depesz.com [1]. [1] http://explain.depesz.com/s/tAkd