Thread: Index INCLUDE vs. Bitmap Index Scan

Index INCLUDE vs. Bitmap Index Scan

From
Markus Winand
Date:
Hi!

I think Bitmap Index Scan should take advantage of B-tree INCLUDE columns, which it doesn’t at the moment (tested on
masteras of yesterday). 

Consider this (find the setup at the bottom of this mail).

CREATE INDEX idx ON tbl (a, b) INCLUDE (c);

EXPLAIN (analyze, buffers)
SELECT *
  FROM tbl
 WHERE a = 1
   AND c = 1;

The following plan could be produced:

                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on tbl  (cost=4.14..8.16 rows=1 width=7616) (actual time=0.027..0.029 rows=1 loops=1)
   Recheck Cond: (a = 1)
   Filter: (c = 1)
   Rows Removed by Filter: 7
   Heap Blocks: exact=2
   Buffers: shared hit=2 read=1
   ->  Bitmap Index Scan on idx  (cost=0.00..4.14 rows=1 width=0) (actual time=0.018..0.018 rows=8 loops=1)
         Index Cond: (a = 1)
         Buffers: shared read=1
 Planning Time: 0.615 ms
 Execution Time: 0.060 ms


The Bitmap Index Scan does not filter on column C, which is INCLUDEd in the index leaf nodes.

Instead, the Bitmap Heap Scan fetches unnecessary blocks and then applies the filter.

I would expect a similar execution as with this index:

DROP INDEX idx;
CREATE INDEX idx ON tbl (a, b, c);

                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on tbl  (cost=4.14..8.16 rows=1 width=7616) (actual time=0.021..0.021 rows=1 loops=1)
   Recheck Cond: ((a = 1) AND (c = 1))
   Heap Blocks: exact=1
   Buffers: shared hit=1 read=1
   ->  Bitmap Index Scan on idx  (cost=0.00..4.14 rows=1 width=0) (actual time=0.018..0.018 rows=1 loops=1)
         Index Cond: ((a = 1) AND (c = 1))
         Buffers: shared read=1
 Planning Time: 0.123 ms
 Execution Time: 0.037 ms

(As a side node: I also dislike it how Bitmap Index Scan mixes search conditions and filters in “Index Cond”)

Due to the not-filtered column B in the index, the use of column C is here pretty much the same as it is for the first
index,which has C in INCLUDE. 

Am I missing anything or is it just an oversight because INCLUDE was primarily done for Index Only Scan?

As a background, here is the use case I see for this scenario:
Filters that can not be used for searching the tree can be put in INCLUDE instead of the key-part of the index even if
thereis no intend to run the query as Index Only Scan (e.g. SELECT *). Columns in INCLUDE can be used for filtering
(worksfine for Index Only Scan, btw). 

The most prominent example are post-fix or in-fix LIKE ‘%term%’ filters. The benefit of putting such columns in INCLUDE
isthat it is clearly documented that the index column is neither used for searching in the tree nor for ordering. It is
eitherused for filtering, or for fetching. Knowing this makes it easier to extend existing index. 

Imagine you have this query to optimize:

SELECT *
  FROM …
 WHERE a = ?
   AND b = ?
 ORDER BY ts DESC
 LIMIT 10

And find this existing index on that table:

CREATE INDEX … ON … (a, b) INCLUDE (c);

In this case it is a rather safe option to replace the index with the following:

CREATE INDEX … ON … (a, b, ts) INCLUDE (c);

On the other hand, if the original index had all column in the key part, changing it to (A, B, TS, C) is a rather
dangerousoption. 

-markus

— Setup for the demo

-- demo table: 4 rows per block
-- the row if interest is the first one, the others spill to a second block so the problem can also seen in “Buffers"

CREATE TABLE tbl (a INTEGER, b INTEGER, c INTEGER, junk CHAR(1900));

INSERT INTO tbl VALUES (1, 1, 1, 'junk'),
                       (1, 1, 9, 'junk'),
                       (1, 1, 9, 'junk'),
                       (1, 1, 9, 'junk'),
                       (1, 1, 9, 'junk'),
                       (1, 1, 9, 'junk'),
                       (1, 1, 9, 'junk'),
                       (1, 1, 9, 'junk');

-- prevent unwanted plans for demo of Bitmap Index+Heap Scan
SET ENABLE_INDEXSCAN = OFF;
SET ENABLE_SEQSCAN = OFF;


Re: Index INCLUDE vs. Bitmap Index Scan

From
Tom Lane
Date:
Markus Winand <markus.winand@winand.at> writes:
> I think Bitmap Index Scan should take advantage of B-tree INCLUDE columns, which it doesn’t at the moment (tested on
masteras of yesterday). 

Regular index scans don't do what you're imagining either (i.e., check
filter conditions in advance of visiting the heap).  There's a roadblock
to implementing such behavior, which is that we might end up applying
filter expressions to dead rows.  That could make users unhappy.
For example, given a filter condition like "1.0/c > 0.1", people
would complain if it still got zero-divide failures even after they'd
deleted all rows with c=0 from their table.

Generally speaking, we expect indexable operators not to throw
errors on any input values, which is why this problem isn't serious
for the index conditions proper.  But we can't make that assumption
for arbitrary filter conditions.

> (As a side node: I also dislike it how Bitmap Index Scan mixes search conditions and filters in “Index Cond”)

What do you think is being mixed exactly?

            regards, tom lane


Re: Index INCLUDE vs. Bitmap Index Scan

From
Andres Freund
Date:
Hi,

On 2019-02-26 18:22:41 -0500, Tom Lane wrote:
> Markus Winand <markus.winand@winand.at> writes:
> > I think Bitmap Index Scan should take advantage of B-tree INCLUDE columns, which it doesn’t at the moment (tested
onmaster as of yesterday).
 
> 
> Regular index scans don't do what you're imagining either (i.e., check
> filter conditions in advance of visiting the heap).  There's a roadblock
> to implementing such behavior, which is that we might end up applying
> filter expressions to dead rows.  That could make users unhappy.
> For example, given a filter condition like "1.0/c > 0.1", people
> would complain if it still got zero-divide failures even after they'd
> deleted all rows with c=0 from their table.
> 
> Generally speaking, we expect indexable operators not to throw
> errors on any input values, which is why this problem isn't serious
> for the index conditions proper.  But we can't make that assumption
> for arbitrary filter conditions.

We could probably work around that, by only doing such pre-checks on
index tuples only for tuples known to be visible according to the
VM. Would quite possibly be too hard to understand for users
though. Also not sure if it'd actually perform that well due to the
added random IO, as we'd have to do such checks before entering a tuple
into the bitmap (as obviously we don't have a tuple afterwards).

Greetings,

Andres Freund


Re: Index INCLUDE vs. Bitmap Index Scan

From
Justin Pryzby
Date:
On Tue, Feb 26, 2019 at 09:07:01PM +0100, Markus Winand wrote:
> CREATE INDEX idx ON tbl (a, b, c);
>  Bitmap Heap Scan on tbl  (cost=4.14..8.16 rows=1 width=7616) (actual time=0.021..0.021 rows=1 loops=1)
>    Recheck Cond: ((a = 1) AND (c = 1))
>    ->  Bitmap Index Scan on idx  (cost=0.00..4.14 rows=1 width=0) (actual time=0.018..0.018 rows=1 loops=1)
>          Index Cond: ((a = 1) AND (c = 1))
> 
> (As a side node: I also dislike it how Bitmap Index Scan mixes search conditions and filters in “Index Cond”)

I don't think it's mixing them;  it's using index scan on leading *and*
nonleading column.  That's possible even if frequently not efficient.

Justin


Re: Index INCLUDE vs. Bitmap Index Scan

From
Markus Winand
Date:

> On 27.02.2019, at 00:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Markus Winand <markus.winand@winand.at> writes:
>> I think Bitmap Index Scan should take advantage of B-tree INCLUDE columns, which it doesn’t at the moment (tested on
masteras of yesterday). 
>
> Regular index scans don't do what you're imagining either (i.e., check
> filter conditions in advance of visiting the heap).  There's a roadblock
> to implementing such behavior, which is that we might end up applying
> filter expressions to dead rows.  That could make users unhappy.
> For example, given a filter condition like "1.0/c > 0.1", people
> would complain if it still got zero-divide failures even after they'd
> deleted all rows with c=0 from their table.

Ok, but I don’t see how this case different for key columns vs. INCLUDE columns.

When I test this with the (a, b, c) index (no INCLUDE), different plans are produced for "c=1" (my original example)
vs."1.0/c > 0.1”. 

The second one postpones this condition to the Bitmap Heap Scan.

                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on tbl  (cost=4.14..8.16 rows=1 width=4) (actual time=0.023..0.028 rows=8 loops=1)
   Recheck Cond: (a = 1)
   Filter: ((1.0 / (c)::numeric) > 0.1)
   Heap Blocks: exact=2
   Buffers: shared hit=3
   ->  Bitmap Index Scan on idx  (cost=0.00..4.14 rows=1 width=0) (actual time=0.007..0.007 rows=8 loops=1)
         Index Cond: (a = 1)
         Buffers: shared hit=1
 Planning Time: 0.053 ms
 Execution Time: 0.044 ms

I’ve never noticed that behaviour before, but if it is there to prevent the exception-on-dead-tuple problem, the same
couldbe applied to INCLUDE columns? 

I realise that this will not cover all use cases I can imagine but it would be consistent for key and non-key columns.

-markus



Re: Index INCLUDE vs. Bitmap Index Scan

From
Markus Winand
Date:

> On 27.02.2019, at 02:00, Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> On Tue, Feb 26, 2019 at 09:07:01PM +0100, Markus Winand wrote:
>> CREATE INDEX idx ON tbl (a, b, c);
>> Bitmap Heap Scan on tbl  (cost=4.14..8.16 rows=1 width=7616) (actual time=0.021..0.021 rows=1 loops=1)
>>   Recheck Cond: ((a = 1) AND (c = 1))
>>   ->  Bitmap Index Scan on idx  (cost=0.00..4.14 rows=1 width=0) (actual time=0.018..0.018 rows=1 loops=1)
>>         Index Cond: ((a = 1) AND (c = 1))
>>
>> (As a side node: I also dislike it how Bitmap Index Scan mixes search conditions and filters in “Index Cond”)
>
> I don't think it's mixing them;  it's using index scan on leading *and*
> nonleading column.  That's possible even if frequently not efficient.

The distinction leading / non-leading is very important for performance. Other database products use different names in
theexecution plan so that it is immediately visible (without knowing the index definition). 

- Oracle: access vs. filter predicates
- SQL Server: “seek predicates” vs. “predicates”
- Db2: START/STOP vs. SARG
- MySQL/MariaDB show how many leading columns of the index are used — the rest is just “filtering"

PostgreSQL: no difference visible in the execution plan.

CREATE INDEX idx ON tbl (a,b,c);

EXPLAIN (analyze, buffers)
SELECT *
 FROM tbl
WHERE a = 1
  AND c = 1;

                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on tbl  (cost=4.14..8.16 rows=1 width=7616) (actual time=0.017..0.018 rows=1 loops=1)
   Recheck Cond: ((a = 1) AND (c = 1))
   Heap Blocks: exact=1
   Buffers: shared hit=1 read=1
   ->  Bitmap Index Scan on idx  (cost=0.00..4.14 rows=1 width=0) (actual time=0.014..0.014 rows=1 loops=1)
         Index Cond: ((a = 1) AND (c = 1))
         Buffers: shared read=1
 Planning Time: 0.149 ms
 Execution Time: 0.035 ms


DROP INDEX idx;
CREATE INDEX idx ON tbl (a, c, b); -- NOTE: column “c” is second


                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on tbl  (cost=4.14..8.16 rows=1 width=7616) (actual time=0.013..0.013 rows=1 loops=1)
   Recheck Cond: ((a = 1) AND (c = 1))
   Heap Blocks: exact=1
   Buffers: shared hit=1 read=1
   ->  Bitmap Index Scan on idx  (cost=0.00..4.14 rows=1 width=0) (actual time=0.009..0.009 rows=1 loops=1)
         Index Cond: ((a = 1) AND (c = 1))
         Buffers: shared read=1
 Planning Time: 0.262 ms
 Execution Time: 0.036 ms
(9 rows)

-markus



Re: Index INCLUDE vs. Bitmap Index Scan

From
Tom Lane
Date:
Markus Winand <markus.winand@winand.at> writes:
>> On 27.02.2019, at 02:00, Justin Pryzby <pryzby@telsasoft.com> wrote:
>> On Tue, Feb 26, 2019 at 09:07:01PM +0100, Markus Winand wrote:
>>> (As a side node: I also dislike it how Bitmap Index Scan mixes search conditions and filters in “Index Cond”)

>> I don't think it's mixing them;  it's using index scan on leading *and*
>> nonleading column.  That's possible even if frequently not efficient.

> The distinction leading / non-leading is very important for performance. Other database products use different names
inthe execution plan so that it is immediately visible (without knowing the index definition). 

Other database products don't have the wide range of index types that
we do.  The concepts you propose using are pretty much meaningless
for non-btree indexes.  EXPLAIN doesn't really know which of the index
conditions will usefully cut down the index search space for the
particular type, so it just lists everything that has the right form
to be passed to the index AM.

Note that passing a condition to the AM, rather than executing it as
a filter, is generally a win when possible even if it fails to cut
the portion of the index searched at all.  That's because it can save
visits to the heap (tying back to the original point in this thread,
that we test index conditions, then heap liveness check, then filter
conditions).  So the planner is aggressive about pushing things into
that category when it can.

It might help to point out that to be an index condition, a WHERE
clause has to meet tighter conditions than just whether it mentions
an index column.  It generally has to be of the form "index_column
indexable_operator pseudo_constant" (though some index types support
some other cases like "index_column IS NULL" as index conditions too).
Clauses mentioning INCLUDE columns fail this test a priori, because
there are no indexable operators associated with an INCLUDE column.

            regards, tom lane


Re: Index INCLUDE vs. Bitmap Index Scan

From
Tomas Vondra
Date:
On 2/27/19 6:36 AM, Markus Winand wrote:
> 
> 
>> On 27.02.2019, at 00:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> Markus Winand <markus.winand@winand.at> writes:
>>> I think Bitmap Index Scan should take advantage of B-tree INCLUDE columns, which it doesn’t at the moment (tested
onmaster as of yesterday).
 
>>
>> Regular index scans don't do what you're imagining either (i.e., check
>> filter conditions in advance of visiting the heap).  There's a roadblock
>> to implementing such behavior, which is that we might end up applying
>> filter expressions to dead rows.  That could make users unhappy.
>> For example, given a filter condition like "1.0/c > 0.1", people
>> would complain if it still got zero-divide failures even after they'd
>> deleted all rows with c=0 from their table.
> 
> Ok, but I don’t see how this case different for key columns vs. INCLUDE columns.
> 

Yeah, I'm a bit puzzled by this difference too - why would it be safe
for keys and not the other included columns?

> When I test this with the (a, b, c) index (no INCLUDE), different
> plans are produced for "c=1" (my original example) vs. "1.0/c > 0.1”.

I do recall a discussion about executing expressions on index tuples
during IOS (before/without inspecting the heap tuple). I'm too lazy to
search for the thread now, but I recall it was somehow related to
leak-proof-ness. And AFAICS none of the "div" procs are marked as
leak-proof, so perhaps that's one of the reasons?

Of course, this does not explain why equality conditions and such (which
are generally leak-proof) couldn't be moved to the bitmap index scan.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Index INCLUDE vs. Bitmap Index Scan

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On 2/27/19 6:36 AM, Markus Winand wrote:
> On 27.02.2019, at 00:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> For example, given a filter condition like "1.0/c > 0.1", people
>>> would complain if it still got zero-divide failures even after they'd
>>> deleted all rows with c=0 from their table.

>> Ok, but I don’t see how this case different for key columns vs. INCLUDE columns.

> Yeah, I'm a bit puzzled by this difference too - why would it be safe
> for keys and not the other included columns?

It's not about the column, it's about the operator.  We assume that
operators appearing in opclasses are safe to execute even for
index entries that correspond to dead rows.  INCLUDE columns don't
have any associated opclass, hence no assumed-usable operators.

> I do recall a discussion about executing expressions on index tuples
> during IOS (before/without inspecting the heap tuple). I'm too lazy to
> search for the thread now, but I recall it was somehow related to
> leak-proof-ness. And AFAICS none of the "div" procs are marked as
> leak-proof, so perhaps that's one of the reasons?

Leak-proof-ness is kind of related, perhaps, but it's not quite the property
we're after here --- at least not to my mind.  It might be an interesting
discussion exactly what the relationship is.  Meanwhile, we don't have any
separate concept of functions that are safe to apply to any index entry;
opclass membership is it.

You could probably argue that any clause containing only indexed variables
and operators that are members of *some* opclass could be used as a filter
in advance of heap liveness checking.  But we lack any infrastructure for
that, in either the planner or executor.

            regards, tom lane


Re: Index INCLUDE vs. Bitmap Index Scan

From
Tom Lane
Date:
I wrote:
> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> I do recall a discussion about executing expressions on index tuples
>> during IOS (before/without inspecting the heap tuple). I'm too lazy to
>> search for the thread now, but I recall it was somehow related to
>> leak-proof-ness. And AFAICS none of the "div" procs are marked as
>> leak-proof, so perhaps that's one of the reasons?

> Leak-proof-ness is kind of related, perhaps, but it's not quite the property
> we're after here --- at least not to my mind.  It might be an interesting
> discussion exactly what the relationship is.  Meanwhile, we don't have any
> separate concept of functions that are safe to apply to any index entry;
> opclass membership is it.

The other thread about RLS helped me to crystallize the vague feelings
I had about this.  I now think that this is what we're actually assuming:
an indexable operator must be leakproof *with respect to its index-key
input*.  If it is not, it might throw errors or otherwise reveal the
existence of index entries for dead rows, which would be a usability fail
whether or not you are excited about security as such.

On the other hand, it's okay to throw errors that only reveal information
about the non-index input.  For instance, it's not a problem for pg_trgm
to treat regex-match operators as indexable, even though those will throw
error for a malformed pattern input.

So this is indeed related to leakproof-ness, but our current definition
of "leakproof" is too simple to capture the property exactly.

Getting back to the question of this thread, I think we'd have to restrict
any filtering done in advance of heap liveness checks to fully leakproof
functions, since we don't want the filter expression to possibly throw
an error, regardless of which input(s) came from the index.

            regards, tom lane