Re: Avoid full GIN index scan when possible - Mailing list pgsql-hackers

From Nikita Glukhov
Subject Re: Avoid full GIN index scan when possible
Date
Msg-id f2889144-db1d-e3b2-db97-cfc8794cda43@postgrespro.ru
Whole thread Raw
In response to Re: Avoid full GIN index scan when possible  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Avoid full GIN index scan when possible  (Michael Paquier <michael@paquier.xyz>)
Re: Avoid full GIN index scan when possible  (Alexander Korotkov <a.korotkov@postgrespro.ru>)
List pgsql-hackers
Attached 8th version of the patches.


A brief description of the patches and their improvements/overheads:

1. Avoid full scan in "empty-ALL AND regular" case.  One EVERYTHING entry with unconditional recheck is used instead of multiple  ALL entries.  Overhead for rechecking NULLs and "empty-ALL" keys is introduced.  Overhead of merging ALL-lists for multicolumn indexes is eliminated.

2. Fix overhead of rechecking NULLs.  Returned back overhead of merging NOT_NULL-lists for multicolumn indexes.

3. Fix overhead of unnecessary recheck of "empty-ALL" keys using consistentFn().  Performance of "empty-ALL [AND empty-ALL ...]" case now is the same as on   master.

4. Avoid full scan in "non-empty-ALL AND regular" case.  New variant of list-merging logic added to scanGetItem().

On 07.08.2019 23:32, Tom Lane wrote:
Nikita Glukhov <n.gluhov@postgrespro.ru> writes:
Attached 6th version of the patches.
I spent a bit of time looking at these.  Attached is a proposed revision
of the 0001 patch, with some minor changes:

* I didn't adopt your move of the "Non-default modes require the index
to have placeholders" test to after the stanza handling zero-key cases.
I think that move would be correct for 0001 as it stands, but it's far
less clear that it's still safe after 0002/0003 or whatever variant of
those we end up with.  We should leave that code where it is for now,
enforcing the v1-index requirement for all non-default search modes, and
reconsider after the dust settles.  (Or if we never do reconsider, it
won't be a big deal --- I doubt many v0 indexes are still out there.)
Ok.

* Rearranged the selfuncs.c logic to match ginNewScanKey better.

* Cleaned up my own sloppiness in the new gin.sql test cases.

I think this would be committable as it stands, except that replacing
an ALL scan with an EVERYTHING scan could be a performance regression
if the index contains many null items.  We need to do something about
that before committing.
Yes, such performance regression does exist (see test results at the end).  
And it exists not only if there are many NULLs -- recheck overhead is 
significant even in the simple cases like  "array @> '{}'".  This really 
makes patch 0001 non-committable.

And the only thing I can here offer to fix this is the patches 0002 and 0003.

Unfortunately I'm not sold on either 0002 or 0003 as they stand;
they seem overly complicated, I'm not convinced they're correct,
and you haven't really provided examples showing that all this
extra complexity is worthwhile.
Yes, they are complicated, but I tried to simplify 0002 a bit, and even
divided it into two separate patches 0002 and 0003.  For the performance 
improvements, see the test results at the end.

In particular:

* I don't really like the whole business about detecting a constant-true
ALL condition by applying the consistentFn at this stage.  That just
feels wrong to me: the consistentFn should be expecting some data about
the index contents and we don't have any to give.  If it works, it's
accidental, and probably it's fragile.  
If we have no entries, then there is nothing to pass to consistentFn() and it
should always return the same value for a given scankey.  The similar 
technique is used in startScanKey() where the fake entryRes[] is passed to it.

Moreover, the only gain we'd get from it is maybe not having to set 
forcedRecheck, and that doesn't look to me like it would make all that 
much difference.
The forced recheck has a non-zero cost, so this makes real sense.
* The code seems to be assuming that a zero-key ALL query is necessarily
precisely equivalent to a NOT NULL condition.  This seems flat out wrong.
At the very least it's possible for such a query to be constant-false,
rather than constant-true-for-non-null-items.  Admittedly, that would
suggest rather stupid coding of the opclass query-extract function, as
it could have reported a constant-false condition in an optimizable way
rather than an unoptimizable one.  But we aren't entitled to assume that
the opclass isn't being dumb; the API says it can do this, so it can.
I think we have to check the scankey regardless, either in the index or
via forcedRecheck.
Yes, empty ALL queries are equivalent to NOT NULL with or without recheck.
Patches 0001 and 0002 use unconditional forcedRecheck.  Patch 0003 uses 
conditional recheck depending on the result of triConsistentFn() invocation.
I added missing check for GIN_FALSE to eliminate constant-false empty 
ALL queries.  So, the empty ALL scankey is always checked in the index,
but only once at the initialization stage.

* I really dislike the refcount business in 0003.  It's not clear why we
need that or whether it's correct, and I think it would be unmaintainable
even if it were documented (which it isn't).


ISTM we could get where we need to go in a much simpler way.  A couple
of alternative ideas:

* During ginNewScanKey, separate out ALL-mode queries and don't add them
to the scankey list immediately.  After examining all the keys, if we
found any normal (DEFAULT or INCLUDE_EMPTY) queries, then go ahead and
add in the ALL-mode queries so that we can check them in the index, but
don't cause a full scan.  Otherwise, discard all the ALL-mode queries
and emit a NOT_NULL scan key, setting forcedRecheck so that we apply the
filtering that way.

* Or we could just discard ALL-mode queries on sight, setting
forcedRecheck always, and then emit NOT_NULL if we had any
of those and no normal queries.
I completely rewrote this logic in patch 0004, the reference counting is no 
longer needed.

Non-empty ALL keys are immediately added to the list, but the initialization 
of hidden ALL entries in them is postponed, and these full scan entries added 
only if there are no normal keys.  But if we have normal keys, then for each
ALL key enabled special list-merging logic in scanGetItem(), so the items 
matching normal keys are emitted to the result even if they have no entries 
required for ALL scankeys.


For example, the following intarray query
 arr @@ '1' AND arr @@ '!2'

produces two 1-entry scankeys:
 DEFAULT('1') ALL('2')       (previously there were 2 entries: '2' and ALL)

When the item lists for the entries '1' and '2' are merged, emitted all items - having '1' and not having '2', without forced recheck (new logic)- having both '1' and '2', if triConsistentFn(ALL('2')) returns not FALSE   (ordinary logic, each item must have at least one entry of each scankey)


This helps to do as much work as possible in the index, and to avoid a
unnecessary recheck.


I'm not sure that code changes in scanGetItem() are correct (especially due to 
the presence of lossy items), and the whole patch 0004 was not carefully tested,
but if the performance results are interesting, I could work further on this 
optimization.

The thing that seems hard to predict here is whether it is worth tracking
the presence of additional keys in the index to avoid a recheck in the
heap.  It's fairly easy to imagine that for common keys, avoiding the
recheck would be a net loss because it requires more work in the index.
If we had statistics about key counts, which of course we don't, you
could imagine making this decision dynamically depending on whether an
ALL query is asking about common or uncommon keys.

BTW --- any way you slice this, it seems like we'd end up with a situation
where we never execute an ALL query against the index in the way we do
now, meaning that the relevant code in the scanning logic is dead and
could be removed.  Given that, we don't really need a new NOT_NULL search
mode; we could just redefine what ALL mode actually does at execution.
This would be good IMO, because it's not obvious what the difference
between ALL and NOT_NULL modes is anyway.
The ALL mode is still used now for non-empty ALL queries without normal queries.

Simple performance test:

create table t (a int[], b int[], c int[]);

-- 1M NULLs
insert into t select NULL, NULL, NULL 
from generate_series(0, 999999) i;

-- 1M 1-element arrays
insert into t select array[i], array[i], array[i] 
from generate_series(0, 999999) i;

-- 10k 2-element arrays with common element
insert into t select array[-1,i], array[-1,i], array[-1,i] 
from generate_series(0, 9999) i;

create extension intarray;
create index on t using gin (a gin__int_ops, b gin__int_ops, c gin__int_ops);

                                      |           Query time, ms           WHERE condition            | master |          patches                                      |        |  #1  |  #2  |  #3  |  #4
---------------------------------------+--------+------+------+------+------a @> '{}'                             |    272 |  473 |  369 |  271 |  261a @> '{}' and b @> '{}'               |    374 |  548 |  523 |  368 |  353 a @> '{}' and b @> '{}' and c @> '{}' |    479 |  602 |  665 |  461 |  446 
a @> '{}' and a @@ '1'                |   52.2 |  0.4 |  0.4 |  0.4 |  0.4a @> '{}' and a @@ '-1'               |   56.2 |  4.0 |  4.0 |  2.3 |  2.3
a @@ '!-1' and a @@ '1'               |   52.8 | 53.0 | 52.7 | 52.9 |  0.3a @@ '!1' and a @@ '-1'               |   54.9 | 55.2 | 55.1 | 55.3 |  2.4

--
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

pgsql-hackers by date:

Previous
From: Ranier Vilela
Date:
Subject: RE: [PATCH] Tiny optmization or a bug?
Next
From: Mark Dilger
Date:
Subject: Re: Assertion failing in master, predicate.c