Re: Avoid full GIN index scan when possible - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Avoid full GIN index scan when possible |
Date | |
Msg-id | 19438.1565209940@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Avoid full GIN index scan when possible (Nikita Glukhov <n.gluhov@postgrespro.ru>) |
Responses |
Re: Avoid full GIN index scan when possible
Re: Avoid full GIN index scan when possible |
List | pgsql-hackers |
Nikita Glukhov <n.gluhov@postgrespro.ru> writes: > Attached 6th version of the patches. I spent a bit of time looking at these. Attached is a proposed revision of the 0001 patch, with some minor changes: * I didn't adopt your move of the "Non-default modes require the index to have placeholders" test to after the stanza handling zero-key cases. I think that move would be correct for 0001 as it stands, but it's far less clear that it's still safe after 0002/0003 or whatever variant of those we end up with. We should leave that code where it is for now, enforcing the v1-index requirement for all non-default search modes, and reconsider after the dust settles. (Or if we never do reconsider, it won't be a big deal --- I doubt many v0 indexes are still out there.) * Rearranged the selfuncs.c logic to match ginNewScanKey better. * Cleaned up my own sloppiness in the new gin.sql test cases. I think this would be committable as it stands, except that replacing an ALL scan with an EVERYTHING scan could be a performance regression if the index contains many null items. We need to do something about that before committing. Unfortunately I'm not sold on either 0002 or 0003 as they stand; they seem overly complicated, I'm not convinced they're correct, and you haven't really provided examples showing that all this extra complexity is worthwhile. In particular: * I don't really like the whole business about detecting a constant-true ALL condition by applying the consistentFn at this stage. That just feels wrong to me: the consistentFn should be expecting some data about the index contents and we don't have any to give. If it works, it's accidental, and probably it's fragile. Moreover, the only gain we'd get from it is maybe not having to set forcedRecheck, and that doesn't look to me like it would make all that much difference. * The code seems to be assuming that a zero-key ALL query is necessarily precisely equivalent to a NOT NULL condition. This seems flat out wrong. At the very least it's possible for such a query to be constant-false, rather than constant-true-for-non-null-items. Admittedly, that would suggest rather stupid coding of the opclass query-extract function, as it could have reported a constant-false condition in an optimizable way rather than an unoptimizable one. But we aren't entitled to assume that the opclass isn't being dumb; the API says it can do this, so it can. I think we have to check the scankey regardless, either in the index or via forcedRecheck. * I really dislike the refcount business in 0003. It's not clear why we need that or whether it's correct, and I think it would be unmaintainable even if it were documented (which it isn't). ISTM we could get where we need to go in a much simpler way. A couple of alternative ideas: * During ginNewScanKey, separate out ALL-mode queries and don't add them to the scankey list immediately. After examining all the keys, if we found any normal (DEFAULT or INCLUDE_EMPTY) queries, then go ahead and add in the ALL-mode queries so that we can check them in the index, but don't cause a full scan. Otherwise, discard all the ALL-mode queries and emit a NOT_NULL scan key, setting forcedRecheck so that we apply the filtering that way. * Or we could just discard ALL-mode queries on sight, setting forcedRecheck always, and then emit NOT_NULL if we had any of those and no normal queries. The thing that seems hard to predict here is whether it is worth tracking the presence of additional keys in the index to avoid a recheck in the heap. It's fairly easy to imagine that for common keys, avoiding the recheck would be a net loss because it requires more work in the index. If we had statistics about key counts, which of course we don't, you could imagine making this decision dynamically depending on whether an ALL query is asking about common or uncommon keys. BTW --- any way you slice this, it seems like we'd end up with a situation where we never execute an ALL query against the index in the way we do now, meaning that the relevant code in the scanning logic is dead and could be removed. Given that, we don't really need a new NOT_NULL search mode; we could just redefine what ALL mode actually does at execution. This would be good IMO, because it's not obvious what the difference between ALL and NOT_NULL modes is anyway. regards, tom lane diff --git a/contrib/pg_trgm/expected/pg_trgm.out b/contrib/pg_trgm/expected/pg_trgm.out index b3e709f..3e5ba9b 100644 --- a/contrib/pg_trgm/expected/pg_trgm.out +++ b/contrib/pg_trgm/expected/pg_trgm.out @@ -3498,6 +3498,68 @@ select count(*) from test_trgm where t ~ '[qwerty]{2}-?[qwerty]{2}'; 1000 (1 row) +-- check handling of indexquals that generate no searchable conditions +explain (costs off) +select count(*) from test_trgm where t like '%99%' and t like '%qwerty%'; + QUERY PLAN +----------------------------------------------------------------------------- + Aggregate + -> Bitmap Heap Scan on test_trgm + Recheck Cond: ((t ~~ '%99%'::text) AND (t ~~ '%qwerty%'::text)) + -> Bitmap Index Scan on trgm_idx + Index Cond: ((t ~~ '%99%'::text) AND (t ~~ '%qwerty%'::text)) +(5 rows) + +select count(*) from test_trgm where t like '%99%' and t like '%qwerty%'; + count +------- + 19 +(1 row) + +explain (costs off) +select count(*) from test_trgm where t like '%99%' and t like '%qw%'; + QUERY PLAN +------------------------------------------------------------------------- + Aggregate + -> Bitmap Heap Scan on test_trgm + Recheck Cond: ((t ~~ '%99%'::text) AND (t ~~ '%qw%'::text)) + -> Bitmap Index Scan on trgm_idx + Index Cond: ((t ~~ '%99%'::text) AND (t ~~ '%qw%'::text)) +(5 rows) + +select count(*) from test_trgm where t like '%99%' and t like '%qw%'; + count +------- + 19 +(1 row) + +-- ensure that pending-list items are handled correctly, too +create temp table t_test_trgm(t text COLLATE "C"); +create index t_trgm_idx on t_test_trgm using gin (t gin_trgm_ops); +insert into t_test_trgm values ('qwerty99'), ('qwerty01'); +explain (costs off) +select count(*) from t_test_trgm where t like '%99%' and t like '%qwerty%'; + QUERY PLAN +----------------------------------------------------------------------------- + Aggregate + -> Bitmap Heap Scan on t_test_trgm + Recheck Cond: ((t ~~ '%99%'::text) AND (t ~~ '%qwerty%'::text)) + -> Bitmap Index Scan on t_trgm_idx + Index Cond: ((t ~~ '%99%'::text) AND (t ~~ '%qwerty%'::text)) +(5 rows) + +select count(*) from t_test_trgm where t like '%99%' and t like '%qwerty%'; + count +------- + 1 +(1 row) + +select count(*) from t_test_trgm where t like '%99%' and t like '%qw%'; + count +------- + 1 +(1 row) + create table test2(t text COLLATE "C"); insert into test2 values ('abcdef'); insert into test2 values ('quark'); diff --git a/contrib/pg_trgm/sql/pg_trgm.sql b/contrib/pg_trgm/sql/pg_trgm.sql index 08459e6..dcfd3c2 100644 --- a/contrib/pg_trgm/sql/pg_trgm.sql +++ b/contrib/pg_trgm/sql/pg_trgm.sql @@ -55,6 +55,22 @@ select t,similarity(t,'gwertyu0988') as sml from test_trgm where t % 'gwertyu098 select t,similarity(t,'gwertyu1988') as sml from test_trgm where t % 'gwertyu1988' order by sml desc, t; select count(*) from test_trgm where t ~ '[qwerty]{2}-?[qwerty]{2}'; +-- check handling of indexquals that generate no searchable conditions +explain (costs off) +select count(*) from test_trgm where t like '%99%' and t like '%qwerty%'; +select count(*) from test_trgm where t like '%99%' and t like '%qwerty%'; +explain (costs off) +select count(*) from test_trgm where t like '%99%' and t like '%qw%'; +select count(*) from test_trgm where t like '%99%' and t like '%qw%'; +-- ensure that pending-list items are handled correctly, too +create temp table t_test_trgm(t text COLLATE "C"); +create index t_trgm_idx on t_test_trgm using gin (t gin_trgm_ops); +insert into t_test_trgm values ('qwerty99'), ('qwerty01'); +explain (costs off) +select count(*) from t_test_trgm where t like '%99%' and t like '%qwerty%'; +select count(*) from t_test_trgm where t like '%99%' and t like '%qwerty%'; +select count(*) from t_test_trgm where t like '%99%' and t like '%qw%'; + create table test2(t text COLLATE "C"); insert into test2 values ('abcdef'); insert into test2 values ('quark'); diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c index b18ae2b..65ed8b2 100644 --- a/src/backend/access/gin/ginget.c +++ b/src/backend/access/gin/ginget.c @@ -1814,7 +1814,7 @@ scanPendingInsert(IndexScanDesc scan, TIDBitmap *tbm, int64 *ntids) * consistent functions. */ oldCtx = MemoryContextSwitchTo(so->tempCtx); - recheck = false; + recheck = so->forcedRecheck; match = true; for (i = 0; i < so->nkeys; i++) @@ -1888,9 +1888,14 @@ gingetbitmap(IndexScanDesc scan, TIDBitmap *tbm) { CHECK_FOR_INTERRUPTS(); + /* Get next item ... */ if (!scanGetItem(scan, iptr, &iptr, &recheck)) break; + /* ... apply forced recheck if required ... */ + recheck |= so->forcedRecheck; + + /* ... and transfer it into bitmap */ if (ItemPointerIsLossyPage(&iptr)) tbm_add_page(tbm, ItemPointerGetBlockNumber(&iptr)); else diff --git a/src/backend/access/gin/ginscan.c b/src/backend/access/gin/ginscan.c index 74d9821..7b8de10 100644 --- a/src/backend/access/gin/ginscan.c +++ b/src/backend/access/gin/ginscan.c @@ -286,6 +286,7 @@ ginNewScanKey(IndexScanDesc scan) palloc(so->allocentries * sizeof(GinScanEntry)); so->isVoidRes = false; + so->forcedRecheck = false; for (i = 0; i < scan->numberOfKeys; i++) { @@ -333,16 +334,28 @@ ginNewScanKey(IndexScanDesc scan) if (searchMode != GIN_SEARCH_MODE_DEFAULT) hasNullQuery = true; - /* - * In default mode, no keys means an unsatisfiable query. - */ + /* Special cases for queries that contain no keys */ if (queryValues == NULL || nQueryValues <= 0) { if (searchMode == GIN_SEARCH_MODE_DEFAULT) { + /* In default mode, no keys means an unsatisfiable query */ so->isVoidRes = true; break; } + else if (searchMode == GIN_SEARCH_MODE_ALL) + { + /* + * The query probably matches all non-null items, but rather + * than scanning the index in ALL mode, we use forced rechecks + * to verify matches of this scankey. This wins if there are + * any non-ALL scankeys; otherwise we end up adding an + * EVERYTHING scankey below. + */ + so->forcedRecheck = true; + continue; + } + nQueryValues = 0; /* ensure sane value */ } diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index 7eba59e..5e0a922 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -6320,10 +6320,24 @@ gincost_pattern(IndexOptInfo *index, int indexcol, PointerGetDatum(&nullFlags), PointerGetDatum(&searchMode)); - if (nentries <= 0 && searchMode == GIN_SEARCH_MODE_DEFAULT) + /* Special cases for queries that contain no keys */ + if (nentries <= 0) { - /* No match is possible */ - return false; + if (searchMode == GIN_SEARCH_MODE_DEFAULT) + { + /* In default mode, no keys means an unsatisfiable query */ + return false; + } + else if (searchMode == GIN_SEARCH_MODE_ALL) + { + /* + * ginNewScanKey() does not emit scankeys for a key-less ALL + * query. Instead it will emit an EVERYTHING key, but only if + * there are no other regular keys. We don't know that yet, so + * postpone setting the haveFullScan flag. + */ + return true; + } } for (i = 0; i < nentries; i++) @@ -6485,6 +6499,10 @@ gincost_scalararrayopexpr(PlannerInfo *root, /* We ignore array elements that are unsatisfiable patterns */ numPossible++; + /* If no regular scan keys, assume an EVERYTHING scan is needed */ + if (elemcounts.searchEntries == 0) + elemcounts.haveFullScan = true; + if (elemcounts.haveFullScan) { /* @@ -6709,6 +6727,10 @@ gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count, return; } + /* If no regular scan keys, assume an EVERYTHING scan is needed */ + if (counts.searchEntries == 0) + counts.haveFullScan = true; + if (counts.haveFullScan || indexQuals == NIL) { /* diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h index 78fcd82..9d2ee3a 100644 --- a/src/include/access/gin_private.h +++ b/src/include/access/gin_private.h @@ -359,6 +359,7 @@ typedef struct GinScanOpaqueData MemoryContext keyCtx; /* used to hold key and entry data */ bool isVoidRes; /* true if query is unsatisfiable */ + bool forcedRecheck; /* must recheck all returned tuples */ } GinScanOpaqueData; typedef GinScanOpaqueData *GinScanOpaque; diff --git a/src/test/regress/expected/gin.out b/src/test/regress/expected/gin.out index a3911a6..5ba96fa 100644 --- a/src/test/regress/expected/gin.out +++ b/src/test/regress/expected/gin.out @@ -1,7 +1,7 @@ -- -- Test GIN indexes. -- --- There are other tests to test different GIN opclassed. This is for testing +-- There are other tests to test different GIN opclasses. This is for testing -- GIN itself. -- Create and populate a test table with a GIN index. create table gin_test_tbl(i int4[]) with (autovacuum_enabled = off); @@ -35,3 +35,32 @@ insert into gin_test_tbl select array[1, 2, g] from generate_series(1, 1000) g; insert into gin_test_tbl select array[1, 3, g] from generate_series(1, 1000) g; delete from gin_test_tbl where i @> array[2]; vacuum gin_test_tbl; +-- Test optimization of empty queries +create temp table t_gin_test_tbl(i int4[], j int4[]); +create index on t_gin_test_tbl using gin (i, j); +insert into t_gin_test_tbl select array[100,g], array[200,g] +from generate_series(1, 10) g; +insert into t_gin_test_tbl values(array[0,0], null); +set enable_seqscan = off; +explain (costs off) +select * from t_gin_test_tbl where array[0] <@ i; + QUERY PLAN +--------------------------------------------------- + Bitmap Heap Scan on t_gin_test_tbl + Recheck Cond: ('{0}'::integer[] <@ i) + -> Bitmap Index Scan on t_gin_test_tbl_i_j_idx + Index Cond: (i @> '{0}'::integer[]) +(4 rows) + +select * from t_gin_test_tbl where array[0] <@ i; + i | j +-------+--- + {0,0} | +(1 row) + +select * from t_gin_test_tbl where array[0] <@ i and '{}'::int4[] <@ j; + i | j +---+--- +(0 rows) + +reset enable_seqscan; diff --git a/src/test/regress/sql/gin.sql b/src/test/regress/sql/gin.sql index c566e9b..f98fb7e 100644 --- a/src/test/regress/sql/gin.sql +++ b/src/test/regress/sql/gin.sql @@ -1,7 +1,7 @@ -- -- Test GIN indexes. -- --- There are other tests to test different GIN opclassed. This is for testing +-- There are other tests to test different GIN opclasses. This is for testing -- GIN itself. -- Create and populate a test table with a GIN index. @@ -34,3 +34,16 @@ insert into gin_test_tbl select array[1, 3, g] from generate_series(1, 1000) g; delete from gin_test_tbl where i @> array[2]; vacuum gin_test_tbl; + +-- Test optimization of empty queries +create temp table t_gin_test_tbl(i int4[], j int4[]); +create index on t_gin_test_tbl using gin (i, j); +insert into t_gin_test_tbl select array[100,g], array[200,g] +from generate_series(1, 10) g; +insert into t_gin_test_tbl values(array[0,0], null); +set enable_seqscan = off; +explain (costs off) +select * from t_gin_test_tbl where array[0] <@ i; +select * from t_gin_test_tbl where array[0] <@ i; +select * from t_gin_test_tbl where array[0] <@ i and '{}'::int4[] <@ j; +reset enable_seqscan;
pgsql-hackers by date: