Re: Avoid full GIN index scan when possible - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Avoid full GIN index scan when possible |
Date | |
Msg-id | 20200106152155.uuungi2p776p5lbk@development Whole thread Raw |
In response to | Re: Avoid full GIN index scan when possible (Nikita Glukhov <n.gluhov@postgrespro.ru>) |
Responses |
Re: Avoid full GIN index scan when possible
|
List | pgsql-hackers |
On Fri, Dec 27, 2019 at 04:36:14AM +0300, Nikita Glukhov wrote: >On 26.12.2019 4:59, Alexander Korotkov wrote: >> >> I've tried to add patch #4 to comparison, but I've catch assertion >>failure. >> >>TRAP: FailedAssertion("key->includeNonMatching", File: "ginget.c", >>Line: 1340) >There simply should be inverted condition in the assertion: >Assert(!key->includeNonMatching); > >I have looked at v9 patch, and here is my review: > >1. I agree with NULL-flag handling simplifications in ginNewScanKey(), >ginScanKeyAddHiddenEntry() extraction. > >2. I also agree that usage of nrequired/nadditional in keyGetItem() is a more >natural solution to implement exclusion keys than my previous attempt of doing >that in scanGetKey(). > >But there are some questions: > >Can we avoid referencing excludeOnly flag keyGetItem() by replacing these >references with !nrequired? > >Maybe it would be better to move the whole block of keyGetItem() code >starting from the first loop over required keys and ending before the loop over >additional keys inside 'if (key->nrequired) { ... }'? > >Can we avoid introducing excludeOnly flag by reusing searchMode and/or by >moving the initialization of nrequired/nadditional into ginNewScanKey()? > > >3. The following two times repeated NULL-filtering check looks too complicated >and needs to be refactored somehow: > >- res = key->triConsistentFn(key); >+ if (key->excludeOnly && >+ key->nuserentries < key->nentries && >+ key->scanEntry[key->nuserentries]->queryCategory == GIN_CAT_NULL_KEY && >+ key->entryRes[key->nuserentries] == GIN_TRUE) >+ res = GIN_FALSE; >+ else >+ res = key->triConsistentFn(key); > >For example, a special consistentFn() can be introduced for such NOT_NULL >scankeys. Or even a hidden separate one-entry scankey with a trivial >consistentFn() can be added instead of adding hidden entry. > > >4. forcedRecheck flag that was previously used for discarded empty ALL scankeys >is removed now. 0-entry exclusion keys can appear instead, and their >consistentFn() simply returns constant value. Could this lead to tangible >overhead in some cases (in comparison to forcedRecheck flag)? > > >5. A hidden GIN_CAT_EMPTY_QUERY is added only for the first empty ALL-scankey, >NULLs in other columns are filtered out with GIN_CAT_NULL_KEY. This looks like >asymmetric, and it leads to accelerations is some cases and slowdowns in others >(depending on NULL fractions and their correlations in columns). > >The following test shows a significant performance regression of v9: > >insert into t select array[i], NULL, NULL from generate_series(1, 1000000) i; > > | Query time, ms > WHERE condition | master | v8 | v9 >---------------------------------------+--------+--------+--------- > a @> '{}' | 224 | 213 | 212 > a @> '{}' and b @> '{}' | 52 | 57 | 255 > a @> '{}' and b @> '{}' and c @> '{}' | 51 | 58 | 290 > > >In the older version of the patch I tried to do the similar things (initialize >only one NOT_NULL entry for the first column), but refused to do this in v8. > >So, to avoid slowdowns relative to master, I can offer simply to add >GIN_CAT_EMPTY_QUERY entry for each column with empty ALL-keys if there are >no normal keys. > Yeah, I can confirm those results, although on my system the timings are a bit different (I haven't tested v8): | Query time, ms WHERE condition | master | v9 ---------------------------------------+--------+--------- a @> '{}' | 610 | 589 a @> '{}' and b @> '{}' | 185 | 665 a @> '{}' and b @> '{}' and c @> '{}' | 185 | 741 So that's something we probably need to address, perhaps by using the GIN_CAT_EMPTY_QUERY entries as proposed. I've also tested this on a database storing mailing lists archives with a trigram index, and in that case the performance with short values gets much better. The "messages" table has two text fields with a GIN trigram index - subject and body, and querying them with short/long values works like this: WHERE | master | v9 -------------------------------------------------------------- subject LIKE '%aa%' AND body LIKE '%xx%' | 4943 | 4052 subject LIKE '%aaa%' AND body LIKE '%xx%' | 10 | 10 subject LIKE '%aa%' AND body LIKE '%xxx%' | 380 | 13 subject LIKE '%aaa%' AND BODY LIKE '%xxx%' | 2 | 2 which seems fairly nice. I've done tests with individual columns, and that seems to be working fine too. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: