Re: Avoid full GIN index scan when possible - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Avoid full GIN index scan when possible
Date
Msg-id 20200106152155.uuungi2p776p5lbk@development
Whole thread Raw
In response to Re: Avoid full GIN index scan when possible  (Nikita Glukhov <n.gluhov@postgrespro.ru>)
Responses Re: Avoid full GIN index scan when possible  (Alexander Korotkov <a.korotkov@postgrespro.ru>)
List pgsql-hackers
On Fri, Dec 27, 2019 at 04:36:14AM +0300, Nikita Glukhov wrote:
>On 26.12.2019 4:59, Alexander Korotkov wrote:
>>
>> I've tried to add patch #4 to comparison, but I've catch assertion 
>>failure.
>>
>>TRAP: FailedAssertion("key->includeNonMatching", File: "ginget.c", 
>>Line: 1340)
>There simply should be inverted condition in the assertion:
>Assert(!key->includeNonMatching);
>
>I have looked at v9 patch, and here is my review:
>
>1. I agree with NULL-flag handling simplifications in ginNewScanKey(),
>ginScanKeyAddHiddenEntry() extraction.
>
>2. I also agree that usage of nrequired/nadditional in keyGetItem() is a more
>natural solution to implement exclusion keys than my previous attempt of doing
>that in scanGetKey().
>
>But there are some questions:
>
>Can we avoid referencing excludeOnly flag keyGetItem() by replacing these
>references with !nrequired?
>
>Maybe it would be better to move the whole block of keyGetItem() code
>starting from the first loop over required keys and ending before the loop over
>additional keys inside 'if (key->nrequired) { ... }'?
>
>Can we avoid introducing excludeOnly flag by reusing searchMode and/or by
>moving the initialization of nrequired/nadditional into ginNewScanKey()?
>
>
>3. The following two times repeated NULL-filtering check looks too complicated
>and needs to be refactored somehow:
>
>-    res = key->triConsistentFn(key);
>+    if (key->excludeOnly &&
>+        key->nuserentries < key->nentries &&
>+        key->scanEntry[key->nuserentries]->queryCategory == GIN_CAT_NULL_KEY &&
>+        key->entryRes[key->nuserentries] == GIN_TRUE)
>+        res = GIN_FALSE;
>+    else
>+        res = key->triConsistentFn(key);
>
>For example, a special consistentFn() can be introduced for such NOT_NULL
>scankeys.  Or even a hidden separate one-entry scankey with a trivial
>consistentFn() can be added instead of adding hidden entry.
>
>
>4. forcedRecheck flag that was previously used for discarded empty ALL scankeys
>is removed now.  0-entry exclusion keys can appear instead, and their
>consistentFn() simply returns constant value.  Could this lead to tangible
>overhead in some cases (in comparison to forcedRecheck flag)?
>
>
>5. A hidden GIN_CAT_EMPTY_QUERY is added only for the first empty ALL-scankey,
>NULLs in other columns are filtered out with GIN_CAT_NULL_KEY.  This looks like
>asymmetric, and it leads to accelerations is some cases and slowdowns in others
>(depending on NULL fractions and their correlations in columns).
>
>The following test shows a significant performance regression of v9:
>
>insert into t select array[i], NULL, NULL from generate_series(1, 1000000) i;
>
>                                       |        Query time, ms
>            WHERE condition            | master |   v8   |    v9
>---------------------------------------+--------+--------+---------
> a @> '{}'                             |    224 |    213 |    212
> a @> '{}' and b @> '{}'               |     52 |     57 |    255
> a @> '{}' and b @> '{}' and c @> '{}' |     51 |     58 |    290
>
>
>In the older version of the patch I tried to do the similar things (initialize
>only one NOT_NULL entry for the first column), but refused to do this in v8.
>
>So, to avoid slowdowns relative to master, I can offer simply to add
>GIN_CAT_EMPTY_QUERY entry for each column with empty ALL-keys if there are
>no normal keys.
>

Yeah, I can confirm those results, although on my system the timings are
a bit different (I haven't tested v8):

                                        |        Query time, ms
             WHERE condition            | master |    v9
---------------------------------------+--------+---------
  a @> '{}'                             |    610 |    589
  a @> '{}' and b @> '{}'               |    185 |    665
  a @> '{}' and b @> '{}' and c @> '{}' |    185 |    741

So that's something we probably need to address, perhaps by using the
GIN_CAT_EMPTY_QUERY entries as proposed.

I've also tested this on a database storing mailing lists archives with
a trigram index, and in that case the performance with short values gets
much better. The "messages" table has two text fields with a GIN trigram
index - subject and body, and querying them with short/long values works
like this:

                    WHERE                    |  master  |    v9
  --------------------------------------------------------------
  subject LIKE '%aa%' AND body LIKE '%xx%'   |    4943  |  4052
  subject LIKE '%aaa%' AND body LIKE '%xx%'  |      10  |    10
  subject LIKE '%aa%' AND body LIKE '%xxx%'  |     380  |    13
  subject LIKE '%aaa%' AND BODY LIKE '%xxx%' |       2  |     2

which seems fairly nice. I've done tests with individual columns, and
that seems to be working fine too.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: pgsql: Add basic TAP tests for psql's tab-completion logic.
Next
From: "Daniel Verite"
Date:
Subject: Re: Unicode normalization SQL functions