Re: regression coverage gaps for gist and hash indexes - Mailing list pgsql-hackers

From Alexander Lakhin
Subject Re: regression coverage gaps for gist and hash indexes
Date
Msg-id 69ee05f7-8168-cce5-7f2d-d887f0aaf8bd@gmail.com
Whole thread Raw
In response to Re: regression coverage gaps for gist and hash indexes  (Alexander Lakhin <exclusion@gmail.com>)
List pgsql-hackers
Hi,

31.03.2023 17:00, Alexander Lakhin wrote:
> 31.03.2023 15:55, Tom Lane wrote:
>> See also the thread about bug #16329 [1]. Alexander promised to look
>> into improving the test coverage in this area, maybe he can keep an
>> eye on the WAL logic coverage too.
>
> Yes, I'm going to analyze that area too. Maybe it'll take more time
> (a week or two) if I encounter some bugs there (for now I observe anomalies
> with gist__int_ops), but I will definitely try to improve the gist testing.

After 2+ weeks of researching I'd like to summarize my findings.
1) The checking query proposed in [1] could be improved by adding
the restriction "tgk.v = brute.v" to the condition:
WHERE tgk.k >> point(brute.min - 1, 0) AND tgk.k << point(brute.max + 1, 0)
Otherwise that query gives a false positive after
insert into test_gist_killtuples values(point(505, 0));

The similar improved condition could be placed in hash_index_killtuples.sql.

Yet another improvement for the checking query could be done with the
replacement:
min(k <-> point(0, 0)), max(k <-> point(0, 0)) ->
min(k <-> point(0, k[1])), max(p <-> point(0, k[1])) ...

It doesn't change the query plan dramatically, but the query becomes more
universal (it would work for points with any non-negative integer x).

2) I've checked clang`s scan-build notices related to gist as I planned [2],
namely:
Logic error    Branch condition evaluates to a garbage value src/backend/access/gist/gistutil.c   
gistCompressValues   606
 
Logic error    Dereference of null pointer src/backend/access/gist/gist.c    gistFindCorrectParent    1099
Logic error    Dereference of null pointer src/backend/access/gist/gist.c    gistdoinsert    671
Logic error    Dereference of null pointer src/backend/access/gist/gist.c    gistfinishsplit    1339
Logic error    Dereference of null pointer src/backend/access/gist/gist.c    gistplacetopage    340
Logic error    Dereference of null pointer src/backend/access/gist/gistbuildbuffers.c gistPushItupToNodeBuffer    366
Logic error    Result of operation is garbage or undefined src/backend/access/gist/gistbuildbuffers.c 
gistRelocateBuildBuffersOnSplit    677
Logic error    Result of operation is garbage or undefined src/backend/access/gist/gistutil.c    gistchoose    463
Unused code    Dead assignment    src/backend/access/gist/gist.c gistdoinsert    843

And found that all of them (except for the last one, that doesn't worth
fixing, IMO) are false positives (I can present detailed explanations if it
could be of interest.) So I see no grounds here to build new tests on.

3) To date I found other anomalies more or less related to gist:
fillfactor is ignored for sorted index build mode, which is effectively default now [3]
amcheck verification for gist is not yet ready to use [4] (and the latest patch doesn't apply to the current HEAD)
bug #17888: Incorrect memory access in gist__int_ops for an input array with many elements [5]

4) I've constructed some tests, that provide full coverage for
gistFindCorrectParent(), reach for "very rare situation", and for
gistfixsplit(), but all these tests execute concurrent queries, so they
can't be implemented as simple regression tests. Moreover, I could not find
any explicit problems when reaching those places (I used the checking query
from [1] in absence of other means to check gist indexes), so I see no value
in developing (not to speak of committing) these tests for now. I'm going to
further explore the gist behavior in those dark corners, but it looks like
a long-term task, so I think it shouldn't delay the gist coverage improvement
already proposed.

5)
02.04.2023 20:50, Andres Freund wrote:
> Looks like the test in [1] could be made a lot cheaper by changing effective_cache_size
> for just that test:
The effective_cache_size is accounted only when buffering = auto, but in
that test we have buffering = on, so changing it wouldn't help there.

While looking at gist-related tests, I've noticed an incorrect comment
in index_including_gist.sql:
  * 1.1. test CREATE INDEX with buffered build

It's incorrect exactly because with the default effective_cache_size the
buffered build mode is not enabled for that index size (I've confirmed
this with the elog(LOG,..) placed inside gistInitBuffering()).

So I'd like to propose the patch attached, that:
a) demonstrates the bug #16329:
With 8e5eef50c reverted, I get:
**00:00:00:11.179 1587838** Valgrind detected 1 error(s) during execution of "CREATE INDEX tbl_gist_idx ON tbl_gist 
using gist (c4) INCLUDE (c1,c2,c3) WITH (buffering = on);"
b) makes the comment in index_including_gist.sql correct
c) increases a visible test coverage a little, in particular:
  Function 'gistBuffersReleaseBlock'
-Lines executed:66.67% of 9
+Lines executed:100.00% of 9
d) doesn't increase the test duration significantly:
without valgrind I see difference: 84 ms -> 93 ms, under vagrind: 13513 ms -> 14511 ms

Thus, I think, it's worth to split the activity related to gist testing
improvement to finalizing/accepting the already-emerging patches and to
background research/anomaly findings, which could inspire further
enhancements in this area.

[1] https://www.postgresql.org/message-id/20230331231300.4kkrl44usvy2pmkv%40awork3.anarazel.de
[2] https://www.postgresql.org/message-id/cad7055f-0d76-cc31-71d5-f8b600ebb116%40gmail.com
[3] https://www.postgresql.org/message-id/fbbfe5dc-3dfa-d54a-3a94-e2bee37b85d8%40gmail.com
[4] https://www.postgresql.org/message-id/885cfb61-26e9-e7c1-49a8-02b3fb12b497%40gmail.com
[5] https://www.postgresql.org/message-id/17888-f72930e6b5ce8c14@postgresql.org

Best regards,
Alexander
Attachment

pgsql-hackers by date:

Previous
From: Richard Guo
Date:
Subject: Re: Allowing parallel-safe initplans
Next
From: Michael Paquier
Date:
Subject: Re: eclg -C ORACLE breaks data