Thread: BUG #16329: Valgrind detects an invalid read when building a gist index with buffering

BUG #16329: Valgrind detects an invalid read when building a gist index with buffering

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      16329
Logged by:          Alexander Lakhin
Email address:      exclusion@gmail.com
PostgreSQL version: 12.2
Operating system:   Ubuntu 18.04
Description:

Running the following query under valgrind:
create table gist_point_tbl(id int4, p point);
insert into gist_point_tbl (id, p)
select g, point(g*10+1, g*10+1) from generate_series(1, 30000) g;
create index gist_pointidx2 on gist_point_tbl using gist(p) with (buffering
= on, fillfactor=50);

leads to a memory access error:
==00:00:00:08.755 18585== Invalid read of size 2
==00:00:00:08.756 18585==    at 0x1F94CA: gistBuildCallback
(gistbuild.c:495)
==00:00:00:08.756 18585==    by 0x21F556: heapam_index_build_range_scan
(heapam_handler.c:1664)
==00:00:00:08.756 18585==    by 0x1F972F: table_index_build_scan
(tableam.h:1522)
==00:00:00:08.756 18585==    by 0x1F972F: gistbuild (gistbuild.c:196)
==00:00:00:08.756 18585==    by 0x29D1FE: index_build (index.c:2909)
==00:00:00:08.756 18585==    by 0x29EAC8: index_create (index.c:1221)
==00:00:00:08.756 18585==    by 0x348C60: DefineIndex (indexcmds.c:1005)
==00:00:00:08.756 18585==    by 0x52B518: ProcessUtilitySlow
(utility.c:1372)
==00:00:00:08.756 18585==    by 0x52AAC5: standard_ProcessUtility
(utility.c:927)
==00:00:00:08.756 18585==    by 0x52AB73: ProcessUtility (utility.c:360)
==00:00:00:08.756 18585==    by 0x527025: PortalRunUtility (pquery.c:1175)
==00:00:00:08.756 18585==    by 0x527C6F: PortalRunMulti (pquery.c:1321)
==00:00:00:08.756 18585==    by 0x5289DE: PortalRun (pquery.c:796)
==00:00:00:08.756 18585==  Address 0x8a4ca1e is 398 bytes inside a block of
size 8,192 alloc'd
==00:00:00:08.756 18585==    at 0x4C2FBFF: malloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==00:00:00:08.756 18585==    by 0x6781E6: AllocSetContextCreateInternal
(aset.c:477)
==00:00:00:08.756 18585==    by 0x3BF065: ExecInitFunctionScan
(nodeFunctionscan.c:508)
==00:00:00:08.756 18585==    by 0x3ACE6F: ExecInitNode
(execProcnode.c:247)
==00:00:00:08.756 18585==    by 0x3D03E1: ExecInitModifyTable
(nodeModifyTable.c:2363)
==00:00:00:08.756 18585==    by 0x3ACD4C: ExecInitNode
(execProcnode.c:174)
==00:00:00:08.756 18585==    by 0x3A70F3: InitPlan (execMain.c:1020)
==00:00:00:08.756 18585==    by 0x3A7299: standard_ExecutorStart
(execMain.c:266)
==00:00:00:08.756 18585==    by 0x3A739B: ExecutorStart (execMain.c:148)
==00:00:00:08.756 18585==    by 0x527936: ProcessQuery (pquery.c:156)
==00:00:00:08.756 18585==    by 0x527B34: PortalRunMulti (pquery.c:1283)
==00:00:00:08.756 18585==    by 0x5289DE: PortalRun (pquery.c:796)

Reproduced at least on REL_10_STABLE..master.


30.03.2020 00:00, PG Bug reporting form wrote:
> The following bug has been logged on the website:
>
> Bug reference:      16329
> Logged by:          Alexander Lakhin
> Email address:      exclusion@gmail.com
> PostgreSQL version: 12.2
> Operating system:   Ubuntu 18.04
> Description:        
>
> Running the following query under valgrind:
> create table gist_point_tbl(id int4, p point);
> insert into gist_point_tbl (id, p)
> select g, point(g*10+1, g*10+1) from generate_series(1, 30000) g;
> create index gist_pointidx2 on gist_point_tbl using gist(p) with (buffering
> = on, fillfactor=50);
>
> leads to a memory access error:
> ==00:00:00:08.755 18585== Invalid read of size 2
> ==00:00:00:08.756 18585==    at 0x1F94CA: gistBuildCallback
> (gistbuild.c:495)
I've found that the memory context where the itup is created:
    oldCtx = MemoryContextSwitchTo(buildstate->giststate->tempCxt);

    /* form an index tuple and point it at the heap tuple */
    itup = gistFormTuple(buildstate->giststate, index, values, isnull,
true);

can be reset in gistBufferingBuildInsert->gistProcessEmptyingQueue:
            /* Free all the memory allocated during index tuple
processing */
            MemoryContextReset(buildstate->giststate->tempCxt);

Probably, the offending commit is d22a09dc.




Hello hackers,
30.03.2020 08:00, Alexander Lakhin wrote:
> 30.03.2020 00:00, PG Bug reporting form wrote:
>> The following bug has been logged on the website:
>>
>> Bug reference:      16329
>> Logged by:          Alexander Lakhin
>> Email address:      exclusion@gmail.com
>> PostgreSQL version: 12.2
>> Operating system:   Ubuntu 18.04
>> Description:        
>>
>> Running the following query under valgrind:
>> create table gist_point_tbl(id int4, p point);
>> insert into gist_point_tbl (id, p)
>> select g, point(g*10+1, g*10+1) from generate_series(1, 30000) g;
>> create index gist_pointidx2 on gist_point_tbl using gist(p) with (buffering
>> = on, fillfactor=50);
>>
>> leads to a memory access error:
>> ==00:00:00:08.755 18585== Invalid read of size 2
>> ==00:00:00:08.756 18585==    at 0x1F94CA: gistBuildCallback
>> (gistbuild.c:495)
>>
Please look at the patch that modifies the gist regression test to make
the issue visible and fixes it by avoiding access to the memory context
that can be reset in gistProcessEmptyingQueue().

Best regards,
Alexander

Attachment
06.04.2020 09:00, Alexander Lakhin wrote:
> 30.03.2020 00:00, PG Bug reporting form wrote:
>>> Running the following query under valgrind:
>>> create table gist_point_tbl(id int4, p point);
>>> insert into gist_point_tbl (id, p)
>>> select g, point(g*10+1, g*10+1) from generate_series(1, 30000) g;
>>> create index gist_pointidx2 on gist_point_tbl using gist(p) with (buffering
>>> = on, fillfactor=50);
>>>
>>> leads to a memory access error:
>>> ==00:00:00:08.755 18585== Invalid read of size 2
>>> ==00:00:00:08.756 18585==    at 0x1F94CA: gistBuildCallback
>>> (gistbuild.c:495)
> Please look at the patch that modifies the gist regression test to make
> the issue visible and fixes it by avoiding access to the memory context
> that can be reset in gistProcessEmptyingQueue().
Could you please let me know if the fix is incorrect (or not elaborated
enough to be included in the upcoming releases) or this issue is false
positive?

Best regards,
Alexander





Alexander Lakhin <exclusion@gmail.com> writes:
>> Please look at the patch that modifies the gist regression test to make
>> the issue visible and fixes it by avoiding access to the memory context
>> that can be reset in gistProcessEmptyingQueue().

> Could you please let me know if the fix is incorrect (or not elaborated
> enough to be included in the upcoming releases) or this issue is false
> positive?

I took a look at this.  I see the hazard, but I'm not sure that
I like your proposed quick-fix.  Essentially, the problem is that
gistBuildCallback is supposing that the tuple it creates will survive
the execution of its subroutines, which in fact it won't always.
But that means that at some point the tuple pointer it's passed down to
those subroutines becomes a dangling pointer.  What guarantee would we
have that the subroutines don't access the tuple after that point?

I think the real issue here is confusion about which level of function
"owns" the tempCxt and gets to decide when to reset it.  We can't have
both gistBuildCallback and gistProcessEmptyingQueue doing that.  Maybe
we need more than one temporary context, or maybe we could just skip
the context reset in gistProcessEmptyingQueue and let the storage
accumulate till we get back to gistBuildCallback.  But in any case
it's going to require more analysis.

Or, if we do just hack it as you suggest, there had better be a couple
of fat comments in gistBufferingBuildInsert warning about the hazards.

I was somewhat dismayed to realize from looking at the code coverage
report that we have exactly zero test coverage of the gist buffering
build code paths today.  That's just awful; how did the feature get
committed with no test coverage?  Your proposed test additions raise the
coverage in gistbuild.c and gistbuildbuffers.c to something respectable,
at least.  But it looks like there are still some potentially-delicate
paths that aren't tested, notably the "have to stop buffering" business.
Can we do better at reasonable cost, or are those paths just never
reached without huge data volume?  (Could we make them more reachable
by turning down maintenance_work_mem or some other setting?)

            regards, tom lane



Hello Tom,
05.05.2020 21:59, Tom Lane wrote:
Could you please let me know if the fix is incorrect (or not elaborated
enough to be included in the upcoming releases) or this issue is false
positive?
I took a look at this.  I see the hazard, but I'm not sure that
I like your proposed quick-fix.  Essentially, the problem is that
gistBuildCallback is supposing that the tuple it creates will survive
the execution of its subroutines, which in fact it won't always.
But that means that at some point the tuple pointer it's passed down to
those subroutines becomes a dangling pointer.  What guarantee would we
have that the subroutines don't access the tuple after that point?
Thanks for your review!
Yes, I'm not excited by this fix too, so I'll try to find another not so quick way to fix it.
Or, if we do just hack it as you suggest, there had better be a couple
of fat comments in gistBufferingBuildInsert warning about the hazards.

I was somewhat dismayed to realize from looking at the code coverage
report that we have exactly zero test coverage of the gist buffering
build code paths today.  That's just awful; how did the feature get
committed with no test coverage?  Your proposed test additions raise the
coverage in gistbuild.c and gistbuildbuffers.c to something respectable,
at least.  But it looks like there are still some potentially-delicate
paths that aren't tested, notably the "have to stop buffering" business.
Can we do better at reasonable cost, or are those paths just never
reached without huge data volume?  (Could we make them more reachable
by turning down maintenance_work_mem or some other setting?)
Please look at the improved test that makes the code coverage for gistbuildbuffers.c almost 100%.

Best regards,
Alexander
Attachment
11.05.2020 23:00, Alexander Lakhin wrote:
Please look at the improved test that makes the code coverage for gistbuildbuffers.c almost 100%.
That improved test exposes another bug (when running under valgrind):

==00:00:00:34.861 1439277== Conditional jump or move depends on uninitialised value(s)
==00:00:00:34.861 1439277==    at 0x1FFE6A: gistpenalty (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x1F86E0: gistRelocateBuildBuffersOnSplit (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x1F67B7: gistbufferinginserttuples (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x1F6B6F: gistbufferinginserttuples (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x1F6E0E: gistProcessItup (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x1F701F: gistProcessEmptyingQueue (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x1F706D: gistBufferingBuildInsert (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x1F7817: gistBuildCallback (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x2293F3: heapam_index_build_range_scan (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x1F7A23: gistbuild (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x2B8372: index_build (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x2B9C8D: index_create (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==  Uninitialised value was created by a heap allocation
==00:00:00:34.861 1439277==    at 0x6B646A: palloc (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x1F5A8A: initGISTstate (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x1F7913: gistbuild (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x2B8372: index_build (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x2B9C8D: index_create (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x361677: DefineIndex (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x563FE3: ProcessUtilitySlow (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x56358F: standard_ProcessUtility (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x56366E: ProcessUtility (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x55FAED: PortalRunUtility (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x560640: PortalRunMulti (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x5612F8: PortalRun (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==
{
   <insert_a_suppression_name_here>
   Memcheck:Cond
   fun:gistpenalty
   fun:gistRelocateBuildBuffersOnSplit
   fun:gistbufferinginserttuples
   fun:gistbufferinginserttuples
   fun:gistProcessItup
   fun:gistProcessEmptyingQueue
   fun:gistBufferingBuildInsert
   fun:gistBuildCallback
   fun:heapam_index_build_range_scan
   fun:gistbuild
   fun:index_build
   fun:index_create
}
==00:00:00:34.861 1439277== Use of uninitialised value of size 8
==00:00:00:34.861 1439277==    at 0x68E92D: FunctionCall3Coll (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==  Uninitialised value was created by a heap allocation
==00:00:00:34.861 1439277==    at 0x6B646A: palloc (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x1F5A8A: initGISTstate (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x1F7913: gistbuild (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x2B8372: index_build (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x2B9C8D: index_create (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x361677: DefineIndex (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x563FE3: ProcessUtilitySlow (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x56358F: standard_ProcessUtility (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x56366E: ProcessUtility (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x55FAED: PortalRunUtility (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x560640: PortalRunMulti (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x5612F8: PortalRun (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==
{
   <insert_a_suppression_name_here>
   Memcheck:Value8
   fun:FunctionCall3Coll
}
==00:00:00:34.861 1439277== Jump to the invalid address stated on the next line
==00:00:00:34.861 1439277==    at 0x0: ???
==00:00:00:34.861 1439277==    by 0x68E92E: FunctionCall3Coll (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x1FFE9F: gistpenalty (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x1F86E0: gistRelocateBuildBuffersOnSplit (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x1F67B7: gistbufferinginserttuples (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x1F6B6F: gistbufferinginserttuples (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x1F6E0E: gistProcessItup (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x1F701F: gistProcessEmptyingQueue (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x1F706D: gistBufferingBuildInsert (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x1F7817: gistBuildCallback (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x2293F3: heapam_index_build_range_scan (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x1F7A23: gistbuild (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==00:00:00:34.861 1439277==
{
   <insert_a_suppression_name_here>
   Memcheck:Jump
   obj:*
   fun:FunctionCall3Coll
   fun:gistpenalty
   fun:gistRelocateBuildBuffersOnSplit
   fun:gistbufferinginserttuples
   fun:gistbufferinginserttuples
   fun:gistProcessItup
   fun:gistProcessEmptyingQueue
   fun:gistBufferingBuildInsert
   fun:gistBuildCallback
   fun:heapam_index_build_range_scan
   fun:gistbuild
}
==00:00:00:34.861 1439277==
==00:00:00:34.861 1439277== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==00:00:00:34.861 1439277==  Bad permissions for mapped region at address 0x0
==00:00:00:34.861 1439277==    at 0x0: ???
==00:00:00:34.861 1439277==    by 0x68E92E: FunctionCall3Coll (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x1FFE9F: gistpenalty (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x1F86E0: gistRelocateBuildBuffersOnSplit (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x1F67B7: gistbufferinginserttuples (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x1F6B6F: gistbufferinginserttuples (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x1F6E0E: gistProcessItup (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x1F701F: gistProcessEmptyingQueue (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x1F706D: gistBufferingBuildInsert (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x1F7817: gistBuildCallback (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x2293F3: heapam_index_build_range_scan (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==    by 0x1F7A23: gistbuild (in .../tmp_install/usr/local/pgsql/bin/postgres)
2020-05-12 06:51:30.764 MSK postmaster[1439231] LOG:  server process (PID 1439277) was terminated by signal 11: Segmentation fault
2020-05-12 06:51:30.764 MSK postmaster[1439231] DETAIL:  Failed process was running: create index gist_pointidx6 on gist_point_tbl using gist(p) include(t) with (buffering = on, fillfactor=50);

Best regards,
Alexander

2020-05-12 06:51:30.764 MSK postmaster[1439231] LOG:  server process (PID 1439277) was terminated by signal 11: Segmentation fault
2020-05-12 06:51:30.764 MSK postmaster[1439231] DETAIL:  Failed process was running: create index gist_pointidx6 on gist_point_tbl using gist(p) include(t) with (buffering = on, fillfactor=50);

Best regards,
Alexander

I investigated this issue under Valgrind and found that in v13 it is no longer detected and the patch in the discussion above is no longer needed. 
But I think that increasing test coverage is useful anyway, and it's good to commit tests from this patch (the re-formatted version is below).
Also, there is a minor correction for the case of covering index where only part of the columns are keys, others are just INCLUDE columns.

--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com
Attachment
I investigated this issue under Valgrind and found that in v13 it is no longer detected and the patch in the discussion above is no longer needed. 
I want to correct myself that issue is no longer detected in master, but not in 13stable. Everything else is without changes.
Pavel Borisov <pashkin.elfe@gmail.com> writes:
>> I investigated this issue under Valgrind and found that in v13 it is no
>> longer detected and the patch in the discussion above is no longer needed.

> I want to correct myself that issue is no longer detected in master, but
> not in 13stable. Everything else is without changes.

I went to try to evaluate this, and found that with or without your
proposed testing addition, the code coverage for gistbuildbuffers.c
remains at 0.0%.  The previous testing patch is the same.  So that's
why the issue is "no longer detectable": the code is not reached in
HEAD.

After some investigation, I found that commit 16fa9b2b3 broke this:
if the opclass(es) have GIST_SORTSUPPORT_PROC, then the buffering
code will not be used, no matter what the index's options say.

This seems like a spectacularly bad idea from a testing standpoint,
even if it's the right thing for most users.  Basically, it is now
impossible to test buffering builds at all, unless you find a gist
opclass that lacks GIST_SORTSUPPORT_PROC.  Although there are a
few candidates to pick from, someone could at any time add such
a support proc and silently break your testing plan, just as
16fa9b2b3 did by adding such a proc to point_ops.

So I think 16fa9b2b3 has to be reconsidered: if we have a
buffering=on index parameter, we must go with that independently
of the availability of sort support procs.  Unless I hear very
loud squawks very quickly, I'll go make that happen.

(I observe that 16fa9b2b3 also failed to update the relevant
documentation in 65.4.1...)

            regards, tom lane



Pavel Borisov <pashkin.elfe@gmail.com> writes:
> Also, there is a minor correction for the case of covering index where only
> part of the columns are keys, others are just INCLUDE columns.

Yeah, that's clearly a bug, and it reinforces the point about wanting
more thorough test coverage of this area.

I pushed the bug fix, but not yet the test addition, because I'm not
very happy about the latter:

1. It nearly triples the runtime of gist.sql, from ~650 ms to ~1700 ms
on my machine.  That's a pretty bad increase for something we're
proposing to drop into the core regression tests.  Given that this is
hardly the only index build in that test, I wonder why it's so much
(but I did not look for the reason).

2. The test exposed the gistRelocateBuildBuffersOnSplit bug only about
one time in ten for me.  I suppose this is due to the random split
choices gistchoose makes for equally-good tuples, so it's not terribly
surprising; but it is problematic for a test that we're hoping to use
to provide reliable code coverage.

I'm not really sure what we could do about #2.  Perhaps, instead of
relying on random(), we could make gistchoose() use our own PRNG and
then invent a debugging function that forces the seed to a known value.
(GEQO already does something similar, although I'd not go as far as
exposing the seed as a GUC.  Resetting it via some quick-hack C
function in regress.c would be enough IMO.)  Or perhaps gist.sql could
be adjusted so that the test data is less full of equally-good tuples.

            regards, tom lane







> 13 окт. 2020 г., в 03:16, Tom Lane <tgl@sss.pgh.pa.us> написал(а):
>
> I pushed the bug fix, but not yet the test addition, because I'm not
> very happy about the latter:
>
> 1. It nearly triples the runtime of gist.sql, from ~650 ms to ~1700 ms
> on my machine.  That's a pretty bad increase for something we're
> proposing to drop into the core regression tests.  Given that this is
> hardly the only index build in that test, I wonder why it's so much
> (but I did not look for the reason).
>
> 2. The test exposed the gistRelocateBuildBuffersOnSplit bug only about
> one time in ten for me.  I suppose this is due to the random split
> choices gistchoose makes for equally-good tuples, so it's not terribly
> surprising; but it is problematic for a test that we're hoping to use
> to provide reliable code coverage.
>
> I'm not really sure what we could do about #2.  Perhaps, instead of
> relying on random(), we could make gistchoose() use our own PRNG and
> then invent a debugging function that forces the seed to a known value.
> (GEQO already does something similar, although I'd not go as far as
> exposing the seed as a GUC.  Resetting it via some quick-hack C
> function in regress.c would be enough IMO.)  Or perhaps gist.sql could
> be adjusted so that the test data is less full of equally-good tuples.

I think we should use not entropy-based tie breaker in GiST. We can extract some randomness from tuples using hash.
I'd be much happier if GiST behaviour was deterministic.

> This seems like a spectacularly bad idea from a testing standpoint,
> even if it's the right thing for most users.  Basically, it is now
> impossible to test buffering builds at all, unless you find a gist
> opclass that lacks GIST_SORTSUPPORT_PROC.  Although there are a
> few candidates to pick from, someone could at any time add such
> a support proc and silently break your testing plan, just as
> 16fa9b2b3 did by adding such a proc to point_ops.
>
> So I think 16fa9b2b3 has to be reconsidered: if we have a
> buffering=on index parameter, we must go with that independently
> of the availability of sort support procs.  Unless I hear very
> loud squawks very quickly, I'll go make that happen.
FWIW I think forcing buffered build on buffering=on is a good idea. Not just from testing point of view, it simply does
whatuser asked for. 

Thanks!

Best regards, Andrey Borodin.




Andrey Borodin <x4mmm@yandex-team.ru> writes:
>> 13 окт. 2020 г., в 03:16, Tom Lane <tgl@sss.pgh.pa.us> написал(а):
>> 2. The test exposed the gistRelocateBuildBuffersOnSplit bug only about
>> one time in ten for me.  I suppose this is due to the random split
>> choices gistchoose makes for equally-good tuples, so it's not terribly
>> surprising; but it is problematic for a test that we're hoping to use
>> to provide reliable code coverage.
>>
>> I'm not really sure what we could do about #2.  Perhaps, instead of
>> relying on random(), we could make gistchoose() use our own PRNG and
>> then invent a debugging function that forces the seed to a known value.
>> (GEQO already does something similar, although I'd not go as far as
>> exposing the seed as a GUC.  Resetting it via some quick-hack C
>> function in regress.c would be enough IMO.)  Or perhaps gist.sql could
>> be adjusted so that the test data is less full of equally-good tuples.

> I think we should use not entropy-based tie breaker in GiST. We can extract some randomness from tuples using hash.
> I'd be much happier if GiST behaviour was deterministic.

If we started using our own PRNG, we could very easily make the "random"
sequence be the same in every GiST build --- again, see GEQO for prior
art.  I'm a little suspicious of trying to pull some entropy out of the
tuples themselves: that seems like it'd be tremendously prone to cross-
platform and cross-version behavioral differences.

            regards, tom lane




> 22 окт. 2020 г., в 19:01, Tom Lane <tgl@sss.pgh.pa.us> написал(а):
>
> Andrey Borodin <x4mmm@yandex-team.ru> writes:
>>> 13 окт. 2020 г., в 03:16, Tom Lane <tgl@sss.pgh.pa.us> написал(а):
>>> 2. The test exposed the gistRelocateBuildBuffersOnSplit bug only about
>>> one time in ten for me.  I suppose this is due to the random split
>>> choices gistchoose makes for equally-good tuples, so it's not terribly
>>> surprising; but it is problematic for a test that we're hoping to use
>>> to provide reliable code coverage.
>>>
>>> I'm not really sure what we could do about #2.  Perhaps, instead of
>>> relying on random(), we could make gistchoose() use our own PRNG and
>>> then invent a debugging function that forces the seed to a known value.
>>> (GEQO already does something similar, although I'd not go as far as
>>> exposing the seed as a GUC.  Resetting it via some quick-hack C
>>> function in regress.c would be enough IMO.)  Or perhaps gist.sql could
>>> be adjusted so that the test data is less full of equally-good tuples.
>
>> I think we should use not entropy-based tie breaker in GiST. We can extract some randomness from tuples using hash.
>> I'd be much happier if GiST behaviour was deterministic.
>
> If we started using our own PRNG, we could very easily make the "random"
> sequence be the same in every GiST build --- again, see GEQO for prior
> art.  I'm a little suspicious of trying to pull some entropy out of the
> tuples themselves: that seems like it'd be tremendously prone to cross-
> platform and cross-version behavioral differences.

PFA copy of GEQO approach to GiST. I haven't found proper place to document this: current random tie breaker seems
undocumented.
Will describing GUC here [0] be enough?

Thanks!

Best regards, Andrey Borodin.

[0] https://www.postgresql.org/docs/13/runtime-config-developer.html

Attachment


пн, 2 нояб. 2020 г. в 20:05, Tom Lane <tgl@sss.pgh.pa.us>:
Alexander Lakhin <exclusion@gmail.com> writes:
>> Please look at the patch that modifies the gist regression test to make
>> the issue visible and fixes it by avoiding access to the memory context
>> that can be reset in gistProcessEmptyingQueue().

> Could you please let me know if the fix is incorrect (or not elaborated
> enough to be included in the upcoming releases) or this issue is false
> positive?

I took a look at this.  I see the hazard, but I'm not sure that
I like your proposed quick-fix.  Essentially, the problem is that
gistBuildCallback is supposing that the tuple it creates will survive
the execution of its subroutines, which in fact it won't always.
But that means that at some point the tuple pointer it's passed down to
those subroutines becomes a dangling pointer.  What guarantee would we
have that the subroutines don't access the tuple after that point?

I think the real issue here is confusion about which level of function
"owns" the tempCxt and gets to decide when to reset it.  We can't have
both gistBuildCallback and gistProcessEmptyingQueue doing that.  Maybe
we need more than one temporary context, or maybe we could just skip
the context reset in gistProcessEmptyingQueue and let the storage
accumulate till we get back to gistBuildCallback.  But in any case
it's going to require more analysis.

I investigated the code and realized that there are two ways to call the problematic gistProcessEmptyingQueue() function:

1. gistEmptyAllBuffers /tempCxt owner/ -> gistProcessEmptyingQueue
A potentially dangerous operation of resetting memory context inside sub-function calls of a 'context owner' function is done at the very ending of gistProcessEmptyingQueue(). Just after it we return back to gistEmptyAllBuffers and immediately have a switch to old context. So resetting tempCxt has no danger in this case as no pointers are used after.

2. gistBuildCallback /tempCxt owner/ -> gistBufferingBuildInsert -> gistProcessEmptyingQueue
gistProcessEmptyingQueue() is at the very ending of gistBufferingBuildInsert(). After the reset no pointers from tempCxt context are used up to the 'owner' level of gistBuildCallback. So moving the only itup pointer operation before gistBufferingBuildInsert() call will fix the bug.

(Of course, the alternative of not re-setting tempCxt inside gistProcessEmptyingQueue() and doing this level up will also fix the issue)

So I think that the patch by Alexander will do the thing. I also added some comments to the code and removed extra context reset in gistBuildCallback (which is already done level down) to make things clear.

--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com
Attachment
Pavel Borisov <pashkin.elfe@gmail.com> writes:
> So I think that the patch by Alexander will do the thing. I also added some
> comments to the code and removed extra context reset in gistBuildCallback
> (which is already done level down) to make things clear.

My apologies for having let this slip through the cracks.  I think
I'd wanted to understand why the committed version of the GiST
index tests doesn't expose the problem, and I never got time to
study that.  I still haven't tracked it down, but the proposed patch
seems clearly safe so I've gone ahead and pushed it.

I did use Alexander's original code change that just moved the
totals-incrementing.  It looks to me like gistProcessEmptyingQueue
is not guaranteed to reach the context reset, and I'd rather have
an extra reset than miss one.  (No-op context resets are cheap.)

            regards, tom lane



I wrote:
> My apologies for having let this slip through the cracks.  I think
> I'd wanted to understand why the committed version of the GiST
> index tests doesn't expose the problem, and I never got time to
> study that.  I still haven't tracked it down, but the proposed patch
> seems clearly safe so I've gone ahead and pushed it.

I couldn't quite let this go, and after some testing I see the issue.
The code coverage report correctly shows that we reach almost all of
gistProcessEmptyingQueue in the regression tests, but what it fails to
expose is that the MemoryContextReset call is only reached in calls
from gistEmptyAllBuffers, which occur only at the end of the index
build.  We never get there from gistBufferingBuildInsert, which is
the troublesome path.  I haven't grokked gistProcessEmptyingQueue
fully, but if the comment in gistBufferingBuildInsert is correct:

    /* If we filled up (half of a) buffer, process buffer emptying. */

then this is just a matter of not having enough data volume --- as
indeed Egor's test case proves, if you break it down a little: the
failure doesn't happen in the index builds with 10k or 20k tuples,
you need 30k.  The same was true in Alexander's original report.

So we could fix the lack of test coverage with something about
like this:

diff --git a/src/test/regress/sql/gist.sql b/src/test/regress/sql/gist.sql
index 6f1fc65..a42d3ee 100644
*** a/src/test/regress/sql/gist.sql
--- b/src/test/regress/sql/gist.sql
*************** select p from gist_tbl order by circle(p
*** 170,175 ****
--- 170,180 ----
  select p from gist_tbl order by circle(p,1) <-> point(0,0) limit 1;
  
  -- Force an index build using buffering.
+ insert into gist_tbl
+ select box(point(0.05*i, 0.05*i), point(0.05*i, 0.05*i)),
+        point(0.05*i, 0.05*i),
+        circle(point(0.05*i, 0.05*i), 1.0)
+ from generate_series(10001,30000) as i;
  create index gist_tbl_box_index_forcing_buffering on gist_tbl using gist (p)
    with (buffering=on, fillfactor=50);
  
The question is whether it's worth the extra test cycles forevermore.
On my machine, the time for the gist test script goes from ~400ms to
~600ms.  That's still less than some of the concurrent scripts (brin,
privileges, and join_hash all take more than that for me), so maybe
it's fine.  But it seems like a lot for a change that doesn't move the
raw code coverage results by even one line.

Thoughts?

            regards, tom lane



On Wed, Mar 29, 2023 at 06:21:38PM -0400, Tom Lane wrote:
> *** a/src/test/regress/sql/gist.sql
> --- b/src/test/regress/sql/gist.sql
> *************** select p from gist_tbl order by circle(p
> *** 170,175 ****
> --- 170,180 ----
>   select p from gist_tbl order by circle(p,1) <-> point(0,0) limit 1;
>   
>   -- Force an index build using buffering.
> + insert into gist_tbl
> + select box(point(0.05*i, 0.05*i), point(0.05*i, 0.05*i)),
> +        point(0.05*i, 0.05*i),
> +        circle(point(0.05*i, 0.05*i), 1.0)
> + from generate_series(10001,30000) as i;
>   create index gist_tbl_box_index_forcing_buffering on gist_tbl using gist (p)
>     with (buffering=on, fillfactor=50);
>   
> The question is whether it's worth the extra test cycles forevermore.
> On my machine, the time for the gist test script goes from ~400ms to
> ~600ms.  That's still less than some of the concurrent scripts (brin,
> privileges, and join_hash all take more than that for me), so maybe
> it's fine.  But it seems like a lot for a change that doesn't move the
> raw code coverage results by even one line.

I say +200ms is fine.  This thread is a reminder of the limits of raw code
coverage as a measure of tests.  (I expect this will add tens of minutes to
the soft-float buildfarm animal runs, but they deserve it.)



30.03.2023 01:21, Tom Lane wrote:
>    
> The question is whether it's worth the extra test cycles forevermore.
> On my machine, the time for the gist test script goes from ~400ms to
> ~600ms.  That's still less than some of the concurrent scripts (brin,
> privileges, and join_hash all take more than that for me), so maybe
> it's fine.  But it seems like a lot for a change that doesn't move the
> raw code coverage results by even one line.
>
> Thoughts?

I would say that providing the behavior coverage is the next thing to do after
the code coverage reached 100%.  I can easily imagine bugs hiding in a fully
covered code (#17858 is the recent example).

As to that concrete place, I would incline to improve the test while it stays
in line with other tests, but I lack a whole picture of the gist testing.
It doesn't seem to me that the regress/sql/gist.sql is too comprehensive.
I've just looked at the scan-build's results and found 9 warnings related to
gist. A quick glance at the code confirms that there are indeed
interesting cases, which are not exercised for now (e.g., call
gistRelocateBuildBuffersOnSplit() with
IndexRelationGetNumberOfKeyAttributes(r) > 1).
So if I could prove that 2-3 of them are really bugs during the next week,
maybe then assess that test change alongside with others (if any).
(Maybe we'll want to build that index over different columns, for example.)

Best regards,
Alexander



Alexander Lakhin <exclusion@gmail.com> writes:
> As to that concrete place, I would incline to improve the test while it stays
> in line with other tests, but I lack a whole picture of the gist testing.
> It doesn't seem to me that the regress/sql/gist.sql is too comprehensive.
> I've just looked at the scan-build's results and found 9 warnings related to
> gist. A quick glance at the code confirms that there are indeed
> interesting cases, which are not exercised for now (e.g., call
> gistRelocateBuildBuffersOnSplit() with
> IndexRelationGetNumberOfKeyAttributes(r) > 1).
> So if I could prove that 2-3 of them are really bugs during the next week,
> maybe then assess that test change alongside with others (if any).
> (Maybe we'll want to build that index over different columns, for example.)

OK, there's certainly no urgency.

            regards, tom lane