Thread: BUG #16329: Valgrind detects an invalid read when building a gist index with buffering
BUG #16329: Valgrind detects an invalid read when building a gist index with buffering
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 16329 Logged by: Alexander Lakhin Email address: exclusion@gmail.com PostgreSQL version: 12.2 Operating system: Ubuntu 18.04 Description: Running the following query under valgrind: create table gist_point_tbl(id int4, p point); insert into gist_point_tbl (id, p) select g, point(g*10+1, g*10+1) from generate_series(1, 30000) g; create index gist_pointidx2 on gist_point_tbl using gist(p) with (buffering = on, fillfactor=50); leads to a memory access error: ==00:00:00:08.755 18585== Invalid read of size 2 ==00:00:00:08.756 18585== at 0x1F94CA: gistBuildCallback (gistbuild.c:495) ==00:00:00:08.756 18585== by 0x21F556: heapam_index_build_range_scan (heapam_handler.c:1664) ==00:00:00:08.756 18585== by 0x1F972F: table_index_build_scan (tableam.h:1522) ==00:00:00:08.756 18585== by 0x1F972F: gistbuild (gistbuild.c:196) ==00:00:00:08.756 18585== by 0x29D1FE: index_build (index.c:2909) ==00:00:00:08.756 18585== by 0x29EAC8: index_create (index.c:1221) ==00:00:00:08.756 18585== by 0x348C60: DefineIndex (indexcmds.c:1005) ==00:00:00:08.756 18585== by 0x52B518: ProcessUtilitySlow (utility.c:1372) ==00:00:00:08.756 18585== by 0x52AAC5: standard_ProcessUtility (utility.c:927) ==00:00:00:08.756 18585== by 0x52AB73: ProcessUtility (utility.c:360) ==00:00:00:08.756 18585== by 0x527025: PortalRunUtility (pquery.c:1175) ==00:00:00:08.756 18585== by 0x527C6F: PortalRunMulti (pquery.c:1321) ==00:00:00:08.756 18585== by 0x5289DE: PortalRun (pquery.c:796) ==00:00:00:08.756 18585== Address 0x8a4ca1e is 398 bytes inside a block of size 8,192 alloc'd ==00:00:00:08.756 18585== at 0x4C2FBFF: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==00:00:00:08.756 18585== by 0x6781E6: AllocSetContextCreateInternal (aset.c:477) ==00:00:00:08.756 18585== by 0x3BF065: ExecInitFunctionScan (nodeFunctionscan.c:508) ==00:00:00:08.756 18585== by 0x3ACE6F: ExecInitNode (execProcnode.c:247) ==00:00:00:08.756 18585== by 0x3D03E1: ExecInitModifyTable (nodeModifyTable.c:2363) ==00:00:00:08.756 18585== by 0x3ACD4C: ExecInitNode (execProcnode.c:174) ==00:00:00:08.756 18585== by 0x3A70F3: InitPlan (execMain.c:1020) ==00:00:00:08.756 18585== by 0x3A7299: standard_ExecutorStart (execMain.c:266) ==00:00:00:08.756 18585== by 0x3A739B: ExecutorStart (execMain.c:148) ==00:00:00:08.756 18585== by 0x527936: ProcessQuery (pquery.c:156) ==00:00:00:08.756 18585== by 0x527B34: PortalRunMulti (pquery.c:1283) ==00:00:00:08.756 18585== by 0x5289DE: PortalRun (pquery.c:796) Reproduced at least on REL_10_STABLE..master.
Re: BUG #16329: Valgrind detects an invalid read when building a gistindex with buffering
From
Alexander Lakhin
Date:
30.03.2020 00:00, PG Bug reporting form wrote: > The following bug has been logged on the website: > > Bug reference: 16329 > Logged by: Alexander Lakhin > Email address: exclusion@gmail.com > PostgreSQL version: 12.2 > Operating system: Ubuntu 18.04 > Description: > > Running the following query under valgrind: > create table gist_point_tbl(id int4, p point); > insert into gist_point_tbl (id, p) > select g, point(g*10+1, g*10+1) from generate_series(1, 30000) g; > create index gist_pointidx2 on gist_point_tbl using gist(p) with (buffering > = on, fillfactor=50); > > leads to a memory access error: > ==00:00:00:08.755 18585== Invalid read of size 2 > ==00:00:00:08.756 18585== at 0x1F94CA: gistBuildCallback > (gistbuild.c:495) I've found that the memory context where the itup is created: oldCtx = MemoryContextSwitchTo(buildstate->giststate->tempCxt); /* form an index tuple and point it at the heap tuple */ itup = gistFormTuple(buildstate->giststate, index, values, isnull, true); can be reset in gistBufferingBuildInsert->gistProcessEmptyingQueue: /* Free all the memory allocated during index tuple processing */ MemoryContextReset(buildstate->giststate->tempCxt); Probably, the offending commit is d22a09dc.
Re: BUG #16329: Valgrind detects an invalid read when building a gistindex with buffering
From
Alexander Lakhin
Date:
Hello hackers, 30.03.2020 08:00, Alexander Lakhin wrote: > 30.03.2020 00:00, PG Bug reporting form wrote: >> The following bug has been logged on the website: >> >> Bug reference: 16329 >> Logged by: Alexander Lakhin >> Email address: exclusion@gmail.com >> PostgreSQL version: 12.2 >> Operating system: Ubuntu 18.04 >> Description: >> >> Running the following query under valgrind: >> create table gist_point_tbl(id int4, p point); >> insert into gist_point_tbl (id, p) >> select g, point(g*10+1, g*10+1) from generate_series(1, 30000) g; >> create index gist_pointidx2 on gist_point_tbl using gist(p) with (buffering >> = on, fillfactor=50); >> >> leads to a memory access error: >> ==00:00:00:08.755 18585== Invalid read of size 2 >> ==00:00:00:08.756 18585== at 0x1F94CA: gistBuildCallback >> (gistbuild.c:495) >> Please look at the patch that modifies the gist regression test to make the issue visible and fixes it by avoiding access to the memory context that can be reset in gistProcessEmptyingQueue(). Best regards, Alexander
Attachment
Re: BUG #16329: Valgrind detects an invalid read when building a gistindex with buffering
From
Alexander Lakhin
Date:
06.04.2020 09:00, Alexander Lakhin wrote: > 30.03.2020 00:00, PG Bug reporting form wrote: >>> Running the following query under valgrind: >>> create table gist_point_tbl(id int4, p point); >>> insert into gist_point_tbl (id, p) >>> select g, point(g*10+1, g*10+1) from generate_series(1, 30000) g; >>> create index gist_pointidx2 on gist_point_tbl using gist(p) with (buffering >>> = on, fillfactor=50); >>> >>> leads to a memory access error: >>> ==00:00:00:08.755 18585== Invalid read of size 2 >>> ==00:00:00:08.756 18585== at 0x1F94CA: gistBuildCallback >>> (gistbuild.c:495) > Please look at the patch that modifies the gist regression test to make > the issue visible and fixes it by avoiding access to the memory context > that can be reset in gistProcessEmptyingQueue(). Could you please let me know if the fix is incorrect (or not elaborated enough to be included in the upcoming releases) or this issue is false positive? Best regards, Alexander
Re: BUG #16329: Valgrind detects an invalid read when building a gist index with buffering
From
Tom Lane
Date:
Alexander Lakhin <exclusion@gmail.com> writes: >> Please look at the patch that modifies the gist regression test to make >> the issue visible and fixes it by avoiding access to the memory context >> that can be reset in gistProcessEmptyingQueue(). > Could you please let me know if the fix is incorrect (or not elaborated > enough to be included in the upcoming releases) or this issue is false > positive? I took a look at this. I see the hazard, but I'm not sure that I like your proposed quick-fix. Essentially, the problem is that gistBuildCallback is supposing that the tuple it creates will survive the execution of its subroutines, which in fact it won't always. But that means that at some point the tuple pointer it's passed down to those subroutines becomes a dangling pointer. What guarantee would we have that the subroutines don't access the tuple after that point? I think the real issue here is confusion about which level of function "owns" the tempCxt and gets to decide when to reset it. We can't have both gistBuildCallback and gistProcessEmptyingQueue doing that. Maybe we need more than one temporary context, or maybe we could just skip the context reset in gistProcessEmptyingQueue and let the storage accumulate till we get back to gistBuildCallback. But in any case it's going to require more analysis. Or, if we do just hack it as you suggest, there had better be a couple of fat comments in gistBufferingBuildInsert warning about the hazards. I was somewhat dismayed to realize from looking at the code coverage report that we have exactly zero test coverage of the gist buffering build code paths today. That's just awful; how did the feature get committed with no test coverage? Your proposed test additions raise the coverage in gistbuild.c and gistbuildbuffers.c to something respectable, at least. But it looks like there are still some potentially-delicate paths that aren't tested, notably the "have to stop buffering" business. Can we do better at reasonable cost, or are those paths just never reached without huge data volume? (Could we make them more reachable by turning down maintenance_work_mem or some other setting?) regards, tom lane
Re: BUG #16329: Valgrind detects an invalid read when building a gistindex with buffering
From
Alexander Lakhin
Date:
Hello Tom,
05.05.2020 21:59, Tom Lane wrote:
05.05.2020 21:59, Tom Lane wrote:
Thanks for your review!Could you please let me know if the fix is incorrect (or not elaborated enough to be included in the upcoming releases) or this issue is false positive?I took a look at this. I see the hazard, but I'm not sure that I like your proposed quick-fix. Essentially, the problem is that gistBuildCallback is supposing that the tuple it creates will survive the execution of its subroutines, which in fact it won't always. But that means that at some point the tuple pointer it's passed down to those subroutines becomes a dangling pointer. What guarantee would we have that the subroutines don't access the tuple after that point?
Yes, I'm not excited by this fix too, so I'll try to find another not so quick way to fix it.
Please look at the improved test that makes the code coverage for gistbuildbuffers.c almost 100%.Or, if we do just hack it as you suggest, there had better be a couple of fat comments in gistBufferingBuildInsert warning about the hazards. I was somewhat dismayed to realize from looking at the code coverage report that we have exactly zero test coverage of the gist buffering build code paths today. That's just awful; how did the feature get committed with no test coverage? Your proposed test additions raise the coverage in gistbuild.c and gistbuildbuffers.c to something respectable, at least. But it looks like there are still some potentially-delicate paths that aren't tested, notably the "have to stop buffering" business. Can we do better at reasonable cost, or are those paths just never reached without huge data volume? (Could we make them more reachable by turning down maintenance_work_mem or some other setting?)
Best regards,
Alexander
Attachment
Re: BUG #16329: Valgrind detects an invalid read when building a gistindex with buffering
From
Alexander Lakhin
Date:
11.05.2020 23:00, Alexander Lakhin wrote:
Please look at the improved test that makes the code coverage for gistbuildbuffers.c almost 100%.That improved test exposes another bug (when running under valgrind):
==00:00:00:34.861 1439277== Conditional jump or move depends on uninitialised value(s)
==00:00:00:34.861 1439277== at 0x1FFE6A: gistpenalty (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x1F86E0: gistRelocateBuildBuffersOnSplit (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x1F67B7: gistbufferinginserttuples (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x1F6B6F: gistbufferinginserttuples (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x1F6E0E: gistProcessItup (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x1F701F: gistProcessEmptyingQueue (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x1F706D: gistBufferingBuildInsert (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x1F7817: gistBuildCallback (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x2293F3: heapam_index_build_range_scan (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x1F7A23: gistbuild (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x2B8372: index_build (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x2B9C8D: index_create (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== Uninitialised value was created by a heap allocation
==00:00:00:34.861 1439277== at 0x6B646A: palloc (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x1F5A8A: initGISTstate (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x1F7913: gistbuild (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x2B8372: index_build (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x2B9C8D: index_create (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x361677: DefineIndex (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x563FE3: ProcessUtilitySlow (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x56358F: standard_ProcessUtility (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x56366E: ProcessUtility (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x55FAED: PortalRunUtility (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x560640: PortalRunMulti (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x5612F8: PortalRun (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==
{
<insert_a_suppression_name_here>
Memcheck:Cond
fun:gistpenalty
fun:gistRelocateBuildBuffersOnSplit
fun:gistbufferinginserttuples
fun:gistbufferinginserttuples
fun:gistProcessItup
fun:gistProcessEmptyingQueue
fun:gistBufferingBuildInsert
fun:gistBuildCallback
fun:heapam_index_build_range_scan
fun:gistbuild
fun:index_build
fun:index_create
}
==00:00:00:34.861 1439277== Use of uninitialised value of size 8
==00:00:00:34.861 1439277== at 0x68E92D: FunctionCall3Coll (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== Uninitialised value was created by a heap allocation
==00:00:00:34.861 1439277== at 0x6B646A: palloc (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x1F5A8A: initGISTstate (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x1F7913: gistbuild (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x2B8372: index_build (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x2B9C8D: index_create (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x361677: DefineIndex (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x563FE3: ProcessUtilitySlow (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x56358F: standard_ProcessUtility (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x56366E: ProcessUtility (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x55FAED: PortalRunUtility (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x560640: PortalRunMulti (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x5612F8: PortalRun (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277==
{
<insert_a_suppression_name_here>
Memcheck:Value8
fun:FunctionCall3Coll
}
==00:00:00:34.861 1439277== Jump to the invalid address stated on the next line
==00:00:00:34.861 1439277== at 0x0: ???
==00:00:00:34.861 1439277== by 0x68E92E: FunctionCall3Coll (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x1FFE9F: gistpenalty (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x1F86E0: gistRelocateBuildBuffersOnSplit (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x1F67B7: gistbufferinginserttuples (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x1F6B6F: gistbufferinginserttuples (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x1F6E0E: gistProcessItup (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x1F701F: gistProcessEmptyingQueue (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x1F706D: gistBufferingBuildInsert (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x1F7817: gistBuildCallback (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x2293F3: heapam_index_build_range_scan (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x1F7A23: gistbuild (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== Address 0x0 is not stack'd, malloc'd or (recently) free'd
==00:00:00:34.861 1439277==
{
<insert_a_suppression_name_here>
Memcheck:Jump
obj:*
fun:FunctionCall3Coll
fun:gistpenalty
fun:gistRelocateBuildBuffersOnSplit
fun:gistbufferinginserttuples
fun:gistbufferinginserttuples
fun:gistProcessItup
fun:gistProcessEmptyingQueue
fun:gistBufferingBuildInsert
fun:gistBuildCallback
fun:heapam_index_build_range_scan
fun:gistbuild
}
==00:00:00:34.861 1439277==
==00:00:00:34.861 1439277== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==00:00:00:34.861 1439277== Bad permissions for mapped region at address 0x0
==00:00:00:34.861 1439277== at 0x0: ???
==00:00:00:34.861 1439277== by 0x68E92E: FunctionCall3Coll (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x1FFE9F: gistpenalty (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x1F86E0: gistRelocateBuildBuffersOnSplit (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x1F67B7: gistbufferinginserttuples (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x1F6B6F: gistbufferinginserttuples (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x1F6E0E: gistProcessItup (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x1F701F: gistProcessEmptyingQueue (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x1F706D: gistBufferingBuildInsert (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x1F7817: gistBuildCallback (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x2293F3: heapam_index_build_range_scan (in .../tmp_install/usr/local/pgsql/bin/postgres)
==00:00:00:34.861 1439277== by 0x1F7A23: gistbuild (in .../tmp_install/usr/local/pgsql/bin/postgres)
2020-05-12 06:51:30.764 MSK postmaster[1439231] LOG: server process (PID 1439277) was terminated by signal 11: Segmentation fault
2020-05-12 06:51:30.764 MSK postmaster[1439231] DETAIL: Failed process was running: create index gist_pointidx6 on gist_point_tbl using gist(p) include(t) with (buffering = on, fillfactor=50);
Best regards,
Alexander
Re: BUG #16329: Valgrind detects an invalid read when building a gist index with buffering
From
Pavel Borisov
Date:
2020-05-12 06:51:30.764 MSK postmaster[1439231] LOG: server process (PID 1439277) was terminated by signal 11: Segmentation fault
2020-05-12 06:51:30.764 MSK postmaster[1439231] DETAIL: Failed process was running: create index gist_pointidx6 on gist_point_tbl using gist(p) include(t) with (buffering = on, fillfactor=50);
Best regards,
Alexander
I investigated this issue under Valgrind and found that in v13 it is no longer detected and the patch in the discussion above is no longer needed.
But I think that increasing test coverage is useful anyway, and it's good to commit tests from this patch (the re-formatted version is below).
Also, there is a minor correction for the case of covering index where only part of the columns are keys, others are just INCLUDE columns.
Attachment
Re: BUG #16329: Valgrind detects an invalid read when building a gist index with buffering
From
Pavel Borisov
Date:
I investigated this issue under Valgrind and found that in v13 it is no longer detected and the patch in the discussion above is no longer needed.
I want to correct myself that issue is no longer detected in master, but not in 13stable. Everything else is without changes.
Re: BUG #16329: Valgrind detects an invalid read when building a gist index with buffering
From
Tom Lane
Date:
Pavel Borisov <pashkin.elfe@gmail.com> writes: >> I investigated this issue under Valgrind and found that in v13 it is no >> longer detected and the patch in the discussion above is no longer needed. > I want to correct myself that issue is no longer detected in master, but > not in 13stable. Everything else is without changes. I went to try to evaluate this, and found that with or without your proposed testing addition, the code coverage for gistbuildbuffers.c remains at 0.0%. The previous testing patch is the same. So that's why the issue is "no longer detectable": the code is not reached in HEAD. After some investigation, I found that commit 16fa9b2b3 broke this: if the opclass(es) have GIST_SORTSUPPORT_PROC, then the buffering code will not be used, no matter what the index's options say. This seems like a spectacularly bad idea from a testing standpoint, even if it's the right thing for most users. Basically, it is now impossible to test buffering builds at all, unless you find a gist opclass that lacks GIST_SORTSUPPORT_PROC. Although there are a few candidates to pick from, someone could at any time add such a support proc and silently break your testing plan, just as 16fa9b2b3 did by adding such a proc to point_ops. So I think 16fa9b2b3 has to be reconsidered: if we have a buffering=on index parameter, we must go with that independently of the availability of sort support procs. Unless I hear very loud squawks very quickly, I'll go make that happen. (I observe that 16fa9b2b3 also failed to update the relevant documentation in 65.4.1...) regards, tom lane
Re: BUG #16329: Valgrind detects an invalid read when building a gist index with buffering
From
Tom Lane
Date:
Pavel Borisov <pashkin.elfe@gmail.com> writes: > Also, there is a minor correction for the case of covering index where only > part of the columns are keys, others are just INCLUDE columns. Yeah, that's clearly a bug, and it reinforces the point about wanting more thorough test coverage of this area. I pushed the bug fix, but not yet the test addition, because I'm not very happy about the latter: 1. It nearly triples the runtime of gist.sql, from ~650 ms to ~1700 ms on my machine. That's a pretty bad increase for something we're proposing to drop into the core regression tests. Given that this is hardly the only index build in that test, I wonder why it's so much (but I did not look for the reason). 2. The test exposed the gistRelocateBuildBuffersOnSplit bug only about one time in ten for me. I suppose this is due to the random split choices gistchoose makes for equally-good tuples, so it's not terribly surprising; but it is problematic for a test that we're hoping to use to provide reliable code coverage. I'm not really sure what we could do about #2. Perhaps, instead of relying on random(), we could make gistchoose() use our own PRNG and then invent a debugging function that forces the seed to a known value. (GEQO already does something similar, although I'd not go as far as exposing the seed as a GUC. Resetting it via some quick-hack C function in regress.c would be enough IMO.) Or perhaps gist.sql could be adjusted so that the test data is less full of equally-good tuples. regards, tom lane
Re: BUG #16329: Valgrind detects an invalid read when building a gist index with buffering
From
Andrey Borodin
Date:
> 13 окт. 2020 г., в 03:16, Tom Lane <tgl@sss.pgh.pa.us> написал(а): > > I pushed the bug fix, but not yet the test addition, because I'm not > very happy about the latter: > > 1. It nearly triples the runtime of gist.sql, from ~650 ms to ~1700 ms > on my machine. That's a pretty bad increase for something we're > proposing to drop into the core regression tests. Given that this is > hardly the only index build in that test, I wonder why it's so much > (but I did not look for the reason). > > 2. The test exposed the gistRelocateBuildBuffersOnSplit bug only about > one time in ten for me. I suppose this is due to the random split > choices gistchoose makes for equally-good tuples, so it's not terribly > surprising; but it is problematic for a test that we're hoping to use > to provide reliable code coverage. > > I'm not really sure what we could do about #2. Perhaps, instead of > relying on random(), we could make gistchoose() use our own PRNG and > then invent a debugging function that forces the seed to a known value. > (GEQO already does something similar, although I'd not go as far as > exposing the seed as a GUC. Resetting it via some quick-hack C > function in regress.c would be enough IMO.) Or perhaps gist.sql could > be adjusted so that the test data is less full of equally-good tuples. I think we should use not entropy-based tie breaker in GiST. We can extract some randomness from tuples using hash. I'd be much happier if GiST behaviour was deterministic. > This seems like a spectacularly bad idea from a testing standpoint, > even if it's the right thing for most users. Basically, it is now > impossible to test buffering builds at all, unless you find a gist > opclass that lacks GIST_SORTSUPPORT_PROC. Although there are a > few candidates to pick from, someone could at any time add such > a support proc and silently break your testing plan, just as > 16fa9b2b3 did by adding such a proc to point_ops. > > So I think 16fa9b2b3 has to be reconsidered: if we have a > buffering=on index parameter, we must go with that independently > of the availability of sort support procs. Unless I hear very > loud squawks very quickly, I'll go make that happen. FWIW I think forcing buffered build on buffering=on is a good idea. Not just from testing point of view, it simply does whatuser asked for. Thanks! Best regards, Andrey Borodin.
Re: BUG #16329: Valgrind detects an invalid read when building a gist index with buffering
From
Tom Lane
Date:
Andrey Borodin <x4mmm@yandex-team.ru> writes: >> 13 окт. 2020 г., в 03:16, Tom Lane <tgl@sss.pgh.pa.us> написал(а): >> 2. The test exposed the gistRelocateBuildBuffersOnSplit bug only about >> one time in ten for me. I suppose this is due to the random split >> choices gistchoose makes for equally-good tuples, so it's not terribly >> surprising; but it is problematic for a test that we're hoping to use >> to provide reliable code coverage. >> >> I'm not really sure what we could do about #2. Perhaps, instead of >> relying on random(), we could make gistchoose() use our own PRNG and >> then invent a debugging function that forces the seed to a known value. >> (GEQO already does something similar, although I'd not go as far as >> exposing the seed as a GUC. Resetting it via some quick-hack C >> function in regress.c would be enough IMO.) Or perhaps gist.sql could >> be adjusted so that the test data is less full of equally-good tuples. > I think we should use not entropy-based tie breaker in GiST. We can extract some randomness from tuples using hash. > I'd be much happier if GiST behaviour was deterministic. If we started using our own PRNG, we could very easily make the "random" sequence be the same in every GiST build --- again, see GEQO for prior art. I'm a little suspicious of trying to pull some entropy out of the tuples themselves: that seems like it'd be tremendously prone to cross- platform and cross-version behavioral differences. regards, tom lane
Re: BUG #16329: Valgrind detects an invalid read when building a gist index with buffering
From
Andrey Borodin
Date:
> 22 окт. 2020 г., в 19:01, Tom Lane <tgl@sss.pgh.pa.us> написал(а): > > Andrey Borodin <x4mmm@yandex-team.ru> writes: >>> 13 окт. 2020 г., в 03:16, Tom Lane <tgl@sss.pgh.pa.us> написал(а): >>> 2. The test exposed the gistRelocateBuildBuffersOnSplit bug only about >>> one time in ten for me. I suppose this is due to the random split >>> choices gistchoose makes for equally-good tuples, so it's not terribly >>> surprising; but it is problematic for a test that we're hoping to use >>> to provide reliable code coverage. >>> >>> I'm not really sure what we could do about #2. Perhaps, instead of >>> relying on random(), we could make gistchoose() use our own PRNG and >>> then invent a debugging function that forces the seed to a known value. >>> (GEQO already does something similar, although I'd not go as far as >>> exposing the seed as a GUC. Resetting it via some quick-hack C >>> function in regress.c would be enough IMO.) Or perhaps gist.sql could >>> be adjusted so that the test data is less full of equally-good tuples. > >> I think we should use not entropy-based tie breaker in GiST. We can extract some randomness from tuples using hash. >> I'd be much happier if GiST behaviour was deterministic. > > If we started using our own PRNG, we could very easily make the "random" > sequence be the same in every GiST build --- again, see GEQO for prior > art. I'm a little suspicious of trying to pull some entropy out of the > tuples themselves: that seems like it'd be tremendously prone to cross- > platform and cross-version behavioral differences. PFA copy of GEQO approach to GiST. I haven't found proper place to document this: current random tie breaker seems undocumented. Will describing GUC here [0] be enough? Thanks! Best regards, Andrey Borodin. [0] https://www.postgresql.org/docs/13/runtime-config-developer.html
Attachment
Re: BUG #16329: Valgrind detects an invalid read when building a gist index with buffering
From
Pavel Borisov
Date:
пн, 2 нояб. 2020 г. в 20:05, Tom Lane <tgl@sss.pgh.pa.us>:
Alexander Lakhin <exclusion@gmail.com> writes:
>> Please look at the patch that modifies the gist regression test to make
>> the issue visible and fixes it by avoiding access to the memory context
>> that can be reset in gistProcessEmptyingQueue().
> Could you please let me know if the fix is incorrect (or not elaborated
> enough to be included in the upcoming releases) or this issue is false
> positive?
I took a look at this. I see the hazard, but I'm not sure that
I like your proposed quick-fix. Essentially, the problem is that
gistBuildCallback is supposing that the tuple it creates will survive
the execution of its subroutines, which in fact it won't always.
But that means that at some point the tuple pointer it's passed down to
those subroutines becomes a dangling pointer. What guarantee would we
have that the subroutines don't access the tuple after that point?
I think the real issue here is confusion about which level of function
"owns" the tempCxt and gets to decide when to reset it. We can't have
both gistBuildCallback and gistProcessEmptyingQueue doing that. Maybe
we need more than one temporary context, or maybe we could just skip
the context reset in gistProcessEmptyingQueue and let the storage
accumulate till we get back to gistBuildCallback. But in any case
it's going to require more analysis.
I investigated the code and realized that there are two ways to call the problematic gistProcessEmptyingQueue() function:
1. gistEmptyAllBuffers /tempCxt owner/ -> gistProcessEmptyingQueue
A potentially dangerous operation of resetting memory context inside sub-function calls of a 'context owner' function is done at the very ending of gistProcessEmptyingQueue(). Just after it we return back to gistEmptyAllBuffers and immediately have a switch to old context. So resetting tempCxt has no danger in this case as no pointers are used after.
2. gistBuildCallback /tempCxt owner/ -> gistBufferingBuildInsert -> gistProcessEmptyingQueue
gistProcessEmptyingQueue() is at the very ending of gistBufferingBuildInsert(). After the reset no pointers from tempCxt context are used up to the 'owner' level of gistBuildCallback. So moving the only itup pointer operation before gistBufferingBuildInsert() call will fix the bug.
(Of course, the alternative of not re-setting tempCxt inside gistProcessEmptyingQueue() and doing this level up will also fix the issue)
So I think that the patch by Alexander will do the thing. I also added some comments to the code and removed extra context reset in gistBuildCallback (which is already done level down) to make things clear.
Attachment
Re: BUG #16329: Valgrind detects an invalid read when building a gist index with buffering
From
Tom Lane
Date:
Pavel Borisov <pashkin.elfe@gmail.com> writes: > So I think that the patch by Alexander will do the thing. I also added some > comments to the code and removed extra context reset in gistBuildCallback > (which is already done level down) to make things clear. My apologies for having let this slip through the cracks. I think I'd wanted to understand why the committed version of the GiST index tests doesn't expose the problem, and I never got time to study that. I still haven't tracked it down, but the proposed patch seems clearly safe so I've gone ahead and pushed it. I did use Alexander's original code change that just moved the totals-incrementing. It looks to me like gistProcessEmptyingQueue is not guaranteed to reach the context reset, and I'd rather have an extra reset than miss one. (No-op context resets are cheap.) regards, tom lane
Re: BUG #16329: Valgrind detects an invalid read when building a gist index with buffering
From
Tom Lane
Date:
I wrote: > My apologies for having let this slip through the cracks. I think > I'd wanted to understand why the committed version of the GiST > index tests doesn't expose the problem, and I never got time to > study that. I still haven't tracked it down, but the proposed patch > seems clearly safe so I've gone ahead and pushed it. I couldn't quite let this go, and after some testing I see the issue. The code coverage report correctly shows that we reach almost all of gistProcessEmptyingQueue in the regression tests, but what it fails to expose is that the MemoryContextReset call is only reached in calls from gistEmptyAllBuffers, which occur only at the end of the index build. We never get there from gistBufferingBuildInsert, which is the troublesome path. I haven't grokked gistProcessEmptyingQueue fully, but if the comment in gistBufferingBuildInsert is correct: /* If we filled up (half of a) buffer, process buffer emptying. */ then this is just a matter of not having enough data volume --- as indeed Egor's test case proves, if you break it down a little: the failure doesn't happen in the index builds with 10k or 20k tuples, you need 30k. The same was true in Alexander's original report. So we could fix the lack of test coverage with something about like this: diff --git a/src/test/regress/sql/gist.sql b/src/test/regress/sql/gist.sql index 6f1fc65..a42d3ee 100644 *** a/src/test/regress/sql/gist.sql --- b/src/test/regress/sql/gist.sql *************** select p from gist_tbl order by circle(p *** 170,175 **** --- 170,180 ---- select p from gist_tbl order by circle(p,1) <-> point(0,0) limit 1; -- Force an index build using buffering. + insert into gist_tbl + select box(point(0.05*i, 0.05*i), point(0.05*i, 0.05*i)), + point(0.05*i, 0.05*i), + circle(point(0.05*i, 0.05*i), 1.0) + from generate_series(10001,30000) as i; create index gist_tbl_box_index_forcing_buffering on gist_tbl using gist (p) with (buffering=on, fillfactor=50); The question is whether it's worth the extra test cycles forevermore. On my machine, the time for the gist test script goes from ~400ms to ~600ms. That's still less than some of the concurrent scripts (brin, privileges, and join_hash all take more than that for me), so maybe it's fine. But it seems like a lot for a change that doesn't move the raw code coverage results by even one line. Thoughts? regards, tom lane
Re: BUG #16329: Valgrind detects an invalid read when building a gist index with buffering
From
Noah Misch
Date:
On Wed, Mar 29, 2023 at 06:21:38PM -0400, Tom Lane wrote: > *** a/src/test/regress/sql/gist.sql > --- b/src/test/regress/sql/gist.sql > *************** select p from gist_tbl order by circle(p > *** 170,175 **** > --- 170,180 ---- > select p from gist_tbl order by circle(p,1) <-> point(0,0) limit 1; > > -- Force an index build using buffering. > + insert into gist_tbl > + select box(point(0.05*i, 0.05*i), point(0.05*i, 0.05*i)), > + point(0.05*i, 0.05*i), > + circle(point(0.05*i, 0.05*i), 1.0) > + from generate_series(10001,30000) as i; > create index gist_tbl_box_index_forcing_buffering on gist_tbl using gist (p) > with (buffering=on, fillfactor=50); > > The question is whether it's worth the extra test cycles forevermore. > On my machine, the time for the gist test script goes from ~400ms to > ~600ms. That's still less than some of the concurrent scripts (brin, > privileges, and join_hash all take more than that for me), so maybe > it's fine. But it seems like a lot for a change that doesn't move the > raw code coverage results by even one line. I say +200ms is fine. This thread is a reminder of the limits of raw code coverage as a measure of tests. (I expect this will add tens of minutes to the soft-float buildfarm animal runs, but they deserve it.)
Re: BUG #16329: Valgrind detects an invalid read when building a gist index with buffering
From
Alexander Lakhin
Date:
30.03.2023 01:21, Tom Lane wrote: > > The question is whether it's worth the extra test cycles forevermore. > On my machine, the time for the gist test script goes from ~400ms to > ~600ms. That's still less than some of the concurrent scripts (brin, > privileges, and join_hash all take more than that for me), so maybe > it's fine. But it seems like a lot for a change that doesn't move the > raw code coverage results by even one line. > > Thoughts? I would say that providing the behavior coverage is the next thing to do after the code coverage reached 100%. I can easily imagine bugs hiding in a fully covered code (#17858 is the recent example). As to that concrete place, I would incline to improve the test while it stays in line with other tests, but I lack a whole picture of the gist testing. It doesn't seem to me that the regress/sql/gist.sql is too comprehensive. I've just looked at the scan-build's results and found 9 warnings related to gist. A quick glance at the code confirms that there are indeed interesting cases, which are not exercised for now (e.g., call gistRelocateBuildBuffersOnSplit() with IndexRelationGetNumberOfKeyAttributes(r) > 1). So if I could prove that 2-3 of them are really bugs during the next week, maybe then assess that test change alongside with others (if any). (Maybe we'll want to build that index over different columns, for example.) Best regards, Alexander
Re: BUG #16329: Valgrind detects an invalid read when building a gist index with buffering
From
Tom Lane
Date:
Alexander Lakhin <exclusion@gmail.com> writes: > As to that concrete place, I would incline to improve the test while it stays > in line with other tests, but I lack a whole picture of the gist testing. > It doesn't seem to me that the regress/sql/gist.sql is too comprehensive. > I've just looked at the scan-build's results and found 9 warnings related to > gist. A quick glance at the code confirms that there are indeed > interesting cases, which are not exercised for now (e.g., call > gistRelocateBuildBuffersOnSplit() with > IndexRelationGetNumberOfKeyAttributes(r) > 1). > So if I could prove that 2-3 of them are really bugs during the next week, > maybe then assess that test change alongside with others (if any). > (Maybe we'll want to build that index over different columns, for example.) OK, there's certainly no urgency. regards, tom lane