Thread: BUG #17950: Incorrect memory access in gtsvector_picksplit()
The following bug has been logged on the website: Bug reference: 17950 Logged by: Alexander Lakhin Email address: exclusion@gmail.com PostgreSQL version: 16beta1 Operating system: Ubuntu 22.04 Description: The following script: CREATE TABLE test_tsvector(t text, a tsvector); SELECT 'COPY test_tsvector FROM ''.../src/test/regress/data/tsearch.data'';' FROM generate_series(1, 19) \gexec CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT ON test_tsvector FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger(a, 'pg_catalog.english', t); SELECT 'COPY test_tsvector FROM ''.../src/test/regress/data/tsearch.data'';' FROM generate_series(1, 38) \gexec CREATE INDEX gistidx ON test_tsvector USING gist (a tsvector_ops(siglen=1)); -- I believe it's not the only way to get a data pattern needed, so -- probably the repro could be simplified, if desired. triggers a Valgrind-detected memory access error: ==00:00:00:53.414 342514== Invalid read of size 1 ==00:00:00:53.414 342514== at 0x79787D: pg_popcount (pg_bitutils.c:332) ==00:00:00:53.414 342514== by 0x6F93C6: sizebitvec (tsgistidx.c:488) ==00:00:00:53.414 342514== by 0x6FA24A: gtsvector_picksplit (tsgistidx.c:731) ==00:00:00:53.414 342514== by 0x74146D: FunctionCall2Coll (fmgr.c:1132) ==00:00:00:53.414 342514== by 0x217E65: gistUserPicksplit (gistsplit.c:433) ==00:00:00:53.414 342514== by 0x2184A5: gistSplitByKey (gistsplit.c:697) ==00:00:00:53.414 342514== by 0x20C237: gistSplit (gist.c:1450) ==00:00:00:53.414 342514== by 0x20CA32: gistplacetopage (gist.c:309) ==00:00:00:53.414 342514== by 0x20DBFB: gistinserttuples (gist.c:1278) ==00:00:00:53.414 342514== by 0x20DF85: gistfinishsplit (gist.c:1376) ==00:00:00:53.414 342514== by 0x20DC79: gistinserttuples (gist.c:1305) ==00:00:00:53.414 342514== by 0x20E39E: gistinserttuple (gist.c:1231) ==00:00:00:53.414 342514== Address 0x7386c68 is 16,024 bytes inside a block of size 16,384 alloc'd ==00:00:00:53.414 342514== at 0x4848899: malloc (vg_replace_malloc.c:381) ==00:00:00:53.414 342514== by 0x7602C7: AllocSetAlloc (aset.c:924) ==00:00:00:53.414 342514== by 0x76C26A: palloc (mcxt.c:1240) ==00:00:00:53.414 342514== by 0x20C24A: gistSplit (gist.c:1453) ==00:00:00:53.414 342514== by 0x20CA32: gistplacetopage (gist.c:309) ==00:00:00:53.414 342514== by 0x20DBFB: gistinserttuples (gist.c:1278) ==00:00:00:53.414 342514== by 0x20E39E: gistinserttuple (gist.c:1231) ==00:00:00:53.414 342514== by 0x20E940: gistdoinsert (gist.c:886) ==00:00:00:53.414 342514== by 0x21152B: gistBuildCallback (gistbuild.c:929) ==00:00:00:53.414 342514== by 0x24420D: heapam_index_build_range_scan (heapam_handler.c:1708) ==00:00:00:53.414 342514== by 0x2119E4: table_index_build_scan (tableam.h:1781) ==00:00:00:53.414 342514== by 0x2119E4: gistbuild (gistbuild.c:317) ==00:00:00:53.414 342514== by 0x2E06F5: index_build (index.c:3032) ==00:00:00:53.414 342514== (Several runs might be required for the issue reproduction.) With the additional debug logging in gtsvector_picksplit(): @@ -722,6 +722,11 @@ gtsvector_picksplit(PG_FUNCTION_ARGS) continue; } +if (!cache[j].allistrue) { +elog(LOG, "!!!gtsvector_picksplit| j: %d, cache[j].sign: %p, GETSIGN(cache[j].sign): %p", j, cache[j].sign, GETSIGN(cache[j].sign)); +VALGRIND_CHECK_MEM_IS_DEFINED(GETSIGN(cache[j].sign), siglen); +} + if (ISALLTRUE(datum_l) || cache[j].allistrue) { if (ISALLTRUE(datum_l) && cache[j].allistrue) (and #include "utils/memdebug.h") I see: cache[j].sign: 0x723a999, GETSIGN(cache[j].sign): 0x723a9a1 ==00:00:00:18.356 351519== Unaddressable byte(s) found during client check request ==00:00:00:18.356 351519== at 0x6FA2BE: gtsvector_picksplit (tsgistidx.c:727) ... ==00:00:00:18.356 351519== Address 0x723a9a1 is 4,977 bytes inside a block of size 8,192 alloc'd Reproduced starting from 911e70207. But with the slight variation: CREATE INDEX gistidx ON test_tsvector USING gist (a tsvector_ops); and the debugging patch: @@ -711,6 +712,9 @@ gtsvector_picksplit(PG_FUNCTION_ARGS) else size_alpha = hemdistsign(cache[j].sign, GETSIGN(datum_l)); +if (!cache[j].allistrue) +VALGRIND_CHECK_MEM_IS_DEFINED(GETSIGN(cache[j].sign), SIGLEN); + if (ISALLTRUE(datum_r) || cache[j].allistrue) { if (ISALLTRUE(datum_r) && cache[j].allistrue) it's reproduced even on 911e70207~1: ==00:00:00:15.858 370963== Unaddressable byte(s) found during client check request ==00:00:00:15.858 370963== at 0x636E0E: gtsvector_picksplit (tsgistidx.c:716) ==00:00:00:15.858 370963== by 0x67B6D3: FunctionCall2Coll (fmgr.c:1162) ==00:00:00:15.858 370963== by 0x1FB662: gistUserPicksplit (gistsplit.c:433) ==00:00:00:15.858 370963== by 0x1FBCCF: gistSplitByKey (gistsplit.c:697) ==00:00:00:15.858 370963== by 0x1F088A: gistSplit (gist.c:1441) ==00:00:00:15.858 370963== by 0x1F0F3C: gistplacetopage (gist.c:302) ==00:00:00:15.858 370963== by 0x1F202F: gistinserttuples (gist.c:1270) ==00:00:00:15.858 370963== by 0x1F273A: gistinserttuple (gist.c:1223) ==00:00:00:15.858 370963== by 0x1F2BCE: gistdoinsert (gist.c:879) ==00:00:00:15.858 370963== by 0x1F4E8F: gistBuildCallback (gistbuild.c:470) ==00:00:00:15.858 370963== by 0x2262BD: heapam_index_build_range_scan (heapam_handler.c:1659) ==00:00:00:15.858 370963== by 0x1F50E7: table_index_build_scan (tableam.h:1540) ==00:00:00:15.858 370963== by 0x1F50E7: gistbuild (gistbuild.c:196) ==00:00:00:15.858 370963== Address 0x11a3a68b is 4,939 bytes inside a block of size 16,384 alloc'd ==00:00:00:15.858 370963== at 0x4848899: malloc (vg_replace_malloc.c:381) ==00:00:00:15.858 370963== by 0x699954: AllocSetAlloc (aset.c:941) ==00:00:00:15.858 370963== by 0x6A26CC: palloc (mcxt.c:963) ==00:00:00:15.858 370963== by 0x636990: gtsvector_picksplit (tsgistidx.c:613) ... So it looks like this defect exists in core since 140d4ebcb. IIUC, using the GETSIGN macro with cache[j].sign is a mistake -- it erroneously adds 8 to an address of the sign field, so for the last j it leads to an out-of-bounds memory read.
29.05.2023 23:00, PG Bug reporting form wrote: > The following bug has been logged on the website: > > Bug reference: 17950 I managed to reduce the reproducer to the following: CREATE TABLE tst(t tsvector); INSERT INTO tst SELECT array_to_string(array(SELECT 'a' || x::text FROM generate_series(1, 125) x), ' ')::tsvector FROM generate_series(1, 3000); INSERT INTO tst SELECT '' FROM generate_series(1, 100); CREATE INDEX gistidx ON tst USING gist (t tsvector_ops(siglen=1)); (Sorry for the previous messy script.) A trivial fix for the issue is attached. BTW, when looking at the index contents (page 0) using pageinspect, I saw: itemoffset | ctid | itemlen | dead | keys ------------+-------------+---------+------+----------------------------------- 1 | (367,65535) | 16 | f | (a)=("0 true bits, 0 false bits") 2 | (368,65535) | 16 | f | (a)=("0 true bits, 0 false bits") The text describing keys looks confusing, just as if siglen was 0, but it's not the case. This is explained by the code: int siglen = GETSIGLEN(key); int cnttrue = (ISALLTRUE(key)) ? SIGLENBIT(siglen) : sizebitvec(GETSIGN(key), siglen); sprintf(outbuf, SINGOUTSTR, cnttrue, (int) SIGLENBIT(siglen) - cnttrue); When ISALLTRUE, the code tries to calculate bit count from siglen, but siglen is 0 in this case. So maybe fix it in passing too... Best regards, Alexander
Attachment
17.06.2023 17:00, Alexander Lakhin wrote: > 29.05.2023 23:00, PG Bug reporting form wrote: >> The following bug has been logged on the website: >> >> Bug reference: 17950 > > I managed to reduce the reproducer to the following: > CREATE TABLE tst(t tsvector); > INSERT INTO tst SELECT array_to_string(array(SELECT 'a' || x::text FROM generate_series(1, 125) x), ' ')::tsvector > FROM generate_series(1, 3000); > INSERT INTO tst SELECT '' FROM generate_series(1, 100); > CREATE INDEX gistidx ON tst USING gist (t tsvector_ops(siglen=1)); > > > A trivial fix for the issue is attached. > I can also propose a regression test addition that demonstrates the valgrind complaint and also the output of gtsvectorout() for the case ISALLTRUE and the opposite. This addition increases the duration of `make check -C contrib/pageinspect` under valgrind by 7-8 seconds for me: ok 5 - gist 2496 ms -> ok 5 - gist 9890 ms In absence of any objections or other propositions, I'm inclined to register this bugfix on the commitfest. Best regards, Alexander
Attachment
Alexander Lakhin <exclusion@gmail.com> writes: > I can also propose a regression test addition that demonstrates the valgrind > complaint and also the output of gtsvectorout() for the case ISALLTRUE and > the opposite. OK, but ... > This addition increases the duration of `make check -C contrib/pageinspect` > under valgrind by 7-8 seconds for me: [ ie, more than triple its previous runtime ] ... that seems completely unacceptable cost-wise. I'd be inclined to commit the fix without a supporting test case, instead of that. Given that the misapplication of GETSIGN is causing an incorrect pointer to be passed to sizebitvec(), how come the error is not leading to outright wrong answers? I guess because it's in picksplit, the worst outcome normally is a poor choice of split, so maybe exhibiting wrong behavior in a detectable way is hard. > In absence of any objections or other propositions, I'm inclined to register > this bugfix on the commitfest. Please do that in any case, so we don't forget about it. regards, tom lane
Hello Tom, 13.08.2023 17:35, Tom Lane wrote: > Alexander Lakhin <exclusion@gmail.com> writes: >> I can also propose a regression test addition that demonstrates the valgrind >> complaint and also the output of gtsvectorout() for the case ISALLTRUE and >> the opposite. > OK, but ... > >> This addition increases the duration of `make check -C contrib/pageinspect` >> under valgrind by 7-8 seconds for me: > [ ie, more than triple its previous runtime ] Yes, unfortunately. > ... that seems completely unacceptable cost-wise. I'd be inclined > to commit the fix without a supporting test case, instead of that. I've tried to make that addition as valuable as possible in the context of this issue, but as it is too expensive (to be honest, I was bolstered by the gin test duration (~10 secs too)), then let's leave it aside. > Given that the misapplication of GETSIGN is causing an incorrect > pointer to be passed to sizebitvec(), how come the error is not > leading to outright wrong answers? I guess because it's in > picksplit, the worst outcome normally is a poor choice of split, > so maybe exhibiting wrong behavior in a detectable way is hard. (In fact, I stopped my previous research when I had come to the conclusion that generating data pattern needed to demonstrate wrong answers or at least an inefficient split require many more data rows than we can afford in a regression test.) >> In absence of any objections or other propositions, I'm inclined to register >> this bugfix on the commitfest. > Please do that in any case, so we don't forget about it. Thank you! Done: https://commitfest.postgresql.org/44/4498/ Best regards, Alexander
On Sun, Aug 13, 2023 at 09:00:01PM +0300, Alexander Lakhin wrote: > I've tried to make that addition as valuable as possible in the context of > this issue, but as it is too expensive (to be honest, I was bolstered by > the gin test duration (~10 secs too)), then let's leave it aside. I have applied 0001 down to 11 to get the basic fix in place, but the regression tests are really too expensive compared to the value they bring. Regarding the changes in gtsvectorout(), the output produced is indeed confusing when ISALLTRUE is set. - int siglen = GETSIGLEN(key); - int cnttrue = (ISALLTRUE(key)) ? SIGLENBIT(siglen) : sizebitvec(GETSIGN(key), siglen); + if (ISALLTRUE(key)) + sprintf(outbuf, "all true bits"); + else + { + int siglen = GETSIGLEN(key); + int cnttrue = (ISALLTRUE(key)) ? SIGLENBIT(siglen) : sizebitvec(GETSIGN(key), siglen); - sprintf(outbuf, SINGOUTSTR, cnttrue, (int) SIGLENBIT(siglen) - cnttrue); + sprintf(outbuf, SINGOUTSTR, cnttrue, (int) SIGLENBIT(siglen) - cnttrue); + } In the false branch of ISALLTRUE(key), why isn't cnttrue always calculated with sizebitvec()? It's also not something I would backpatch. That's confusing, for sure, but there is also the argument of keeping a consistent output in the stable branches. -- Michael
Attachment
Hi Mchael, Thank you for committing the fix! 04.09.2023 09:35, Michael Paquier wrote: > Regarding the changes in gtsvectorout(), the output produced is indeed > confusing when ISALLTRUE is set. > > - int siglen = GETSIGLEN(key); > - int cnttrue = (ISALLTRUE(key)) ? SIGLENBIT(siglen) : sizebitvec(GETSIGN(key), siglen); > + if (ISALLTRUE(key)) > + sprintf(outbuf, "all true bits"); > + else > + { > + int siglen = GETSIGLEN(key); > + int cnttrue = (ISALLTRUE(key)) ? SIGLENBIT(siglen) : sizebitvec(GETSIGN(key), siglen); > > - sprintf(outbuf, SINGOUTSTR, cnttrue, (int) SIGLENBIT(siglen) - cnttrue); > + sprintf(outbuf, SINGOUTSTR, cnttrue, (int) SIGLENBIT(siglen) - cnttrue); > + } > > In the false branch of ISALLTRUE(key), why isn't cnttrue always > calculated with sizebitvec()? Yes, that expression should be simpler there. Please look at the v2 attached. > It's also not something I would > backpatch. That's confusing, for sure, but there is also the argument > of keeping a consistent output in the stable branches. I agree. Let's not backpatch it. Best regards, Alexander
Attachment
On Mon, Sep 04, 2023 at 05:00:00PM +0300, Alexander Lakhin wrote: > 04.09.2023 09:35, Michael Paquier wrote: >> It's also not something I would >> backpatch. That's confusing, for sure, but there is also the argument >> of keeping a consistent output in the stable branches. > > I agree. Let's not backpatch it. Applied on HEAD, then. This item is now completely closed. -- Michael