Thread: PANIC: wrong buffer passed to visibilitymap_clear

PANIC: wrong buffer passed to visibilitymap_clear

From
"王伟(学弈)"
Date:
Hi,
I recently find this problem while testing PG14 with sysbench.
Then I look through the emails from pgsql-hackers and find a previous similary bug which is https://www.postgresql.org/message-id/flat/2247102.1618008027%40sss.pgh.pa.us. But the bugfix commit(34f581c39e97e2ea237255cf75cccebccc02d477) is already patched to PG14.

The following is the stack of coredump.
#0  0x00007f2ba9bfa277 in raise () from /lib64/libc.so.6
#1  0x00007f2ba9bfb968 in abort () from /lib64/libc.so.6
#2  0x00000000009416bb in errfinish () at elog.c:717
#3  0x000000000049518b in visibilitymap_clear (rel=<optimized out>, heapBlk=<optimized out>, buf=<optimized out>, flags=<optimized out>, polar_record=<optimized out>) at visibilitymap.c:142
#4  0x000000000054c2df in heap_update () at heapam.c:3948
#5  0x0000000000555538 in heapam_tuple_update (relation=0x7f2b909cfef8, otid=0x7fff930c577a, slot=0x2612528, cid=0, snapshot=<optimized out>, crosscheck=0x0, wait=true, tmfd=0x7fff930c5690, lockmode=0x7fff930c5684, update_indexes=0x7fff930c5681)
    at heapam_handler.c:327
#6  0x00000000006e04f3 in table_tuple_update (update_indexes=0x7fff930c5681, lockmode=0x7fff930c5684, tmfd=0x7fff930c5690, wait=true, crosscheck=<optimized out>, snapshot=<optimized out>, cid=<optimized out>, slot=0x2612528, otid=0x26126d0,
    rel=0x7f2b909cfef8) at ../../../src/include/access/tableam.h:1509
#7  ExecUpdate () at nodeModifyTable.c:1785
#8  0x00000000006e0f2a in ExecModifyTable () at nodeModifyTable.c:2592
#9  0x00000000006b909c in ExecProcNode (node=0x24d4eb0) at ../../../src/include/executor/executor.h:257
#10 ExecutePlan (execute_once=<optimized out>, dest=0xaa52e0 <donothingDR>, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_UPDATE, use_parallel_mode=<optimized out>, planstate=0x24d4eb0, estate=0x24d4c08)
    at execMain.c:1553
#11 standard_ExecutorRun () at execMain.c:363
#12 0x00007f2babd893dd in pgss_ExecutorRun (queryDesc=0x25dcab8, direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at pg_stat_statements.c:1018
#13 0x00007f2babd816fa in explain_ExecutorRun (queryDesc=0x25dcab8, direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at auto_explain.c:334
#14 0x0000000000828ac8 in ProcessQuery (plan=0x25c9e88, sourceText=0x25dc9a8 "UPDATE sbtest8 SET k=k+1 WHERE id=$1", params=0x25dca28, queryEnv=0x0, dest=<optimized out>, qc=0x7fff930c5e40) at pquery.c:160
#15 0x00000000008294d8 in PortalRunMulti (portal=portal@entry=0x25697c8, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=0xaa52e0 <donothingDR>, dest@entry=0x24a6568, altdest=0xaa52e0 <donothingDR>,
    altdest@entry=0x24a6568, qc=qc@entry=0x7fff930c5e40) at pquery.c:1277
#16 0x0000000000829929 in PortalRun () at pquery.c:797
#17 0x0000000000826f57 in exec_execute_message (max_rows=9223372036854775807, portal_name=0x24a6158 "") at postgres.c:2306
#18 PostgresMain () at postgres.c:4826
#19 0x00000000007a1e8a in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4910
#20 BackendStartup (port=<optimized out>) at postmaster.c:4621
#21 ServerLoop () at postmaster.c:1823
#22 0x00000000007a2c4b in PostmasterMain () at postmaster.c:1488
#23 0x000000000050c5d0 in main (argc=3, argv=0x24a0f50) at main.c:209

I'm wondering whether there's another code path to lead this problem happened. Since, I take a deep dig via gdb which turns out that newbuffer is not euqal to buffer. In other words, the function RelationGetBufferForTuple must have been called just now.
Besides, why didn't we re-check the flag after RelationGetBufferForTuple was called?
But I'm confused about the heap_update and RelationGetBufferForTuple functions which are too long to understand for me. Can anyone give me a hand?

--
Best regards,
rogers.ww

Re: PANIC: wrong buffer passed to visibilitymap_clear

From
Tomas Vondra
Date:

On 7/22/22 10:22, 王伟(学弈) wrote:
> Hi,
> I recently find this problem while testing PG14 with sysbench.
> Then I look through the emails from pgsql-hackers and find a previous
> similary bug which
> is https://www.postgresql.org/message-id/flat/2247102.1618008027%40sss.pgh.pa.us
> <https://www.postgresql.org/message-id/flat/2247102.1618008027%40sss.pgh.pa.us>.
> But the bugfix commit(34f581c39e97e2ea237255cf75cccebccc02d477) is
> already patched to PG14.
> 

Which PG14 version / commit is this, exactly? What sysbench parameters
did you use, how likely is hitting the issue?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



回复:Re: PANIC: wrong buffer passed to visibilitymap_clear

From
"王伟(学弈)"
Date:
On 7/22/22 18:06, Tomas Vondra wrote:
> Which PG14 version / commit is this, exactly? What sysbench parameters
> did you use, how likely is hitting the issue?
PG_VERSION is '14beta2'.
The head commit id is 'e1c1c30f635390b6a3ae4993e8cac213a33e6e3f'.
I have run these sysbench commands for  couple of days, but only two times to hit the issue.
These sysbench commands are:
prepare:
sysbench --tables=10 --table-size=1000000 --db-ps-mode=auto --pgsql-password=* --time=72000 --db-driver=pgsql --pgsql-port=* --threads=50 --thread-init-timeout=3000 --report-interval=5 --pgsql-user=* --pgsql-host=* --pgsql-db=* --events=0 --pgsql-ignore-errors="PX000,58M01" ./src/lua/oltp_insert.lua prepare
parallel execution: 
sysbench --tables=10 --table-size=1000000 --db-ps-mode=auto --pgsql-password=* --time=72000 --db-driver=pgsql --pgsql-port=* --threads=50 --thread-init-timeout=3000 --report-interval=5 --pgsql-user=* --pgsql-host=* --pgsql-db=* --events=0 --pgsql-ignore-errors="PX000,58M01" ./src/lua/oltp_insert.lua run
sysbench --tables=10 --table-size=1000000 --db-ps-mode=auto --pgsql-password=* --time=72000 --db-driver=pgsql --pgsql-port=* --threads=50 --thread-init-timeout=3000 --report-interval=5 --pgsql-user=* --pgsql-host=*--pgsql-db=* --events=0 --pgsql-ignore-errors="PX000,58M01" ./src/lua/oltp_delete.lua run
sysbench --tables=10 --table-size=1000000 --db-ps-mode=auto --pgsql-password=* --time=72000 --db-driver=pgsql --pgsql-port=* --threads=50 --thread-init-timeout=3000 --report-interval=5 --pgsql-user=* --pgsql-host=* --pgsql-db=* --events=0 --pgsql-ignore-errors="PX000,58M01" ./src/lua/oltp_read_write.lua run
sysbench --tables=10 --table-size=1000000 --db-ps-mode=auto --pgsql-password=* --time=72000 --db-driver=pgsql --pgsql-port=* --threads=50 --thread-init-timeout=3000 --report-interval=5 --pgsql-user=* --pgsql-host=* --pgsql-db=* --events=0 --pgsql-ignore-errors="PX000,58M01" ./src/lua/oltp_update_index.lua run

--
regards,
rogers.ww
------------------------------------------------------------------
发件人:Tomas Vondra<tomas.vondra@enterprisedb.com>
日 期:2022年07月22日 18:06:21
收件人:王伟(学弈)<rogers.ww@alibaba-inc.com>; pgsql-hackers<pgsql-hackers@lists.postgresql.org>
主 题:Re: PANIC: wrong buffer passed to visibilitymap_clear



On 7/22/22 10:22, 王伟(学弈) wrote:
> Hi,
> I recently find this problem while testing PG14 with sysbench.
> Then I look through the emails from pgsql-hackers and find a previous
> similary bug which
> is https://www.postgresql.org/message-id/flat/2247102.1618008027%40sss.pgh.pa.us
> <https://www.postgresql.org/message-id/flat/2247102.1618008027%40sss.pgh.pa.us>.
> But the bugfix commit(34f581c39e97e2ea237255cf75cccebccc02d477) is
> already patched to PG14.


Which PG14 version / commit is this, exactly? What sysbench parameters
did you use, how likely is hitting the issue?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: 回复:Re: PANIC: wrong buffer passed to visibilitymap_clear

From
Tomas Vondra
Date:
On 7/22/22 14:17, 王伟(学弈) wrote:
> On 7/22/22 18:06, Tomas Vondra wrote:
>> Which PG14 version / commit is this, exactly? What sysbench parameters
>> did you use, how likely is hitting the issue?
> PG_VERSION is '14beta2'.
> The head commit id is 'e1c1c30f635390b6a3ae4993e8cac213a33e6e3f'.

Why not current REL_14_STABLE? 14beta2 is pretty old, and while I
haven't checked, perhaps this was already fixed since then.

> I have run these sysbench commands for  couple of days, but only two times to hit the issue.
> These sysbench commands are:
> prepare:
>
sysbench --tables=10 --table-size=1000000 --db-ps-mode=auto --pgsql-password=* --time=72000 --db-driver=pgsql --pgsql-port=*
>
--threads=50 --thread-init-timeout=3000 --report-interval=5 --pgsql-user=* --pgsql-host=* --pgsql-db=* --events=0 --pgsql-ignore-errors="PX000,58M01"
> ./src/lua/oltp_insert.lua prepare
> parallel execution: 
>
sysbench --tables=10 --table-size=1000000 --db-ps-mode=auto --pgsql-password=* --time=72000 --db-driver=pgsql --pgsql-port=*
> --threads=50 --thread-init-timeout=3000 --report-interval=5 --pgsql-user=* --pgsql-host=*
> --pgsql-db=* --events=0 --pgsql-ignore-errors="PX000,58M01"
> ./src/lua/oltp_insert.lua run
>
sysbench --tables=10 --table-size=1000000 --db-ps-mode=auto --pgsql-password=* --time=72000 --db-driver=pgsql --pgsql-port=*
>
--threads=50 --thread-init-timeout=3000 --report-interval=5 --pgsql-user=* --pgsql-host=*--pgsql-db=* --events=0 --pgsql-ignore-errors="PX000,58M01"
> ./src/lua/oltp_delete.lua run
>
sysbench --tables=10 --table-size=1000000 --db-ps-mode=auto --pgsql-password=* --time=72000 --db-driver=pgsql --pgsql-port=*
> --threads=50 --thread-init-timeout=3000 --report-interval=5 --pgsql-user=* --pgsql-host=*
> --pgsql-db=* --events=0 --pgsql-ignore-errors="PX000,58M01" ./src/lua/oltp_read_write.lua run
>
sysbench --tables=10 --table-size=1000000 --db-ps-mode=auto --pgsql-password=* --time=72000 --db-driver=pgsql --pgsql-port=*
> --threads=50 --thread-init-timeout=3000 --report-interval=5 --pgsql-user=* --pgsql-host=*
> --pgsql-db=* --events=0 --pgsql-ignore-errors="PX000,58M01" ./src/lua/oltp_update_index.lua run
> 

Thanks. Not sure I'll be able to do such long sysbench runs, though. Can
you try reproducing this with current REL_14_STABLE? I wonder if dumping
the WAL (using pg_waldump) might tell us more about what happened.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PANIC: wrong buffer passed to visibilitymap_clear

From
Peter Geoghegan
Date:
On Fri, Jul 22, 2022 at 1:22 AM 王伟(学弈) <rogers.ww@alibaba-inc.com> wrote:
> I recently find this problem while testing PG14 with sysbench.

The line numbers from your stack trace don't match up with
REL_14_STABLE. Is this actually a fork of Postgres 14? (Oh, looks like
it's an old beta release.)

> Then I look through the emails from pgsql-hackers and find a previous similary bug which is
https://www.postgresql.org/message-id/flat/2247102.1618008027%40sss.pgh.pa.us.But the bugfix
commit(34f581c39e97e2ea237255cf75cccebccc02d477)is already patched to PG14. 

It does seem possible that there is another similar bug somewhere --
another case where we were protected by the fact that VACUUM acquired
a full cleanup lock (not just an exclusive buffer lock) during its
second heap pass. That changed in Postgres 14 (commit 8523492d4e). But
I really don't know -- almost anything is possible.

> I'm wondering whether there's another code path to lead this problem happened. Since, I take a deep dig via gdb which
turnsout that newbuffer is not euqal to buffer. In other words, the function RelationGetBufferForTuple must have been
calledjust now. 
> Besides, why didn't we re-check the flag after RelationGetBufferForTuple was called?

Recheck what flag? And at what point? It's not easy to figure this out
from your stack trace, because of the line number issues.

It would also be helpful if you told us about the specific table
involved. Though the important thing (the essential thing) is to test
today's REL_14_STABLE. There have been *lots* of bug fixes since
Postgres 14 beta2 was current.

--
Peter Geoghegan



Re: PANIC: wrong buffer passed to visibilitymap_clear

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> It would also be helpful if you told us about the specific table
> involved. Though the important thing (the essential thing) is to test
> today's REL_14_STABLE. There have been *lots* of bug fixes since
> Postgres 14 beta2 was current.

Yeah.  To be blunt, you're wasting your time and ours by testing
a year-old beta version.  The odds are respectable that the problem
is already fixed.  Even if it's not, the version skew between
what you are looking at and what we are looking at creates lots of
confusion.

            regards, tom lane