Re: VM corruption on standby - Mailing list pgsql-hackers
From | Alexander Korotkov |
---|---|
Subject | Re: VM corruption on standby |
Date | |
Msg-id | CAPpHfduMpS42=kOcLf7uaQr+oypLGjVpHo01enX=zfGNVYo8Rg@mail.gmail.com Whole thread Raw |
In response to | Re: VM corruption on standby (Kirill Reshke <reshkekirill@gmail.com>) |
Responses |
Re: VM corruption on standby
|
List | pgsql-hackers |
On Tue, Aug 12, 2025 at 8:38 AM Kirill Reshke <reshkekirill@gmail.com> wrote:
> On Wed, 6 Aug 2025 at 20:00, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> >
> > Hi hackers!
> >
> > I was reviewing the patch about removing xl_heap_visible and found the VM\WAL machinery very interesting.
> > At Yandex we had several incidents with corrupted VM and on pgconf.dev colleagues from AWS confirmed that they saw something similar too.
> > So I toyed around and accidentally wrote a test that reproduces $subj.
> >
> > I think the corruption happens as follows:
> > 0. we create a table with one frozen tuple
> > 1. next heap_insert() clears VM bit and hangs immediately, nothing was logged yet
> > 2. VM buffer is flushed on disk with checkpointer or bgwriter
> > 3. primary is killed with -9
> > now we have a page that is ALL_VISIBLE\ALL_FORZEN on standby, but clear VM bits on primary
> > 4. subsequent insert does not set XLH_LOCK_ALL_FROZEN_CLEARED in it's WAL record
> > 5. pg_visibility detects corruption
> >
> > Interestingly, in an off-list conversation Melanie explained me how ALL_VISIBLE is protected from this: WAL-logging depends on PD_ALL_VISIBLE heap page bit, not a state of the VM. But for ALL_FROZEN this is not a case:
> >
> > /* Clear only the all-frozen bit on visibility map if needed */
> > if (PageIsAllVisible(page) &&
> > visibilitymap_clear(relation, block, vmbuffer,
> > VISIBILITYMAP_ALL_FROZEN))
> > cleared_all_frozen = true; // this won't happen due to flushed VM buffer before a crash
> >
> > Anyway, the test reproduces corruption of both bits. And also reproduces selecting deleted data on standby.
> >
> > The test is not intended to be committed when we fix the problem, so some waits are simulated with sleep(1) and test is placed at modules/test_slru where it was easier to write. But if we ever want something like this - I can design a less hacky version. And, probably, more generic.
> >
> > Thanks!
> >
> >
> > Best regards, Andrey Borodin.
> >
> >
> >
>
> Attached reproduces the same but without any standby node. CHECKPOINT
> somehow manages to flush the heap page when instance kill-9-ed.
> As a result, we have inconsistency between heap and VM pages:
>
> ```
> reshke=# select * from pg_visibility('x');
> blkno | all_visible | all_frozen | pd_all_visible
> -------+-------------+------------+----------------
> 0 | t | t | f
> (1 row)
> ```
>
> Notice I moved INJECTION point one line above visibilitymap_clear.
> Without this change, such behaviour also reproduced, but with much
> less frequency.
BTW, I've tried this patch on the current master, where bc22dc0e0d was reverted. And it fails for me.
t/001_multixact.pl .. 1/?
# Failed test 'pg_check_frozen() observes corruption'
# at t/001_multixact.pl line 102.
# got: '(0,2)
# (0,3)
# (0,4)'
# expected: ''
# Failed test 'pg_check_visible() observes corruption'
# at t/001_multixact.pl line 103.
# got: '(0,2)
# (0,4)'
# expected: ''
# Failed test 'deleted data returned by select'
# at t/001_multixact.pl line 104.
# got: '2'
# expected: ''
# Looks like you failed 3 tests of 3.
t/001_multixact.pl .. Dubious, test returned 3 (wstat 768, 0x300)
Failed 3/3 subtests
Test Summary Report
-------------------
t/001_multixact.pl (Wstat: 768 Tests: 3 Failed: 3)
Failed tests: 1-3
Non-zero exit status: 3
Files=1, Tests=3, 2 wallclock secs ( 0.01 usr 0.00 sys + 0.09 cusr 0.27 csys = 0.37 CPU)
Result: FAIL
make: *** [../../../../src/makefiles/pgxs.mk:452: check] Error 1
Could you, please, recheck?
------
Regards,
Alexander Korotkov
Supabase
> On Wed, 6 Aug 2025 at 20:00, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> >
> > Hi hackers!
> >
> > I was reviewing the patch about removing xl_heap_visible and found the VM\WAL machinery very interesting.
> > At Yandex we had several incidents with corrupted VM and on pgconf.dev colleagues from AWS confirmed that they saw something similar too.
> > So I toyed around and accidentally wrote a test that reproduces $subj.
> >
> > I think the corruption happens as follows:
> > 0. we create a table with one frozen tuple
> > 1. next heap_insert() clears VM bit and hangs immediately, nothing was logged yet
> > 2. VM buffer is flushed on disk with checkpointer or bgwriter
> > 3. primary is killed with -9
> > now we have a page that is ALL_VISIBLE\ALL_FORZEN on standby, but clear VM bits on primary
> > 4. subsequent insert does not set XLH_LOCK_ALL_FROZEN_CLEARED in it's WAL record
> > 5. pg_visibility detects corruption
> >
> > Interestingly, in an off-list conversation Melanie explained me how ALL_VISIBLE is protected from this: WAL-logging depends on PD_ALL_VISIBLE heap page bit, not a state of the VM. But for ALL_FROZEN this is not a case:
> >
> > /* Clear only the all-frozen bit on visibility map if needed */
> > if (PageIsAllVisible(page) &&
> > visibilitymap_clear(relation, block, vmbuffer,
> > VISIBILITYMAP_ALL_FROZEN))
> > cleared_all_frozen = true; // this won't happen due to flushed VM buffer before a crash
> >
> > Anyway, the test reproduces corruption of both bits. And also reproduces selecting deleted data on standby.
> >
> > The test is not intended to be committed when we fix the problem, so some waits are simulated with sleep(1) and test is placed at modules/test_slru where it was easier to write. But if we ever want something like this - I can design a less hacky version. And, probably, more generic.
> >
> > Thanks!
> >
> >
> > Best regards, Andrey Borodin.
> >
> >
> >
>
> Attached reproduces the same but without any standby node. CHECKPOINT
> somehow manages to flush the heap page when instance kill-9-ed.
> As a result, we have inconsistency between heap and VM pages:
>
> ```
> reshke=# select * from pg_visibility('x');
> blkno | all_visible | all_frozen | pd_all_visible
> -------+-------------+------------+----------------
> 0 | t | t | f
> (1 row)
> ```
>
> Notice I moved INJECTION point one line above visibilitymap_clear.
> Without this change, such behaviour also reproduced, but with much
> less frequency.
BTW, I've tried this patch on the current master, where bc22dc0e0d was reverted. And it fails for me.
t/001_multixact.pl .. 1/?
# Failed test 'pg_check_frozen() observes corruption'
# at t/001_multixact.pl line 102.
# got: '(0,2)
# (0,3)
# (0,4)'
# expected: ''
# Failed test 'pg_check_visible() observes corruption'
# at t/001_multixact.pl line 103.
# got: '(0,2)
# (0,4)'
# expected: ''
# Failed test 'deleted data returned by select'
# at t/001_multixact.pl line 104.
# got: '2'
# expected: ''
# Looks like you failed 3 tests of 3.
t/001_multixact.pl .. Dubious, test returned 3 (wstat 768, 0x300)
Failed 3/3 subtests
Test Summary Report
-------------------
t/001_multixact.pl (Wstat: 768 Tests: 3 Failed: 3)
Failed tests: 1-3
Non-zero exit status: 3
Files=1, Tests=3, 2 wallclock secs ( 0.01 usr 0.00 sys + 0.09 cusr 0.27 csys = 0.37 CPU)
Result: FAIL
make: *** [../../../../src/makefiles/pgxs.mk:452: check] Error 1
Could you, please, recheck?
------
Regards,
Alexander Korotkov
Supabase
pgsql-hackers by date: