Thread: MarkBufferDirtyHint() and LSN update
Please consider this scenario (race conditions): 1. FlushBuffer() has written the buffer but hasn't yet managed to clear the BM_DIRTY flag (however BM_JUST_DIRTIED could be cleared by now). 2. Another backend modified a hint bit and called MarkBufferDirtyHint(). 3. In MarkBufferDirtyHint(), if XLogHintBitIsNeeded() evaluates to true (e.g. due to checksums enabled), new LSN is computed, however it's not assigned to the page because the buffer is still dirty: if (!(buf_state & BM_DIRTY)) { ... if (!XLogRecPtrIsInvalid(lsn)) PageSetLSN(page, lsn); } 4. MarkBufferDirtyHint() completes. 5. In the first session, FlushBuffer()->TerminateBufferIO() will not clear BM_DIRTY because MarkBufferDirtyHint() has eventually set BM_JUST_DIRTIED. Thus the hint bit change itself will be written by the next call of FlushBuffer(). However page LSN is hasn't been updated so the requirement that WAL must be flushed first is not met. I think that PageSetLSN() should be called regardless BM_DIRTY. Do I miss any subtle detail? -- Antonin Houska Web: https://www.cybertec-postgresql.com
On Wed, Oct 30, 2019 at 02:44:18PM +0100, Antonin Houska wrote: >Please consider this scenario (race conditions): > >1. FlushBuffer() has written the buffer but hasn't yet managed to clear the >BM_DIRTY flag (however BM_JUST_DIRTIED could be cleared by now). > >2. Another backend modified a hint bit and called MarkBufferDirtyHint(). > >3. In MarkBufferDirtyHint(), if XLogHintBitIsNeeded() evaluates to true >(e.g. due to checksums enabled), new LSN is computed, however it's not >assigned to the page because the buffer is still dirty: > > if (!(buf_state & BM_DIRTY)) > { > ... > > if (!XLogRecPtrIsInvalid(lsn)) > PageSetLSN(page, lsn); > } > >4. MarkBufferDirtyHint() completes. > >5. In the first session, FlushBuffer()->TerminateBufferIO() will not clear >BM_DIRTY because MarkBufferDirtyHint() has eventually set >BM_JUST_DIRTIED. Thus the hint bit change itself will be written by the next >call of FlushBuffer(). However page LSN is hasn't been updated so the >requirement that WAL must be flushed first is not met. > >I think that PageSetLSN() should be called regardless BM_DIRTY. Do I miss any >subtle detail? > Isn't this prevented by locking of the buffer header? Both FlushBuffer and MarkBufferDirtyHint do obtain that lock. I see MarkBufferDirtyHint does a bit of work before, but that's related to BM_PERMANENT. If there really is a race condition, it shouldn't be that difficult to trigger it by adding a sleep or a breakpoint, I guess. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > On Wed, Oct 30, 2019 at 02:44:18PM +0100, Antonin Houska wrote: > >Please consider this scenario (race conditions): > > > >1. FlushBuffer() has written the buffer but hasn't yet managed to clear the > >BM_DIRTY flag (however BM_JUST_DIRTIED could be cleared by now). > > > >2. Another backend modified a hint bit and called MarkBufferDirtyHint(). > > > >3. In MarkBufferDirtyHint(), if XLogHintBitIsNeeded() evaluates to true > >(e.g. due to checksums enabled), new LSN is computed, however it's not > >assigned to the page because the buffer is still dirty: > > > > if (!(buf_state & BM_DIRTY)) > > { > > ... > > > > if (!XLogRecPtrIsInvalid(lsn)) > > PageSetLSN(page, lsn); > > } > > > >4. MarkBufferDirtyHint() completes. > > > >5. In the first session, FlushBuffer()->TerminateBufferIO() will not clear > >BM_DIRTY because MarkBufferDirtyHint() has eventually set > >BM_JUST_DIRTIED. Thus the hint bit change itself will be written by the next > >call of FlushBuffer(). However page LSN is hasn't been updated so the > >requirement that WAL must be flushed first is not met. > > > >I think that PageSetLSN() should be called regardless BM_DIRTY. Do I miss any > >subtle detail? > > > > Isn't this prevented by locking of the buffer header? Both FlushBuffer > and MarkBufferDirtyHint do obtain that lock. I see MarkBufferDirtyHint > does a bit of work before, but that's related to BM_PERMANENT. > > If there really is a race condition, it shouldn't be that difficult to > trigger it by adding a sleep or a breakpoint, I guess. Yes, I had verified it using gdb: inserted a row into a table, then attached gdb to checkpointer and stopped it when FlushBuffer() was about to call TerminateBufferIO(). Then, when scanning the table, I saw that MarkBufferDirtyHint() skipped the call of PageSetLSN(). Finally, before checkpointer unlocked the buffer header in TerminateBufferIO(), buf_state was 3553624066 ~ 0b11010011110100000000000000000010. With BM_DIRTY defined as #define BM_DIRTY (1U << 23) the flag appeared to be set. -- Antonin Houska Web: https://www.cybertec-postgresql.com
On Wed, Oct 30, 2019 at 9:43 AM Antonin Houska <ah@cybertec.at> wrote: > 5. In the first session, FlushBuffer()->TerminateBufferIO() will not clear > BM_DIRTY because MarkBufferDirtyHint() has eventually set > BM_JUST_DIRTIED. Thus the hint bit change itself will be written by the next > call of FlushBuffer(). However page LSN is hasn't been updated so the > requirement that WAL must be flushed first is not met. This part confuses me. Are you saying that MarkBufferDirtyHint() can set BM_JUST_DIRTIED without setting BM_DIRTY? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Oct 30, 2019 at 9:43 AM Antonin Houska <ah@cybertec.at> wrote: > > 5. In the first session, FlushBuffer()->TerminateBufferIO() will not clear > > BM_DIRTY because MarkBufferDirtyHint() has eventually set > > BM_JUST_DIRTIED. Thus the hint bit change itself will be written by the next > > call of FlushBuffer(). However page LSN is hasn't been updated so the > > requirement that WAL must be flushed first is not met. > > This part confuses me. Are you saying that MarkBufferDirtyHint() can > set BM_JUST_DIRTIED without setting BM_DIRTY? No, I'm saying that MarkBufferDirtyHint() leaves BM_DIRTY set, as expected. However, if things happen in the order I described, then LSN returned by XLogSaveBufferForHint() is not assigned to the page. -- Antonin Houska Web: https://www.cybertec-postgresql.com
On Thu, Oct 31, 2019 at 09:43:47AM +0100, Antonin Houska wrote: > Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> Isn't this prevented by locking of the buffer header? Both FlushBuffer >> and MarkBufferDirtyHint do obtain that lock. I see MarkBufferDirtyHint >> does a bit of work before, but that's related to BM_PERMANENT. In FlushBuffer() you have a window after fetching the buffer state and the header is once unlocked until TerminateBufferIO() gets called (this would take again a lock on the buffer header), so it is logically possible for the buffer to be marked as dirty once again causing the update of the LSN on the page to be missed even if a backup block has been written in WAL. > Yes, I had verified it using gdb: inserted a row into a table, then attached > gdb to checkpointer and stopped it when FlushBuffer() was about to call > TerminateBufferIO(). Then, when scanning the table, I saw that > MarkBufferDirtyHint() skipped the call of PageSetLSN(). Finally, before > checkpointer unlocked the buffer header in TerminateBufferIO(), buf_state was > 3553624066 ~ 0b11010011110100000000000000000010. Small typo here: 11010011110100000000000000000010... > With BM_DIRTY defined as > > #define BM_DIRTY (1U << 23) > > the flag appeared to be set. ... Still, the bit is set here. Does something like the attached patch make sense? Reviews are welcome. -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> wrote: > Does something like the attached patch make sense? Reviews are > welcome. This looks good to me. -- Antonin Houska Web: https://www.cybertec-postgresql.com
At Mon, 11 Nov 2019 10:03:14 +0100, Antonin Houska <ah@cybertec.at> wrote in > Michael Paquier <michael@paquier.xyz> wrote: > > Does something like the attached patch make sense? Reviews are > > welcome. > > This looks good to me. I have a qustion. The current code assumes that !BM_DIRTY means that the function is dirtying the page. But if !BM_JUST_DIRTIED, the function actually is going to re-dirty the page even if BM_DIRTY. If this is correct, the trigger for stats update is not !BM_DIRTY but !BM_JUST_DIRTIED, or the fact that we passed the line of XLogSaveBufferForHint() could be the trigger, regardless whether the LSN is valid or not. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > At Mon, 11 Nov 2019 10:03:14 +0100, Antonin Houska <ah@cybertec.at> wrote in > > Michael Paquier <michael@paquier.xyz> wrote: > > > Does something like the attached patch make sense? Reviews are > > > welcome. > > > > This looks good to me. > > I have a qustion. > > The current code assumes that !BM_DIRTY means that the function is > dirtying the page. But if !BM_JUST_DIRTIED, the function actually is > going to re-dirty the page even if BM_DIRTY. It makes sense to me. I can imagine the following: 1. FlushBuffer() cleared BM_JUST_DIRTIED, wrote the page to disk but hasn't yet cleared BM_DIRTY. 2. Another backend changed a hint bit in shared memory and called MarkBufferDirtyHint(). Thus FlushBuffer() missed the current hint bit change, so we need to dirty the page again. > If this is correct, the trigger for stats update is not !BM_DIRTY but > !BM_JUST_DIRTIED, or the fact that we passed the line of > XLogSaveBufferForHint() could be the trigger, regardless whether the > LSN is valid or not. I agree. -- Antonin Houska Web: https://www.cybertec-postgresql.com
On Mon, Nov 11, 2019 at 10:03:14AM +0100, Antonin Houska wrote: > This looks good to me. Actually, no, this is not good. I have been studying more the patch, and after stressing more this code path with a cluster having checksums enabled and shared_buffers at 1MB, I have been able to make a couple of page's LSNs go backwards with pgbench -s 100. The cause was simply that the page got flushed with a newer LSN than what was returned by XLogSaveBufferForHint() before taking the buffer header lock, so updating only the LSN for a non-dirty page was simply guarding against that. -- Michael
Attachment
On Wed, Nov 13, 2019 at 09:17:03PM +0900, Michael Paquier wrote: > Actually, no, this is not good. I have been studying more the patch, > and after stressing more this code path with a cluster having > checksums enabled and shared_buffers at 1MB, I have been able to make > a couple of page's LSNs go backwards with pgbench -s 100. The cause > was simply that the page got flushed with a newer LSN than what was > returned by XLogSaveBufferForHint() before taking the buffer header > lock, so updating only the LSN for a non-dirty page was simply > guarding against that. for the reference attached is the trick I have used, adding an extra assertion check in PageSetLSN(). -- Michael
Attachment
At Thu, 14 Nov 2019 12:01:29 +0900, Michael Paquier <michael@paquier.xyz> wrote in > On Wed, Nov 13, 2019 at 09:17:03PM +0900, Michael Paquier wrote: > > Actually, no, this is not good. I have been studying more the patch, > > and after stressing more this code path with a cluster having > > checksums enabled and shared_buffers at 1MB, I have been able to make > > a couple of page's LSNs go backwards with pgbench -s 100. The cause > > was simply that the page got flushed with a newer LSN than what was > > returned by XLogSaveBufferForHint() before taking the buffer header > > lock, so updating only the LSN for a non-dirty page was simply > > guarding against that. I thought of something like that but forgot to mention. But after thinking of that, couldn't the current code can do the same think even though with a far small probability? Still a session with smaller hint LSN but didn't entered the header lock section yet can be cut-in by another session with larger hint LSN. > for the reference attached is the trick I have used, adding an extra > assertion check in PageSetLSN(). I believe that all locations where Page-LSN is set is in the same buffer-ex-lock section with XLogInsert.. but not sure. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Michael Paquier <michael@paquier.xyz> wrote: > On Mon, Nov 11, 2019 at 10:03:14AM +0100, Antonin Houska wrote: > > This looks good to me. > > Actually, no, this is not good. I have been studying more the patch, > and after stressing more this code path with a cluster having > checksums enabled and shared_buffers at 1MB, I have been able to make > a couple of page's LSNs go backwards with pgbench -s 100. The cause > was simply that the page got flushed with a newer LSN than what was > returned by XLogSaveBufferForHint() before taking the buffer header > lock, so updating only the LSN for a non-dirty page was simply > guarding against that. Interesting. Now that I know about the problem, I could have reproduced it using gdb: MarkBufferDirtyHint() was called by 2 backends concurrently in such a way that the first backend generates the LSN, but before it manages to assign it to the page, another backend generates another LSN and sets it. Can't we just apply the attached diff on the top of your patch? Also I wonder how checksums helped you to discover the problem? Although I could simulate the setting of lower LSN, I could not see any broken checksum. And I wouldn't even expect that since FlushBuffer() copies the buffer into backend local memory before it calculates the checksum. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Attachment
Antonin Houska <ah@cybertec.at> wrote: > Michael Paquier <michael@paquier.xyz> wrote: > > > On Mon, Nov 11, 2019 at 10:03:14AM +0100, Antonin Houska wrote: > > > This looks good to me. > > > > Actually, no, this is not good. I have been studying more the patch, > > and after stressing more this code path with a cluster having > > checksums enabled and shared_buffers at 1MB, I have been able to make > > a couple of page's LSNs go backwards with pgbench -s 100. The cause > > was simply that the page got flushed with a newer LSN than what was > > returned by XLogSaveBufferForHint() before taking the buffer header > > lock, so updating only the LSN for a non-dirty page was simply > > guarding against that. > > Interesting. Now that I know about the problem, I could have reproduced it > using gdb: MarkBufferDirtyHint() was called by 2 backends concurrently in such > a way that the first backend generates the LSN, but before it manages to > assign it to the page, another backend generates another LSN and sets it. > > Can't we just apply the attached diff on the top of your patch? I wanted to register the patch for the next CF so it's not forgotten, but see it's already there. Why have you set the status to "withdrawn"? -- Antonin Houska Web: https://www.cybertec-postgresql.com
On Fri, Dec 20, 2019 at 04:30:38PM +0100, Antonin Houska wrote: > I wanted to register the patch for the next CF so it's not forgotten, but see > it's already there. Why have you set the status to "withdrawn"? Because my patch was incorrect, and I did not make enough bandwidth to think more on the matter. I am actually not sure that what you are proposing is better.. If you wish to get that reviewed, could you add a new CF entry? -- Michael