Thread: False "pg_serial": apparent wraparound” in logs
Hi,
I Recently encountered a situation on the field in which the message
“could not truncate directory "pg_serial": apparent wraparound”
was logged even through there was no danger of wraparound. This
was on a brand new cluster and only took a few minutes to see
the message in the logs.
Reading on some history of this error message, it appears that there
was work done to improve SLRU truncation and associated wraparound
log messages [1]. The attached repro on master still shows that this message
can be logged incorrectly.
The repro runs updates with 90 threads in serializable mode and kicks
off a “long running” select on the same table in serializable mode.
As soon as the long running select commits, the next checkpoint fails
to truncate the SLRU and logs the error message.
Besides the confusing log message, there may also also be risk with
pg_serial getting unnecessarily bloated and depleting the disk space.
Is this a bug?
[1] https://www.postgresql.org/message-id/flat/20190202083822.GC32531%40gust.leadboat.com
Regards,
Sami Imseih
Amazon Web Services (AWS)
Attachment
Hi,
I dug a bit into this and what looks to be happening is the comparison
of the page containing the latest cutoff xid could falsely be reported
as in the future of the last page number because the latest
page number of the Serial slru is only set when the page is
initialized [1].
So under the correct conditions, such as in the repro, the serializable
XID has moved past the last page number, therefore to the next checkpoint
which triggers a CheckPointPredicate, it will appear that the slru
has wrapped around.
It seems what may be needed here is to advance the
latest_page_number during SerialSetActiveSerXmin and if
we are using the SLRU. See below:
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 1af41213b4..6946ed21b4 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -992,6 +992,9 @@ SerialSetActiveSerXmin(TransactionId xid)
serialControl->tailXid = xid;
+ if (serialControl->headPage > 0)
+ SerialSlruCtl->shared->latest_page_number = SerialPage(xid);
+
LWLockRelease(SerialSLRULock);
}
[1] https://github.com/postgres/postgres/blob/master/src/backend/access/transam/slru.c#L306
Regards,
Sami
From: "Imseih (AWS), Sami" <simseih@amazon.com>
Date: Tuesday, August 22, 2023 at 7:56 PM
To: "pgsql-hackers@postgresql.org" <pgsql-hackers@postgresql.org>
Subject: False "pg_serial": apparent wraparound” in logs
Hi,
I Recently encountered a situation on the field in which the message
“could not truncate directory "pg_serial": apparent wraparound”
was logged even through there was no danger of wraparound. This
was on a brand new cluster and only took a few minutes to see
the message in the logs.
Reading on some history of this error message, it appears that there
was work done to improve SLRU truncation and associated wraparound
log messages [1]. The attached repro on master still shows that this message
can be logged incorrectly.
The repro runs updates with 90 threads in serializable mode and kicks
off a “long running” select on the same table in serializable mode.
As soon as the long running select commits, the next checkpoint fails
to truncate the SLRU and logs the error message.
Besides the confusing log message, there may also also be risk with
pg_serial getting unnecessarily bloated and depleting the disk space.
Is this a bug?
[1] https://www.postgresql.org/message-id/flat/20190202083822.GC32531%40gust.leadboat.com
Regards,
Sami Imseih
Amazon Web Services (AWS)
Attached a patch with a new CF entry: https://commitfest.postgresql.org/44/4516/
Regards,
Sami Imseih
Amazon Web Services (AWS)
Attachment
On 25/08/2023 07:29, Imseih (AWS), Sami wrote: > diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c > index 1af41213b4..7e7be3b885 100644 > --- a/src/backend/storage/lmgr/predicate.c > +++ b/src/backend/storage/lmgr/predicate.c > @@ -992,6 +992,13 @@ SerialSetActiveSerXmin(TransactionId xid) > > serialControl->tailXid = xid; > > + /* > + * If the SLRU is being used, set the latest page number to > + * the current tail xid. > + */ > + if (serialControl->headPage > 0) > + SerialSlruCtl->shared->latest_page_number = SerialPage(serialControl->tailXid); > + > LWLockRelease(SerialSLRULock); > } I don't really understand what exactly the problem is, or how this fixes it. But this doesn't feel right: Firstly, isn't headPage == 0 also a valid value? We initialize headPage to -1 when it's not in use. Secondly, shouldn't we set it to the page corresponding to headXid rather than tailXid. Thirdly, I don't think this code should have any business setting latest_page_number directly. latest_page_number is set in SimpleLruZeroPage(). Are we missing a call to SimpleLruZeroPage() somewhere? -- Heikki Linnakangas Neon (https://neon.tech)
> I don't really understand what exactly the problem is, or how this fixes > it. But this doesn't feel right: As the repro show, false reports of "pg_serial": apparent wraparound” messages are possible. For a very busy system which checkpoints frequently and heavy usage of serializable isolation, this will flood the error logs, and falsely cause alarm to the user. It also prevents the SLRU from being truncated. In my repro, I end up seeing, even though the SLRU does not wraparound. " LOG: could not truncate directory "pg_serial": apparent wraparound" > Firstly, isn't headPage == 0 also a valid value? We initialize headPage > to -1 when it's not in use. Yes. You are correct. This is wrong. > Secondly, shouldn't we set it to the page corresponding to headXid > rather than tailXid. > Thirdly, I don't think this code should have any business setting > latest_page_number directly. latest_page_number is set in > SimpleLruZeroPage(). Correct, after checking again, I do realize the patch is wrong. > Are we missing a call to SimpleLruZeroPage() somewhere? That is a good point. The initial idea was to advance the latest_page_number during SerialSetActiveSerXmin, but the initial approach is obviously wrong. When SerialSetActiveSerXmin is called for a new active serializable xmin, and at that point we don't need to keep any any earlier transactions, should SimpleLruZeroPage be called to ensure there is a target page for the xid? I tried something like below, which fixes my repro, by calling SimpleLruZeroPage at the end of SerialSetActiveSerXmin. @@ -953,6 +953,8 @@ SerialGetMinConflictCommitSeqNo(TransactionId xid) static void SerialSetActiveSerXmin(TransactionId xid) { + int targetPage = SerialPage(xid); + LWLockAcquire(SerialSLRULock, LW_EXCLUSIVE); /* @@ -992,6 +994,9 @@ SerialSetActiveSerXmin(TransactionId xid) serialControl->tailXid = xid; + if (serialControl->headPage != targetPage) + SimpleLruZeroPage(SerialSlruCtl, targetPage); + LWLockRelease(SerialSLRULock); } Regards, Sami
On 30/09/2023 02:16, Imseih (AWS), Sami wrote: > The initial idea was to advance the latest_page_number > during SerialSetActiveSerXmin, but the initial approach is > obviously wrong. That approach at high level could work, a > When SerialSetActiveSerXmin is called for a new active > serializable xmin, and at that point we don't need to keep any > any earlier transactions, should SimpleLruZeroPage be called > to ensure there is a target page for the xid? > > I tried something like below, which fixes my repro, by calling > SimpleLruZeroPage at the end of SerialSetActiveSerXmin. > > @@ -953,6 +953,8 @@ SerialGetMinConflictCommitSeqNo(TransactionId xid) > static void > SerialSetActiveSerXmin(TransactionId xid) > { > + int targetPage = SerialPage(xid); > + > LWLockAcquire(SerialSLRULock, LW_EXCLUSIVE); > > /* > @@ -992,6 +994,9 @@ SerialSetActiveSerXmin(TransactionId xid) > > serialControl->tailXid = xid; > > + if (serialControl->headPage != targetPage) > + SimpleLruZeroPage(SerialSlruCtl, targetPage); > + > LWLockRelease(SerialSLRULock); > } No, that's very wrong too. You are zeroing the page containing the oldest XID that's still needed. That page still contains important information. It might work if you zero the previous page, but I think you need to do a little more than that. (I wish we had tests that would catch that.) The crux of the problem is that 'tailXid' can advance past 'headXid'. I was bit surprised by that, but I think it's by design. I wish it was called out explicitly in a comment though. The code mostly handles that fine, except that it confuses the "apparent wraparound" check. 'tailXid' is the oldest XID that we might still need to look up in the SLRU, based on the transactions that are still active, and 'headXid' is the newest XID that has been written out to the SLRU. But we only write an XID out to the SLRU and advance headXid if the shared memory data structure fills up. So it's possible that as old transactions age out, we advance 'tailXid' past 'headXid'. SerialAdd() tolerates tailXid > headXid. It will zero out all the pages between the old headXid and tailXid, even though no lookups can occur on those pages. That's unnecessary but harmless. I think the smallest fix here would be to change CheckPointPredicate() so that if tailPage > headPage, pass headPage to SimpleLruTruncate() instead of tailPage. Or perhaps it should go into the "The SLRU is no longer needed" codepath in that case. If tailPage > headPage, the SLRU isn't needed at the moment. In addition to that, we could change SerialAdd() to not zero out the pages between old headXid and tailXid unnecessarily, but that's more of an optimization than bug fix. -- Heikki Linnakangas Neon (https://neon.tech)
On Sun, Oct 01, 2023 at 09:43:21PM +0300, Heikki Linnakangas wrote: > I think the smallest fix here would be to change CheckPointPredicate() so > that if tailPage > headPage, pass headPage to SimpleLruTruncate() instead of > tailPage. Or perhaps it should go into the "The SLRU is no longer needed" > codepath in that case. If tailPage > headPage, the SLRU isn't needed at the > moment. Good idea. Indeed that should be good and simple enough for the back-branches, at quick glance. -- Michael
Attachment
> I think the smallest fix here would be to change CheckPointPredicate() > so that if tailPage > headPage, pass headPage to SimpleLruTruncate() > instead of tailPage. Or perhaps it should go into the "The SLRU is no > longer needed" codepath in that case. If tailPage > headPage, the SLRU > isn't needed at the moment. I spent sometime studying this and it appears to be a good approach. Passing the cutoff page as headPage (SLRU not needed code path ) instead of the tailPage to SimpleLruTruncate is already being done when the tailXid is not a valid XID. I added an additional condition to make sure that the tailPage proceeds the headPage as well. Attached is v2 of the patch. > In addition to that, we could change SerialAdd() to not zero out the > pages between old headXid and tailXid unnecessarily, but that's more of > an optimization than bug fix. Yes, I did notice that in my debugging, but will not address this in the current patch. Regards, Sami
Attachment
Correct a typo in my last message: Instead of: “ I added an additional condition to make sure that the tailPage proceeds the headPage as well. “ It should be: “ I added an additional condition to make sure that the tailPage precedes the headPage as well. ” Regards, Sami > On Oct 5, 2023, at 6:29 PM, Imseih (AWS), Sami <simseih@amazon.com> wrote: > > I added an additional condition to make sure that the tailPage proceeds the headPage > as well.
On Thu, Oct 05, 2023 at 11:28:02PM +0000, Imseih (AWS), Sami wrote: > I spent sometime studying this and it appears to be a good approach. > > Passing the cutoff page as headPage (SLRU not needed code path ) instead of the tailPage to > SimpleLruTruncate is already being done when the tailXid is not a valid XID. > I added an additional condition to make sure that the tailPage proceeds the headPage > as well. > > Attached is v2 of the patch. Thanks for the updated patch. I have begun looking at what you have here. -- Michael
Attachment
On Thu, Oct 05, 2023 at 11:28:02PM +0000, Imseih (AWS), Sami wrote: > I spent sometime studying this and it appears to be a good approach. > > Passing the cutoff page as headPage (SLRU not needed code path ) instead of the tailPage to > SimpleLruTruncate is already being done when the tailXid is not a valid XID. > I added an additional condition to make sure that the tailPage proceeds the headPage > as well. I have been studying the whole area, and these threads from 2011 have come to me, with two separate attempts: https://www.postgresql.org/message-id/4DF105ED.1050800@enterprisedb.com https://www.postgresql.org/message-id/4D8F54E6020000250003BD16%40gw.wicourts.gov Back then, we were pretty much OK with the amount of space that could be wasted even in this case. Actually, how much space are we talking about here when a failed truncation happens? As this is basically harmless, still leads to a confusing message, do we really need a backpatch here? Anyway, it looks like you're right, we don't really need the SLRU once the tail is ahead of the tail because the SLRU has wrapped around due to the effect of transactions aging out, so making the truncation a bit smarter should be OK. + /* + * Check if the tailXid is valid and that the tailPage is not ahead of + * the headPage, otherwise the SLRU is no longer needed. + */ Hmm. This doesn't seem enough. Shouldn't we explain at least in which scenarios the tail can get ahead of the head (aka at least with long running transactions that make the SLRU wrap-around)? Except if I am missing something, there is no explanation of that in predicate.c. -- Michael
Attachment
Sorry for the delay in response. > Back then, we were pretty much OK with the amount of space that could > be wasted even in this case. Actually, how much space are we talking > about here when a failed truncation happens? It is a transient waste in space as it will eventually clean up. > As this is basically > harmless, still leads to a confusing message, Correct, and especially because the message has "wraparound" in the text. > do we really need a backpatch here? No, I don't think a backpatch is necessary. > Anyway, it looks like you're right, we don't really need the SLRU once > the tail is ahead of the tail because the SLRU has wrapped around due > to the effect of transactions aging out, so making the truncation a > bit smarter should be OK. I assume you meant " the tail is ahead of the head". SummarizeOldestCommittedSxact advances the headXid, but if we checkpoint before this is called, then the tail could be ahead. The tail is advanced by SerialSetActiveSerXmin whenever there is a new serializable transaction. > Hmm. This doesn't seem enough. Shouldn't we explain at least in > which scenarios the tail can get ahead of the head (aka at least > with long running transactions that make the SLRU wrap-around)? > Except if I am missing something, there is no explanation of that in > predicate.c. After looking at this a bit more, I don't think the previous rev is correct. We should not fall through to the " The SLRU is no longer needed." Which also sets the headPage to invalid. We should only truncate up to the head page. Please see attached v3. Regards, Sami
Attachment
On Sat, Oct 14, 2023 at 07:29:54PM +0000, Imseih (AWS), Sami wrote: >> Anyway, it looks like you're right, we don't really need the SLRU once >> the tail is ahead of the tail because the SLRU has wrapped around due >> to the effect of transactions aging out, so making the truncation a >> bit smarter should be OK. > > I assume you meant " the tail is ahead of the head". Damn fingers on a keyboard who don't know how to type. >> Hmm. This doesn't seem enough. Shouldn't we explain at least in >> which scenarios the tail can get ahead of the head (aka at least >> with long running transactions that make the SLRU wrap-around)? >> Except if I am missing something, there is no explanation of that in >> predicate.c. > > After looking at this a bit more, I don't think the previous rev is correct. > We should not fall through to the " The SLRU is no longer needed." Which > also sets the headPage to invalid. We should only truncate up to the > head page. Seems correct to me. Or this would count as if the SLRU is not in use, but it's being used. -- Michael
Attachment
On Mon, Oct 16, 2023 at 04:58:31PM +0900, Michael Paquier wrote: > On Sat, Oct 14, 2023 at 07:29:54PM +0000, Imseih (AWS), Sami wrote: >> After looking at this a bit more, I don't think the previous rev is correct. >> We should not fall through to the " The SLRU is no longer needed." Which >> also sets the headPage to invalid. We should only truncate up to the >> head page. > > Seems correct to me. Or this would count as if the SLRU is not in > use, but it's being used. So, I've spent more time on that and applied the simplification today, doing as you have suggested to use the head page rather than the tail page when the tail XID is ahead of the head XID, but without disabling the whole. I've simplified a bit the code and the comments, though, while on it (some renames and a slight refactoring of tailPage, for example). -- Michael
Attachment
> So, I've spent more time on that and applied the simplification today, > doing as you have suggested to use the head page rather than the tail > page when the tail XID is ahead of the head XID, but without disabling > the whole. I've simplified a bit the code and the comments, though, > while on it (some renames and a slight refactoring of tailPage, for > example). > -- > Michael Thank you! Regards, Sami