Thread: False "pg_serial": apparent wraparound” in logs

False "pg_serial": apparent wraparound” in logs

From
"Imseih (AWS), Sami"
Date:

Hi,

 

I Recently encountered a situation on the field in which the message

“could not truncate directory "pg_serial": apparent wraparound”

was logged even through there was no danger of wraparound. This

was on a brand new cluster and only took a few minutes to see

the message in the logs.

 

Reading on some history of this error message, it appears that there

was work done to improve SLRU truncation and associated wraparound

log messages [1]. The attached repro on master still shows that this message

can be logged incorrectly.

 

The repro runs updates with 90 threads in serializable mode and kicks

off a “long running” select on the same table in serializable mode.

 

As soon as the long running select commits, the next checkpoint fails

to truncate the SLRU and logs the error message.

 

Besides the confusing log message, there may also also be risk with

pg_serial getting unnecessarily bloated and depleting the disk space.

 

Is this a bug?

 

[1] https://www.postgresql.org/message-id/flat/20190202083822.GC32531%40gust.leadboat.com

 

Regards,

 

Sami Imseih

Amazon Web Services (AWS)

 

 

 

 

 

Attachment

Re: False "pg_serial": apparent wraparound” in logs

From
"Imseih (AWS), Sami"
Date:

Hi,

 

I dug a bit into this and what looks to be happening is the comparison

of the page containing the latest cutoff xid could falsely be reported

as in the future of the last page number because the latest

page number of the Serial slru is only set when the page is

initialized [1].

 

So under the correct conditions, such as in the repro, the serializable

XID has moved past the last page number, therefore to the next checkpoint

which triggers a CheckPointPredicate, it will appear that the slru

has wrapped around.

 

It seems what may be needed here is to advance the

latest_page_number during SerialSetActiveSerXmin and if

we are using the SLRU. See below:

 

 

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c

index 1af41213b4..6946ed21b4 100644

--- a/src/backend/storage/lmgr/predicate.c

+++ b/src/backend/storage/lmgr/predicate.c

@@ -992,6 +992,9 @@ SerialSetActiveSerXmin(TransactionId xid)

 

        serialControl->tailXid = xid;

 

+       if (serialControl->headPage > 0)

+               SerialSlruCtl->shared->latest_page_number = SerialPage(xid);

+

        LWLockRelease(SerialSLRULock);

}

 

[1] https://github.com/postgres/postgres/blob/master/src/backend/access/transam/slru.c#L306

 

Regards,

 

Sami

 

From: "Imseih (AWS), Sami" <simseih@amazon.com>
Date: Tuesday, August 22, 2023 at 7:56 PM
To: "pgsql-hackers@postgresql.org" <pgsql-hackers@postgresql.org>
Subject: False "pg_serial": apparent wraparound” in logs

 

Hi,

 

I Recently encountered a situation on the field in which the message

“could not truncate directory "pg_serial": apparent wraparound”

was logged even through there was no danger of wraparound. This

was on a brand new cluster and only took a few minutes to see

the message in the logs.

 

Reading on some history of this error message, it appears that there

was work done to improve SLRU truncation and associated wraparound

log messages [1]. The attached repro on master still shows that this message

can be logged incorrectly.

 

The repro runs updates with 90 threads in serializable mode and kicks

off a “long running” select on the same table in serializable mode.

 

As soon as the long running select commits, the next checkpoint fails

to truncate the SLRU and logs the error message.

 

Besides the confusing log message, there may also also be risk with

pg_serial getting unnecessarily bloated and depleting the disk space.

 

Is this a bug?

 

[1] https://www.postgresql.org/message-id/flat/20190202083822.GC32531%40gust.leadboat.com

 

Regards,

 

Sami Imseih

Amazon Web Services (AWS)

 

 

 

 

 

Re: False "pg_serial": apparent wraparound” in logs

From
"Imseih (AWS), Sami"
Date:

Attached a patch with a new CF entry: https://commitfest.postgresql.org/44/4516/

 

Regards,

 

Sami Imseih

Amazon Web Services (AWS)

 

 

 

 

 

Attachment

Re: False "pg_serial": apparent wraparound” in logs

From
Heikki Linnakangas
Date:
On 25/08/2023 07:29, Imseih (AWS), Sami wrote:
> diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
> index 1af41213b4..7e7be3b885 100644
> --- a/src/backend/storage/lmgr/predicate.c
> +++ b/src/backend/storage/lmgr/predicate.c
> @@ -992,6 +992,13 @@ SerialSetActiveSerXmin(TransactionId xid)
>  
>      serialControl->tailXid = xid;
>  
> +    /*
> +     * If the SLRU is being used, set the latest page number to
> +     * the current tail xid.
> +     */
> +    if (serialControl->headPage > 0)
> +        SerialSlruCtl->shared->latest_page_number = SerialPage(serialControl->tailXid);
> +
>      LWLockRelease(SerialSLRULock);
>  }

I don't really understand what exactly the problem is, or how this fixes 
it. But this doesn't feel right:

Firstly, isn't headPage == 0 also a valid value? We initialize headPage 
to -1 when it's not in use.

Secondly, shouldn't we set it to the page corresponding to headXid 
rather than tailXid.

Thirdly, I don't think this code should have any business setting 
latest_page_number directly. latest_page_number is set in 
SimpleLruZeroPage(). Are we missing a call to SimpleLruZeroPage() somewhere?

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: False "pg_serial": apparent wraparound” in logs

From
"Imseih (AWS), Sami"
Date:
> I don't really understand what exactly the problem is, or how this fixes
> it. But this doesn't feel right:

As the repro show, false reports of "pg_serial": apparent wraparound”
messages are possible. For a very busy system which checkpoints frequently
and heavy usage of serializable isolation, this will flood the error logs, and 
falsely cause alarm to the user. It also prevents the SLRU from being
truncated.

In my repro, I end up seeing, even though the SLRU does not wraparound.
" LOG: could not truncate directory "pg_serial": apparent wraparound"

> Firstly, isn't headPage == 0 also a valid value? We initialize headPage
> to -1 when it's not in use.

Yes. You are correct. This is wrong.

> Secondly, shouldn't we set it to the page corresponding to headXid
> rather than tailXid.

> Thirdly, I don't think this code should have any business setting
> latest_page_number directly. latest_page_number is set in
> SimpleLruZeroPage(). 

Correct, after checking again, I do realize the patch is wrong.

> Are we missing a call to SimpleLruZeroPage() somewhere?

That is a good point.

The initial idea was to advance the latest_page_number 
during SerialSetActiveSerXmin, but the initial approach is 
obviously wrong.

When SerialSetActiveSerXmin is called for a new active
serializable xmin, and at that point we don't need to keep any
any earlier transactions, should SimpleLruZeroPage be called
to ensure there is a target page for the xid?

I tried something like below, which fixes my repro, by calling
SimpleLruZeroPage at the end of SerialSetActiveSerXmin.

@@ -953,6 +953,8 @@ SerialGetMinConflictCommitSeqNo(TransactionId xid)
 static void
 SerialSetActiveSerXmin(TransactionId xid)
 {
+       int targetPage = SerialPage(xid);
+
        LWLockAcquire(SerialSLRULock, LW_EXCLUSIVE);
 
        /*
@@ -992,6 +994,9 @@ SerialSetActiveSerXmin(TransactionId xid)
 
        serialControl->tailXid = xid;
 
+       if (serialControl->headPage != targetPage)
+               SimpleLruZeroPage(SerialSlruCtl, targetPage);
+
        LWLockRelease(SerialSLRULock);
 }

Regards,

Sami








Re: False "pg_serial": apparent wraparound” in logs

From
Heikki Linnakangas
Date:
On 30/09/2023 02:16, Imseih (AWS), Sami wrote:
> The initial idea was to advance the latest_page_number
> during SerialSetActiveSerXmin, but the initial approach is
> obviously wrong.

That approach at high level could work, a

> When SerialSetActiveSerXmin is called for a new active
> serializable xmin, and at that point we don't need to keep any
> any earlier transactions, should SimpleLruZeroPage be called
> to ensure there is a target page for the xid?
> 
> I tried something like below, which fixes my repro, by calling
> SimpleLruZeroPage at the end of SerialSetActiveSerXmin.
> 
> @@ -953,6 +953,8 @@ SerialGetMinConflictCommitSeqNo(TransactionId xid)
>   static void
>   SerialSetActiveSerXmin(TransactionId xid)
>   {
> +       int targetPage = SerialPage(xid);
> +
>          LWLockAcquire(SerialSLRULock, LW_EXCLUSIVE);
>   
>          /*
> @@ -992,6 +994,9 @@ SerialSetActiveSerXmin(TransactionId xid)
>   
>          serialControl->tailXid = xid;
>   
> +       if (serialControl->headPage != targetPage)
> +               SimpleLruZeroPage(SerialSlruCtl, targetPage);
> +
>          LWLockRelease(SerialSLRULock);
>   }

No, that's very wrong too. You are zeroing the page containing the 
oldest XID that's still needed. That page still contains important 
information. It might work if you zero the previous page, but I think 
you need to do a little more than that. (I wish we had tests that would 
catch that.)

The crux of the problem is that 'tailXid' can advance past 'headXid'. I 
was bit surprised by that, but I think it's by design. I wish it was 
called out explicitly in a comment though. The code mostly handles that 
fine, except that it confuses the "apparent wraparound" check.

'tailXid' is the oldest XID that we might still need to look up in the 
SLRU, based on the transactions that are still active, and 'headXid' is 
the newest XID that has been written out to the SLRU. But we only write 
an XID out to the SLRU and advance headXid if the shared memory data 
structure fills up. So it's possible that as old transactions age out, 
we advance 'tailXid' past 'headXid'.

SerialAdd() tolerates tailXid > headXid. It will zero out all the pages 
between the old headXid and tailXid, even though no lookups can occur on 
those pages. That's unnecessary but harmless.

I think the smallest fix here would be to change CheckPointPredicate() 
so that if tailPage > headPage, pass headPage to SimpleLruTruncate() 
instead of tailPage. Or perhaps it should go into the "The SLRU is no 
longer needed" codepath in that case. If tailPage > headPage, the SLRU 
isn't needed at the moment.

In addition to that, we could change SerialAdd() to not zero out the 
pages between old headXid and tailXid unnecessarily, but that's more of 
an optimization than bug fix.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: False "pg_serial": apparent wraparound” in logs

From
Michael Paquier
Date:
On Sun, Oct 01, 2023 at 09:43:21PM +0300, Heikki Linnakangas wrote:
> I think the smallest fix here would be to change CheckPointPredicate() so
> that if tailPage > headPage, pass headPage to SimpleLruTruncate() instead of
> tailPage. Or perhaps it should go into the "The SLRU is no longer needed"
> codepath in that case. If tailPage > headPage, the SLRU isn't needed at the
> moment.

Good idea.  Indeed that should be good and simple enough for the
back-branches, at quick glance.
--
Michael

Attachment

Re: False "pg_serial": apparent wraparound” in logs

From
"Imseih (AWS), Sami"
Date:
> I think the smallest fix here would be to change CheckPointPredicate()
> so that if tailPage > headPage, pass headPage to SimpleLruTruncate()
> instead of tailPage. Or perhaps it should go into the "The SLRU is no
> longer needed" codepath in that case. If tailPage > headPage, the SLRU
> isn't needed at the moment.

I spent sometime studying this and it appears to be a good approach. 

Passing the cutoff page as headPage (SLRU not needed code path ) instead of the tailPage to 
SimpleLruTruncate is already being done when the tailXid is not a valid XID. 
I added an additional condition to make sure that the tailPage proceeds the headPage
as well. 

Attached is v2 of the patch.

> In addition to that, we could change SerialAdd() to not zero out the
> pages between old headXid and tailXid unnecessarily, but that's more of
> an optimization than bug fix.

Yes, I did notice that in my debugging, but will not address this in the current patch.


Regards,

Sami


Attachment

Re: False "pg_serial": apparent wraparound” in logs

From
"Imseih (AWS), Sami"
Date:
Correct a typo in my last message: 

Instead of:
“ I added an additional condition to make sure that the tailPage proceeds the headPage
as well. “

It should be:
“ I added an additional condition to make sure that the tailPage precedes the headPage
as well. ”


Regards,

Sami

> On Oct 5, 2023, at 6:29 PM, Imseih (AWS), Sami <simseih@amazon.com> wrote:
> 
> I added an additional condition to make sure that the tailPage proceeds the headPage
> as well.

Re: False "pg_serial": apparent wraparound” in logs

From
Michael Paquier
Date:
On Thu, Oct 05, 2023 at 11:28:02PM +0000, Imseih (AWS), Sami wrote:
> I spent sometime studying this and it appears to be a good approach.
>
> Passing the cutoff page as headPage (SLRU not needed code path ) instead of the tailPage to
> SimpleLruTruncate is already being done when the tailXid is not a valid XID.
> I added an additional condition to make sure that the tailPage proceeds the headPage
> as well.
>
> Attached is v2 of the patch.

Thanks for the updated patch.  I have begun looking at what you have
here.
--
Michael

Attachment

Re: False "pg_serial": apparent wraparound” in logs

From
Michael Paquier
Date:
On Thu, Oct 05, 2023 at 11:28:02PM +0000, Imseih (AWS), Sami wrote:
> I spent sometime studying this and it appears to be a good approach.
>
> Passing the cutoff page as headPage (SLRU not needed code path ) instead of the tailPage to
> SimpleLruTruncate is already being done when the tailXid is not a valid XID.
> I added an additional condition to make sure that the tailPage proceeds the headPage
> as well.

I have been studying the whole area, and these threads from 2011 have
come to me, with two separate attempts:
https://www.postgresql.org/message-id/4DF105ED.1050800@enterprisedb.com
https://www.postgresql.org/message-id/4D8F54E6020000250003BD16%40gw.wicourts.gov

Back then, we were pretty much OK with the amount of space that could
be wasted even in this case.  Actually, how much space are we talking
about here when a failed truncation happens?  As this is basically
harmless, still leads to a confusing message, do we really need a
backpatch here?

Anyway, it looks like you're right, we don't really need the SLRU once
the tail is ahead of the tail because the SLRU has wrapped around due
to the effect of transactions aging out, so making the truncation a
bit smarter should be OK.

+    /*
+     * Check if the tailXid is valid and that the tailPage is not ahead of
+     * the headPage, otherwise the SLRU is no longer needed.
+     */

Hmm.  This doesn't seem enough.  Shouldn't we explain at least in
which scenarios the tail can get ahead of the head (aka at least
with long running transactions that make the SLRU wrap-around)?
Except if I am missing something, there is no explanation of that in
predicate.c.
--
Michael

Attachment

Re: False "pg_serial": apparent wraparound” in logs

From
"Imseih (AWS), Sami"
Date:
Sorry for the delay in response.

> Back then, we were pretty much OK with the amount of space that could
> be wasted even in this case. Actually, how much space are we talking
> about here when a failed truncation happens? 

It is a transient waste in space as it will eventually clean up.

> As this is basically
> harmless, still leads to a confusing message, 

Correct, and especially because the message has
"wraparound" in the text.

> do we really need a backpatch here?

No, I don't think a backpatch is necessary.


> Anyway, it looks like you're right, we don't really need the SLRU once
> the tail is ahead of the tail because the SLRU has wrapped around due
> to the effect of transactions aging out, so making the truncation a
> bit smarter should be OK.

I assume you meant " the tail is ahead of the head".

SummarizeOldestCommittedSxact advances the headXid, but if we
checkpoint before this is called, then the tail could be ahead. The tail is
advanced by SerialSetActiveSerXmin whenever there is a new serializable
transaction.


> Hmm. This doesn't seem enough. Shouldn't we explain at least in
> which scenarios the tail can get ahead of the head (aka at least
> with long running transactions that make the SLRU wrap-around)?
> Except if I am missing something, there is no explanation of that in
> predicate.c.

After looking at this a bit more, I don't think the previous rev is correct.
We should not fall through to the " The SLRU is no longer needed." Which
also sets the headPage to invalid. We should only truncate up to the
head page.

Please see attached v3.


Regards,

Sami




Attachment

Re: False "pg_serial": apparent wraparound” in logs

From
Michael Paquier
Date:
On Sat, Oct 14, 2023 at 07:29:54PM +0000, Imseih (AWS), Sami wrote:
>> Anyway, it looks like you're right, we don't really need the SLRU once
>> the tail is ahead of the tail because the SLRU has wrapped around due
>> to the effect of transactions aging out, so making the truncation a
>> bit smarter should be OK.
>
> I assume you meant " the tail is ahead of the head".

Damn fingers on a keyboard who don't know how to type.

>> Hmm. This doesn't seem enough. Shouldn't we explain at least in
>> which scenarios the tail can get ahead of the head (aka at least
>> with long running transactions that make the SLRU wrap-around)?
>> Except if I am missing something, there is no explanation of that in
>> predicate.c.
>
> After looking at this a bit more, I don't think the previous rev is correct.
> We should not fall through to the " The SLRU is no longer needed." Which
> also sets the headPage to invalid. We should only truncate up to the
> head page.

Seems correct to me.  Or this would count as if the SLRU is not in
use, but it's being used.
--
Michael

Attachment

Re: False "pg_serial": apparent wraparound” in logs

From
Michael Paquier
Date:
On Mon, Oct 16, 2023 at 04:58:31PM +0900, Michael Paquier wrote:
> On Sat, Oct 14, 2023 at 07:29:54PM +0000, Imseih (AWS), Sami wrote:
>> After looking at this a bit more, I don't think the previous rev is correct.
>> We should not fall through to the " The SLRU is no longer needed." Which
>> also sets the headPage to invalid. We should only truncate up to the
>> head page.
>
> Seems correct to me.  Or this would count as if the SLRU is not in
> use, but it's being used.

So, I've spent more time on that and applied the simplification today,
doing as you have suggested to use the head page rather than the tail
page when the tail XID is ahead of the head XID, but without disabling
the whole.  I've simplified a bit the code and the comments, though,
while on it (some renames and a slight refactoring of tailPage, for
example).
--
Michael

Attachment

Re: False "pg_serial": apparent wraparound” in logs

From
"Imseih (AWS), Sami"
Date:
> So, I've spent more time on that and applied the simplification today,
> doing as you have suggested to use the head page rather than the tail
> page when the tail XID is ahead of the head XID, but without disabling
> the whole. I've simplified a bit the code and the comments, though,
> while on it (some renames and a slight refactoring of tailPage, for
> example).
> --
> Michael

Thank you!

Regards,

Sami