Re: False "pg_serial": apparent wraparound” in logs - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: False "pg_serial": apparent wraparound” in logs
Date
Msg-id 8c9567a5-13cc-cefa-acec-228ae1db23dc@iki.fi
Whole thread Raw
In response to Re: False "pg_serial": apparent wraparound” in logs  ("Imseih (AWS), Sami" <simseih@amazon.com>)
Responses Re: False "pg_serial": apparent wraparound” in logs
Re: False "pg_serial": apparent wraparound” in logs
List pgsql-hackers
On 30/09/2023 02:16, Imseih (AWS), Sami wrote:
> The initial idea was to advance the latest_page_number
> during SerialSetActiveSerXmin, but the initial approach is
> obviously wrong.

That approach at high level could work, a

> When SerialSetActiveSerXmin is called for a new active
> serializable xmin, and at that point we don't need to keep any
> any earlier transactions, should SimpleLruZeroPage be called
> to ensure there is a target page for the xid?
> 
> I tried something like below, which fixes my repro, by calling
> SimpleLruZeroPage at the end of SerialSetActiveSerXmin.
> 
> @@ -953,6 +953,8 @@ SerialGetMinConflictCommitSeqNo(TransactionId xid)
>   static void
>   SerialSetActiveSerXmin(TransactionId xid)
>   {
> +       int targetPage = SerialPage(xid);
> +
>          LWLockAcquire(SerialSLRULock, LW_EXCLUSIVE);
>   
>          /*
> @@ -992,6 +994,9 @@ SerialSetActiveSerXmin(TransactionId xid)
>   
>          serialControl->tailXid = xid;
>   
> +       if (serialControl->headPage != targetPage)
> +               SimpleLruZeroPage(SerialSlruCtl, targetPage);
> +
>          LWLockRelease(SerialSLRULock);
>   }

No, that's very wrong too. You are zeroing the page containing the 
oldest XID that's still needed. That page still contains important 
information. It might work if you zero the previous page, but I think 
you need to do a little more than that. (I wish we had tests that would 
catch that.)

The crux of the problem is that 'tailXid' can advance past 'headXid'. I 
was bit surprised by that, but I think it's by design. I wish it was 
called out explicitly in a comment though. The code mostly handles that 
fine, except that it confuses the "apparent wraparound" check.

'tailXid' is the oldest XID that we might still need to look up in the 
SLRU, based on the transactions that are still active, and 'headXid' is 
the newest XID that has been written out to the SLRU. But we only write 
an XID out to the SLRU and advance headXid if the shared memory data 
structure fills up. So it's possible that as old transactions age out, 
we advance 'tailXid' past 'headXid'.

SerialAdd() tolerates tailXid > headXid. It will zero out all the pages 
between the old headXid and tailXid, even though no lookups can occur on 
those pages. That's unnecessary but harmless.

I think the smallest fix here would be to change CheckPointPredicate() 
so that if tailPage > headPage, pass headPage to SimpleLruTruncate() 
instead of tailPage. Or perhaps it should go into the "The SLRU is no 
longer needed" codepath in that case. If tailPage > headPage, the SLRU 
isn't needed at the moment.

In addition to that, we could change SerialAdd() to not zero out the 
pages between old headXid and tailXid unnecessarily, but that's more of 
an optimization than bug fix.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: Evaluate arguments of correlated SubPlans in the referencing ExprState
Next
From: Peter Eisentraut
Date:
Subject: Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound