Thread: 9.6.11- could not truncate directory "pg_serial": apparent wraparound

9.6.11- could not truncate directory "pg_serial": apparent wraparound

From
Pavel Suderevsky
Date:
Hi,

PG: 9.6.11
OS: CentOS
Env: AWS EC2

I've faced the following exceptions in PostgreSQL server log:
> could not truncate directory "pg_serial": apparent wraparound
Sometimes it repeats every 5 min and the longest period was 40 min.

In fact, I can't find any suspicious events happening that periods. 
pg_wait_sampling didn't catch any events, no long queries (more than 60s), Autovacuum workers or transactions in "idle in transaction" state were in action at this time.

>You should not see the errors you are reporting nor
>the warning I mentioned unless a serializable transaction remains
>active long enough for about 1 billion transaction IDs to be
>consumed.

Database age now is just 18.5 millions of transactions.

Server has two standbys (sync and async), hot_standby_feedback is off.

Please advice what I can do to find a reason of these exceptions.

Re: 9.6.11- could not truncate directory "pg_serial": apparent wraparound

From
Pavel Suderevsky
Date:
Guys, still need your help.

Previous night:
2019-04-05 00:35:04 UTC  LOG:  could not truncate directory "pg_serial": apparent wraparound
2019-04-05 00:40:04 UTC  LOG:  could not truncate directory "pg_serial": apparent wraparound
(2 checkpoints)

It turned that I have some problem with performance related to predicate locking on this platform.
A lot of long prepared statements with the SerializableXactHashLock and predicate_lock_manager wait_events followed by high CPU usage happened during 00:30 and 00:45. During this period there were 55k pred locks granted at max and 30k in average. Probably because of high CPU usage some statements were spending a lot of time in bind/parse steps.

Probably if you advise me what could cause "pg_serial": apparent wraparound messages I would have more chances to handle all the performance issues.
 
Thank you!
--
Pavel Suderevsky

пн, 11 мар. 2019 г. в 19:09, Pavel Suderevsky <psuderevsky@gmail.com>:
Hi,

PG: 9.6.11
OS: CentOS
Env: AWS EC2

I've faced the following exceptions in PostgreSQL server log:
> could not truncate directory "pg_serial": apparent wraparound
Sometimes it repeats every 5 min and the longest period was 40 min.

In fact, I can't find any suspicious events happening that periods. 
pg_wait_sampling didn't catch any events, no long queries (more than 60s), Autovacuum workers or transactions in "idle in transaction" state were in action at this time.

>You should not see the errors you are reporting nor
>the warning I mentioned unless a serializable transaction remains
>active long enough for about 1 billion transaction IDs to be
>consumed.

Database age now is just 18.5 millions of transactions.

Server has two standbys (sync and async), hot_standby_feedback is off.

Please advice what I can do to find a reason of these exceptions.

Re: 9.6.11- could not truncate directory "pg_serial": apparent wraparound

From
Thomas Munro
Date:
On Sun, Apr 7, 2019 at 2:31 AM Pavel Suderevsky <psuderevsky@gmail.com> wrote:
> Probably if you advise me what could cause "pg_serial": apparent wraparound messages I would have more chances to
handleall the performance issues.
 

9.6 has this code:

        /*
         * Give a warning if we're about to run out of SLRU pages.
         *
         * slru.c has a maximum of 64k segments, with 32
(SLRU_PAGES_PER_SEGMENT)
         * pages each. We need to store a 64-bit integer for each Xid, and with
         * default 8k block size, 65536*32 pages is only enough to cover 2^30
         * XIDs. If we're about to hit that limit and wrap around,
warn the user.
         *
         * To avoid spamming the user, we only give one warning when
we've used 1
         * billion XIDs, and stay silent until the situation is fixed and the
         * number of XIDs used falls below 800 million again.
         *
         * XXX: We have no safeguard to actually *prevent* the wrap-around,
         * though. All you get is a warning.
         */
        if (oldSerXidControl->warningIssued)
        {
                TransactionId lowWatermark;

                lowWatermark = tailXid + 800000000;
                if (lowWatermark < FirstNormalTransactionId)
                        lowWatermark = FirstNormalTransactionId;
                if (TransactionIdPrecedes(xid, lowWatermark))
                        oldSerXidControl->warningIssued = false;
        }
        else
        {
                TransactionId highWatermark;

                highWatermark = tailXid + 1000000000;
                if (highWatermark < FirstNormalTransactionId)
                        highWatermark = FirstNormalTransactionId;
                if (TransactionIdFollows(xid, highWatermark))
                {
                        oldSerXidControl->warningIssued = true;
                        ereport(WARNING,
                                        (errmsg("memory for
serializable conflict tracking is nearly exhausted"),
                                         errhint("There might be an
idle transaction or a forgotten prepared transaction causing
this.")));
                }
        }

Did you see that warning at some point before the later error?

I think if you saw that warning, and then later the error you
reported, it's probably just being prudent and avoiding the truncation
because it detects a potential wraparound.  If it actually does wrap
around, then there is a potential for
OldSerXidGetMinConflictCommitSeqNo() to report a too-recent minimum
conflict CSN for a given XID, and I'm not sure what consequence that
would have (without drinking a lot more coffee), but potentially some
kind of incorrect answer.  On server restart the problem fixes itself
because pg_serial is only used to spill state relating to transactions
running in this server lifetime.

I wonder if this condition required you to have a serializable
transaction running (or prepared) while you consume 2^30 AKA ~1
billion xids.  I think it is unreachable in v11+ because commit
e5eb4fa8 allowed for more SLRU pages to avoid this artificially early
wrap.

Gee, it'd be nice to use FullTransactionId for SERIALIZABLEXACT and
pg_serial in v13 and not to even have to think about wraparound here.
It's more doable here than elsewhere because the data on disk isn't
persistent across server restart, let alone pg_upgrade.  Let's see...
each segment file is 256kb and we need to be able to address 2^64 *
sizeof(SerCommitSequenceNumber), so you'd have segment files numbered
from 0 up to 1ffffffffffff (so you'd need slru.c to support 13 char
segment names and 64 bit segment numbers, whereas it currently has a
limit of 6 in SlruScanDirectory and uses int for segment number).
You'd be addressing them by FullTransactionId, but that's just the
index used to find entries -- the actual amount of data stored
wouldn't change, you'd just start seeing wider filenames, and all the
fragile modulo comparison truncation stuff would disappear from the
tree.

-- 
Thomas Munro
https://enterprisedb.com



Re: 9.6.11- could not truncate directory "pg_serial": apparent wraparound

From
Thomas Munro
Date:
On Tue, Apr 9, 2019 at 12:14 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> It's more doable here than elsewhere because the data on disk isn't
> persistent across server restart, let alone pg_upgrade.  Let's see...
> each segment file is 256kb and we need to be able to address 2^64 *
> sizeof(SerCommitSequenceNumber), so you'd have segment files numbered
> from 0 up to 1ffffffffffff (so you'd need slru.c to support 13 char
> segment names and 64 bit segment numbers, whereas it currently has a
> limit of 6 in SlruScanDirectory and uses int for segment number).

Come to think of it, even for the persistent ones, pg_upgrade could
rename them to a new scheme anyway as long as the segment size remains
the same.  I'd be inclined to ditch the current segment numbering
scheme and switch to one where the FullTransactionId of the first
entry in the segment is used for its name, so that admins can more
easily understand what the files under there correspond to.

-- 
Thomas Munro
https://enterprisedb.com



Re: 9.6.11- could not truncate directory "pg_serial": apparent wraparound

From
Pavel Suderevsky
Date:
On Sun, Apr 7, 2019 at 2:31 AM Pavel Suderevsky <psuderevsky@gmail.com> wrote:
> Probably if you advise me what could cause "pg_serial": apparent wraparound messages I would have more chances to handle all the performance issues.

Did you see that warning at some point before the later error?
Thomas,

Thank you for your reply!

No, there have never been such warnings.

I wonder if this condition required you to have a serializable
transaction running (or prepared) while you consume 2^30 AKA ~1
billion xids.  I think it is unreachable in v11+ because commit
e5eb4fa8 allowed for more SLRU pages to avoid this artificially early
wrap.  
 
Do I understand right that this is about Virtual txids? Have no idea how even a something close to a billion of transaction ids could be consumed on this system.
 
--
Pavel Suderevsky

Re: 9.6.11- could not truncate directory "pg_serial": apparent wraparound

From
Pavel Suderevsky
Date:
Hi,

Got this issue again.
Settings on the platform (PG 9.6.11):
max_pred_locks_per_transaction = 3000
max_connections = 800

Despite the fact that documentation says: 
> with the exception of fast-path locks, each lock manager will deliver a consistent set of results
I've noticed the following:
1. pg_locks showed 2 million SIReadLocks for the pid that has been in the "idle" state for a dozen of seconds already.
2. pg_locks showed count of 5 million SIReadLock granted although I expected a limit of 2.4 million SIReadLocks with settings provided above.

And still no any signs of serializable transactions that could live for a 1 billion xids. 
--
Pavel Suderevsky
E: psuderevsky(at)gmail(dot)com  

вт, 9 апр. 2019 г. в 16:00, Pavel Suderevsky <psuderevsky@gmail.com>:
On Sun, Apr 7, 2019 at 2:31 AM Pavel Suderevsky <psuderevsky@gmail.com> wrote:
> Probably if you advise me what could cause "pg_serial": apparent wraparound messages I would have more chances to handle all the performance issues.

Did you see that warning at some point before the later error?
Thomas,

Thank you for your reply!

No, there have never been such warnings.

I wonder if this condition required you to have a serializable
transaction running (or prepared) while you consume 2^30 AKA ~1
billion xids.  I think it is unreachable in v11+ because commit
e5eb4fa8 allowed for more SLRU pages to avoid this artificially early
wrap.  
 
Do I understand right that this is about Virtual txids? Have no idea how even a something close to a billion of transaction ids could be consumed on this system.
 
--
Pavel Suderevsky