Re: archive status ".ready" files may be created too early - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: archive status ".ready" files may be created too early
Date
Msg-id 20201215.193257.1534297881230796788.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: archive status ".ready" files may be created too early  ("Bossart, Nathan" <bossartn@amazon.com>)
Responses Re: archive status ".ready" files may be created too early
List pgsql-hackers
At Mon, 14 Dec 2020 18:25:23 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in 
> Apologies for the long delay.
> 
> I've spent a good amount of time thinking about this bug and trying
> out a few different approaches for fixing it.  I've attached a work-
> in-progress patch for my latest attempt.
> 
> On 10/13/20, 5:07 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
> >           F0        F1
> >         AAAAA  F  BBBBB
> > |---------|---------|---------|
> >    seg X    seg X+1   seg X+2
> >
> > Matsumura-san has a concern about the case where there are two (or
> > more) partially-flushed segment-spanning records at the same time.
> >
> > This patch remembers only the last cross-segment record. If we were
> > going to flush up to F0 after Record-B had been written, we would fail
> > to hold-off archiving seg-X. This patch is based on a assumption that
> > that case cannot happen because we don't leave a pending page at the
> > time of segment switch and no records don't span over three or more
> > segments.
> 
> I wonder if these are safe assumptions to make.  For your example, if
> we've written record B to the WAL buffers, but neither record A nor B
> have been written to disk or flushed, aren't we still in trouble?

You're right in that regard. There's a window where partial record is
written when write location passes F0 after insertion location passes
F1. However, remembering all spanning records seems overkilling to me.

I modifed the previous patch so that it remembers the start LSN of the
*oldest* corss-segment continuation record in the last consecutive
bonded segments, and the end LSN of the latest cross-segmetn
continuation record. This doesn't foreget past segment boundaries.
The region is cleard when WAL-write LSN goes beyond the remembered end
LSN.  So the region may contain several wal-segments that are not
connected to the next one, but that doesn't matter so much.


> Also, is there actually any limit on WAL record length that means that
> it is impossible for a record to span over three or more segments?

Even though it is not a hard limit, AFAICS as mentioned before the
longest possible record is what log_newpages() emits. that is up to
about 500kBytes for now. I think we don't want to make the length
longer. If we set the wal_segment_size to 1MB and set the block size
to 16kB or more, we would have a recrod spanning over three or more
segments but I don't think that is a sane configuration and that kind
of issue could happen elsewhere.

> Perhaps these assumptions are true, but it doesn't seem obvious to me
> that they are, and they might be pretty fragile.

I added an assertion that a record must be shorter than a wal segment
to XLogRecordAssemble(). This guarantees the assumption to be true?
(The condition is tentative, would need to be adjusted.)

> The attached patch doesn't make use of these assumptions.  Instead, we
> track the positions of the records that cross segment boundaries in a
> small hash map, and we use that to determine when it is safe to mark a
> segment as ready for archival.  I think this approach resembles
> Matsumura-san's patch from June.
> 
> As before, I'm not handling replication, archive_timeout, and
> persisting latest-marked-ready through crashes yet.  For persisting
> the latest-marked-ready segment through crashes, I was thinking of
> using a new file that stores the segment number.


Also, the attached is a PoC.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 5714bd064d61135c41a64ecc39aeff74c25a0e74 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Tue, 15 Dec 2020 16:24:13 +0900
Subject: [PATCH v3] PoC: Avoid archiving a WAL segment that continues to the
 next segment

If the last record of a finshed segment continues to the next segment,
we need to defer archiving of the segment until the record is flushed
to the end. Otherwise crash recovery can overwrite the last record of
a segment and history diverges between archive and pg_wal.
---
 src/backend/access/transam/xlog.c       | 187 +++++++++++++++++++++++-
 src/backend/access/transam/xloginsert.c |   3 +
 src/backend/replication/walsender.c     |  14 +-
 src/include/access/xlog.h               |   1 +
 4 files changed, 193 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8dd225c2e1..98da521601 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -723,6 +723,16 @@ typedef struct XLogCtlData
      */
     XLogRecPtr    lastFpwDisableRecPtr;
 
+    /* The last segment notified to be archived. Protected by WALWriteLock */
+    XLogSegNo    lastNotifiedSeg;
+
+    /*
+     * Remember the oldest and newest known segment that ends with a
+     * continuation record.
+     */
+    XLogRecPtr    firstSegContRecStart;
+    XLogRecPtr    lastSegContRecEnd;
+    
     slock_t        info_lck;        /* locks shared variables shown above */
 } XLogCtlData;
 
@@ -1160,6 +1170,9 @@ XLogInsertRecord(XLogRecData *rdata,
      */
     if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ)
     {
+        XLogSegNo startseg;
+        XLogSegNo endseg;
+
         SpinLockAcquire(&XLogCtl->info_lck);
         /* advance global request to include new block(s) */
         if (XLogCtl->LogwrtRqst.Write < EndPos)
@@ -1167,6 +1180,19 @@ XLogInsertRecord(XLogRecData *rdata,
         /* update local result copy while I have the chance */
         LogwrtResult = XLogCtl->LogwrtResult;
         SpinLockRelease(&XLogCtl->info_lck);
+
+        /* Remember the range of segments end with a continuation recrod */
+        XLByteToSeg(StartPos, startseg, wal_segment_size);
+        XLByteToPrevSeg(EndPos, endseg, wal_segment_size);
+
+        if (startseg != endseg)
+        {
+            SpinLockAcquire(&XLogCtl->info_lck);
+            if (XLogCtl->firstSegContRecStart == InvalidXLogRecPtr)
+                XLogCtl->firstSegContRecStart = StartPos;
+            XLogCtl->lastSegContRecEnd = EndPos;
+            SpinLockRelease(&XLogCtl->info_lck);
+        }
     }
 
     /*
@@ -2400,6 +2426,43 @@ XLogCheckpointNeeded(XLogSegNo new_segno)
     return false;
 }
 
+/*
+ * Returns last notified segment.
+ */
+static XLogSegNo
+GetLastNotifiedSegment(void)
+{
+    XLogSegNo last_notified;
+
+    SpinLockAcquire(&XLogCtl->info_lck);
+    last_notified = XLogCtl->lastNotifiedSeg;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    return last_notified;
+}
+
+/*
+ * Notify segments that are not yet notified.
+ */
+static void
+NotifySegmentsUpTo(XLogSegNo notifySegNo)
+{
+    XLogSegNo last_notified = GetLastNotifiedSegment();
+    XLogSegNo i;
+
+    if (notifySegNo <= last_notified)
+        return;
+
+    for (i = XLogCtl->lastNotifiedSeg + 1 ; i <= notifySegNo ; i++)
+        XLogArchiveNotifySeg(i);
+
+    /* Don't go back in the case someone else has made it go further. */
+    SpinLockAcquire(&XLogCtl->info_lck);
+    if (XLogCtl->lastNotifiedSeg < notifySegNo)
+        XLogCtl->lastNotifiedSeg = notifySegNo;
+    SpinLockRelease(&XLogCtl->info_lck);
+}
+
 /*
  * Write and/or fsync the log at least as far as WriteRqst indicates.
  *
@@ -2423,6 +2486,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
     int            npages;
     int            startidx;
     uint32        startoffset;
+    bool        extended = false;
 
     /* We should always be inside a critical section here */
     Assert(CritSectionCount > 0);
@@ -2579,15 +2643,73 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
              */
             if (finishing_seg)
             {
+                XLogRecPtr firstSegContRecStart;
+                XLogRecPtr lastSegContRecEnd;
+
                 issue_xlog_fsync(openLogFile, openLogSegNo);
 
                 /* signal that we need to wakeup walsenders later */
                 WalSndWakeupRequest();
 
-                LogwrtResult.Flush = LogwrtResult.Write;    /* end of page */
+                SpinLockAcquire(&XLogCtl->info_lck);
+                firstSegContRecStart = XLogCtl->firstSegContRecStart;
+                lastSegContRecEnd = XLogCtl->lastSegContRecEnd;
+                SpinLockRelease(&XLogCtl->info_lck);
 
-                if (XLogArchivingActive())
-                    XLogArchiveNotifySeg(openLogSegNo);
+                /*
+                 * If we're on a continuation record spans over segments, don't
+                 * expose flush location until the next record is written. This
+                 * prevents expose a flush location at middle of a
+                 * cross-segment WAL recrod.
+                 */
+                if (LogwrtResult.Write < firstSegContRecStart ||
+                    lastSegContRecEnd <= LogwrtResult.Write)
+                {
+                    LogwrtResult.Flush = LogwrtResult.Write; /* end of page */
+
+                    if (XLogArchivingActive())
+                        NotifySegmentsUpTo(openLogSegNo);
+
+                    extended = false;
+                }
+                else
+                {
+                    /*
+                     * There's a case where we are told to flush up not to the
+                     * end of a record but to WALbuffer page boundary. We
+                     * advance the request LSN to the end of the record in the
+                     * case the record at the requested LSN continues to the
+                     * next segment.
+                     */
+                    XLogSegNo oldseg;
+                    XLogSegNo currseg;
+
+                    XLByteToPrevSeg(WriteRqst.Write, currseg, wal_segment_size);
+                    XLByteToPrevSeg(lastSegContRecEnd, oldseg,
+                                    wal_segment_size);
+
+                    if (oldseg == currseg &&
+                        WriteRqst.Write <= lastSegContRecEnd)
+                        WriteRqst.Write = lastSegContRecEnd;
+
+                    /*
+                     * We need to finish writing at least the current record in
+                     * order to the old segment can be safely archived.
+                     */
+                    extended = true;
+                }
+
+                if (lastSegContRecEnd <= LogwrtResult.Write)
+                {
+                    /*
+                     * We got out from the continuation region, reset the
+                     * locations.
+                     */
+                    SpinLockAcquire(&XLogCtl->info_lck);
+                    XLogCtl->firstSegContRecStart = InvalidXLogRecPtr;
+                    XLogCtl->lastSegContRecEnd = InvalidXLogRecPtr;
+                    SpinLockRelease(&XLogCtl->info_lck);
+                }
 
                 XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
                 XLogCtl->lastSegSwitchLSN = LogwrtResult.Flush;
@@ -2616,11 +2738,23 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
         }
         curridx = NextBufIdx(curridx);
 
-        /* If flexible, break out of loop as soon as we wrote something */
-        if (flexible && npages == 0)
+        /*
+         * If flexible, break out of loop as soon as we wrote something.
+         * However, we don't leave the loop if the last record in the just
+         * finished segment needs to be finished.
+         */
+        if (flexible && !extended && npages == 0)
             break;
     }
 
+    /*
+     * We have extended the write request to the next segment if the record at
+     * the initial WriteRqst.Write continues to the next segment.  In that case
+     * need to notify the last segment here.
+     */
+    if (extended)
+        NotifySegmentsUpTo(openLogSegNo - 1);
+
     Assert(npages == 0);
 
     /*
@@ -7705,6 +7839,18 @@ StartupXLOG(void)
     XLogCtl->LogwrtRqst.Write = EndOfLog;
     XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
+    /*
+     * We have archived up to the previous segment of EndOfLog so far.
+     * Initialize lastNotifiedSeg if needed.
+     */
+    if (XLogArchivingActive())
+    {
+        XLogSegNo    endLogSegNo;
+
+        XLByteToSeg(EndOfLog, endLogSegNo, wal_segment_size);
+        XLogCtl->lastNotifiedSeg = endLogSegNo - 1;
+    }
+
     /*
      * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
      * record before resource manager writes cleanup WAL records or checkpoint
@@ -8429,6 +8575,37 @@ GetFlushRecPtr(void)
     return LogwrtResult.Flush;
 }
 
+/*
+ * GetReplicationTargetRecPtr -- Returns the latest position that is safe to
+ * replicate.
+ */
+XLogRecPtr
+GetReplicationTargetRecPtr(void)
+{
+    static XLogRecPtr    lastTargetRecPtr = InvalidXLogRecPtr;
+    XLogRecPtr    firstSegContRecStart;
+    XLogRecPtr    lastSegContRecEnd;
+
+    SpinLockAcquire(&XLogCtl->info_lck);
+    LogwrtResult = XLogCtl->LogwrtResult;
+    firstSegContRecStart = XLogCtl->firstSegContRecStart;
+    lastSegContRecEnd = XLogCtl->lastSegContRecEnd;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    /*
+     * Don't move forward if the current flush position may be within a
+     * continuation record that spans over segments.
+     */
+    if (lastTargetRecPtr == InvalidXLogRecPtr ||
+        firstSegContRecStart == InvalidXLogRecPtr ||
+        XLogSegmentOffset(LogwrtResult.Flush, wal_segment_size) != 0 ||
+        LogwrtResult.Flush < firstSegContRecStart ||
+        lastSegContRecEnd <= LogwrtResult.Flush)
+        lastTargetRecPtr = LogwrtResult.Flush;
+
+    return lastTargetRecPtr;
+}
+
 /*
  * GetLastImportantRecPtr -- Returns the LSN of the last important record
  * inserted. All records not explicitly marked as unimportant are considered
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 1f0e4e01e6..af53d1f514 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -815,6 +815,9 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
     rechdr->xl_prev = InvalidXLogRecPtr;
     rechdr->xl_crc = rdata_crc;
 
+    /* we shouldn't have a record longer than a segment */
+    Assert(total_len < wal_segment_size);
+
     return &hdr_rdt;
 }
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 2eb19ad293..00d8701a60 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2637,14 +2637,14 @@ XLogSendPhysical(void)
         /*
          * Streaming the current timeline on a primary.
          *
-         * Attempt to send all data that's already been written out and
-         * fsync'd to disk.  We cannot go further than what's been written out
-         * given the current implementation of WALRead().  And in any case
-         * it's unsafe to send WAL that is not securely down to disk on the
-         * primary: if the primary subsequently crashes and restarts, standbys
-         * must not have applied any WAL that got lost on the primary.
+         * Attempt to send all data that's can be replicated.  We cannot go
+         * further than what's been written out given the current
+         * implementation of WALRead().  And in any case it's unsafe to send
+         * WAL that is not securely down to disk on the primary: if the primary
+         * subsequently crashes and restarts, standbys must not have applied
+         * any WAL that got lost on the primary.
          */
-        SendRqstPtr = GetFlushRecPtr();
+        SendRqstPtr = GetReplicationTargetRecPtr();
     }
 
     /*
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e71..94876f628c 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -338,6 +338,7 @@ extern void GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p)
 extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
+extern XLogRecPtr GetReplicationTargetRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
 extern void RemovePromoteSignalFiles(void);
 
-- 
2.27.0


pgsql-hackers by date:

Previous
From: Peter Smith
Date:
Subject: Re: Single transaction in the tablesync worker?
Next
From: Alexey Kondratov
Date:
Subject: Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly