Re: archive status ".ready" files may be created too early - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: archive status ".ready" files may be created too early |
Date | |
Msg-id | 20210831074338.mt7hpaveuyvb723d@alap3.anarazel.de Whole thread Raw |
In response to | Re: archive status ".ready" files may be created too early ("Bossart, Nathan" <bossartn@amazon.com>) |
Responses |
Re: archive status ".ready" files may be created too early
|
List | pgsql-hackers |
Hi On 2021-08-31 06:45:06 +0000, Bossart, Nathan wrote: > On 8/30/21, 7:39 PM, "Andres Freund" <andres@anarazel.de> wrote: > > On 2021-08-30 22:39:04 +0000, Bossart, Nathan wrote: > >> If we called NotifySegmentsReadyForArchive() before we updated the > >> flush location in shared memory, we might skip nudging the WAL writer > >> even though we should. > > > > That's trivial to address - just have a local variable saying whether we need > > to call NotifySegmentsReadyForArchive(). > > I think we can remove the race condition entirely by moving boundary > registration to before WALInsertLockRelease(). I attached a patch for > discussion. I think it's a bad idea to move more code to before WALInsertLockRelease. There's a very limited number of xlog insert slots, and WAL flushes (e.g. commits) need to wait for insertions to finish. > >> > When s1 is fully written out, NotifySegmentsReadyForArchive() will set > >> > earliestSegBoundary = latestSegBoundary = s4 and create .reaady for .s1 - > >> > ok. But when when s2 is flushed, we'll afaict happily create .ready files > >> > for s2, s3 despite s4 not yet being written, because earliestSegBoundary is > >> > now s4. > >> > >> In this case, the .ready files for s2 and s3 wouldn't be created until > >> s4 is flushed to disk. > > > > I don't think that's true as the code stands today? The > > NotifySegmentsReadyForArchive() for s2 will update earliestSegBoundary to s4, > > because latestSegBoundary = 4 and earliestSegBoundary = 1, triggering the > > keep_latest branch. Any subsequent NotifySegmentsReadyForArchive() with a > > segment < 4 will then be able to flush s2 and s3? > > When flushRecPtr is less than both of the segment boundaries, > NotifySegmentsReadyForArchive() will return without doing anything. > At least, that was the intent. If there is some reason it's not > actually working that way, I can work on fixing it. But that's not OK either! Consider a scenario when there's small records each spanning just a bit into the next segment, and initially all the data is in wal_buffers. RegisterSegmentBoundary(s1, s1+10) earliestSegBoundary = s1 earliestSegBoundaryEndPtr = s1+10 RegisterSegmentBoundary(s2, s2+10) earliestSegBoundary = s1 earliestSegBoundaryEndPtr = s1+10 latestSegBoundary = s2 latestSegBoundaryEndPtr = s2 + 10 RegisterSegmentBoundary(s3, s3+10) earliestSegBoundary = s1 earliestSegBoundaryEndPtr = s1+10 latestSegBoundary = s2 latestSegBoundaryEndPtr = s2 + 10 RegisterSegmentBoundary(s4, s4+10) earliestSegBoundary = s1 earliestSegBoundaryEndPtr = s1+10 latestSegBoundary = s4 latestSegBoundaryEndPtr = s4 + 10 If there's now a flush request including all of s3, we'll have the following sequence of notifies: NotifySegmentsReadyForArchive(s1) nothing happens, smaller than s1+10 NotifySegmentsReadyForArchive(s2) earliestSegBoundary = s4 earliestSegBoundaryEndPtr = s4+10 latestSegBoundary = s4 latestSegBoundaryEndPtr = s4 + 10 latest_boundary_seg = s1 NotifySegmentsReadyForArchive(s3) nothing happens, flush is smaller than s4 If the record ending at s4 + 10 isn't an async commit (and thus XLogCtl->asyncXactLSN is smaller), and there are no further records, we can end up waiting effectively forever for s2 (and s3) to be archived. If all other connections (and autovac etc) are idle, there will be no XLogFlush() calls, nor will XLogBackgroundFlush() do anything, because it'll hit the "If already known flushed" path, because the the first page in s4 is only partially filled. Am I missing something? > > I don't think it's sensible to fix these separately. It'd be one thing to do > > that for HEAD, but on the back branches? And that this patch hasn't gotten any > > performance testing is scary. > > Are there any specific performance tests you'd like to see? I don't > mind running a couple. - Parallel copy with > 8 processes - Parallel non-transactional insertion of small-medium records Simulates inserting rows within a transaction - Parallel transactional insertion of small-medium sized records, with fsync=on Plain oltp writes - Parallel transactional insertion of small-medium sized records, with fsync=off fsync=off to simulate a fast server-class SSD (where fsync is instantaneous). Of course, if you have one of those, you can also use that. For the oltp ones I've had good experience simulating workloads with pg_logical_emit_message(). That just hits the WAL path, *drastically* reducing the variance / shortening the required test duration. Greetings, Andres Freund
pgsql-hackers by date: