Apologies for the long delay.
I've spent a good amount of time thinking about this bug and trying
out a few different approaches for fixing it. I've attached a work-
in-progress patch for my latest attempt.
On 10/13/20, 5:07 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
> F0 F1
> AAAAA F BBBBB
> |---------|---------|---------|
> seg X seg X+1 seg X+2
>
> Matsumura-san has a concern about the case where there are two (or
> more) partially-flushed segment-spanning records at the same time.
>
> This patch remembers only the last cross-segment record. If we were
> going to flush up to F0 after Record-B had been written, we would fail
> to hold-off archiving seg-X. This patch is based on a assumption that
> that case cannot happen because we don't leave a pending page at the
> time of segment switch and no records don't span over three or more
> segments.
I wonder if these are safe assumptions to make. For your example, if
we've written record B to the WAL buffers, but neither record A nor B
have been written to disk or flushed, aren't we still in trouble?
Also, is there actually any limit on WAL record length that means that
it is impossible for a record to span over three or more segments?
Perhaps these assumptions are true, but it doesn't seem obvious to me
that they are, and they might be pretty fragile.
The attached patch doesn't make use of these assumptions. Instead, we
track the positions of the records that cross segment boundaries in a
small hash map, and we use that to determine when it is safe to mark a
segment as ready for archival. I think this approach resembles
Matsumura-san's patch from June.
As before, I'm not handling replication, archive_timeout, and
persisting latest-marked-ready through crashes yet. For persisting
the latest-marked-ready segment through crashes, I was thinking of
using a new file that stores the segment number.
Nathan