Re: archive status ".ready" files may be created too early - Mailing list pgsql-hackers

From Robert Haas
Subject Re: archive status ".ready" files may be created too early
Date
Msg-id CA+TgmoaaOA0pnJ3=j2Ao7PO7Obo6ShYyqXvtM8+daGmnq401zg@mail.gmail.com
Whole thread Raw
In response to Re: archive status ".ready" files may be created too early  ("alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org>)
Responses Re: archive status ".ready" files may be created too early  ("Bossart, Nathan" <bossartn@amazon.com>)
List pgsql-hackers
On Fri, Aug 20, 2021 at 10:50 AM alvherre@alvh.no-ip.org
<alvherre@alvh.no-ip.org> wrote:
> 1. We use a hash table in shared memory.  That's great.  The part that's
>    not so great is that in both places where we read items from it, we
>    have to iterate in some way.  This seems a bit silly.  An array would
>    serve us better, if only we could expand it as needed.  However, in
>    shared memory we can't do that.  (I think the list of elements we
>    need to memoize is arbitrary long, if enough processes can be writing
>    WAL at the same time.)

We can't expand the hash table either. It has an initial and maximum
size of 16 elements, which means it's basically an expensive array,
and which also means that it imposes a new limit of 16 *
wal_segment_size on the size of WAL records. If you exceed that limit,
I think things just go boom... which I think is not acceptable. I
think we can have records in the multi-GB range of wal_level=logical
and someone chooses a stupid replica identity setting.

It's actually not clear to me why we need to track multiple entries
anyway. The scenario postulated by Horiguchi-san in
https://www.postgresql.org/message-id/20201014.090628.839639906081252194.horikyota.ntt@gmail.com
seems to require that the write position be multiple segments ahead of
the flush position, but that seems impossible with the present code,
because XLogWrite() calls issue_xlog_fsync() at once if the segment is
filled. So I think, at least with the present code, any record that
isn't completely flushed to disk has to be at least partially in the
current segment. And there can be only one record that starts in some
earlier segment and ends in this one.

I will be the first to admit that the forced end-of-segment syncs
suck. They often stall every backend in the entire system at the same
time. Everyone fills up the xlog segment really fast and then stalls
HARD while waiting for that sync to happen. So it's arguably better
not to do more things that depend on that being how it works, but I
think needing a variable-size amount of shared memory is even worse.
If we're going to track multiple entries here we need some rule that
bounds how many of them we can need to track. If the number of entries
is defined by the number of segment boundaries that a particular
record crosses, it's effectively unbounded, because right now WAL
records can be pretty much arbitrarily big.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: The Free Space Map: Problems and Opportunities
Next
From: Jelte Fennema
Date:
Subject: Re: [EXTERNAL] Re: Allow declaration after statement and reformat code to use it