Re: archive status ".ready" files may be created too early - Mailing list pgsql-hackers

From Bossart, Nathan
Subject Re: archive status ".ready" files may be created too early
Date
Msg-id 5FA1ACA0-ACD0-46C1-BBF5-36F1D02534B0@amazon.com
Whole thread Raw
In response to Re: archive status ".ready" files may be created too early  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
List pgsql-hackers
On 1/26/21, 6:36 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
> At Tue, 26 Jan 2021 19:13:57 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in
>> On 12/17/20, 9:15 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
>> > At Thu, 17 Dec 2020 22:20:35 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in
>> >> On 12/15/20, 2:33 AM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
>> >> > You're right in that regard. There's a window where partial record is
>> >> > written when write location passes F0 after insertion location passes
>> >> > F1. However, remembering all spanning records seems overkilling to me.
>> >>
>> >> I'm curious why you feel that recording all cross-segment records is
>> >> overkill.  IMO it seems far simpler to just do that rather than try to
>> >
>> > Sorry, my words are not enough. Remembering all spanning records in
>> > *shared memory* seems to be overkilling.  Much more if it is stored in
>> > shared hash table.  Even though it rarely the case, it can fail hard
>> > way when reaching the limit. If we could do well by remembering just
>> > two locations, we wouldn't need to worry about such a limitation.
>>
>> I don't think it will fail if we reach max_size for the hash table.
>> The comment above ShmemInitHash() has this note:
>>
>>     * max_size is the estimated maximum number of hashtable entries.  This is
>>     * not a hard limit, but the access efficiency will degrade if it is
>>     * exceeded substantially (since it's used to compute directory size and
>>     * the hash table buckets will get overfull).
>
> That description means that a shared hash has a directory with fixed
> size thus there may be synonyms, which causes degradation. Even though
> buckets are preallocated with the specified number, since the minimum
> directory size is 256, buckets are allocated at least 256 in a long
> run. Minimum on-the-fly allocation size is 32. I haven't calcuated
> further precicely, but I'm worried about the amount of spare shared
> memory the hash can allocate.

On my machine, hash_estimate_size() for the table returns 5,968 bytes.
That estimate is for a max_size of 16.  In my testing, I've been able
to need up to 6 elements in this table, but that required turning off
synchronous_commit, adding a long sleep at the end of XLogWrite(), and
increasing wal_buffers substantially.  This leads me to think that a
max_size of 16 elements is typically sufficient.  (I may have also
accidentally demonstrated that only storing two record boundaries
could be insufficient.)

>> > Another concern about the concrete patch:
>> >
>> > NotifySegmentsReadyForArchive() searches the shared hashacquiaing a
>> > LWLock every time XLogWrite is called while segment archive is being
>> > held off.  I don't think it is acceptable and I think it could be a
>> > problem when many backends are competing on WAL.
>>
>> This is a fair point.  I did some benchmarking with a few hundred
>> connections all doing writes, and I was not able to discern any
>> noticeable performance impact.  My guess is that contention on this
>> new lock is unlikely because callers of XLogWrite() must already hold
>> WALWriteLock.  Otherwise, I believe we only acquire ArchNotifyLock no
>> more than once per segment to record new record boundaries.
>
> Thanks. I agree that the reader-reader contention is not a problem due
> to existing serialization by WALWriteLock. Adding an entry happens
> only at segment boundary so the ArchNotifyLock doesn't seem to be a
> problem.
>
> However the function prolongs the WALWriteLock section. Couldn't we
> somehow move the call to NotifySegmentsReadyForArchive in XLogWrite
> out of the WALWriteLock section?

I don't see a clean way to do that.  XLogWrite() assumes that
WALWriteLock is held when it is called, and it doesn't release it at
any point.  I think we'd have to move NotifySegmentsReadyForArchive()
to the callers of XLogWrite() if we wanted to avoid holding onto
WALWriteLock for longer.  Unless we can point to a measurable
performance penalty, I'm not sure this is worth it.

Nathan


pgsql-hackers by date:

Previous
From: Jacob Champion
Date:
Subject: Re: Allow matching whole DN from a client certificate
Next
From: "Joel Jacobson"
Date:
Subject: Re: [HACKERS] GSoC 2017: Foreign Key Arrays