Re: archive status ".ready" files may be created too early - Mailing list pgsql-hackers
From | Bossart, Nathan |
---|---|
Subject | Re: archive status ".ready" files may be created too early |
Date | |
Msg-id | 5FA1ACA0-ACD0-46C1-BBF5-36F1D02534B0@amazon.com Whole thread Raw |
In response to | Re: archive status ".ready" files may be created too early (Kyotaro Horiguchi <horikyota.ntt@gmail.com>) |
List | pgsql-hackers |
On 1/26/21, 6:36 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote: > At Tue, 26 Jan 2021 19:13:57 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in >> On 12/17/20, 9:15 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote: >> > At Thu, 17 Dec 2020 22:20:35 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in >> >> On 12/15/20, 2:33 AM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote: >> >> > You're right in that regard. There's a window where partial record is >> >> > written when write location passes F0 after insertion location passes >> >> > F1. However, remembering all spanning records seems overkilling to me. >> >> >> >> I'm curious why you feel that recording all cross-segment records is >> >> overkill. IMO it seems far simpler to just do that rather than try to >> > >> > Sorry, my words are not enough. Remembering all spanning records in >> > *shared memory* seems to be overkilling. Much more if it is stored in >> > shared hash table. Even though it rarely the case, it can fail hard >> > way when reaching the limit. If we could do well by remembering just >> > two locations, we wouldn't need to worry about such a limitation. >> >> I don't think it will fail if we reach max_size for the hash table. >> The comment above ShmemInitHash() has this note: >> >> * max_size is the estimated maximum number of hashtable entries. This is >> * not a hard limit, but the access efficiency will degrade if it is >> * exceeded substantially (since it's used to compute directory size and >> * the hash table buckets will get overfull). > > That description means that a shared hash has a directory with fixed > size thus there may be synonyms, which causes degradation. Even though > buckets are preallocated with the specified number, since the minimum > directory size is 256, buckets are allocated at least 256 in a long > run. Minimum on-the-fly allocation size is 32. I haven't calcuated > further precicely, but I'm worried about the amount of spare shared > memory the hash can allocate. On my machine, hash_estimate_size() for the table returns 5,968 bytes. That estimate is for a max_size of 16. In my testing, I've been able to need up to 6 elements in this table, but that required turning off synchronous_commit, adding a long sleep at the end of XLogWrite(), and increasing wal_buffers substantially. This leads me to think that a max_size of 16 elements is typically sufficient. (I may have also accidentally demonstrated that only storing two record boundaries could be insufficient.) >> > Another concern about the concrete patch: >> > >> > NotifySegmentsReadyForArchive() searches the shared hashacquiaing a >> > LWLock every time XLogWrite is called while segment archive is being >> > held off. I don't think it is acceptable and I think it could be a >> > problem when many backends are competing on WAL. >> >> This is a fair point. I did some benchmarking with a few hundred >> connections all doing writes, and I was not able to discern any >> noticeable performance impact. My guess is that contention on this >> new lock is unlikely because callers of XLogWrite() must already hold >> WALWriteLock. Otherwise, I believe we only acquire ArchNotifyLock no >> more than once per segment to record new record boundaries. > > Thanks. I agree that the reader-reader contention is not a problem due > to existing serialization by WALWriteLock. Adding an entry happens > only at segment boundary so the ArchNotifyLock doesn't seem to be a > problem. > > However the function prolongs the WALWriteLock section. Couldn't we > somehow move the call to NotifySegmentsReadyForArchive in XLogWrite > out of the WALWriteLock section? I don't see a clean way to do that. XLogWrite() assumes that WALWriteLock is held when it is called, and it doesn't release it at any point. I think we'd have to move NotifySegmentsReadyForArchive() to the callers of XLogWrite() if we wanted to avoid holding onto WALWriteLock for longer. Unless we can point to a measurable performance penalty, I'm not sure this is worth it. Nathan
pgsql-hackers by date: