Hi,
While thinking about the at the fsync mess, I started looking at the
fsync request queue. I was primarily wondering whether we can keep FDs
open long enough (by forwarding them to the checkpointer) to guarantee
that we see the error. But that's mostly irrelevant for what I'm
wondering about here.
The fsync request queue often is fairly large. 20 bytes for each
shared_buffers isn't a neglebible overhead. One reason it needs to be
fairly large is that we do not deduplicate while inserting, we just add
an entry on every single write.
ISTM that using a hashtable sounds saner, because we'd deduplicate on
insert. While that'd require locking, we can relatively easily reduce
the overhead of that by keeping track of something like mdsync_cycle_ctr
in MdfdVec, and only insert again if the cycle was incremented since.
Right now if the queue is full and can't be compacted we end up
fsync()ing on every single write, rather than once per checkpoint
afaict. That's a fairly horrible.
For the case that there's no space in the map, I'd suggest to just do
10% or so of the fsync in the poor sod of a process that finds no
space. That's surely better than constantly fsyncing on every single
write. We can also make bgwriter check the size of the hashtable on a
regular basis and do some of them if it gets too full.
The hashtable also I think has some advantages for the future. I've
introduced something very similar in my radix tree based buffer mapping.
Greetings,
Andres Freund