On Sun, Mar 17, 2024 at 2:53 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Tue, Mar 12, 2024 at 10:03 AM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > I've rebased the attached v10 over top of the changes to
> > lazy_scan_heap() Heikki just committed and over the v6 streaming read
> > patch set. I started testing them and see that you are right, we no
> > longer pin too many buffers. However, the uncached example below is
> > now slower with streaming read than on master -- it looks to be
> > because it is doing twice as many WAL writes and syncs. I'm still
> > investigating why that is.
--snip--
> 4. For learning/exploration only, I rebased my experimental vectored
> FlushBuffers() patch, which teaches the checkpointer to write relation
> data out using smgrwritev(). The checkpointer explicitly sorts
> blocks, but I think ring buffers should naturally often contain
> consecutive blocks in ring order. Highly experimental POC code pushed
> to a public branch[2], but I am not proposing anything here, just
> trying to understand things. The nicest looking system call trace was
> with BUFFER_USAGE_LIMIT set to 512kB, so it could do its writes, reads
> and WAL writes 128kB at a time:
>
> pwrite(32,...,131072,0xfc6000) = 131072 (0x20000)
> fdatasync(32) = 0 (0x0)
> pwrite(27,...,131072,0x6c0000) = 131072 (0x20000)
> pread(27,...,131072,0x73e000) = 131072 (0x20000)
> pwrite(27,...,131072,0x6e0000) = 131072 (0x20000)
> pread(27,...,131072,0x75e000) = 131072 (0x20000)
> pwritev(27,[...],3,0x77e000) = 131072 (0x20000)
> preadv(27,[...],3,0x77e000) = 131072 (0x20000)
>
> That was a fun experiment, but... I recognise that efficient cleaning
> of ring buffers is a Hard Problem requiring more concurrency: it's
> just too late to be flushing that WAL. But we also don't want to
> start writing back data immediately after dirtying pages (cf. OS
> write-behind for big sequential writes in traditional Unixes), because
> we're not allowed to write data out without writing the WAL first and
> we currently need to build up bigger WAL writes to do so efficiently
> (cf. some other systems that can write out fragments of WAL
> concurrently so the latency-vs-throughput trade-off doesn't have to be
> so extreme). So we want to defer writing it, but not too long. We
> need something cleaning our buffers (or at least flushing the
> associated WAL, but preferably also writing the data) not too late and
> not too early, and more in sync with our scan than the WAL writer is.
> What that machinery should look like I don't know (but I believe
> Andres has ideas).
I've attached a WIP v11 streaming vacuum patch set here that is
rebased over master (by Thomas), so that I could add a CF entry for
it. It still has the problem with the extra WAL write and fsync calls
investigated by Thomas above. Thomas has some work in progress doing
streaming write-behind to alleviate the issues with the buffer access
strategy and streaming reads. When he gets a version of that ready to
share, he will start a new "Streaming Vacuum" thread.
- Melanie