[MASSMAIL]WIP: Vectored writeback - Mailing list pgsql-hackers

From Thomas Munro
Subject [MASSMAIL]WIP: Vectored writeback
Date
Msg-id CA+hUKGK1in4FiWtisXZ+Jo-cNSbWjmBcPww3w3DBM+whJTABXA@mail.gmail.com
Whole thread Raw
Responses Re: WIP: Vectored writeback
List pgsql-hackers
Hi,

Here are some vectored writeback patches I worked on in the 17 cycle
and posted as part of various patch sets, but didn't get into a good
enough shape to take further.  They "push" vectored writes out, but I
think what they need is to be turned inside out and converted into
users of a new hypothetical write_stream.c, so that we have a model
that will survive contact with asynchronous I/O and would "pull"
writes from a stream that controls I/O concurrency.  That all seemed a
lot less urgent to work on than reads, hence leaving on ice for now.
There is a lot of code that reads, and a small finite amount that
writes.  I think the patches show some aspects of the problem-space
though, and they certainly make checkpointing faster.  They cover 2
out of 5ish ways we write relation data: checkpointing, and strategies
AKA ring buffers.

They make checkpoints look like this, respecting io_combine_limit,
instead of lots of 8kB writes:

pwritev(9,[...],2,0x0) = 131072 (0x20000)
pwrite(9,...,131072,0x20000) = 131072 (0x20000)
pwrite(9,...,131072,0x40000) = 131072 (0x20000)
pwrite(9,...,131072,0x60000) = 131072 (0x20000)
pwrite(9,...,131072,0x80000) = 131072 (0x20000)
...

Two more ways data gets written back are: bgwriter and regular
BAS_NORMAL buffer eviction, but they are not such natural candidates
for write combining.  Well, if you know you're going to write out a
buffer, *maybe* it's worth probing the buffer pool to see if adjacent
block numbers are also present and dirty?  I don't know.  Before and
after?  Or maybe it's better to wait for the tree-based mapping table
of legend first so it becomes cheaper to navigate in block number
order.

The 5th way is raw file copy that doesn't go through the buffer pool,
such as CREATE DATABASE ... STRATEGY=FILE_COPY, which already works
with big writes, and CREATE INDEX via bulk_write.c which is easily
converted to vectored writes, and I plan to push the patches for that
shortly.  I think those should ultimately become stream-based too.

Anyway, I wanted to share these uncommitfest patches, having rebased
them over relevant recent commits, so I could leave them in working
state in case anyone is interested in this file I/O-level stuff...

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: [MASSMAIL]postgres_fdw fails because GMT != UTC
Next
From: jian he
Date:
Subject: Re: remaining sql/json patches