Thread: Vectored IO in XLogWrite()

Vectored IO in XLogWrite()

From
Melih Mutlu
Date:
Hi hackers,

I was looking into XLogWrite() and saw the below comment. It cannot really circle back in wal buffers without needing to call pg_write() since next pages wouldn't be contiguous in memory. So it needs to write whenever it reaches the end of wal buffers.

  /*
* Dump the set if this will be the last loop iteration, or if we are
* at the last page of the cache area (since the next page won't be
* contiguous in memory), or if we are at the end of the logfile
* segment.
*/

I think that we don't have the "contiguous pages" constraint when writing anymore as we can do vectored IO. It seems unnecessary to write just because XLogWrite() is at the end of wal buffers.
Attached patch uses pg_pwritev() instead of pg_pwrite() and tries to write pages in one call even if they're not contiguous in memory, until it reaches the page at startidx.

After quickly experimenting the patch and comparing the number of write calls, the patch's affect can be more visible when wal_buffers is quite low as it's more likely to circle back to the beginning. When wal_buffers is set to a decent amount, the patch only saves a few write calls. But I wouldn't expect any regression introduced by the patch (I may be wrong here), so I thought it may be worth to consider.

I appreciate any feedback on the proposed change. I'd also be glad to benchmark the patch if you want to see how it performs in some specific cases since I've been struggling with coming up a good test case.

Regards,
--
Melih Mutlu
Microsoft
Attachment

Re: Vectored IO in XLogWrite()

From
Robert Haas
Date:
On Tue, Aug 6, 2024 at 5:36 AM Melih Mutlu <m.melihmutlu@gmail.com> wrote:
> I think that we don't have the "contiguous pages" constraint when writing anymore as we can do vectored IO. It seems
unnecessaryto write just because XLogWrite() is at the end of wal buffers. 
> Attached patch uses pg_pwritev() instead of pg_pwrite() and tries to write pages in one call even if they're not
contiguousin memory, until it reaches the page at startidx. 

Here are a few notes on this patch:

- It's not pgindent-clean. In fact, it doesn't even pass git diff --check.

- You added a new comment (/* Reaching the buffer... */) in the middle
of a chunk of lines that were already covered by an existing comment
(/* Dump the set ... */). This makes it look like the /* Dump the
set... */ comment only covers the 3 lines of code that immediately
follow it rather than everything in the "if" statement. You could fix
this in a variety of ways, but in this case the easiest solution, to
me, looks like just skipping the new comment. It seems like the point
is pretty self-explanatory.

- The patch removes the initialization of "from" but not the variable
itself. You still increment the variable you haven't initialized.

- I don't think the logic is correct after a partial write. Pre-patch,
"from" advances while "nleft" goes down, but post-patch, what gets
written is dependent on the contents of "iov" which is initialized
outside the loop and never updated. Perhaps compute_remaining_iovec
would be useful here?

- I assume this is probably a good idea in principle, because fewer
system calls are presumably better than more. The impact will probably
be very small, though.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Vectored IO in XLogWrite()

From
Melih Mutlu
Date:
Hi Robert,

Thanks for reviewing.

Robert Haas <robertmhaas@gmail.com>, 6 Ağu 2024 Sal, 20:43 tarihinde şunu yazdı:
On Tue, Aug 6, 2024 at 5:36 AM Melih Mutlu <m.melihmutlu@gmail.com> wrote:
> I think that we don't have the "contiguous pages" constraint when writing anymore as we can do vectored IO. It seems unnecessary to write just because XLogWrite() is at the end of wal buffers.
> Attached patch uses pg_pwritev() instead of pg_pwrite() and tries to write pages in one call even if they're not contiguous in memory, until it reaches the page at startidx.

Here are a few notes on this patch:

- It's not pgindent-clean. In fact, it doesn't even pass git diff --check.

Fixed.
 
- You added a new comment (/* Reaching the buffer... */) in the middle
of a chunk of lines that were already covered by an existing comment
(/* Dump the set ... */). This makes it look like the /* Dump the
set... */ comment only covers the 3 lines of code that immediately
follow it rather than everything in the "if" statement. You could fix
this in a variety of ways, but in this case the easiest solution, to
me, looks like just skipping the new comment. It seems like the point
is pretty self-explanatory.

Removed the new comment. Only keeping the updated version of the /* Dump the set... */ comment.
 
- The patch removes the initialization of "from" but not the variable
itself. You still increment the variable you haven't initialized.

- I don't think the logic is correct after a partial write. Pre-patch,
"from" advances while "nleft" goes down, but post-patch, what gets
written is dependent on the contents of "iov" which is initialized
outside the loop and never updated. Perhaps compute_remaining_iovec
would be useful here?

You're right. I should have thought about the partial write case. I now fixed it by looping and trying to write until compute_remaining_iovec() returns 0.

Thanks,
--
Melih Mutlu
Microsoft
Attachment

Re: Vectored IO in XLogWrite()

From
Robert Haas
Date:
On Tue, Aug 6, 2024 at 6:30 PM Melih Mutlu <m.melihmutlu@gmail.com> wrote:
> Fixed.

+                iov[0].iov_base = XLogCtl->pages + startidx * (Size)
XLOG_BLCKSZ;;

Double semicolon.

Aside from that, this looks correct now, so the next question is
whether we want it. To me, it seems like this isn't likely to buy very
much, but it also doesn't really seem to have any kind of downside, so
I'd be somewhat inclined to go ahead with it. On the other hand, one
could argue that it's better not to change working code without a good
reason.

I wondered whether the regression tests actually hit the iovcnt == 2
case, and it turns out that they do, rather frequently actually.
Making that case a FATAL causes ~260 regression test failure. However,
on larger systems, we'll often end up with wal_segment_size=16MB and
wal_buffers=16MB and then it seems like we don't hit the iovcnt==2
case. Which I guess just reinforces the point that this is
theoretically better but practically not much different.

Any other votes on what to do here?

--
Robert Haas
EDB: http://www.enterprisedb.com