Thread: Vectored IO in XLogWrite()
Hi hackers,
I was looking into XLogWrite() and saw the below comment. It cannot really circle back in wal buffers without needing to call pg_write() since next pages wouldn't be contiguous in memory. So it needs to write whenever it reaches the end of wal buffers.
/*
* Dump the set if this will be the last loop iteration, or if we are
* at the last page of the cache area (since the next page won't be
* contiguous in memory), or if we are at the end of the logfile
* segment.
*/
I think that we don't have the "contiguous pages" constraint when writing anymore as we can do vectored IO. It seems unnecessary to write just because XLogWrite() is at the end of wal buffers.
Attached patch uses pg_pwritev() instead of pg_pwrite() and tries to write pages in one call even if they're not contiguous in memory, until it reaches the page at startidx.
After quickly experimenting the patch and comparing the number of write calls, the patch's affect can be more visible when wal_buffers is quite low as it's more likely to circle back to the beginning. When wal_buffers is set to a decent amount, the patch only saves a few write calls. But I wouldn't expect any regression introduced by the patch (I may be wrong here), so I thought it may be worth to consider.
I appreciate any feedback on the proposed change. I'd also be glad to benchmark the patch if you want to see how it performs in some specific cases since I've been struggling with coming up a good test case.
Regards,
-- Melih Mutlu
Microsoft
Attachment
On Tue, Aug 6, 2024 at 5:36 AM Melih Mutlu <m.melihmutlu@gmail.com> wrote: > I think that we don't have the "contiguous pages" constraint when writing anymore as we can do vectored IO. It seems unnecessaryto write just because XLogWrite() is at the end of wal buffers. > Attached patch uses pg_pwritev() instead of pg_pwrite() and tries to write pages in one call even if they're not contiguousin memory, until it reaches the page at startidx. Here are a few notes on this patch: - It's not pgindent-clean. In fact, it doesn't even pass git diff --check. - You added a new comment (/* Reaching the buffer... */) in the middle of a chunk of lines that were already covered by an existing comment (/* Dump the set ... */). This makes it look like the /* Dump the set... */ comment only covers the 3 lines of code that immediately follow it rather than everything in the "if" statement. You could fix this in a variety of ways, but in this case the easiest solution, to me, looks like just skipping the new comment. It seems like the point is pretty self-explanatory. - The patch removes the initialization of "from" but not the variable itself. You still increment the variable you haven't initialized. - I don't think the logic is correct after a partial write. Pre-patch, "from" advances while "nleft" goes down, but post-patch, what gets written is dependent on the contents of "iov" which is initialized outside the loop and never updated. Perhaps compute_remaining_iovec would be useful here? - I assume this is probably a good idea in principle, because fewer system calls are presumably better than more. The impact will probably be very small, though. -- Robert Haas EDB: http://www.enterprisedb.com
Hi Robert,
Thanks for reviewing.
Robert Haas <robertmhaas@gmail.com>, 6 Ağu 2024 Sal, 20:43 tarihinde şunu yazdı:
On Tue, Aug 6, 2024 at 5:36 AM Melih Mutlu <m.melihmutlu@gmail.com> wrote:
> I think that we don't have the "contiguous pages" constraint when writing anymore as we can do vectored IO. It seems unnecessary to write just because XLogWrite() is at the end of wal buffers.
> Attached patch uses pg_pwritev() instead of pg_pwrite() and tries to write pages in one call even if they're not contiguous in memory, until it reaches the page at startidx.
Here are a few notes on this patch:
- It's not pgindent-clean. In fact, it doesn't even pass git diff --check.
Fixed.
- You added a new comment (/* Reaching the buffer... */) in the middle
of a chunk of lines that were already covered by an existing comment
(/* Dump the set ... */). This makes it look like the /* Dump the
set... */ comment only covers the 3 lines of code that immediately
follow it rather than everything in the "if" statement. You could fix
this in a variety of ways, but in this case the easiest solution, to
me, looks like just skipping the new comment. It seems like the point
is pretty self-explanatory.
Removed the new comment. Only keeping the updated version of the /* Dump the set... */ comment.
- The patch removes the initialization of "from" but not the variable
itself. You still increment the variable you haven't initialized.
- I don't think the logic is correct after a partial write. Pre-patch,
"from" advances while "nleft" goes down, but post-patch, what gets
written is dependent on the contents of "iov" which is initialized
outside the loop and never updated. Perhaps compute_remaining_iovec
would be useful here?
You're right. I should have thought about the partial write case. I now fixed it by looping and trying to write until compute_remaining_iovec() returns 0.
Thanks,
Melih Mutlu
Microsoft
Attachment
On Tue, Aug 6, 2024 at 6:30 PM Melih Mutlu <m.melihmutlu@gmail.com> wrote: > Fixed. + iov[0].iov_base = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;; Double semicolon. Aside from that, this looks correct now, so the next question is whether we want it. To me, it seems like this isn't likely to buy very much, but it also doesn't really seem to have any kind of downside, so I'd be somewhat inclined to go ahead with it. On the other hand, one could argue that it's better not to change working code without a good reason. I wondered whether the regression tests actually hit the iovcnt == 2 case, and it turns out that they do, rather frequently actually. Making that case a FATAL causes ~260 regression test failure. However, on larger systems, we'll often end up with wal_segment_size=16MB and wal_buffers=16MB and then it seems like we don't hit the iovcnt==2 case. Which I guess just reinforces the point that this is theoretically better but practically not much different. Any other votes on what to do here? -- Robert Haas EDB: http://www.enterprisedb.com