Re: Improve WALRead() to suck data directly from WAL buffers when possible - Mailing list pgsql-hackers

From Bharath Rupireddy
Subject Re: Improve WALRead() to suck data directly from WAL buffers when possible
Date
Msg-id CALj2ACU9cfAcfVsGwUqXMace_7rfSBJ7+hXVJfVV1jnspTDGHQ@mail.gmail.com
Whole thread Raw
In response to Re: Improve WALRead() to suck data directly from WAL buffers when possible  (Dilip Kumar <dilipbalaut@gmail.com>)
Responses Re: Improve WALRead() to suck data directly from WAL buffers when possible  (Jeff Davis <pgsql@j-davis.com>)
Re: Improve WALRead() to suck data directly from WAL buffers when possible  (Dilip Kumar <dilipbalaut@gmail.com>)
List pgsql-hackers
On Sun, Dec 25, 2022 at 4:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> > > This adds copying of the whole page (at least) at every WAL *record*
> > > read,
> >
> > In the worst case yes, but that may not always be true. On a typical
> > production server with decent write traffic, it happens that the
> > callers of WALRead() read a full WAL page of size XLOG_BLCKSZ bytes or
> > MAX_SEND_SIZE bytes.
>
> I agree with this.
>
> > > This patch copies the bleeding edge WAL page without recording the
> > > (next) insertion point nor checking whether all in-progress insertion
> > > behind the target LSN have finished. Thus the copied page may have
> > > holes.  That being said, the sequential-reading nature and the fact
> > > that WAL buffers are zero-initialized may make it work for recovery,
> > > but I don't think this also works for replication.
> >
> > WALRead() callers are smart enough to take the flushed bytes only.
> > Although they read the whole WAL page, they calculate the valid bytes.
>
> Right
>
> On first read the patch looks good, although it needs some more
> thoughts on 'XXX' comments in the patch.

Thanks a lot for reviewing.

Here are some open points that I mentioned in v1 patch:

1.
+     * XXX: Perhaps, measuring the immediate lock availability and its impact
+     * on concurrent WAL writers is a good idea here.

It was shown in my testng upthread [1] that the patch does no harm in
this regard. It will be great if other members try testing in their
respective environments and use cases.

2.
+     * XXX: Perhaps, returning if lock is not immediately available a good idea
+     * here. The caller can then go ahead with reading WAL from WAL file.

After thinking a bit more on this, ISTM that doing the above is right
to not cause any contention when the lock is busy. I've done so in the
v2 patch.

3.
+     * XXX: Perhaps, quickly finding if the given WAL record is in WAL buffers
+     * a good idea here. This avoids unnecessary lock acquire-release cycles.
+     * One way to do that is by maintaining oldest WAL record that's currently
+     * present in WAL buffers.

I think by doing the above we might end up creating a new point of
contention. Because shared variables to track min and max available
LSNs in the WAL buffers will need to be protected against all the
concurrent writers. Also, with the change that's done in (2) above,
that is, quickly exiting if the lock was busy, this comment seems
unnecessary to worry about. Hence, I decided to leave it there.

4.
+         * XXX: Perhaps, we can further go and validate the found page header,
+         * record header and record at least in assert builds, something like
+         * the xlogreader.c does and return if any of those validity checks
+         * fail. Having said that, we stick to the minimal checks for now.

I was being over-cautious initially. The fact that we acquire
WALBufMappingLock while reading the needed WAL buffer page itself
guarantees that no one else initializes it/makes it ready for next use
in AdvanceXLInsertBuffer(). The checks that we have for page header
(xlp_magic, xlp_pageaddr and xlp_tli) in the patch are enough for us
to ensure that we're not reading a page that got just initialized. The
callers will anyway perform extensive checks on page and record in
XLogReaderValidatePageHeader() and ValidXLogRecordHeader()
respectively. If any such failures occur after reading WAL from WAL
buffers, then that must be treated as a bug IMO. Hence, I don't think
we need to do the above.

> And also I do not like that XLogReadFromBuffers() is using 3 bools
> hit/partial hit/miss, instead of this we can use an enum or some
> tristate variable, I think that will be cleaner.

Yeah, that seems more verbose, all that information can be deduced
from requested bytes and read bytes, I've done so in the v2 patch.

Please review the attached v2 patch further.

I'm also attaching two helper patches (as .txt files) herewith for
testing that basically adds WAL read stats -
USE-ON-HEAD-Collect-WAL-read-from-file-stats.txt - apply on HEAD and
monitor pg_stat_replication for per-walsender WAL read from WAL file
stats. USE-ON-PATCH-Collect-WAL-read-from-buffers-and-file-stats.txt -
apply on v2 patch and monitor pg_stat_replication for per-walsender
WAL read from WAL buffers and WAL file stats.

[1] https://www.postgresql.org/message-id/CALj2ACXUbvON86vgwTkum8ab3bf1%3DHkMxQ5hZJZS3ZcJn8NEXQ%40mail.gmail.com

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

pgsql-hackers by date:

Previous
From: Dilip Kumar
Date:
Subject: Re: Time delayed LR (WAS Re: logical replication restrictions)
Next
From: Amit Kapila
Date:
Subject: Re: Time delayed LR (WAS Re: logical replication restrictions)