Thread: Improve WALRead() to suck data directly from WAL buffers when possible

Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
Hi,

WALRead() currently reads WAL from the WAL file on the disk, which
means, the walsenders serving streaming and logical replication
(callers of WALRead()) will have to hit the disk/OS's page cache for
reading the WAL. This may increase the amount of read IO required for
all the walsenders put together as one typically maintains many
standbys/subscribers on production servers for high availability,
disaster recovery, read-replicas and so on. Also, it may increase
replication lag if all the WAL reads are always hitting the disk.

It may happen that WAL buffers contain the requested WAL, if so, the
WALRead() can attempt to read from the WAL buffers first before
reading from the file. If the read hits the WAL buffers, then reading
from the file on disk is avoided. This mainly reduces the read IO/read
system calls. It also enables us to do other features specified
elsewhere [1].

I'm attaching a patch that implements the idea which is also noted
elsewhere [2]. I've run some tests [3]. The WAL buffers hit ratio with
the patch stood at 95%, in other words, the walsenders avoided 95% of
the time reading from the file. The benefit, if measured in terms of
the amount of data - 79% (13.5GB out of total 17GB) of the requested
WAL is read from the WAL buffers as opposed to 21% from the file. Note
that the WAL buffers hit ratio can be very low for write-heavy
workloads, in which case, file reads are inevitable.

The patch introduces concurrent readers for the WAL buffers, so far
only there are concurrent writers. In the patch, WALRead() takes just
one lock (WALBufMappingLock) in shared mode to enable concurrent
readers and does minimal things - checks if the requested WAL page is
present in WAL buffers, if so, copies the page and releases the lock.
I think taking just WALBufMappingLock is enough here as the concurrent
writers depend on it to initialize and replace a page in WAL buffers.

I'll add this to the next commitfest.

Thoughts?

[1] https://www.postgresql.org/message-id/CALj2ACXCSM%2BsTR%3D5NNRtmSQr3g1Vnr-yR91azzkZCaCJ7u4d4w%40mail.gmail.com

[2]
 * XXX probably this should be improved to suck data directly from the
 * WAL buffers when possible.
 */
bool
WALRead(XLogReaderState *state,

[3]
1 primary, 1 sync standby, 1 async standby
./pgbench --initialize --scale=300 postgres
./pgbench --jobs=16 --progress=300 --client=32 --time=900
--username=ubuntu postgres

PATCHED:
-[ RECORD 1 ]----------+----------------
application_name       | assb1
wal_read               | 31005
wal_read_bytes         | 3800607104
wal_read_time          | 779.402
wal_read_buffers       | 610611
wal_read_bytes_buffers | 14493226440
wal_read_time_buffers  | 3033.309
sync_state             | async
-[ RECORD 2 ]----------+----------------
application_name       | ssb1
wal_read               | 31027
wal_read_bytes         | 3800932712
wal_read_time          | 696.365
wal_read_buffers       | 610580
wal_read_bytes_buffers | 14492900832
wal_read_time_buffers  | 2989.507
sync_state             | sync

HEAD:
-[ RECORD 1 ]----+----------------
application_name | assb1
wal_read         | 705627
wal_read_bytes   | 18343480640
wal_read_time    | 7607.783
sync_state       | async
-[ RECORD 2 ]----+------------
application_name | ssb1
wal_read         | 705625
wal_read_bytes   | 18343480640
wal_read_time    | 4539.058
sync_state       | sync

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Kyotaro Horiguchi
Date:
At Fri, 9 Dec 2022 14:33:39 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in 
> The patch introduces concurrent readers for the WAL buffers, so far
> only there are concurrent writers. In the patch, WALRead() takes just
> one lock (WALBufMappingLock) in shared mode to enable concurrent
> readers and does minimal things - checks if the requested WAL page is
> present in WAL buffers, if so, copies the page and releases the lock.
> I think taking just WALBufMappingLock is enough here as the concurrent
> writers depend on it to initialize and replace a page in WAL buffers.
> 
> I'll add this to the next commitfest.
> 
> Thoughts?

This adds copying of the whole page (at least) at every WAL *record*
read, fighting all WAL writers by taking WALBufMappingLock on a very
busy page while the copying. I'm a bit doubtful that it results in an
overall improvement. It seems to me almost all pread()s here happens
on file buffer so it is unclear to me that copying a whole WAL page
(then copying the target record again) wins over a pread() call that
copies only the record to read. Do you have an actual number of how
frequent WAL reads go to disk, or the actual number of performance
gain or real I/O reduction this patch offers?

This patch copies the bleeding edge WAL page without recording the
(next) insertion point nor checking whether all in-progress insertion
behind the target LSN have finished. Thus the copied page may have
holes.  That being said, the sequential-reading nature and the fact
that WAL buffers are zero-initialized may make it work for recovery,
but I don't think this also works for replication.

I remember that the one of the advantage of reading the on-memory WAL
records is that that allows walsender to presend the unwritten
records. So perhaps we should manage how far the buffer is filled with
valid content (or how far we can presend) in this feature.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Kyotaro Horiguchi
Date:
At Mon, 12 Dec 2022 11:57:17 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> This patch copies the bleeding edge WAL page without recording the
> (next) insertion point nor checking whether all in-progress insertion
> behind the target LSN have finished. Thus the copied page may have
> holes.  That being said, the sequential-reading nature and the fact
> that WAL buffers are zero-initialized may make it work for recovery,
> but I don't think this also works for replication.

Mmm. I'm a bit dim. Recovery doesn't read concurrently-written
records. Please forget about recovery.

> I remember that the one of the advantage of reading the on-memory WAL
> records is that that allows walsender to presend the unwritten
> records. So perhaps we should manage how far the buffer is filled with
> valid content (or how far we can presend) in this feature.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Kyotaro Horiguchi
Date:
Sorry for the confusion.

At Mon, 12 Dec 2022 12:06:36 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Mon, 12 Dec 2022 11:57:17 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > This patch copies the bleeding edge WAL page without recording the
> > (next) insertion point nor checking whether all in-progress insertion
> > behind the target LSN have finished. Thus the copied page may have
> > holes.  That being said, the sequential-reading nature and the fact
> > that WAL buffers are zero-initialized may make it work for recovery,
> > but I don't think this also works for replication.
> 
> Mmm. I'm a bit dim. Recovery doesn't read concurrently-written
> records. Please forget about recovery.

NO... Logical walsenders do that. So, please forget about this...

> > I remember that the one of the advantage of reading the on-memory WAL
> > records is that that allows walsender to presend the unwritten
> > records. So perhaps we should manage how far the buffer is filled with
> > valid content (or how far we can presend) in this feature.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Mon, Dec 12, 2022 at 8:27 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>

Thanks for providing thoughts.

> At Fri, 9 Dec 2022 14:33:39 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in
> > The patch introduces concurrent readers for the WAL buffers, so far
> > only there are concurrent writers. In the patch, WALRead() takes just
> > one lock (WALBufMappingLock) in shared mode to enable concurrent
> > readers and does minimal things - checks if the requested WAL page is
> > present in WAL buffers, if so, copies the page and releases the lock.
> > I think taking just WALBufMappingLock is enough here as the concurrent
> > writers depend on it to initialize and replace a page in WAL buffers.
> >
> > I'll add this to the next commitfest.
> >
> > Thoughts?
>
> This adds copying of the whole page (at least) at every WAL *record*
> read,

In the worst case yes, but that may not always be true. On a typical
production server with decent write traffic, it happens that the
callers of WALRead() read a full WAL page of size XLOG_BLCKSZ bytes or
MAX_SEND_SIZE bytes.

> fighting all WAL writers by taking WALBufMappingLock on a very
> busy page while the copying. I'm a bit doubtful that it results in an
> overall improvement.

Well, the tests don't reflect that [1], I've run an insert work load
[2]. The WAL is being read from WAL buffers 99% of the time, which is
pretty cool. If you have any use-cases in mind, please share them
and/or feel free to run at your end.

> It seems to me almost all pread()s here happens
> on file buffer so it is unclear to me that copying a whole WAL page
> (then copying the target record again) wins over a pread() call that
> copies only the record to read.

That's not always guaranteed. Imagine a typical production server with
decent write traffic and heavy analytical queries (which fills OS page
cache with the table pages accessed for the queries), the WAL pread()
calls turn to IOPS. Despite the WAL being present in WAL buffers,
customers will be paying unnecessarily for these IOPS too. With the
patch, we are basically avoiding the pread() system calls which may
turn into IOPS on production servers (99% of the time for the insert
use case [1][2], 95% of the time for pgbench use case specified
upthread). With the patch, WAL buffers can act as L1 cache, if one
calls OS page cache as L2 cache (of course this illustration is not
related to the typical processor L1 and L2 ... caches).

> Do you have an actual number of how
> frequent WAL reads go to disk, or the actual number of performance
> gain or real I/O reduction this patch offers?

It might be a bit tough to generate such heavy traffic. An idea is to
ensure the WAL page/file goes out of the OS page cache before
WALRead() - these might help here - 0002 patch from
https://www.postgresql.org/message-id/CA%2BhUKGLmeyrDcUYAty90V_YTcoo5kAFfQjRQ-_1joS_%3DX7HztA%40mail.gmail.com
and tool https://github.com/klando/pgfincore.

> This patch copies the bleeding edge WAL page without recording the
> (next) insertion point nor checking whether all in-progress insertion
> behind the target LSN have finished. Thus the copied page may have
> holes.  That being said, the sequential-reading nature and the fact
> that WAL buffers are zero-initialized may make it work for recovery,
> but I don't think this also works for replication.

WALRead() callers are smart enough to take the flushed bytes only.
Although they read the whole WAL page, they calculate the valid bytes.

> I remember that the one of the advantage of reading the on-memory WAL
> records is that that allows walsender to presend the unwritten
> records. So perhaps we should manage how far the buffer is filled with
> valid content (or how far we can presend) in this feature.

Yes, the non-flushed WAL can be read and sent across if one wishes to
to make replication faster and parallel flushing on primary and
standbys at the cost of a bit of extra crash handling, that's
mentioned here
https://www.postgresql.org/message-id/CALj2ACXCSM%2BsTR%3D5NNRtmSQr3g1Vnr-yR91azzkZCaCJ7u4d4w%40mail.gmail.com.
However, this can be a separate discussion.

I also want to reiterate that the patch implemented a TODO item:

 * XXX probably this should be improved to suck data directly from the
 * WAL buffers when possible.
 */
bool
WALRead(XLogReaderState *state,

[1]
PATCHED:
1 1470.329907
2 1437.096329
4 2966.096948
8 5978.441040
16 11405.538255
32 22933.546058
64 43341.870038
128 73623.837068
256 104754.248661
512 115746.359530
768 106106.691455
1024 91900.079086
2048 84134.278589
4096 62580.875507

-[ RECORD 1 ]----------+-----------
application_name       | assb1
sent_lsn               | 0/1B8106A8
write_lsn              | 0/1B8106A8
flush_lsn              | 0/1B8106A8
replay_lsn             | 0/1B8106A8
write_lag              |
flush_lag              |
replay_lag             |
wal_read               | 104
wal_read_bytes         | 10733008
wal_read_time          | 1.845
wal_read_buffers       | 76662
wal_read_bytes_buffers | 383598808
wal_read_time_buffers  | 205.418
sync_state             | async

HEAD:
1 1312.054496
2 1449.429321
4 2717.496207
8 5913.361540
16 10762.978907
32 19653.449728
64 41086.124269
128 68548.061171
256 104468.415361
512 114328.943598
768 91751.279309
1024 96403.736757
2048 82155.140270
4096 66160.659511

-[ RECORD 1 ]----+-----------
application_name | assb1
sent_lsn         | 0/1AB5BCB8
write_lsn        | 0/1AB5BCB8
flush_lsn        | 0/1AB5BCB8
replay_lsn       | 0/1AB5BCB8
write_lag        |
flush_lag        |
replay_lag       |
wal_read         | 71967
wal_read_bytes   | 381009080
wal_read_time    | 243.616
sync_state       | async

[2] Test details:
./configure --prefix=$PWD/inst/ CFLAGS="-O3" > install.log && make -j
8 install > install.log 2>&1 &
1 primary, 1 async standby
cd inst/bin
./pg_ctl -D data -l logfile stop
./pg_ctl -D assbdata -l logfile1 stop
rm -rf data assbdata
rm logfile logfile1
free -m
sudo su -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
free -m
./initdb -D data
rm -rf /home/ubuntu/archived_wal
mkdir /home/ubuntu/archived_wal
cat << EOF >> data/postgresql.conf
shared_buffers = '8GB'
wal_buffers = '1GB'
max_wal_size = '16GB'
max_connections = '5000'
archive_mode = 'on'
archive_command='cp %p /home/ubuntu/archived_wal/%f'
track_wal_io_timing = 'on'
EOF
./pg_ctl -D data -l logfile start
./psql -c "select
pg_create_physical_replication_slot('assb1_repl_slot', true, false)"
postgres
./pg_ctl -D data -l logfile restart
./pg_basebackup -D assbdata
./pg_ctl -D data -l logfile stop
cat << EOF >> assbdata/postgresql.conf
port=5433
primary_conninfo='host=localhost port=5432 dbname=postgres user=ubuntu
application_name=assb1'
primary_slot_name='assb1_repl_slot'
restore_command='cp /home/ubuntu/archived_wal/%f %p'
EOF
touch assbdata/standby.signal
./pg_ctl -D data -l logfile start
./pg_ctl -D assbdata -l logfile1 start
./pgbench -i -s 1 -d postgres
./psql -d postgres -c "ALTER TABLE pgbench_accounts DROP CONSTRAINT
pgbench_accounts_pkey;"
cat << EOF >> insert.sql
\set aid random(1, 10 * :scale)
\set delta random(1, 100000 * :scale)
INSERT INTO pgbench_accounts (aid, bid, abalance) VALUES (:aid, :aid, :delta);
EOF
ulimit -S -n 5000
for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n
"$c ";./pgbench -n -M prepared -U ubuntu postgres -f insert.sql -c$c
-j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Dilip Kumar
Date:
On Fri, Dec 23, 2022 at 3:46 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Mon, Dec 12, 2022 at 8:27 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> >
>
> Thanks for providing thoughts.
>
> > At Fri, 9 Dec 2022 14:33:39 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in
> > > The patch introduces concurrent readers for the WAL buffers, so far
> > > only there are concurrent writers. In the patch, WALRead() takes just
> > > one lock (WALBufMappingLock) in shared mode to enable concurrent
> > > readers and does minimal things - checks if the requested WAL page is
> > > present in WAL buffers, if so, copies the page and releases the lock.
> > > I think taking just WALBufMappingLock is enough here as the concurrent
> > > writers depend on it to initialize and replace a page in WAL buffers.
> > >
> > > I'll add this to the next commitfest.
> > >
> > > Thoughts?
> >
> > This adds copying of the whole page (at least) at every WAL *record*
> > read,
>
> In the worst case yes, but that may not always be true. On a typical
> production server with decent write traffic, it happens that the
> callers of WALRead() read a full WAL page of size XLOG_BLCKSZ bytes or
> MAX_SEND_SIZE bytes.

I agree with this.

> > This patch copies the bleeding edge WAL page without recording the
> > (next) insertion point nor checking whether all in-progress insertion
> > behind the target LSN have finished. Thus the copied page may have
> > holes.  That being said, the sequential-reading nature and the fact
> > that WAL buffers are zero-initialized may make it work for recovery,
> > but I don't think this also works for replication.
>
> WALRead() callers are smart enough to take the flushed bytes only.
> Although they read the whole WAL page, they calculate the valid bytes.

Right

On first read the patch looks good, although it needs some more
thoughts on 'XXX' comments in the patch.

And also I do not like that XLogReadFromBuffers() is using 3 bools
hit/partial hit/miss, instead of this we can use an enum or some
tristate variable, I think that will be cleaner.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Sun, Dec 25, 2022 at 4:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> > > This adds copying of the whole page (at least) at every WAL *record*
> > > read,
> >
> > In the worst case yes, but that may not always be true. On a typical
> > production server with decent write traffic, it happens that the
> > callers of WALRead() read a full WAL page of size XLOG_BLCKSZ bytes or
> > MAX_SEND_SIZE bytes.
>
> I agree with this.
>
> > > This patch copies the bleeding edge WAL page without recording the
> > > (next) insertion point nor checking whether all in-progress insertion
> > > behind the target LSN have finished. Thus the copied page may have
> > > holes.  That being said, the sequential-reading nature and the fact
> > > that WAL buffers are zero-initialized may make it work for recovery,
> > > but I don't think this also works for replication.
> >
> > WALRead() callers are smart enough to take the flushed bytes only.
> > Although they read the whole WAL page, they calculate the valid bytes.
>
> Right
>
> On first read the patch looks good, although it needs some more
> thoughts on 'XXX' comments in the patch.

Thanks a lot for reviewing.

Here are some open points that I mentioned in v1 patch:

1.
+     * XXX: Perhaps, measuring the immediate lock availability and its impact
+     * on concurrent WAL writers is a good idea here.

It was shown in my testng upthread [1] that the patch does no harm in
this regard. It will be great if other members try testing in their
respective environments and use cases.

2.
+     * XXX: Perhaps, returning if lock is not immediately available a good idea
+     * here. The caller can then go ahead with reading WAL from WAL file.

After thinking a bit more on this, ISTM that doing the above is right
to not cause any contention when the lock is busy. I've done so in the
v2 patch.

3.
+     * XXX: Perhaps, quickly finding if the given WAL record is in WAL buffers
+     * a good idea here. This avoids unnecessary lock acquire-release cycles.
+     * One way to do that is by maintaining oldest WAL record that's currently
+     * present in WAL buffers.

I think by doing the above we might end up creating a new point of
contention. Because shared variables to track min and max available
LSNs in the WAL buffers will need to be protected against all the
concurrent writers. Also, with the change that's done in (2) above,
that is, quickly exiting if the lock was busy, this comment seems
unnecessary to worry about. Hence, I decided to leave it there.

4.
+         * XXX: Perhaps, we can further go and validate the found page header,
+         * record header and record at least in assert builds, something like
+         * the xlogreader.c does and return if any of those validity checks
+         * fail. Having said that, we stick to the minimal checks for now.

I was being over-cautious initially. The fact that we acquire
WALBufMappingLock while reading the needed WAL buffer page itself
guarantees that no one else initializes it/makes it ready for next use
in AdvanceXLInsertBuffer(). The checks that we have for page header
(xlp_magic, xlp_pageaddr and xlp_tli) in the patch are enough for us
to ensure that we're not reading a page that got just initialized. The
callers will anyway perform extensive checks on page and record in
XLogReaderValidatePageHeader() and ValidXLogRecordHeader()
respectively. If any such failures occur after reading WAL from WAL
buffers, then that must be treated as a bug IMO. Hence, I don't think
we need to do the above.

> And also I do not like that XLogReadFromBuffers() is using 3 bools
> hit/partial hit/miss, instead of this we can use an enum or some
> tristate variable, I think that will be cleaner.

Yeah, that seems more verbose, all that information can be deduced
from requested bytes and read bytes, I've done so in the v2 patch.

Please review the attached v2 patch further.

I'm also attaching two helper patches (as .txt files) herewith for
testing that basically adds WAL read stats -
USE-ON-HEAD-Collect-WAL-read-from-file-stats.txt - apply on HEAD and
monitor pg_stat_replication for per-walsender WAL read from WAL file
stats. USE-ON-PATCH-Collect-WAL-read-from-buffers-and-file-stats.txt -
apply on v2 patch and monitor pg_stat_replication for per-walsender
WAL read from WAL buffers and WAL file stats.

[1] https://www.postgresql.org/message-id/CALj2ACXUbvON86vgwTkum8ab3bf1%3DHkMxQ5hZJZS3ZcJn8NEXQ%40mail.gmail.com

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Jeff Davis
Date:
On Mon, 2022-12-26 at 14:20 +0530, Bharath Rupireddy wrote:
> Please review the attached v2 patch further.

I'm still unclear on the performance goals of this patch. I see that it
will reduce syscalls, which sounds good, but to what end?

Does it allow a greater number of walsenders? Lower replication
latency? Less IO bandwidth? All of the above?


--
Jeff Davis
PostgreSQL Contributor Team - AWS





Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Andres Freund
Date:
Hi,

On 2023-01-14 00:48:52 -0800, Jeff Davis wrote:
> On Mon, 2022-12-26 at 14:20 +0530, Bharath Rupireddy wrote:
> > Please review the attached v2 patch further.
> 
> I'm still unclear on the performance goals of this patch. I see that it
> will reduce syscalls, which sounds good, but to what end?
> 
> Does it allow a greater number of walsenders? Lower replication
> latency? Less IO bandwidth? All of the above?

One benefit would be that it'd make it more realistic to use direct IO for WAL
- for which I have seen significant performance benefits. But when we
afterwards have to re-read it from disk to replicate, it's less clearly a win.

Greetings,

Andres Freund



Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
SATYANARAYANA NARLAPURAM
Date:


On Sat, Jan 14, 2023 at 12:34 PM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2023-01-14 00:48:52 -0800, Jeff Davis wrote:
> On Mon, 2022-12-26 at 14:20 +0530, Bharath Rupireddy wrote:
> > Please review the attached v2 patch further.
>
> I'm still unclear on the performance goals of this patch. I see that it
> will reduce syscalls, which sounds good, but to what end?
>
> Does it allow a greater number of walsenders? Lower replication
> latency? Less IO bandwidth? All of the above?

One benefit would be that it'd make it more realistic to use direct IO for WAL
- for which I have seen significant performance benefits. But when we
afterwards have to re-read it from disk to replicate, it's less clearly a win.

 +1. Archive modules rely on reading the wal files for PITR. Direct IO for WAL requires reading these files from disk anyways for archival. However, Archiving using utilities like pg_receivewal can take advantage of this patch together with direct IO for WAL.

Thanks,
Satya

Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Andres Freund
Date:
Hi,

On 2023-01-14 12:34:03 -0800, Andres Freund wrote:
> On 2023-01-14 00:48:52 -0800, Jeff Davis wrote:
> > On Mon, 2022-12-26 at 14:20 +0530, Bharath Rupireddy wrote:
> > > Please review the attached v2 patch further.
> > 
> > I'm still unclear on the performance goals of this patch. I see that it
> > will reduce syscalls, which sounds good, but to what end?
> > 
> > Does it allow a greater number of walsenders? Lower replication
> > latency? Less IO bandwidth? All of the above?
> 
> One benefit would be that it'd make it more realistic to use direct IO for WAL
> - for which I have seen significant performance benefits. But when we
> afterwards have to re-read it from disk to replicate, it's less clearly a win.

Satya's email just now reminded me of another important reason:

Eventually we should add the ability to stream out WAL *before* it has locally
been written out and flushed. Obviously the relevant positions would have to
be noted in the relevant message in the streaming protocol, and we couldn't
generally allow standbys to apply that data yet.

That'd allow us to significantly reduce the overhead of synchronous
replication, because instead of commonly needing to send out all the pending
WAL at commit, we'd just need to send out the updated flush position. The
reason this would lower the overhead is that:

a) The reduced amount of data to be transferred reduces latency - it's easy to
   accumulate a few TCP packets worth of data even in a single small OLTP
   transaction
b) The remote side can start to write out data earlier


Of course this would require additional infrastructure on the receiver
side. E.g. some persistent state indicating up to where WAL is allowed to be
applied, to avoid the standby getting ahead of th eprimary, in case the
primary crash-restarts (or has more severe issues).


With a bit of work we could perform WAL replay on standby without waiting for
the fdatasync of the received WAL - that only needs to happen when a) we need
to confirm a flush position to the primary b) when we need to write back pages
from the buffer pool (and some other things).


Greetings,

Andres Freund



Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Thu, Jan 26, 2023 at 2:45 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2023-01-14 12:34:03 -0800, Andres Freund wrote:
> > On 2023-01-14 00:48:52 -0800, Jeff Davis wrote:
> > > On Mon, 2022-12-26 at 14:20 +0530, Bharath Rupireddy wrote:
> > > > Please review the attached v2 patch further.
> > >
> > > I'm still unclear on the performance goals of this patch. I see that it
> > > will reduce syscalls, which sounds good, but to what end?
> > >
> > > Does it allow a greater number of walsenders? Lower replication
> > > latency? Less IO bandwidth? All of the above?
> >
> > One benefit would be that it'd make it more realistic to use direct IO for WAL
> > - for which I have seen significant performance benefits. But when we
> > afterwards have to re-read it from disk to replicate, it's less clearly a win.
>
> Satya's email just now reminded me of another important reason:
>
> Eventually we should add the ability to stream out WAL *before* it has locally
> been written out and flushed. Obviously the relevant positions would have to
> be noted in the relevant message in the streaming protocol, and we couldn't
> generally allow standbys to apply that data yet.
>
> That'd allow us to significantly reduce the overhead of synchronous
> replication, because instead of commonly needing to send out all the pending
> WAL at commit, we'd just need to send out the updated flush position. The
> reason this would lower the overhead is that:
>
> a) The reduced amount of data to be transferred reduces latency - it's easy to
>    accumulate a few TCP packets worth of data even in a single small OLTP
>    transaction
> b) The remote side can start to write out data earlier
>
>
> Of course this would require additional infrastructure on the receiver
> side. E.g. some persistent state indicating up to where WAL is allowed to be
> applied, to avoid the standby getting ahead of th eprimary, in case the
> primary crash-restarts (or has more severe issues).
>
>
> With a bit of work we could perform WAL replay on standby without waiting for
> the fdatasync of the received WAL - that only needs to happen when a) we need
> to confirm a flush position to the primary b) when we need to write back pages
> from the buffer pool (and some other things).

Thanks Andres, Jeff and Satya for taking a look at the thread. Andres
is right, the eventual plan is to do a bunch of other stuff as
described above and we've discussed this in another thread (see
below). I would like to once again clarify motivation behind this
feature:

1. It enables WAL readers (callers of WALRead() - wal senders,
pg_walinspect etc.) to use WAL buffers as first level cache which
might reduce number of IOPS at a peak load especially when the pread()
results in a disk read (WAL isn't available in OS page cache). I had
earlier presented the buffer hit ratio/amount of pread() system calls
reduced with wal senders in the first email of this thread (95% of the
time wal senders are able to read from WAL buffers without impacting
anybody). Now, here are the results with the WAL DIO patch [1] - where
WAL pread() turns into a disk read, see the results [2] and attached
graph.

2. As Andres rightly mentioned, it helps WAL DIO; since there's no OS
page cache, using WAL buffers as read cache helps a lot. It is clearly
evident from my experiment with WAL DIO patch [1], see the results [2]
and attached graph. As expected, WAL DIO brings down the TPS, whereas
WAL buffers read i.e. this patch brings it up.

3. As Andres rightly mentioned above, it enables flushing WAL in
parallel on primary and all standbys [3]. I haven't yet started work
on this, I will aim for PG 17.

4. It will make the work on - disallow async standbys or subscribers
getting ahead of the sync standbys [3] possible. I haven't yet started
work on this, I will aim for PG 17.

5. It implements the following TODO item specified near WALRead():
 * XXX probably this should be improved to suck data directly from the
 * WAL buffers when possible.
 */
bool
WALRead(XLogReaderState *state,

That said, this feature is separately reviewable and perhaps can go
separately as it has its own benefits.

[1] https://www.postgresql.org/message-id/CA%2BhUKGLmeyrDcUYAty90V_YTcoo5kAFfQjRQ-_1joS_%3DX7HztA%40mail.gmail.com

[2] Test case is an insert pgbench workload.
clients    HEAD    WAL DIO    WAL DIO & WAL BUFFERS READ    WAL BUFFERS READ
1    1404    1070    1424    1375
2    1487    796    1454    1517
4    3064    1743    3011    3019
8    6114    3556    6026    5954
16    11560    7051    12216    12132
32    23181    13079    23449    23561
64    43607    26983    43997    45636
128    80723    45169    81515    81911
256    110925    90185    107332    114046
512    119354    109817    110287    117506
768    112435    105795    106853    111605
1024    107554    105541    105942    109370
2048    88552    79024    80699    90555
4096    61323    54814    58704    61743

[3]
https://www.postgresql.org/message-id/20220309020123.sneaoijlg3rszvst@alap3.anarazel.de
https://www.postgresql.org/message-id/CALj2ACXCSM%2BsTR%3D5NNRtmSQr3g1Vnr-yR91azzkZCaCJ7u4d4w%40mail.gmail.com

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Masahiko Sawada
Date:
On Thu, Jan 26, 2023 at 2:33 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Thu, Jan 26, 2023 at 2:45 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2023-01-14 12:34:03 -0800, Andres Freund wrote:
> > > On 2023-01-14 00:48:52 -0800, Jeff Davis wrote:
> > > > On Mon, 2022-12-26 at 14:20 +0530, Bharath Rupireddy wrote:
> > > > > Please review the attached v2 patch further.
> > > >
> > > > I'm still unclear on the performance goals of this patch. I see that it
> > > > will reduce syscalls, which sounds good, but to what end?
> > > >
> > > > Does it allow a greater number of walsenders? Lower replication
> > > > latency? Less IO bandwidth? All of the above?
> > >
> > > One benefit would be that it'd make it more realistic to use direct IO for WAL
> > > - for which I have seen significant performance benefits. But when we
> > > afterwards have to re-read it from disk to replicate, it's less clearly a win.
> >
> > Satya's email just now reminded me of another important reason:
> >
> > Eventually we should add the ability to stream out WAL *before* it has locally
> > been written out and flushed. Obviously the relevant positions would have to
> > be noted in the relevant message in the streaming protocol, and we couldn't
> > generally allow standbys to apply that data yet.
> >
> > That'd allow us to significantly reduce the overhead of synchronous
> > replication, because instead of commonly needing to send out all the pending
> > WAL at commit, we'd just need to send out the updated flush position. The
> > reason this would lower the overhead is that:
> >
> > a) The reduced amount of data to be transferred reduces latency - it's easy to
> >    accumulate a few TCP packets worth of data even in a single small OLTP
> >    transaction
> > b) The remote side can start to write out data earlier
> >
> >
> > Of course this would require additional infrastructure on the receiver
> > side. E.g. some persistent state indicating up to where WAL is allowed to be
> > applied, to avoid the standby getting ahead of th eprimary, in case the
> > primary crash-restarts (or has more severe issues).
> >
> >
> > With a bit of work we could perform WAL replay on standby without waiting for
> > the fdatasync of the received WAL - that only needs to happen when a) we need
> > to confirm a flush position to the primary b) when we need to write back pages
> > from the buffer pool (and some other things).
>
> Thanks Andres, Jeff and Satya for taking a look at the thread. Andres
> is right, the eventual plan is to do a bunch of other stuff as
> described above and we've discussed this in another thread (see
> below). I would like to once again clarify motivation behind this
> feature:
>
> 1. It enables WAL readers (callers of WALRead() - wal senders,
> pg_walinspect etc.) to use WAL buffers as first level cache which
> might reduce number of IOPS at a peak load especially when the pread()
> results in a disk read (WAL isn't available in OS page cache). I had
> earlier presented the buffer hit ratio/amount of pread() system calls
> reduced with wal senders in the first email of this thread (95% of the
> time wal senders are able to read from WAL buffers without impacting
> anybody). Now, here are the results with the WAL DIO patch [1] - where
> WAL pread() turns into a disk read, see the results [2] and attached
> graph.
>
> 2. As Andres rightly mentioned, it helps WAL DIO; since there's no OS
> page cache, using WAL buffers as read cache helps a lot. It is clearly
> evident from my experiment with WAL DIO patch [1], see the results [2]
> and attached graph. As expected, WAL DIO brings down the TPS, whereas
> WAL buffers read i.e. this patch brings it up.
>
> [2] Test case is an insert pgbench workload.
> clients    HEAD    WAL DIO    WAL DIO & WAL BUFFERS READ    WAL BUFFERS READ
> 1    1404    1070    1424    1375
> 2    1487    796    1454    1517
> 4    3064    1743    3011    3019
> 8    6114    3556    6026    5954
> 16    11560    7051    12216    12132
> 32    23181    13079    23449    23561
> 64    43607    26983    43997    45636
> 128    80723    45169    81515    81911
> 256    110925    90185    107332    114046
> 512    119354    109817    110287    117506
> 768    112435    105795    106853    111605
> 1024    107554    105541    105942    109370
> 2048    88552    79024    80699    90555
> 4096    61323    54814    58704    61743

If I'm understanding this result correctly, it seems to me that your
patch works well with the WAL DIO patch (WALDIO vs. WAL DIO & WAL
BUFFERS READ), but there seems no visible performance gain with only
your patch (HEAD vs. WAL BUFFERS READ). So it seems to me that your
patch should be included in the WAL DIO patch rather than applying it
alone. Am I missing something?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Andres Freund
Date:
Hi,

On 2023-01-27 14:24:51 +0900, Masahiko Sawada wrote:
> If I'm understanding this result correctly, it seems to me that your
> patch works well with the WAL DIO patch (WALDIO vs. WAL DIO & WAL
> BUFFERS READ), but there seems no visible performance gain with only
> your patch (HEAD vs. WAL BUFFERS READ). So it seems to me that your
> patch should be included in the WAL DIO patch rather than applying it
> alone. Am I missing something?

We already support using DIO for WAL - it's just restricted in a way that
makes it practically not usable. And the reason for that is precisely that
walsenders need to read the WAL. See get_sync_bit():

    /*
     * Optimize writes by bypassing kernel cache with O_DIRECT when using
     * O_SYNC and O_DSYNC.  But only if archiving and streaming are disabled,
     * otherwise the archive command or walsender process will read the WAL
     * soon after writing it, which is guaranteed to cause a physical read if
     * we bypassed the kernel cache. We also skip the
     * posix_fadvise(POSIX_FADV_DONTNEED) call in XLogFileClose() for the same
     * reason.
     *
     * Never use O_DIRECT in walreceiver process for similar reasons; the WAL
     * written by walreceiver is normally read by the startup process soon
     * after it's written. Also, walreceiver performs unaligned writes, which
     * don't work with O_DIRECT, so it is required for correctness too.
     */
    if (!XLogIsNeeded() && !AmWalReceiverProcess())
        o_direct_flag = PG_O_DIRECT;


Even if that weren't the case, splitting up bigger commits in incrementally
committable chunks is a good idea.

Greetings,

Andres Freund



Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Masahiko Sawada
Date:
On Fri, Jan 27, 2023 at 3:17 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2023-01-27 14:24:51 +0900, Masahiko Sawada wrote:
> > If I'm understanding this result correctly, it seems to me that your
> > patch works well with the WAL DIO patch (WALDIO vs. WAL DIO & WAL
> > BUFFERS READ), but there seems no visible performance gain with only
> > your patch (HEAD vs. WAL BUFFERS READ). So it seems to me that your
> > patch should be included in the WAL DIO patch rather than applying it
> > alone. Am I missing something?
>
> We already support using DIO for WAL - it's just restricted in a way that
> makes it practically not usable. And the reason for that is precisely that
> walsenders need to read the WAL. See get_sync_bit():
>
>         /*
>          * Optimize writes by bypassing kernel cache with O_DIRECT when using
>          * O_SYNC and O_DSYNC.  But only if archiving and streaming are disabled,
>          * otherwise the archive command or walsender process will read the WAL
>          * soon after writing it, which is guaranteed to cause a physical read if
>          * we bypassed the kernel cache. We also skip the
>          * posix_fadvise(POSIX_FADV_DONTNEED) call in XLogFileClose() for the same
>          * reason.
>          *
>          * Never use O_DIRECT in walreceiver process for similar reasons; the WAL
>          * written by walreceiver is normally read by the startup process soon
>          * after it's written. Also, walreceiver performs unaligned writes, which
>          * don't work with O_DIRECT, so it is required for correctness too.
>          */
>         if (!XLogIsNeeded() && !AmWalReceiverProcess())
>                 o_direct_flag = PG_O_DIRECT;
>
>
> Even if that weren't the case, splitting up bigger commits in incrementally
> committable chunks is a good idea.

Agreed. I was wondering about the fact that the test result doesn't
show things to satisfy the first motivation of this patch, which is to
improve performance by reducing disk I/O and system calls regardless
of the DIO patch. But it makes sense to me that this patch is a part
of the DIO patch series.

I'd like to confirm whether there is any performance regression caused
by this patch in some cases, especially when not using DIO.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Fri, Jan 27, 2023 at 12:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> I'd like to confirm whether there is any performance regression caused
> by this patch in some cases, especially when not using DIO.

Thanks. I ran some insert tests with primary and 1 async standby.
Please see the numbers below and attached graphs. I've not noticed a
regression as such, in fact, with patch, there's a slight improvement.
Note that there's no WAL DIO involved here.

test-case 1:
clients    HEAD    PATCHED
1    139    156
2    624    599
4    3113    3410
8    6194    6433
16    11255    11722
32    22455    21658
64    46072    47103
128    80255    85970
256    110067    111488
512    114043    118094
768    109588    111892
1024    106144    109361
2048    85808    90745
4096    55911    53755

test-case 2:
clients    HEAD    PATCHED
1    177    128
2    186    425
4    2114    2946
8    5835    5840
16    10654    11199
32    14071    13959
64    18092    17519
128    27298    28274
256    24600    24843
512    17139    19450
768    16778    20473
1024    18294    20209
2048    12898    13920
4096    6399    6815

test-case 3:
clients    HEAD    PATCHED
1    148    191
2    302    317
4    3415    3243
8    5864    6193
16    9573    10267
32    14069    15819
64    17424    18453
128    24493    29192
256    33180    38250
512    35568    36551
768    29731    30317
1024    32291    32124
2048    27964    28933
4096    13702    15034

[1]
cat << EOF >> data/postgresql.conf
shared_buffers = '8GB'
wal_buffers = '1GB'
max_wal_size = '16GB'
max_connections = '5000'
archive_mode = 'on'
archive_command='cp %p /home/ubuntu/archived_wal/%f'
EOF

test-case 1:
./pgbench -i -s 300 -d postgres
./psql -d postgres -c "ALTER TABLE pgbench_accounts DROP CONSTRAINT
pgbench_accounts_pkey;"
cat << EOF >> insert.sql
\set aid random(1, 10 * :scale)
\set delta random(1, 100000 * :scale)
INSERT INTO pgbench_accounts (aid, bid, abalance) VALUES (:aid, :aid, :delta);
EOF
for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n
"$c ";./pgbench -n -M prepared -U ubuntu postgres -f insert.sql -c$c
-j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done

test-case 2:
./pgbench --initialize --scale=300 postgres
for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n
"$c ";./pgbench -n -M prepared -U ubuntu postgres -b tpcb-like -c$c
-j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done

test-case 3:
./pgbench --initialize --scale=300 postgres
for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n
"$c ";./pgbench -n -M prepared -U ubuntu postgres -b simple-update
-c$c -j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Dilip Kumar
Date:
On Mon, Dec 26, 2022 at 2:20 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>

I have gone through this patch, I have some comments (mostly cosmetic
and comments)

1.
+ /*
+ * We have found the WAL buffer page holding the given LSN. Read from a
+ * pointer to the right offset within the page.
+ */
+ memcpy(page, (XLogCtl->pages + idx * (Size) XLOG_BLCKSZ),
+    (Size) XLOG_BLCKSZ);


From the above comments, it appears that we are reading from the exact
pointer we are interested to read, but actually, we are reading
the complete page.  I think this comment needs to be fixed and we can
also explain why we read the complete page here.

2.
+static char *
+GetXLogBufferForRead(XLogRecPtr ptr, TimeLineID tli, char *page)
+{
+ XLogRecPtr expectedEndPtr;
+ XLogRecPtr endptr;
+ int idx;
+ char    *recptr = NULL;

Generally, we use the name 'recptr' to represent XLogRecPtr type of
variable, but in your case, it is actually data at that recptr, so
better use some other name like 'buf' or 'buffer'.


3.
+ if ((recptr + nbytes) <= (page + XLOG_BLCKSZ))
+ {
+ /* All the bytes are in one page. */
+ memcpy(dst, recptr, nbytes);
+ dst += nbytes;
+ *read_bytes += nbytes;
+ ptr += nbytes;
+ nbytes = 0;
+ }
+ else if ((recptr + nbytes) > (page + XLOG_BLCKSZ))
+ {
+ /* All the bytes are not in one page. */
+ Size bytes_remaining;

Why do you have this 'else if ((recptr + nbytes) > (page +
XLOG_BLCKSZ))' check in the else part? why it is not directly else
without a condition in 'if'?

4.
+XLogReadFromBuffers(XLogRecPtr startptr,
+ TimeLineID tli,
+ Size count,
+ char *buf,
+ Size *read_bytes)

I think we do not need 2 separate variables 'count' and '*read_bytes',
just one variable for input/output is sufficient.  The original value
can always be stored in some temp variable
instead of the function argument.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Tue, Feb 7, 2023 at 4:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Dec 26, 2022 at 2:20 PM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
>
> I have gone through this patch, I have some comments (mostly cosmetic
> and comments)

Thanks a lot for reviewing.

> From the above comments, it appears that we are reading from the exact
> pointer we are interested to read, but actually, we are reading
> the complete page.  I think this comment needs to be fixed and we can
> also explain why we read the complete page here.

I modified it. Please see the attached v3 patch.

> Generally, we use the name 'recptr' to represent XLogRecPtr type of
> variable, but in your case, it is actually data at that recptr, so
> better use some other name like 'buf' or 'buffer'.

Changed it to use 'data' as it seemed more appropriate than just a
buffer to not confuse with the WAL buffer page.

> 3.
> + if ((recptr + nbytes) <= (page + XLOG_BLCKSZ))
> + {
> + }
> + else if ((recptr + nbytes) > (page + XLOG_BLCKSZ))
> + {
>
> Why do you have this 'else if ((recptr + nbytes) > (page +
> XLOG_BLCKSZ))' check in the else part? why it is not directly else
> without a condition in 'if'?

Changed.

> I think we do not need 2 separate variables 'count' and '*read_bytes',
> just one variable for input/output is sufficient.  The original value
> can always be stored in some temp variable
> instead of the function argument.

We could do that, but for the sake of readability and not cluttering
the API, I kept it as-is.

Besides addressing the above review comments, I've made some more
changes - 1) I optimized the patch a bit by removing an extra memcpy.
Up until v2 patch, the entire WAL buffer page is returned and the
caller takes what is wanted from it. This adds an extra memcpy, so I
changed it to avoid extra memcpy and just copy what is wanted. 2) I
improved the comments.

I can also do a few other things, but before working on them, I'd like
to hear from others:
1. A separate wait event (WAIT_EVENT_WAL_READ_FROM_BUFFERS) for
reading from WAL buffers - right now, WAIT_EVENT_WAL_READ is being
used both for reading from WAL buffers and WAL files. Given the fact
that we won't wait for a lock or do a time-taking task while reading
from buffers, it seems unnecessary.
2. A separate TAP test for verifying that the WAL is actually read
from WAL buffers - right now, existing tests for recovery,
subscription, pg_walinspect already cover the code, see [1]. However,
if needed, I can add a separate TAP test.
3. Use the oldest initialized WAL buffer page to quickly tell if the
given LSN is present in WAL buffers without taking any lock - right
now, WALBufMappingLock is acquired to do so. While this doesn't seem
to impact much, it's good to optimize it away. But, the oldest
initialized WAL buffer page isn't tracked, so I've put up a patch and
sent in another thread [2]. Irrespective of [2], we are still good
with what we have in this patch.

[1]
recovery tests:
PATCHED: WAL buffers hit - 14759, misses - 3371

subscription tests:
PATCHED: WAL buffers hit - 1972, misses - 32616

pg_walinspect tests:
PATCHED: WAL buffers hit - 8, misses - 8

[2] https://www.postgresql.org/message-id/CALj2ACVgi6LirgLDZh%3DFdfdvGvKAD%3D%3DWTOSWcQy%3DAtNgPDVnKw%40mail.gmail.com

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Dilip Kumar
Date:
On Wed, Feb 8, 2023 at 9:57 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> I can also do a few other things, but before working on them, I'd like
> to hear from others:
> 1. A separate wait event (WAIT_EVENT_WAL_READ_FROM_BUFFERS) for
> reading from WAL buffers - right now, WAIT_EVENT_WAL_READ is being
> used both for reading from WAL buffers and WAL files. Given the fact
> that we won't wait for a lock or do a time-taking task while reading
> from buffers, it seems unnecessary.

Yes, we do not need this separate wait event and we also don't need
WAIT_EVENT_WAL_READ wait event while reading from the buffer.  Because
we are not performing any IO so no specific wait event is needed and
for reading from the WAL buffer we are acquiring WALBufMappingLock so
that lwlock event will be tracked under that lock.

> 2. A separate TAP test for verifying that the WAL is actually read
> from WAL buffers - right now, existing tests for recovery,
> subscription, pg_walinspect already cover the code, see [1]. However,
> if needed, I can add a separate TAP test.

Can we write a test that can actually validate that we have read from
a WAL Buffer? If so then it would be good to have such a test to avoid
any future breakage in that logic.  But if it is just for hitting the
code but no guarantee that whether we can validate as part of the test
whether it has hit the WAL buffer or not then I think the existing
cases are sufficient.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Wed, Feb 8, 2023 at 10:33 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Feb 8, 2023 at 9:57 AM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> > I can also do a few other things, but before working on them, I'd like
> > to hear from others:
> > 1. A separate wait event (WAIT_EVENT_WAL_READ_FROM_BUFFERS) for
> > reading from WAL buffers - right now, WAIT_EVENT_WAL_READ is being
> > used both for reading from WAL buffers and WAL files. Given the fact
> > that we won't wait for a lock or do a time-taking task while reading
> > from buffers, it seems unnecessary.
>
> Yes, we do not need this separate wait event and we also don't need
> WAIT_EVENT_WAL_READ wait event while reading from the buffer.  Because
> we are not performing any IO so no specific wait event is needed and
> for reading from the WAL buffer we are acquiring WALBufMappingLock so
> that lwlock event will be tracked under that lock.

Nope, LWLockConditionalAcquire doesn't wait, so no lock wait event (no
LWLockReportWaitStart) there. I agree to not have any wait event for
reading from WAL buffers as no IO is involved there. I removed it in
the attached v4 patch.

> > 2. A separate TAP test for verifying that the WAL is actually read
> > from WAL buffers - right now, existing tests for recovery,
> > subscription, pg_walinspect already cover the code, see [1]. However,
> > if needed, I can add a separate TAP test.
>
> Can we write a test that can actually validate that we have read from
> a WAL Buffer? If so then it would be good to have such a test to avoid
> any future breakage in that logic.  But if it is just for hitting the
> code but no guarantee that whether we can validate as part of the test
> whether it has hit the WAL buffer or not then I think the existing
> cases are sufficient.

We could set up a standby or a logical replication subscriber or
pg_walinspect extension and verify if the code got hit with the help
of the server log (DEBUG1) message added by the patch. However, this
can make the test volatile.

Therefore, I came up with a simple and small test module/extension
named test_wal_read_from_buffers under src/test/module. It basically
exposes a SQL-function given an LSN, it calls XLogReadFromBuffers()
and returns true if it hits WAL buffers, otherwise false. And the
simple TAP test of this module verifies if the function returns true.
I attached the test module as v4-0002 here. The test module looks
specific and also helps as demonstration of how one can possibly use
the new XLogReadFromBuffers().

Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Nathan Bossart
Date:
On Wed, Feb 08, 2023 at 08:00:00PM +0530, Bharath Rupireddy wrote:
> +            /*
> +             * We read some of the requested bytes. Continue to read remaining
> +             * bytes.
> +             */
> +            ptr += nread;
> +            nbytes -= nread;
> +            dst += nread;
> +            *read_bytes += nread;

Why do we only read a page at a time in XLogReadFromBuffersGuts()?  What is
preventing us from copying all the data we need in one go?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Tue, Feb 28, 2023 at 6:14 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Wed, Feb 08, 2023 at 08:00:00PM +0530, Bharath Rupireddy wrote:
> > +                     /*
> > +                      * We read some of the requested bytes. Continue to read remaining
> > +                      * bytes.
> > +                      */
> > +                     ptr += nread;
> > +                     nbytes -= nread;
> > +                     dst += nread;
> > +                     *read_bytes += nread;
>
> Why do we only read a page at a time in XLogReadFromBuffersGuts()?  What is
> preventing us from copying all the data we need in one go?

Note that most of the WALRead() callers request a single page of
XLOG_BLCKSZ bytes even if the server has less or more available WAL
pages. It's the streaming replication wal sender that can request less
than XLOG_BLCKSZ bytes and upto MAX_SEND_SIZE (16 * XLOG_BLCKSZ). And,
if we read, say, MAX_SEND_SIZE at once while holding
WALBufMappingLock, that might impact concurrent inserters (at least, I
can say it in theory) - one of the main intentions of this patch is
not to impact inserters much.

Therefore, I feel reading one WAL buffer page at a time, which works
for most of the cases, without impacting concurrent inserters much is
better - https://www.postgresql.org/message-id/CALj2ACWXHP6Ha1BfDB14txm%3DXP272wCbOV00mcPg9c6EXbnp5A%40mail.gmail.com.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Kuntal Ghosh
Date:
On Tue, Feb 28, 2023 at 10:38 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
+/*
+ * Guts of XLogReadFromBuffers().
+ *
+ * Read 'count' bytes into 'buf', starting at location 'ptr', from WAL
+ * fetched WAL buffers on timeline 'tli' and return the read bytes.
+ */
s/fetched WAL buffers/fetched from WAL buffers


+ else if (nread < nbytes)
+ {
+ /*
+ * We read some of the requested bytes. Continue to read remaining
+ * bytes.
+ */
+ ptr += nread;
+ nbytes -= nread;
+ dst += nread;
+ *read_bytes += nread;
+ }

The 'if' condition should always be true. You can replace the same
with an assertion instead.
s/Continue to read remaining/Continue to read the remaining

The good thing about this patch is that it reduces read IO calls
without impacting the write performance (at least not that
noticeable). It also takes us one step forward towards the
enhancements mentioned in the thread. If performance is a concern, we
can introduce a GUC to enable/disable this feature.

--
Thanks & Regards,
Kuntal Ghosh



Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Nathan Bossart
Date:
On Tue, Feb 28, 2023 at 10:38:31AM +0530, Bharath Rupireddy wrote:
> On Tue, Feb 28, 2023 at 6:14 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> Why do we only read a page at a time in XLogReadFromBuffersGuts()?  What is
>> preventing us from copying all the data we need in one go?
> 
> Note that most of the WALRead() callers request a single page of
> XLOG_BLCKSZ bytes even if the server has less or more available WAL
> pages. It's the streaming replication wal sender that can request less
> than XLOG_BLCKSZ bytes and upto MAX_SEND_SIZE (16 * XLOG_BLCKSZ). And,
> if we read, say, MAX_SEND_SIZE at once while holding
> WALBufMappingLock, that might impact concurrent inserters (at least, I
> can say it in theory) - one of the main intentions of this patch is
> not to impact inserters much.

Perhaps we should test both approaches to see if there is a noticeable
difference.  It might not be great for concurrent inserts to repeatedly
take the lock, either.  If there's no real difference, we might be able to
simplify the code a bit.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Wed, Mar 1, 2023 at 12:06 AM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>
> On Tue, Feb 28, 2023 at 10:38 AM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> +/*
> + * Guts of XLogReadFromBuffers().
> + *
> + * Read 'count' bytes into 'buf', starting at location 'ptr', from WAL
> + * fetched WAL buffers on timeline 'tli' and return the read bytes.
> + */
> s/fetched WAL buffers/fetched from WAL buffers

Modified that comment a bit and moved it to the XLogReadFromBuffers.

> + else if (nread < nbytes)
> + {
> + /*
> + * We read some of the requested bytes. Continue to read remaining
> + * bytes.
> + */
> + ptr += nread;
> + nbytes -= nread;
> + dst += nread;
> + *read_bytes += nread;
> + }
>
> The 'if' condition should always be true. You can replace the same
> with an assertion instead.

Yeah, added an assert and changed that else if (nread < nbytes) to
else only condition.

> s/Continue to read remaining/Continue to read the remaining

Done.

> The good thing about this patch is that it reduces read IO calls
> without impacting the write performance (at least not that
> noticeable). It also takes us one step forward towards the
> enhancements mentioned in the thread.

Right.

> If performance is a concern, we
> can introduce a GUC to enable/disable this feature.

I didn't see any performance issues from my testing so far with 3
different pgbench cases
https://www.postgresql.org/message-id/CALj2ACWXHP6Ha1BfDB14txm%3DXP272wCbOV00mcPg9c6EXbnp5A%40mail.gmail.com.

While adding a GUC to enable/disable a feature sounds useful, IMHO it
isn't good for the user. Because we already have too many GUCs for the
user and we may not want all features to be defensive and add their
own GUCs. If at all, any bugs arise due to some corner-case we missed
to count in, we can surely help fix them. Having said this, I'm open
to suggestions here.

Please find the attached v5 patch set for further review.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Wed, Mar 1, 2023 at 2:39 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> Please find the attached v5 patch set for further review.

I simplified the code largely by moving the logic of reading the WAL
buffer page from a separate function to the main while loop. This
enabled me to get rid of XLogReadFromBuffersGuts() that v5 and other
previous patches have.

Please find the attached v6 patch set for further review. Meanwhile,
I'll continue to work on the review comment raised upthread -
https://www.postgresql.org/message-id/20230301041523.GA1453450%40nathanxps13.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Wed, Mar 1, 2023 at 9:45 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Tue, Feb 28, 2023 at 10:38:31AM +0530, Bharath Rupireddy wrote:
> > On Tue, Feb 28, 2023 at 6:14 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> >> Why do we only read a page at a time in XLogReadFromBuffersGuts()?  What is
> >> preventing us from copying all the data we need in one go?
> >
> > Note that most of the WALRead() callers request a single page of
> > XLOG_BLCKSZ bytes even if the server has less or more available WAL
> > pages. It's the streaming replication wal sender that can request less
> > than XLOG_BLCKSZ bytes and upto MAX_SEND_SIZE (16 * XLOG_BLCKSZ). And,
> > if we read, say, MAX_SEND_SIZE at once while holding
> > WALBufMappingLock, that might impact concurrent inserters (at least, I
> > can say it in theory) - one of the main intentions of this patch is
> > not to impact inserters much.
>
> Perhaps we should test both approaches to see if there is a noticeable
> difference.  It might not be great for concurrent inserts to repeatedly
> take the lock, either.  If there's no real difference, we might be able to
> simplify the code a bit.

I took a stab at this - acquire WALBufMappingLock separately for each
requested WAL buffer page vs acquire WALBufMappingLock once for all
requested WAL buffer pages. I chose the pgbench tpcb-like benchmark
that has 3 UPDATE statements and 1 INSERT statement. I ran pgbench for
30min with scale factor 100 and 4096 clients with primary and 1 async
standby, see [1]. I captured wait_events to see the contention on
WALBufMappingLock. I haven't noticed any contention on the lock and no
difference in TPS too, see [2] for results on HEAD, see [3] for
results on v6 patch which has "acquire WALBufMappingLock separately
for each requested WAL buffer page" strategy and see [4] for results
on v7 patch (attached herewith) which has "acquire WALBufMappingLock
once for all requested WAL buffer pages" strategy. Another thing to
note from the test results is that reduction in WALRead IO wait events
from 136 on HEAD to 1 on v6 or v7 patch. So, the read from WAL buffers
is really  helping here.

With these observations, I'd like to use the approach that acquires
WALBufMappingLock once for all requested WAL buffer pages unlike v6
and the previous patches.

I'm attaching the v7 patch set with this change for further review.

[1]
shared_buffers = '8GB'
wal_buffers = '1GB'
max_wal_size = '16GB'
max_connections = '5000'
archive_mode = 'on'
archive_command='cp %p /home/ubuntu/archived_wal/%f'
./pgbench --initialize --scale=100 postgres
./pgbench -n -M prepared -U ubuntu postgres -b tpcb-like -c4096 -j4096 -T1800

[2]
HEAD:
done in 20.03 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 15.53 s, vacuum 0.19 s, primary keys 4.30 s).
tps = 11654.475345 (without initial connection time)

50950253  Lock            | transactionid
16472447  Lock            | tuple
3869523  LWLock          | LockManager
 739283  IPC             | ProcArrayGroupUpdate
 718549                  |
 439877  LWLock          | WALWrite
 130737  Client          | ClientRead
 121113  LWLock          | BufferContent
  70778  LWLock          | WALInsert
  43346  IPC             | XactGroupUpdate
  18547
  18546  Activity        | LogicalLauncherMain
  18545  Activity        | AutoVacuumMain
  18272  Activity        | ArchiverMain
  17627  Activity        | WalSenderMain
  17207  Activity        | WalWriterMain
  15455  IO              | WALSync
  14963  LWLock          | ProcArray
  14747  LWLock          | XactSLRU
  13943  Timeout         | CheckpointWriteDelay
  10519  Activity        | BgWriterHibernate
   8022  Activity        | BgWriterMain
   4486  Timeout         | SpinDelay
   4443  Activity        | CheckpointerMain
   1435  Lock            | extend
    670  LWLock          | XidGen
    373  IO              | WALWrite
    283  Timeout         | VacuumDelay
    268  IPC             | ArchiveCommand
    249  Timeout         | VacuumTruncate
    136  IO              | WALRead
    115  IO              | WALInitSync
     74  IO              | DataFileWrite
     67  IO              | WALInitWrite
     36  IO              | DataFileFlush
     35  IO              | DataFileExtend
     17  IO              | DataFileRead
      4  IO              | SLRUWrite
      3  IO              | BufFileWrite
      2  IO              | DataFileImmediateSync
      1 Tuples only is on.
      1  LWLock          | SInvalWrite
      1  LWLock          | LockFastPath
      1  IO              | ControlFileSyncUpdate

[3]
done in 19.99 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 15.52 s, vacuum 0.18 s, primary keys 4.28 s).
tps = 11689.584538 (without initial connection time)

50678977  Lock            | transactionid
16252048  Lock            | tuple
4146827  LWLock          | LockManager
 768256                  |
 719923  IPC             | ProcArrayGroupUpdate
 432836  LWLock          | WALWrite
 140354  Client          | ClientRead
 124203  LWLock          | BufferContent
  74355  LWLock          | WALInsert
  39852  IPC             | XactGroupUpdate
  30728
  30727  Activity        | LogicalLauncherMain
  30726  Activity        | AutoVacuumMain
  30420  Activity        | ArchiverMain
  29881  Activity        | WalSenderMain
  29418  Activity        | WalWriterMain
  23428  Activity        | BgWriterHibernate
  15960  Timeout         | CheckpointWriteDelay
  15840  IO              | WALSync
  15066  LWLock          | ProcArray
  14577  Activity        | CheckpointerMain
  14377  LWLock          | XactSLRU
   7291  Activity        | BgWriterMain
   4336  Timeout         | SpinDelay
   1707  Lock            | extend
    720  LWLock          | XidGen
    362  Timeout         | VacuumTruncate
    360  IO              | WALWrite
    304  Timeout         | VacuumDelay
    301  IPC             | ArchiveCommand
    106  IO              | WALInitSync
     82  IO              | DataFileWrite
     66  IO              | WALInitWrite
     45  IO              | DataFileFlush
     25  IO              | DataFileExtend
     18  IO              | DataFileRead
      5  LWLock          | LockFastPath
      2  IO              | DataFileSync
      2  IO              | DataFileImmediateSync
      1 Tuples only is on.
      1  LWLock          | BufferMapping
      1  IO              | WALRead
      1  IO              | SLRUWrite
      1  IO              | SLRURead
      1  IO              | ReplicationSlotSync
      1  IO              | BufFileRead

[4]
done in 19.92 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 15.53 s, vacuum 0.23 s, primary keys 4.16 s).
tps = 11671.869074 (without initial connection time)

50614021  Lock            | transactionid
16482561  Lock            | tuple
4086451  LWLock          | LockManager
 777507                  |
 714329  IPC             | ProcArrayGroupUpdate
 420593  LWLock          | WALWrite
 138142  Client          | ClientRead
 125381  LWLock          | BufferContent
  75283  LWLock          | WALInsert
  38759  IPC             | XactGroupUpdate
  20283
  20282  Activity        | LogicalLauncherMain
  20281  Activity        | AutoVacuumMain
  20002  Activity        | ArchiverMain
  19467  Activity        | WalSenderMain
  19036  Activity        | WalWriterMain
  15836  IO              | WALSync
  15708  Timeout         | CheckpointWriteDelay
  15346  LWLock          | ProcArray
  15095  LWLock          | XactSLRU
  11852  Activity        | BgWriterHibernate
   8424  Activity        | BgWriterMain
   4636  Timeout         | SpinDelay
   4415  Activity        | CheckpointerMain
   2048  Lock            | extend
   1457  Timeout         | VacuumTruncate
    646  LWLock          | XidGen
    402  IO              | WALWrite
    306  Timeout         | VacuumDelay
    278  IPC             | ArchiveCommand
    117  IO              | WALInitSync
     74  IO              | DataFileWrite
     66  IO              | WALInitWrite
     35  IO              | DataFileFlush
     29  IO              | DataFileExtend
     24  LWLock          | LockFastPath
     14  IO              | DataFileRead
      2  IO              | SLRUWrite
      2  IO              | DataFileImmediateSync
      2  IO              | BufFileWrite
      1 Tuples only is on.
      1  LWLock          | BufferMapping
      1  IO              | WALRead
      1  IO              | SLRURead
      1  IO              | BufFileRead

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Nathan Bossart
Date:
+void
+XLogReadFromBuffers(XLogRecPtr startptr,
+                    TimeLineID tli,
+                    Size count,
+                    char *buf,
+                    Size *read_bytes)

Since this function presently doesn't return anything, can we have it
return the number of bytes read instead of storing it in a pointer
variable?

+    ptr = startptr;
+    nbytes = count;
+    dst = buf;

These variables seem superfluous.

+            /*
+             * Requested WAL isn't available in WAL buffers, so return with
+             * what we have read so far.
+             */
+            break;

nitpick: I'd move this to the top so that you can save a level of
indentation.

    if (expectedEndPtr != endptr)
        break;

    ... logic for when the data is found in the WAL buffers ...

+                /*
+                 * All the bytes are not in one page. Read available bytes on
+                 * the current page, copy them over to output buffer and
+                 * continue to read remaining bytes.
+                 */

Is it possible to memcpy more than a page at a time?

+            /*
+             * The fact that we acquire WALBufMappingLock while reading the WAL
+             * buffer page itself guarantees that no one else initializes it or
+             * makes it ready for next use in AdvanceXLInsertBuffer().
+             *
+             * However, we perform basic page header checks for ensuring that
+             * we are not reading a page that just got initialized. Callers
+             * will anyway perform extensive page-level and record-level
+             * checks.
+             */

Hm.  I wonder if we should make these assertions instead.

+    elog(DEBUG1, "read %zu bytes out of %zu bytes from WAL buffers for given LSN %X/%X, Timeline ID %u",
+         *read_bytes, count, LSN_FORMAT_ARGS(startptr), tli);

I definitely don't think we should put an elog() in this code path.
Perhaps this should be guarded behind WAL_DEBUG.

+        /*
+         * Check if we have read fully (hit), partially (partial hit) or
+         * nothing (miss) from WAL buffers. If we have read either partially or
+         * nothing, then continue to read the remaining bytes the usual way,
+         * that is, read from WAL file.
+         */
+        if (count == read_bytes)
+        {
+            /* Buffer hit, so return. */
+            return true;
+        }
+        else if (read_bytes > 0 && count > read_bytes)
+        {
+            /*
+             * Buffer partial hit, so reset the state to count the read bytes
+             * and continue.
+             */
+            buf += read_bytes;
+            startptr += read_bytes;
+            count -= read_bytes;
+        }
+
+        /* Buffer miss i.e., read_bytes = 0, so continue */

I think we can simplify this.  We effectively take the same action any time
"count" doesn't equal "read_bytes", so there's no need for the "else if".

    if (count == read_bytes)
        return true;

    buf += read_bytes;
    startptr += read_bytes;
    count -= read_bytes;

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Tue, Mar 7, 2023 at 3:30 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> +void
> +XLogReadFromBuffers(XLogRecPtr startptr,
>
> Since this function presently doesn't return anything, can we have it
> return the number of bytes read instead of storing it in a pointer
> variable?

Done.

> +    ptr = startptr;
> +    nbytes = count;
> +    dst = buf;
>
> These variables seem superfluous.

Needed startptr and count for DEBUG1 message and assertion at the end.
Removed dst and used buf in the new patch now.

> +            /*
> +             * Requested WAL isn't available in WAL buffers, so return with
> +             * what we have read so far.
> +             */
> +            break;
>
> nitpick: I'd move this to the top so that you can save a level of
> indentation.

Done.

> +                /*
> +                 * All the bytes are not in one page. Read available bytes on
> +                 * the current page, copy them over to output buffer and
> +                 * continue to read remaining bytes.
> +                 */
>
> Is it possible to memcpy more than a page at a time?

It would complicate things a lot there; the logic to figure out the
last page bytes that may or may not fit in the whole page gets
complicated. Also, the logic to verify each page's header gets
complicated. We might lose out if we memcpy all the pages at once and
start verifying each page's header in another loop.

I would like to keep it simple - read a single page from WAL buffers,
verify it and continue.

> +            /*
> +             * The fact that we acquire WALBufMappingLock while reading the WAL
> +             * buffer page itself guarantees that no one else initializes it or
> +             * makes it ready for next use in AdvanceXLInsertBuffer().
> +             *
> +             * However, we perform basic page header checks for ensuring that
> +             * we are not reading a page that just got initialized. Callers
> +             * will anyway perform extensive page-level and record-level
> +             * checks.
> +             */
>
> Hm.  I wonder if we should make these assertions instead.

Okay. I added XLogReaderValidatePageHeader for assert-only builds
which will help catch any issues there. But we can't perform record
level checks here because this function doesn't know where the record
starts from, it knows only pages. This change required us to pass in
XLogReaderState to XLogReadFromBuffers. I marked it as
PG_USED_FOR_ASSERTS_ONLY and did page header checks only when it is
passed as non-null so that someone who doesn't have XLogReaderState
can still read from buffers.

> +    elog(DEBUG1, "read %zu bytes out of %zu bytes from WAL buffers for given LSN %X/%X, Timeline ID %u",
> +         *read_bytes, count, LSN_FORMAT_ARGS(startptr), tli);
>
> I definitely don't think we should put an elog() in this code path.
> Perhaps this should be guarded behind WAL_DEBUG.

Placing it behind WAL_DEBUG doesn't help users/developers. My
intention was to let users know that the WAL read hit the buffers,
it'll help them report if any issue occurs and also help developers to
debug that issue.

On a different note - I was recently looking at the code around
WAL_DEBUG macro and the wal_debug GUC. It looks so complex that one
needs to build source code with the WAL_DEBUG macro and enable the GUC
to see the extended logs for WAL. IMO, the best way there is either:
1) unify all the code under WAL_DEBUG macro and get rid of wal_debug GUC, or
2) unify all the code under wal_debug GUC (it is developer-only and
superuser-only so there shouldn't be a problem even if we ship it out
of the box).

If someone is concerned about the GUC being enabled on production
servers knowingly or unknowingly with option (2), we can go ahead with
option (1). I will discuss this separately to see what others think.

> I think we can simplify this.  We effectively take the same action any time
> "count" doesn't equal "read_bytes", so there's no need for the "else if".
>
>         if (count == read_bytes)
>                 return true;
>
>         buf += read_bytes;
>         startptr += read_bytes;
>         count -= read_bytes;

I wanted to avoid setting these unnecessarily for buffer misses.

Thanks a lot for reviewing. I'm attaching the v8 patch for further review.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Nathan Bossart
Date:
On Tue, Mar 07, 2023 at 12:39:13PM +0530, Bharath Rupireddy wrote:
> On Tue, Mar 7, 2023 at 3:30 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> Is it possible to memcpy more than a page at a time?
> 
> It would complicate things a lot there; the logic to figure out the
> last page bytes that may or may not fit in the whole page gets
> complicated. Also, the logic to verify each page's header gets
> complicated. We might lose out if we memcpy all the pages at once and
> start verifying each page's header in another loop.

Doesn't the complicated logic you describe already exist to some extent in
the patch?  You are copying a page at a time, which involves calculating
various addresses and byte counts.

>> +    elog(DEBUG1, "read %zu bytes out of %zu bytes from WAL buffers for given LSN %X/%X, Timeline ID %u",
>> +         *read_bytes, count, LSN_FORMAT_ARGS(startptr), tli);
>>
>> I definitely don't think we should put an elog() in this code path.
>> Perhaps this should be guarded behind WAL_DEBUG.
> 
> Placing it behind WAL_DEBUG doesn't help users/developers. My
> intention was to let users know that the WAL read hit the buffers,
> it'll help them report if any issue occurs and also help developers to
> debug that issue.

I still think an elog() is mighty expensive for this code path, even when
it doesn't actually produce any messages.  And when it does, I think it has
the potential to be incredibly noisy.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Nitin Jadhav
Date:
> [1]
> subscription tests:
> PATCHED: WAL buffers hit - 1972, misses - 32616

Can you share more details about the test here?

I went through the v8 patch. Following are my thoughts to improve the
WAL buffer hit ratio.

Currently the no-longer-needed WAL data present in WAL buffers gets
cleared in XLogBackgroundFlush() which is called based on the
wal_writer_delay config setting. Once the data is flushed to the disk,
it is treated as no-longer-needed and it will be cleared as soon as
possible based on some config settings. I have done some testing by
tweaking the wal_writer_delay config setting to confirm the behaviour.
We can see that the WAL buffer hit ratio is good when the
wal_writer_delay is big enough [2] compared to smaller
wal_writer_delay [1]. So irrespective of the wal_writer_delay
settings, we should keep the WAL data in the WAL buffers as long as
possible so that all the readers (Mainly WAL senders) can take
advantage of this. The WAL page should be evicted from the WAL buffers
only when the WAL buffer is full and we need room for the new page.
The patch attached takes care of this. We can see the improvements in
WAL buffer hit ratio even when the wal_writer_delay is set to lower
value [3].

Second, In WALRead(), we try to read the data from disk whenever we
don't find the data from WAL buffers. We don't store this data in the
WAL buffer. We just read the data, use it and leave it. If we store
this data to the WAL buffer, then we may avoid a few disk reads.

[1]:
wal_buffers=1GB
wal_writer_delay=1ms
./pgbench --initialize --scale=300 postgres

WAL buffers hit=5046
WAL buffers miss=56767

[2]:
wal_buffers=1GB
wal_writer_delay=10s
./pgbench --initialize --scale=300 postgres

WAL buffers hit=45454
WAL buffers miss=14064

[3]:
wal_buffers=1GB
wal_writer_delay=1ms
./pgbench --initialize --scale=300 postgres

WAL buffers hit=37214
WAL buffers miss=844

Please share your thoughts.

Thanks & Regards,
Nitin Jadhav

On Tue, Mar 7, 2023 at 12:39 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Tue, Mar 7, 2023 at 3:30 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> >
> > +void
> > +XLogReadFromBuffers(XLogRecPtr startptr,
> >
> > Since this function presently doesn't return anything, can we have it
> > return the number of bytes read instead of storing it in a pointer
> > variable?
>
> Done.
>
> > +    ptr = startptr;
> > +    nbytes = count;
> > +    dst = buf;
> >
> > These variables seem superfluous.
>
> Needed startptr and count for DEBUG1 message and assertion at the end.
> Removed dst and used buf in the new patch now.
>
> > +            /*
> > +             * Requested WAL isn't available in WAL buffers, so return with
> > +             * what we have read so far.
> > +             */
> > +            break;
> >
> > nitpick: I'd move this to the top so that you can save a level of
> > indentation.
>
> Done.
>
> > +                /*
> > +                 * All the bytes are not in one page. Read available bytes on
> > +                 * the current page, copy them over to output buffer and
> > +                 * continue to read remaining bytes.
> > +                 */
> >
> > Is it possible to memcpy more than a page at a time?
>
> It would complicate things a lot there; the logic to figure out the
> last page bytes that may or may not fit in the whole page gets
> complicated. Also, the logic to verify each page's header gets
> complicated. We might lose out if we memcpy all the pages at once and
> start verifying each page's header in another loop.
>
> I would like to keep it simple - read a single page from WAL buffers,
> verify it and continue.
>
> > +            /*
> > +             * The fact that we acquire WALBufMappingLock while reading the WAL
> > +             * buffer page itself guarantees that no one else initializes it or
> > +             * makes it ready for next use in AdvanceXLInsertBuffer().
> > +             *
> > +             * However, we perform basic page header checks for ensuring that
> > +             * we are not reading a page that just got initialized. Callers
> > +             * will anyway perform extensive page-level and record-level
> > +             * checks.
> > +             */
> >
> > Hm.  I wonder if we should make these assertions instead.
>
> Okay. I added XLogReaderValidatePageHeader for assert-only builds
> which will help catch any issues there. But we can't perform record
> level checks here because this function doesn't know where the record
> starts from, it knows only pages. This change required us to pass in
> XLogReaderState to XLogReadFromBuffers. I marked it as
> PG_USED_FOR_ASSERTS_ONLY and did page header checks only when it is
> passed as non-null so that someone who doesn't have XLogReaderState
> can still read from buffers.
>
> > +    elog(DEBUG1, "read %zu bytes out of %zu bytes from WAL buffers for given LSN %X/%X, Timeline ID %u",
> > +         *read_bytes, count, LSN_FORMAT_ARGS(startptr), tli);
> >
> > I definitely don't think we should put an elog() in this code path.
> > Perhaps this should be guarded behind WAL_DEBUG.
>
> Placing it behind WAL_DEBUG doesn't help users/developers. My
> intention was to let users know that the WAL read hit the buffers,
> it'll help them report if any issue occurs and also help developers to
> debug that issue.
>
> On a different note - I was recently looking at the code around
> WAL_DEBUG macro and the wal_debug GUC. It looks so complex that one
> needs to build source code with the WAL_DEBUG macro and enable the GUC
> to see the extended logs for WAL. IMO, the best way there is either:
> 1) unify all the code under WAL_DEBUG macro and get rid of wal_debug GUC, or
> 2) unify all the code under wal_debug GUC (it is developer-only and
> superuser-only so there shouldn't be a problem even if we ship it out
> of the box).
>
> If someone is concerned about the GUC being enabled on production
> servers knowingly or unknowingly with option (2), we can go ahead with
> option (1). I will discuss this separately to see what others think.
>
> > I think we can simplify this.  We effectively take the same action any time
> > "count" doesn't equal "read_bytes", so there's no need for the "else if".
> >
> >         if (count == read_bytes)
> >                 return true;
> >
> >         buf += read_bytes;
> >         startptr += read_bytes;
> >         count -= read_bytes;
>
> I wanted to avoid setting these unnecessarily for buffer misses.
>
> Thanks a lot for reviewing. I'm attaching the v8 patch for further review.
>
> --
> Bharath Rupireddy
> PostgreSQL Contributors Team
> RDS Open Source Databases
> Amazon Web Services: https://aws.amazon.com

Attachment

Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Sun, Mar 12, 2023 at 12:52 AM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:
>
> I went through the v8 patch.

Thanks for looking at it. Please post the responses in-line, not above
the entire previous message for better readability.

> Following are my thoughts to improve the
> WAL buffer hit ratio.

Note that the motive of this patch is to read WAL from WAL buffers
*when possible* without affecting concurrent WAL writers.

> Currently the no-longer-needed WAL data present in WAL buffers gets
> cleared in XLogBackgroundFlush() which is called based on the
> wal_writer_delay config setting. Once the data is flushed to the disk,
> it is treated as no-longer-needed and it will be cleared as soon as
> possible based on some config settings.

Being opportunistic in pre-initializing as many possible WAL buffer
pages as is there for a purpose. There's an illuminating comment [1],
so that's done for a purpose, so removing it fully is a no-go IMO. For
instance, it'll make WAL buffer pages available for concurrent writers
so there will be less work for writers in GetXLogBuffer. I'm sure
removing the opportunistic pre-initialization of the WAL buffer pages
will hurt performance in a highly concurrent-write workload.

    /*
     * Great, done. To take some work off the critical path, try to initialize
     * as many of the no-longer-needed WAL buffers for future use as we can.
     */
    AdvanceXLInsertBuffer(InvalidXLogRecPtr, insertTLI, true);

> Second, In WALRead(), we try to read the data from disk whenever we
> don't find the data from WAL buffers. We don't store this data in the
> WAL buffer. We just read the data, use it and leave it. If we store
> this data to the WAL buffer, then we may avoid a few disk reads.

Again this is going to hurt concurrent writers. Note that wal_buffers
aren't used as full cache per-se, there'll be multiple writers to it,
*when possible* readers will try to read from it without hurting
writers.

> The patch attached takes care of this.

Please post the new proposal as a text file (not a .patch file) or as
a plain text in the email itself if the change is small or attach all
the patches if the patch is over-and-above the proposed patches.
Attaching a single over-and-above patch will make CFBot unhappy and
will force authors to repost the original patches. Typically, we
follow this. Having said, I have some review comments to fix on
v8-0001, so, I'll be sending out v9 patch-set soon.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Sun, Mar 12, 2023 at 11:00 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> I have some review comments to fix on
> v8-0001, so, I'll be sending out v9 patch-set soon.

Please find the attached v9 patch set for further review. I moved the
check for just-initialized WAL buffer pages before reading the page.
Up until now, it's the other way around, meaning, read the page and
then check the header if it is just-initialized, which is wrong. The
attached v9 patch set corrects it.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Tue, Mar 7, 2023 at 11:14 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Tue, Mar 07, 2023 at 12:39:13PM +0530, Bharath Rupireddy wrote:
> > On Tue, Mar 7, 2023 at 3:30 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> >> Is it possible to memcpy more than a page at a time?
> >
> > It would complicate things a lot there; the logic to figure out the
> > last page bytes that may or may not fit in the whole page gets
> > complicated. Also, the logic to verify each page's header gets
> > complicated. We might lose out if we memcpy all the pages at once and
> > start verifying each page's header in another loop.
>
> Doesn't the complicated logic you describe already exist to some extent in
> the patch?  You are copying a page at a time, which involves calculating
> various addresses and byte counts.

Okay here I am with the v10 patch set attached that avoids multiple
memcpy calls which must benefit the callers who want to read more than
1 WAL buffer page (streaming replication WAL sender for instance).

> >> +    elog(DEBUG1, "read %zu bytes out of %zu bytes from WAL buffers for given LSN %X/%X, Timeline ID %u",
> >> +         *read_bytes, count, LSN_FORMAT_ARGS(startptr), tli);
> >>
> >> I definitely don't think we should put an elog() in this code path.
> >> Perhaps this should be guarded behind WAL_DEBUG.
> >
> > Placing it behind WAL_DEBUG doesn't help users/developers. My
> > intention was to let users know that the WAL read hit the buffers,
> > it'll help them report if any issue occurs and also help developers to
> > debug that issue.
>
> I still think an elog() is mighty expensive for this code path, even when
> it doesn't actually produce any messages.  And when it does, I think it has
> the potential to be incredibly noisy.

Well, my motive was to have a way for the user to know WAL buffer hits
and misses to report any found issues. However, I have a plan later to
add WAL buffer stats (hits/misses). I understand that even if someone
enables DEBUG1, this message can bloat server log files and make
recovery slower, especially on a standby. Hence, I agree to keep these
logs behind the WAL_DEBUG macro like others and did so in the attached
v10 patch set.

Please review the attached v10 patch set further.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Jeff Davis
Date:
On Sat, 2023-01-14 at 12:34 -0800, Andres Freund wrote:
> One benefit would be that it'd make it more realistic to use direct
> IO for WAL
> - for which I have seen significant performance benefits. But when we
> afterwards have to re-read it from disk to replicate, it's less
> clearly a win.

Does this patch still look like a good fit for your (or someone else's)
plans for direct IO here? If so, would committing this soon make it
easier to make progress on that, or should we wait until it's actually
needed?

If I recall, this patch does not provide a perforance benefit as-is
(correct me if things have changed) and I don't know if a reduction in
syscalls alone is enough to justify it. But if it paves the way for
direct IO for WAL, that does seem worth it.

Regards,
    Jeff Davis




Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Andres Freund
Date:
Hi,

On 2023-10-03 16:05:32 -0700, Jeff Davis wrote:
> On Sat, 2023-01-14 at 12:34 -0800, Andres Freund wrote:
> > One benefit would be that it'd make it more realistic to use direct
> > IO for WAL
> > - for which I have seen significant performance benefits. But when we
> > afterwards have to re-read it from disk to replicate, it's less
> > clearly a win.
> 
> Does this patch still look like a good fit for your (or someone else's)
> plans for direct IO here? If so, would committing this soon make it
> easier to make progress on that, or should we wait until it's actually
> needed?

I think it'd be quite useful to have. Even with the code as of 16, I see
better performance in some workloads with debug_io_direct=wal,
wal_sync_method=open_datasync compared to any other configuration. Except of
course that it makes walsenders more problematic, as they suddenly require
read IO. Thus having support for walsenders to send directly from wal buffers
would be beneficial, even without further AIO infrastructure.


I also think there are other quite desirable features that are made easier by
this patch. One of the primary problems with using synchronous replication is
the latency increase, obviously. We can't send out WAL before it has locally
been wirten out and flushed to disk. For some workloads, we could
substantially lower synchronous commit latency if we were able to send WAL to
remote nodes *before* WAL has been made durable locally, even if the receiving
systems wouldn't be allowed to write that data to disk yet: It takes less time
to send just "write LSN: %X/%X, flush LSNL: %X/%X" than also having to send
all the not-yet-durable WAL.

In many OLTP workloads there won't be WAL flushes between generating WAL for
DML and commit, which means that the amount of WAL that needs to be sent out
at commit can be of nontrivial size.

E.g. for pgbench, normally a transaction is about ~550 bytes (fitting in a
single tcp/ip packet), but a pgbench transaction that needs to emit FPIs for
everything is a lot larger: ~45kB (not fitting in a single packet). Obviously
many real world workloads OLTP workloads actually do more writes than
pgbench. Making the commit latency of the latter be closer to the commit
latency of the former when using syncrep would obviously be great.

Of course this patch is just a relatively small step towards that: We'd also
need in-memory buffering on the receiving side, the replication protocol would
need to be improved, we'd likely need an option to explicitly opt into
receiving unflushed data. But it's still a pretty much required step.

Greetings,

Andres Freund



Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Thu, Oct 12, 2023 at 4:13 AM Andres Freund <andres@anarazel.de> wrote:
>
> On 2023-10-03 16:05:32 -0700, Jeff Davis wrote:
> >
> > Does this patch still look like a good fit for your (or someone else's)
> > plans for direct IO here? If so, would committing this soon make it
> > easier to make progress on that, or should we wait until it's actually
> > needed?
>
> I think it'd be quite useful to have. Even with the code as of 16, I see
> better performance in some workloads with debug_io_direct=wal,
> wal_sync_method=open_datasync compared to any other configuration. Except of
> course that it makes walsenders more problematic, as they suddenly require
> read IO. Thus having support for walsenders to send directly from wal buffers
> would be beneficial, even without further AIO infrastructure.

Right. Tests show the benefit with WAL DIO + this patch -
https://www.postgresql.org/message-id/CALj2ACV6rS%2B7iZx5%2BoAvyXJaN4AG-djAQeM1mrM%3DYSDkVrUs7g%40mail.gmail.com.

Also, irrespective of WAL DIO, the WAL buffers hit ratio with the
patch stood at 95% for 1 primary, 1 sync standby, 1 async standby,
pgbench --scale=300 --client=32 --time=900. In other words, the
walsenders avoided 95% of the time reading from the file/avoided pread
system calls -
https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com.

> I also think there are other quite desirable features that are made easier by
> this patch. One of the primary problems with using synchronous replication is
> the latency increase, obviously. We can't send out WAL before it has locally
> been wirten out and flushed to disk. For some workloads, we could
> substantially lower synchronous commit latency if we were able to send WAL to
> remote nodes *before* WAL has been made durable locally, even if the receiving
> systems wouldn't be allowed to write that data to disk yet: It takes less time
> to send just "write LSN: %X/%X, flush LSNL: %X/%X" than also having to send
> all the not-yet-durable WAL.
>
> In many OLTP workloads there won't be WAL flushes between generating WAL for
> DML and commit, which means that the amount of WAL that needs to be sent out
> at commit can be of nontrivial size.
>
> E.g. for pgbench, normally a transaction is about ~550 bytes (fitting in a
> single tcp/ip packet), but a pgbench transaction that needs to emit FPIs for
> everything is a lot larger: ~45kB (not fitting in a single packet). Obviously
> many real world workloads OLTP workloads actually do more writes than
> pgbench. Making the commit latency of the latter be closer to the commit
> latency of the former when using syncrep would obviously be great.
>
> Of course this patch is just a relatively small step towards that: We'd also
> need in-memory buffering on the receiving side, the replication protocol would
> need to be improved, we'd likely need an option to explicitly opt into
> receiving unflushed data. But it's still a pretty much required step.

Yes, this patch can pave the way for all of the above features in
future. However, I'm looking forward to getting this in for now.
Later, I'll come up with more concrete thoughts on the above.

Having said above, the latest v10 patch after addressing some of the
review comments is at
https://www.postgresql.org/message-id/CALj2ACU3ZYzjOv4vZTR%2BLFk5PL4ndUnbLS6E1vG2dhDBjQGy2A%40mail.gmail.com.
Any further thoughts on the patch is welcome.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Thu, Oct 12, 2023 at 4:13 AM Andres Freund <andres@anarazel.de> wrote:
>
> On 2023-10-03 16:05:32 -0700, Jeff Davis wrote:
> > On Sat, 2023-01-14 at 12:34 -0800, Andres Freund wrote:
> > > One benefit would be that it'd make it more realistic to use direct
> > > IO for WAL
> > > - for which I have seen significant performance benefits. But when we
> > > afterwards have to re-read it from disk to replicate, it's less
> > > clearly a win.
> >
> > Does this patch still look like a good fit for your (or someone else's)
> > plans for direct IO here? If so, would committing this soon make it
> > easier to make progress on that, or should we wait until it's actually
> > needed?
>
> I think it'd be quite useful to have. Even with the code as of 16, I see
> better performance in some workloads with debug_io_direct=wal,
> wal_sync_method=open_datasync compared to any other configuration. Except of
> course that it makes walsenders more problematic, as they suddenly require
> read IO. Thus having support for walsenders to send directly from wal buffers
> would be beneficial, even without further AIO infrastructure.

I'm attaching the v11 patch set with the following changes:
- Improved input validation in the function that reads WAL from WAL
buffers in 0001 patch.
- Improved test module's code in 0002 patch.
- Modernized meson build file in 0002 patch.
- Added commit messages for both the patches.
- Ran pgindent on both the patches.

Any thoughts are welcome.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Fri, Oct 20, 2023 at 10:19 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Thu, Oct 12, 2023 at 4:13 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > On 2023-10-03 16:05:32 -0700, Jeff Davis wrote:
> > > On Sat, 2023-01-14 at 12:34 -0800, Andres Freund wrote:
> > > > One benefit would be that it'd make it more realistic to use direct
> > > > IO for WAL
> > > > - for which I have seen significant performance benefits. But when we
> > > > afterwards have to re-read it from disk to replicate, it's less
> > > > clearly a win.
> > >
> > > Does this patch still look like a good fit for your (or someone else's)
> > > plans for direct IO here? If so, would committing this soon make it
> > > easier to make progress on that, or should we wait until it's actually
> > > needed?
> >
> > I think it'd be quite useful to have. Even with the code as of 16, I see
> > better performance in some workloads with debug_io_direct=wal,
> > wal_sync_method=open_datasync compared to any other configuration. Except of
> > course that it makes walsenders more problematic, as they suddenly require
> > read IO. Thus having support for walsenders to send directly from wal buffers
> > would be beneficial, even without further AIO infrastructure.
>
> I'm attaching the v11 patch set with the following changes:
> - Improved input validation in the function that reads WAL from WAL
> buffers in 0001 patch.
> - Improved test module's code in 0002 patch.
> - Modernized meson build file in 0002 patch.
> - Added commit messages for both the patches.
> - Ran pgindent on both the patches.
>
> Any thoughts are welcome.

I'm attaching v12 patch set with just pgperltidy ran on the new TAP
test added in 0002. No other changes from that of v11 patch set.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Jeff Davis
Date:
On Sat, 2023-10-21 at 23:59 +0530, Bharath Rupireddy wrote:
> I'm attaching v12 patch set with just pgperltidy ran on the new TAP
> test added in 0002. No other changes from that of v11 patch set.

Thank you.

Comments:

* It would be good to document that this is partially an optimization
(read from memory first) and partially an API difference that allows
reading unflushed data. For instance, walsender may benefit
performance-wise (and perhaps later with the ability to read unflushed
data) whereas pg_walinspect benefits primarily from reading unflushed
data.

* Shouldn't there be a new method in XLogReaderRoutine (e.g.
read_unflushed_data), rather than having logic in WALRead()? The
callers can define the method if it makes sense (and that would be a
good place to document why); or leave it NULL if not.

* I'm not clear on the "partial hit" case. Wouldn't that mean you found
the earliest byte in the buffers, but not the latest byte requested? Is
that possible, and if so under what circumstances? I added an
"Assert(nread == 0 || nread == count)" in WALRead() after calling
XLogReadFromBuffers(), and it wasn't hit.

* If the partial hit case is important, wouldn't XLogReadFromBuffers()
fill in the end of the buffer rather than the beginning?

* Other functions that use xlblocks, e.g. GetXLogBuffer(), use more
effort to avoid acquiring WALBufMappingLock. Perhaps you can avoid it,
too? One idea is to check that XLogCtl->xlblocks[idx] is equal to
expectedEndPtr both before and after the memcpy(), with appropriate
barriers. That could mitigate concerns expressed by Kyotaro Horiguchi
and Masahiko Sawada.

* Are you sure that reducing the number of calls to memcpy() is a win?
I would expect that to be true only if the memcpy()s are tiny, but here
they are around XLOG_BLCKSZ. I believe this was done based on a comment
from Nathan Bossart, but I didn't really follow why that's important.
Also, if we try to use one memcpy for all of the data, it might not
interact well with my idea above to avoid taking the lock.

* Style-wise, the use of "unlikely" seems excessive, unless there's a
reason to think it matters.

Regards,
    Jeff Davis




Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Nathan Bossart
Date:
On Tue, Oct 24, 2023 at 05:15:19PM -0700, Jeff Davis wrote:
> * Are you sure that reducing the number of calls to memcpy() is a win?
> I would expect that to be true only if the memcpy()s are tiny, but here
> they are around XLOG_BLCKSZ. I believe this was done based on a comment
> from Nathan Bossart, but I didn't really follow why that's important.
> Also, if we try to use one memcpy for all of the data, it might not
> interact well with my idea above to avoid taking the lock.

I don't recall exactly why I suggested this, but if additional memcpy()s
help in some way and don't negatively impact performance, then I retract my
previous comment.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Wed, Oct 25, 2023 at 5:45 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> Comments:

Thanks for reviewing.

> * It would be good to document that this is partially an optimization
> (read from memory first) and partially an API difference that allows
> reading unflushed data. For instance, walsender may benefit
> performance-wise (and perhaps later with the ability to read unflushed
> data) whereas pg_walinspect benefits primarily from reading unflushed
> data.

Commit message has these things covered in detail. However, I think
adding some info in the code comments is a good idea and done around
the WALRead() function in the attached v13 patch set.

> * Shouldn't there be a new method in XLogReaderRoutine (e.g.
> read_unflushed_data), rather than having logic in WALRead()? The
> callers can define the method if it makes sense (and that would be a
> good place to document why); or leave it NULL if not.

I've designed the new function XLogReadFromBuffers to read from WAL
buffers in such a way that one can easily embed it in page_read
callbacks if it makes sense. Almost all the available backend
page_read callbacks read_local_xlog_page_no_wait,
read_local_xlog_page, logical_read_xlog_page except XLogPageRead
(which is used for recovery when WAL buffers aren't used at all) have
one thing in common, that is, WALRead(). Therefore, it seemed a
natural choice for me to call XLogReadFromBuffers. In other words, I'd
say it's the responsibility of page_read callback implementers to
decide if they want to read from WAL buffers or not and hence I don't
think we need a separate XLogReaderRoutine.

If someone wants to read unflushed WAL, the typical way to implement
it is to write a new page_read callback
read_local_unflushed_xlog_page/logical_read_unflushed_xlog_page or
similar without WALRead() but just the new function
XLogReadFromBuffers to read from WAL buffers and return.

> * I'm not clear on the "partial hit" case. Wouldn't that mean you found
> the earliest byte in the buffers, but not the latest byte requested? Is
> that possible, and if so under what circumstances? I added an
> "Assert(nread == 0 || nread == count)" in WALRead() after calling
> XLogReadFromBuffers(), and it wasn't hit.
>
> * If the partial hit case is important, wouldn't XLogReadFromBuffers()
> fill in the end of the buffer rather than the beginning?

Partial hit was possible when the requested WAL pages are read one
page at a time from WAL buffers with WALBufMappingLock
acquisition-release for each page as the requested page can be
replaced by the time the lock is released and reacquired. This was the
case up until the v6 patch -
https://www.postgresql.org/message-id/CALj2ACWTNneq2EjMDyUeWF-BnwpewuhiNEfjo9bxLwFU9iPF0w%40mail.gmail.com.
Now that the approach has been changed to read multiple pages at once
under one WALBufMappingLock acquisition-release. .
We can either keep the partial hit handling (just to not miss
anything) or turn the following partial hit case to an error or an
Assert(false);. I prefer to keep the partial hit handling as-is just
in case:
+        else if (count > nread)
+        {
+            /*
+             * Buffer partial hit, so reset the state to count the read bytes
+             * and continue.
+             */
+            buf += nread;
+            startptr += nread;
+            count -= nread;
+        }

> * Other functions that use xlblocks, e.g. GetXLogBuffer(), use more
> effort to avoid acquiring WALBufMappingLock. Perhaps you can avoid it,
> too? One idea is to check that XLogCtl->xlblocks[idx] is equal to
> expectedEndPtr both before and after the memcpy(), with appropriate
> barriers. That could mitigate concerns expressed by Kyotaro Horiguchi
> and Masahiko Sawada.

Yes, I proposed that idea in another thread -
https://www.postgresql.org/message-id/CALj2ACVFSirOFiABrNVAA6JtPHvA9iu%2Bwp%3DqkM9pdLZ5mwLaFg%40mail.gmail.com.
If that looks okay, I can add it to the next version of this patch
set.

> * Are you sure that reducing the number of calls to memcpy() is a win?
> I would expect that to be true only if the memcpy()s are tiny, but here
> they are around XLOG_BLCKSZ. I believe this was done based on a comment
> from Nathan Bossart, but I didn't really follow why that's important.
> Also, if we try to use one memcpy for all of the data, it might not
> interact well with my idea above to avoid taking the lock.

Up until the v6 patch -
https://www.postgresql.org/message-id/CALj2ACWTNneq2EjMDyUeWF-BnwpewuhiNEfjo9bxLwFU9iPF0w%40mail.gmail.com,
the requested WAL was being read one page at a time from WAL buffers
into output buffer with one memcpy call for each page. Now that the
approach has been changed to read multiple pages at once under one
WALBufMappingLock acquisition-release with comparatively lesser number
of memcpy calls. I honestly haven't seen any difference between the
two approaches -
https://www.postgresql.org/message-id/CALj2ACUpQGiwQTzmoSMOFk5%3DWiJc06FcYpxzBX0SEej4ProRzg%40mail.gmail.com.

The new approach of reading multiple pages at once under one
WALBufMappingLock acquisition-release clearly wins over reading one
page at a time with multiple lock  acquisition-release cycles.

> * Style-wise, the use of "unlikely" seems excessive, unless there's a
> reason to think it matters.

Given the current use of XLogReadFromBuffers, the input parameters are
passed as expected, IOW, these are unlikely events. The comments [1]
say that the unlikely() is to be used in hot code paths; I think
reading WAL from buffers is a hot code path especially when called
from (logical, physical) walsenders. If there's any stronger reason
than the appearance/style-wise, I'm okay to not use them. For now,
I've retained them.

FWIW, I found heapam.c using unlikely() extensively for safety checks.

[1]
 * Hints to the compiler about the likelihood of a branch. Both likely() and
 * unlikely() return the boolean value of the contained expression.
 *
 * These should only be used sparingly, in very hot code paths. It's very easy
 * to mis-estimate likelihoods.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Jeff Davis
Date:
On Fri, 2023-10-27 at 03:46 +0530, Bharath Rupireddy wrote:
>
> Almost all the available backend
> page_read callbacks read_local_xlog_page_no_wait,
> read_local_xlog_page, logical_read_xlog_page except XLogPageRead
> (which is used for recovery when WAL buffers aren't used at all) have
> one thing in common, that is, WALRead(). Therefore, it seemed a
> natural choice for me to call XLogReadFromBuffers. In other words,
> I'd
> say it's the responsibility of page_read callback implementers to
> decide if they want to read from WAL buffers or not and hence I don't
> think we need a separate XLogReaderRoutine.

I think I see what you are saying: WALRead() is at a lower level than
the XLogReaderRoutine callbacks, because it's used by the .page_read
callbacks.

That makes sense, but my first interpretation was that WALRead() is
above the XLogReaderRoutine callbacks because it calls .segment_open
and .segment_close. To me that sounds like a layering violation, but it
exists today without your patch.

I suppose the question is: should reading from the WAL buffers an
intentional thing that the caller does explicitly by specific callers?
Or is it an optimization that should be hidden from the caller?

I tend toward the former, at least for now. I suspect that when we do
some more interesting things, like replicating unflushed data, we will
want reading from buffers to be a separate step, not combined with
WALRead(). After things in this area settle a bit then we might want to
refactor and combine them again.

> If someone wants to read unflushed WAL, the typical way to implement
> it is to write a new page_read callback
> read_local_unflushed_xlog_page/logical_read_unflushed_xlog_page or
> similar without WALRead() but just the new function
> XLogReadFromBuffers to read from WAL buffers and return.

Then why is it being called from WALRead() at all?

>
> I prefer to keep the partial hit handling as-is just
> in case:
>

So a "partial hit" is essentially a narrow race condition where one
page is read from buffers, and it's valid; and by the time it gets to
the next page, it has already been evicted (along with the previously
read page)? In other words, I think you are describing a case where
eviction is happening faster than the memcpy()s in a loop, which is
certainly possible due to scheduling or whatever, but doesn't seem like
the common case.

The case I'd expect for a partial read is when the startptr points to
an evicted page, but some later pages in the requested range are still
present in the buffers.

I'm not really sure whether either of these cases matters, but if we
implement one and not the other, there should be some explanation.

> Yes, I proposed that idea in another thread -
> https://www.postgresql.org/message-id/CALj2ACVFSirOFiABrNVAA6JtPHvA9iu%2Bwp%3DqkM9pdLZ5mwLaFg%40mail.gmail.com
> .
> If that looks okay, I can add it to the next version of this patch
> set.

The code in the email above still shows a call to:

  /*
   * Requested WAL is available in WAL buffers, so recheck the
existence
   * under the WALBufMappingLock and read if the page still exists,
otherwise
   * return.
   */
  LWLockAcquire(WALBufMappingLock, LW_SHARED);

and I don't think that's required. How about something like:

  endptr1 = XLogCtl->xlblocks[idx];
  /* Requested WAL isn't available in WAL buffers. */
  if (expectedEndPtr != endptr1)
      break;

  pg_read_barrier();
  ...
  memcpy(buf, data, bytes_read_this_loop);
  ...
  pg_read_barrier();
  endptr2 = XLogCtl->xlblocks[idx];
  if (expectedEndPtr != endptr2)
      break;

  ntotal += bytes_read_this_loop;
  /* success; move on to next page */

I'm not sure why GetXLogBuffer() doesn't just use pg_atomic_read_u64().
I suppose because xlblocks are not guaranteed to be 64-bit aligned?
Should we just align it to 64 bits so we can use atomics? (I don't
think it matters in this case, but atomics would avoid the need to
think about it.)
>

>
> FWIW, I found heapam.c using unlikely() extensively for safety
> checks.

OK, I won't object to the use of unlikely(), though I typically don't
use it without a fairly strong reason to think I should override what
the compiler thinks and/or what branch predictors can handle.

In this case, I think some of those errors are not really necessary
anyway, though:

  * XLogReadFromBuffers shouldn't take a timeline argument just to
demand that it's always equal to the wal insertion timeline.
  * Why check that startptr is earlier than the flush pointer, but not
startptr+count? Also, given that we intend to support reading unflushed
data, it would be good to comment that the function still works past
the flush pointer, and that it will be safe to remove later (right?).
  * An "Assert(!RecoveryInProgress())" would be more appropriate than
an error. Perhaps we will remove even that check in the future to
achieve cascaded replication of unflushed data.

Regards,
    Jeff Davis







Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Sat, Oct 28, 2023 at 2:22 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> I think I see what you are saying: WALRead() is at a lower level than
> the XLogReaderRoutine callbacks, because it's used by the .page_read
> callbacks.
>
> That makes sense, but my first interpretation was that WALRead() is
> above the XLogReaderRoutine callbacks because it calls .segment_open
> and .segment_close. To me that sounds like a layering violation, but it
> exists today without your patch.

Right. WALRead() is a common function used by most if not all
page_read callbacks. Typically, the page_read callbacks code has 2
parts - first determine the target/start LSN and second read WAL (via
WALRead() for instance).

> I suppose the question is: should reading from the WAL buffers an
> intentional thing that the caller does explicitly by specific callers?
> Or is it an optimization that should be hidden from the caller?
>
> I tend toward the former, at least for now.

Yes, it's an optimization that must be hidden from the caller.

> I suspect that when we do
> some more interesting things, like replicating unflushed data, we will
> want reading from buffers to be a separate step, not combined with
> WALRead(). After things in this area settle a bit then we might want to
> refactor and combine them again.

As said previously, the new XLogReadFromBuffers() function is generic
and extensible in the way that anyone with a target/start LSN
(corresponding to flushed or written-but-not-yet-flushed WAL) and TLI
can call it to read from WAL buffers. It's just that the patch
currently uses it where it makes sense i.e. in WALRead(). But, it's
usable in, say, a page_read callback reading unflushed WAL from WAL
buffers.

> > If someone wants to read unflushed WAL, the typical way to implement
> > it is to write a new page_read callback
> > read_local_unflushed_xlog_page/logical_read_unflushed_xlog_page or
> > similar without WALRead() but just the new function
> > XLogReadFromBuffers to read from WAL buffers and return.
>
> Then why is it being called from WALRead() at all?

The patch focuses on reading flushed WAL from WAL buffers if
available, not the unflushed WAL at all; that's why it's in WALRead()
before reading from the WAL file using pg_pread().

I'm trying to make a point that the XLogReadFromBuffers() enables one
to read unflushed WAL from WAL buffers (if really wanted for future
features like replicate from WAL buffers as a new opt-in feature to
improve the replication performance).

> > I prefer to keep the partial hit handling as-is just
> > in case:
>
> So a "partial hit" is essentially a narrow race condition where one
> page is read from buffers, and it's valid; and by the time it gets to
> the next page, it has already been evicted (along with the previously
> read page)?
>
> In other words, I think you are describing a case where
> eviction is happening faster than the memcpy()s in a loop, which is
> certainly possible due to scheduling or whatever, but doesn't seem like
> the common case.
>
> The case I'd expect for a partial read is when the startptr points to
> an evicted page, but some later pages in the requested range are still
> present in the buffers.
>
> I'm not really sure whether either of these cases matters, but if we
> implement one and not the other, there should be some explanation.

At any given point of time, WAL buffer pages are maintained as a
circularly sorted array in an ascending order from
OldestInitializedPage to InitializedUpTo (new pages are inserted at
this end). Also, the current patch holds WALBufMappingLock while
reading the buffer pages, meaning, no one can replace the buffer pages
until reading is finished. Therefore, the above two described partial
hit cases can't happen - when reading multiple pages if the first page
is found to be existing in the buffer pages, it means the other pages
must exist too because of the circular and sortedness of the WAL
buffer page array.

Here's an illustration with WAL buffers circular array (idx, LSN) of
size 10 elements with contents as {(0, 160), (1, 170), (2, 180), (3,
90), (4, 100), (5, 110), (6, 120), (7, 130), (8, 140), (9, 150)} and
current InitializedUpTo pointing to page at LSN 180, idx 2.
- Read 6 pages starting from LSN 80. Nothing is read from WAL buffers
as the page at LSN 80 doesn't exist despite other 5 pages starting
from LSN 90 exist.
- Read 6 pages starting from LSN 90. All the pages exist and are read
from WAL buffers.
- Read 6 pages starting from LSN 150. Note that WAL is currently
flushed only up to page at LSN 180 and the callers won't ask for
unflushed WAL read. If a caller asks for an unflushed WAL read
intentionally or unintentionally, XLogReadFromBuffers() reads only 4
pages starting from LSN 150 to LSN 180 and will leave the remaining 2
pages for the caller to deal with. This is the partial hit that can
happen. Therefore, retaining the partial hit code in WALRead() as-is
in the current patch is needed IMV.

> > Yes, I proposed that idea in another thread -
> > https://www.postgresql.org/message-id/CALj2ACVFSirOFiABrNVAA6JtPHvA9iu%2Bwp%3DqkM9pdLZ5mwLaFg%40mail.gmail.com
> > .
> > If that looks okay, I can add it to the next version of this patch
> > set.
>
> The code in the email above still shows a call to:
>
>   /*
>    * Requested WAL is available in WAL buffers, so recheck the
> existence
>    * under the WALBufMappingLock and read if the page still exists,
> otherwise
>    * return.
>    */
>   LWLockAcquire(WALBufMappingLock, LW_SHARED);
>
> and I don't think that's required. How about something like:
>
>   endptr1 = XLogCtl->xlblocks[idx];
>   /* Requested WAL isn't available in WAL buffers. */
>   if (expectedEndPtr != endptr1)
>       break;
>
>   pg_read_barrier();
>   ...
>   memcpy(buf, data, bytes_read_this_loop);
>   ...
>   pg_read_barrier();
>   endptr2 = XLogCtl->xlblocks[idx];
>   if (expectedEndPtr != endptr2)
>       break;
>
>   ntotal += bytes_read_this_loop;
>   /* success; move on to next page */
>
> I'm not sure why GetXLogBuffer() doesn't just use pg_atomic_read_u64().
> I suppose because xlblocks are not guaranteed to be 64-bit aligned?
> Should we just align it to 64 bits so we can use atomics? (I don't
> think it matters in this case, but atomics would avoid the need to
> think about it.)

WALBufMappingLock protects both xlblocks and WAL buffer pages [1][2].
I'm not sure how using the memory barrier, not WALBufMappingLock,
prevents writers from replacing WAL buffer pages while readers reading
the pages. FWIW, GetXLogBuffer() reads the xlblocks value without the
lock but it confirms the WAL existence under the lock and gets the WAL
buffer page under the lock [3].

I'll reiterate the WALBufMappingLock thing for this patch - the idea
is to know whether or not the WAL at a given LSN exists in WAL buffers
without acquiring WALBufMappingLock; if exists acquire the lock in
shared mode, read from WAL buffers and then release. WAL buffer pages
are organized as a circular array with the InitializedUpTo as the
latest filled WAL buffer page. If there's a way to track the oldest
filled WAL buffer page (OldestInitializedPage), at any given point of
time, the elements of the circular array are sorted in an ascending
order from OldestInitializedPage to InitializedUpTo. With this
approach, no lock is required to know if the WAL at given LSN exists
in WAL buffers, we can just do this if lsn >=
XLogCtl->OldestInitializedPage && lsn < XLogCtl->InitializedUpTo. I
proposed this idea here
https://www.postgresql.org/message-id/CALj2ACVgi6LirgLDZh%3DFdfdvGvKAD%3D%3DWTOSWcQy%3DAtNgPDVnKw%40mail.gmail.com.
I've pulled that patch in here as 0001 to showcase its use for this
feature.

>   * Why check that startptr is earlier than the flush pointer, but not
> startptr+count? Also, given that we intend to support reading unflushed
> data, it would be good to comment that the function still works past
> the flush pointer, and that it will be safe to remove later (right?).

That won't work, see the comment below. Actual flushed LSN may not
always be greater than startptr+count. GetFlushRecPtr() check in
XLogReadFromBuffers() is similar to what pg_walinspect has in
GetCurrentLSN().

    /*
     * Even though we just determined how much of the page can be validly read
     * as 'count', read the whole page anyway. It's guaranteed to be
     * zero-padded up to the page boundary if it's incomplete.
     */
    if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ, tli,
                 &errinfo))

>   * An "Assert(!RecoveryInProgress())" would be more appropriate than
> an error. Perhaps we will remove even that check in the future to
> achieve cascaded replication of unflushed data.
>
> In this case, I think some of those errors are not really necessary
> anyway, though:
>
>   * XLogReadFromBuffers shouldn't take a timeline argument just to
> demand that it's always equal to the wal insertion timeline.

I've changed XLogReadFromBuffers() to return as-if nothing was read
(cache miss) when the server is in recovery or the requested TLI is
not the current server's insertion TLI. It is better than failing with
ERRORs so that  the callers don't have to have any checks for recovery
or TLI.

PSA v14 patch set.

[1]
 * WALBufMappingLock: must be held to replace a page in the WAL buffer cache.

[2]
     * and xlblocks values certainly do.  xlblocks values are protected by
     * WALBufMappingLock.
     */
    char       *pages;            /* buffers for unwritten XLOG pages */
    XLogRecPtr *xlblocks;        /* 1st byte ptr-s + XLOG_BLCKSZ */

[3]
     * However, we don't hold a lock while we read the value. If someone has
     * just initialized the page, it's possible that we get a "torn read" of
     * the XLogRecPtr if 64-bit fetches are not atomic on this platform. In
     * that case we will see a bogus value. That's ok, we'll grab the mapping
     * lock (in AdvanceXLInsertBuffer) and retry if we see anything else than
     * the page we're looking for. But it means that when we do this unlocked
     * read, we might see a value that appears to be ahead of the page we're
     * looking for. Don't PANIC on that, until we've verified the value while
     * holding the lock.
     */
    expectedEndPtr = ptr;
    expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;

    endptr = XLogCtl->xlblocks[idx];

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Jeff Davis
Date:
On Thu, 2023-11-02 at 22:38 +0530, Bharath Rupireddy wrote:
> > I suppose the question is: should reading from the WAL buffers an
> > intentional thing that the caller does explicitly by specific
> > callers?
> > Or is it an optimization that should be hidden from the caller?
> >
> > I tend toward the former, at least for now.
>
> Yes, it's an optimization that must be hidden from the caller.

As I said, I tend toward the opposite: that specific callers should
read from the buffers explicitly in the cases where it makes sense.

I don't think this is the most important point right now though, let's
sort out the other details.

> >
> At any given point of time, WAL buffer pages are maintained as a
> circularly sorted array in an ascending order from
> OldestInitializedPage to InitializedUpTo (new pages are inserted at
> this end).

I don't see any reference to OldestInitializedPage or anything like it,
with or without your patch. Am I missing something?

> - Read 6 pages starting from LSN 80. Nothing is read from WAL buffers
> as the page at LSN 80 doesn't exist despite other 5 pages starting
> from LSN 90 exist.

This is what I imagined a "partial hit" was: read the 5 pages starting
at 90. The caller would then need to figure out how to read the page at
LSN 80 from the segment files.

I am not saying we should support this case; perhaps it doesn't matter.
I'm just describing why that term was confusing for me.

> If a caller asks for an unflushed WAL read
> intentionally or unintentionally, XLogReadFromBuffers() reads only 4
> pages starting from LSN 150 to LSN 180 and will leave the remaining 2
> pages for the caller to deal with. This is the partial hit that can
> happen.

To me that's more like an EOF case. "Partial hit" sounds to me like the
data exists but is not available in the cache (i.e. go to the segment
files); whereas if it encountered the end, the data is not available at
all.

> >
> WALBufMappingLock protects both xlblocks and WAL buffer pages [1][2].
> I'm not sure how using the memory barrier, not WALBufMappingLock,
> prevents writers from replacing WAL buffer pages while readers
> reading
> the pages.

It doesn't *prevent* that case, but it does *detect* that case. We
don't want to prevent writers from replacing WAL buffers, because that
would mean we are slowing down the critical WAL writing path.

Let me explain the potential problem cases, and how the barriers
prevent them:

Potential problem 1: the page is not yet resident in the cache at the
time the memcpy begins. In this case, the first read barrier would
ensure that the page is also not yet resident at the time xlblocks[idx]
is read into endptr1, and we'd break out of the loop.

Potential problem 2: the page is evicted before the memcpy finishes. In
this case, the second read barrier would ensure that the page was also
evicted before xlblocks[idx] is read into endptr2, and again we'd
detect the problem and break out of the loop.

I assume here that, if xlblocks[idx] holds the endPtr of the desired
page, all of the bytes for that page are resident at that moment. I
don't think that's true right now: AdvanceXLInsertBuffers() zeroes the
old page before updating xlblocks[nextidx]. I think it needs something
like:

  pg_atomic_write_u64(&XLogCtl->xlblocks[nextidx], InvalidXLogRecPtr);
  pg_write_barrier();

before the MemSet.

I didn't review your latest v14 patch yet.

Regards,
    Jeff Davis







Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Fri, Nov 3, 2023 at 12:35 PM Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Thu, 2023-11-02 at 22:38 +0530, Bharath Rupireddy wrote:
> > > I suppose the question is: should reading from the WAL buffers an
> > > intentional thing that the caller does explicitly by specific
> > > callers?
> > > Or is it an optimization that should be hidden from the caller?
> > >
> > > I tend toward the former, at least for now.
> >
> > Yes, it's an optimization that must be hidden from the caller.
>
> As I said, I tend toward the opposite: that specific callers should
> read from the buffers explicitly in the cases where it makes sense.

How about adding a bool flag (read_from_wal_buffers) to
XLogReaderState so that the callers can set it if they want this
facility via XLogReaderAllocate()?

> > At any given point of time, WAL buffer pages are maintained as a
> > circularly sorted array in an ascending order from
> > OldestInitializedPage to InitializedUpTo (new pages are inserted at
> > this end).
>
> I don't see any reference to OldestInitializedPage or anything like it,
> with or without your patch. Am I missing something?

OldestInitializedPage is introduced in v14-0001 patch. Please have a look.

> > - Read 6 pages starting from LSN 80. Nothing is read from WAL buffers
> > as the page at LSN 80 doesn't exist despite other 5 pages starting
> > from LSN 90 exist.
>
> This is what I imagined a "partial hit" was: read the 5 pages starting
> at 90. The caller would then need to figure out how to read the page at
> LSN 80 from the segment files.
>
> I am not saying we should support this case; perhaps it doesn't matter.
> I'm just describing why that term was confusing for me.

Okay. Current patch doesn't support this case.

> > If a caller asks for an unflushed WAL read
> > intentionally or unintentionally, XLogReadFromBuffers() reads only 4
> > pages starting from LSN 150 to LSN 180 and will leave the remaining 2
> > pages for the caller to deal with. This is the partial hit that can
> > happen.
>
> To me that's more like an EOF case. "Partial hit" sounds to me like the
> data exists but is not available in the cache (i.e. go to the segment
> files); whereas if it encountered the end, the data is not available at
> all.

Right. We can tweak the comments around "partial hit" if required.

> > WALBufMappingLock protects both xlblocks and WAL buffer pages [1][2].
> > I'm not sure how using the memory barrier, not WALBufMappingLock,
> > prevents writers from replacing WAL buffer pages while readers
> > reading
> > the pages.
>
> It doesn't *prevent* that case, but it does *detect* that case. We
> don't want to prevent writers from replacing WAL buffers, because that
> would mean we are slowing down the critical WAL writing path.
>
> Let me explain the potential problem cases, and how the barriers
> prevent them:
>
> Potential problem 1: the page is not yet resident in the cache at the
> time the memcpy begins. In this case, the first read barrier would
> ensure that the page is also not yet resident at the time xlblocks[idx]
> is read into endptr1, and we'd break out of the loop.
>
> Potential problem 2: the page is evicted before the memcpy finishes. In
> this case, the second read barrier would ensure that the page was also
> evicted before xlblocks[idx] is read into endptr2, and again we'd
> detect the problem and break out of the loop.

Understood.

> I assume here that, if xlblocks[idx] holds the endPtr of the desired
> page, all of the bytes for that page are resident at that moment. I
> don't think that's true right now: AdvanceXLInsertBuffers() zeroes the
> old page before updating xlblocks[nextidx].

Right.

> I think it needs something like:
>
>   pg_atomic_write_u64(&XLogCtl->xlblocks[nextidx], InvalidXLogRecPtr);
>   pg_write_barrier();
>
> before the MemSet.

I think it works. First, xlblocks needs to be turned to an array of
64-bit atomics and then the above change. With this, all those who
reads xlblocks with or without WALBufMappingLock also need to check if
xlblocks[idx] is ever InvalidXLogRecPtr and take appropriate action.

I'm sure you have seen the following. It looks like I'm leaning
towards the claim that it's safe to read xlblocks without
WALBufMappingLock. I'll put up a patch for these changes separately.

        /*
         * Make sure the initialization of the page becomes visible to others
         * before the xlblocks update. GetXLogBuffer() reads xlblocks without
         * holding a lock.
         */
        pg_write_barrier();

        *((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) = NewPageEndPtr;

I think the 3 things that helps read from WAL buffers without
WALBufMappingLock are: 1) couple of the read barriers in
XLogReadFromBuffers, 2) atomically initializing xlblocks[idx] to
InvalidXLogRecPtr plus a write barrier in AdvanceXLInsertBuffer(), 3)
the following sanity check to see if the read page is valid in
XLogReadFromBuffers(). If it sounds sensible, I'll work towards coding
it up. Thoughts?

+ , we
+         * need to ensure that we are not reading a page that just got
+         * initialized. For this, we look at the needed page header.
+         */
+        phdr = (XLogPageHeader) page;
+
+        /* Return, if WAL buffer page doesn't look valid. */
+        if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
+              phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
+              phdr->xlp_tli == tli))
+            break;
+

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Jeff Davis
Date:
On Fri, 2023-11-03 at 20:23 +0530, Bharath Rupireddy wrote:
> >
> OldestInitializedPage is introduced in v14-0001 patch. Please have a
> look.

I don't see why that's necessary if we move to the algorithm I
suggested below that doesn't require a lock.

> >
> Okay. Current patch doesn't support this [partial hit of newer pages]
> case.

OK, no need to support it until you see a reason.
> >

> >
> > I think it needs something like:
> >
> >   pg_atomic_write_u64(&XLogCtl->xlblocks[nextidx],
> > InvalidXLogRecPtr);
> >   pg_write_barrier();
> >
> > before the MemSet.
>
> I think it works. First, xlblocks needs to be turned to an array of
> 64-bit atomics and then the above change.

Does anyone see a reason we shouldn't move to atomics here?

>
>         pg_write_barrier();
>
>         *((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) =
> NewPageEndPtr;

I am confused why the "volatile" is required on that line (not from
your patch). I sent a separate message about that:

https://www.postgresql.org/message-id/784f72ac09061fe5eaa5335cc347340c367c73ac.camel@j-davis.com

> I think the 3 things that helps read from WAL buffers without
> WALBufMappingLock are: 1) couple of the read barriers in
> XLogReadFromBuffers, 2) atomically initializing xlblocks[idx] to
> InvalidXLogRecPtr plus a write barrier in AdvanceXLInsertBuffer(), 3)
> the following sanity check to see if the read page is valid in
> XLogReadFromBuffers(). If it sounds sensible, I'll work towards
> coding
> it up. Thoughts?

I like it. I think it will ultimately be a fairly simple loop. And by
moving to atomics, we won't need the delicate comment in
GetXLogBuffer().


Regards,
    Jeff Davis




Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Andres Freund
Date:
Hi,

On 2023-11-03 20:23:30 +0530, Bharath Rupireddy wrote:
> On Fri, Nov 3, 2023 at 12:35 PM Jeff Davis <pgsql@j-davis.com> wrote:
> >
> > On Thu, 2023-11-02 at 22:38 +0530, Bharath Rupireddy wrote:
> > > > I suppose the question is: should reading from the WAL buffers an
> > > > intentional thing that the caller does explicitly by specific
> > > > callers?
> > > > Or is it an optimization that should be hidden from the caller?
> > > >
> > > > I tend toward the former, at least for now.
> > >
> > > Yes, it's an optimization that must be hidden from the caller.
> >
> > As I said, I tend toward the opposite: that specific callers should
> > read from the buffers explicitly in the cases where it makes sense.
> 
> How about adding a bool flag (read_from_wal_buffers) to
> XLogReaderState so that the callers can set it if they want this
> facility via XLogReaderAllocate()?

That seems wrong architecturally - why should xlogreader itself know about any
of this? What would it mean in frontend code if read_from_wal_buffers were
set? IMO this is something that should happen purely within the read function.

Greetings,

Andres Freund



Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Andres Freund
Date:
Hi,

On 2023-11-02 22:38:38 +0530, Bharath Rupireddy wrote:
> From 5b5469d7dcd8e98bfcaf14227e67356bbc1f5fe8 Mon Sep 17 00:00:00 2001
> From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
> Date: Thu, 2 Nov 2023 15:10:51 +0000
> Subject: [PATCH v14] Track oldest initialized WAL buffer page
>
> ---
>  src/backend/access/transam/xlog.c | 170 ++++++++++++++++++++++++++++++
>  src/include/access/xlog.h         |   1 +
>  2 files changed, 171 insertions(+)
>
> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> index b541be8eec..fdf2ef310b 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -504,6 +504,45 @@ typedef struct XLogCtlData
>      XLogRecPtr *xlblocks;        /* 1st byte ptr-s + XLOG_BLCKSZ */
>      int            XLogCacheBlck;    /* highest allocated xlog buffer index */
>
> +    /*
> +     * Start address of oldest initialized page in XLog buffers.
> +     *
> +     * We mainly track oldest initialized page explicitly to quickly tell if a
> +     * given WAL record is available in XLog buffers. It also can be used for
> +     * other purposes, see notes below.
> +     *
> +     * OldestInitializedPage gives XLog buffers following properties:
> +     *
> +     * 1) At any given point of time, pages in XLog buffers array are sorted
> +     * in an ascending order from OldestInitializedPage till InitializedUpTo.
> +     * Note that we verify this property for assert-only builds, see
> +     * IsXLogBuffersArraySorted() for more details.

This is true - but also not, if you look at it a bit too literally. The
buffers in xlblocks itself obviously aren't ordered when wrapping around
between XLogRecPtrToBufIdx(OldestInitializedPage) and
XLogRecPtrToBufIdx(InitializedUpTo).


> +     * 2) OldestInitializedPage is monotonically increasing (by virtue of how
> +     * postgres generates WAL records), that is, its value never decreases.
> +     * This property lets someone read its value without a lock. There's no
> +     * problem even if its value is slightly stale i.e. concurrently being
> +     * updated. One can still use it for finding if a given WAL record is
> +     * available in XLog buffers. At worst, one might get false positives
> +     * (i.e. OldestInitializedPage may tell that the WAL record is available
> +     * in XLog buffers, but when one actually looks at it, it isn't really
> +     * available). This is more efficient and performant than acquiring a lock
> +     * for reading. Note that we may not need a lock to read
> +     * OldestInitializedPage but we need to update it holding
> +     * WALBufMappingLock.

I'd
s/may not need/do not need/

But perhaps rephrase it a bit more, to something like:

To update OldestInitializedPage, WALBufMappingLock needs to be held
exclusively, for reading no lock is required.


> +     *
> +     * 3) One can start traversing XLog buffers from OldestInitializedPage
> +     * till InitializedUpTo to list out all valid WAL records and stats, and
> +     * expose them via SQL-callable functions to users.
> +     *
> +     * 4) XLog buffers array is inherently organized as a circular, sorted and
> +     * rotated array with OldestInitializedPage as pivot with the property
> +     * where LSN of previous buffer page (if valid) is greater than
> +     * OldestInitializedPage and LSN of next buffer page (if valid) is greater
> +     * than OldestInitializedPage.
> +     */
> +    XLogRecPtr    OldestInitializedPage;

It seems a bit odd to name a LSN containing variable *Page...


>      /*
>       * InsertTimeLineID is the timeline into which new WAL is being inserted
>       * and flushed. It is zero during recovery, and does not change once set.
> @@ -590,6 +629,10 @@ static ControlFileData *ControlFile = NULL;
>  #define NextBufIdx(idx)        \
>          (((idx) == XLogCtl->XLogCacheBlck) ? 0 : ((idx) + 1))
>
> +/* Macro to retreat to previous buffer index. */
> +#define PreviousBufIdx(idx)        \
> +        (((idx) == 0) ? XLogCtl->XLogCacheBlck : ((idx) - 1))

I think it might be worth making these inlines and adding assertions that idx
is not bigger than XLogCtl->XLogCacheBlck?


>  /*
>   * XLogRecPtrToBufIdx returns the index of the WAL buffer that holds, or
>   * would hold if it was in cache, the page containing 'recptr'.
> @@ -708,6 +751,10 @@ static void WALInsertLockAcquireExclusive(void);
>  static void WALInsertLockRelease(void);
>  static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
>
> +#ifdef USE_ASSERT_CHECKING
> +static bool IsXLogBuffersArraySorted(void);
> +#endif
> +
>  /*
>   * Insert an XLOG record represented by an already-constructed chain of data
>   * chunks.  This is a low-level routine; to construct the WAL record header
> @@ -1992,6 +2039,52 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
>          XLogCtl->InitializedUpTo = NewPageEndPtr;
>
>          npages++;
> +
> +        /*
> +         * Try updating oldest initialized XLog buffer page.
> +         *
> +         * Update it if we are initializing an XLog buffer page for the first
> +         * time or if XLog buffers are full and we are wrapping around.
> +         */
> +        if (XLogRecPtrIsInvalid(XLogCtl->OldestInitializedPage) ||
> +            XLogRecPtrToBufIdx(XLogCtl->OldestInitializedPage) == nextidx)
> +        {
> +            Assert(XLogCtl->OldestInitializedPage < NewPageBeginPtr);
> +
> +            XLogCtl->OldestInitializedPage = NewPageBeginPtr;
> +        }

Wait, isn't this too late?  At this point the buffer can already be used by
GetXLogBuffers().  I think thi sneeds to happen at the latest just before
        *((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) = NewPageEndPtr;


Why is it legal to get here with XLogCtl->OldestInitializedPage being invalid?


> +
> +/*
> + * Returns whether or not a given WAL record is available in XLog buffers.
> + *
> + * Note that we don't read OldestInitializedPage under a lock, see description
> + * near its definition in xlog.c for more details.
> + *
> + * Note that caller needs to pass in an LSN known to the server, not a future
> + * or unwritten or unflushed LSN.
> + */
> +bool
> +IsWALRecordAvailableInXLogBuffers(XLogRecPtr lsn)
> +{
> +    if (!XLogRecPtrIsInvalid(lsn) &&
> +        !XLogRecPtrIsInvalid(XLogCtl->OldestInitializedPage) &&
> +        lsn >= XLogCtl->OldestInitializedPage &&
> +        lsn < XLogCtl->InitializedUpTo)
> +    {
> +        return true;
> +    }
> +
> +    return false;
> +}
> diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
> index a14126d164..35235010e6 100644
> --- a/src/include/access/xlog.h
> +++ b/src/include/access/xlog.h
> @@ -261,6 +261,7 @@ extern void ReachedEndOfBackup(XLogRecPtr EndRecPtr, TimeLineID tli);
>  extern void SetInstallXLogFileSegmentActive(void);
>  extern bool IsInstallXLogFileSegmentActive(void);
>  extern void XLogShutdownWalRcv(void);
> +extern bool IsWALRecordAvailableInXLogBuffers(XLogRecPtr lsn);
>
>  /*
>   * Routines to start, stop, and get status of a base backup.
> --
> 2.34.1
>

> From db027d8f1dcb53ebceef0135287f120acf67cc21 Mon Sep 17 00:00:00 2001
> From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
> Date: Thu, 2 Nov 2023 15:36:11 +0000
> Subject: [PATCH v14] Allow WAL reading from WAL buffers
>
> This commit adds WALRead() the capability to read WAL from WAL
> buffers when possible. When requested WAL isn't available in WAL
> buffers, the WAL is read from the WAL file as usual. It relies on
> WALBufMappingLock so that no one replaces the WAL buffer page that
> we're reading from. It skips reading from WAL buffers if
> WALBufMappingLock can't be acquired immediately. In other words,
> it doesn't wait for WALBufMappingLock to be available. This helps
> reduce the contention on WALBufMappingLock.
>
> This commit benefits the callers of WALRead(), that are walsenders
> and pg_walinspect. They can now avoid reading WAL from the WAL
> file (possibly avoiding disk IO). Tests show that the WAL buffers
> hit ratio stood at 95% for 1 primary, 1 sync standby, 1 async
> standby, with pgbench --scale=300 --client=32 --time=900. In other
> words, the walsenders avoided 95% of the time reading from the
> file/avoided pread system calls:
> https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
>
> This commit also benefits when direct IO is enabled for WAL.
> Reading WAL from WAL buffers puts back the performance close to
> that of without direct IO for WAL:
> https://www.postgresql.org/message-id/CALj2ACV6rS%2B7iZx5%2BoAvyXJaN4AG-djAQeM1mrM%3DYSDkVrUs7g%40mail.gmail.com
>
> This commit also paves the way for the following features in
> future:
> - Improves synchronous replication performance by replicating
> directly from WAL buffers.
> - A opt-in way for the walreceivers to receive unflushed WAL.
> More details here:
> https://www.postgresql.org/message-id/20231011224353.cl7c2s222dw3de4j%40awork3.anarazel.de
>
> Author: Bharath Rupireddy
> Reviewed-by: Dilip Kumar, Andres Freund
> Reviewed-by: Nathan Bossart, Kuntal Ghosh
> Discussion:
https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
> ---
>  src/backend/access/transam/xlog.c       | 205 ++++++++++++++++++++++++
>  src/backend/access/transam/xlogreader.c |  41 ++++-
>  src/include/access/xlog.h               |   6 +
>  3 files changed, 250 insertions(+), 2 deletions(-)
>
> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> index fdf2ef310b..ff5dccaaa7 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -1753,6 +1753,211 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
>      return cachedPos + ptr % XLOG_BLCKSZ;
>  }

>

> +/*
> + * Read WAL from WAL buffers.
> + *
> + * Read 'count' bytes of WAL from WAL buffers into 'buf', starting at location
> + * 'startptr', on timeline 'tli' and return total read bytes.
> + *
> + * This function returns quickly in the following cases:
> + * - When passed-in timeline is different than server's current insertion
> + * timeline as WAL is always inserted into WAL buffers on insertion timeline.
> + * - When server is in recovery as WAL buffers aren't currently used in
> + * recovery.
> + *
> + * Note that this function reads as much as it can from WAL buffers, meaning,
> + * it may not read all the requested 'count' bytes. Caller must be aware of
> + * this and deal with it.
> + *
> + * Note that this function is not available for frontend code as WAL buffers is
> + * an internal mechanism to the server.
>
> + */
> +Size
> +XLogReadFromBuffers(XLogReaderState *state,
> +                    XLogRecPtr startptr,
> +                    TimeLineID tli,
> +                    Size count,
> +                    char *buf)
> +{
> +    XLogRecPtr    ptr;
> +    XLogRecPtr    cur_lsn;
> +    Size        nbytes;
> +    Size        ntotal;
> +    Size        nbatch;
> +    char       *batchstart;
> +
> +    if (RecoveryInProgress())
> +        return 0;
>
> +    if (tli != GetWALInsertionTimeLine())
> +        return 0;
> +
> +    Assert(!XLogRecPtrIsInvalid(startptr));
> +
> +    cur_lsn = GetFlushRecPtr(NULL);
> +    if (unlikely(startptr > cur_lsn))
> +        elog(ERROR, "WAL start LSN %X/%X specified for reading from WAL buffers must be less than current database
systemWAL LSN %X/%X",
 
> +             LSN_FORMAT_ARGS(startptr), LSN_FORMAT_ARGS(cur_lsn));

Hm, why does this check belong here? For some tools it might be legitimate to
read the WAL before it was fully flushed.



> +    /*
> +     * Holding WALBufMappingLock ensures inserters don't overwrite this value
> +     * while we are reading it. We try to acquire it in shared mode so that
> +     * the concurrent WAL readers are also allowed. We try to do as less work
> +     * as possible while holding the lock to not impact concurrent WAL writers
> +     * much. We quickly exit to not cause any contention, if the lock isn't
> +     * immediately available.
> +     */
> +    if (!LWLockConditionalAcquire(WALBufMappingLock, LW_SHARED))
> +        return 0;

That seems problematic - that lock is often heavily contended.  We could
instead check IsWALRecordAvailableInXLogBuffers() once before reading the
page, then read the page contents *without* holding a lock, and then check
IsWALRecordAvailableInXLogBuffers() again - if the page was replaced in the
interim we read bogus data, but that's a bit of a wasted effort.


> +    ptr = startptr;
> +    nbytes = count;                /* Total bytes requested to be read by caller. */
> +    ntotal = 0;                    /* Total bytes read. */
> +    nbatch = 0;                    /* Bytes to be read in single batch. */
> +    batchstart = NULL;            /* Location to read from for single batch. */

What does "batch" mean?


> +    while (nbytes > 0)
> +    {
> +        XLogRecPtr    expectedEndPtr;
> +        XLogRecPtr    endptr;
> +        int            idx;
> +        char       *page;
> +        char       *data;
> +        XLogPageHeader phdr;
> +
> +        idx = XLogRecPtrToBufIdx(ptr);
> +        expectedEndPtr = ptr;
> +        expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
> +        endptr = XLogCtl->xlblocks[idx];
> +
> +        /* Requested WAL isn't available in WAL buffers. */
> +        if (expectedEndPtr != endptr)
> +            break;
> +
> +        /*
> +         * We found WAL buffer page containing given XLogRecPtr. Get starting
> +         * address of the page and a pointer to the right location of given
> +         * XLogRecPtr in that page.
> +         */
> +        page = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
> +        data = page + ptr % XLOG_BLCKSZ;
> +
> +        /*
> +         * The fact that we acquire WALBufMappingLock while reading the WAL
> +         * buffer page itself guarantees that no one else initializes it or
> +         * makes it ready for next use in AdvanceXLInsertBuffer(). However, we
> +         * need to ensure that we are not reading a page that just got
> +         * initialized. For this, we look at the needed page header.
> +         */
> +        phdr = (XLogPageHeader) page;
> +
> +        /* Return, if WAL buffer page doesn't look valid. */
> +        if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
> +              phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
> +              phdr->xlp_tli == tli))
> +            break;

I don't think this code should ever encounter a page where this is not the
case?  We particularly shouldn't do so silently, seems that could hide all
kinds of problems.



> +        /*
> +         * Note that we don't perform all page header checks here to avoid
> +         * extra work in production builds; callers will anyway do those
> +         * checks extensively. However, in an assert-enabled build, we perform
> +         * all the checks here and raise an error if failed.
> +         */

Why?


> +        /* Count what is wanted, not the whole page. */
> +        if ((data + nbytes) <= (page + XLOG_BLCKSZ))
> +        {
> +            /* All the bytes are in one page. */
> +            nbatch += nbytes;
> +            ntotal += nbytes;
> +            nbytes = 0;
> +        }
> +        else
> +        {
> +            Size        navailable;
> +
> +            /*
> +             * All the bytes are not in one page. Deduce available bytes on
> +             * the current page, count them and continue to look for remaining
> +             * bytes.
> +             */
s/deducate/deduct/? Perhaps better subtract?


Greetings,

Andres Freund



Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Sat, Nov 4, 2023 at 1:17 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> > > I think it needs something like:
> > >
> > >   pg_atomic_write_u64(&XLogCtl->xlblocks[nextidx],
> > > InvalidXLogRecPtr);
> > >   pg_write_barrier();
> > >
> > > before the MemSet.
> >
> > I think it works. First, xlblocks needs to be turned to an array of
> > 64-bit atomics and then the above change.
>
> Does anyone see a reason we shouldn't move to atomics here?
>
> >
> >         pg_write_barrier();
> >
> >         *((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) =
> > NewPageEndPtr;
>
> I am confused why the "volatile" is required on that line (not from
> your patch). I sent a separate message about that:
>
> https://www.postgresql.org/message-id/784f72ac09061fe5eaa5335cc347340c367c73ac.camel@j-davis.com
>
> > I think the 3 things that helps read from WAL buffers without
> > WALBufMappingLock are: 1) couple of the read barriers in
> > XLogReadFromBuffers, 2) atomically initializing xlblocks[idx] to
> > InvalidXLogRecPtr plus a write barrier in AdvanceXLInsertBuffer(), 3)
> > the following sanity check to see if the read page is valid in
> > XLogReadFromBuffers(). If it sounds sensible, I'll work towards
> > coding
> > it up. Thoughts?
>
> I like it. I think it will ultimately be a fairly simple loop. And by
> moving to atomics, we won't need the delicate comment in
> GetXLogBuffer().

I'm attaching the v15 patch set implementing the above ideas. Please
have a look.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Jeff Davis
Date:
On Sat, 2023-11-04 at 20:55 +0530, Bharath Rupireddy wrote:
> +        XLogRecPtr    EndPtr =
> pg_atomic_read_u64(&XLogCtl->xlblocks[curridx]);
> +
> +        /*
> +         * xlblocks value can be InvalidXLogRecPtr before
> the new WAL buffer
> +         * page gets initialized in AdvanceXLInsertBuffer.
> In such a case
> +         * re-read the xlblocks value under the lock to
> ensure the correct
> +         * value is read.
> +         */
> +        if (unlikely(XLogRecPtrIsInvalid(EndPtr)))
> +        {
> +            LWLockAcquire(WALBufMappingLock,
> LW_EXCLUSIVE);
> +            EndPtr = pg_atomic_read_u64(&XLogCtl-
> >xlblocks[curridx]);
> +            LWLockRelease(WALBufMappingLock);
> +        }
> +
> +        Assert(!XLogRecPtrIsInvalid(EndPtr));

Can that really happen? If the EndPtr is invalid, that means the page
is in the process of being cleared, so the contents of the page are
undefined at that time, right?

Regards,
    Jeff Davis




Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Sun, Nov 5, 2023 at 2:57 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Sat, 2023-11-04 at 20:55 +0530, Bharath Rupireddy wrote:
> > +             XLogRecPtr      EndPtr =
> > pg_atomic_read_u64(&XLogCtl->xlblocks[curridx]);
> > +
> > +             /*
> > +              * xlblocks value can be InvalidXLogRecPtr before
> > the new WAL buffer
> > +              * page gets initialized in AdvanceXLInsertBuffer.
> > In such a case
> > +              * re-read the xlblocks value under the lock to
> > ensure the correct
> > +              * value is read.
> > +              */
> > +             if (unlikely(XLogRecPtrIsInvalid(EndPtr)))
> > +             {
> > +                     LWLockAcquire(WALBufMappingLock,
> > LW_EXCLUSIVE);
> > +                     EndPtr = pg_atomic_read_u64(&XLogCtl-
> > >xlblocks[curridx]);
> > +                     LWLockRelease(WALBufMappingLock);
> > +             }
> > +
> > +             Assert(!XLogRecPtrIsInvalid(EndPtr));
>
> Can that really happen? If the EndPtr is invalid, that means the page
> is in the process of being cleared, so the contents of the page are
> undefined at that time, right?

My initial thoughts were this way - xlblocks is being read without
holding WALBufMappingLock in XLogWrite() and since we write
InvalidXLogRecPtr to xlblocks array elements temporarily before
MemSet-ting the page in AdvanceXLInsertBuffer(), the PANIC "xlog write
request %X/%X is past end of log %X/%X" might get hit if EndPtr read
from xlblocks is InvalidXLogRecPtr. FWIW, an Assert(false); within the
if (unlikely(XLogRecPtrIsInvalid(EndPtr))) block didn't hit in make
check-world.

It looks like my above understanding isn't correct because it can
never happen that the page that's being written to the WAL file gets
initialized in AdvanceXLInsertBuffer(). I'll remove this piece of code
in next version of the patch unless there are any other thoughts.

[1]
    /*
     * Within the loop, curridx is the cache block index of the page to
     * consider writing.  Begin at the buffer containing the next unwritten
     * page, or last partially written page.
     */
    curridx = XLogRecPtrToBufIdx(LogwrtResult.Write);

    while (LogwrtResult.Write < WriteRqst.Write)
    {
        /*
         * Make sure we're not ahead of the insert process.  This could happen
         * if we're passed a bogus WriteRqst.Write that is past the end of the
         * last page that's been initialized by AdvanceXLInsertBuffer.
         */
        XLogRecPtr  EndPtr = pg_atomic_read_u64(&XLogCtl->xlblocks[curridx]);

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Sat, Nov 4, 2023 at 6:13 AM Andres Freund <andres@anarazel.de> wrote:
>
> > +     cur_lsn = GetFlushRecPtr(NULL);
> > +     if (unlikely(startptr > cur_lsn))
> > +             elog(ERROR, "WAL start LSN %X/%X specified for reading from WAL buffers must be less than current
databasesystem WAL LSN %X/%X", 
> > +                      LSN_FORMAT_ARGS(startptr), LSN_FORMAT_ARGS(cur_lsn));
>
> Hm, why does this check belong here? For some tools it might be legitimate to
> read the WAL before it was fully flushed.

Agreed and removed the check.

> > +     /*
> > +      * Holding WALBufMappingLock ensures inserters don't overwrite this value
> > +      * while we are reading it. We try to acquire it in shared mode so that
> > +      * the concurrent WAL readers are also allowed. We try to do as less work
> > +      * as possible while holding the lock to not impact concurrent WAL writers
> > +      * much. We quickly exit to not cause any contention, if the lock isn't
> > +      * immediately available.
> > +      */
> > +     if (!LWLockConditionalAcquire(WALBufMappingLock, LW_SHARED))
> > +             return 0;
>
> That seems problematic - that lock is often heavily contended.  We could
> instead check IsWALRecordAvailableInXLogBuffers() once before reading the
> page, then read the page contents *without* holding a lock, and then check
> IsWALRecordAvailableInXLogBuffers() again - if the page was replaced in the
> interim we read bogus data, but that's a bit of a wasted effort.

In the new approach described upthread here
https://www.postgresql.org/message-id/c3455ab9da42e09ca9d059879b5c512b2d1f9681.camel%40j-davis.com,
there's no lock required for reading from WAL buffers. PSA patches for
more details.

> > +             /*
> > +              * The fact that we acquire WALBufMappingLock while reading the WAL
> > +              * buffer page itself guarantees that no one else initializes it or
> > +              * makes it ready for next use in AdvanceXLInsertBuffer(). However, we
> > +              * need to ensure that we are not reading a page that just got
> > +              * initialized. For this, we look at the needed page header.
> > +              */
> > +             phdr = (XLogPageHeader) page;
> > +
> > +             /* Return, if WAL buffer page doesn't look valid. */
> > +             if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
> > +                       phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
> > +                       phdr->xlp_tli == tli))
> > +                     break;
>
> I don't think this code should ever encounter a page where this is not the
> case?  We particularly shouldn't do so silently, seems that could hide all
> kinds of problems.

I think it's possible to read a "just got initialized" page with the
new approach to read WAL buffer pages without WALBufMappingLock if the
page is read right after it is initialized and xlblocks is filled in
AdvanceXLInsertBuffer() but before actual WAL is written.

> > +             /*
> > +              * Note that we don't perform all page header checks here to avoid
> > +              * extra work in production builds; callers will anyway do those
> > +              * checks extensively. However, in an assert-enabled build, we perform
> > +              * all the checks here and raise an error if failed.
> > +              */
>
> Why?

Minimal page header checks are performed to ensure we don't read the
page that just got initialized unlike what
XLogReaderValidatePageHeader(). Are you suggesting to remove page
header checks with XLogReaderValidatePageHeader() for assert-enabled
builds? Or are you suggesting to do page header checks with
XLogReaderValidatePageHeader() for production builds too?

PSA v16 patch set. Note that 0004 patch adds support for WAL read
stats (both from WAL file and WAL buffers) to walsenders and may not
necessarily the best approach to capture WAL read stats in light of
https://www.postgresql.org/message-id/CALj2ACU_f5_c8F%2BxyNR4HURjG%3DJziiz07wCpQc%3DAqAJUFh7%2B8w%40mail.gmail.com
which adds WAL read/write/fsync stats to pg_stat_io.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Andres Freund
Date:
Hi,

On 2023-11-08 13:10:34 +0530, Bharath Rupireddy wrote:
> > > +             /*
> > > +              * The fact that we acquire WALBufMappingLock while reading the WAL
> > > +              * buffer page itself guarantees that no one else initializes it or
> > > +              * makes it ready for next use in AdvanceXLInsertBuffer(). However, we
> > > +              * need to ensure that we are not reading a page that just got
> > > +              * initialized. For this, we look at the needed page header.
> > > +              */
> > > +             phdr = (XLogPageHeader) page;
> > > +
> > > +             /* Return, if WAL buffer page doesn't look valid. */
> > > +             if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
> > > +                       phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
> > > +                       phdr->xlp_tli == tli))
> > > +                     break;
> >
> > I don't think this code should ever encounter a page where this is not the
> > case?  We particularly shouldn't do so silently, seems that could hide all
> > kinds of problems.
> 
> I think it's possible to read a "just got initialized" page with the
> new approach to read WAL buffer pages without WALBufMappingLock if the
> page is read right after it is initialized and xlblocks is filled in
> AdvanceXLInsertBuffer() but before actual WAL is written.

I think the code needs to make sure that *never* happens. That seems unrelated
to holding or not holding WALBufMappingLock. Even if the page header is
already valid, I don't think it's ok to just read/parse WAL data that's
concurrently being modified.

We can never allow WAL being read that's past
  XLogBytePosToRecPtr(XLogCtl->Insert->CurrBytePos)
as it does not exist.

And if the to-be-read LSN is between
XLogCtl->LogwrtResult->Write and XLogBytePosToRecPtr(Insert->CurrBytePos)
we need to call WaitXLogInsertionsToFinish() before copying the data.

Greetings,

Andres Freund



Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Fri, Nov 10, 2023 at 2:28 AM Andres Freund <andres@anarazel.de> wrote:
>
> On 2023-11-08 13:10:34 +0530, Bharath Rupireddy wrote:
> > > > +             /*
> > > > +              * The fact that we acquire WALBufMappingLock while reading the WAL
> > > > +              * buffer page itself guarantees that no one else initializes it or
> > > > +              * makes it ready for next use in AdvanceXLInsertBuffer(). However, we
> > > > +              * need to ensure that we are not reading a page that just got
> > > > +              * initialized. For this, we look at the needed page header.
> > > > +              */
> > > > +             phdr = (XLogPageHeader) page;
> > > > +
> > > > +             /* Return, if WAL buffer page doesn't look valid. */
> > > > +             if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
> > > > +                       phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
> > > > +                       phdr->xlp_tli == tli))
> > > > +                     break;
> > >
> > > I don't think this code should ever encounter a page where this is not the
> > > case?  We particularly shouldn't do so silently, seems that could hide all
> > > kinds of problems.
> >
> > I think it's possible to read a "just got initialized" page with the
> > new approach to read WAL buffer pages without WALBufMappingLock if the
> > page is read right after it is initialized and xlblocks is filled in
> > AdvanceXLInsertBuffer() but before actual WAL is written.
>
> I think the code needs to make sure that *never* happens. That seems unrelated
> to holding or not holding WALBufMappingLock. Even if the page header is
> already valid, I don't think it's ok to just read/parse WAL data that's
> concurrently being modified.
>
> We can never allow WAL being read that's past
>   XLogBytePosToRecPtr(XLogCtl->Insert->CurrBytePos)
> as it does not exist.

Agreed. Erroring out in XLogReadFromBuffers() if passed in WAL is past
the CurrBytePos is an option. Another cleaner way is to just let the
caller decide what it needs to do (retry or error out) - fill an error
message in XLogReadFromBuffers() and return as-if nothing was read or
return a special negative error code like XLogDecodeNextRecord so that
the caller can deal with it.

Also, reading CurrBytePos with insertpos_lck spinlock can come in the
way of concurrent inserters. A possible way is to turn both
CurrBytePos and PrevBytePos 64-bit atomics so that
XLogReadFromBuffers() can read CurrBytePos without any lock atomically
and leave it to the caller to deal with non-existing WAL reads.

> And if the to-be-read LSN is between
> XLogCtl->LogwrtResult->Write and XLogBytePosToRecPtr(Insert->CurrBytePos)
> we need to call WaitXLogInsertionsToFinish() before copying the data.

Agree to wait for all in-flight insertions to the pages we're about to
read to finish. But, reading XLogCtl->LogwrtRqst.Write requires either
XLogCtl->info_lck spinlock or WALWriteLock. Maybe turn
XLogCtl->LogwrtRqst.Write a 64-bit atomic and read it without any
lock, rely on
WaitXLogInsertionsToFinish()'s return value i.e. if
WaitXLogInsertionsToFinish() returns a value >= Insert->CurrBytePos,
then go read that page from WAL buffers.

Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Mon, Nov 13, 2023 at 7:02 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Fri, Nov 10, 2023 at 2:28 AM Andres Freund <andres@anarazel.de> wrote:
>
> > I think the code needs to make sure that *never* happens. That seems unrelated
> > to holding or not holding WALBufMappingLock. Even if the page header is
> > already valid, I don't think it's ok to just read/parse WAL data that's
> > concurrently being modified.
> >
> > We can never allow WAL being read that's past
> >   XLogBytePosToRecPtr(XLogCtl->Insert->CurrBytePos)
> > as it does not exist.
>
> Agreed. Erroring out in XLogReadFromBuffers() if passed in WAL is past
> the CurrBytePos is an option. Another cleaner way is to just let the
> caller decide what it needs to do (retry or error out) - fill an error
> message in XLogReadFromBuffers() and return as-if nothing was read or
> return a special negative error code like XLogDecodeNextRecord so that
> the caller can deal with it.

In the attached v17 patch, I've ensured that the XLogReadFromBuffers
returns when the caller requests a WAL that's past the current insert
position at the moment.

> Also, reading CurrBytePos with insertpos_lck spinlock can come in the
> way of concurrent inserters. A possible way is to turn both
> CurrBytePos and PrevBytePos 64-bit atomics so that
> XLogReadFromBuffers() can read CurrBytePos without any lock atomically
> and leave it to the caller to deal with non-existing WAL reads.
>
> > And if the to-be-read LSN is between
> > XLogCtl->LogwrtResult->Write and XLogBytePosToRecPtr(Insert->CurrBytePos)
> > we need to call WaitXLogInsertionsToFinish() before copying the data.
>
> Agree to wait for all in-flight insertions to the pages we're about to
> read to finish. But, reading XLogCtl->LogwrtRqst.Write requires either
> XLogCtl->info_lck spinlock or WALWriteLock. Maybe turn
> XLogCtl->LogwrtRqst.Write a 64-bit atomic and read it without any
> lock, rely on
> WaitXLogInsertionsToFinish()'s return value i.e. if
> WaitXLogInsertionsToFinish() returns a value >= Insert->CurrBytePos,
> then go read that page from WAL buffers.

In the attached v17 patch, I've ensured that the XLogReadFromBuffers
waits for all in-progress insertions to finish when the caller
requests WAL that's past the current write position and before the
current insert position.

I've also ensured that the XLogReadFromBuffers returns special return
codes for various scenarios (when asked to read in recovery, read on a
different TLI, read a non-existent WAL and so on.) instead of it
erroring out. This gives flexibility to the caller to decide what to
do.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Jeff Davis
Date:
On Thu, 2023-12-07 at 15:59 +0530, Bharath Rupireddy wrote:
> In the attached v17 patch

0001 could impact performance could be impacted in a few ways:

   * There's one additional write barrier inside
     AdvanceXLInsertBuffer()
   * AdvanceXLInsertBuffer() already holds WALBufMappingLock, so
     the atomic access inside of it is somewhat redundant
   * On some platforms, the XLogCtlData structure size will change

The patch has been out for a while and nobody seems concerned about
those things, and they look fine to me, so I assume these are not real
problems. I just wanted to highlight them.

Also, the description and the comments seem off. The patch does two
things: (a) make it possible to read a page without a lock, which means
we need to mark with InvalidXLogRecPtr while it's being initialized;
and (b) use 64-bit atomics to make it safer (or at least more
readable).

(a) feels like the most important thing, and it's a hard requirement
for the rest of the work, right?

(b) seems like an implementation choice, and I agree with it on
readability grounds.

Also:

+  * But it means that when we do this
+  * unlocked read, we might see a value that appears to be ahead of
the
+  * page we're looking for. Don't PANIC on that, until we've verified
the
+  * value while holding the lock.

Is that still true even without a torn read?

The code for 0001 itself looks good. These are minor concerns and I am
inclined to commit something like it fairly soon.

Regards,
    Jeff Davis




Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Fri, Dec 8, 2023 at 6:04 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Thu, 2023-12-07 at 15:59 +0530, Bharath Rupireddy wrote:
> > In the attached v17 patch
>
> The code for 0001 itself looks good. These are minor concerns and I am
> inclined to commit something like it fairly soon.

Thanks. Attaching remaining patches as v18 patch-set after commits
c3a8e2a7cb16 and 766571be1659.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment
On Wed, 2023-12-20 at 15:36 +0530, Bharath Rupireddy wrote:
> Thanks. Attaching remaining patches as v18 patch-set after commits
> c3a8e2a7cb16 and 766571be1659.

Comments:

I still think the right thing for this patch is to call
XLogReadFromBuffers() directly from the callers who need it, and not
change WALRead(). I am open to changing this later, but for now that
makes sense to me so that we can clearly identify which callers benefit
and why. I have brought this up a few times before[1][2], so there must
be some reason that I don't understand -- can you explain it?

The XLogReadFromBuffersResult is never used. I can see how it might be
useful for testing or asserts, but it's not used even in the test
module. I don't think we should clutter the API with that kind of thing
-- let's just return the nread.

I also do not like the terminology "partial hit" to be used in this
way. Perhaps "short read" or something about hitting the end of
readable WAL would be better?

I like how the callers of WALRead() are being more precise about the
bytes they are requesting.

You've added several spinlock acquisitions to the loop. Two explicitly,
and one implicitly in WaitXLogInsertionsToFinish(). These may allow you
to read slightly further, but introduce performance risk. Was this
discussed?

The callers are not checking for XLREADBUGS_UNINITIALIZED_WAL, so it
seems like there's a risk of getting partially-written data? And it's
not clear to me the check of the wal page headers is the right one
anyway.

It seems like all of this would be simpler if you checked first how far
you can safely read data, and then just loop and read that far. I'm not
sure that it's worth it to try to mix the validity checks with the
reading of the data.

Regards,
    Jeff Davis

[1] https://www.postgresql.org/message-id/4132fe48f831ed6f73a9eb191af5fe475384969c.camel%40j-davis.com
[2]
https://www.postgresql.org/message-id/2ef04861c0f77e7ae78b703770cc2bbbac3d85e6.camel@j-davis.com



Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Fri, Jan 5, 2024 at 7:20 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Wed, 2023-12-20 at 15:36 +0530, Bharath Rupireddy wrote:
> > Thanks. Attaching remaining patches as v18 patch-set after commits
> > c3a8e2a7cb16 and 766571be1659.
>
> Comments:

Thanks for reviewing.

> I still think the right thing for this patch is to call
> XLogReadFromBuffers() directly from the callers who need it, and not
> change WALRead(). I am open to changing this later, but for now that
> makes sense to me so that we can clearly identify which callers benefit
> and why. I have brought this up a few times before[1][2], so there must
> be some reason that I don't understand -- can you explain it?

IMO, WALRead() is the best place to have XLogReadFromBuffers() for 2
reasons: 1) All of the WALRead() callers (except FRONTEND tools) will
benefit if WAL is read from WAL buffers. I don't see any reason for a
caller to skip reading from WAL buffers. If there's a caller (in
future) wanting to skip reading from WAL buffers, I'm open to adding a
flag in XLogReaderState to skip.  2) The amount of code is reduced if
XLogReadFromBuffers() sits in WALRead().

> The XLogReadFromBuffersResult is never used. I can see how it might be
> useful for testing or asserts, but it's not used even in the test
> module. I don't think we should clutter the API with that kind of thing
> -- let's just return the nread.

Removed.

> I also do not like the terminology "partial hit" to be used in this
> way. Perhaps "short read" or something about hitting the end of
> readable WAL would be better?

"short read" seems good. Done that way in the new patch.

> I like how the callers of WALRead() are being more precise about the
> bytes they are requesting.
>
> You've added several spinlock acquisitions to the loop. Two explicitly,
> and one implicitly in WaitXLogInsertionsToFinish(). These may allow you
> to read slightly further, but introduce performance risk. Was this
> discussed?

I opted to read slightly further thinking that the loops aren't going
to get longer for spinlocks to appear costly. Basically, I wasn't sure
which approach was the best. Now that there's an opinion to keep them
outside, I'd agree with it. Done that way in the new patch.

> The callers are not checking for XLREADBUGS_UNINITIALIZED_WAL, so it
> seems like there's a risk of getting partially-written data? And it's
> not clear to me the check of the wal page headers is the right one
> anyway.
>
> It seems like all of this would be simpler if you checked first how far
> you can safely read data, and then just loop and read that far.  I'm not
> sure that it's worth it to try to mix the validity checks with the
> reading of the data.

XLogReadFromBuffers needs the page header check in after reading the
page from WAL buffers. Typically, we must not read a WAL buffer page
that just got initialized. Because we waited enough for the
in-progress WAL insertions to finish above. However, there can exist a
slight window after the above wait finishes in which the read buffer
page can get replaced especially under high WAL generation rates.
After all, we are reading from WAL buffers without any locks here. So,
let's not count such a page in.

I've addressed the above review comments and attached v19 patch-set.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment
On Wed, 2024-01-10 at 19:59 +0530, Bharath Rupireddy wrote:
> I've addressed the above review comments and attached v19 patch-set.

Regarding:

-       if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ, tli,
-                                &errinfo))
+       if (!WALRead(state, cur_page, targetPagePtr, count, tli,
&errinfo))

I'd like to understand the reason it was using XLOG_BLCKSZ before. Was
it a performance optimization? Or was it to zero the remainder of the
caller's buffer (readBuf)? Or something else?

If it was to zero the remainder of the caller's buffer, then we should
explicitly make that the caller's responsibility.

Regards,
    Jeff Davis




Hi Bharath,

Thanks for working on this. It seems like a nice improvement to have.

Here are some comments on 0001 patch.

1-  xlog.c
+ /*
+ * Fast paths for the following reasons: 1) WAL buffers aren't in use when
+ * server is in recovery. 2) WAL is inserted into WAL buffers on current
+ * server's insertion TLI. 3) Invalid starting WAL location.
+ */

Shouldn't the comment be something like "2) WAL is *not* inserted into WAL buffers on current server's insertion TLI" since the condition to break is tli != GetWALInsertionTimeLine() 

2-
+ /*
+ * WAL being read doesn't yet exist i.e. past the current insert position.
+ */
+ if ((startptr + count) > reservedUpto)
+ return ntotal;

This question may not even make sense but I wonder whether we can read from startptr only to reservedUpto in case of startptr+count exceeds reservedUpto? 

3-
+ /* Wait for any in-progress WAL insertions to WAL buffers to finish. */
+ if ((startptr + count) > LogwrtResult.Write &&
+ (startptr + count) <= reservedUpto)
+ WaitXLogInsertionsToFinish(startptr + count);

Do we need to check if (startptr + count) <= reservedUpto as we already verified this condition a few lines above?

4-
+ Assert(nread > 0);
+ memcpy(dst, data, nread);
+
+ /*
+ * Make sure we don't read xlblocks down below before the page
+ * contents up above.
+ */
+ pg_read_barrier();
+
+ /* Recheck if the read page still exists in WAL buffers. */
+ endptr = pg_atomic_read_u64(&XLogCtl->xlblocks[idx]);
+
+ /* Return if the page got initalized while we were reading it. */
+ if (expectedEndPtr != endptr)
+ break;
+
+ /*
+ * Typically, we must not read a WAL buffer page that just got
+ * initialized. Because we waited enough for the in-progress WAL
+ * insertions to finish above. However, there can exist a slight
+ * window after the above wait finishes in which the read buffer page
+ * can get replaced especially under high WAL generation rates. After
+ * all, we are reading from WAL buffers without any locks here. So,
+ * let's not count such a page in.
+ */
+ phdr = (XLogPageHeader) page;
+ if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
+   phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
+   phdr->xlp_tli == tli))
+ break;

I see that you recheck if the page still exists and so at the end. What would you think about memcpy'ing only after being sure that we will need and use the recently read data? If we break the loop during the recheck, we simply discard the data read in the latest attempt. I guess that this may not be a big deal but the data would be unnecessarily copied into the destination in such a case.

5- xlogreader.c
+ nread = XLogReadFromBuffers(startptr, tli, count, buf);
+
+ if (nread > 0)
+ {
+ /*
+ * Check if its a full read, short read or no read from WAL buffers.
+ * For short read or no read, continue to read the remaining bytes
+ * from WAL file.
+ *
+ * XXX: It might be worth to expose WAL buffer read stats.
+ */
+ if (nread == count) /* full read */
+ return true;
+ else if (nread < count) /* short read */
+ {
+ buf += nread;
+ startptr += nread;
+ count -= nread;
+ }

Typo in the comment. Should be like "Check if *it's* a full read, short read or no read from WAL buffers."

Also I don't think XLogReadFromBuffers() returns anything less than 0 and more than count. Is verifying nread > 0 necessary? I think if nread does not equal to count, we can simply assume that it's a short read. (or no read at all in case nread is 0 which we don't need to handle specifically)

6-
+ /*
+ * We determined how much of the page can be validly read as 'count', read
+ * that much only, not the entire page. Since WALRead() can read the page
+ * from WAL buffers, in which case, the page is not guaranteed to be
+ * zero-padded up to the page boundary because of the concurrent
+ * insertions.
+ */

I'm not sure about pasting this into the most places we call WalRead(). Wouldn't it be better if we mention this somewhere around WALRead() only once? 


Best,
--
Melih Mutlu
Microsoft
Hi,

On 2024-01-10 19:59:29 +0530, Bharath Rupireddy wrote:
> +        /*
> +         * Typically, we must not read a WAL buffer page that just got
> +         * initialized. Because we waited enough for the in-progress WAL
> +         * insertions to finish above. However, there can exist a slight
> +         * window after the above wait finishes in which the read buffer page
> +         * can get replaced especially under high WAL generation rates. After
> +         * all, we are reading from WAL buffers without any locks here. So,
> +         * let's not count such a page in.
> +         */
> +        phdr = (XLogPageHeader) page;
> +        if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
> +              phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
> +              phdr->xlp_tli == tli))
> +            break;

I still think that anything that requires such checks shouldn't be
merged. It's completely bogus to check page contents for validity when we
should have metadata telling us which range of the buffers is valid and which
not.

Greetings,

Andres Freund



On Mon, 2024-01-22 at 12:12 -0800, Andres Freund wrote:
> I still think that anything that requires such checks shouldn't be
> merged. It's completely bogus to check page contents for validity
> when we
> should have metadata telling us which range of the buffers is valid
> and which
> not.

The check seems entirely unnecessary, to me. A leftover from v18?

I have attached a new patch (version "19j") to illustrate some of my
previous suggestions. I didn't spend a lot of time on it so it's not
ready for commit, but I believe my suggestions are easier to understand
in code form.

Note that, right now, it only works for XLogSendPhysical(). I believe
it's best to just make it work for 1-3 callers that we understand well,
and we can generalize later if it makes sense.

I'm still not clear on why some callers are reading XLOG_BLCKSZ
(expecting zeros at the end), and if it's OK to just change them to use
the exact byte count.

Also, if we've detected that the first requested buffer has been
evicted, is there any value in continuing the loop to see if more
recent buffers are available? For example, if the requested LSNs range
over buffers 4, 5, and 6, and 4 has already been evicted, should we try
to return LSN data from 5 and 6 at the proper offset in the dest
buffer? If so, we'd need to adjust the API so the caller knows what
parts of the dest buffer were filled in.

Regards,
    Jeff Davis


Attachment

Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Tue, Jan 23, 2024 at 9:37 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Mon, 2024-01-22 at 12:12 -0800, Andres Freund wrote:
> > I still think that anything that requires such checks shouldn't be
> > merged. It's completely bogus to check page contents for validity
> > when we
> > should have metadata telling us which range of the buffers is valid
> > and which
> > not.
>
> The check seems entirely unnecessary, to me. A leftover from v18?
>
> I have attached a new patch (version "19j") to illustrate some of my
> previous suggestions. I didn't spend a lot of time on it so it's not
> ready for commit, but I believe my suggestions are easier to understand
> in code form.

> Note that, right now, it only works for XLogSendPhysical(). I believe
> it's best to just make it work for 1-3 callers that we understand well,
> and we can generalize later if it makes sense.

+1 to do it for XLogSendPhysical() first. Enabling it for others can
just be done as something like the attached v20-0003.

> I'm still not clear on why some callers are reading XLOG_BLCKSZ
> (expecting zeros at the end), and if it's OK to just change them to use
> the exact byte count.

"expecting zeros at the end" - this can't always be true as the WAL
can get flushed after determining the flush ptr before reading it from
the WAL file. FWIW, here's what I've tried previoulsy -
https://github.com/BRupireddy2/postgres/tree/ensure_extra_read_WAL_page_is_zero_padded_at_the_end_WIP,
the tests hit the Assert(false); added. Which means, the zero-padding
comment around WALRead() call-sites isn't quite right.

    /*
     * Even though we just determined how much of the page can be validly read
     * as 'count', read the whole page anyway. It's guaranteed to be
     * zero-padded up to the page boundary if it's incomplete.
     */
    if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ, tli,

I think this needs to be discussed separately. If okay, I'll start a new thread.

> Also, if we've detected that the first requested buffer has been
> evicted, is there any value in continuing the loop to see if more
> recent buffers are available? For example, if the requested LSNs range
> over buffers 4, 5, and 6, and 4 has already been evicted, should we try
> to return LSN data from 5 and 6 at the proper offset in the dest
> buffer? If so, we'd need to adjust the API so the caller knows what
> parts of the dest buffer were filled in.

I'd second this capability for now to keep the API simple and clear,
but we can consider expanding it as needed.

I reviewed the v19j and attached v20 patch set:

1.
 * The caller must ensure that it's reasonable to read from the WAL buffers,
 * i.e. that the requested data is from the current timeline, that we're not
 * in recovery, etc.

I still think the XLogReadFromBuffers can just return in any of the
above cases instead of comments. I feel we must assume the caller is
going to ask the WAL from a different timeline and/or in recovery and
design the API to deal with it. Done that way in v20 patch.

2. Fixed some typos, reworded a few comments (i.e. used "current
insert/write position" instead of "Insert/Write pointer" like
elsewhere), ran pgindent.

3.
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.

Removed the above comment before WALRead() since we have that facility
now. Perhaps, we can say the callers can suck data directly from the
WAL buffers using XLogReadFromBuffers. But I have no strong opinion on
this.

4.
+     * Most callers will have already updated LogwrtResult when determining
+     * how far to read, but it's OK if it's out of date. (XXX: is it worth
+     * taking a spinlock to update LogwrtResult and check again before calling
+     * WaitXLogInsertionsToFinish()?)

If the callers use GetFlushRecPtr() to determine how far to read,
LogwrtResult will be *reasonably* latest, otherwise not. If
LogwrtResult is a bit old, XLogReadFromBuffers will call
WaitXLogInsertionsToFinish which will just loop over all insertion
locks and return.

As far as the current WAL readers are concerned, we don't need an
explicit spinlock to determine LogwrtResult because all of them use
GetFlushRecPtr() to determine how far to read. If there's any caller
that's not updating LogwrtResult at all, we can consider reading
LogwrtResult it ourselves in future.

5. I think the two requirements specified at
https://www.postgresql.org/message-id/20231109205836.zjoawdrn4q77yemv%40awork3.anarazel.de
still hold with the v19j.

5.1 Never allow WAL being read that's past
XLogBytePosToRecPtr(XLogCtl->Insert->CurrBytePos) as it does not
exist.
5.2 If the to-be-read LSN is between XLogCtl->LogwrtResult->Write and
XLogBytePosToRecPtr(Insert->CurrBytePos) we need to call
WaitXLogInsertionsToFinish() before copying the data.

+    if (upto > LogwrtResult.Write)
+    {
+        XLogRecPtr writtenUpto = WaitXLogInsertionsToFinish(upto, false);
+
+        upto = Min(upto, writtenUpto);
+        nbytes = upto - startptr;
+    }

XLogReadFromBuffers ensures the above two with adjusting upto based on
Min(upto, writtenUpto) as WaitXLogInsertionsToFinish returns the
oldest insertion that is still in-progress.

For instance, the current write LSN is 100, current insert LSN is 150
and upto is 200 - we only read upto 150 if startptr is < 150; we don't
read anything if startptr is > 150.

6. I've modified the test module in v20-0002 patch as follows:
6.1 Renamed the module to read_wal_from_buffers stripping "test_"
which otherwise is making the name longer. Longer names can cause
failures on some Windows BF members if the PATH/FILE name is too long.
6.2 Tweaked tests to hit WaitXLogInsertionsToFinish() and upto =
Min(upto, writtenUpto); in XLogReadFromBuffers.

PSA v20 patch set.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment
On Thu, 2024-01-25 at 14:35 +0530, Bharath Rupireddy wrote:
>
> "expecting zeros at the end" - this can't always be true as the WAL
>
...

> I think this needs to be discussed separately. If okay, I'll start a
> new thread.

Thank you for investigating. When the above issue is handled, I'll be
more comfortable expanding the call sites for XLogReadFromBuffers().

> > Also, if we've detected that the first requested buffer has been
> > evicted, is there any value in continuing the loop to see if more
> > recent buffers are available? For example, if the requested LSNs
> > range
> > over buffers 4, 5, and 6, and 4 has already been evicted, should we
> > try
> > to return LSN data from 5 and 6 at the proper offset in the dest
> > buffer? If so, we'd need to adjust the API so the caller knows what
> > parts of the dest buffer were filled in.
>
> I'd second this capability for now to keep the API simple and clear,
> but we can consider expanding it as needed.

Agreed. This case doesn't seem important; I just thought I'd ask about
it.

> If the callers use GetFlushRecPtr() to determine how far to read,
> LogwrtResult will be *reasonably* latest

It will be up-to-date enough that we'd never go through
WaitXLogInsertionsToFinish(), which is all we care about.

> As far as the current WAL readers are concerned, we don't need an
> explicit spinlock to determine LogwrtResult because all of them use
> GetFlushRecPtr() to determine how far to read. If there's any caller
> that's not updating LogwrtResult at all, we can consider reading
> LogwrtResult it ourselves in future.

So we don't actually need that path yet, right?

> 5. I think the two requirements specified at
> https://www.postgresql.org/message-id/20231109205836.zjoawdrn4q77yemv%40awork3.anarazel.de
> still hold with the v19j.

Agreed.

> PSA v20 patch set.

0001 is very close. I have the following suggestions:

  * Don't just return zero. If the caller is doing something we don't
expect, we want to fix the caller. I understand you'd like this to be
more like a transparent optimization, and we may do that later, but I
don't think it's a good idea to do that now.

  * There's currently no use for reading LSNs between Write and Insert,
so remove the WaitXLogInsertionsToFinish() code path. That also means
we don't need the extra emitLog parameter, so we can remove that. When
we have a use case, we can bring it all back.

If you agree, I can just make those adjustments (and do some final
checking) and commit 0001. Otherwise let me know what you think.

0002: How does the test control whether the data requested is before
the Flush pointer, the Write pointer, or the Insert pointer? What if
the walwriter comes in and moves one of those pointers before the next
statement is executed? Also, do you think a test module is required for
the basic functionality in 0001, or only when we start doing more
complex things like reading past the Flush pointer?

0003: can you explain why this is useful for wal summarizer to read
from the buffers?

Regards,
    Jeff Davis




Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Fri, Jan 26, 2024 at 8:31 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> > PSA v20 patch set.
>
> 0001 is very close. I have the following suggestions:
>
>   * Don't just return zero. If the caller is doing something we don't
> expect, we want to fix the caller. I understand you'd like this to be
> more like a transparent optimization, and we may do that later, but I
> don't think it's a good idea to do that now.

+    if (RecoveryInProgress() ||
+        tli != GetWALInsertionTimeLine())
+        return ntotal;
+
+    Assert(!XLogRecPtrIsInvalid(startptr));

Are you suggesting to error out instead of returning 0? If yes, I
disagree with it. Because, failure to read due to unmet pre-conditions
doesn't necessarily have to be to error out. If we error out, the
immediate failure we see is in the src/bin/psql TAP test for calling
XLogReadFromBuffers when the server is in recovery. How about
returning a negative value instead of just 0 or returning true/false
just like WALRead?

>   * There's currently no use for reading LSNs between Write and Insert,
> so remove the WaitXLogInsertionsToFinish() code path. That also means
> we don't need the extra emitLog parameter, so we can remove that. When
> we have a use case, we can bring it all back.

I disagree with this. I don't see anything wrong with
XLogReadFromBuffers having the capability to wait for in-progress
insertions to finish. In fact, it makes the function near-complete.
Imagine, implementing an extension (may be for fun or learning or
educational or production purposes) to read unflushed WAL directly
from WAL buffers using XLogReadFromBuffers as page_read callback with
xlogreader facility. AFAICT, I don't see a problem with
WaitXLogInsertionsToFinish logic in XLogReadFromBuffers.

FWIW, one important aspect of XLogReadFromBuffers is its ability to
read the unflushed WAL from WAL buffers. Also, see a note from Andres
here https://www.postgresql.org/message-id/20231109205836.zjoawdrn4q77yemv%40awork3.anarazel.de.

> If you agree, I can just make those adjustments (and do some final
> checking) and commit 0001. Otherwise let me know what you think.

Thanks. Please see my responses above.

> 0002: How does the test control whether the data requested is before
> the Flush pointer, the Write pointer, or the Insert pointer? What if
> the walwriter comes in and moves one of those pointers before the next
> statement is executed?

Tried to keep wal_writer quiet with wal_writer_delay=10000ms and
wal_writer_flush_after = 1GB to not to flush WAL in the background.
Also, disabled autovacuum, and set checkpoint_timeout to a higher
value. All of this is done to generate minimal WAL so that WAL buffers
don't get overwritten. Do you see any problems with it?

> Also, do you think a test module is required for
> the basic functionality in 0001, or only when we start doing more
> complex things like reading past the Flush pointer?

With WaitXLogInsertionsToFinish in XLogReadFromBuffers, we have that
capability already in. Having a separate test module ensures the code
is tested properly.

As far as the test is concerned, it verifies 2 cases:
1. Check if WAL is successfully read from WAL buffers. For this, the
test generates minimal WAL and reads from WAL buffers from the start
LSN = current insert LSN captured before the WAL generation.
2. Check with a WAL that doesn't yet exist. For this, the test reads
from WAL buffers from the start LSN = current flush LSN+16MB (a
randomly chosen higher value).

> 0003: can you explain why this is useful for wal summarizer to read
> from the buffers?

Can the WAL summarizer ever read the WAL on current TLI? I'm not so
sure about it, I haven't explored it in detail.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



On Fri, 2024-01-26 at 19:31 +0530, Bharath Rupireddy wrote:
> Are you suggesting to error out instead of returning 0?

We'd do neither of those things, because no caller should actually call
it while RecoveryInProgress() or on a different timeline.

>  How about
> returning a negative value instead of just 0 or returning true/false
> just like WALRead?

All of these things are functionally equivalent -- the same thing is
happening at the end. This is just a discussion about API style and how
that will interact with hypothetical callers that don't exist today.
And it can also be easily changed later, so we aren't stuck with
whatever decision happens here.

>
> Imagine, implementing an extension (may be for fun or learning or
> educational or production purposes) to read unflushed WAL directly
> from WAL buffers using XLogReadFromBuffers as page_read callback with
> xlogreader facility.

That makes sense, I didn't realize you intended to use this fron an
extension. I'm fine considering that as a separate patch that could
potentially be committed soon after this one.

I'd like some more details, but can I please just commit the basic
functionality now-ish?

> Tried to keep wal_writer quiet with wal_writer_delay=10000ms and
> wal_writer_flush_after = 1GB to not to flush WAL in the background.
> Also, disabled autovacuum, and set checkpoint_timeout to a higher
> value. All of this is done to generate minimal WAL so that WAL
> buffers
> don't get overwritten. Do you see any problems with it?

Maybe check it against pg_current_wal_lsn(), and see if the Write
pointer moved ahead? Perhaps even have a (limited) loop that tries
again to catch it at the right time?

> Can the WAL summarizer ever read the WAL on current TLI? I'm not so
> sure about it, I haven't explored it in detail.

Let's just not call XLogReadFromBuffers from there.

Regards,
    Jeff Davis




Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Sat, Jan 27, 2024 at 1:04 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> All of these things are functionally equivalent -- the same thing is
> happening at the end. This is just a discussion about API style and how
> that will interact with hypothetical callers that don't exist today.
> And it can also be easily changed later, so we aren't stuck with
> whatever decision happens here.

I'll leave that up to you. I'm okay either ways - 1) ensure the caller
doesn't use XLogReadFromBuffers, 2) XLogReadFromBuffers returning
as-if nothing was read when in recovery or on a different timeline.

> > Imagine, implementing an extension (may be for fun or learning or
> > educational or production purposes) to read unflushed WAL directly
> > from WAL buffers using XLogReadFromBuffers as page_read callback with
> > xlogreader facility.
>
> That makes sense, I didn't realize you intended to use this fron an
> extension. I'm fine considering that as a separate patch that could
> potentially be committed soon after this one.

Yes, I've turned that into 0002 patch.

> I'd like some more details, but can I please just commit the basic
> functionality now-ish?

+1.

> > Tried to keep wal_writer quiet with wal_writer_delay=10000ms and
> > wal_writer_flush_after = 1GB to not to flush WAL in the background.
> > Also, disabled autovacuum, and set checkpoint_timeout to a higher
> > value. All of this is done to generate minimal WAL so that WAL
> > buffers
> > don't get overwritten. Do you see any problems with it?
>
> Maybe check it against pg_current_wal_lsn(), and see if the Write
> pointer moved ahead? Perhaps even have a (limited) loop that tries
> again to catch it at the right time?

Adding a loop seems to be reasonable here and done in v21-0003. Also,
I've added wal_level = minimal per
src/test/recovery/t/039_end_of_wal.pl introduced by commit bae868caf22
which also tries to keep WAL activity to minimum.

> > Can the WAL summarizer ever read the WAL on current TLI? I'm not so
> > sure about it, I haven't explored it in detail.
>
> Let's just not call XLogReadFromBuffers from there.

Removed.

PSA v21 patch set.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment
Hmm, this looks quite nice and simple.  My only comment is that a
sequence like this

   /* Read from WAL buffers, if available. */
   rbytes = XLogReadFromBuffers(&output_message.data[output_message.len],
                                startptr, nbytes, xlogreader->seg.ws_tli);
   output_message.len += rbytes;
   startptr += rbytes;
   nbytes -= rbytes;

   if (!WALRead(xlogreader,
                &output_message.data[output_message.len],
                startptr,

leaves you wondering if WALRead() should be called at all or not, in the
case when all bytes were read by XLogReadFromBuffers.  I think in many
cases what's going to happen is that nbytes is going to be zero, and
then WALRead is going to return having done nothing in its inner loop.
I think this warrants a comment somewhere.  Alternatively, we could
short-circuit the 'if' expression so that WALRead() is not called in
that case (but I'm not sure it's worth the loss of code clarity).

Also, but this is really quite minor, it seems sad to add more functions
with the prefix XLog, when we have renamed things to use the prefix WAL,
and we have kept the old names only to avoid backpatchability issues.
I mean, if we have WALRead() already, wouldn't it make perfect sense to
name the new routine WALReadFromBuffers?

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Tiene valor aquel que admite que es un cobarde" (Fernandel)



Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Tue, Jan 30, 2024 at 11:01 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> Hmm, this looks quite nice and simple.

Thanks for looking at it.

> My only comment is that a
> sequence like this
>
>    /* Read from WAL buffers, if available. */
>    rbytes = XLogReadFromBuffers(&output_message.data[output_message.len],
>                                 startptr, nbytes, xlogreader->seg.ws_tli);
>    output_message.len += rbytes;
>    startptr += rbytes;
>    nbytes -= rbytes;
>
>    if (!WALRead(xlogreader,
>                 &output_message.data[output_message.len],
>                 startptr,
>
> leaves you wondering if WALRead() should be called at all or not, in the
> case when all bytes were read by XLogReadFromBuffers.  I think in many
> cases what's going to happen is that nbytes is going to be zero, and
> then WALRead is going to return having done nothing in its inner loop.
> I think this warrants a comment somewhere.  Alternatively, we could
> short-circuit the 'if' expression so that WALRead() is not called in
> that case (but I'm not sure it's worth the loss of code clarity).

It might help avoid a function call in case reading from WAL buffers
satisfies the read fully. And, it's not that clumsy with the change,
see following. I've changed it in the attached v22 patch set.

if (nbytes > 0 &&
    !WALRead(xlogreader,

> Also, but this is really quite minor, it seems sad to add more functions
> with the prefix XLog, when we have renamed things to use the prefix WAL,
> and we have kept the old names only to avoid backpatchability issues.
> I mean, if we have WALRead() already, wouldn't it make perfect sense to
> name the new routine WALReadFromBuffers?

WALReadFromBuffers looks better. Used that in v22 patch.

Please see the attached v22 patch set.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment
Looking at 0003, where an XXX comment is added about taking a spinlock
to read LogwrtResult, I suspect the answer is probably not, because it
is likely to slow down the other uses of LogwrtResult.  But I wonder if
a better path forward would be to base further work on my older
uncommitted patch to make LogwrtResult use atomics.  With that, you
wouldn't have to block others in order to read the value.  I last posted
that patch in [1] in case you're curious.

[1] https://postgr.es/m/20220728065920.oleu2jzsatchakfj@alvherre.pgsql

The reason I abandoned that patch is that the performance problem that I
was fixing no longer existed -- it was fixed in a different way.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"In fact, the basic problem with Perl 5's subroutines is that they're not
crufty enough, so the cruft leaks out into user-defined code instead, by
the Conservation of Cruft Principle."  (Larry Wall, Apocalypse 6)



Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Wed, Jan 31, 2024 at 3:01 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> Looking at 0003, where an XXX comment is added about taking a spinlock
> to read LogwrtResult, I suspect the answer is probably not, because it
> is likely to slow down the other uses of LogwrtResult.

We avoided keeping LogwrtResult latest as the current callers for
WALReadFromBuffers() all determine the flush LSN using
GetFlushRecPtr(), see comment #4 from
https://www.postgresql.org/message-id/CALj2ACV%3DC1GZT9XQRm4iN1NV1T%3DhLA_hsGWNx2Y5-G%2BmSwdhNg%40mail.gmail.com.

> But I wonder if
> a better path forward would be to base further work on my older
> uncommitted patch to make LogwrtResult use atomics.  With that, you
> wouldn't have to block others in order to read the value.  I last posted
> that patch in [1] in case you're curious.
>
> [1] https://postgr.es/m/20220728065920.oleu2jzsatchakfj@alvherre.pgsql
>
> The reason I abandoned that patch is that the performance problem that I
> was fixing no longer existed -- it was fixed in a different way.

Nice. I'll respond in that thread.  FWIW, there's been a recent
attempt at turning unloggedLSN to 64-bit atomic -
https://commitfest.postgresql.org/46/4330/ and that might need
pg_atomic_monotonic_advance_u64. I guess we would have to bring your
patch and the unloggedLSN into a single thread to have a better
discussion.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



On Wed, 2024-01-31 at 14:30 +0530, Bharath Rupireddy wrote:
> Please see the attached v22 patch set.

Committed 0001.

For 0002 & 0003, I'd like more clarity on how they will actually be
used by an extension.

For 0004, we need to resolve why callers are using XLOG_BLCKSZ and we
can fix that independently, as discussed here:

https://www.postgresql.org/message-id/CALj2ACV=C1GZT9XQRm4iN1NV1T=hLA_hsGWNx2Y5-G+mSwdhNg@mail.gmail.com

Regards,
    Jeff Davis




Hi,

On 2024-02-12 11:33:24 -0800, Jeff Davis wrote:
> On Wed, 2024-01-31 at 14:30 +0530, Bharath Rupireddy wrote:
> > Please see the attached v22 patch set.
>
> Committed 0001.

Yay, I think this is very cool. There are plenty other improvements than can
be based on this...


One thing I'm a bit confused in the code is the following:

+    /*
+     * Don't read past the available WAL data.
+     *
+     * Check using local copy of LogwrtResult. Ordinarily it's been updated by
+     * the caller when determining how far to read; but if not, it just means
+     * we'll read less data.
+     *
+     * XXX: the available WAL could be extended to the WAL insert pointer by
+     * calling WaitXLogInsertionsToFinish().
+     */
+    upto = Min(startptr + count, LogwrtResult.Write);
+    nbytes = upto - startptr;

Shouldn't it pretty much be a bug to ever encounter this? There aren't
equivalent checks in WALRead(), so any user of WALReadFromBuffers() that then
falls back to WALRead() is just going to send unwritten data.

ISTM that this should be an assertion or error.

Greetings,

Andres Freund



On Mon, 2024-02-12 at 12:18 -0800, Andres Freund wrote:
> +    upto = Min(startptr + count, LogwrtResult.Write);
> +    nbytes = upto - startptr;
>
> Shouldn't it pretty much be a bug to ever encounter this?

In the current code it's impossible, though Bharath hinted at an
extension which could reach that path.

What I committed was a bit of a compromise -- earlier versions of the
patch supported reading right up to the Insert pointer (which requires
a call to WaitXLogInsertionsToFinish()). I wasn't ready to commit that
code without seeing a more about how that would be used, but I thought
it was reasonable to have some simple code in there to allow reading up
to the Write pointer.

It seems closer to the structure that we will ultimately need to
replicate unflushed data, right?

Regards,
    Jeff Davis

[1]
https://www.postgresql.org/message-id/CALj2ACW65mqn6Ukv57SqDTMzAJgd1N_AdQtDgy+gMDqu6v618Q@mail.gmail.com



Hi,

On 2024-02-12 12:46:00 -0800, Jeff Davis wrote:
> On Mon, 2024-02-12 at 12:18 -0800, Andres Freund wrote:
> > +    upto = Min(startptr + count, LogwrtResult.Write);
> > +    nbytes = upto - startptr;
> >
> > Shouldn't it pretty much be a bug to ever encounter this?
>
> In the current code it's impossible, though Bharath hinted at an
> extension which could reach that path.
>
> What I committed was a bit of a compromise -- earlier versions of the
> patch supported reading right up to the Insert pointer (which requires
> a call to WaitXLogInsertionsToFinish()). I wasn't ready to commit that
> code without seeing a more about how that would be used, but I thought
> it was reasonable to have some simple code in there to allow reading up
> to the Write pointer.

I doubt there's a sane way to use WALRead() without *first* ensuring that the
range of data is valid. I think we're better of moving that responsibility
explicitly to the caller and adding an assertion verifying that.


> It seems closer to the structure that we will ultimately need to
> replicate unflushed data, right?

It doesn't really seem like a necessary, or even particularly useful,
part. You couldn't just call WALRead() for that, since the caller would need
to know the range up to which WAL is valid but not yet flushed as well. Thus
the caller would need to first use WaitXLogInsertionsToFinish() or something
like it anyway - and then there's no point in doing the WALRead() anymore.

Note that for replicating unflushed data, we *still* might need to fall back
to reading WAL data from disk. In which case not asserting in WALRead() would
just make it hard to find bugs, because not using WaitXLogInsertionsToFinish()
would appear to work as long as data is in wal buffers, but as soon as we'd
fall back to on-disk (but unflushed) data, we'd send bogus WAL.

Greetings,

Andres Freund



On Mon, 2024-02-12 at 11:33 -0800, Jeff Davis wrote:
> For 0002 & 0003, I'd like more clarity on how they will actually be
> used by an extension.

In patch 0002, I'm concerned about calling
WaitXLogInsertionsToFinish(). It loops through all the locks, but
doesn't have any early return path or advance any state.

So if it's repeatedly called with the same or similar values it seems
like it would be doing a lot of extra work.

I'm not sure of the best fix. We could add something to LogwrtResult to
track a new LSN that represents the highest known point where all
inserters are finished (in other words, the latest return value of
WaitXLogInsertionsToFinish()). That seems invasive, though.

Regards,
    Jeff Davis




Hi,

On 2024-02-12 15:56:19 -0800, Jeff Davis wrote:
> On Mon, 2024-02-12 at 11:33 -0800, Jeff Davis wrote:
> > For 0002 & 0003, I'd like more clarity on how they will actually be
> > used by an extension.
>
> In patch 0002, I'm concerned about calling
> WaitXLogInsertionsToFinish(). It loops through all the locks, but
> doesn't have any early return path or advance any state.

I doubt it'd be too bad - we call that at much much higher frequency during
write heavy OLTP workloads (c.f. XLogFlush()).  It can be a performance issue
there, but only after increasing NUM_XLOGINSERT_LOCKS - before that the
limited number of writers is the limit.  Compared to that walsender shouldn't
be a significant factor.

However, I think it's a very bad idea to call WALReadFromBuffers() from
WALReadFromBuffers(). This needs to be at the caller, not down in
WALReadFromBuffers().

I don't see why we would want to weaken the error condition in
WaitXLogInsertionsToFinish() - I suspect it'd not work correctly to wait for
insertions that aren't yet in progress and it just seems like an API misuse.


> So if it's repeatedly called with the same or similar values it seems like
> it would be doing a lot of extra work.
>
> I'm not sure of the best fix. We could add something to LogwrtResult to
> track a new LSN that represents the highest known point where all
> inserters are finished (in other words, the latest return value of
> WaitXLogInsertionsToFinish()). That seems invasive, though.

FWIW, I think LogwrtResult is an anti-pattern, perhaps introduced due to
misunderstanding how cache coherency works. It's not fundamentally faster to
access non-shared memory. It'd make far more sense to allow lock-free access
to the shared LogwrtResult and

Greetings,

Andres Freund



On Mon, 2024-02-12 at 15:36 -0800, Andres Freund wrote:
>
> It doesn't really seem like a necessary, or even particularly useful,
> part. You couldn't just call WALRead() for that, since the caller
> would need
> to know the range up to which WAL is valid but not yet flushed as
> well. Thus
> the caller would need to first use WaitXLogInsertionsToFinish() or
> something
> like it anyway - and then there's no point in doing the WALRead()
> anymore.

I follow until the last part. Did you mean "and then there's no point
in doing the WaitXLogInsertionsToFinish() in WALReadFromBuffers()
anymore"?

For now, should I assert that the requested WAL data is before the
Flush pointer or assert that it's before the Write pointer?

> Note that for replicating unflushed data, we *still* might need to
> fall back
> to reading WAL data from disk. In which case not asserting in
> WALRead() would
> just make it hard to find bugs, because not using
> WaitXLogInsertionsToFinish()
> would appear to work as long as data is in wal buffers, but as soon
> as we'd
> fall back to on-disk (but unflushed) data, we'd send bogus WAL.

That makes me wonder whether my previous idea[1] might matter: when
some buffers have been evicted, should WALReadFromBuffers() keep going
through the loop and return the end portion of the requested data
rather than the beginning?

We can sort that out when we get closer to replicating unflushed WAL.

Regards,
    Jeff Davis

[1]
https://www.postgresql.org/message-id/2b36bf99e762e65db0dafbf8d338756cf5fa6ece.camel@j-davis.com



Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Tue, Feb 13, 2024 at 1:03 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> For 0004, we need to resolve why callers are using XLOG_BLCKSZ and we
> can fix that independently, as discussed here:
>
> https://www.postgresql.org/message-id/CALj2ACV=C1GZT9XQRm4iN1NV1T=hLA_hsGWNx2Y5-G+mSwdhNg@mail.gmail.com

Thanks. I started a new thread for this -
https://www.postgresql.org/message-id/CALj2ACWBRFac2TingD3PE3w2EBHXUHY3%3DAEEZPJmqhpEOBGExg%40mail.gmail.com.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Tue, Feb 13, 2024 at 5:06 AM Andres Freund <andres@anarazel.de> wrote:
>
> I doubt there's a sane way to use WALRead() without *first* ensuring that the
> range of data is valid. I think we're better of moving that responsibility
> explicitly to the caller and adding an assertion verifying that.
>
> It doesn't really seem like a necessary, or even particularly useful,
> part. You couldn't just call WALRead() for that, since the caller would need
> to know the range up to which WAL is valid but not yet flushed as well. Thus
> the caller would need to first use WaitXLogInsertionsToFinish() or something
> like it anyway - and then there's no point in doing the WALRead() anymore.
>
> Note that for replicating unflushed data, we *still* might need to fall back
> to reading WAL data from disk. In which case not asserting in WALRead() would
> just make it hard to find bugs, because not using WaitXLogInsertionsToFinish()
> would appear to work as long as data is in wal buffers, but as soon as we'd
> fall back to on-disk (but unflushed) data, we'd send bogus WAL.

Callers of WALRead() do a good amount of work to figure out what's
been flushed out but they read the un-flushed and/or invalid data see
the comment [1] around WALRead() call sites as well as a recent thread
[2] for more details.

IIUC, here's the summary of the discussion that has happened so far:
a) If only replicating flushed data, then ensure all the WALRead()
callers read how much ever is valid out of startptr+count. Fix
provided in [2] can help do that.
b) If only replicating flushed data, then ensure all the
WALReadFromBuffers() callers read how much ever is valid out of
startptr+count. Current and expected WALReadFromBuffers() callers will
anyway determine how much of it is flushed and can validly be read.
c) If planning to replicate unflushed data, then ensure all the
WALRead() callers wait until startptr+count is past the current insert
position with WaitXLogInsertionsToFinish().
d) If planning to replicate unflushed data, then ensure all the
WALReadFromBuffers() callers wait until startptr+count is past the
current insert position with WaitXLogInsertionsToFinish().

Adding an assertion or error in WALReadFromBuffers() for ensuring the
callers do follow the above set of rules is easy. We can just do
Assert(startptr+count <= LogwrtResult.Flush).

However, adding a similar assertion or error in WALRead() gets
trickier as it's being called from many places - walsenders, backends,
external tools etc. even when the server is in recovery. Therefore,
determining the actual valid LSN is a bit of a challenge.

What I think is the best way:
- Try and get the fix provided for (a) at [2].
- Implement both (c) and (d).
- Have the assertion in WALReadFromBuffers() ensuring the callers wait
until startptr+count is past the current insert position with
WaitXLogInsertionsToFinish().
- Have a comment around WALRead() to ensure the callers are requesting
the WAL that's written to the disk because it's hard to determine
what's written to disk as this gets called in many scenarios - when
server is in recovery, for walsummarizer etc.
- In the new test module, demonstrate how one can implement reading
unflushed data with WALReadFromBuffers() and/or WALRead() +
WaitXLogInsertionsToFinish().

Thoughts?

[1]
/*
 * Even though we just determined how much of the page can be validly read
 * as 'count', read the whole page anyway. It's guaranteed to be
 * zero-padded up to the page boundary if it's incomplete.
 */
if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ, tli,
             &errinfo))

[2] https://www.postgresql.org/message-id/CALj2ACWBRFac2TingD3PE3w2EBHXUHY3%3DAEEZPJmqhpEOBGExg%40mail.gmail.com

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Attached 2 patches.

Per Andres's suggestion, 0001 adds an:
  Assert(startptr + count <= LogwrtResult.Write)

Though if we want to allow the caller (e.g. in an extension) to
determine the valid range, perhaps using WaitXLogInsertionsToFinish(),
then the check is wrong. Maybe we should just get rid of that code
entirely and trust the caller to request a reasonable range?

On Mon, 2024-02-12 at 17:33 -0800, Jeff Davis wrote:
> That makes me wonder whether my previous idea[1] might matter: when
> some buffers have been evicted, should WALReadFromBuffers() keep
> going
> through the loop and return the end portion of the requested data
> rather than the beginning?
> [1]
> https://www.postgresql.org/message-id/2b36bf99e762e65db0dafbf8d338756cf5fa6ece.camel@j-davis.com

0002 is to illustrate the above idea. It's a strange API so I don't
intend to commit it in this form, but I think we will ultimately need
to do something like it when we want to replicate unflushed data.

The idea is that data past the Write pointer is always (and only)
available in the WAL buffers, so WALReadFromBuffers() should always
return it. That way we can always safely fall through to ordinary
WALRead(), which can only see before the Write pointer. There's also
data before the Write pointer that could be in the WAL buffers, and we
might as well copy that, too, if it's not evicted.

If some buffers are evicted, it will fill in the *end* of the buffer,
leaving a gap at the beginning. The nice thing is that if there is any
gap, it will be before the Write pointer, so we can always fall back to
WALRead() to fill the gap and it should always succeed.

Regards,
    Jeff Davis


Attachment
On Tue, 2024-02-13 at 22:47 +0530, Bharath Rupireddy wrote:
> c) If planning to replicate unflushed data, then ensure all the
> WALRead() callers wait until startptr+count is past the current
> insert
> position with WaitXLogInsertionsToFinish().

WALRead() can't read past the Write pointer, so there's no point in
calling WaitXLogInsertionsToFinish(), right?


Regards,
    Jeff Davis




Hi,

On 2024-02-12 17:33:24 -0800, Jeff Davis wrote:
> On Mon, 2024-02-12 at 15:36 -0800, Andres Freund wrote:
> >
> > It doesn't really seem like a necessary, or even particularly useful,
> > part. You couldn't just call WALRead() for that, since the caller
> > would need
> > to know the range up to which WAL is valid but not yet flushed as
> > well. Thus
> > the caller would need to first use WaitXLogInsertionsToFinish() or
> > something
> > like it anyway - and then there's no point in doing the WALRead()
> > anymore.
>
> I follow until the last part. Did you mean "and then there's no point
> in doing the WaitXLogInsertionsToFinish() in WALReadFromBuffers()
> anymore"?

Yes, not sure what happened in my brain there.


> For now, should I assert that the requested WAL data is before the
> Flush pointer or assert that it's before the Write pointer?

Yes, I think that'd be good.


> > Note that for replicating unflushed data, we *still* might need to
> > fall back
> > to reading WAL data from disk. In which case not asserting in
> > WALRead() would
> > just make it hard to find bugs, because not using
> > WaitXLogInsertionsToFinish()
> > would appear to work as long as data is in wal buffers, but as soon
> > as we'd
> > fall back to on-disk (but unflushed) data, we'd send bogus WAL.
>
> That makes me wonder whether my previous idea[1] might matter: when
> some buffers have been evicted, should WALReadFromBuffers() keep going
> through the loop and return the end portion of the requested data
> rather than the beginning?

I still doubt that that will help very often, but it'll take some
experimentation to figure it out, I guess.


> We can sort that out when we get closer to replicating unflushed WAL.

+1

Greetings,

Andres Freund



Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Wed, Feb 14, 2024 at 6:59 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> Attached 2 patches.
>
> Per Andres's suggestion, 0001 adds an:
>   Assert(startptr + count <= LogwrtResult.Write)
>
> Though if we want to allow the caller (e.g. in an extension) to
> determine the valid range, perhaps using WaitXLogInsertionsToFinish(),
> then the check is wrong.

Right.

> Maybe we should just get rid of that code
> entirely and trust the caller to request a reasonable range?

I'd suggest we strike a balance here - error out in assert builds if
startptr+count is past the current insert position and trust the
callers for production builds. It has a couple of advantages over
doing just Assert(startptr + count <= LogwrtResult.Write):
1) It allows the caller to read unflushed WAL directly from WAL
buffers, see the attached 0005 for an example.
2) All the existing callers where WALReadFromBuffers() is thought to
be used are ensuring WAL availability by reading upto the flush
position so no problem with it.

Also, a note before WALRead() stating the caller must request the WAL
at least that's written out (upto LogwrtResult.Write). I'm not so sure
about this, perhaps, we don't need this comment at all.

Here, I'm with v23 patch set:

0001 - Adds assertion in WALReadFromBuffers() to ensure the requested
WAL isn't beyond the current insert position.
0002 - Adds a new test module to demonstrate how one can use
WALReadFromBuffers() ensuring WaitXLogInsertionsToFinish() if need be.
0003 - Uses WALReadFromBuffers in more places like logical walsenders
and backends.
0004 - Removes zero-padding related stuff as discussed in
https://www.postgresql.org/message-id/CALj2ACWBRFac2TingD3PE3w2EBHXUHY3=AEEZPJmqhpEOBGExg@mail.gmail.com.
This is needed in this patch set otherwise the assertion added in 0001
fails after 0003.
0005 - Adds a page_read callback for reading from WAL buffers in the
new test module added in 0002. Also, adds tests.

Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment
On Fri, 2024-02-16 at 13:08 +0530, Bharath Rupireddy wrote:
> I'd suggest we strike a balance here - error out in assert builds if
> startptr+count is past the current insert position and trust the
> callers for production builds.

It's not reasonable to have divergent behavior between assert-enabled
builds and production. I think for now I will just commit the Assert as
Andres suggested until we work out a few more details.

One idea is to use Álvaro's work to eliminate the spinlock, and then
add a variable to represent the last known point returned by
WaitXLogInsertionsToFinish(). Then we can cheaply Assert that the
caller requested something before that point.

> Here, I'm with v23 patch set:

Thank you, I'll look at these.

Regards,
    Jeff Davis




Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Fri, Feb 16, 2024 at 11:01 PM Jeff Davis <pgsql@j-davis.com> wrote:
>
> > Here, I'm with v23 patch set:
>
> Thank you, I'll look at these.

Thanks. Here's the v24 patch set after rebasing.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Sat, Feb 17, 2024 at 10:27 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Fri, Feb 16, 2024 at 11:01 PM Jeff Davis <pgsql@j-davis.com> wrote:
> >
> > > Here, I'm with v23 patch set:
> >
> > Thank you, I'll look at these.
>
> Thanks. Here's the v24 patch set after rebasing.

Ran pgperltidy on the new TAP test file added. Please see the attached
v25 patch set.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Tue, Feb 20, 2024 at 11:40 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> Ran pgperltidy on the new TAP test file added. Please see the attached
> v25 patch set.

Please find the v26 patches after rebasing.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Thu, Mar 21, 2024 at 11:33 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> Please find the v26 patches after rebasing.

Commit f3ff7bf83b added a check in WALReadFromBuffers to ensure the
requested WAL is not past the WAL that's copied to WAL buffers. So,
I've dropped v26-0001 patch.

I've attached v27 patches for further review.

0001 adds a test module to demonstrate reading from WAL buffers
patterns like the caller ensuring the requested WAL is fully copied to
WAL buffers using WaitXLogInsertionsToFinish and an implementation of
xlogreader page_read
callback to read unflushed/not-yet-flushed WAL directly from WAL buffers.

0002 Use WALReadFromBuffers in more places like for logical
walsenders, logical decoding functions, backends reading WAL.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
"Andrey M. Borodin"
Date:

> On 8 Apr 2024, at 08:17, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:

Hi Bharath!

As far as I understand CF entry [0] is committed? I understand that there are some open followups, but I just want to
determinecorrect CF item status... 

Thanks!


Best regards, Andrey Borodin.

[0] https://commitfest.postgresql.org/47/4060/


On Tue, Apr 09, 2024 at 09:33:49AM +0300, Andrey M. Borodin wrote:
> As far as I understand CF entry [0] is committed? I understand that
> there are some open followups, but I just want to determine correct
> CF item status...

So much work has happened on this thread with things that has been
committed, so switching the entry to committed makes sense to me.  I
have just done that.

Bharath, could you create a new thread with the new things you are
proposing?  All that should be v18 work, particularly v27-0002:
https://www.postgresql.org/message-id/CALj2ACWCibnX2jcnRreBHFesFeQ6vbKiFstML=w-JVTvUKD_EA@mail.gmail.com
--
Michael

Attachment

Re: Improve WALRead() to suck data directly from WAL buffers when possible

From
Bharath Rupireddy
Date:
On Thu, Apr 11, 2024 at 6:31 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Tue, Apr 09, 2024 at 09:33:49AM +0300, Andrey M. Borodin wrote:
> > As far as I understand CF entry [0] is committed? I understand that
> > there are some open followups, but I just want to determine correct
> > CF item status...
>
> So much work has happened on this thread with things that has been
> committed, so switching the entry to committed makes sense to me.  I
> have just done that.
>
> Bharath, could you create a new thread with the new things you are
> proposing?  All that should be v18 work, particularly v27-0002:
> https://www.postgresql.org/message-id/CALj2ACWCibnX2jcnRreBHFesFeQ6vbKiFstML=w-JVTvUKD_EA@mail.gmail.com

Thanks. I started a new thread
https://www.postgresql.org/message-id/CALj2ACVfF2Uj9NoFy-5m98HNtjHpuD17EDE9twVeJng-jTAe7A%40mail.gmail.com.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



On Wed, Apr 24, 2024 at 09:46:20PM +0530, Bharath Rupireddy wrote:
> Thanks. I started a new thread
> https://www.postgresql.org/message-id/CALj2ACVfF2Uj9NoFy-5m98HNtjHpuD17EDE9twVeJng-jTAe7A%40mail.gmail.com.

Cool, thanks.
--
Michael

Attachment