Improve WALRead() to suck data directly from WAL buffers when possible - Mailing list pgsql-hackers

From Bharath Rupireddy
Subject Improve WALRead() to suck data directly from WAL buffers when possible
Date
Msg-id CALj2ACXKKK=wbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54+Na=Q@mail.gmail.com
Whole thread Raw
Responses Re: Improve WALRead() to suck data directly from WAL buffers when possible  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
List pgsql-hackers
Hi,

WALRead() currently reads WAL from the WAL file on the disk, which
means, the walsenders serving streaming and logical replication
(callers of WALRead()) will have to hit the disk/OS's page cache for
reading the WAL. This may increase the amount of read IO required for
all the walsenders put together as one typically maintains many
standbys/subscribers on production servers for high availability,
disaster recovery, read-replicas and so on. Also, it may increase
replication lag if all the WAL reads are always hitting the disk.

It may happen that WAL buffers contain the requested WAL, if so, the
WALRead() can attempt to read from the WAL buffers first before
reading from the file. If the read hits the WAL buffers, then reading
from the file on disk is avoided. This mainly reduces the read IO/read
system calls. It also enables us to do other features specified
elsewhere [1].

I'm attaching a patch that implements the idea which is also noted
elsewhere [2]. I've run some tests [3]. The WAL buffers hit ratio with
the patch stood at 95%, in other words, the walsenders avoided 95% of
the time reading from the file. The benefit, if measured in terms of
the amount of data - 79% (13.5GB out of total 17GB) of the requested
WAL is read from the WAL buffers as opposed to 21% from the file. Note
that the WAL buffers hit ratio can be very low for write-heavy
workloads, in which case, file reads are inevitable.

The patch introduces concurrent readers for the WAL buffers, so far
only there are concurrent writers. In the patch, WALRead() takes just
one lock (WALBufMappingLock) in shared mode to enable concurrent
readers and does minimal things - checks if the requested WAL page is
present in WAL buffers, if so, copies the page and releases the lock.
I think taking just WALBufMappingLock is enough here as the concurrent
writers depend on it to initialize and replace a page in WAL buffers.

I'll add this to the next commitfest.

Thoughts?

[1] https://www.postgresql.org/message-id/CALj2ACXCSM%2BsTR%3D5NNRtmSQr3g1Vnr-yR91azzkZCaCJ7u4d4w%40mail.gmail.com

[2]
 * XXX probably this should be improved to suck data directly from the
 * WAL buffers when possible.
 */
bool
WALRead(XLogReaderState *state,

[3]
1 primary, 1 sync standby, 1 async standby
./pgbench --initialize --scale=300 postgres
./pgbench --jobs=16 --progress=300 --client=32 --time=900
--username=ubuntu postgres

PATCHED:
-[ RECORD 1 ]----------+----------------
application_name       | assb1
wal_read               | 31005
wal_read_bytes         | 3800607104
wal_read_time          | 779.402
wal_read_buffers       | 610611
wal_read_bytes_buffers | 14493226440
wal_read_time_buffers  | 3033.309
sync_state             | async
-[ RECORD 2 ]----------+----------------
application_name       | ssb1
wal_read               | 31027
wal_read_bytes         | 3800932712
wal_read_time          | 696.365
wal_read_buffers       | 610580
wal_read_bytes_buffers | 14492900832
wal_read_time_buffers  | 2989.507
sync_state             | sync

HEAD:
-[ RECORD 1 ]----+----------------
application_name | assb1
wal_read         | 705627
wal_read_bytes   | 18343480640
wal_read_time    | 7607.783
sync_state       | async
-[ RECORD 2 ]----+------------
application_name | ssb1
wal_read         | 705625
wal_read_bytes   | 18343480640
wal_read_time    | 4539.058
sync_state       | sync

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

pgsql-hackers by date:

Previous
From: John Naylor
Date:
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Next
From: Richard Guo
Date:
Subject: Check lateral references within PHVs for memoize cache keys