Re: Improve WALRead() to suck data directly from WAL buffers when possible - Mailing list pgsql-hackers
From | Bharath Rupireddy |
---|---|
Subject | Re: Improve WALRead() to suck data directly from WAL buffers when possible |
Date | |
Msg-id | CALj2ACXUbvON86vgwTkum8ab3bf1=HkMxQ5hZJZS3ZcJn8NEXQ@mail.gmail.com Whole thread Raw |
In response to | Re: Improve WALRead() to suck data directly from WAL buffers when possible (Kyotaro Horiguchi <horikyota.ntt@gmail.com>) |
Responses |
Re: Improve WALRead() to suck data directly from WAL buffers when possible
|
List | pgsql-hackers |
On Mon, Dec 12, 2022 at 8:27 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > Thanks for providing thoughts. > At Fri, 9 Dec 2022 14:33:39 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in > > The patch introduces concurrent readers for the WAL buffers, so far > > only there are concurrent writers. In the patch, WALRead() takes just > > one lock (WALBufMappingLock) in shared mode to enable concurrent > > readers and does minimal things - checks if the requested WAL page is > > present in WAL buffers, if so, copies the page and releases the lock. > > I think taking just WALBufMappingLock is enough here as the concurrent > > writers depend on it to initialize and replace a page in WAL buffers. > > > > I'll add this to the next commitfest. > > > > Thoughts? > > This adds copying of the whole page (at least) at every WAL *record* > read, In the worst case yes, but that may not always be true. On a typical production server with decent write traffic, it happens that the callers of WALRead() read a full WAL page of size XLOG_BLCKSZ bytes or MAX_SEND_SIZE bytes. > fighting all WAL writers by taking WALBufMappingLock on a very > busy page while the copying. I'm a bit doubtful that it results in an > overall improvement. Well, the tests don't reflect that [1], I've run an insert work load [2]. The WAL is being read from WAL buffers 99% of the time, which is pretty cool. If you have any use-cases in mind, please share them and/or feel free to run at your end. > It seems to me almost all pread()s here happens > on file buffer so it is unclear to me that copying a whole WAL page > (then copying the target record again) wins over a pread() call that > copies only the record to read. That's not always guaranteed. Imagine a typical production server with decent write traffic and heavy analytical queries (which fills OS page cache with the table pages accessed for the queries), the WAL pread() calls turn to IOPS. Despite the WAL being present in WAL buffers, customers will be paying unnecessarily for these IOPS too. With the patch, we are basically avoiding the pread() system calls which may turn into IOPS on production servers (99% of the time for the insert use case [1][2], 95% of the time for pgbench use case specified upthread). With the patch, WAL buffers can act as L1 cache, if one calls OS page cache as L2 cache (of course this illustration is not related to the typical processor L1 and L2 ... caches). > Do you have an actual number of how > frequent WAL reads go to disk, or the actual number of performance > gain or real I/O reduction this patch offers? It might be a bit tough to generate such heavy traffic. An idea is to ensure the WAL page/file goes out of the OS page cache before WALRead() - these might help here - 0002 patch from https://www.postgresql.org/message-id/CA%2BhUKGLmeyrDcUYAty90V_YTcoo5kAFfQjRQ-_1joS_%3DX7HztA%40mail.gmail.com and tool https://github.com/klando/pgfincore. > This patch copies the bleeding edge WAL page without recording the > (next) insertion point nor checking whether all in-progress insertion > behind the target LSN have finished. Thus the copied page may have > holes. That being said, the sequential-reading nature and the fact > that WAL buffers are zero-initialized may make it work for recovery, > but I don't think this also works for replication. WALRead() callers are smart enough to take the flushed bytes only. Although they read the whole WAL page, they calculate the valid bytes. > I remember that the one of the advantage of reading the on-memory WAL > records is that that allows walsender to presend the unwritten > records. So perhaps we should manage how far the buffer is filled with > valid content (or how far we can presend) in this feature. Yes, the non-flushed WAL can be read and sent across if one wishes to to make replication faster and parallel flushing on primary and standbys at the cost of a bit of extra crash handling, that's mentioned here https://www.postgresql.org/message-id/CALj2ACXCSM%2BsTR%3D5NNRtmSQr3g1Vnr-yR91azzkZCaCJ7u4d4w%40mail.gmail.com. However, this can be a separate discussion. I also want to reiterate that the patch implemented a TODO item: * XXX probably this should be improved to suck data directly from the * WAL buffers when possible. */ bool WALRead(XLogReaderState *state, [1] PATCHED: 1 1470.329907 2 1437.096329 4 2966.096948 8 5978.441040 16 11405.538255 32 22933.546058 64 43341.870038 128 73623.837068 256 104754.248661 512 115746.359530 768 106106.691455 1024 91900.079086 2048 84134.278589 4096 62580.875507 -[ RECORD 1 ]----------+----------- application_name | assb1 sent_lsn | 0/1B8106A8 write_lsn | 0/1B8106A8 flush_lsn | 0/1B8106A8 replay_lsn | 0/1B8106A8 write_lag | flush_lag | replay_lag | wal_read | 104 wal_read_bytes | 10733008 wal_read_time | 1.845 wal_read_buffers | 76662 wal_read_bytes_buffers | 383598808 wal_read_time_buffers | 205.418 sync_state | async HEAD: 1 1312.054496 2 1449.429321 4 2717.496207 8 5913.361540 16 10762.978907 32 19653.449728 64 41086.124269 128 68548.061171 256 104468.415361 512 114328.943598 768 91751.279309 1024 96403.736757 2048 82155.140270 4096 66160.659511 -[ RECORD 1 ]----+----------- application_name | assb1 sent_lsn | 0/1AB5BCB8 write_lsn | 0/1AB5BCB8 flush_lsn | 0/1AB5BCB8 replay_lsn | 0/1AB5BCB8 write_lag | flush_lag | replay_lag | wal_read | 71967 wal_read_bytes | 381009080 wal_read_time | 243.616 sync_state | async [2] Test details: ./configure --prefix=$PWD/inst/ CFLAGS="-O3" > install.log && make -j 8 install > install.log 2>&1 & 1 primary, 1 async standby cd inst/bin ./pg_ctl -D data -l logfile stop ./pg_ctl -D assbdata -l logfile1 stop rm -rf data assbdata rm logfile logfile1 free -m sudo su -c 'sync; echo 3 > /proc/sys/vm/drop_caches' free -m ./initdb -D data rm -rf /home/ubuntu/archived_wal mkdir /home/ubuntu/archived_wal cat << EOF >> data/postgresql.conf shared_buffers = '8GB' wal_buffers = '1GB' max_wal_size = '16GB' max_connections = '5000' archive_mode = 'on' archive_command='cp %p /home/ubuntu/archived_wal/%f' track_wal_io_timing = 'on' EOF ./pg_ctl -D data -l logfile start ./psql -c "select pg_create_physical_replication_slot('assb1_repl_slot', true, false)" postgres ./pg_ctl -D data -l logfile restart ./pg_basebackup -D assbdata ./pg_ctl -D data -l logfile stop cat << EOF >> assbdata/postgresql.conf port=5433 primary_conninfo='host=localhost port=5432 dbname=postgres user=ubuntu application_name=assb1' primary_slot_name='assb1_repl_slot' restore_command='cp /home/ubuntu/archived_wal/%f %p' EOF touch assbdata/standby.signal ./pg_ctl -D data -l logfile start ./pg_ctl -D assbdata -l logfile1 start ./pgbench -i -s 1 -d postgres ./psql -d postgres -c "ALTER TABLE pgbench_accounts DROP CONSTRAINT pgbench_accounts_pkey;" cat << EOF >> insert.sql \set aid random(1, 10 * :scale) \set delta random(1, 100000 * :scale) INSERT INTO pgbench_accounts (aid, bid, abalance) VALUES (:aid, :aid, :delta); EOF ulimit -S -n 5000 for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n "$c ";./pgbench -n -M prepared -U ubuntu postgres -f insert.sql -c$c -j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: