Re: WIP: WAL prefetch (another approach) - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: WIP: WAL prefetch (another approach)
Date
Msg-id CA+hUKG+2Vw3UAVNJSfz5_zhRcHUWEBDrpB7pyQ85Yroep0AKbw@mail.gmail.com
Whole thread Raw
In response to Re: WIP: WAL prefetch (another approach)  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: WIP: WAL prefetch (another approach)  (Thomas Munro <thomas.munro@gmail.com>)
Re: WIP: WAL prefetch (another approach)  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-hackers
On Wed, Sep 9, 2020 at 11:16 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> OK, thanks for looking into this. I guess I'll wait for an updated patch
> before testing this further. The storage has limited capacity so I'd
> have to either reduce the amount of data/WAL or juggle with the WAL
> segments somehow. Doesn't seem worth it.

Here's a new WIP version that works for archive-based recovery in my tests.

The main change I have been working on is that there is now just a
single XLogReaderState, so no more double-reading and double-decoding
of the WAL.  It provides XLogReadRecord(), as before, but now you can
also read further ahead with XLogReadAhead().  The user interface is
much like before, except that the GUCs changed a bit.  They are now:

  recovery_prefetch=on
  recovery_prefetch_fpw=off
  wal_decode_buffer_size=256kB
  maintenance_io_concurrency=10

I recommend setting maintenance_io_concurrency and
wal_decode_buffer_size much higher than those defaults.

There are a few TODOs and questions remaining.  One issue I'm
wondering about is whether it is OK that bulky FPI data is now
memcpy'd into the decode buffer, whereas before we avoided that
sometimes, when it didn't happen to cross a page boundary; I have some
ideas on how to do better (basically two levels of ring buffer) but I
haven't looked into that yet.  Another issue is the new 'nowait' API
for the page-read callback; I'm trying to figure out if that is
sufficient, or something more sophisticated including perhaps a
different return value is required.  Another thing I'm wondering about
is whether I have timeline changes adequately handled.

This design opens up a lot of possibilities for future performance
improvements.  Some example:

1.  By adding some workspace to decoded records, the prefetcher can
leave breadcrumbs for XLogReadBufferForRedoExtended(), so that it
usually avoids the need for a second buffer mapping table lookup.
Incidentally this also skips the hot smgropen() calls that Jakub
complained about.  I have an added an experimental patch like that,
but I need to look into the interlocking some more.

2.  By inspecting future records in the record->next chain, a redo
function could merge work in various ways in quite a simple and
localised way.  A couple of examples:
2.1.  If there is a sequence of records of the same type touching the
same page, you could process all of them while you have the page lock.
2.2.  If there is a sequence of relation extensions (say, a sequence
of multi-tuple inserts to the end of a relation, as commonly seen in
bulk data loads) then instead of generating a many pwrite(8KB of
zeroes) syscalls record-by-record to extend the relation, a single
posix_fallocate(1MB) could extend the file in one shot.  Assuming the
bgwriter is running and doing a good job, this would remove most of
the system calls from bulk-load-recovery.

3.  More sophisticated analysis could find records to merge that are a
bit further apart, under carefully controlled conditions; for example
if you have a sequence like heap-insert, btree-insert, heap-insert,
btree-insert, ... then a simple next-record system like 2 won't see
the opportunities, but something a teensy bit smarter could.

4.  Since the decoding buffer can be placed in shared memory (decoded
records contain pointers but only don't point to any other memory
region, with the exception of clearly marked oversized records), we
could begin to contemplate handing work off to other processes, given
a clever dependency analysis scheme and some more infrastructure.

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Lift line-length limit for pg_service.conf
Next
From: Paul A Jungwirth
Date:
Subject: Re: range_agg