Home > mailing lists
Re: WAL prefetch - Mailing list pgsql-hackers

From	Sean Chittenden
Subject	Re: WAL prefetch
Date	July 9, 2018 03:26:59
Msg-id	20180709002659.pwkjqf2xn3o3gqij@joyent.com Whole thread Raw
In response to	Re: WAL prefetch (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses	Re: WAL prefetch
List	pgsql-hackers
Tree view
> Without prefetching, it's ~70GB of WAL. With prefetching, it's only about
> 30GB. Considering the 1-hour test generates about 90GB of WAL, this means the
> replay speed grew from 20GB/h to almost 60GB/h. That's rather measurable
> improvement ;-)

Thank you everyone for this reasonably in-depth thread on prefaulting.

Because this was a sprawling thread and I haven't been keeping up with this
discussion until now, let me snag a bunch of points and address them here in one
shot.  I've attempted to answer a bunch of questions that appear to have come up
during this thread, as well as provide some clarity where there were unanswered
questions.  Apologies in advance for the length.

There are a few points that I want to highlight regarding prefaulting, and I
also want to call out when prefaulting is and isn't useful.  But first, let me
introduce three terms that will help characterize this problem:

1. Hot read-modify-write - a PG page that is modified while the page is still
   contained within shared_buffers.
2. Warm read-modify-write ("RMW") - a PG page that's in the filesystem cache but
   not present in shared_buffers.
3. Cold RMW - a PG page is not in either PG's shared_buffers or the OS'es
   filesystem cache.

Prefaulting is only useful in addressing the third situation, the cold
read-modify-write.  For fast disks, or systems that have their entire dataset
held in RAM, or whose disk systems can perform a RMW fast enough for the
velocity of incoming writes, there is no benefit of prefaulting (this is why
there is a high and low-watermark in pg_prefaulter).  In these situations
prefaulting would potentially be extra constant overhead, especially for DBs
where their workload is ~100% Hot/Warm RMW.  Primaries are almost always under
the Hot RMW workload (cold restarts being the exception).

The warm RMW scenario could be solved by prefaulting into shared_buffers, but I
doubt there would be a significant performance benefit because the expense of
PostgreSQL faulting from shared_buffers to the OS cache is relatively small
compared to a disk read.  I do think there is something to be gained in the Warm
RMW case, but compared to Cold RMW, this optimization is noise and best left for
a future iteration.

The real importance of prefaulting becomes apparent in the following two
situations:

1. Priming the OS's filesystem cache, notably after an OS restart.  This is of
   value to all PostgreSQL scenarios, regardless of whether or not it's a
   primary or follower.  Reducing database startup/recovery times is very
   helpful, especially when recovering from an outage or after having performed
   planned maintenance.  Little in PostgreSQL administration is more infuriating
   than watching PostgreSQL recover and seeing the CPU 100% idle and the disk IO
   system nearly completely idle (especially during an outage or when recovering
   from an outage).
2. When the following two environmental factors are true:
   a. the volume of writes to discrete pages is high
   b. the interval between subsequent writes to a single page is long enough
      that a page is evicted from both shared_buffers and the filesystem cache

   Write-heavy workloads tend to see this problem, especially if you're
   attempting to provide consistency in your application and do not read from
   the followers (thereby priming their OS/shared_buffer cache).  If the
   workload is continuous, the follower may never be able overcome the write
   volume and the database never catches up.

The pg_prefaulter was borne out of the last workload, namely a write-heavy, 24/7
constant load with a large dataset.

What pg_prefaulter does is read in the blocks referenced from the WAL stream
(i.e. PG heap pages) and then load the referenced pages into the OS filesystem
cache (via threaded calls to pread(2)).  The WAL apply process has a cache-hit
because the filesystem cache has been primed with the heap page before the apply
process attempted to perform its read-modify-write of the heap.

It is important to highlight that this is a problem because there is only one
synchronous pread(2) call in flight at a time from the apply/recover/startup
process, which effectively acts as the speed limit for PostgreSQL.  The physics
of many workloads are such that followers are unable to keep up and are thus
destined to always fall behind (we've all seen this at some point, likely via
apply lag from a VACUUM or pg_repack).  The primary can schedule concurrent IO
from multiple client all making independent SELECTS.  Contrast that to a replica
who has zero knowledge of the IOs that the primary recently dispatched, and all
IO looks like random read and likely a cache miss.  In effect, the pg_prefaulter
raises the speed limit of the WAL apply/recovery process by priming the
filesystem cache by snooping in on the WAL stream.

PostgreSQL's WAL apply and recovery process is only capable of scheduling a
single synchronous pread(2) syscall.  As a result, even if you have an RAID10
and a capable IO scheduler in the OS that is able to read form both halves of
each mirror, you're only going to perform ~150-225 pread(2) calls per second.
Despite the entire disk system being able to deliver something closer to
2400-3600 IOPS (~10ms random-read == ~150-225 IOPS * 16x disks), you'll only
observe ~6% utilization of the random read IO capabilities of a server.  When
you realize the size of the unapplied WAL entries represents a backlog of queued
or unscheduled IOs, the phrase "suboptimal" doesn't begin to do reality justice.

One or more of the following activities can demonstrate the problem:

* Natural random-write workloads at high velocity
* VACUUM activity
* pg_repack
* slow IO subsystems on followers
* synchronous apply

Regarding the environment where pg_prefaulter was written, the server hardware
was reasonably beefy servers (256GB RAM, 16x 10K SAS disks) and this database
cluster was already in a scale-out configuration.  Doubling the number of
database servers would only spread the load out by 2x, but we'd still only be
utilizing ~6% of the IO across the fleet.  We needed ~100% IO utilization when
followers were falling behind.  In practice we are seeing orders of magnitude
improvement in apply lag.

Other points worth mentioning:

* the checkpoint_interval was set to anywhere between 15s and 300s, it didn't
  matter - we did discover a new form of lag, however, checkpoint lag.  Pages
  were being evicted from cache faster than checkpoints were able to execute,
  leading to unbounded checkpoints and WAL growth (i.e. writes were fast enough
  that the checkpointer was suffering from Cold RMW).  iirc, pg_prefaulter reads
  in both WAL pages and WAL files that are about to be used in checkpoints (it's
  been a while since I wrote this code).

* The pg_prefaulter saw the best performance when we increased the number of IO
  workers to be roughly equal to the available IO commands the OS could schedule
  and dispatch (i.e. 16x disks * 150 IOPS == ~2K).

* pg_prefaulter is very aggressive about not performing work twice or reading
  the same page multiple times.  pg_prefaulter uses a heap page cache to prevent
  redundant IOs for the same PG heap page.  pg_prefaulter also dedupes IO
  requests in case the same page was referenced twice in short succession due to
  data locality in the WAL stream.  The workload was already under cache
  pressure.  Artificially promoting a page from the ARC MRU to MFU would result
  in potentially useful records in the MFU being evicted from cache.

* During the design phase, I looked into using bgworkers but given the number of
  in-flight pread(2) calls required to fully utilize the IO subsystem, I opted
  for something threaded (I was also confined to using Solaris which doesn't
  support posix_fadvise(2), so I couldn't sequentially dispatch async
  posix_fadvise(2) calls and hope for the best).

* In my testing I was successfully using pgbench(1) to simulate the workload.
  Increased the checkpoint_interval and segments to a very high number was
  sufficient.  I could see the improvement for cold-start even with SSDs, but
  I'm not sure how big of an impact this would be for NVMe.

* My slides are posted and have graphs of the before and after using the
  pg_prefaulter, but I'm happy to provide additional details or answer more Q's.

* It would be interesting to see if posix_fadvise(2) is actually harming
  performance.  For example, spinning off a throw-away implementation that uses
  aio or a pthread worker pool + pread(2).  I do remember seeing some mailing
  list blurbs from Mozilla where they were attempting to use posix_fadvise(2)
  and were seeing a slow down in performance on Linux (I believe this has since
  been fixed, but it wouldn't surprise me if there were still unintended
  consequences from this syscall).

* I have a port of pg_prefaulter that supports PostgreSQL >= 9 ~complete, but
  not pushed.  I'll see if I can get to that this week.  For "reasons" this
  isn't a high priority for me at the moment, but I'm happy to help out and see
  this move forward.

* Tomas, feel free to contact me offline to discuss why the pg_prefault isn't
  working for you.  I have it running on Linux, FreeBSD, illumos, and macOS.

* In the graph from Tomas (nice work, btw), it's clear that the bandwidth is the
  same.  The way that we verified this earlier was to run ~10-15min traces and
  capture the file and offset of every read of PostgreSQL and pg_prefaulter.  We
  saw pg_prefaulter IOs be ~100% cache miss.  For PostgreSQL, we could observe
  that ~99% of its IO was cache hit.  We also verified that pg_prefaulter wasn't
  doing any IO that wasn't eventually performed by PostgreSQL by comparing the
  IOs performed against each heap segment.

* Drop a VACUUM FULL FREEZE into any pgbench testing (or a pg_repack) and it's
  trivial to see the effects, even on SSD.  Similarly, performing a fast
  shutdown of a replica and amassing a large backlog of unrecieved, unapplied
  WAL pages is pretty demonstrative.

* "In this case I see that without prefetching, the replay process uses about
  20% of a CPU. With prefetching increases this to ~60%, which is nice."  With
  the pg_prefaulter, the IO should hit 100% utilization.  Like I mentioned
  above, Tomas, I'd like to make sure you get this working so you can compare
  and improve as necessary.  :~] I never got CPU utilization to 100%, but I did
  get disk IO utilization to 100%, and that to me was the definition of success.
  CPU utilization of the apply process could become 100% utilized with fast
  enough disks but in production I didn't have anything that wasn't spinning
  rust.

* It looks like we're still trying to figure out the nature of this problem and
  the cost of various approaches.  From a rapid prototyping perspective, feel
  free to suggest changes to the Go pg_prefaulter and toss the various
  experiments behind a feature flag.

* "> But it is implemented in GO and using pg_waldump.
   Yeah, that's not too good if we want it in core."
  I fail to see the problem with a side-car in Go.  *checks calendar*  :~]

* In pg_prefaulter all IOs are converted into full-page reads.

* pg_prefaulter will not activate if the number of unapplied WAL pages is less
  than the size of 1 WAL segment (i.e. 16MB).  This could be tuned further, but
  this low-water mark seems to work well.

* pg_prefaulter won't read-ahead more than 100x WAL segments into the future.  I
  made the unverified assumption that PostgreSQL could not process more than
  1.6GB of WAL (~1/2 of the memory bandwidth available for a single process) in
  less than the rotational latency of a random IO (i.e. ~10ms), and that
  pg_prefaulter could in theory stay ahead of PG.  PostgreSQL normally overtakes
  pg_prefaulter's ability to fault in random pages due to disk IO limitations
  (PostgreSQL's cache hit rate is ~99.9%, not 100% for this very reason).  In
  practice this has worked out, but I'm sure there's room for improvement with
  regards to setting the high-watermark and reducing this value. #yoloconstants

* I contemplated not reading in FPW but this would have been detrimental on ZFS
  because ZFS is a copy-on-write filesystem (vs block filesystem).  For ZFS, we
  are using a 16K record size, compressed down to ~8K.  We have to read the
  entire record in before we can modify half of the page.  I suspect eliding
  prefaulting FPWs will always be a performance loss for nearly all hardware.

* If there is sufficient interest in these experiences, contact me offline (or
  via PostgreSQL Slack) and I can setup a call to answer questions in a
  higher-bandwidth setting such as Zoom or Google Hangouts.

I'm sorry for being late to the reply party, I've been watching posts in this
thread accumulate for a while and haven't had time to respond until now.
Cheers.  -sc


--
Sean Chittenden
Attachment

signature.asc
pgsql-hackers by date:
From: Jeff Davis
Date: 09 July 2018, 03:00:18
Subject: Re: [HACKERS] Another oddity in handling of WCO constraints inpostgres_fdw
From: Masahiko Sawada
Date: 09 July 2018, 04:06:00
Subject: Re: Copy function for logical replication slots
Re: WAL prefetch - Mailing list pgsql-hackers

Attachment

Previous

Next