Re: [PERFORM] Questionaire: Common WAL write rates on busy servers. - Mailing list pgsql-performance

From Tomas Vondra
Subject Re: [PERFORM] Questionaire: Common WAL write rates on busy servers.
Date
Msg-id 5b03c0a4-1c68-1fe4-fb18-96328b6c3566@2ndquadrant.com
Whole thread Raw
In response to [PERFORM] Questionaire: Common WAL write rates on busy servers.  (Andres Freund <andres@anarazel.de>)
List pgsql-performance
Hi,

On 04/25/2017 06:17 AM, Andres Freund wrote:
> Hi,
>
> I've lately seen more and more installations where the generation of
> write-ahead-log (WAL) is one of the primary bottlenecks. I'm curious
> whether that's primarily a "sampling error" of mine, or whether
> that's indeed more common.
>

I see those cases too. To some degree it's a sampling bias. People
generally don't call us to look at the 99% of systems that perform fine,
so we tend to see the the 1% of systems under pressure.

That doesn't make that observation irrelevant, though. Those demanding
systems are one of the things that pushes us forward.

 >
> The primary reason I'm curious is that I'm pondering a few potential
> optimizations, and would like to have some guidance which are more
> and which are less important.
>

I think any optimization you do will improve at least some of those busy
systems.

> Questions (answer as many you can comfortably answer):
> - How many MB/s, segments/s do you see on busier servers?

That depends on the cause (see the next point).

> - What generates the bulk of WAL on your servers (9.5+ can use
>   pg_xlogdump --stats to compute that)?

a) systems doing large batches
    - bulk loads/updates/deletes, one or few sessions doing a lot
    - easily high hundreds of MB/s (on a separate device)

b) OLTP systems doing a lot of tiny/small transactions
    - many concurrent sessions
    - often end up much more limited by WAL, due to locking etc.
    - often the trouble is random updates all over the place, causing
      amplification due to FPI (PK on UUID is a great way to cause this
      unnecessarily even on apps with naturally tiny working set)

> - Are you seeing WAL writes being a bottleneck?OA

On the write-intensive systems, yes. Often the CPUs are waiting for WAL
I/O to complete during COMMIT.

> - What kind of backup methods are you using and is the WAL volume a
>   problem?

The large and busy systems can easily produce so much WAL, that the
basebackup is not the largest part of the backup. That is somewhat
solvable by using other means of obtaining the basebackup snapshot (e.g.
by taking some sort of filesystem / SAN / ... snapshot). That reduces
the amount of WAL needed to make the basebackup consistent, but it
doesn't solve the WAL archiving issue.

> - What kind of replication are you using and is the WAL volume a
>   problem?

Generally streaming replication, and yes, the amount of WAL may be an
issue, partially because the standby is a single-process thing. And as
it has to process something generated by N sessions on the primary, that
can't end well.

Interestingly enough, FPIs can actually make it way faster, because the
standby does not need to read the data from disk during recovery.

> - What are your settings for wal_compression, max_wal_size (9.5+) /
>   checkpoint_segments (< 9.5), checkpoint_timeout and wal_buffers?

I'd say the usual practice is to tune for timed checkpoints, say 30+
minutes apart (or more). wal_compression is typically 'off' (i.e. the
default value).

> - Could you quickly describe your workload?

Pretty much a little bit of everything, depending on the customer.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pgsql-performance by date:

Previous
From: Justin Pryzby
Date:
Subject: Re: [PERFORM] Performance issue in PostgreSQL server...
Next
From: Tomas Vondra
Date:
Subject: Re: [PERFORM] Questionaire: Common WAL write rates on busy servers.