Re: patch to allow disable of WAL recycling - Mailing list pgsql-hackers

From Jerry Jelinek
Subject Re: patch to allow disable of WAL recycling
Date
Msg-id CACPQ5FrufdyMNB8a7qPxejzm+XqJfDdkUhHCDCu3XaUdKxd9Qw@mail.gmail.com
Whole thread Raw
In response to patch to allow disable of WAL recycling  (Jerry Jelinek <jerry.jelinek@joyent.com>)
List pgsql-hackers
I was asked to perform two different tests:
1) A benchmarksql run with WAL recycling on and then off, for comparison
2) A test when the filesystem fills up

For #1, I did two 15 minute benchmarksql runs and here are the results.
wal_recycle=on
--------------
Term-00, Running Average tpmTOTAL: 299.84    Current tpmTOTAL: 29412    Memory U14:49:02,470 [Thread-1] INFO   jTPCC : Term-00,
14:49:02,470 [Thread-1] INFO   jTPCC : Term-00,
14:49:02,471 [Thread-1] INFO   jTPCC : Term-00, Measured tpmC (NewOrders) = 136.49
14:49:02,471 [Thread-1] INFO   jTPCC : Term-00, Measured tpmTOTAL = 299.78
14:49:02,471 [Thread-1] INFO   jTPCC : Term-00, Session Start     = 2018-07-12 14:34:02
14:49:02,471 [Thread-1] INFO   jTPCC : Term-00, Session End       = 2018-07-12 14:49:02
14:49:02,471 [Thread-1] INFO   jTPCC : Term-00, Transaction Count = 4497

wal_recycle=off
---------------
Term-00, Running Average tpmTOTAL: 299.85    Current tpmTOTAL: 29520    Memory U15:10:15,712 [Thread-1] INFO   jTPCC : Term-00,
15:10:15,712 [Thread-1] INFO   jTPCC : Term-00,
15:10:15,713 [Thread-1] INFO   jTPCC : Term-00, Measured tpmC (NewOrders) = 135.89
15:10:15,713 [Thread-1] INFO   jTPCC : Term-00, Measured tpmTOTAL = 299.79
15:10:15,713 [Thread-1] INFO   jTPCC : Term-00, Session Start     = 2018-07-12 14:55:15
15:10:15,713 [Thread-1] INFO   jTPCC : Term-00, Session End       = 2018-07-12 15:10:15
15:10:15,713 [Thread-1] INFO   jTPCC : Term-00, Transaction Count = 4497

As can be seen, disabling WAL recycling does not cause any performance regression.

For #2, I ran the test with WAL recycling on (the current behavior as well as the default with this patch) since the behavior of postgres is orthogonal to WAL recycling when the filesystem fills up.

I capped the filesystem with 32MB of free space. I setup a configuration with wal_keep_segments=50 and started a long benchmarksql run. I had 4 WAL files already in existence when the run started.

As the filesystem fills up, the performance of postgres gets slower and slower, as would be expected. This is due to the COW nature of the filesystem and the fact that all writes need to find space.

When a new WAL file is created, this essentially consumes no space since it is a zero-filled file, so no filesystem space is consumed, except for a little metadata for the file. However, as writes occur to the WAL
file, space is being consumed. Eventually all space in the filesystem is consumed. I could not tell if this occurred during a write to an existing WAL file or a write to the database itself. As other people have observed, WAL file creation in a COW filesystem is not the problematic operation when the filesystem fills up. It is the writes to existing files that will fail. When postgres core dumped there were 6 WAL files in the pg_wal directory (well short of the 50 configured).

When the filesystem filled up, postgres core dumped and benchmarksql emitted a bunch of java debug information which I could provide if anyone is interested.

Here is some information for the postgres core dump. It looks like postgres aborted itself, but since the filesystem is full, there is nothing in the log file.
> ::status
debugging core file of postgres (64-bit) from f6c22f98-38aa-eb51-80d2-811ed25bed6b
file: /zones/f6c22f98-38aa-eb51-80d2-811ed25bed6b/local/pgsql/bin/postgres
initial argv: /usr/local/pgsql/bin/postgres -D /home/postgres/data
threading model: native threads
status: process terminated by SIGABRT (Abort), pid=76019 uid=1001 code=-1
> $C
fffff9ffffdfa4b0 libc.so.1`_lwp_kill+0xa()
fffff9ffffdfa4e0 libc.so.1`raise+0x20(6)
fffff9ffffdfa530 libc.so.1`abort+0x98()
fffff9ffffdfa560 errfinish+0x230()
fffff9ffffdfa5e0 XLogWrite+0x294()
fffff9ffffdfa610 XLogBackgroundFlush+0x18d()
fffff9ffffdfaa50 WalWriterMain+0x1a8()
fffff9ffffdfaab0 AuxiliaryProcessMain+0x3ff()
fffff9ffffdfab40 0x7b5566()
fffff9ffffdfab90 reaper+0x60a()
fffff9ffffdfaba0 libc.so.1`__sighndlr+6()
fffff9ffffdfac30 libc.so.1`call_user_handler+0x1db(12, 0, fffff9ffffdfaca0)
fffff9ffffdfac80 libc.so.1`sigacthandler+0x116(12, 0, fffff9ffffdfaca0)
fffff9ffffdfb0f0 libc.so.1`__pollsys+0xa()
fffff9ffffdfb220 libc.so.1`pselect+0x26b(7, fffff9ffffdfdad0, 0, 0, fffff9ffffdfb230, 0)
fffff9ffffdfb270 libc.so.1`select+0x5a(7, fffff9ffffdfdad0, 0, 0, fffff9ffffdfb6c0)
fffff9ffffdffb00 ServerLoop+0x289()
fffff9ffffdffb70 PostmasterMain+0xcfa()
fffff9ffffdffba0 main+0x3cd()
fffff9ffffdffbd0 _start_crt+0x83()
fffff9ffffdffbe0 _start+0x18()

Let me know if there is any other information I could provide.

Thanks,
Jerry


On Tue, Jun 26, 2018 at 7:35 AM, Jerry Jelinek <jerry.jelinek@joyent.com> wrote:
Hello All,

Attached is a patch to provide an option to disable WAL recycling. We have found that this can help performance by eliminating read-modify-write behavior on old WAL files that are no longer resident in the filesystem cache. The is a lot more detail on the background of the motivation for this in the following thread.


A similar change has been tested against our 9.6 branch that we're currently running, but the attached patch is against master.

Thanks,
Jerry


pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Concurrency bug in UPDATE of partition-key
Next
From: Andres Freund
Date:
Subject: Re: Fix error message when trying to alter statistics on includedcolumn