Re: patch to allow disable of WAL recycling - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: patch to allow disable of WAL recycling
Date
Msg-id CAEepm=2QXmF9xDmGDyMtoEeTEH6=jcf=b8--yLzdeVzBfVLVuA@mail.gmail.com
Whole thread Raw
In response to Re: patch to allow disable of WAL recycling  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: patch to allow disable of WAL recycling
List pgsql-hackers
On Mon, Aug 27, 2018 at 10:14 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> zfs (Linux)
> -----------
> On scale 200, there's pretty much no difference.

Speculation: It could be that the dnode and/or indirect blocks that point to data blocks are falling out of memory in my test setup[1] but not in yours.  I don't know, but I guess those blocks compete with regular data blocks in the ARC?  If so it might come down to ARC size and the amount of other data churning through it.

Further speculation:  Other filesystems have equivalent data structures, but for example XFS jams that data into the inode itself in a compact "extent list" format[2] if it can, to avoid the need for an external btree.  Hmm, I wonder if that format tends to be used for our segment files.  Since cached inodes are reclaimed in a different way than cached data pages, I wonder if that makes them more sticky in the face of high data churn rates (or I guess less, depending on your Linux vfs_cache_pressure setting and number of active files).  I suppose the combination of those two things, sticky inodes with internalised extent lists, might make it more likely that we can overwrite an old file without having to fault anything in.

One big difference between your test rig and mine is that your Optane 900P claims to do about half a million random IOPS.  That is about half a million more IOPS than my spinning disks.  (Actually I used my 5400RPM steam powered machine deliberately for that test: I disabled fsync so that commit rate wouldn't be slowed down but cache misses would be obvious.  I guess Joyent's storage is somewhere between these two extremes...)

> On scale 2000, the
> throughput actually decreased a bit, by about 5% - from the chart it
> seems disabling the WAL reuse somewhat amplifies impact of checkpoints,
> for some reason.

Huh.

> I have no idea what happened at the largest scale (8000) - on master
> there's a huge drop after ~120 minutes, which somewhat recovers at ~220
> minutes (but not fully). Without WAL reuse there's no such drop,
> although there seems to be some degradation after ~220 minutes (i.e. at
> about the same time the master partially recovers. I'm not sure what to
> think about this, I wonder if it might be caused by almost filling the
> disk space, or something like that. I'm rerunning this with scale 600.

There are lots of reports of ZFS performance degrading when free space gets below something like 20%.

[1] https://www.postgresql.org/message-id/CAEepm%3D2pypg3nGgBDYyG0wuCH%2BxTWsAJddvJUGBNsDiyMhcXaQ%40mail.gmail.com
[2] http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure/tmp/en-US/html/Data_Extents.html

--
Thomas Munro
http://www.enterprisedb.com

pgsql-hackers by date:

Previous
From: Tatsuo Ishii
Date:
Subject: Re: Adding a note to protocol.sgml regarding CopyData
Next
From: Thomas Munro
Date:
Subject: Re: simplehash.h comment