Thread: wal_level=archive gives better performance than minimal - why?

wal_level=archive gives better performance than minimal - why?

From
Tomas Vondra
Date:
Hi all,

I've run a series fo pgbench benchmarks with the aim to see the effect
of moving the WAL logs to a separate drive, and one thing that really
surprised me is that the archive log level seems to give much better
performance than minimal log level.

On spinning drives this is not noticeable, but on SSDs it's quite clear.
See for example this:

  http://www.fuzzy.cz/tmp/tps-rw-minimal.png
  http://www.fuzzy.cz/tmp/tps-rw-archive.png

That minimal log level gives about 1600 tps all the time, while archive
log level gives about the same performance at the start but then it
continuously increases up to about 2000 tps.

This seems very suspicious, because AFAIK the wal level should not
really matter for pgbench and if it does I'd expect exactly the opposite
behaviour (i.e. 'archive' performing worse than 'minimal').

This was run on 9.1.2 with two SSDs (Intel 320) and EXT4, but I do see
exactly the same behaviour with a single SSD drive.

The config files are here (the only difference is the wal_level line at
the very end)

  http://www.fuzzy.cz/tmp/postgresql-minimal.conf
  http://www.fuzzy.cz/tmp/postgresql-archive.conf

pgbench results and logs are here:

  http://www.fuzzy.cz/tmp/pgbench.minimal.log.gz
  http://www.fuzzy.cz/tmp/pgbench.archive.log.gz

  http://www.fuzzy.cz/tmp/results.minimal.log
  http://www.fuzzy.cz/tmp/results.archive.log

I do plan to rerun the whole benchmark, but is there any reasonable
explanation or something that might cause such behaviour?

kind regards
Tomas

Re: wal_level=archive gives better performance than minimal - why?

From
Greg Smith
Date:
On 01/12/2012 06:17 PM, Tomas Vondra wrote:
> I've run a series fo pgbench benchmarks with the aim to see the effect
> of moving the WAL logs to a separate drive, and one thing that really
> surprised me is that the archive log level seems to give much better
> performance than minimal log level.

How repeatable is this?  If you always run minimal first and then
archive, that might be the actual cause of the difference.  In this
situation I would normally run this 12 times, with this sort of pattern:

minimal
minimal
minimal
archive
archive
archive
minimal
minimal
minimal
archive
archive
archive

To make sure the difference wasn't some variation on "gets slower after
each run".  pgbench suffers a lot from problems in that class.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


Re: wal_level=archive gives better performance than minimal - why?

From
Tomas Vondra
Date:
On 16.1.2012 23:35, Greg Smith wrote:
> On 01/12/2012 06:17 PM, Tomas Vondra wrote:
>> I've run a series fo pgbench benchmarks with the aim to see the effect
>> of moving the WAL logs to a separate drive, and one thing that really
>> surprised me is that the archive log level seems to give much better
>> performance than minimal log level.
>
> How repeatable is this?  If you always run minimal first and then
> archive, that might be the actual cause of the difference.  In this
> situation I would normally run this 12 times, with this sort of pattern:
>
> minimal
> minimal
> minimal
> archive
> archive
> archive
> minimal
> minimal
> minimal
> archive
> archive
> archive
>
> To make sure the difference wasn't some variation on "gets slower after
> each run".  pgbench suffers a lot from problems in that class.

AFAIK it's well repeatable - the primary goal of the benchmark was to
see the benefir of moving the WAL to a separate device (with various WAL
levels and device types - SSD and HDD).

I plan to rerun the whole thing this week with a bit more details logged
to rule out basic configuration mistakes etc.

Each run is completely separate (rebuilt from scratch) and takes about 1
hour to complete. Each pgbench run consists of these steps

  1) rebuild the data from scratch
  2) 10-minute warmup (read-only run)
  3) 20-minute read-only run
  4) checkpoint
  5) 20-minute read-write run

and the results are very stable.

Tomas

Re: wal_level=archive gives better performance than minimal - why?

From
Tomas Vondra
Date:
On 17.1.2012 01:29, Tomas Vondra wrote:
> On 16.1.2012 23:35, Greg Smith wrote:
>> On 01/12/2012 06:17 PM, Tomas Vondra wrote:
>>> I've run a series fo pgbench benchmarks with the aim to see the effect
>>> of moving the WAL logs to a separate drive, and one thing that really
>>> surprised me is that the archive log level seems to give much better
>>> performance than minimal log level.
>>
>> How repeatable is this?  If you always run minimal first and then
>> archive, that might be the actual cause of the difference.  In this
>> situation I would normally run this 12 times, with this sort of pattern:
>>
>> minimal
>> minimal
>> minimal
>> archive
>> archive
>> archive
>> minimal
>> minimal
>> minimal
>> archive
>> archive
>> archive
>>
>> To make sure the difference wasn't some variation on "gets slower after
>> each run".  pgbench suffers a lot from problems in that class.

So, I've rerun the whole benchmark (varying fsync method and wal level),
and the results are exactly the same as before ...

See this:

  http://www.fuzzy.cz/tmp/fsync/tps.html
  http://www.fuzzy.cz/tmp/fsync/latency.html

Each row represents one of the fsync methods, first column is archive
level, second column is minimal level. Notice that the performance with
archive level continuously increases and is noticeably better than the
minimal wal level. In some cases (e.g. fdatasync) the difference is up
to 15%. That's a lot.

This is a 20-minute pgbench read-write run that is executed after a
20-minute read-only pgbench run (to warm up the caches etc.)

The latencies seem generaly the same, except that with minimal WAL level
there's a 4-minute interval of significantly higher latencies at the
beginning.

That's suspiciously similar to the checkpoint timeout (which was set to
4 minutes), but why should this matter for minimal WAL level and not for
archive?

Tomas

Re: wal_level=archive gives better performance than minimal - why?

From
Robert Haas
Date:
2012/1/22 Tomas Vondra <tv@fuzzy.cz>:
> That's suspiciously similar to the checkpoint timeout (which was set to
> 4 minutes), but why should this matter for minimal WAL level and not for
> archive?

I went through and looked at all the places where we invoke
XLogIsNeeded().  When XLogIsNeeded(), we:

1. WAL log creation of the _init fork of an unlogged table or an index
on an unlogged table (otherwise, an fsync is enough)
2. WAL log index builds
3. WAL log changes to max_connections, max_prepared_xacts,
max_locks_per_xact, and/or wal_level
4. skip calling posix_fadvise(POSIX_FADV_DONTNEED) when closing a WAL file
5. skip supplying O_DIRECT when writing WAL, if wal_sync_method is
open_sync or open_datasync
6. refuse to create named restore points
7. WAL log CLUSTER
8. WAL log COPY FROM into a newly created/truncated relation
9. WAL log ALTER TABLE .. SET TABLESPACE
9. WAL log cleanup info before doing an index vacuum (this one should
probably be changed to happen only in HS mode)
10. WAL log SELECT INTO

It's hard to see how generating more WAL could cause a performance
improvement, unless there's something about full page flushes being
more efficient than partial page flushes or something like that.  But
none of the stuff above looks likely to happen very often anyway.  But
items #4 and #5 on that list like things that could potentially be
causing a problem - if WAL files are being reused regularly, then
calling POSIX_FADV_DONTNEED on them could represent a regression.  It
might be worth compiling with POSIX_FADV_DONTNEED undefined and see
whether that changes anything.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: wal_level=archive gives better performance than minimal - why?

From
Cédric Villemain
Date:
Le 3 février 2012 19:48, Robert Haas <robertmhaas@gmail.com> a écrit :
> 2012/1/22 Tomas Vondra <tv@fuzzy.cz>:
>> That's suspiciously similar to the checkpoint timeout (which was set to
>> 4 minutes), but why should this matter for minimal WAL level and not for
>> archive?
>
> I went through and looked at all the places where we invoke
> XLogIsNeeded().  When XLogIsNeeded(), we:
>
> 1. WAL log creation of the _init fork of an unlogged table or an index
> on an unlogged table (otherwise, an fsync is enough)
> 2. WAL log index builds
> 3. WAL log changes to max_connections, max_prepared_xacts,
> max_locks_per_xact, and/or wal_level
> 4. skip calling posix_fadvise(POSIX_FADV_DONTNEED) when closing a WAL file
> 5. skip supplying O_DIRECT when writing WAL, if wal_sync_method is
> open_sync or open_datasync
> 6. refuse to create named restore points
> 7. WAL log CLUSTER
> 8. WAL log COPY FROM into a newly created/truncated relation
> 9. WAL log ALTER TABLE .. SET TABLESPACE
> 9. WAL log cleanup info before doing an index vacuum (this one should
> probably be changed to happen only in HS mode)
> 10. WAL log SELECT INTO
>
> It's hard to see how generating more WAL could cause a performance
> improvement, unless there's something about full page flushes being
> more efficient than partial page flushes or something like that.  But
> none of the stuff above looks likely to happen very often anyway.  But
> items #4 and #5 on that list like things that could potentially be
> causing a problem - if WAL files are being reused regularly, then
> calling POSIX_FADV_DONTNEED on them could represent a regression.  It
> might be worth compiling with POSIX_FADV_DONTNEED undefined and see
> whether that changes anything.

it should be valuable to have the kernel version and also confirm the
same behavior happens with XFS.

>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance



--
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation

Re: wal_level=archive gives better performance than minimal - why?

From
Tomas Vondra
Date:
On 4.2.2012 17:04, Cédric Villemain wrote:
> Le 3 février 2012 19:48, Robert Haas <robertmhaas@gmail.com> a écrit :
>> 2012/1/22 Tomas Vondra <tv@fuzzy.cz>:
>>> That's suspiciously similar to the checkpoint timeout (which was set to
>>> 4 minutes), but why should this matter for minimal WAL level and not for
>>> archive?
>>
>> I went through and looked at all the places where we invoke
>> XLogIsNeeded().  When XLogIsNeeded(), we:
>>
>> 1. WAL log creation of the _init fork of an unlogged table or an index
>> on an unlogged table (otherwise, an fsync is enough)
>> 2. WAL log index builds
>> 3. WAL log changes to max_connections, max_prepared_xacts,
>> max_locks_per_xact, and/or wal_level
>> 4. skip calling posix_fadvise(POSIX_FADV_DONTNEED) when closing a WAL file
>> 5. skip supplying O_DIRECT when writing WAL, if wal_sync_method is
>> open_sync or open_datasync
>> 6. refuse to create named restore points
>> 7. WAL log CLUSTER
>> 8. WAL log COPY FROM into a newly created/truncated relation
>> 9. WAL log ALTER TABLE .. SET TABLESPACE
>> 9. WAL log cleanup info before doing an index vacuum (this one should
>> probably be changed to happen only in HS mode)
>> 10. WAL log SELECT INTO
>>
>> It's hard to see how generating more WAL could cause a performance
>> improvement, unless there's something about full page flushes being
>> more efficient than partial page flushes or something like that.  But
>> none of the stuff above looks likely to happen very often anyway.  But
>> items #4 and #5 on that list like things that could potentially be
>> causing a problem - if WAL files are being reused regularly, then
>> calling POSIX_FADV_DONTNEED on them could represent a regression.  It
>> might be worth compiling with POSIX_FADV_DONTNEED undefined and see
>> whether that changes anything.
>
> it should be valuable to have the kernel version and also confirm the
> same behavior happens with XFS.

The kernel is 3.1.5, more precisely the "uname -a" gives this:

Linux rimmer 3.1.5-gentoo #1 SMP PREEMPT Sun Dec 25 14:11:19 CET 2011
x86_64 Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz GenuineIntel GNU/Linux

I plan to rerun the test with various settings, I'll add there XFS
results (so far everything was on EXT4) and I'll post an update to this
thread.

Tmoas