Thread: wal_level=archive gives better performance than minimal - why?
Hi all, I've run a series fo pgbench benchmarks with the aim to see the effect of moving the WAL logs to a separate drive, and one thing that really surprised me is that the archive log level seems to give much better performance than minimal log level. On spinning drives this is not noticeable, but on SSDs it's quite clear. See for example this: http://www.fuzzy.cz/tmp/tps-rw-minimal.png http://www.fuzzy.cz/tmp/tps-rw-archive.png That minimal log level gives about 1600 tps all the time, while archive log level gives about the same performance at the start but then it continuously increases up to about 2000 tps. This seems very suspicious, because AFAIK the wal level should not really matter for pgbench and if it does I'd expect exactly the opposite behaviour (i.e. 'archive' performing worse than 'minimal'). This was run on 9.1.2 with two SSDs (Intel 320) and EXT4, but I do see exactly the same behaviour with a single SSD drive. The config files are here (the only difference is the wal_level line at the very end) http://www.fuzzy.cz/tmp/postgresql-minimal.conf http://www.fuzzy.cz/tmp/postgresql-archive.conf pgbench results and logs are here: http://www.fuzzy.cz/tmp/pgbench.minimal.log.gz http://www.fuzzy.cz/tmp/pgbench.archive.log.gz http://www.fuzzy.cz/tmp/results.minimal.log http://www.fuzzy.cz/tmp/results.archive.log I do plan to rerun the whole benchmark, but is there any reasonable explanation or something that might cause such behaviour? kind regards Tomas
On 01/12/2012 06:17 PM, Tomas Vondra wrote: > I've run a series fo pgbench benchmarks with the aim to see the effect > of moving the WAL logs to a separate drive, and one thing that really > surprised me is that the archive log level seems to give much better > performance than minimal log level. How repeatable is this? If you always run minimal first and then archive, that might be the actual cause of the difference. In this situation I would normally run this 12 times, with this sort of pattern: minimal minimal minimal archive archive archive minimal minimal minimal archive archive archive To make sure the difference wasn't some variation on "gets slower after each run". pgbench suffers a lot from problems in that class. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On 16.1.2012 23:35, Greg Smith wrote: > On 01/12/2012 06:17 PM, Tomas Vondra wrote: >> I've run a series fo pgbench benchmarks with the aim to see the effect >> of moving the WAL logs to a separate drive, and one thing that really >> surprised me is that the archive log level seems to give much better >> performance than minimal log level. > > How repeatable is this? If you always run minimal first and then > archive, that might be the actual cause of the difference. In this > situation I would normally run this 12 times, with this sort of pattern: > > minimal > minimal > minimal > archive > archive > archive > minimal > minimal > minimal > archive > archive > archive > > To make sure the difference wasn't some variation on "gets slower after > each run". pgbench suffers a lot from problems in that class. AFAIK it's well repeatable - the primary goal of the benchmark was to see the benefir of moving the WAL to a separate device (with various WAL levels and device types - SSD and HDD). I plan to rerun the whole thing this week with a bit more details logged to rule out basic configuration mistakes etc. Each run is completely separate (rebuilt from scratch) and takes about 1 hour to complete. Each pgbench run consists of these steps 1) rebuild the data from scratch 2) 10-minute warmup (read-only run) 3) 20-minute read-only run 4) checkpoint 5) 20-minute read-write run and the results are very stable. Tomas
On 17.1.2012 01:29, Tomas Vondra wrote: > On 16.1.2012 23:35, Greg Smith wrote: >> On 01/12/2012 06:17 PM, Tomas Vondra wrote: >>> I've run a series fo pgbench benchmarks with the aim to see the effect >>> of moving the WAL logs to a separate drive, and one thing that really >>> surprised me is that the archive log level seems to give much better >>> performance than minimal log level. >> >> How repeatable is this? If you always run minimal first and then >> archive, that might be the actual cause of the difference. In this >> situation I would normally run this 12 times, with this sort of pattern: >> >> minimal >> minimal >> minimal >> archive >> archive >> archive >> minimal >> minimal >> minimal >> archive >> archive >> archive >> >> To make sure the difference wasn't some variation on "gets slower after >> each run". pgbench suffers a lot from problems in that class. So, I've rerun the whole benchmark (varying fsync method and wal level), and the results are exactly the same as before ... See this: http://www.fuzzy.cz/tmp/fsync/tps.html http://www.fuzzy.cz/tmp/fsync/latency.html Each row represents one of the fsync methods, first column is archive level, second column is minimal level. Notice that the performance with archive level continuously increases and is noticeably better than the minimal wal level. In some cases (e.g. fdatasync) the difference is up to 15%. That's a lot. This is a 20-minute pgbench read-write run that is executed after a 20-minute read-only pgbench run (to warm up the caches etc.) The latencies seem generaly the same, except that with minimal WAL level there's a 4-minute interval of significantly higher latencies at the beginning. That's suspiciously similar to the checkpoint timeout (which was set to 4 minutes), but why should this matter for minimal WAL level and not for archive? Tomas
2012/1/22 Tomas Vondra <tv@fuzzy.cz>: > That's suspiciously similar to the checkpoint timeout (which was set to > 4 minutes), but why should this matter for minimal WAL level and not for > archive? I went through and looked at all the places where we invoke XLogIsNeeded(). When XLogIsNeeded(), we: 1. WAL log creation of the _init fork of an unlogged table or an index on an unlogged table (otherwise, an fsync is enough) 2. WAL log index builds 3. WAL log changes to max_connections, max_prepared_xacts, max_locks_per_xact, and/or wal_level 4. skip calling posix_fadvise(POSIX_FADV_DONTNEED) when closing a WAL file 5. skip supplying O_DIRECT when writing WAL, if wal_sync_method is open_sync or open_datasync 6. refuse to create named restore points 7. WAL log CLUSTER 8. WAL log COPY FROM into a newly created/truncated relation 9. WAL log ALTER TABLE .. SET TABLESPACE 9. WAL log cleanup info before doing an index vacuum (this one should probably be changed to happen only in HS mode) 10. WAL log SELECT INTO It's hard to see how generating more WAL could cause a performance improvement, unless there's something about full page flushes being more efficient than partial page flushes or something like that. But none of the stuff above looks likely to happen very often anyway. But items #4 and #5 on that list like things that could potentially be causing a problem - if WAL files are being reused regularly, then calling POSIX_FADV_DONTNEED on them could represent a regression. It might be worth compiling with POSIX_FADV_DONTNEED undefined and see whether that changes anything. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Le 3 février 2012 19:48, Robert Haas <robertmhaas@gmail.com> a écrit : > 2012/1/22 Tomas Vondra <tv@fuzzy.cz>: >> That's suspiciously similar to the checkpoint timeout (which was set to >> 4 minutes), but why should this matter for minimal WAL level and not for >> archive? > > I went through and looked at all the places where we invoke > XLogIsNeeded(). When XLogIsNeeded(), we: > > 1. WAL log creation of the _init fork of an unlogged table or an index > on an unlogged table (otherwise, an fsync is enough) > 2. WAL log index builds > 3. WAL log changes to max_connections, max_prepared_xacts, > max_locks_per_xact, and/or wal_level > 4. skip calling posix_fadvise(POSIX_FADV_DONTNEED) when closing a WAL file > 5. skip supplying O_DIRECT when writing WAL, if wal_sync_method is > open_sync or open_datasync > 6. refuse to create named restore points > 7. WAL log CLUSTER > 8. WAL log COPY FROM into a newly created/truncated relation > 9. WAL log ALTER TABLE .. SET TABLESPACE > 9. WAL log cleanup info before doing an index vacuum (this one should > probably be changed to happen only in HS mode) > 10. WAL log SELECT INTO > > It's hard to see how generating more WAL could cause a performance > improvement, unless there's something about full page flushes being > more efficient than partial page flushes or something like that. But > none of the stuff above looks likely to happen very often anyway. But > items #4 and #5 on that list like things that could potentially be > causing a problem - if WAL files are being reused regularly, then > calling POSIX_FADV_DONTNEED on them could represent a regression. It > might be worth compiling with POSIX_FADV_DONTNEED undefined and see > whether that changes anything. it should be valuable to have the kernel version and also confirm the same behavior happens with XFS. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance -- Cédric Villemain +33 (0)6 20 30 22 52 http://2ndQuadrant.fr/ PostgreSQL: Support 24x7 - Développement, Expertise et Formation
On 4.2.2012 17:04, Cédric Villemain wrote: > Le 3 février 2012 19:48, Robert Haas <robertmhaas@gmail.com> a écrit : >> 2012/1/22 Tomas Vondra <tv@fuzzy.cz>: >>> That's suspiciously similar to the checkpoint timeout (which was set to >>> 4 minutes), but why should this matter for minimal WAL level and not for >>> archive? >> >> I went through and looked at all the places where we invoke >> XLogIsNeeded(). When XLogIsNeeded(), we: >> >> 1. WAL log creation of the _init fork of an unlogged table or an index >> on an unlogged table (otherwise, an fsync is enough) >> 2. WAL log index builds >> 3. WAL log changes to max_connections, max_prepared_xacts, >> max_locks_per_xact, and/or wal_level >> 4. skip calling posix_fadvise(POSIX_FADV_DONTNEED) when closing a WAL file >> 5. skip supplying O_DIRECT when writing WAL, if wal_sync_method is >> open_sync or open_datasync >> 6. refuse to create named restore points >> 7. WAL log CLUSTER >> 8. WAL log COPY FROM into a newly created/truncated relation >> 9. WAL log ALTER TABLE .. SET TABLESPACE >> 9. WAL log cleanup info before doing an index vacuum (this one should >> probably be changed to happen only in HS mode) >> 10. WAL log SELECT INTO >> >> It's hard to see how generating more WAL could cause a performance >> improvement, unless there's something about full page flushes being >> more efficient than partial page flushes or something like that. But >> none of the stuff above looks likely to happen very often anyway. But >> items #4 and #5 on that list like things that could potentially be >> causing a problem - if WAL files are being reused regularly, then >> calling POSIX_FADV_DONTNEED on them could represent a regression. It >> might be worth compiling with POSIX_FADV_DONTNEED undefined and see >> whether that changes anything. > > it should be valuable to have the kernel version and also confirm the > same behavior happens with XFS. The kernel is 3.1.5, more precisely the "uname -a" gives this: Linux rimmer 3.1.5-gentoo #1 SMP PREEMPT Sun Dec 25 14:11:19 CET 2011 x86_64 Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz GenuineIntel GNU/Linux I plan to rerun the test with various settings, I'll add there XFS results (so far everything was on EXT4) and I'll post an update to this thread. Tmoas