Re: [WIP] Double-write with Fast Checksums - Mailing list pgsql-hackers

From Dan Scales
Subject Re: [WIP] Double-write with Fast Checksums
Date
Msg-id 2069626669.2741935.1326831941506.JavaMail.root@zimbra-prod-mbox-4.vmware.com
Whole thread Raw
In response to Re: [WIP] Double-write with Fast Checksums  (Dan Scales <scales@vmware.com>)
Responses Re: [WIP] Double-write with Fast Checksums
Re: [WIP] Double-write with Fast Checksums
List pgsql-hackers
We have some numbers for 9.2 runs with and without double writes now.  We
are still using the double-write patch that assumes checksums on data
pages, so checksums must be turned on for double writes.

The first set of runs are 50-warehouse 2-processor DBT2 60-minute run,
with checkpoints every 5 minutes.  Machine memory is 8G, cache size is
5G.  Database size is about 9G.  The disks are enterprise Fibre Channel
disks, so there is good disk write-caching at the array.  All runs are
for virtual machines.  (We expect that the virtual machine numbers would
be representative of performance for non-virtual machines, but we know
that we need to get non-virtual numbers as well.)
             orig 9.2| 9.2 + DW patch             ---------------------------------------------             FPW off
FPWoff FPW off  FPW on  DW on/FPW off                      CK off  CK on    CK on   CK on
------------------------------------------------
one disk:     15574    15308   15135   13337   13052 [5G shared_buffer, 8G RAM]
sep log disk: 18739    18134   18063   15823   16033

(First row is everything on one disk, second row is where the WAL log is
on a separate disk.)

So, in this case where cache is large and disks probably have
write-caching, we get about same performance with full_page_write on and
double-writes on.  We need to run these numbers more to get a good
average -- in some runs last night, double writes did better, closer to
what we were seeing with 9.0 (score of 17721 instead of 16033).

Note that, for one disk, there is no significant different between the
original 9.2 code and the patched code with checksums (and double-writes)
turned off.  For two disks, there is a bigger difference (3.3%), but I'm
not sure that is really significant.

The second set of numbers is for a hard disk with write cache turned off,
closer to internal hard disks of servers (people were quite interested in
that result).  These runs are for 50-warehouse 8-processor DBT2 60-minute
run, with checkpoints every 5 minutes.  The RAM size is 8G, and the cache
size is 6G.
             9.2 + DW patch             -----------------------------------             FPW off  FPW on  DW on/FPW off
          CK on    CK on   CK on
 
one disk:     12084    7849    9766        [6G shared_buffers, 8G RAM]

So, here we see a performance advantage for double writes where the cache
is large and the disks do not have write-caching.  Presumably, the cost
of fsyncing the big writes (with full pages) to the WAL log on a slow
disk are traded against the fsyncs of the double writes.

Third set of numbers is back to the first hardware setup, but with much
smaller shared_buffers.  Again, the runs are 50-warehouse 2-processor DBT2
60-minute run, with checkpoints every 5 minutes.  But shared_buffers is
set to 1G, so there will be a great many more dirty evictions by the
backends.
             9.2 + DW patch             -----------------------------------             FPW off  FPW on  DW on/FPW off
          CK on    CK on   CK on
 
one disk:     11078   10394    3296  [1G shared_buffers, 8G RAM]
sep log disk: 13605   12015    3412     
one disk:      7731    6613    2670  [1G shared_buffers, 2G RAM]
sep log disk:  6752    6129    2722

Here we see that double writes does very badly, because of all the double
writes being done for individual blocks by the backends.  With the small
shared cache, the backends are now writing 3 times as many blocks as the
checkpointer.

Clearly, the double write option would have to be completely optional,
available for use for database configurations which have a well-sized
cache.

It would still be preferable that performance didn't have such a cliff
when dirty evictions become high, so, with that in mind, I am doing some
prototyping of the double-write buffer idea that folks have proposed on
this thread. 

Happy to hear all comments/suggestions.  Thanks,

Dan

----- Original Message -----
From: "Dan Scales" <scales@vmware.com>
To: "Heikki Linnakangas" <heikki.linnakangas@enterprisedb.com>
Cc: "PG Hackers" <pgsql-hackers@postgresql.org>, jkshah@gmail.com, "David Fetter" <david@fetter.org>
Sent: Wednesday, January 11, 2012 1:25:21 PM
Subject: Re: [HACKERS] [WIP] Double-write with Fast Checksums

Thanks for all the comments and suggestions on the double-write patch.  We are working on generating performance
resultsfor the 9.2 patch, but there is enough difference between 9.0 and 9.2 that it will take some time.
 

One thing in 9.2 that may be causing problems with the current patch is the fact that the checkpointer and bgwriter are
separatedand can run at the same time (I think), and therefore will contend on the double-write file.  Is there any
thoughtthat the bgwriter might be paused while the checkpointer is doing a checkpoint, since the checkpointer is doing
someof the cleaning that the bgwriter wants to do anyways?
 

The current patch (as mentioned) also may not do well if there are a lot of dirty-page evictions by backends, because
ofthe extra fsyncing just to write individual buffers.  I think Heikki's (and Simon's) idea of a growing shared
double-writebuffer (only doing double-writes when it gets to a certain size) instead is a great idea that could deal
withthe dirty-page eviction issue with less performance hit.  It could also deal with the checkpointer/bgwriter
contention,if we can't avoid that.  I will think about that approach and any issues that might arise.  But for now, we
willwork on getting performance numbers for the current patch.
 

With respect to all the extra fsyncs, I agree they are expensive if done on individual buffers by backends.  For the
checkpointer,there will be extra fsyncs, but the batching helps greatly, and the fsyncs per batch are traded off
againstthe often large & unpredictable fsyncs at the end of checkpoints.  In our performance runs on 9.0, the
configurationwas such that there were not a lot of dirty evictions, and the checkpointer/bgwriter was able to finish
thecheckpoint on time, even with the double writes.
 

And just wanted to reiterate one other benefit of double writes -- it greatly reduces the size of the WAL logs.

Thanks,

Dan

----- Original Message -----
From: "Heikki Linnakangas" <heikki.linnakangas@enterprisedb.com>
To: "David Fetter" <david@fetter.org>
Cc: "PG Hackers" <pgsql-hackers@postgresql.org>, jkshah@gmail.com
Sent: Wednesday, January 11, 2012 4:13:01 AM
Subject: Re: [HACKERS] [WIP] Double-write with Fast Checksums

On 10.01.2012 23:43, David Fetter wrote:
> Please find attached a new revision of the double-write patch.  While
> this one still uses the checksums from VMware, it's been
> forward-ported to 9.2.
>
> I'd like to hold off on merging Simon's checksum patch into this one
> for now because there may be some independent issues.

Could you write this patch so that it doesn't depend on any of the 
checksum patches, please? That would make the patch smaller and easier 
to review, and it would allow benchmarking the performance impact of 
double-writes vs full page writes independent of checksums.

At the moment, double-writes are done in one batch, fsyncing the 
double-write area first and the data files immediately after that. 
That's probably beneficial if you have a BBU, and/or a fairly large 
shared_buffers setting, so that pages don't get swapped between OS and 
PostgreSQL cache too much. But when those assumptions don't hold, it 
would be interesting to treat the double-write buffers more like a 2nd 
WAL for full-page images. Whenever a dirty page is evicted from 
shared_buffers, write it to the double-write area, but don't fsync it or 
write it back to the data file yet. Instead, let it sit in the 
double-write area, and grow the double-write file(s) as necessary, until 
the next checkpoint comes along.

In general, I must say that I'm pretty horrified by all these extra 
fsync's this introduces. You really need a BBU to absorb them, and even 
then, you're fsyncing data files to disk much more frequently than you
otherwise would.

Jignesh mentioned having run some performance tests with this. I would 
like to see those results, and some analysis and benchmarks of how 
settings like shared_buffers and the presence of BBU affect this, 
compared to full_page_writes=on and off.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


pgsql-hackers by date:

Previous
From: Matteo Beccati
Date:
Subject: Re: automating CF submissions (was xlog location arithmetic)
Next
From: Peter Eisentraut
Date:
Subject: Re: Why is CF 2011-11 still listed as "In Progress"?