Home > mailing lists

Re: Hot standby, recovery infra - Mailing list pgsql-hackers

From	Heikki Linnakangas
Subject	Re: Hot standby, recovery infra
Date	February 5, 2009 08:18:28
Msg-id	498AD906.1030507@enterprisedb.com Whole thread Raw
In response to	Re: Hot standby, recovery infra (Simon Riggs <simon@2ndQuadrant.com>)
Responses	Re: Hot standby, recovery infra
List	pgsql-hackers

Tree view

Simon Riggs wrote:
> On Thu, 2009-02-05 at 13:18 +0200, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> On Thu, 2009-02-05 at 11:46 +0200, Heikki Linnakangas wrote:
>>>> Simon Riggs wrote:
>>>>> So we might end up flushing more often *and* we will be doing it
>>>>> potentially in the code path of other users.
>>>> For example, imagine a database that fits completely in shared buffers. 
>>>> If we update at every XLogFileRead, we have to fsync every 16MB of WAL. 
>>>> If we update in XLogFlush the way I described, you only need to update 
>>>> when we flush a page from the buffer cache, which will only happen at 
>>>> restartpoints. That's far less updates.
>>> Oh, did you change the bgwriter so it doesn't do normal page cleaning? 
>> No. Ok, that wasn't completely accurate. The page cleaning by bgwriter 
>> will perform XLogFlushes, but that should be pretty insignificant. When 
>> there's little page replacement going on, bgwriter will do a small 
>> trickle of page cleaning, which won't matter much. 
> 
> Yes, that case is good, but it wasn't the use case we're trying to speed
> up by having the bgwriter active during recovery. We're worried about
> I/O bound recoveries.

Ok, let's do the math:

By updating minRecoveryPoint in XLogFileRead, you're fsyncing the 
control file once every 16MB of WAL.

By updating in XLogFlush, the frequency depends on the amount of 
shared_buffers available to buffer the modified pages, the average WAL 
record size, and the cache hit ratio. Let's determine the worst case:

The smallest WAL record that dirties a page is a heap deletion record. 
That contains just enough information to locate the tuple. If I'm 
reading the headers right, that record is 48 bytes long (28 bytes of 
xlog header + 18 bytes of payload + padding). Assuming that the WAL is 
full of just those records, and there's no full page images, and that 
the cache hit ratio is 0%, we will need (16 MB / 48 B) * 8 kB = 2730 MB 
of shared_buffers to achieve the once per 16 MB of WAL per one fsync mark.

So if you have a lower shared_buffers setting than 2.7 GB, you can have 
more frequent fsyncs this way in the worst case. If you think of the 
typical case, you're probably not doing all deletes, and you're having a 
non-zero cache hit ratio, so you achieve the same frequency with a much 
lower shared_buffers setting. And if you're really that I/O bound, I 
doubt the few extra fsyncs matter much.

Also note that when the control file is updated in XLogFlush, it's 
typically the bgwriter doing it as it cleans buffers ahead of the clock 
hand, not the startup process.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

pgsql-hackers by date:

From: Simon Riggs
Date: 05 February 2009, 07:50:37
Subject: Re: Hot standby, recovery infra

From: "K, Niranjan (NSN - IN/Bangalore)"
Date: 05 February 2009, 10:00:50
Subject: Re: Synch Replication

Re: Hot standby, recovery infra - Mailing list pgsql-hackers

Previous

Next