Re: Hot standby, recovery infra - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Hot standby, recovery infra |
Date | |
Msg-id | 498AD906.1030507@enterprisedb.com Whole thread Raw |
In response to | Re: Hot standby, recovery infra (Simon Riggs <simon@2ndQuadrant.com>) |
Responses |
Re: Hot standby, recovery infra
|
List | pgsql-hackers |
Simon Riggs wrote: > On Thu, 2009-02-05 at 13:18 +0200, Heikki Linnakangas wrote: >> Simon Riggs wrote: >>> On Thu, 2009-02-05 at 11:46 +0200, Heikki Linnakangas wrote: >>>> Simon Riggs wrote: >>>>> So we might end up flushing more often *and* we will be doing it >>>>> potentially in the code path of other users. >>>> For example, imagine a database that fits completely in shared buffers. >>>> If we update at every XLogFileRead, we have to fsync every 16MB of WAL. >>>> If we update in XLogFlush the way I described, you only need to update >>>> when we flush a page from the buffer cache, which will only happen at >>>> restartpoints. That's far less updates. >>> Oh, did you change the bgwriter so it doesn't do normal page cleaning? >> No. Ok, that wasn't completely accurate. The page cleaning by bgwriter >> will perform XLogFlushes, but that should be pretty insignificant. When >> there's little page replacement going on, bgwriter will do a small >> trickle of page cleaning, which won't matter much. > > Yes, that case is good, but it wasn't the use case we're trying to speed > up by having the bgwriter active during recovery. We're worried about > I/O bound recoveries. Ok, let's do the math: By updating minRecoveryPoint in XLogFileRead, you're fsyncing the control file once every 16MB of WAL. By updating in XLogFlush, the frequency depends on the amount of shared_buffers available to buffer the modified pages, the average WAL record size, and the cache hit ratio. Let's determine the worst case: The smallest WAL record that dirties a page is a heap deletion record. That contains just enough information to locate the tuple. If I'm reading the headers right, that record is 48 bytes long (28 bytes of xlog header + 18 bytes of payload + padding). Assuming that the WAL is full of just those records, and there's no full page images, and that the cache hit ratio is 0%, we will need (16 MB / 48 B) * 8 kB = 2730 MB of shared_buffers to achieve the once per 16 MB of WAL per one fsync mark. So if you have a lower shared_buffers setting than 2.7 GB, you can have more frequent fsyncs this way in the worst case. If you think of the typical case, you're probably not doing all deletes, and you're having a non-zero cache hit ratio, so you achieve the same frequency with a much lower shared_buffers setting. And if you're really that I/O bound, I doubt the few extra fsyncs matter much. Also note that when the control file is updated in XLogFlush, it's typically the bgwriter doing it as it cleans buffers ahead of the clock hand, not the startup process. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
pgsql-hackers by date: