Re: [GENERAL] Slow PITR restore - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: [GENERAL] Slow PITR restore
Date
Msg-id 47619523.8010107@enterprisedb.com
Whole thread Raw
In response to Re: [GENERAL] Slow PITR restore  (Simon Riggs <simon@2ndquadrant.com>)
Responses Re: [GENERAL] Slow PITR restore  (Simon Riggs <simon@2ndquadrant.com>)
Re: [GENERAL] Slow PITR restore  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: [GENERAL] Slow PITR restore  (Gregory Stark <stark@enterprisedb.com>)
Re: [GENERAL] Slow PITR restore  (Hannu Krosing <hannu@skype.net>)
List pgsql-hackers
Simon Riggs wrote:
> Allocate a recovery cache of size maintenance_work_mem that goes away
> when recovery ends.
> 
> For every block mentioned in WAL record that isn't an overwrite, first
> check shared_buffers. If its in shared_buffers apply immediately and
> move on. If not in shared_buffers then put in recovery cache.
> 
> When cache fills, empty it. Qsort WAL records by by rmgrid, rel,
> blockid, lsn. Then we scan through the records applying them in
> sequence. That way we will accumulate changes on each block so we only
> need to request it once rather than thrashing the cache. We may get
> lucky and pick up some OS readahead also. We would also use buffer
> recycling when emptying the recovery cache, to ensure that we don't
> trash the main cache and also gain from L2 cache efficiency.
> 
> When recovery ends, empty the cache.

Hmm. That assumes that nothing else than the WAL replay will read
pages into shared buffers. I guess that's true at the moment, but it 
doesn't seem impossible that something like Florian's read-only queries 
on a stand by server would change that.

> I think that is better than both methods mentioned, and definitely
> simpler than my brute-force method. It also lends itself to using both
> previously mentioned methods as additional techniques if we really
> needed to. I suspect reordering the I/Os in this way is going to make a
> huge difference to cache hit rates.

But it won't actually do anything to scale the I/O. You're still going 
to be issuing only one read request at a time. The order of those 
requests will be better from cache hit point of view, which is good, but 
the problem remains that if the modified data blocks are scattered 
around the database, you'll be doing random I/O, one request at a time.

It would be interesting to do something like that to speed up replay of 
long PITR archives, though. You could scan all (or at least far ahead) 
the WAL records, and make note of where there is full page writes for 
each page. Whenever there's a full page write further ahead in the log, 
you could ignore all changes to that page before that, because they're 
going to be overwritten anyway. It won't help with normal recovery, 
because there won't be more than one full page image of each page after 
the last checkpoint, but with PITR it would help.

> Looks like each rmgr_redo call would need to be split into two calls:
> rmgr_redo_apply() returns bool and rmgr_redo_cache(). The first will
> apply if possible, otherwise place in cache. The second gets called
> repeatedly during cache emptying.

Yeah, much like the split I had to do for the posix_fadvise.

It seems that in all the proposed schemes we need to know which blocks a 
given WAL record will need to access. For multiple recovery processes, 
you need that to figure out which WAL records you can safely replay. In 
the posix_fadvise scheme, you need that to issue the posix_fadvises 
without modifying anything.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Josh Berkus
Date:
Subject: Re: [DOCS] "distributed checkpoint"
Next
From: Simon Riggs
Date:
Subject: Re: [GENERAL] Slow PITR restore