Re: [GENERAL] Slow PITR restore - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: [GENERAL] Slow PITR restore
Date
Msg-id 1197580396.4255.1906.camel@ebony.site
Whole thread Raw
In response to Re: [GENERAL] Slow PITR restore  (Heikki Linnakangas <heikki@enterprisedb.com>)
Responses Re: [GENERAL] Slow PITR restore  (Simon Riggs <simon@2ndquadrant.com>)
List pgsql-hackers
On Thu, 2007-12-13 at 20:25 +0000, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > Allocate a recovery cache of size maintenance_work_mem that goes away
> > when recovery ends.
> > 
> > For every block mentioned in WAL record that isn't an overwrite, first
> > check shared_buffers. If its in shared_buffers apply immediately and
> > move on. If not in shared_buffers then put in recovery cache.
> > 
> > When cache fills, empty it. Qsort WAL records by by rmgrid, rel,
> > blockid, lsn. Then we scan through the records applying them in
> > sequence. That way we will accumulate changes on each block so we only
> > need to request it once rather than thrashing the cache. We may get
> > lucky and pick up some OS readahead also. We would also use buffer
> > recycling when emptying the recovery cache, to ensure that we don't
> > trash the main cache and also gain from L2 cache efficiency.
> > 
> > When recovery ends, empty the cache.
> 
> Hmm. That assumes that nothing else than the WAL replay will read
> pages into shared buffers. I guess that's true at the moment, but it 
> doesn't seem impossible that something like Florian's read-only queries 
> on a stand by server would change that.

Agreed, though I was imagining to use the cache as a secondary hash
table. Not sure about that now I write it. I think the accumulation idea
mostly makes sense for heaps. Indexes look much harder.

Whatever happens I think we should get Florian's work in there first,
then tune.

> > I think that is better than both methods mentioned, and definitely
> > simpler than my brute-force method. It also lends itself to using both
> > previously mentioned methods as additional techniques if we really
> > needed to. I suspect reordering the I/Os in this way is going to make a
> > huge difference to cache hit rates.
> 
> But it won't actually do anything to scale the I/O. You're still going 
> to be issuing only one read request at a time. The order of those 
> requests will be better from cache hit point of view, which is good, but 
> the problem remains that if the modified data blocks are scattered 
> around the database, you'll be doing random I/O, one request at a time.

Yeh, agreed. That's why I went for the parallelism approach originally:
you can't escape the basic physics.

I've re-read your post. If you think the buffer manager changes are
roughly the same as would be needed for other gains on index scans, then
your async I/O seems like the most profitable approach. I still don't
like it as much, but that aspect tips the balance, I think.

> It would be interesting to do something like that to speed up replay of 
> long PITR archives, though. You could scan all (or at least far ahead) 
> the WAL records, and make note of where there is full page writes for 
> each page. Whenever there's a full page write further ahead in the log, 
> you could ignore all changes to that page before that, because they're 
> going to be overwritten anyway. It won't help with normal recovery, 
> because there won't be more than one full page image of each page after 
> the last checkpoint, but with PITR it would help.

Another good idea.

Of course if we scan that far ahead we can start removing aborted
transactions also, which is the more standard optimization of recovery.

I was imagining we would just memory map the files rather than buffer
them explicitly, BTW.

> > Looks like each rmgr_redo call would need to be split into two calls:
> > rmgr_redo_apply() returns bool and rmgr_redo_cache(). The first will
> > apply if possible, otherwise place in cache. The second gets called
> > repeatedly during cache emptying.
> 
> Yeah, much like the split I had to do for the posix_fadvise.
> 
> It seems that in all the proposed schemes we need to know which blocks a 
> given WAL record will need to access. For multiple recovery processes, 
> you need that to figure out which WAL records you can safely replay. In 
> the posix_fadvise scheme, you need that to issue the posix_fadvises 
> without modifying anything.

Yeh, I think it should be easy enough to group together the block-based
rmgrs so they all have the same basic structure. Heap and the indexes,
that is, but some parts are harder than others. My feeling is that heaps
will easily accumulate, though secondary indexes will often be random. 

Incidentally, HOT will speed up recovery also, since there will be fewer
index operations to replay.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: [GENERAL] Slow PITR restore
Next
From: Simon Riggs
Date:
Subject: Re: [GENERAL] Slow PITR restore