Re: [GENERAL] Slow PITR restore - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: [GENERAL] Slow PITR restore
Date
Msg-id 1197555385.4255.1773.camel@ebony.site
Whole thread Raw
In response to Re: [GENERAL] Slow PITR restore  (Heikki Linnakangas <heikki@enterprisedb.com>)
Responses Re: [GENERAL] Slow PITR restore  (Heikki Linnakangas <heikki@enterprisedb.com>)
List pgsql-hackers
On Thu, 2007-12-13 at 12:28 +0000, Heikki Linnakangas wrote:
> Gregory Stark wrote:
> > "Simon Riggs" <simon@2ndquadrant.com> writes:
> > 
> >> We would have readbuffers in shared memory, like wal_buffers in reverse.
> >> Each worker would read the next WAL record and check there is no
> >> conflict with other concurrent WAL records. If not, it will apply the
> >> record immediately, otherwise wait for the conflicting worker to
> >> complete. 
> > 
> > Well I guess you would have to bring up the locking infrastructure and lock
> > any blocks in the record you're applying (sorted first to avoid deadlocks). As
> > I understand it we don't use locks during recovery now but I'm not sure if
> > that's just because we don't have to or if there are practical problems which
> > would have to be solved to do so.
> 
> We do use locks during recovery, XLogReadBuffer takes an exclusive lock 
> on the buffer. According to the comments there, it wouldn't be strictly 
> necessary.  But I believe we do actually need it to protect from 
> bgwriter writing out a buffer while it's been modified. We only lock one 
> page at a time, which is good enough for WAL replay, but not to protect 
> things like b-tree split from concurrent access.
> 
> I hacked together a quick & dirty prototype of using posix_fadvise in 
> recovery a while ago. First of all, there's the changes to the buffer 
> manager, which we'd need anyway if we wanted to use posix_fadvise for 
> speeding up other stuff like index scans. Then there's changes to 
> xlog.c, to buffer a number of WAL records, so that you can read ahead 
> the data pages needed by WAL records ahead of the WAL record you're 
> actually replaying.
> 
> I added a new function, readahead, to the rmgr API. It's similar to the 
> redo function, but it doesn't actually replay the WAL record, but just 
> issues the fadvise calls to the buffer manager for the pages that are 
> needed to replay the WAL record. This needs to be implemented for each 
> resource manager that we want to do readahead for. If we had the list of 
> blocks in the WAL record in a rmgr-independent format, we could do that 
> in a more generic way, like we do the backup block restoration.
> 
> The multiple-process approach seems a lot more complex to me. You need a 
> lot of bookkeeping to keep the processes from stepping on each others 
> toes, and to choose the next WAL record to replay. I think you have the 
> same problem that you need to have a rmgr-specific function to extract 
> the data blocks #s required to replay that WAL record, or add that list 
> to the WAL record header in a generic format. Multi-process approach is 
> nice because it allows you to parallelize the CPU work of replaying the 
> records as well, but I wonder how much that really scales given all the 
> locking required. Also, I don't think replaying WAL records is very 
> expensive CPU-wise. You'd need a pretty impressive RAID array to read 
> WAL from, to saturate a single CPU.

With all this talk, I thought of a much better way. We don't actually
need to apply the changes in the order they are received, we just need
to apply sufficient ordering to ensure that each block's changes are
applied in LSN order.

Allocate a recovery cache of size maintenance_work_mem that goes away
when recovery ends.

For every block mentioned in WAL record that isn't an overwrite, first
check shared_buffers. If its in shared_buffers apply immediately and
move on. If not in shared_buffers then put in recovery cache.

When cache fills, empty it. Qsort WAL records by by rmgrid, rel,
blockid, lsn. Then we scan through the records applying them in
sequence. That way we will accumulate changes on each block so we only
need to request it once rather than thrashing the cache. We may get
lucky and pick up some OS readahead also. We would also use buffer
recycling when emptying the recovery cache, to ensure that we don't
trash the main cache and also gain from L2 cache efficiency.

When recovery ends, empty the cache.

I think that is better than both methods mentioned, and definitely
simpler than my brute-force method. It also lends itself to using both
previously mentioned methods as additional techniques if we really
needed to. I suspect reordering the I/Os in this way is going to make a
huge difference to cache hit rates.

Looks like each rmgr_redo call would need to be split into two calls:
rmgr_redo_apply() returns bool and rmgr_redo_cache(). The first will
apply if possible, otherwise place in cache. The second gets called
repeatedly during cache emptying.

That sounds like it might not be I/O efficient, in that it would suffer
from producer/consumer flip/flopping. But with large main work mem
you'll get all the I/O from possibly hundreds of WAL files all
accumulated together before it is issued - assuming only a small % of
WAL records go into the cache and then many of those will have their I/O
reduced to zero because of the sequential cache access.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: [Fwd: [PATCHES] archiver ps display]
Next
From: Alvaro Herrera
Date:
Subject: Re: [Fwd: [PATCHES] archiver ps display]