Thread: Re: [GENERAL] Slow PITR restore

Re: [GENERAL] Slow PITR restore

From
Simon Riggs
Date:
On Thu, 2007-12-13 at 06:27 +0000, Gregory Stark wrote:
> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>
> > "Joshua D. Drake" <jd@commandprompt.com> writes:
> >
> >> Exactly. Which is the point I am making. Five minutes of transactions
> >> is nothing (speaking generally).. In short, if we are in recovery, and
> >> we are not saturated the I/O and at least a single CPU, there is a huge
> >> amount of optimization *somewhere* to be done.
> >
> > You sure about that?  I tested CVS HEAD just now, by setting the
> > checkpoint_ parameters really high, running pgbench for awhile, and
> > then killing the bgwriter to force a recovery cycle over all the WAL
> > generated by the pgbench run.  What I saw was that the machine was 100%
> > disk write bound.  Increasing shared_buffers helped, not in that the
> > write rate got less according to vmstat, but the completion time did.
>
> There are at least three definitions of "saturating the I/O" and it sounds
> like you two are using two different ones.
>
> 1) The processor is waiting on I/O all the time
> 2) The hard drives are all always handling a request
> 3) The hard drives are all handling the full bandwidth they're capable
>
> You would expect (1) and (2) to be the same for a single drive -- though in
> practice there seems to be a gap even between them. But for a raid array there
> can be a large difference, and the wider the raid stripe the larger the
> difference.
>
> In Tom's results:
>
> > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
> >  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
> >  0  9  70024  29232  19876 824368    0    0     0  3152 1447  233  0  1  0 99  0
> >  0  9  70024  29232  19876 824368    0    0     0  3660 1474  252  0  1  0 99  0
> >  0  8  70024  28960  19876 824404    0    0     0  3176 1448  265  0  2  1 97  0
> >
> > I don't see the machine sitting around doing nothing ...
>
> Note that even though the processor is 99% in wait state the drive is only
> handling about 3 MB/s. That translates into a seek time of 2.2ms which is
> actually pretty fast. So if this is a single drive (1) and (2) seem to be
> pretty much the same here.
>
> But note that if this were a raid array Postgres's wouldn't be getting any
> better results. A Raid array wouldn't improve i/o latency at all and since
> it's already 99% waiting for i/o Postgres is not going to be able to issue any
> more. But only one drive in the raid array will be busy at a time which would
> be far less than the maximum random access i/o the raid array is capable of.

Agree with Greg's analysis here. Moving to -hackers now.

I've done performance profiling also. My results replicated Tom's, but I
hadn't performed them on a big enough system and so didn't realise the
I/O scalability issue could be such a large problem. Koichi showed me
some results on a much larger server that illustrated the I/O problem.

But lets remember its only a problem on large servers with a heavy write
workload and a large random I/O requirement. That's an important set of
people, but much less than Josh's 10% of people even.

> Heikki proposed a while back to use posix_fadvise() when processing logs to
> read-ahead blocks which the recover will need before actually attempting to
> recover them. On a raid array that would bring the 3MB/s above up to the
> maximum number of random accesses the raid array can handle (ie, definition
> (2) above).

It's a good idea, but it will require more complex code. I prefer the
simpler solution of using more processes to solve the I/O problem.

Florian's code for Hot Standby introduces a separate recovery process,
similar to an autovacuum launcher. I propose a mechanism similar to the
AV solution where we have lots of recovery workers, with one recovery
master reading the WAL files. We can then distribute WAL records to
workers in some manner.

It's true that many WAL records depend upon each other, but its also
true that the performance problems only occur in the situation when they
the WAL records don't depend upon each other. If they did, they would
touch the same blocks and it would be cached. So as long as we have a
safe mechanism for splitting up the work, everything is fine.

We can divide up the WAL records this by looking at the rmgr field, plus
looking deeper into the records themselves so we can touch different
relations/blocks.

So I'm planning to review this *after* Florian has introduced his patch,
so we can build upon it.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: [GENERAL] Slow PITR restore

From
Gregory Stark
Date:
"Simon Riggs" <simon@2ndquadrant.com> writes:

> On Thu, 2007-12-13 at 06:27 +0000, Gregory Stark wrote:
>> Heikki proposed a while back to use posix_fadvise() when processing logs to
>> read-ahead blocks which the recover will need before actually attempting to
>> recover them. On a raid array that would bring the 3MB/s above up to the
>> maximum number of random accesses the raid array can handle (ie, definition
>> (2) above).
>
> It's a good idea, but it will require more complex code. I prefer the
> simpler solution of using more processes to solve the I/O problem.

Huh, I forgot about that idea. Ironically that was what I suggested when
Heikki described the problem.

I think it's more complex than using posix_fadvise. But it's also more
ambitious. It would allow us to use not only the full random access i/o
bandwidth but also allow us to use more cpu. In cases where the database fits
entirely in ram and we're recovering many many operations modifying the same
blocks over and over that might help a lot.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!

Re: [GENERAL] Slow PITR restore

From
Simon Riggs
Date:
On Thu, 2007-12-13 at 09:45 +0000, Gregory Stark wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> 
> > On Thu, 2007-12-13 at 06:27 +0000, Gregory Stark wrote:
> >> Heikki proposed a while back to use posix_fadvise() when processing logs to
> >> read-ahead blocks which the recover will need before actually attempting to
> >> recover them. On a raid array that would bring the 3MB/s above up to the
> >> maximum number of random accesses the raid array can handle (ie, definition
> >> (2) above).
> >
> > It's a good idea, but it will require more complex code. I prefer the
> > simpler solution of using more processes to solve the I/O problem.
> 
> Huh, I forgot about that idea. Ironically that was what I suggested when
> Heikki described the problem.
> 
> I think it's more complex than using posix_fadvise. 

Some handwaving...

ISTM its just autovacuum launcher + Hot Standby mixed.

I guess I've added two new backends now (Tom groans...) so that part
seems very straightforward. The harder part is distributing the work and
the hardest part is doing that evenly enough to make a difference. It
will also require rmgr changes for state handling, but then I think that
needs work anyway. Maybe we don't even need a master.

We would have readbuffers in shared memory, like wal_buffers in reverse.
Each worker would read the next WAL record and check there is no
conflict with other concurrent WAL records. If not, it will apply the
record immediately, otherwise wait for the conflicting worker to
complete. 

> But it's also more ambitious. 

Time is the issue, I think, so you may be right. That's always why I
grunt so much about freeze dates.

Anyway, I'll leave this now, since I think we need to do Florian's work
first either way and that is much more eagerly awaited I think.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: [GENERAL] Slow PITR restore

From
Gregory Stark
Date:
"Simon Riggs" <simon@2ndquadrant.com> writes:

> We would have readbuffers in shared memory, like wal_buffers in reverse.
> Each worker would read the next WAL record and check there is no
> conflict with other concurrent WAL records. If not, it will apply the
> record immediately, otherwise wait for the conflicting worker to
> complete. 

Well I guess you would have to bring up the locking infrastructure and lock
any blocks in the record you're applying (sorted first to avoid deadlocks). As
I understand it we don't use locks during recovery now but I'm not sure if
that's just because we don't have to or if there are practical problems which
would have to be solved to do so.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication
support!


Re: [GENERAL] Slow PITR restore

From
Heikki Linnakangas
Date:
Gregory Stark wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> 
>> We would have readbuffers in shared memory, like wal_buffers in reverse.
>> Each worker would read the next WAL record and check there is no
>> conflict with other concurrent WAL records. If not, it will apply the
>> record immediately, otherwise wait for the conflicting worker to
>> complete. 
> 
> Well I guess you would have to bring up the locking infrastructure and lock
> any blocks in the record you're applying (sorted first to avoid deadlocks). As
> I understand it we don't use locks during recovery now but I'm not sure if
> that's just because we don't have to or if there are practical problems which
> would have to be solved to do so.

We do use locks during recovery, XLogReadBuffer takes an exclusive lock 
on the buffer. According to the comments there, it wouldn't be strictly 
necessary.  But I believe we do actually need it to protect from 
bgwriter writing out a buffer while it's been modified. We only lock one 
page at a time, which is good enough for WAL replay, but not to protect 
things like b-tree split from concurrent access.

I hacked together a quick & dirty prototype of using posix_fadvise in 
recovery a while ago. First of all, there's the changes to the buffer 
manager, which we'd need anyway if we wanted to use posix_fadvise for 
speeding up other stuff like index scans. Then there's changes to 
xlog.c, to buffer a number of WAL records, so that you can read ahead 
the data pages needed by WAL records ahead of the WAL record you're 
actually replaying.

I added a new function, readahead, to the rmgr API. It's similar to the 
redo function, but it doesn't actually replay the WAL record, but just 
issues the fadvise calls to the buffer manager for the pages that are 
needed to replay the WAL record. This needs to be implemented for each 
resource manager that we want to do readahead for. If we had the list of 
blocks in the WAL record in a rmgr-independent format, we could do that 
in a more generic way, like we do the backup block restoration.

The multiple-process approach seems a lot more complex to me. You need a 
lot of bookkeeping to keep the processes from stepping on each others 
toes, and to choose the next WAL record to replay. I think you have the 
same problem that you need to have a rmgr-specific function to extract 
the data blocks #s required to replay that WAL record, or add that list 
to the WAL record header in a generic format. Multi-process approach is 
nice because it allows you to parallelize the CPU work of replaying the 
records as well, but I wonder how much that really scales given all the 
locking required. Also, I don't think replaying WAL records is very 
expensive CPU-wise. You'd need a pretty impressive RAID array to read 
WAL from, to saturate a single CPU.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: [GENERAL] Slow PITR restore

From
"Guillaume Smet"
Date:
Simon,

On Dec 13, 2007 11:21 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Anyway, I'll leave this now, since I think we need to do Florian's work
> first either way and that is much more eagerly awaited I think.

Speaking of that, is there any news about it and about Florian? It was
a really promising work.

Thanks.

--
Guillaume


Re: [GENERAL] Slow PITR restore

From
Alvaro Herrera
Date:
Gregory Stark wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
>
> > It's a good idea, but it will require more complex code. I prefer the
> > simpler solution of using more processes to solve the I/O problem.
>
> Huh, I forgot about that idea. Ironically that was what I suggested when
> Heikki described the problem.
>
> I think it's more complex than using posix_fadvise. But it's also more
> ambitious. It would allow us to use not only the full random access i/o
> bandwidth but also allow us to use more cpu. In cases where the database fits
> entirely in ram and we're recovering many many operations modifying the same
> blocks over and over that might help a lot.

Actually, if you are modifying the same blocks over and over it will
help *less*, because applying each record needs to occur only after the
previous records that modify the same block have been applied.

So you have two possibilities: you skip that record and try to apply the
next one, hoping that that record applies to a block that's not locked,
(which means you have to remember the skipped record and apply it
sometime in the future), or you put the process to sleep until the lock
has been released.

--
Alvaro Herrera                 http://www.amazon.com/gp/registry/DXLWNGRJD34J
"La conclusión que podemos sacar de esos estudios es que
no podemos sacar ninguna conclusión de ellos" (Tanenbaum)

Re: [GENERAL] Slow PITR restore

From
Alvaro Herrera
Date:
Simon Riggs wrote:

> ISTM its just autovacuum launcher + Hot Standby mixed.

I don't think you need a launcher at all.  Just get the postmaster to
start a configurable number of wal-replay processes (currently the
number is hardcoded to 1).

-- 
Alvaro Herrera                 http://www.amazon.com/gp/registry/CTMLCN8V17R4
"Someone said that it is at least an order of magnitude more work to do
production software than a prototype. I think he is wrong by at least
an order of magnitude."                              (Brian Kernighan)


Re: [GENERAL] Slow PITR restore

From
Simon Riggs
Date:
On Thu, 2007-12-13 at 12:28 +0000, Heikki Linnakangas wrote:
> Gregory Stark wrote:
> > "Simon Riggs" <simon@2ndquadrant.com> writes:
> > 
> >> We would have readbuffers in shared memory, like wal_buffers in reverse.
> >> Each worker would read the next WAL record and check there is no
> >> conflict with other concurrent WAL records. If not, it will apply the
> >> record immediately, otherwise wait for the conflicting worker to
> >> complete. 
> > 
> > Well I guess you would have to bring up the locking infrastructure and lock
> > any blocks in the record you're applying (sorted first to avoid deadlocks). As
> > I understand it we don't use locks during recovery now but I'm not sure if
> > that's just because we don't have to or if there are practical problems which
> > would have to be solved to do so.
> 
> We do use locks during recovery, XLogReadBuffer takes an exclusive lock 
> on the buffer. According to the comments there, it wouldn't be strictly 
> necessary.  But I believe we do actually need it to protect from 
> bgwriter writing out a buffer while it's been modified. We only lock one 
> page at a time, which is good enough for WAL replay, but not to protect 
> things like b-tree split from concurrent access.
> 
> I hacked together a quick & dirty prototype of using posix_fadvise in 
> recovery a while ago. First of all, there's the changes to the buffer 
> manager, which we'd need anyway if we wanted to use posix_fadvise for 
> speeding up other stuff like index scans. Then there's changes to 
> xlog.c, to buffer a number of WAL records, so that you can read ahead 
> the data pages needed by WAL records ahead of the WAL record you're 
> actually replaying.
> 
> I added a new function, readahead, to the rmgr API. It's similar to the 
> redo function, but it doesn't actually replay the WAL record, but just 
> issues the fadvise calls to the buffer manager for the pages that are 
> needed to replay the WAL record. This needs to be implemented for each 
> resource manager that we want to do readahead for. If we had the list of 
> blocks in the WAL record in a rmgr-independent format, we could do that 
> in a more generic way, like we do the backup block restoration.
> 
> The multiple-process approach seems a lot more complex to me. You need a 
> lot of bookkeeping to keep the processes from stepping on each others 
> toes, and to choose the next WAL record to replay. I think you have the 
> same problem that you need to have a rmgr-specific function to extract 
> the data blocks #s required to replay that WAL record, or add that list 
> to the WAL record header in a generic format. Multi-process approach is 
> nice because it allows you to parallelize the CPU work of replaying the 
> records as well, but I wonder how much that really scales given all the 
> locking required. Also, I don't think replaying WAL records is very 
> expensive CPU-wise. You'd need a pretty impressive RAID array to read 
> WAL from, to saturate a single CPU.

With all this talk, I thought of a much better way. We don't actually
need to apply the changes in the order they are received, we just need
to apply sufficient ordering to ensure that each block's changes are
applied in LSN order.

Allocate a recovery cache of size maintenance_work_mem that goes away
when recovery ends.

For every block mentioned in WAL record that isn't an overwrite, first
check shared_buffers. If its in shared_buffers apply immediately and
move on. If not in shared_buffers then put in recovery cache.

When cache fills, empty it. Qsort WAL records by by rmgrid, rel,
blockid, lsn. Then we scan through the records applying them in
sequence. That way we will accumulate changes on each block so we only
need to request it once rather than thrashing the cache. We may get
lucky and pick up some OS readahead also. We would also use buffer
recycling when emptying the recovery cache, to ensure that we don't
trash the main cache and also gain from L2 cache efficiency.

When recovery ends, empty the cache.

I think that is better than both methods mentioned, and definitely
simpler than my brute-force method. It also lends itself to using both
previously mentioned methods as additional techniques if we really
needed to. I suspect reordering the I/Os in this way is going to make a
huge difference to cache hit rates.

Looks like each rmgr_redo call would need to be split into two calls:
rmgr_redo_apply() returns bool and rmgr_redo_cache(). The first will
apply if possible, otherwise place in cache. The second gets called
repeatedly during cache emptying.

That sounds like it might not be I/O efficient, in that it would suffer
from producer/consumer flip/flopping. But with large main work mem
you'll get all the I/O from possibly hundreds of WAL files all
accumulated together before it is issued - assuming only a small % of
WAL records go into the cache and then many of those will have their I/O
reduced to zero because of the sequential cache access.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: [GENERAL] Slow PITR restore

From
Simon Riggs
Date:
On Thu, 2007-12-13 at 10:18 -0300, Alvaro Herrera wrote:
> Gregory Stark wrote:
> > "Simon Riggs" <simon@2ndquadrant.com> writes:
> >
> > > It's a good idea, but it will require more complex code. I prefer the
> > > simpler solution of using more processes to solve the I/O problem.
> >
> > Huh, I forgot about that idea. Ironically that was what I suggested when
> > Heikki described the problem.
> >
> > I think it's more complex than using posix_fadvise. But it's also more
> > ambitious. It would allow us to use not only the full random access i/o
> > bandwidth but also allow us to use more cpu. In cases where the database fits
> > entirely in ram and we're recovering many many operations modifying the same
> > blocks over and over that might help a lot.
>
> Actually, if you are modifying the same blocks over and over it will
> help *less*, because applying each record needs to occur only after the
> previous records that modify the same block have been applied.
>
> So you have two possibilities: you skip that record and try to apply the
> next one, hoping that that record applies to a block that's not locked,
> (which means you have to remember the skipped record and apply it
> sometime in the future), or you put the process to sleep until the lock
> has been released.

Ah, OK, I can see we're on the same lines of thought there. Just posted
a reply to Heikki about this sort of idea.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: [GENERAL] Slow PITR restore

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> Allocate a recovery cache of size maintenance_work_mem that goes away
> when recovery ends.
> 
> For every block mentioned in WAL record that isn't an overwrite, first
> check shared_buffers. If its in shared_buffers apply immediately and
> move on. If not in shared_buffers then put in recovery cache.
> 
> When cache fills, empty it. Qsort WAL records by by rmgrid, rel,
> blockid, lsn. Then we scan through the records applying them in
> sequence. That way we will accumulate changes on each block so we only
> need to request it once rather than thrashing the cache. We may get
> lucky and pick up some OS readahead also. We would also use buffer
> recycling when emptying the recovery cache, to ensure that we don't
> trash the main cache and also gain from L2 cache efficiency.
> 
> When recovery ends, empty the cache.

Hmm. That assumes that nothing else than the WAL replay will read
pages into shared buffers. I guess that's true at the moment, but it 
doesn't seem impossible that something like Florian's read-only queries 
on a stand by server would change that.

> I think that is better than both methods mentioned, and definitely
> simpler than my brute-force method. It also lends itself to using both
> previously mentioned methods as additional techniques if we really
> needed to. I suspect reordering the I/Os in this way is going to make a
> huge difference to cache hit rates.

But it won't actually do anything to scale the I/O. You're still going 
to be issuing only one read request at a time. The order of those 
requests will be better from cache hit point of view, which is good, but 
the problem remains that if the modified data blocks are scattered 
around the database, you'll be doing random I/O, one request at a time.

It would be interesting to do something like that to speed up replay of 
long PITR archives, though. You could scan all (or at least far ahead) 
the WAL records, and make note of where there is full page writes for 
each page. Whenever there's a full page write further ahead in the log, 
you could ignore all changes to that page before that, because they're 
going to be overwritten anyway. It won't help with normal recovery, 
because there won't be more than one full page image of each page after 
the last checkpoint, but with PITR it would help.

> Looks like each rmgr_redo call would need to be split into two calls:
> rmgr_redo_apply() returns bool and rmgr_redo_cache(). The first will
> apply if possible, otherwise place in cache. The second gets called
> repeatedly during cache emptying.

Yeah, much like the split I had to do for the posix_fadvise.

It seems that in all the proposed schemes we need to know which blocks a 
given WAL record will need to access. For multiple recovery processes, 
you need that to figure out which WAL records you can safely replay. In 
the posix_fadvise scheme, you need that to issue the posix_fadvises 
without modifying anything.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: [GENERAL] Slow PITR restore

From
Simon Riggs
Date:
On Thu, 2007-12-13 at 20:25 +0000, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > Allocate a recovery cache of size maintenance_work_mem that goes away
> > when recovery ends.
> > 
> > For every block mentioned in WAL record that isn't an overwrite, first
> > check shared_buffers. If its in shared_buffers apply immediately and
> > move on. If not in shared_buffers then put in recovery cache.
> > 
> > When cache fills, empty it. Qsort WAL records by by rmgrid, rel,
> > blockid, lsn. Then we scan through the records applying them in
> > sequence. That way we will accumulate changes on each block so we only
> > need to request it once rather than thrashing the cache. We may get
> > lucky and pick up some OS readahead also. We would also use buffer
> > recycling when emptying the recovery cache, to ensure that we don't
> > trash the main cache and also gain from L2 cache efficiency.
> > 
> > When recovery ends, empty the cache.
> 
> Hmm. That assumes that nothing else than the WAL replay will read
> pages into shared buffers. I guess that's true at the moment, but it 
> doesn't seem impossible that something like Florian's read-only queries 
> on a stand by server would change that.

Agreed, though I was imagining to use the cache as a secondary hash
table. Not sure about that now I write it. I think the accumulation idea
mostly makes sense for heaps. Indexes look much harder.

Whatever happens I think we should get Florian's work in there first,
then tune.

> > I think that is better than both methods mentioned, and definitely
> > simpler than my brute-force method. It also lends itself to using both
> > previously mentioned methods as additional techniques if we really
> > needed to. I suspect reordering the I/Os in this way is going to make a
> > huge difference to cache hit rates.
> 
> But it won't actually do anything to scale the I/O. You're still going 
> to be issuing only one read request at a time. The order of those 
> requests will be better from cache hit point of view, which is good, but 
> the problem remains that if the modified data blocks are scattered 
> around the database, you'll be doing random I/O, one request at a time.

Yeh, agreed. That's why I went for the parallelism approach originally:
you can't escape the basic physics.

I've re-read your post. If you think the buffer manager changes are
roughly the same as would be needed for other gains on index scans, then
your async I/O seems like the most profitable approach. I still don't
like it as much, but that aspect tips the balance, I think.

> It would be interesting to do something like that to speed up replay of 
> long PITR archives, though. You could scan all (or at least far ahead) 
> the WAL records, and make note of where there is full page writes for 
> each page. Whenever there's a full page write further ahead in the log, 
> you could ignore all changes to that page before that, because they're 
> going to be overwritten anyway. It won't help with normal recovery, 
> because there won't be more than one full page image of each page after 
> the last checkpoint, but with PITR it would help.

Another good idea.

Of course if we scan that far ahead we can start removing aborted
transactions also, which is the more standard optimization of recovery.

I was imagining we would just memory map the files rather than buffer
them explicitly, BTW.

> > Looks like each rmgr_redo call would need to be split into two calls:
> > rmgr_redo_apply() returns bool and rmgr_redo_cache(). The first will
> > apply if possible, otherwise place in cache. The second gets called
> > repeatedly during cache emptying.
> 
> Yeah, much like the split I had to do for the posix_fadvise.
> 
> It seems that in all the proposed schemes we need to know which blocks a 
> given WAL record will need to access. For multiple recovery processes, 
> you need that to figure out which WAL records you can safely replay. In 
> the posix_fadvise scheme, you need that to issue the posix_fadvises 
> without modifying anything.

Yeh, I think it should be easy enough to group together the block-based
rmgrs so they all have the same basic structure. Heap and the indexes,
that is, but some parts are harder than others. My feeling is that heaps
will easily accumulate, though secondary indexes will often be random. 

Incidentally, HOT will speed up recovery also, since there will be fewer
index operations to replay.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: [GENERAL] Slow PITR restore

From
Simon Riggs
Date:
On Thu, 2007-12-13 at 21:13 +0000, Simon Riggs wrote:
> Of course if we scan that far ahead we can start removing aborted
> transactions also, which is the more standard optimization of
> recovery.

Recall that thought!

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: [GENERAL] Slow PITR restore

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Hmm. That assumes that nothing else than the WAL replay will read
> pages into shared buffers. I guess that's true at the moment, but it 
> doesn't seem impossible that something like Florian's read-only queries 
> on a stand by server would change that.

A general comment on this thread: the idea of putting any sort of
asynchronous behavior into WAL recovery gives me the willies.

Recovery is inherently one of the least-exercised parts of the system,
and it gets more so with each robustness improvement we make elsewhere.
Moreover, because it is fairly dumb, anything that does go wrong will
likely result in silent data corruption that may not be noted until much
later.  Any bugs we introduce into recovery will be very hard to find
... and timing-dependent ones will be damn near impossible.

So in my mind the watchword has got to be KISS.  If that means that
recovery isn't terribly speedy, so be it.   I'd far rather get the
right answer slower.

Also, I have not seen anyone provide a very credible argument why
we should spend a lot of effort on optimizing a part of the system
that is so little-exercised.  Don't tell me about warm standby
systems --- they are fine as long as recovery is at least as fast
as the original transactions, and no evidence has been provided to
suggest that it's not.
        regards, tom lane


Re: [GENERAL] Slow PITR restore

From
Simon Riggs
Date:
On Thu, 2007-12-13 at 16:41 -0500, Tom Lane wrote:

> Recovery is inherently one of the least-exercised parts of the system,
> and it gets more so with each robustness improvement we make elsewhere.
> Moreover, because it is fairly dumb, anything that does go wrong will
> likely result in silent data corruption that may not be noted until much
> later.  Any bugs we introduce into recovery will be very hard to find
> ... and timing-dependent ones will be damn near impossible.
> 
> So in my mind the watchword has got to be KISS.  If that means that
> recovery isn't terribly speedy, so be it.   I'd far rather get the
> right answer slower.

Very much agreed, and really the real reason the main recovery code is
essentially untouched for so long. That thought was #1 priority when
writing PITR. Thanks for reminding me/us.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: [GENERAL] Slow PITR restore

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Also, I have not seen anyone provide a very credible argument why
> we should spend a lot of effort on optimizing a part of the system
> that is so little-exercised.  Don't tell me about warm standby
> systems --- they are fine as long as recovery is at least as fast
> as the original transactions, and no evidence has been provided to
> suggest that it's not.

Koichi showed me & Simon graphs of DBT-2 runs in their test lab back in 
May. They had setup two identical systems, one running the benchmark, 
and another one as a warm stand-by. The stand-by couldn't keep up; it 
couldn't replay the WAL as quickly as the primary server produced it. 
IIRC, replaying WAL generated in a 1h benchmark run took 6 hours.

It sounds unbelievable at first, but the problem is that our WAL replay 
doesn't scale. On the primary server, you can have (and they did) a huge 
RAID array with dozens of disks, and a lot of concurrent activity 
keeping it busy. On the standby, we do all the same work, but with a 
single process. Every time we need to read in a page to modify it, we 
block. No matter how many disks you have in the array, it won't help, 
because we only issue one I/O request at a time.

That said, I think the change we made in Spring to not read in pages for 
full page writes will help a lot with that. It would be nice to see some 
new benchmark results to measure that. However, it didn't fix the 
underlying scalability problem.

One KISS approach would be to just do full page writes more often. It 
would obviously bloat the WAL, but it would make the replay faster.

Another reason you would care about fast recovery is PITR. If you do 
base backups only once a week, for example, when you need to recover 
using the archive, you might have to replay a weeks worth of WAL in the 
worst case. You don't want to wait a week for the replay to finish.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: [GENERAL] Slow PITR restore

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Koichi showed me & Simon graphs of DBT-2 runs in their test lab back in 
> May. They had setup two identical systems, one running the benchmark, 
> and another one as a warm stand-by. The stand-by couldn't keep up; it 
> couldn't replay the WAL as quickly as the primary server produced it. 
> IIRC, replaying WAL generated in a 1h benchmark run took 6 hours.

[ shrug... ]  This is not consistent with my experience.  I can't help
suspecting misconfiguration; perhaps shared_buffers much smaller on the
backup, for example.

> One KISS approach would be to just do full page writes more often. It 
> would obviously bloat the WAL, but it would make the replay faster.

... at the cost of making the primary lots slower.
        regards, tom lane


Re: [GENERAL] Slow PITR restore

From
Gregory Stark
Date:
"Heikki Linnakangas" <heikki@enterprisedb.com> writes:

> It would be interesting to do something like that to speed up replay of long
> PITR archives, though. You could scan all (or at least far ahead) the WAL
> records, and make note of where there is full page writes for each page.
> Whenever there's a full page write further ahead in the log, you could ignore
> all changes to that page before that, because they're going to be overwritten
> anyway. 

Hm, you could generate that data when you generate the logs. Keep a hash of
block number and when the last full page write was. Then whenever you switch
log files dump out that hash in a second file alongside.

PITR recovery could read that in before it starts reading any file and consult
it before applying any records.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!


Re: [GENERAL] Slow PITR restore

From
Josh Berkus
Date:
Tom,

> [ shrug... ]  This is not consistent with my experience.  I can't help
> suspecting misconfiguration; perhaps shared_buffers much smaller on the
> backup, for example.

You're only going to see it on SMP systems which have a high degree of CPU 
utilization.  That is, when you have 16 cores processing flat-out, then 
the *single* core which will replay that log could certainly have trouble 
keeping up.  And this wouldn't be an issue which would show up testing on 
a dual-core system.

I don't have extensive testing data on that myself (I depended on Koichi's 
as well) but I do have another real-world case where our slow recovery 
time is a serious problem: clustered filesystem failover configurations, 
e.g. RHCFS, OpenHACluster, Veritas.  For those configuratons, when one 
node fails PostgreSQL is started on a 2nd node against the same data ... 
and goes through recovery.  On very high-volume systems, the recovery can 
be quite slow, up to 15 minutes, which is a long time for a web site to be 
down.

I completely agree that we don't want to risk the reliability of recovery 
in attempts to speed it up, though, so maybe this isn't something we can 
do right now.  But I don't agree that it's not an issue for users.

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco


Re: [GENERAL] Slow PITR restore

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> Tom,
>> [ shrug... ]  This is not consistent with my experience.  I can't help
>> suspecting misconfiguration; perhaps shared_buffers much smaller on the
>> backup, for example.

> You're only going to see it on SMP systems which have a high degree of CPU 
> utilization.  That is, when you have 16 cores processing flat-out, then 
> the *single* core which will replay that log could certainly have trouble 
> keeping up.

You are supposing that replay takes as much CPU as live query
processing, which is nonsense (at least as long as we don't load it
down with a bunch of added complexity ;-)).

The argument that Heikki actually made was that multiple parallel
queries could use more of the I/O bandwidth of a multi-disk array
than recovery could.  Which I believe, but I question how much of a
real-world problem it is.  For it to be an issue, you'd need a workload
that is almost all updates (else recovery wins by not having to
replicate reads of pages that don't get modified) and the updates have
to range over a working set significantly larger than physical RAM
(else I/O bandwidth won't be the bottleneck anyway).  I think we're
talking about an extremely small population of real users.
        regards, tom lane
3e


Re: [GENERAL] Slow PITR restore

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> The argument that Heikki actually made was that multiple parallel
> queries could use more of the I/O bandwidth of a multi-disk array
> than recovery could.  Which I believe, but I question how much of a
> real-world problem it is.  For it to be an issue, you'd need a workload
> that is almost all updates (else recovery wins by not having to
> replicate reads of pages that don't get modified) and the updates have
> to range over a working set significantly larger than physical RAM
> (else I/O bandwidth won't be the bottleneck anyway).  I think we're
> talking about an extremely small population of real users.

Of course that describes most benchmarks pretty well...

I think of this as a scalability problem, not so much a sheer speed problem.
If Postgres isn't fast enough for you you should be able to buy a faster
processor or faster disk or faster something to run it faster. The problem
with this situation is that buying a faster raid controller doesn't help you.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!


Re: [GENERAL] Slow PITR restore

From
Markus Schiltknecht
Date:
Hi,

Alvaro Herrera wrote:
> Simon Riggs wrote:
> 
>> ISTM its just autovacuum launcher + Hot Standby mixed.
> 
> I don't think you need a launcher at all.  Just get the postmaster to
> start a configurable number of wal-replay processes (currently the
> number is hardcoded to 1).

I also see similarity to what I do for Postgres-R: a manager and helper 
backends which can be started upon request. Such a scheme is currently 
used for autovacuum, I'm using it for replication, it could help for 
parallelizing recovery and it certainly helps for parallelizing queries 
as discussed in another thread.

Maybe it's worth considering a general framework for such a manager or 
auto launcher, as well as helper backends. It certainly depends on the 
complexity of that manager, but it should probably better be an external 
process.

What all of the helper backends have in common, AFAICT:
 - a connection to a database - no client connection - superuser privileges

(For parallelized queries, superuser privileges might appear wrong, but 
I'm arguing that parallelizing the rights checking isn't worth the 
trouble, so the initiating worker backend should do that and only 
delegate safe jobs to hepler backends. Or is that a serious limitation 
in a way?)

Most code for that already exists, as we already have various helpers. 
What's missing, IMO, is a communication channel between the worker and 
helper backends as well as between the backends and the manager. That's 
needed i.e. for worker backends being able to request helper backends 
and feed them with their wishes.

Unix pipes can only be set up between the parent and the child of a 
fork, they eat file descriptors, need to copy data to the kernel and 
back and IIRC, there were portability issues. That's why I've written 
the internal message passing (IMessage) stuff, see -patches [1].

I'm all for unifying such a manager process and generalizing the 
requesting and launching of helpers as well as management of their state 
(handling died helper processes, keeping a pool of idle helpers which 
are already connected to a database, etc..). Most of that already exists 
in my Postgres-R code, maybe I can derive a general purpose patch to 
start contributing code from Postgres-R?

Comments? Use cases I'm missing?

Regards

Markus

[1]: last time I published IMessage stuff on -patches, WIP:
http://archives.postgresql.org/pgsql-patches/2007-01/msg00578.php


Re: [GENERAL] Slow PITR restore

From
Hannu Krosing
Date:
Ühel kenal päeval, R, 2007-12-14 kell 10:39, kirjutas Markus
Schiltknecht:
> Hi,
>

> (For parallelized queries, superuser privileges might appear wrong, but
> I'm arguing that parallelizing the rights checking isn't worth the
> trouble, so the initiating worker backend should do that and only
> delegate safe jobs to hepler backends. Or is that a serious limitation
> in a way?)

at least functions defined with SECURITY DEFINER; may be a problem

------------
Hannu




Re: [GENERAL] Slow PITR restore

From
Hannu Krosing
Date:
Ühel kenal päeval, N, 2007-12-13 kell 20:25, kirjutas Heikki
Linnakangas:
...
> Hmm. That assumes that nothing else than the WAL replay will read
> pages into shared buffers. I guess that's true at the moment, but it
> doesn't seem impossible that something like Florian's read-only queries
> on a stand by server would change that.
>
> > I think that is better than both methods mentioned, and definitely
> > simpler than my brute-force method. It also lends itself to using both
> > previously mentioned methods as additional techniques if we really
> > needed to. I suspect reordering the I/Os in this way is going to make a
> > huge difference to cache hit rates.
>
> But it won't actually do anything to scale the I/O. You're still going
> to be issuing only one read request at a time. The order of those
> requests will be better from cache hit point of view, which is good, but
> the problem remains that if the modified data blocks are scattered
> around the database, you'll be doing random I/O, one request at a time.

Why one-at-a-time ?

You could have a long list of pages need to read in, and ask for them
all at the same time.

Here's what I mean

1 ) allocate buffers for N database pages, and a queue for N wal records
2 ) read N wal records to wal record queue, assign database page numbers
from these to buffer pages and issue posix_fadvise() for all as you go.
2a ) if there were repeated pages and thus there are free buffers,
allocate queu items and read some more wal records and assign buffer and
fadvise until N fubbers used
3) process wal record queue to buffers read in by 2
4) write the buffers back to disk

repeat from 2), freeing LRU buffers

Here reads in 2) will be optimised by system via posix_fadvise, and also
the caches can be split between multiple workers by page number hash or
some other random/uniform means to use more than one CPU

-------------
Hannu



Re: [GENERAL] Slow PITR restore

From
Markus Schiltknecht
Date:
Hello Hannu,

Hannu Krosing wrote:
>> (For parallelized queries, superuser privileges might appear wrong, but 
>> I'm arguing that parallelizing the rights checking isn't worth the 
>> trouble, so the initiating worker backend should do that and only 
>> delegate safe jobs to hepler backends. Or is that a serious limitation 
>> in a way?)
> 
> at least functions defined with SECURITY DEFINER; may be a problem

Uhm.. what I had in mind was parallelizing seqential scans, index scans, 
joins and such - database internal stuff.

Parallelizing user defined functions (or what did you have in mind?) is 
more difficult and sometimes impossible, because the planner cannot know 
ahead, what the function's going to do.

However, thinking about it, maybe, one could and should try to 
parallelize computationally intensive IMMUTABLE functions. But already 
with STABLE ones I'm getting suspicious. It would require users to write 
real thread-safe (well, multi-process-safe) functions, which I doubt 
somewhat. Granted, they theoretically *should* be safe, but...

Anyway, if that's the only show stopper, one could certainly tell helper 
backends to substitute their superuser privileges with the invoker's 
privileges. Not sure if that's worth the trouble, though.

Regards

Markus



Re: [GENERAL] Slow PITR restore

From
Markus Schiltknecht
Date:
Hannu Krosing wrote:
> until N fubbers used

..whatever a fubber is :-)

Nice typo!

Markus