Thread: Re: [GENERAL] Slow PITR restore
On Thu, 2007-12-13 at 06:27 +0000, Gregory Stark wrote: > "Tom Lane" <tgl@sss.pgh.pa.us> writes: > > > "Joshua D. Drake" <jd@commandprompt.com> writes: > > > >> Exactly. Which is the point I am making. Five minutes of transactions > >> is nothing (speaking generally).. In short, if we are in recovery, and > >> we are not saturated the I/O and at least a single CPU, there is a huge > >> amount of optimization *somewhere* to be done. > > > > You sure about that? I tested CVS HEAD just now, by setting the > > checkpoint_ parameters really high, running pgbench for awhile, and > > then killing the bgwriter to force a recovery cycle over all the WAL > > generated by the pgbench run. What I saw was that the machine was 100% > > disk write bound. Increasing shared_buffers helped, not in that the > > write rate got less according to vmstat, but the completion time did. > > There are at least three definitions of "saturating the I/O" and it sounds > like you two are using two different ones. > > 1) The processor is waiting on I/O all the time > 2) The hard drives are all always handling a request > 3) The hard drives are all handling the full bandwidth they're capable > > You would expect (1) and (2) to be the same for a single drive -- though in > practice there seems to be a gap even between them. But for a raid array there > can be a large difference, and the wider the raid stripe the larger the > difference. > > In Tom's results: > > > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ > > r b swpd free buff cache si so bi bo in cs us sy id wa st > > 0 9 70024 29232 19876 824368 0 0 0 3152 1447 233 0 1 0 99 0 > > 0 9 70024 29232 19876 824368 0 0 0 3660 1474 252 0 1 0 99 0 > > 0 8 70024 28960 19876 824404 0 0 0 3176 1448 265 0 2 1 97 0 > > > > I don't see the machine sitting around doing nothing ... > > Note that even though the processor is 99% in wait state the drive is only > handling about 3 MB/s. That translates into a seek time of 2.2ms which is > actually pretty fast. So if this is a single drive (1) and (2) seem to be > pretty much the same here. > > But note that if this were a raid array Postgres's wouldn't be getting any > better results. A Raid array wouldn't improve i/o latency at all and since > it's already 99% waiting for i/o Postgres is not going to be able to issue any > more. But only one drive in the raid array will be busy at a time which would > be far less than the maximum random access i/o the raid array is capable of. Agree with Greg's analysis here. Moving to -hackers now. I've done performance profiling also. My results replicated Tom's, but I hadn't performed them on a big enough system and so didn't realise the I/O scalability issue could be such a large problem. Koichi showed me some results on a much larger server that illustrated the I/O problem. But lets remember its only a problem on large servers with a heavy write workload and a large random I/O requirement. That's an important set of people, but much less than Josh's 10% of people even. > Heikki proposed a while back to use posix_fadvise() when processing logs to > read-ahead blocks which the recover will need before actually attempting to > recover them. On a raid array that would bring the 3MB/s above up to the > maximum number of random accesses the raid array can handle (ie, definition > (2) above). It's a good idea, but it will require more complex code. I prefer the simpler solution of using more processes to solve the I/O problem. Florian's code for Hot Standby introduces a separate recovery process, similar to an autovacuum launcher. I propose a mechanism similar to the AV solution where we have lots of recovery workers, with one recovery master reading the WAL files. We can then distribute WAL records to workers in some manner. It's true that many WAL records depend upon each other, but its also true that the performance problems only occur in the situation when they the WAL records don't depend upon each other. If they did, they would touch the same blocks and it would be cached. So as long as we have a safe mechanism for splitting up the work, everything is fine. We can divide up the WAL records this by looking at the rmgr field, plus looking deeper into the records themselves so we can touch different relations/blocks. So I'm planning to review this *after* Florian has introduced his patch, so we can build upon it. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > On Thu, 2007-12-13 at 06:27 +0000, Gregory Stark wrote: >> Heikki proposed a while back to use posix_fadvise() when processing logs to >> read-ahead blocks which the recover will need before actually attempting to >> recover them. On a raid array that would bring the 3MB/s above up to the >> maximum number of random accesses the raid array can handle (ie, definition >> (2) above). > > It's a good idea, but it will require more complex code. I prefer the > simpler solution of using more processes to solve the I/O problem. Huh, I forgot about that idea. Ironically that was what I suggested when Heikki described the problem. I think it's more complex than using posix_fadvise. But it's also more ambitious. It would allow us to use not only the full random access i/o bandwidth but also allow us to use more cpu. In cases where the database fits entirely in ram and we're recovering many many operations modifying the same blocks over and over that might help a lot. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!
On Thu, 2007-12-13 at 09:45 +0000, Gregory Stark wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > > On Thu, 2007-12-13 at 06:27 +0000, Gregory Stark wrote: > >> Heikki proposed a while back to use posix_fadvise() when processing logs to > >> read-ahead blocks which the recover will need before actually attempting to > >> recover them. On a raid array that would bring the 3MB/s above up to the > >> maximum number of random accesses the raid array can handle (ie, definition > >> (2) above). > > > > It's a good idea, but it will require more complex code. I prefer the > > simpler solution of using more processes to solve the I/O problem. > > Huh, I forgot about that idea. Ironically that was what I suggested when > Heikki described the problem. > > I think it's more complex than using posix_fadvise. Some handwaving... ISTM its just autovacuum launcher + Hot Standby mixed. I guess I've added two new backends now (Tom groans...) so that part seems very straightforward. The harder part is distributing the work and the hardest part is doing that evenly enough to make a difference. It will also require rmgr changes for state handling, but then I think that needs work anyway. Maybe we don't even need a master. We would have readbuffers in shared memory, like wal_buffers in reverse. Each worker would read the next WAL record and check there is no conflict with other concurrent WAL records. If not, it will apply the record immediately, otherwise wait for the conflicting worker to complete. > But it's also more ambitious. Time is the issue, I think, so you may be right. That's always why I grunt so much about freeze dates. Anyway, I'll leave this now, since I think we need to do Florian's work first either way and that is much more eagerly awaited I think. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > We would have readbuffers in shared memory, like wal_buffers in reverse. > Each worker would read the next WAL record and check there is no > conflict with other concurrent WAL records. If not, it will apply the > record immediately, otherwise wait for the conflicting worker to > complete. Well I guess you would have to bring up the locking infrastructure and lock any blocks in the record you're applying (sorted first to avoid deadlocks). As I understand it we don't use locks during recovery now but I'm not sure if that's just because we don't have to or if there are practical problems which would have to be solved to do so. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
Gregory Stark wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > >> We would have readbuffers in shared memory, like wal_buffers in reverse. >> Each worker would read the next WAL record and check there is no >> conflict with other concurrent WAL records. If not, it will apply the >> record immediately, otherwise wait for the conflicting worker to >> complete. > > Well I guess you would have to bring up the locking infrastructure and lock > any blocks in the record you're applying (sorted first to avoid deadlocks). As > I understand it we don't use locks during recovery now but I'm not sure if > that's just because we don't have to or if there are practical problems which > would have to be solved to do so. We do use locks during recovery, XLogReadBuffer takes an exclusive lock on the buffer. According to the comments there, it wouldn't be strictly necessary. But I believe we do actually need it to protect from bgwriter writing out a buffer while it's been modified. We only lock one page at a time, which is good enough for WAL replay, but not to protect things like b-tree split from concurrent access. I hacked together a quick & dirty prototype of using posix_fadvise in recovery a while ago. First of all, there's the changes to the buffer manager, which we'd need anyway if we wanted to use posix_fadvise for speeding up other stuff like index scans. Then there's changes to xlog.c, to buffer a number of WAL records, so that you can read ahead the data pages needed by WAL records ahead of the WAL record you're actually replaying. I added a new function, readahead, to the rmgr API. It's similar to the redo function, but it doesn't actually replay the WAL record, but just issues the fadvise calls to the buffer manager for the pages that are needed to replay the WAL record. This needs to be implemented for each resource manager that we want to do readahead for. If we had the list of blocks in the WAL record in a rmgr-independent format, we could do that in a more generic way, like we do the backup block restoration. The multiple-process approach seems a lot more complex to me. You need a lot of bookkeeping to keep the processes from stepping on each others toes, and to choose the next WAL record to replay. I think you have the same problem that you need to have a rmgr-specific function to extract the data blocks #s required to replay that WAL record, or add that list to the WAL record header in a generic format. Multi-process approach is nice because it allows you to parallelize the CPU work of replaying the records as well, but I wonder how much that really scales given all the locking required. Also, I don't think replaying WAL records is very expensive CPU-wise. You'd need a pretty impressive RAID array to read WAL from, to saturate a single CPU. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Simon, On Dec 13, 2007 11:21 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > Anyway, I'll leave this now, since I think we need to do Florian's work > first either way and that is much more eagerly awaited I think. Speaking of that, is there any news about it and about Florian? It was a really promising work. Thanks. -- Guillaume
Gregory Stark wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > > It's a good idea, but it will require more complex code. I prefer the > > simpler solution of using more processes to solve the I/O problem. > > Huh, I forgot about that idea. Ironically that was what I suggested when > Heikki described the problem. > > I think it's more complex than using posix_fadvise. But it's also more > ambitious. It would allow us to use not only the full random access i/o > bandwidth but also allow us to use more cpu. In cases where the database fits > entirely in ram and we're recovering many many operations modifying the same > blocks over and over that might help a lot. Actually, if you are modifying the same blocks over and over it will help *less*, because applying each record needs to occur only after the previous records that modify the same block have been applied. So you have two possibilities: you skip that record and try to apply the next one, hoping that that record applies to a block that's not locked, (which means you have to remember the skipped record and apply it sometime in the future), or you put the process to sleep until the lock has been released. -- Alvaro Herrera http://www.amazon.com/gp/registry/DXLWNGRJD34J "La conclusión que podemos sacar de esos estudios es que no podemos sacar ninguna conclusión de ellos" (Tanenbaum)
Simon Riggs wrote: > ISTM its just autovacuum launcher + Hot Standby mixed. I don't think you need a launcher at all. Just get the postmaster to start a configurable number of wal-replay processes (currently the number is hardcoded to 1). -- Alvaro Herrera http://www.amazon.com/gp/registry/CTMLCN8V17R4 "Someone said that it is at least an order of magnitude more work to do production software than a prototype. I think he is wrong by at least an order of magnitude." (Brian Kernighan)
On Thu, 2007-12-13 at 12:28 +0000, Heikki Linnakangas wrote: > Gregory Stark wrote: > > "Simon Riggs" <simon@2ndquadrant.com> writes: > > > >> We would have readbuffers in shared memory, like wal_buffers in reverse. > >> Each worker would read the next WAL record and check there is no > >> conflict with other concurrent WAL records. If not, it will apply the > >> record immediately, otherwise wait for the conflicting worker to > >> complete. > > > > Well I guess you would have to bring up the locking infrastructure and lock > > any blocks in the record you're applying (sorted first to avoid deadlocks). As > > I understand it we don't use locks during recovery now but I'm not sure if > > that's just because we don't have to or if there are practical problems which > > would have to be solved to do so. > > We do use locks during recovery, XLogReadBuffer takes an exclusive lock > on the buffer. According to the comments there, it wouldn't be strictly > necessary. But I believe we do actually need it to protect from > bgwriter writing out a buffer while it's been modified. We only lock one > page at a time, which is good enough for WAL replay, but not to protect > things like b-tree split from concurrent access. > > I hacked together a quick & dirty prototype of using posix_fadvise in > recovery a while ago. First of all, there's the changes to the buffer > manager, which we'd need anyway if we wanted to use posix_fadvise for > speeding up other stuff like index scans. Then there's changes to > xlog.c, to buffer a number of WAL records, so that you can read ahead > the data pages needed by WAL records ahead of the WAL record you're > actually replaying. > > I added a new function, readahead, to the rmgr API. It's similar to the > redo function, but it doesn't actually replay the WAL record, but just > issues the fadvise calls to the buffer manager for the pages that are > needed to replay the WAL record. This needs to be implemented for each > resource manager that we want to do readahead for. If we had the list of > blocks in the WAL record in a rmgr-independent format, we could do that > in a more generic way, like we do the backup block restoration. > > The multiple-process approach seems a lot more complex to me. You need a > lot of bookkeeping to keep the processes from stepping on each others > toes, and to choose the next WAL record to replay. I think you have the > same problem that you need to have a rmgr-specific function to extract > the data blocks #s required to replay that WAL record, or add that list > to the WAL record header in a generic format. Multi-process approach is > nice because it allows you to parallelize the CPU work of replaying the > records as well, but I wonder how much that really scales given all the > locking required. Also, I don't think replaying WAL records is very > expensive CPU-wise. You'd need a pretty impressive RAID array to read > WAL from, to saturate a single CPU. With all this talk, I thought of a much better way. We don't actually need to apply the changes in the order they are received, we just need to apply sufficient ordering to ensure that each block's changes are applied in LSN order. Allocate a recovery cache of size maintenance_work_mem that goes away when recovery ends. For every block mentioned in WAL record that isn't an overwrite, first check shared_buffers. If its in shared_buffers apply immediately and move on. If not in shared_buffers then put in recovery cache. When cache fills, empty it. Qsort WAL records by by rmgrid, rel, blockid, lsn. Then we scan through the records applying them in sequence. That way we will accumulate changes on each block so we only need to request it once rather than thrashing the cache. We may get lucky and pick up some OS readahead also. We would also use buffer recycling when emptying the recovery cache, to ensure that we don't trash the main cache and also gain from L2 cache efficiency. When recovery ends, empty the cache. I think that is better than both methods mentioned, and definitely simpler than my brute-force method. It also lends itself to using both previously mentioned methods as additional techniques if we really needed to. I suspect reordering the I/Os in this way is going to make a huge difference to cache hit rates. Looks like each rmgr_redo call would need to be split into two calls: rmgr_redo_apply() returns bool and rmgr_redo_cache(). The first will apply if possible, otherwise place in cache. The second gets called repeatedly during cache emptying. That sounds like it might not be I/O efficient, in that it would suffer from producer/consumer flip/flopping. But with large main work mem you'll get all the I/O from possibly hundreds of WAL files all accumulated together before it is issued - assuming only a small % of WAL records go into the cache and then many of those will have their I/O reduced to zero because of the sequential cache access. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Thu, 2007-12-13 at 10:18 -0300, Alvaro Herrera wrote: > Gregory Stark wrote: > > "Simon Riggs" <simon@2ndquadrant.com> writes: > > > > > It's a good idea, but it will require more complex code. I prefer the > > > simpler solution of using more processes to solve the I/O problem. > > > > Huh, I forgot about that idea. Ironically that was what I suggested when > > Heikki described the problem. > > > > I think it's more complex than using posix_fadvise. But it's also more > > ambitious. It would allow us to use not only the full random access i/o > > bandwidth but also allow us to use more cpu. In cases where the database fits > > entirely in ram and we're recovering many many operations modifying the same > > blocks over and over that might help a lot. > > Actually, if you are modifying the same blocks over and over it will > help *less*, because applying each record needs to occur only after the > previous records that modify the same block have been applied. > > So you have two possibilities: you skip that record and try to apply the > next one, hoping that that record applies to a block that's not locked, > (which means you have to remember the skipped record and apply it > sometime in the future), or you put the process to sleep until the lock > has been released. Ah, OK, I can see we're on the same lines of thought there. Just posted a reply to Heikki about this sort of idea. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs wrote: > Allocate a recovery cache of size maintenance_work_mem that goes away > when recovery ends. > > For every block mentioned in WAL record that isn't an overwrite, first > check shared_buffers. If its in shared_buffers apply immediately and > move on. If not in shared_buffers then put in recovery cache. > > When cache fills, empty it. Qsort WAL records by by rmgrid, rel, > blockid, lsn. Then we scan through the records applying them in > sequence. That way we will accumulate changes on each block so we only > need to request it once rather than thrashing the cache. We may get > lucky and pick up some OS readahead also. We would also use buffer > recycling when emptying the recovery cache, to ensure that we don't > trash the main cache and also gain from L2 cache efficiency. > > When recovery ends, empty the cache. Hmm. That assumes that nothing else than the WAL replay will read pages into shared buffers. I guess that's true at the moment, but it doesn't seem impossible that something like Florian's read-only queries on a stand by server would change that. > I think that is better than both methods mentioned, and definitely > simpler than my brute-force method. It also lends itself to using both > previously mentioned methods as additional techniques if we really > needed to. I suspect reordering the I/Os in this way is going to make a > huge difference to cache hit rates. But it won't actually do anything to scale the I/O. You're still going to be issuing only one read request at a time. The order of those requests will be better from cache hit point of view, which is good, but the problem remains that if the modified data blocks are scattered around the database, you'll be doing random I/O, one request at a time. It would be interesting to do something like that to speed up replay of long PITR archives, though. You could scan all (or at least far ahead) the WAL records, and make note of where there is full page writes for each page. Whenever there's a full page write further ahead in the log, you could ignore all changes to that page before that, because they're going to be overwritten anyway. It won't help with normal recovery, because there won't be more than one full page image of each page after the last checkpoint, but with PITR it would help. > Looks like each rmgr_redo call would need to be split into two calls: > rmgr_redo_apply() returns bool and rmgr_redo_cache(). The first will > apply if possible, otherwise place in cache. The second gets called > repeatedly during cache emptying. Yeah, much like the split I had to do for the posix_fadvise. It seems that in all the proposed schemes we need to know which blocks a given WAL record will need to access. For multiple recovery processes, you need that to figure out which WAL records you can safely replay. In the posix_fadvise scheme, you need that to issue the posix_fadvises without modifying anything. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 2007-12-13 at 20:25 +0000, Heikki Linnakangas wrote: > Simon Riggs wrote: > > Allocate a recovery cache of size maintenance_work_mem that goes away > > when recovery ends. > > > > For every block mentioned in WAL record that isn't an overwrite, first > > check shared_buffers. If its in shared_buffers apply immediately and > > move on. If not in shared_buffers then put in recovery cache. > > > > When cache fills, empty it. Qsort WAL records by by rmgrid, rel, > > blockid, lsn. Then we scan through the records applying them in > > sequence. That way we will accumulate changes on each block so we only > > need to request it once rather than thrashing the cache. We may get > > lucky and pick up some OS readahead also. We would also use buffer > > recycling when emptying the recovery cache, to ensure that we don't > > trash the main cache and also gain from L2 cache efficiency. > > > > When recovery ends, empty the cache. > > Hmm. That assumes that nothing else than the WAL replay will read > pages into shared buffers. I guess that's true at the moment, but it > doesn't seem impossible that something like Florian's read-only queries > on a stand by server would change that. Agreed, though I was imagining to use the cache as a secondary hash table. Not sure about that now I write it. I think the accumulation idea mostly makes sense for heaps. Indexes look much harder. Whatever happens I think we should get Florian's work in there first, then tune. > > I think that is better than both methods mentioned, and definitely > > simpler than my brute-force method. It also lends itself to using both > > previously mentioned methods as additional techniques if we really > > needed to. I suspect reordering the I/Os in this way is going to make a > > huge difference to cache hit rates. > > But it won't actually do anything to scale the I/O. You're still going > to be issuing only one read request at a time. The order of those > requests will be better from cache hit point of view, which is good, but > the problem remains that if the modified data blocks are scattered > around the database, you'll be doing random I/O, one request at a time. Yeh, agreed. That's why I went for the parallelism approach originally: you can't escape the basic physics. I've re-read your post. If you think the buffer manager changes are roughly the same as would be needed for other gains on index scans, then your async I/O seems like the most profitable approach. I still don't like it as much, but that aspect tips the balance, I think. > It would be interesting to do something like that to speed up replay of > long PITR archives, though. You could scan all (or at least far ahead) > the WAL records, and make note of where there is full page writes for > each page. Whenever there's a full page write further ahead in the log, > you could ignore all changes to that page before that, because they're > going to be overwritten anyway. It won't help with normal recovery, > because there won't be more than one full page image of each page after > the last checkpoint, but with PITR it would help. Another good idea. Of course if we scan that far ahead we can start removing aborted transactions also, which is the more standard optimization of recovery. I was imagining we would just memory map the files rather than buffer them explicitly, BTW. > > Looks like each rmgr_redo call would need to be split into two calls: > > rmgr_redo_apply() returns bool and rmgr_redo_cache(). The first will > > apply if possible, otherwise place in cache. The second gets called > > repeatedly during cache emptying. > > Yeah, much like the split I had to do for the posix_fadvise. > > It seems that in all the proposed schemes we need to know which blocks a > given WAL record will need to access. For multiple recovery processes, > you need that to figure out which WAL records you can safely replay. In > the posix_fadvise scheme, you need that to issue the posix_fadvises > without modifying anything. Yeh, I think it should be easy enough to group together the block-based rmgrs so they all have the same basic structure. Heap and the indexes, that is, but some parts are harder than others. My feeling is that heaps will easily accumulate, though secondary indexes will often be random. Incidentally, HOT will speed up recovery also, since there will be fewer index operations to replay. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Thu, 2007-12-13 at 21:13 +0000, Simon Riggs wrote: > Of course if we scan that far ahead we can start removing aborted > transactions also, which is the more standard optimization of > recovery. Recall that thought! -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Hmm. That assumes that nothing else than the WAL replay will read > pages into shared buffers. I guess that's true at the moment, but it > doesn't seem impossible that something like Florian's read-only queries > on a stand by server would change that. A general comment on this thread: the idea of putting any sort of asynchronous behavior into WAL recovery gives me the willies. Recovery is inherently one of the least-exercised parts of the system, and it gets more so with each robustness improvement we make elsewhere. Moreover, because it is fairly dumb, anything that does go wrong will likely result in silent data corruption that may not be noted until much later. Any bugs we introduce into recovery will be very hard to find ... and timing-dependent ones will be damn near impossible. So in my mind the watchword has got to be KISS. If that means that recovery isn't terribly speedy, so be it. I'd far rather get the right answer slower. Also, I have not seen anyone provide a very credible argument why we should spend a lot of effort on optimizing a part of the system that is so little-exercised. Don't tell me about warm standby systems --- they are fine as long as recovery is at least as fast as the original transactions, and no evidence has been provided to suggest that it's not. regards, tom lane
On Thu, 2007-12-13 at 16:41 -0500, Tom Lane wrote: > Recovery is inherently one of the least-exercised parts of the system, > and it gets more so with each robustness improvement we make elsewhere. > Moreover, because it is fairly dumb, anything that does go wrong will > likely result in silent data corruption that may not be noted until much > later. Any bugs we introduce into recovery will be very hard to find > ... and timing-dependent ones will be damn near impossible. > > So in my mind the watchword has got to be KISS. If that means that > recovery isn't terribly speedy, so be it. I'd far rather get the > right answer slower. Very much agreed, and really the real reason the main recovery code is essentially untouched for so long. That thought was #1 priority when writing PITR. Thanks for reminding me/us. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Tom Lane wrote: > Also, I have not seen anyone provide a very credible argument why > we should spend a lot of effort on optimizing a part of the system > that is so little-exercised. Don't tell me about warm standby > systems --- they are fine as long as recovery is at least as fast > as the original transactions, and no evidence has been provided to > suggest that it's not. Koichi showed me & Simon graphs of DBT-2 runs in their test lab back in May. They had setup two identical systems, one running the benchmark, and another one as a warm stand-by. The stand-by couldn't keep up; it couldn't replay the WAL as quickly as the primary server produced it. IIRC, replaying WAL generated in a 1h benchmark run took 6 hours. It sounds unbelievable at first, but the problem is that our WAL replay doesn't scale. On the primary server, you can have (and they did) a huge RAID array with dozens of disks, and a lot of concurrent activity keeping it busy. On the standby, we do all the same work, but with a single process. Every time we need to read in a page to modify it, we block. No matter how many disks you have in the array, it won't help, because we only issue one I/O request at a time. That said, I think the change we made in Spring to not read in pages for full page writes will help a lot with that. It would be nice to see some new benchmark results to measure that. However, it didn't fix the underlying scalability problem. One KISS approach would be to just do full page writes more often. It would obviously bloat the WAL, but it would make the replay faster. Another reason you would care about fast recovery is PITR. If you do base backups only once a week, for example, when you need to recover using the archive, you might have to replay a weeks worth of WAL in the worst case. You don't want to wait a week for the replay to finish. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Koichi showed me & Simon graphs of DBT-2 runs in their test lab back in > May. They had setup two identical systems, one running the benchmark, > and another one as a warm stand-by. The stand-by couldn't keep up; it > couldn't replay the WAL as quickly as the primary server produced it. > IIRC, replaying WAL generated in a 1h benchmark run took 6 hours. [ shrug... ] This is not consistent with my experience. I can't help suspecting misconfiguration; perhaps shared_buffers much smaller on the backup, for example. > One KISS approach would be to just do full page writes more often. It > would obviously bloat the WAL, but it would make the replay faster. ... at the cost of making the primary lots slower. regards, tom lane
"Heikki Linnakangas" <heikki@enterprisedb.com> writes: > It would be interesting to do something like that to speed up replay of long > PITR archives, though. You could scan all (or at least far ahead) the WAL > records, and make note of where there is full page writes for each page. > Whenever there's a full page write further ahead in the log, you could ignore > all changes to that page before that, because they're going to be overwritten > anyway. Hm, you could generate that data when you generate the logs. Keep a hash of block number and when the last full page write was. Then whenever you switch log files dump out that hash in a second file alongside. PITR recovery could read that in before it starts reading any file and consult it before applying any records. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!
Tom, > [ shrug... ] This is not consistent with my experience. I can't help > suspecting misconfiguration; perhaps shared_buffers much smaller on the > backup, for example. You're only going to see it on SMP systems which have a high degree of CPU utilization. That is, when you have 16 cores processing flat-out, then the *single* core which will replay that log could certainly have trouble keeping up. And this wouldn't be an issue which would show up testing on a dual-core system. I don't have extensive testing data on that myself (I depended on Koichi's as well) but I do have another real-world case where our slow recovery time is a serious problem: clustered filesystem failover configurations, e.g. RHCFS, OpenHACluster, Veritas. For those configuratons, when one node fails PostgreSQL is started on a 2nd node against the same data ... and goes through recovery. On very high-volume systems, the recovery can be quite slow, up to 15 minutes, which is a long time for a web site to be down. I completely agree that we don't want to risk the reliability of recovery in attempts to speed it up, though, so maybe this isn't something we can do right now. But I don't agree that it's not an issue for users. -- --Josh Josh Berkus PostgreSQL @ Sun San Francisco
Josh Berkus <josh@agliodbs.com> writes: > Tom, >> [ shrug... ] This is not consistent with my experience. I can't help >> suspecting misconfiguration; perhaps shared_buffers much smaller on the >> backup, for example. > You're only going to see it on SMP systems which have a high degree of CPU > utilization. That is, when you have 16 cores processing flat-out, then > the *single* core which will replay that log could certainly have trouble > keeping up. You are supposing that replay takes as much CPU as live query processing, which is nonsense (at least as long as we don't load it down with a bunch of added complexity ;-)). The argument that Heikki actually made was that multiple parallel queries could use more of the I/O bandwidth of a multi-disk array than recovery could. Which I believe, but I question how much of a real-world problem it is. For it to be an issue, you'd need a workload that is almost all updates (else recovery wins by not having to replicate reads of pages that don't get modified) and the updates have to range over a working set significantly larger than physical RAM (else I/O bandwidth won't be the bottleneck anyway). I think we're talking about an extremely small population of real users. regards, tom lane 3e
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > The argument that Heikki actually made was that multiple parallel > queries could use more of the I/O bandwidth of a multi-disk array > than recovery could. Which I believe, but I question how much of a > real-world problem it is. For it to be an issue, you'd need a workload > that is almost all updates (else recovery wins by not having to > replicate reads of pages that don't get modified) and the updates have > to range over a working set significantly larger than physical RAM > (else I/O bandwidth won't be the bottleneck anyway). I think we're > talking about an extremely small population of real users. Of course that describes most benchmarks pretty well... I think of this as a scalability problem, not so much a sheer speed problem. If Postgres isn't fast enough for you you should be able to buy a faster processor or faster disk or faster something to run it faster. The problem with this situation is that buying a faster raid controller doesn't help you. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!
Hi, Alvaro Herrera wrote: > Simon Riggs wrote: > >> ISTM its just autovacuum launcher + Hot Standby mixed. > > I don't think you need a launcher at all. Just get the postmaster to > start a configurable number of wal-replay processes (currently the > number is hardcoded to 1). I also see similarity to what I do for Postgres-R: a manager and helper backends which can be started upon request. Such a scheme is currently used for autovacuum, I'm using it for replication, it could help for parallelizing recovery and it certainly helps for parallelizing queries as discussed in another thread. Maybe it's worth considering a general framework for such a manager or auto launcher, as well as helper backends. It certainly depends on the complexity of that manager, but it should probably better be an external process. What all of the helper backends have in common, AFAICT: - a connection to a database - no client connection - superuser privileges (For parallelized queries, superuser privileges might appear wrong, but I'm arguing that parallelizing the rights checking isn't worth the trouble, so the initiating worker backend should do that and only delegate safe jobs to hepler backends. Or is that a serious limitation in a way?) Most code for that already exists, as we already have various helpers. What's missing, IMO, is a communication channel between the worker and helper backends as well as between the backends and the manager. That's needed i.e. for worker backends being able to request helper backends and feed them with their wishes. Unix pipes can only be set up between the parent and the child of a fork, they eat file descriptors, need to copy data to the kernel and back and IIRC, there were portability issues. That's why I've written the internal message passing (IMessage) stuff, see -patches [1]. I'm all for unifying such a manager process and generalizing the requesting and launching of helpers as well as management of their state (handling died helper processes, keeping a pool of idle helpers which are already connected to a database, etc..). Most of that already exists in my Postgres-R code, maybe I can derive a general purpose patch to start contributing code from Postgres-R? Comments? Use cases I'm missing? Regards Markus [1]: last time I published IMessage stuff on -patches, WIP: http://archives.postgresql.org/pgsql-patches/2007-01/msg00578.php
Ühel kenal päeval, R, 2007-12-14 kell 10:39, kirjutas Markus Schiltknecht: > Hi, > > (For parallelized queries, superuser privileges might appear wrong, but > I'm arguing that parallelizing the rights checking isn't worth the > trouble, so the initiating worker backend should do that and only > delegate safe jobs to hepler backends. Or is that a serious limitation > in a way?) at least functions defined with SECURITY DEFINER; may be a problem ------------ Hannu
Ühel kenal päeval, N, 2007-12-13 kell 20:25, kirjutas Heikki Linnakangas: ... > Hmm. That assumes that nothing else than the WAL replay will read > pages into shared buffers. I guess that's true at the moment, but it > doesn't seem impossible that something like Florian's read-only queries > on a stand by server would change that. > > > I think that is better than both methods mentioned, and definitely > > simpler than my brute-force method. It also lends itself to using both > > previously mentioned methods as additional techniques if we really > > needed to. I suspect reordering the I/Os in this way is going to make a > > huge difference to cache hit rates. > > But it won't actually do anything to scale the I/O. You're still going > to be issuing only one read request at a time. The order of those > requests will be better from cache hit point of view, which is good, but > the problem remains that if the modified data blocks are scattered > around the database, you'll be doing random I/O, one request at a time. Why one-at-a-time ? You could have a long list of pages need to read in, and ask for them all at the same time. Here's what I mean 1 ) allocate buffers for N database pages, and a queue for N wal records 2 ) read N wal records to wal record queue, assign database page numbers from these to buffer pages and issue posix_fadvise() for all as you go. 2a ) if there were repeated pages and thus there are free buffers, allocate queu items and read some more wal records and assign buffer and fadvise until N fubbers used 3) process wal record queue to buffers read in by 2 4) write the buffers back to disk repeat from 2), freeing LRU buffers Here reads in 2) will be optimised by system via posix_fadvise, and also the caches can be split between multiple workers by page number hash or some other random/uniform means to use more than one CPU ------------- Hannu
Hello Hannu, Hannu Krosing wrote: >> (For parallelized queries, superuser privileges might appear wrong, but >> I'm arguing that parallelizing the rights checking isn't worth the >> trouble, so the initiating worker backend should do that and only >> delegate safe jobs to hepler backends. Or is that a serious limitation >> in a way?) > > at least functions defined with SECURITY DEFINER; may be a problem Uhm.. what I had in mind was parallelizing seqential scans, index scans, joins and such - database internal stuff. Parallelizing user defined functions (or what did you have in mind?) is more difficult and sometimes impossible, because the planner cannot know ahead, what the function's going to do. However, thinking about it, maybe, one could and should try to parallelize computationally intensive IMMUTABLE functions. But already with STABLE ones I'm getting suspicious. It would require users to write real thread-safe (well, multi-process-safe) functions, which I doubt somewhat. Granted, they theoretically *should* be safe, but... Anyway, if that's the only show stopper, one could certainly tell helper backends to substitute their superuser privileges with the invoker's privileges. Not sure if that's worth the trouble, though. Regards Markus
Hannu Krosing wrote: > until N fubbers used ..whatever a fubber is :-) Nice typo! Markus