Thread: Proposal of PITR performance improvement for 8.4.
Hi, This is my first proposal of PITR performance improvement for PostgreSQL 8.4 development. This proposal includes readahead mechanism of data pages which will be read by redo() routines in the recovery. This is especially effective in the recovery without full page write. Readahead is done by posix_fadvise() as proposed in index scan improvement. Details of the implementation will be found in README file in the material. -- ------ Koichi Suzuki
Attachment
"Koichi Suzuki" <koichi.szk@gmail.com> writes: > This is my first proposal of PITR performance improvement for > PostgreSQL 8.4 development. This proposal includes readahead > mechanism of data pages which will be read by redo() routines in the > recovery. This is especially effective in the recovery without full > page write. Readahead is done by posix_fadvise() as proposed in > index scan improvement. Incidentally: a bit of background for anyone who wasn't around when last this came up: prefetching is especially for our recovery code because it's single-threaded. If you have a raid array you're effectively limited to using a single drive at a time. This is a major problem because the logs could have been written by many processes hammering the raid array concurrently. In other words your warm standby database might not be able to keep up with the logs from the master database even on identical (or even better) hardware. Simon (I think?) proposed allowing our recovery code to be multi-threaded. Heikki suggested using prefetching. > Details of the implementation will be found in README file in the material. I've read through this and I think I disagree with the idea of using a separate program. It's a lot of extra code -- and duplicated code from the normal recovery too. I recognise that it's awkward to handle during recovery since you would have to rework the wal reading logic quite a bit. But it might be worth doing anyways for non-raid situations. Even if you don't have a raid array it would be worthwhile to do the equivalent of a bitmap heap scan by fetching blocks in order. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB'sPostgreSQL training!
On Tue, 2008-10-28 at 16:33 +0900, Koichi Suzuki wrote: > This is my first proposal of PITR performance improvement for > PostgreSQL 8.4 development. This proposal includes readahead > mechanism of data pages which will be read by redo() routines in the > recovery. This is especially effective in the recovery without full > page write. Readahead is done by posix_fadvise() as proposed in > index scan improvement. > > Details of the implementation will be found in README file in the material. I'm happy with the idea of a readahead process. I thought we were implementing a BackgroundReader process for other uses. Is that dead now? I see you're doing this as a standalone command. Shame we can't include this into core for now, but I can see the reasons why. How does the whole toolchain look with this? Can you give a simple example of using pg_lesslog, pg_readahead and pg_standby together, if that is appropriate? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Gregory Stark wrote: > "Koichi Suzuki" <koichi.szk@gmail.com> writes: > >> This is my first proposal of PITR performance improvement for >> PostgreSQL 8.4 development. This proposal includes readahead >> mechanism of data pages which will be read by redo() routines in the >> recovery. This is especially effective in the recovery without full >> page write. Readahead is done by posix_fadvise() as proposed in >> index scan improvement. > > Incidentally: a bit of background for anyone who wasn't around when last this > came up: prefetching is especially for our recovery code because it's > single-threaded. If you have a raid array you're effectively limited to using > a single drive at a time. This is a major problem because the logs could have > been written by many processes hammering the raid array concurrently. In other > words your warm standby database might not be able to keep up with the logs > from the master database even on identical (or even better) hardware. > > Simon (I think?) proposed allowing our recovery code to be multi-threaded. > Heikki suggested using prefetching. I actually played around with the prefetching, and even wrote a quick prototype of it, about a year ago. It read ahead a fixed number of the WAL records in xlog.c, calling posix_fadvise() for all pages that were referenced in them. I never got around to finish it, as I wanted to see Greg's posix_fadvise() patch get done first and rely on the same infrastructure, but here's some lessons I learned: 1. You should avoid useless posix_fadvise() calls. In the naive implementation, where you simply call posix_fadvise() for every page referenced in every WAL record, you'll do 1-2 posix_fadvise() syscalls per WAL record, and that's a lot of overhead. We face the same design question as with Greg's patch to use posix_fadvise() to prefetch index and bitmap scans: what should the interface to the buffer manager look like? The simplest approach would be a new function call like AdviseBuffer(Relation, BlockNumber), that calls posix_fadvise() for the page if it's not in the buffer cache, but is a no-op otherwise. But that means more overhead, since for every page access, we need to find the page twice in the buffer cache; once for the AdviseBuffer() call, and 2nd time for the actual ReadBuffer(). It would be more efficient to pin the buffer in the AdviseBuffer() call already, but that requires much more changes to the callers. 2. The format of each WAL record is different, so you need a "readahead handler" for every resource manager, for every record type. It would be a lot simpler if there was a standardized way to store that information in the WAL records. 3. IIRC I tried to handle just a few most important WAL records at first, but it turned out that you really need to handle all WAL records (that are used at all) before you see any benefit. Otherwise, every time you hit a WAL record that you haven't done posix_fadvise() on, the recovery "stalls", and you don't need much of those to diminish the gains. Not sure how these apply to your approach, it's very different. You seem to handle 1. by collecting all the page references for the WAL file, and sorting and removing the duplicates. I wonder how much CPU time is spent on that? >> Details of the implementation will be found in README file in the material. > > I've read through this and I think I disagree with the idea of using a > separate program. It's a lot of extra code -- and duplicated code from the > normal recovery too. Agreed, this belongs into core. The nice thing about a separate process is that you could hook it into recovery_command, with no changes to the server, but as you note in the README, we'd want to use this in crash recovery as well, and the interaction between the external command and the startup process seems overly complex for that. Besides, we want to use the posix_fadvise() stuff in the backend anyway, so we should use the same infrastructure during WAL replay as well. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Simon Riggs <simon@2ndQuadrant.com> writes: > I'm happy with the idea of a readahead process. I thought we were > implementing a BackgroundReader process for other uses. Is that dead > now? You and Bruce seem to keep resurrecting that idea. I've never liked it -- I always hated that in Oracle and thought it was a terrible kludge. I think the inter-process communication would be way too heavy-weight, by the time the other process is schedule the process which needed the blocks would have probably have done many of them already anyways. Worse, you would need a large number of reading processes and would start to run into locking contention on the work-queue as well. In any case it would be a lot of code to do what posix_fadvise does for us with a simple syscall anyways. Am I misjudging this? Is this a popular idea? We could always implement it as a fall-back implementation for mdprefetch() where posix_fadvise (and libaio assuming we implement that as well) don't work. It has the advantage of working with any system at all even if it predates 1003.1. But it seems like an awful lot of code to implement a solution that I fear won't work very well. And are people running raid arrays not likely to be running modern OSes anyways? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production Tuning
Gregory Stark wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > >> I'm happy with the idea of a readahead process. I thought we were >> implementing a BackgroundReader process for other uses. Is that dead >> now? > > You and Bruce seem to keep resurrecting that idea. I've never liked it -- I > always hated that in Oracle and thought it was a terrible kludge. > > I think the inter-process communication would be way too heavy-weight, by the > time the other process is schedule the process which needed the blocks would > have probably have done many of them already anyways. Worse, you would need a > large number of reading processes and would start to run into locking > contention on the work-queue as well. > > In any case it would be a lot of code to do what posix_fadvise does for us > with a simple syscall anyways. I agree. > Am I misjudging this? Is this a popular idea? We could always implement it as > a fall-back implementation for mdprefetch() where posix_fadvise (and libaio > assuming we implement that as well) don't work. It has the advantage of > working with any system at all even if it predates 1003.1. Yeah, if we want to support prefetching on systems that don't have posix_fadvise(), that would be the way to do it. posix_fadvise() is a good API. We can support multiple platform-specific implementations of it if there's interest, using posix_fadvise(), aio to a dummy buffer, whatever functions there's on Windows, or background reader processes or threads. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Gregory Stark wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > And are people running raid arrays not likely to be > running modern OSes anyways? > No and those that are can upgrade. Joshua D. Drake
On Tue, 2008-10-28 at 12:34 +0000, Gregory Stark wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > > I'm happy with the idea of a readahead process. I thought we were > > implementing a BackgroundReader process for other uses. Is that dead > > now? > > You and Bruce seem to keep resurrecting that idea. Actually just asking for status, not trying to change design. > I've never liked it -- I > always hated that in Oracle and thought it was a terrible kludge. But now... If you have a better way, great, but that doesn't make perfectly workable and fairly simple userspace solutions into kludges. That's just an emotive description of your preference. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes: > On Tue, 2008-10-28 at 12:34 +0000, Gregory Stark wrote: >> Simon Riggs <simon@2ndQuadrant.com> writes: >> >> I've never liked it -- I >> always hated that in Oracle and thought it was a terrible kludge. > > But now... If you have a better way, great, but that doesn't make > perfectly workable and fairly simple userspace solutions into kludges. > That's just an emotive description of your preference. I won't dispute that calling it a kludge is a personal emotional reaction. I do think it's *far* from simple though. We would have to manage a variable number of worker processes, which is a fair amount of code right off the bat. But the real complexity I fear is in managing the work queue. We would have yet another fixed size shared memory data structure to configure and a lot of locking and memory contention on it. Keep in mind that potentially every backend process will by trying to enqueue hundreds of blocks and you'll have dozens (or even hundreds) of worker processes trying to dequeue blocks. Worst of all, you'll (hopefully) have someone trying to sort the blocks too which would make it basically impossible to do anything to minimize the contention. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB'sPostgreSQL training!
Gregory Stark wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > > I'm happy with the idea of a readahead process. I thought we were > > implementing a BackgroundReader process for other uses. Is that dead > > now? > > You and Bruce seem to keep resurrecting that idea. I've never liked it -- I > always hated that in Oracle and thought it was a terrible kludge. I didn't think I was promoting the separate reader process after you had the posix_fadvise() idea. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Tue, 2008-10-28 at 17:40 -0400, Bruce Momjian wrote: > Gregory Stark wrote: > > Simon Riggs <simon@2ndQuadrant.com> writes: > > > > > I'm happy with the idea of a readahead process. I thought we were > > > implementing a BackgroundReader process for other uses. Is that dead > > > now? > > > > You and Bruce seem to keep resurrecting that idea. I've never liked it -- I > > always hated that in Oracle and thought it was a terrible kludge. > > I didn't think I was promoting the separate reader process after you had > the posix_fadvise() idea. I think Greg is misinterpreting our occasional lack of exactness as disagreement. The end solution is the goal, not any of the discussed mechanisms. It's always good to have a name for it that sums up the goals rather than the methods e.g. frequent update optimisation rather than update-in-place. It would be good if the solutions for normal running and recovery were similar. Greg, please could you look into that? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes: > On Tue, 2008-10-28 at 17:40 -0400, Bruce Momjian wrote: >> Gregory Stark wrote: >> > Simon Riggs <simon@2ndQuadrant.com> writes: >> > >> > > I'm happy with the idea of a readahead process. I thought we were >> > > implementing a BackgroundReader process for other uses. Is that dead >> > > now? >> > >> > You and Bruce seem to keep resurrecting that idea. I've never liked it -- I >> > always hated that in Oracle and thought it was a terrible kludge. >> >> I didn't think I was promoting the separate reader process after you had >> the posix_fadvise() idea. I'm sorry, I thought I remembered you mentioning it again. But perhaps I was thinking of someone else (perhaps it was Simon again?) or perhaps it was before you saw the actual patch. > It would be good if the solutions for normal running and recovery were > similar. Greg, please could you look into that? I could do the readahead side of things but what I'm not sure how to arrange is how to restructure the wal reading logic to read records ahead of the actual replay. I think we would have to maintain two pointers one for the prefetch and one for the actual running. But the logic in for recovery is complex enough that I'm concerned about changing it enough to do that and whether it can be done without uglifying the code quite a bit. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!
Thanks for a lot of inspiring discussions. Please note that my proposal includes only a few lines of change to the recovery code itself. It does not affect buffer management, order of WAL record applying etc. Only change needed is to invoke prefetch feature if redo is going to read WAL which has not been handled by the prefetch (prefetch function returns last-handled LSN). Before writing the readahead code, I ran several experiment how posix_fadvise() speeds up random read and I found that POSIX_FADV_WILLNEED can improve total read performance for around five times, if we schedule the order of posix_fadvise() call to the order of block position. Without random position, the improvement ratio was around three times. This result was achieved with single process, but for RAID configuration. I'd like to do the similar measurement against single disk. I'd like to run some benchmark to clarify the improvement. I agree I should show how my proposal is useful. In terms of the influence to the recovery code, pg_readahead just calls posix_fadvise() to tell the operating system to prefetch the data page to kernel's cash, not PG's shared memory, so we don't have to implement this in PG core code. Because of this and I think it is more practical to have platform-specific code to outside as possible, I wrote most of the prefetch in the external process, which can be available at contrib or PgFoundry, perhaps the latter. Heikki suggested to have separate reader process. I think it's very good idea but with this idea, but this will change PG's performance dramatically. Better in some case, but even worse in other cases possibly. I don't have clear on this. So I think background reader issue should be a challange to 8.5 or further and we must call for research works. So far, I think it is reasonable to keep improving specific code. I'd like to hear some more about these. I'm more than happy to write all the code inside PG core to avoid overhead to create another process. --- Koichi Suzuki 2008/10/29 Gregory Stark <stark@enterprisedb.com>: > Simon Riggs <simon@2ndQuadrant.com> writes: > >> On Tue, 2008-10-28 at 17:40 -0400, Bruce Momjian wrote: >>> Gregory Stark wrote: >>> > Simon Riggs <simon@2ndQuadrant.com> writes: >>> > >>> > > I'm happy with the idea of a readahead process. I thought we were >>> > > implementing a BackgroundReader process for other uses. Is that dead >>> > > now? >>> > >>> > You and Bruce seem to keep resurrecting that idea. I've never liked it -- I >>> > always hated that in Oracle and thought it was a terrible kludge. >>> >>> I didn't think I was promoting the separate reader process after you had >>> the posix_fadvise() idea. > > I'm sorry, I thought I remembered you mentioning it again. But perhaps I was > thinking of someone else (perhaps it was Simon again?) or perhaps it was > before you saw the actual patch. > >> It would be good if the solutions for normal running and recovery were >> similar. Greg, please could you look into that? > > I could do the readahead side of things but what I'm not sure how to arrange > is how to restructure the wal reading logic to read records ahead of the > actual replay. > > I think we would have to maintain two pointers one for the prefetch and one > for the actual running. But the logic in for recovery is complex enough that > I'm concerned about changing it enough to do that and whether it can be done > without uglifying the code quite a bit. > > -- > Gregory Stark > EnterpriseDB http://www.enterprisedb.com > Ask me about EnterpriseDB's RemoteDBA services! >
Hi, > I think we would have to maintain two pointers one for the prefetch and one > for the actual running. But the logic in for recovery is complex enough that > I'm concerned about changing it enough to do that and whether it can be done > without uglifying the code quite a bit. Yes, this is what the code is doing. Prefetch function returns LSN where prefetch was performed. Redo routine has to call prefetch if it is going to read WAL record beyond this. -- ------ Koichi Suzuki
Heikki, 1. In the code, all the referenced page is extracted from the WAL records in scope and sorted to schedule and avoid double posix_fadvise() calls. If full page write is included as the first WAL record, such page is not prefetched. Although there's still some risk to call posix_fadvise() for the pages already in the kernel cache, but I don't think the overhead is not so big. I'd like to measure how much improvement we can achieve with some benchmark. 2. Yes, we have to analyze WAL record using specific code and I think it is not good. At present, major part of this is from xlogdump. Are you proposing to improve WAL record format to be less rm-specific?It's interesting too. 3. Yes, the code includes all the WAL record handling except for those which does not need any "read" operation in the recovery, such as creating new pages with new record. Improving WAL record format as 2. will help to simplify this as well. 2008/10/28 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>: > Gregory Stark wrote: >> >> "Koichi Suzuki" <koichi.szk@gmail.com> writes: >> >>> This is my first proposal of PITR performance improvement for >>> PostgreSQL 8.4 development. This proposal includes readahead >>> mechanism of data pages which will be read by redo() routines in the >>> recovery. This is especially effective in the recovery without full >>> page write. Readahead is done by posix_fadvise() as proposed in >>> index scan improvement. >> >> Incidentally: a bit of background for anyone who wasn't around when last >> this >> came up: prefetching is especially for our recovery code because it's >> single-threaded. If you have a raid array you're effectively limited to >> using >> a single drive at a time. This is a major problem because the logs could >> have >> been written by many processes hammering the raid array concurrently. In >> other >> words your warm standby database might not be able to keep up with the >> logs >> from the master database even on identical (or even better) hardware. >> >> Simon (I think?) proposed allowing our recovery code to be multi-threaded. >> Heikki suggested using prefetching. > > I actually played around with the prefetching, and even wrote a quick > prototype of it, about a year ago. It read ahead a fixed number of the WAL > records in xlog.c, calling posix_fadvise() for all pages that were > referenced in them. I never got around to finish it, as I wanted to see > Greg's posix_fadvise() patch get done first and rely on the same > infrastructure, but here's some lessons I learned: > > 1. You should avoid useless posix_fadvise() calls. In the naive > implementation, where you simply call posix_fadvise() for every page > referenced in every WAL record, you'll do 1-2 posix_fadvise() syscalls per > WAL record, and that's a lot of overhead. We face the same design question > as with Greg's patch to use posix_fadvise() to prefetch index and bitmap > scans: what should the interface to the buffer manager look like? The > simplest approach would be a new function call like AdviseBuffer(Relation, > BlockNumber), that calls posix_fadvise() for the page if it's not in the > buffer cache, but is a no-op otherwise. But that means more overhead, since > for every page access, we need to find the page twice in the buffer cache; > once for the AdviseBuffer() call, and 2nd time for the actual ReadBuffer(). > It would be more efficient to pin the buffer in the AdviseBuffer() call > already, but that requires much more changes to the callers. > > 2. The format of each WAL record is different, so you need a "readahead > handler" for every resource manager, for every record type. It would be a > lot simpler if there was a standardized way to store that information in the > WAL records. > > 3. IIRC I tried to handle just a few most important WAL records at first, > but it turned out that you really need to handle all WAL records (that are > used at all) before you see any benefit. Otherwise, every time you hit a WAL > record that you haven't done posix_fadvise() on, the recovery "stalls", and > you don't need much of those to diminish the gains. > > Not sure how these apply to your approach, it's very different. You seem to > handle 1. by collecting all the page references for the WAL file, and > sorting and removing the duplicates. I wonder how much CPU time is spent on > that? > >>> Details of the implementation will be found in README file in the >>> material. >> >> I've read through this and I think I disagree with the idea of using a >> separate program. It's a lot of extra code -- and duplicated code from the >> normal recovery too. > > Agreed, this belongs into core. The nice thing about a separate process is > that you could hook it into recovery_command, with no changes to the server, > but as you note in the README, we'd want to use this in crash recovery as > well, and the interaction between the external command and the startup > process seems overly complex for that. Besides, we want to use the > posix_fadvise() stuff in the backend anyway, so we should use the same > infrastructure during WAL replay as well. > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com > -- ------ Koichi Suzuki
On Tue, 2008-10-28 at 14:21 +0200, Heikki Linnakangas wrote: > 1. You should avoid useless posix_fadvise() calls. In the naive > implementation, where you simply call posix_fadvise() for every page > referenced in every WAL record, you'll do 1-2 posix_fadvise() syscalls > per WAL record, and that's a lot of overhead. We face the same design > question as with Greg's patch to use posix_fadvise() to prefetch index > and bitmap scans: what should the interface to the buffer manager look > like? The simplest approach would be a new function call like > AdviseBuffer(Relation, BlockNumber), that calls posix_fadvise() for the > page if it's not in the buffer cache, but is a no-op otherwise. But that > means more overhead, since for every page access, we need to find the > page twice in the buffer cache; once for the AdviseBuffer() call, and > 2nd time for the actual ReadBuffer(). That's a much smaller overhead than waiting for an I/O. The CPU overhead isn't really a problem if we're I/O bound. > It would be more efficient to pin > the buffer in the AdviseBuffer() call already, but that requires much > more changes to the callers. That would be hard to cleanup safely, plus we'd have difficulty with timing: is there enough buffer space to allow all the prefetched blocks live in cache at once? If not, this approach would cause problems. > 2. The format of each WAL record is different, so you need a "readahead > handler" for every resource manager, for every record type. It would be > a lot simpler if there was a standardized way to store that information > in the WAL records. I would prefer a new rmgr API call that returns a list of blocks. That's better than trying to make everything fit one pattern. If the call doesn't exist then that rmgr won't get prefetch. > 3. IIRC I tried to handle just a few most important WAL records at > first, but it turned out that you really need to handle all WAL records > (that are used at all) before you see any benefit. Otherwise, every time > you hit a WAL record that you haven't done posix_fadvise() on, the > recovery "stalls", and you don't need much of those to diminish the gains. > > Not sure how these apply to your approach, it's very different. You seem > to handle 1. by collecting all the page references for the WAL file, and > sorting and removing the duplicates. I wonder how much CPU time is spent > on that? Removing duplicates seems like it will save CPU. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Wed, 2008-10-29 at 09:55 +0900, Koichi Suzuki wrote: > I'd like to hear some more about these. I'm more than happy to write > all the code inside PG core to avoid overhead to create another > process. Having an external program can help earlier releases also, so I think this is the right approach for now. In next PG release we should bring this into core, along with streaming. Interface would be better if it accepted pg_readahead filename or pg_readahead filename start-lsn We don't always need a starting lsn. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
I'm not sure if blocks reffered from all WAL records in single WAL segment can fit kernel cache. This is why current pg_readahead returns the last LSN and require starting LSN. So far, with FPW, it seems that we can prefetch all the pages in a WAL segment. So it will be okay for archive log compressed with pg_compresslog. I'll test if it's okay in the case full_page_writes=off. Anyway, I'd like to keep my proposal for 8.4 and continue the test and evaluation to report to the mailing list. I'll also change the whole code to run in the core. ------- Koichi Suzuki 2008/10/29 Simon Riggs <simon@2ndquadrant.com>: > > On Wed, 2008-10-29 at 09:55 +0900, Koichi Suzuki wrote: > >> I'd like to hear some more about these. I'm more than happy to write >> all the code inside PG core to avoid overhead to create another >> process. > > Having an external program can help earlier releases also, so I think > this is the right approach for now. > > In next PG release we should bring this into core, along with streaming. > > Interface would be better if it accepted > pg_readahead filename > or pg_readahead filename start-lsn > > We don't always need a starting lsn. > > -- > Simon Riggs www.2ndQuadrant.com > PostgreSQL Training, Services and Support > > -- ------ Koichi Suzuki
Hi, 2008/10/29 Simon Riggs <simon@2ndquadrant.com>: > > On Tue, 2008-10-28 at 14:21 +0200, Heikki Linnakangas wrote: > >> 1. You should avoid useless posix_fadvise() calls. In the naive >> implementation, where you simply call posix_fadvise() for every page >> referenced in every WAL record, you'll do 1-2 posix_fadvise() syscalls >> per WAL record, and that's a lot of overhead. We face the same design >> question as with Greg's patch to use posix_fadvise() to prefetch index >> and bitmap scans: what should the interface to the buffer manager look >> like? The simplest approach would be a new function call like >> AdviseBuffer(Relation, BlockNumber), that calls posix_fadvise() for the >> page if it's not in the buffer cache, but is a no-op otherwise. But that >> means more overhead, since for every page access, we need to find the >> page twice in the buffer cache; once for the AdviseBuffer() call, and >> 2nd time for the actual ReadBuffer(). > > That's a much smaller overhead than waiting for an I/O. The CPU overhead > isn't really a problem if we're I/O bound. As disccused last year about parallel recovery and random read problem, recovery is really I/O bound, especially when FPW is not available. And it is not practical to ask all the archive logs to include huge FPWs. > >> It would be more efficient to pin >> the buffer in the AdviseBuffer() call already, but that requires much >> more changes to the callers. > > That would be hard to cleanup safely, plus we'd have difficulty with > timing: is there enough buffer space to allow all the prefetched blocks > live in cache at once? If not, this approach would cause problems. I'm not positive to AdviseBuffer() adea. If we do this, we need all the pages reffered from a WAL segment in the shared buffer. This may be several GB and will compete with kernel cache. Current PostgreSQL highly relies on kernel cache (and kernel I/O schedule) and it is not a good idea to have much shared buffer. The worst case is to spare half of the physical memory to the shared buffer. The performance will be very bad. Rather, I prefer to ask kernel to prefetch. > >> 2. The format of each WAL record is different, so you need a "readahead >> handler" for every resource manager, for every record type. It would be >> a lot simpler if there was a standardized way to store that information >> in the WAL records. > > I would prefer a new rmgr API call that returns a list of blocks. That's > better than trying to make everything fit one pattern. If the call > doesn't exist then that rmgr won't get prefetch. Yes, I'd like this idea. Could you let me try this API through prefetch implementation in the core (if it is agreed)? > >> 3. IIRC I tried to handle just a few most important WAL records at >> first, but it turned out that you really need to handle all WAL records >> (that are used at all) before you see any benefit. Otherwise, every time >> you hit a WAL record that you haven't done posix_fadvise() on, the >> recovery "stalls", and you don't need much of those to diminish the gains. >> >> Not sure how these apply to your approach, it's very different. You seem >> to handle 1. by collecting all the page references for the WAL file, and >> sorting and removing the duplicates. I wonder how much CPU time is spent >> on that? > > Removing duplicates seems like it will save CPU. If we invoke posix_fadvise() to the blocks already in the kernel cache, this call will just do nothing but consume some overhead in the kernel. I think duplicate removal saves more. > > -- > Simon Riggs www.2ndQuadrant.com > PostgreSQL Training, Services and Support > > -- ------ Koichi Suzuki
On Thu, 2008-10-30 at 09:46 +0900, Koichi Suzuki wrote: > I'm not sure if blocks reffered from all WAL records in single WAL > segment can fit kernel cache. This is why current pg_readahead > returns the last LSN and require starting LSN. So far, with FPW, it > seems that we can prefetch all the pages in a WAL segment. So it > will be okay for archive log compressed with pg_compresslog. I'll > test if it's okay in the case full_page_writes=off. I'd prefer to be able to specify max_readahead_pages than have to control things at a micro level like that. If you have lots of memory you can set that higher. > Anyway, I'd like to keep my proposal for 8.4 and continue the test and > evaluation to report to the mailing list. > > I'll also change the whole code to run in the core. OK, I quite liked the idea of a separate program. That allows it to work with 8.3 as well as 8.4. No problem with it being in core at all. As ever, good thinking, good patch. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support