Thread: V2 of PITR performance improvement for 8.4
Please find enclosed a revised version of pg_readahead and a patch to invoke pg_readahead. Changes from the previous one is as follows: Pg_readahead now does not return any prefetched point. It simply prefetches all the datapages refered from WAL records in a given WAL segment, except for those whose first WAL record includes full page write. Because of this change, patch to the core was changed so that pg_readahead is invoked when WAL segment is opened. Details will be found in README. I've done a benchmark to see the effect of the prefetch. Here's a report. -------------------------------- Benchmark: DBT-2 Database size: 20GB Gave less number of transactions than DBT-2 default avoid overload status. We ran the benchmark for on hour with chekpoint timeout 30min and completion_target 0.5. Then, collected all the archive log and run PITR. Disks: RAID0 array (8 disks, 7200rpm). Detailed conditions are given at the last. Measure ment result is as follows: (for readability, PDF chart is also attached) ----------------------+------------+--------------------+--------------- WAL conditions | Recovery | Amount of | recovery | time (sec) | physical read (MB) | rate (TX/min) ----------------------+------------+--------------------+--------------- w/o prefetch | | | archived with cp | 6,611 | 5,435 | 402 FPW=off | | | ----------------------+------------+--------------------+--------------- w/o prefetch | | | archived with cp | 1,683 | 801 | 1,458 FPW=on | | | (8.3) ----------------------+------------+--------------------+--------------- w/o prefetch | | | archived with lesslog | 6,644 | 5,090 | 369 FPW=on | | | ----------------------+------------+--------------------+--------------- With prefetch | | | archived with cp | 1,161 | 5,543 | 2,290 FPW=off | | | ----------------------+------------+--------------------+--------------- With prefetch | | | archived with cp | 1,415 | 2,157 | 1,733 FPW=on | | | ----------------------+------------+--------------------+--------------- With prefetch | | | archived with lesslog | 1,196 | 5,369 | 2,051 FPW=on | | | (This proposal) ----------------------+------------+--------------------+--------------- * lesslog means pg_compresslog ** DBT-2 thoughput: 682TPM (FPW=on), 739TPM (FPW=off) This shows that although prefetch does not reduce the physical read, it tremendously impreves the time to read and as a result, if WAL archive is taken with pg_lesslog and prefetch is done, recovery duration is somewhat shorter than current FPW=on score. Important point is that the recovery rate is much higher than DBT-2 throughput. Therefore, this can be combined with synchronous replication and hot standby, tremendously reducing the amount of logs to be shipped (up to as small as one tenth), improving the recovery time and maintaining crash recovery success chance. Just without FWP=off or with pg_compresslog, recovery does not catch up. Because current pg_readahead only works in Linux, I'd like the patch to be into the core and pg_readahead into contrib. Other (major) environment is given below. ----<< H/W and OS >>------------------- CPU: Pentium D, 2.8GHz Memory: 2GB Internal Disk: SATA 150GB, used to archive WAL. External Disk: RAID 0 (Ultra Wide SCSI), 8 disks (SATA 7200rpm) OS: RHEL ES 5.1 (64bit) ----<< Other PostgreSQL configuration >>-------- PostgreSQL: 8.4 dev. head, as of Oct.28th max_connections: 100 shared_buffers: 32MB checkpoint_segments: 1000 checkpoint_timeout: 30min checkpoint_completion target: 0.5 archive_mode: on autovacuum: on logging_collector: on -- ------ Koichi Suzuki
Attachment
On Thu, 2008-11-27 at 21:04 +0900, Koichi Suzuki wrote: > We ran the > benchmark for on hour with chekpoint timeout 30min and completion_target 0.5. > Then, collected all the archive log and run PITR. > ----------------------+------------+--------------------+--------------- > WAL conditions | Recovery | Amount of | recovery > | time (sec) | physical read (MB) | rate (TX/min) > ----------------------+------------+--------------------+--------------- > w/o prefetch | | | > archived with cp | 6,611 | 5,435 | 402 > FPW=off | | | > ----------------------+------------+--------------------+--------------- > With prefetch | | | > archived with cp | 1,161 | 5,543 | 2,290 > FPW=off | | | > ----------------------+------------+--------------------+--------------- There's clearly a huge gain using prefetch, when we have full_page_writes = off. But that does make me think: Why do we need prefetch at all if we use full page writes? There's nothing to prefetch if we can keep it in cache. I notice we set the checkpoint_timeout to 30 mins, which is long enough to exceed the cache on the standby. I wonder if we reduced the timeout would we use the cache better on the standby and not need readahead at all? Do you have any results to examine cache overflow/shorter timeouts? > w/o prefetch | | | > archived with cp | 1,683 | 801 | 1,458 > FPW=on | | | (8.3) > ----------------------+------------+--------------------+--------------- > w/o prefetch | | | > archived with lesslog | 6,644 | 5,090 | 369 > FPW=on | | | > ----------------------+------------+--------------------+--------------- > With prefetch | | | > archived with cp | 1,415 | 2,157 | 1,733 > FPW=on | | | > ----------------------+------------+--------------------+--------------- > With prefetch | | | > archived with lesslog | 1,196 | 5,369 | 2,051 > FPW=on | | | (This proposal) > ----------------------+------------+--------------------+--------------- So I'm wondering if we only need prefetch because we're using lesslog? If we integrated lesslog better into the new replication would we be able to forget about doing the prefetch altogether? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Hi, As to checkpoint timeout, yes, this measurement is hard for FPW=on case. I'll do the similar measurement for checkpoint timeout = 5min and post the result. I expect that the recoevry time will be almost the same in the case FPW=on, lesslog=yes and prefetpch = yes. 2008/12/2 Simon Riggs <simon@2ndquadrant.com>: > > On Thu, 2008-11-27 at 21:04 +0900, Koichi Suzuki wrote: > >> We ran the >> benchmark for on hour with chekpoint timeout 30min and completion_target 0.5. >> Then, collected all the archive log and run PITR. > >> ----------------------+------------+--------------------+--------------- >> WAL conditions | Recovery | Amount of | recovery >> | time (sec) | physical read (MB) | rate (TX/min) >> ----------------------+------------+--------------------+--------------- >> w/o prefetch | | | >> archived with cp | 6,611 | 5,435 | 402 >> FPW=off | | | >> ----------------------+------------+--------------------+--------------- >> With prefetch | | | >> archived with cp | 1,161 | 5,543 | 2,290 >> FPW=off | | | >> ----------------------+------------+--------------------+--------------- > > There's clearly a huge gain using prefetch, when we have > full_page_writes = off. But that does make me think: Why do we need > prefetch at all if we use full page writes? There's nothing to prefetch > if we can keep it in cache. Agreed. This is why I proposed prefetch optional through GUC. > > I notice we set the checkpoint_timeout to 30 mins, which is long enough > to exceed the cache on the standby. I wonder if we reduced the timeout > would we use the cache better on the standby and not need readahead at > all? Do you have any results to examine cache overflow/shorter timeouts? > >> w/o prefetch | | | >> archived with cp | 1,683 | 801 | 1,458 >> FPW=on | | | (8.3) >> ----------------------+------------+--------------------+--------------- >> w/o prefetch | | | >> archived with lesslog | 6,644 | 5,090 | 369 >> FPW=on | | | >> ----------------------+------------+--------------------+--------------- >> With prefetch | | | >> archived with cp | 1,415 | 2,157 | 1,733 >> FPW=on | | | >> ----------------------+------------+--------------------+--------------- >> With prefetch | | | >> archived with lesslog | 1,196 | 5,369 | 2,051 >> FPW=on | | | (This proposal) >> ----------------------+------------+--------------------+--------------- > > So I'm wondering if we only need prefetch because we're using lesslog? > > If we integrated lesslog better into the new replication would we be > able to forget about doing the prefetch altogether? In the case of lesslog, almost all the FPW is replaced with corresponding incremental log and recovery takes longer. Prefetch dramatically improve this, as you will see in the above result. To improve recovery time with FPW=off or FPW=on and lesslog=yes, we need prefetch. > > -- > Simon Riggs www.2ndQuadrant.com > PostgreSQL Training, Services and Support > > -- ------ Koichi Suzuki
Hi, On Thu, Nov 27, 2008 at 9:04 PM, Koichi Suzuki <koichi.szk@gmail.com> wrote: > Please find enclosed a revised version of pg_readahead and a patch to > invoke pg_readahead. Some similar functions are in xlog.c and pg_readahead.c (for example, RecordIsValid). I think that we should unify them as a common function, which helps to develop the tool (for example, xlogdump) treating WAL in the future. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Agreed. I borrowed WAL parsing code from XLogdump and I think WAL parsing should be another candidate. 2008/12/3 Fujii Masao <masao.fujii@gmail.com>: > Hi, > > On Thu, Nov 27, 2008 at 9:04 PM, Koichi Suzuki <koichi.szk@gmail.com> wrote: >> Please find enclosed a revised version of pg_readahead and a patch to >> invoke pg_readahead. > > Some similar functions are in xlog.c and pg_readahead.c (for example, > RecordIsValid). I think that we should unify them as a common function, > which helps to develop the tool (for example, xlogdump) treating WAL in > the future. > > Regards, > > -- > Fujii Masao > NIPPON TELEGRAPH AND TELEPHONE CORPORATION > NTT Open Source Software Center > -- ------ Koichi Suzuki
On Wed, 2008-12-03 at 14:22 +0900, Koichi Suzuki wrote: > > There's clearly a huge gain using prefetch, when we have > > full_page_writes = off. But that does make me think: Why do we need > > prefetch at all if we use full page writes? There's nothing to prefetch > > if we can keep it in cache. > > Agreed. This is why I proposed prefetch optional through GUC. > > > So I'm wondering if we only need prefetch because we're using lesslog? > > > > If we integrated lesslog better into the new replication would we be > > able to forget about doing the prefetch altogether? > > In the case of lesslog, almost all the FPW is replaced with > corresponding incremental log and recovery takes longer. Prefetch > dramatically improve this, as you will see in the above result. To > improve recovery time with FPW=off or FPW=on and lesslog=yes, we need > prefetch. It does sound like it is needed, yes. But if you look at the architecture of synchronous replication in 8.4 then I don't think it makes sense any more. It would be very useful for the architecture we had in 8.3, but that time has gone. If we have FPW=on on primary then we will stream WAL with FPW to standby. There seems little point removing it *after* it has been sent, then putting it back again before we recover, especially when it causes a drop in performance that then needs to be fixed (by this patch). pg_lesslog allowed us to write FPW to disk, yet send WAL without FPW. So if we find a way of streaming WAL without FPW then this patch makes sense, but not until then. So far many people have argued in favour of using FPW=on, which was the whole point of pg_lesslog. Are we now saying that we would run the primary with FPW=off? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Hi, On Thu, Dec 4, 2008 at 6:11 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Wed, 2008-12-03 at 14:22 +0900, Koichi Suzuki wrote: > >> > There's clearly a huge gain using prefetch, when we have >> > full_page_writes = off. But that does make me think: Why do we need >> > prefetch at all if we use full page writes? There's nothing to prefetch >> > if we can keep it in cache. >> >> Agreed. This is why I proposed prefetch optional through GUC. >> >> > So I'm wondering if we only need prefetch because we're using lesslog? >> > >> > If we integrated lesslog better into the new replication would we be >> > able to forget about doing the prefetch altogether? >> >> In the case of lesslog, almost all the FPW is replaced with >> corresponding incremental log and recovery takes longer. Prefetch >> dramatically improve this, as you will see in the above result. To >> improve recovery time with FPW=off or FPW=on and lesslog=yes, we need >> prefetch. > > It does sound like it is needed, yes. But if you look at the > architecture of synchronous replication in 8.4 then I don't think it > makes sense any more. It would be very useful for the architecture we > had in 8.3, but that time has gone. Agreed. I also think that lesslog is for archiving in single node rather than replication between multiple nodes. Of course, it's very useful for the user who doesn't use replication.. etc. > So if we find a way of streaming WAL without FPW then this patch makes > sense, but not until then. So far many people have argued in favour of > using FPW=on, which was the whole point of pg_lesslog. Are we now saying > that we would run the primary with FPW=off? If we always recover a database from a base backup, the primary can run with FPW=off. Since we might need a fresh backup when making the failed server catch up with the current primary, such restriction (always recovery from a backup) might not matter. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
I understood your point. In the case of synchronous replication, because slave fails over when master crashes, there're no need to leave FPW from the beginning. In this case, only prefetch will work. Fujii's code at the slave looks very similar to pg_standby and pg_readahead will help in this case with no modification. 2008/12/4 Simon Riggs <simon@2ndquadrant.com>: > > On Wed, 2008-12-03 at 14:22 +0900, Koichi Suzuki wrote: > >> > There's clearly a huge gain using prefetch, when we have >> > full_page_writes = off. But that does make me think: Why do we need >> > prefetch at all if we use full page writes? There's nothing to prefetch >> > if we can keep it in cache. >> >> Agreed. This is why I proposed prefetch optional through GUC. >> >> > So I'm wondering if we only need prefetch because we're using lesslog? >> > >> > If we integrated lesslog better into the new replication would we be >> > able to forget about doing the prefetch altogether? >> >> In the case of lesslog, almost all the FPW is replaced with >> corresponding incremental log and recovery takes longer. Prefetch >> dramatically improve this, as you will see in the above result. To >> improve recovery time with FPW=off or FPW=on and lesslog=yes, we need >> prefetch. > > It does sound like it is needed, yes. But if you look at the > architecture of synchronous replication in 8.4 then I don't think it > makes sense any more. It would be very useful for the architecture we > had in 8.3, but that time has gone. > > If we have FPW=on on primary then we will stream WAL with FPW to > standby. There seems little point removing it *after* it has been sent, > then putting it back again before we recover, especially when it causes > a drop in performance that then needs to be fixed (by this patch). > > pg_lesslog allowed us to write FPW to disk, yet send WAL without FPW. > > So if we find a way of streaming WAL without FPW then this patch makes > sense, but not until then. So far many people have argued in favour of > using FPW=on, which was the whole point of pg_lesslog. Are we now saying > that we would run the primary with FPW=off? > > -- > Simon Riggs www.2ndQuadrant.com > PostgreSQL Training, Services and Support > > -- ------ Koichi Suzuki
Hi, On Mon, Dec 8, 2008 at 2:54 PM, Koichi Suzuki <koichi.szk@gmail.com> wrote: > I understood your point. In the case of synchronous replication, > because slave fails over when master crashes, there're no need to > leave FPW from the beginning. > > In this case, only prefetch will work. Fujii's code at the slave > looks very similar to pg_standby and pg_readahead will help in this > case with no modification. As the result of discussion, I will change the way to recover on the standby; we don't use PITR for the WAL which walreceiver received, instead, startup process read it by *record* from pg_xlog and redo. So, I'm afraid that synchronous replication doesn't match well with pg_readahead. Regards, > > 2008/12/4 Simon Riggs <simon@2ndquadrant.com>: >> >> On Wed, 2008-12-03 at 14:22 +0900, Koichi Suzuki wrote: >> >>> > There's clearly a huge gain using prefetch, when we have >>> > full_page_writes = off. But that does make me think: Why do we need >>> > prefetch at all if we use full page writes? There's nothing to prefetch >>> > if we can keep it in cache. >>> >>> Agreed. This is why I proposed prefetch optional through GUC. >>> >>> > So I'm wondering if we only need prefetch because we're using lesslog? >>> > >>> > If we integrated lesslog better into the new replication would we be >>> > able to forget about doing the prefetch altogether? >>> >>> In the case of lesslog, almost all the FPW is replaced with >>> corresponding incremental log and recovery takes longer. Prefetch >>> dramatically improve this, as you will see in the above result. To >>> improve recovery time with FPW=off or FPW=on and lesslog=yes, we need >>> prefetch. >> >> It does sound like it is needed, yes. But if you look at the >> architecture of synchronous replication in 8.4 then I don't think it >> makes sense any more. It would be very useful for the architecture we >> had in 8.3, but that time has gone. >> >> If we have FPW=on on primary then we will stream WAL with FPW to >> standby. There seems little point removing it *after* it has been sent, >> then putting it back again before we recover, especially when it causes >> a drop in performance that then needs to be fixed (by this patch). >> >> pg_lesslog allowed us to write FPW to disk, yet send WAL without FPW. >> >> So if we find a way of streaming WAL without FPW then this patch makes >> sense, but not until then. So far many people have argued in favour of >> using FPW=on, which was the whole point of pg_lesslog. Are we now saying >> that we would run the primary with FPW=off? >> >> -- >> Simon Riggs www.2ndQuadrant.com >> PostgreSQL Training, Services and Support >> >> > > > > -- > ------ > Koichi Suzuki > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Hmmm, it's really like pg_readahead needs to be included in the core. I don't think it's a big work and will try to dothis. 2008/12/9 Fujii Masao <masao.fujii@gmail.com>: > Hi, > > On Mon, Dec 8, 2008 at 2:54 PM, Koichi Suzuki <koichi.szk@gmail.com> wrote: >> I understood your point. In the case of synchronous replication, >> because slave fails over when master crashes, there're no need to >> leave FPW from the beginning. >> >> In this case, only prefetch will work. Fujii's code at the slave >> looks very similar to pg_standby and pg_readahead will help in this >> case with no modification. > > As the result of discussion, I will change the way to recover on the standby; > we don't use PITR for the WAL which walreceiver received, instead, startup > process read it by *record* from pg_xlog and redo. So, I'm afraid that > synchronous replication doesn't match well with pg_readahead. > > Regards, > >> >> 2008/12/4 Simon Riggs <simon@2ndquadrant.com>: >>> >>> On Wed, 2008-12-03 at 14:22 +0900, Koichi Suzuki wrote: >>> >>>> > There's clearly a huge gain using prefetch, when we have >>>> > full_page_writes = off. But that does make me think: Why do we need >>>> > prefetch at all if we use full page writes? There's nothing to prefetch >>>> > if we can keep it in cache. >>>> >>>> Agreed. This is why I proposed prefetch optional through GUC. >>>> >>>> > So I'm wondering if we only need prefetch because we're using lesslog? >>>> > >>>> > If we integrated lesslog better into the new replication would we be >>>> > able to forget about doing the prefetch altogether? >>>> >>>> In the case of lesslog, almost all the FPW is replaced with >>>> corresponding incremental log and recovery takes longer. Prefetch >>>> dramatically improve this, as you will see in the above result. To >>>> improve recovery time with FPW=off or FPW=on and lesslog=yes, we need >>>> prefetch. >>> >>> It does sound like it is needed, yes. But if you look at the >>> architecture of synchronous replication in 8.4 then I don't think it >>> makes sense any more. It would be very useful for the architecture we >>> had in 8.3, but that time has gone. >>> >>> If we have FPW=on on primary then we will stream WAL with FPW to >>> standby. There seems little point removing it *after* it has been sent, >>> then putting it back again before we recover, especially when it causes >>> a drop in performance that then needs to be fixed (by this patch). >>> >>> pg_lesslog allowed us to write FPW to disk, yet send WAL without FPW. >>> >>> So if we find a way of streaming WAL without FPW then this patch makes >>> sense, but not until then. So far many people have argued in favour of >>> using FPW=on, which was the whole point of pg_lesslog. Are we now saying >>> that we would run the primary with FPW=off? >>> >>> -- >>> Simon Riggs www.2ndQuadrant.com >>> PostgreSQL Training, Services and Support >>> >>> >> >> >> >> -- >> ------ >> Koichi Suzuki >> >> -- >> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-hackers >> > > > > -- > Fujii Masao > NIPPON TELEGRAPH AND TELEPHONE CORPORATION > NTT Open Source Software Center > -- ------ Koichi Suzuki
On Fri, Dec 12, 2008 at 9:08 AM, Koichi Suzuki <koichi.szk@gmail.com> wrote: > Hmmm, it's really like pg_readahead needs to be included in the core. > I don't think it's a big work and will try to do this. > Yes, I think it's best to have it in core. I would actually combine it with the other idea of reading xlog files directly into xlog buffers during recovery. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
I'm now writing v3 patch of PITR improvement, to work with sync.rep and Hot Standby. Would like to change the thread. 2008/12/12 Pavan Deolasee <pavan.deolasee@gmail.com>: > On Fri, Dec 12, 2008 at 9:08 AM, Koichi Suzuki <koichi.szk@gmail.com> wrote: >> Hmmm, it's really like pg_readahead needs to be included in the core. >> I don't think it's a big work and will try to do this. >> > > Yes, I think it's best to have it in core. I would actually combine it > with the other idea of reading xlog files directly into xlog buffers > during recovery. > > Thanks, > Pavan > > -- > Pavan Deolasee > EnterpriseDB http://www.enterprisedb.com > -- ------ Koichi Suzuki