Thread: V2 of PITR performance improvement for 8.4

V2 of PITR performance improvement for 8.4

From
"Koichi Suzuki"
Date:
Please find enclosed a revised version of pg_readahead and a patch to
invoke pg_readahead.
Changes from the previous one is as follows:

Pg_readahead now does not return any prefetched point.  It simply
prefetches all the datapages
refered from WAL records in a given WAL segment, except for those
whose first WAL record
includes full page write.

Because of this change, patch to the core was changed so that
pg_readahead is invoked when
WAL segment is opened.

Details will be found in README.

I've done a benchmark to see the effect of the prefetch.   Here's a report.

--------------------------------
Benchmark: DBT-2

Database size: 20GB

Gave less number of transactions than DBT-2 default avoid overload
status.  We ran the
benchmark for on hour with chekpoint timeout 30min and completion_target 0.5.
Then, collected all the archive log and run PITR.

Disks: RAID0 array (8 disks, 7200rpm).

Detailed conditions are given at the last.

Measure ment result is as follows: (for readability, PDF chart is also attached)

----------------------+------------+--------------------+---------------
WAL conditions        | Recovery   | Amount of          | recovery
                      | time (sec) | physical read (MB) | rate (TX/min)
----------------------+------------+--------------------+---------------
w/o prefetch          |            |                    |
archived with cp      |  6,611     |     5,435          |    402
FPW=off               |            |                    |
----------------------+------------+--------------------+---------------
w/o prefetch          |            |                    |
archived with cp      |  1,683     |       801          |  1,458
FPW=on                |            |                    |  (8.3)
----------------------+------------+--------------------+---------------
w/o prefetch          |            |                    |
archived with lesslog |  6,644     |     5,090          |    369
FPW=on                |            |                    |
----------------------+------------+--------------------+---------------
With prefetch         |            |                    |
archived with cp      |  1,161     |     5,543          |  2,290
FPW=off               |            |                    |
----------------------+------------+--------------------+---------------
With prefetch         |            |                    |
archived with cp      |  1,415     |     2,157          |  1,733
FPW=on                |            |                    |
----------------------+------------+--------------------+---------------
With prefetch         |            |                    |
archived with lesslog |  1,196     |     5,369          |  2,051
FPW=on                |            |                    | (This proposal)
----------------------+------------+--------------------+---------------
* lesslog means pg_compresslog
** DBT-2 thoughput: 682TPM (FPW=on), 739TPM (FPW=off)


This shows that although prefetch does not reduce the physical read,
it tremendously
impreves the time to read and as a result, if WAL archive is taken
with pg_lesslog and
prefetch is done, recovery duration is somewhat shorter than current
FPW=on score.

Important point is that the recovery rate is much higher than DBT-2
throughput.
Therefore, this can be combined with synchronous replication and hot standby,
tremendously reducing the amount of logs to be shipped (up to as small
as one tenth),
improving the recovery time and maintaining crash recovery success chance.

Just without FWP=off or with pg_compresslog, recovery does not catch up.

Because current pg_readahead only works in Linux, I'd like the patch to be into
the core and pg_readahead into contrib.

Other (major) environment is given below.

----<< H/W and OS >>-------------------
CPU: Pentium D, 2.8GHz
Memory: 2GB
Internal Disk: SATA 150GB, used to archive WAL.
External Disk: RAID 0 (Ultra Wide SCSI), 8 disks (SATA 7200rpm)
OS: RHEL ES 5.1 (64bit)

----<< Other PostgreSQL configuration >>--------
PostgreSQL: 8.4 dev. head, as of Oct.28th
max_connections: 100
shared_buffers:  32MB
checkpoint_segments: 1000
checkpoint_timeout: 30min
checkpoint_completion target: 0.5
archive_mode: on
autovacuum: on
logging_collector: on

--
------
Koichi Suzuki

Attachment

Re: V2 of PITR performance improvement for 8.4

From
Simon Riggs
Date:
On Thu, 2008-11-27 at 21:04 +0900, Koichi Suzuki wrote:

> We ran the
> benchmark for on hour with chekpoint timeout 30min and completion_target 0.5.
> Then, collected all the archive log and run PITR.

> ----------------------+------------+--------------------+---------------
> WAL conditions        | Recovery   | Amount of          | recovery
>                       | time (sec) | physical read (MB) | rate (TX/min)
> ----------------------+------------+--------------------+---------------
> w/o prefetch          |            |                    |
> archived with cp      |  6,611     |     5,435          |    402
> FPW=off               |            |                    |
> ----------------------+------------+--------------------+---------------
> With prefetch         |            |                    |
> archived with cp      |  1,161     |     5,543          |  2,290
> FPW=off               |            |                    |
> ----------------------+------------+--------------------+---------------

There's clearly a huge gain using prefetch, when we have
full_page_writes = off. But that does make me think: Why do we need
prefetch at all if we use full page writes? There's nothing to prefetch
if we can keep it in cache.

I notice we set the checkpoint_timeout to 30 mins, which is long enough
to exceed the cache on the standby. I wonder if we reduced the timeout
would we use the cache better on the standby and not need readahead at
all? Do you have any results to examine cache overflow/shorter timeouts?

> w/o prefetch          |            |                    |
> archived with cp      |  1,683     |       801          |  1,458
> FPW=on                |            |                    |  (8.3)
> ----------------------+------------+--------------------+---------------
> w/o prefetch          |            |                    |
> archived with lesslog |  6,644     |     5,090          |    369
> FPW=on                |            |                    |
> ----------------------+------------+--------------------+---------------
> With prefetch         |            |                    |
> archived with cp      |  1,415     |     2,157          |  1,733
> FPW=on                |            |                    |
> ----------------------+------------+--------------------+---------------
> With prefetch         |            |                    |
> archived with lesslog |  1,196     |     5,369          |  2,051
> FPW=on                |            |                    | (This proposal)
> ----------------------+------------+--------------------+---------------

So I'm wondering if we only need prefetch because we're using lesslog?

If we integrated lesslog better into the new replication would we be
able to forget about doing the prefetch altogether?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: V2 of PITR performance improvement for 8.4

From
"Koichi Suzuki"
Date:
Hi,

As to checkpoint timeout, yes, this measurement is hard for FPW=on
case.  I'll do the similar measurement for checkpoint timeout = 5min
and post the result.   I expect that the recoevry time will be almost
the same in the case FPW=on, lesslog=yes and prefetpch = yes.

2008/12/2 Simon Riggs <simon@2ndquadrant.com>:
>
> On Thu, 2008-11-27 at 21:04 +0900, Koichi Suzuki wrote:
>
>> We ran the
>> benchmark for on hour with chekpoint timeout 30min and completion_target 0.5.
>> Then, collected all the archive log and run PITR.
>
>> ----------------------+------------+--------------------+---------------
>> WAL conditions        | Recovery   | Amount of          | recovery
>>                       | time (sec) | physical read (MB) | rate (TX/min)
>> ----------------------+------------+--------------------+---------------
>> w/o prefetch          |            |                    |
>> archived with cp      |  6,611     |     5,435          |    402
>> FPW=off               |            |                    |
>> ----------------------+------------+--------------------+---------------
>> With prefetch         |            |                    |
>> archived with cp      |  1,161     |     5,543          |  2,290
>> FPW=off               |            |                    |
>> ----------------------+------------+--------------------+---------------
>
> There's clearly a huge gain using prefetch, when we have
> full_page_writes = off. But that does make me think: Why do we need
> prefetch at all if we use full page writes? There's nothing to prefetch
> if we can keep it in cache.

Agreed.   This is why I proposed prefetch optional through GUC.

>
> I notice we set the checkpoint_timeout to 30 mins, which is long enough
> to exceed the cache on the standby. I wonder if we reduced the timeout
> would we use the cache better on the standby and not need readahead at
> all? Do you have any results to examine cache overflow/shorter timeouts?
>
>> w/o prefetch          |            |                    |
>> archived with cp      |  1,683     |       801          |  1,458
>> FPW=on                |            |                    |  (8.3)
>> ----------------------+------------+--------------------+---------------
>> w/o prefetch          |            |                    |
>> archived with lesslog |  6,644     |     5,090          |    369
>> FPW=on                |            |                    |
>> ----------------------+------------+--------------------+---------------
>> With prefetch         |            |                    |
>> archived with cp      |  1,415     |     2,157          |  1,733
>> FPW=on                |            |                    |
>> ----------------------+------------+--------------------+---------------
>> With prefetch         |            |                    |
>> archived with lesslog |  1,196     |     5,369          |  2,051
>> FPW=on                |            |                    | (This proposal)
>> ----------------------+------------+--------------------+---------------
>
> So I'm wondering if we only need prefetch because we're using lesslog?
>
> If we integrated lesslog better into the new replication would we be
> able to forget about doing the prefetch altogether?

In the case of lesslog, almost all the FPW is replaced with
corresponding incremental log and recovery takes longer.   Prefetch
dramatically improve this, as you will see in the above result.    To
improve recovery time with FPW=off or FPW=on and lesslog=yes, we need
prefetch.
>
> --
>  Simon Riggs           www.2ndQuadrant.com
>  PostgreSQL Training, Services and Support
>
>



-- 
------
Koichi Suzuki


Re: V2 of PITR performance improvement for 8.4

From
"Fujii Masao"
Date:
Hi,

On Thu, Nov 27, 2008 at 9:04 PM, Koichi Suzuki <koichi.szk@gmail.com> wrote:
> Please find enclosed a revised version of pg_readahead and a patch to
> invoke pg_readahead.

Some similar functions are in xlog.c and pg_readahead.c (for example,
RecordIsValid). I think that we should unify them as a common function,
which helps to develop the tool (for example, xlogdump) treating WAL in
the future.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: V2 of PITR performance improvement for 8.4

From
"Koichi Suzuki"
Date:
Agreed.

I borrowed WAL parsing code from XLogdump and I think WAL parsing
should be another candidate.

2008/12/3 Fujii Masao <masao.fujii@gmail.com>:
> Hi,
>
> On Thu, Nov 27, 2008 at 9:04 PM, Koichi Suzuki <koichi.szk@gmail.com> wrote:
>> Please find enclosed a revised version of pg_readahead and a patch to
>> invoke pg_readahead.
>
> Some similar functions are in xlog.c and pg_readahead.c (for example,
> RecordIsValid). I think that we should unify them as a common function,
> which helps to develop the tool (for example, xlogdump) treating WAL in
> the future.
>
> Regards,
>
> --
> Fujii Masao
> NIPPON TELEGRAPH AND TELEPHONE CORPORATION
> NTT Open Source Software Center
>



-- 
------
Koichi Suzuki


Re: V2 of PITR performance improvement for 8.4

From
Simon Riggs
Date:
On Wed, 2008-12-03 at 14:22 +0900, Koichi Suzuki wrote:

> > There's clearly a huge gain using prefetch, when we have
> > full_page_writes = off. But that does make me think: Why do we need
> > prefetch at all if we use full page writes? There's nothing to prefetch
> > if we can keep it in cache.
> 
> Agreed.   This is why I proposed prefetch optional through GUC.
> 
> > So I'm wondering if we only need prefetch because we're using lesslog?
> >
> > If we integrated lesslog better into the new replication would we be
> > able to forget about doing the prefetch altogether?
> 
> In the case of lesslog, almost all the FPW is replaced with
> corresponding incremental log and recovery takes longer.   Prefetch
> dramatically improve this, as you will see in the above result.    To
> improve recovery time with FPW=off or FPW=on and lesslog=yes, we need
> prefetch.

It does sound like it is needed, yes. But if you look at the
architecture of synchronous replication in 8.4 then I don't think it
makes sense any more. It would be very useful for the architecture we
had in 8.3, but that time has gone.

If we have FPW=on on primary then we will stream WAL with FPW to
standby. There seems little point removing it *after* it has been sent,
then putting it back again before we recover, especially when it causes
a drop in performance that then needs to be fixed (by this patch).

pg_lesslog allowed us to write FPW to disk, yet send WAL without FPW.

So if we find a way of streaming WAL without FPW then this patch makes
sense, but not until then. So far many people have argued in favour of
using FPW=on, which was the whole point of pg_lesslog. Are we now saying
that we would run the primary with FPW=off?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: V2 of PITR performance improvement for 8.4

From
"Fujii Masao"
Date:
Hi,

On Thu, Dec 4, 2008 at 6:11 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On Wed, 2008-12-03 at 14:22 +0900, Koichi Suzuki wrote:
>
>> > There's clearly a huge gain using prefetch, when we have
>> > full_page_writes = off. But that does make me think: Why do we need
>> > prefetch at all if we use full page writes? There's nothing to prefetch
>> > if we can keep it in cache.
>>
>> Agreed.   This is why I proposed prefetch optional through GUC.
>>
>> > So I'm wondering if we only need prefetch because we're using lesslog?
>> >
>> > If we integrated lesslog better into the new replication would we be
>> > able to forget about doing the prefetch altogether?
>>
>> In the case of lesslog, almost all the FPW is replaced with
>> corresponding incremental log and recovery takes longer.   Prefetch
>> dramatically improve this, as you will see in the above result.    To
>> improve recovery time with FPW=off or FPW=on and lesslog=yes, we need
>> prefetch.
>
> It does sound like it is needed, yes. But if you look at the
> architecture of synchronous replication in 8.4 then I don't think it
> makes sense any more. It would be very useful for the architecture we
> had in 8.3, but that time has gone.

Agreed. I also think that lesslog is for archiving in single node rather
than replication between multiple nodes. Of course, it's very useful
for the user who doesn't use replication.. etc.

> So if we find a way of streaming WAL without FPW then this patch makes
> sense, but not until then. So far many people have argued in favour of
> using FPW=on, which was the whole point of pg_lesslog. Are we now saying
> that we would run the primary with FPW=off?

If we always recover a database from a base backup, the primary can run
with FPW=off. Since we might need a fresh backup when making the failed
server catch up with the current primary, such restriction (always recovery
from a backup) might not matter.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: V2 of PITR performance improvement for 8.4

From
"Koichi Suzuki"
Date:
I understood your point.  In the case of synchronous replication,
because slave fails over when master crashes,  there're no need to
leave FPW from the beginning.

In this case, only prefetch will work.   Fujii's code at the slave
looks very similar to pg_standby and pg_readahead will help in this
case with no modification.

2008/12/4 Simon Riggs <simon@2ndquadrant.com>:
>
> On Wed, 2008-12-03 at 14:22 +0900, Koichi Suzuki wrote:
>
>> > There's clearly a huge gain using prefetch, when we have
>> > full_page_writes = off. But that does make me think: Why do we need
>> > prefetch at all if we use full page writes? There's nothing to prefetch
>> > if we can keep it in cache.
>>
>> Agreed.   This is why I proposed prefetch optional through GUC.
>>
>> > So I'm wondering if we only need prefetch because we're using lesslog?
>> >
>> > If we integrated lesslog better into the new replication would we be
>> > able to forget about doing the prefetch altogether?
>>
>> In the case of lesslog, almost all the FPW is replaced with
>> corresponding incremental log and recovery takes longer.   Prefetch
>> dramatically improve this, as you will see in the above result.    To
>> improve recovery time with FPW=off or FPW=on and lesslog=yes, we need
>> prefetch.
>
> It does sound like it is needed, yes. But if you look at the
> architecture of synchronous replication in 8.4 then I don't think it
> makes sense any more. It would be very useful for the architecture we
> had in 8.3, but that time has gone.
>
> If we have FPW=on on primary then we will stream WAL with FPW to
> standby. There seems little point removing it *after* it has been sent,
> then putting it back again before we recover, especially when it causes
> a drop in performance that then needs to be fixed (by this patch).
>
> pg_lesslog allowed us to write FPW to disk, yet send WAL without FPW.
>
> So if we find a way of streaming WAL without FPW then this patch makes
> sense, but not until then. So far many people have argued in favour of
> using FPW=on, which was the whole point of pg_lesslog. Are we now saying
> that we would run the primary with FPW=off?
>
> --
>  Simon Riggs           www.2ndQuadrant.com
>  PostgreSQL Training, Services and Support
>
>



-- 
------
Koichi Suzuki


Re: V2 of PITR performance improvement for 8.4

From
"Fujii Masao"
Date:
Hi,

On Mon, Dec 8, 2008 at 2:54 PM, Koichi Suzuki <koichi.szk@gmail.com> wrote:
> I understood your point.  In the case of synchronous replication,
> because slave fails over when master crashes,  there're no need to
> leave FPW from the beginning.
>
> In this case, only prefetch will work.   Fujii's code at the slave
> looks very similar to pg_standby and pg_readahead will help in this
> case with no modification.

As the result of discussion, I will change the way to recover on the standby;
we don't use PITR for the WAL which walreceiver received, instead, startup
process read it by *record* from pg_xlog and redo. So, I'm afraid that
synchronous replication doesn't match well with pg_readahead.

Regards,

>
> 2008/12/4 Simon Riggs <simon@2ndquadrant.com>:
>>
>> On Wed, 2008-12-03 at 14:22 +0900, Koichi Suzuki wrote:
>>
>>> > There's clearly a huge gain using prefetch, when we have
>>> > full_page_writes = off. But that does make me think: Why do we need
>>> > prefetch at all if we use full page writes? There's nothing to prefetch
>>> > if we can keep it in cache.
>>>
>>> Agreed.   This is why I proposed prefetch optional through GUC.
>>>
>>> > So I'm wondering if we only need prefetch because we're using lesslog?
>>> >
>>> > If we integrated lesslog better into the new replication would we be
>>> > able to forget about doing the prefetch altogether?
>>>
>>> In the case of lesslog, almost all the FPW is replaced with
>>> corresponding incremental log and recovery takes longer.   Prefetch
>>> dramatically improve this, as you will see in the above result.    To
>>> improve recovery time with FPW=off or FPW=on and lesslog=yes, we need
>>> prefetch.
>>
>> It does sound like it is needed, yes. But if you look at the
>> architecture of synchronous replication in 8.4 then I don't think it
>> makes sense any more. It would be very useful for the architecture we
>> had in 8.3, but that time has gone.
>>
>> If we have FPW=on on primary then we will stream WAL with FPW to
>> standby. There seems little point removing it *after* it has been sent,
>> then putting it back again before we recover, especially when it causes
>> a drop in performance that then needs to be fixed (by this patch).
>>
>> pg_lesslog allowed us to write FPW to disk, yet send WAL without FPW.
>>
>> So if we find a way of streaming WAL without FPW then this patch makes
>> sense, but not until then. So far many people have argued in favour of
>> using FPW=on, which was the whole point of pg_lesslog. Are we now saying
>> that we would run the primary with FPW=off?
>>
>> --
>>  Simon Riggs           www.2ndQuadrant.com
>>  PostgreSQL Training, Services and Support
>>
>>
>
>
>
> --
> ------
> Koichi Suzuki
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: V2 of PITR performance improvement for 8.4

From
"Koichi Suzuki"
Date:
Hmmm,  it's really like pg_readahead needs to be included in the core.  I don't think it's a big work and will try to
dothis.
 

2008/12/9 Fujii Masao <masao.fujii@gmail.com>:
> Hi,
>
> On Mon, Dec 8, 2008 at 2:54 PM, Koichi Suzuki <koichi.szk@gmail.com> wrote:
>> I understood your point.  In the case of synchronous replication,
>> because slave fails over when master crashes,  there're no need to
>> leave FPW from the beginning.
>>
>> In this case, only prefetch will work.   Fujii's code at the slave
>> looks very similar to pg_standby and pg_readahead will help in this
>> case with no modification.
>
> As the result of discussion, I will change the way to recover on the standby;
> we don't use PITR for the WAL which walreceiver received, instead, startup
> process read it by *record* from pg_xlog and redo. So, I'm afraid that
> synchronous replication doesn't match well with pg_readahead.
>
> Regards,
>
>>
>> 2008/12/4 Simon Riggs <simon@2ndquadrant.com>:
>>>
>>> On Wed, 2008-12-03 at 14:22 +0900, Koichi Suzuki wrote:
>>>
>>>> > There's clearly a huge gain using prefetch, when we have
>>>> > full_page_writes = off. But that does make me think: Why do we need
>>>> > prefetch at all if we use full page writes? There's nothing to prefetch
>>>> > if we can keep it in cache.
>>>>
>>>> Agreed.   This is why I proposed prefetch optional through GUC.
>>>>
>>>> > So I'm wondering if we only need prefetch because we're using lesslog?
>>>> >
>>>> > If we integrated lesslog better into the new replication would we be
>>>> > able to forget about doing the prefetch altogether?
>>>>
>>>> In the case of lesslog, almost all the FPW is replaced with
>>>> corresponding incremental log and recovery takes longer.   Prefetch
>>>> dramatically improve this, as you will see in the above result.    To
>>>> improve recovery time with FPW=off or FPW=on and lesslog=yes, we need
>>>> prefetch.
>>>
>>> It does sound like it is needed, yes. But if you look at the
>>> architecture of synchronous replication in 8.4 then I don't think it
>>> makes sense any more. It would be very useful for the architecture we
>>> had in 8.3, but that time has gone.
>>>
>>> If we have FPW=on on primary then we will stream WAL with FPW to
>>> standby. There seems little point removing it *after* it has been sent,
>>> then putting it back again before we recover, especially when it causes
>>> a drop in performance that then needs to be fixed (by this patch).
>>>
>>> pg_lesslog allowed us to write FPW to disk, yet send WAL without FPW.
>>>
>>> So if we find a way of streaming WAL without FPW then this patch makes
>>> sense, but not until then. So far many people have argued in favour of
>>> using FPW=on, which was the whole point of pg_lesslog. Are we now saying
>>> that we would run the primary with FPW=off?
>>>
>>> --
>>>  Simon Riggs           www.2ndQuadrant.com
>>>  PostgreSQL Training, Services and Support
>>>
>>>
>>
>>
>>
>> --
>> ------
>> Koichi Suzuki
>>
>> --
>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-hackers
>>
>
>
>
> --
> Fujii Masao
> NIPPON TELEGRAPH AND TELEPHONE CORPORATION
> NTT Open Source Software Center
>



-- 
------
Koichi Suzuki


Re: V2 of PITR performance improvement for 8.4

From
"Pavan Deolasee"
Date:
On Fri, Dec 12, 2008 at 9:08 AM, Koichi Suzuki <koichi.szk@gmail.com> wrote:
> Hmmm,  it's really like pg_readahead needs to be included in the core.
>   I don't think it's a big work and will try to do this.
>

Yes, I think it's best to have it in core. I would actually combine it
with the other idea of reading xlog files directly into xlog buffers
during recovery.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com


Re: V2 of PITR performance improvement for 8.4

From
"Koichi Suzuki"
Date:
I'm now writing v3 patch of PITR improvement, to work with sync.rep
and Hot Standby.    Would like to change the thread.

2008/12/12 Pavan Deolasee <pavan.deolasee@gmail.com>:
> On Fri, Dec 12, 2008 at 9:08 AM, Koichi Suzuki <koichi.szk@gmail.com> wrote:
>> Hmmm,  it's really like pg_readahead needs to be included in the core.
>>   I don't think it's a big work and will try to do this.
>>
>
> Yes, I think it's best to have it in core. I would actually combine it
> with the other idea of reading xlog files directly into xlog buffers
> during recovery.
>
> Thanks,
> Pavan
>
> --
> Pavan Deolasee
> EnterpriseDB     http://www.enterprisedb.com
>



-- 
------
Koichi Suzuki