Re: BUG #7494: WAL replay speed depends heavily on the shared_buffers size - Mailing list pgsql-bugs

From Valentine Gogichashvili
Subject Re: BUG #7494: WAL replay speed depends heavily on the shared_buffers size
Date
Msg-id CAP93muXCLBBnHuWrbr8Lh6tNTFNVYRTp9VRTSApMK1UY+QpYYA@mail.gmail.com
Whole thread Raw
In response to Re: BUG #7494: WAL replay speed depends heavily on the shared_buffers size  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: BUG #7494: WAL replay speed depends heavily on the shared_buffers size  (Valentine Gogichashvili <valgog@gmail.com>)
List pgsql-bugs
Hello Andreas,

here is the process, that now actually is not using CPU at all and the
shared_buffers are set to 2GB:

50978 postgres  20   0 2288m 2.0g 2.0g S  0.0  1.6   4225:34 postgres:
startup process   recovering 000000050000262E000000FD

It is hanging on that file for several minutes now.

and here is the strace:

$ strace -c -f -p 50978
Process 50978 attached - interrupt to quit
 Process 50978 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 94.82    0.007999          37       215           select
  2.73    0.000230           1       215           getppid
  2.45    0.000207           1       215       215 stat
------ ----------- ----------- --------- --------- ----------------
100.00    0.008436                   645       215 total

What kind of additional profiling information would you like to see?

Regards,

-- Valentin


On Wed, Aug 15, 2012 at 4:09 PM, Andres Freund <andres@2ndquadrant.com>wrot=
e:

> Hi,
>
> On Wednesday, August 15, 2012 12:10:42 PM valgog@gmail.com wrote:
> > The following bug has been logged on the website:
> >
> > Bug reference:      7494
> > Logged by:          Valentine Gogichashvili
> > Email address:      valgog@gmail.com
> > PostgreSQL version: 9.0.7
> > Operating system:   Linux version 2.6.32-5-amd64 (Debian 2.6.32-41)
> > Description:
> >
> > We are experiencing strange(?) behavior on the replication slave
> machines.
> > The master machine has a very heavy update load, where many processes a=
re
> > updating lots of data. It generates up to 30GB of WAL files per hour.
> > Normally it is not a problem for the slave machines to replay this amou=
nt
> > of WAL files on time and keep on with the master. But at some moments,
> the
> > slaves are =E2=80=9Changing=E2=80=9D with 100% CPU usage on the WAL rep=
lay process and 3%
> > IOWait, needing up to 30 seconds to process one WAL file. If this tippi=
ng
> > point is reached, then a huge WAL replication lag is building up quite
> > fast, that also leads to overfill of the XLOG directory on the slave
> > machines, as the WAL receiver is putting the WAL files it gets via
> > streaming replication the XLOG directory (that, in many cases are quite=
 a
> > limited size separate disk partition).
> Could you try to get a profile of that 100% cpu time?
>
> Greetings,
>
> Andres
> --
> Andres Freund           http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
>

pgsql-bugs by date:

Previous
From: psql@elbrief.de
Date:
Subject: Re-2: BUG #7495: chosen wrong index
Next
From: Valentine Gogichashvili
Date:
Subject: Re: BUG #7494: WAL replay speed depends heavily on the shared_buffers size