Thread: BUG #8192: On very large tables the concurrent update with vacuum lag the hot_standby replica

The following bug has been logged on the website:

Bug reference:      8192
Logged by:          Federico Campoli
Email address:      federico@brandwatch.com
PostgreSQL version: 9.2.4
Operating system:   Debian 6.0
Description:        =


/*

Description:

It seems on very large tables the concurrent update with vacuum (or
autovacuum), =

when the slave is in hot standby mode, generates long loops in read on a
single wal segment during the recovery process.

This have two nasty effects.
A massive read IO peak and the replay lag increasing as the recovery process
hangs for long periods on a pointless loop.

PostgreSQL version: PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled
by gcc-4.4.real (Debian 4.4.5-8) 4.4.5, 64-bit

Steps to reproduce the error:
setup an hot standby server.
the error occurs with streaming replication enabled and disabled

*/
SET client_min_messages=3D'debug2';
SET trace_sort=3D'on';


--create a new empty table
DROP TABLE IF EXISTS t_vacuum;
CREATE table t_vacuum
    (
       i_id_row integer,
       ts_time  timestamp with time zone default now()
    )
;
--disable the autovacuum, we need to run it manually
ALTER TABLE t_vacuum set (autovacuum_enabled=3D'off');

--insert into the table enough data to have multiple file nodes
INSERT INTO t_vacuum
    (
        i_id_row
    )
    SELECT * FROM generate_series(1,40000000)
;


/*
start iotop on the hot standby monitoring the postgres:startup process and
eventually the wal receiver
for example iotop -p 31293
wait for all the wal files generated during the update to be applied =

this does not apply on live servers of course but is useful to trap only the
IO activity generated by vacuum

If the streaming replication is on, this select will tell when the slave has
finished.

In any case is useful to check the replication lag during the vacuum.
SELECT =

    pg_size_pretty(pg_xlog_location_diff(pg_current_xlog_location(),
replay_location))
FROM =

    pg_stat_replication
;

*/

--in a new session and start an huge table update
UPDATE t_vacuum set ts_time=3Dnow() WHERE i_id_row<20000000;

--then vacuum the table =

VACUUM VERBOSE t_vacuum;

--at some point the startup process will stuck recovering one single wal
file and
--the DISK READ column will show a huge IO for a while. =

--if you monitor the wal receiver also you will notice stream will continue
without problems
--sooner or later the startup process will resume the recovery but meanwhile
the DISK READ is not zero
--the slave will accumulate replication lag =
On 30 May 2013 17:43,  <federico@brandwatch.com> wrote:

> It seems on very large tables the concurrent update with vacuum (or
> autovacuum),
> when the slave is in hot standby mode, generates long loops in read on a
> single wal segment during the recovery process.

Not all aspects of hot standby are optimally tuned, as yet. Some room
for improvement exists.

In some cases, conscious choices were made to keep processing on
master fast at the expense of speed on the standby.

There doesn't seem to be anything which is in itself an error here.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
On Thursday, May 30, 2013, wrote:

> The following bug has been logged on the website:
>
> Bug reference:      8192
> Logged by:          Federico Campoli
> Email address:      federico@brandwatch.com <javascript:;>
> PostgreSQL version: 9.2.4
> Operating system:   Debian 6.0
> Description:
>
> /*
>
> Description:
>
> It seems on very large tables the concurrent update with vacuum (or
> autovacuum),
> when the slave is in hot standby mode, generates long loops in read on a
> single wal segment during the recovery process.
>
> This have two nasty effects.
> A massive read IO peak and the replay lag increasing as the recovery
> process
> hangs for long periods on a pointless loop.
>

Are you observing a loop, and if so how are you observing it?  What is it
that is looping?


> SET client_min_messages='debug2';
> SET trace_sort='on';
>

Are these settings useful?  What are they showing you?


>
> --in a new session and start an huge table update
> UPDATE t_vacuum set ts_time=now() WHERE i_id_row<20000000;
>
> --then vacuum the table
> VACUUM VERBOSE t_vacuum;
>

Are you running the update and vacuum concurrently or serially?


>
> --at some point the startup process will stuck recovering one single wal
> file and
> --the DISK READ column will show a huge IO for a while.
>

What is huge?


I don't know if I can reproduce this or not.  I certainly get spiky lag,
but I see no reason to think it is anything other than IO congestion,
occurring during stretches of WAL records where compact records describe a
larger amount of work that needs to be done in terms of poorly-cached IO.
 Perhaps the kernel's read-ahead mechanism is not working as well as it
theoretically could be.  Also the standby isn't using a ring-buffer
strategy, but I see no reason to think it would help were it to do so.

The DISK READ column is not what I would call huge during this, often 10-15
MB/S, because much of the IO is scattered rather than sequential.  The IO
wait % on the other hand is maxed out.

It is hard to consider it as a bug that the performance is not as high as
one might wish it to be.  Is this behavior a regression from some earlier
version?  What if hot-standby is turned off?

Cheers,

Jeff
On 02/06/13 01:17, Jeff Janes wrote:
> On Thursday, May 30, 2013, wrote:
>
>     The following bug has been logged on the website:
>
>     Bug reference:      8192
>     Logged by:          Federico Campoli
>     Email address: federico@brandwatch.com <javascript:;>
>     PostgreSQL version: 9.2.4
>     Operating system:   Debian 6.0
>     Description:
>
>     /*
>
>     Description:
>
>     It seems on very large tables the concurrent update with vacuum (or
>     autovacuum),
>     when the slave is in hot standby mode, generates long loops in read on a
>     single wal segment during the recovery process.
>
>     This have two nasty effects.
>     A massive read IO peak and the replay lag increasing as the recovery
>     process
>     hangs for long periods on a pointless loop.
>
>
> Are you observing a loop, and if so how are you observing it?  What is
> it that is looping?

I'm sorry, just guessing it could be a loop.
The read remains stuck on the same segment.
On my testbox I have at least 1 minute to 20 Mb/s.
On the live system the peak is 124 Mb/s for 2 to 3 minutes without any
progress in the wal reply.

I've attached the part of postgresql's log with debug2  from my sandbox
when that happens.

In warm standby everything is fine no lag at all.

At moment as workaround I've switched to warm standby the slaves as,
despite the low wal generation on the master ~200Mb/minute the slave
accumulates a massive lag when the autovacuum processes starts with hot
standby (the peak was 400 GB and was still increasing before switching
to warm standby).

The database is ~ 4 TB costantly updated.

Many thanks
Federico

--
Federico Campoli
Database Administrator brandwatch.com

Attachment
On 2013-06-04 13:57:58 +0100, Federico Campoli wrote:
> On 02/06/13 01:17, Jeff Janes wrote:
> >On Thursday, May 30, 2013, wrote:
> >
> >    The following bug has been logged on the website:
> >
> >    Bug reference:      8192
> >    Logged by:          Federico Campoli
> >    Email address: federico@brandwatch.com <javascript:;>
> >    PostgreSQL version: 9.2.4
> >    Operating system:   Debian 6.0
> >    Description:
> >
> >    /*
> >
> >    Description:
> >
> >    It seems on very large tables the concurrent update with vacuum (or
> >    autovacuum),
> >    when the slave is in hot standby mode, generates long loops in read on a
> >    single wal segment during the recovery process.
> >
> >    This have two nasty effects.
> >    A massive read IO peak and the replay lag increasing as the recovery
> >    process
> >    hangs for long periods on a pointless loop.
> >
> >
> >Are you observing a loop, and if so how are you observing it?  What is
> >it that is looping?
>
> I'm sorry, just guessing it could be a loop.
> The read remains stuck on the same segment.

Well, if you check the output in that file you can see that 'apply' is
progressing, so it's not stuck in some specific area.
Is the startup process cpu bound during that time? If so, any chance to
get a profile?

> In warm standby everything is fine no lag at all.
>
> At moment as workaround I've switched to warm standby the slaves as, despite
> the low wal generation on the master ~200Mb/minute the slave accumulates a
> massive lag when the autovacuum processes starts with hot standby (the peak
> was 400 GB and was still increasing before switching to warm standby).

By switching to a warm standby you mean disabling hot_standby=on in the
standbys or changing wal_level?

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Hi,

On 2013-06-04 16:20:12 +0100, Federico Campoli wrote:
> >Well, if you check the output in that file you can see that 'apply' is
> >progressing, so it's not stuck in some specific area.
> >Is the startup process cpu bound during that time? If so, any chance to
> >get a profile?
>
> Please find attached the profile file and the postgresql log generated in
> the test run on my sandbox.

Unfortunately a gprof profile isn't meaningful without the generating
binary as far as I know. Could you either generate the callgraph or just
generate a perf profile like:

# start_test
# perf record -a
# ctrl-C
# perf report > somefile

And then send somefile?

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
On 04/06/13 16:29, Andres Freund wrote:
> Hi,
>
> On 2013-06-04 16:20:12 +0100, Federico Campoli wrote:
>>> Well, if you check the output in that file you can see that 'apply' is
>>> progressing, so it's not stuck in some specific area.
>>> Is the startup process cpu bound during that time? If so, any chance to
>>> get a profile?
>>
>> Please find attached the profile file and the postgresql log generated in
>> the test run on my sandbox.
>
> Unfortunately a gprof profile isn't meaningful without the generating
> binary as far as I know. Could you either generate the callgraph or just
> generate a perf profile like:
>
> # start_test
> # perf record -a
> # ctrl-C
> # perf report > somefile
>
> And then send somefile?
>
> Greetings,
>
> Andres Freund
>

This is the gprof output for the profile.

Many thanks
Federico

--
Federico Campoli
Database Administrator brandwatch.com

Attachment
On 04/06/13 14:14, Andres Freund wrote:
> On 2013-06-04 13:57:58 +0100, Federico Campoli wrote:
>> On 02/06/13 01:17, Jeff Janes wrote:
>>> On Thursday, May 30, 2013, wrote:
>>>
>>>     The following bug has been logged on the website:
>>>
>>>     Bug reference:      8192
>>>     Logged by:          Federico Campoli
>>>     Email address: federico@brandwatch.com <javascript:;>
>>>     PostgreSQL version: 9.2.4
>>>     Operating system:   Debian 6.0
>>>     Description:
>>>
>>>     /*
>>>
>>>     Description:
>>>
>>>     It seems on very large tables the concurrent update with vacuum (or
>>>     autovacuum),
>>>     when the slave is in hot standby mode, generates long loops in read on a
>>>     single wal segment during the recovery process.
>>>
>>>     This have two nasty effects.
>>>     A massive read IO peak and the replay lag increasing as the recovery
>>>     process
>>>     hangs for long periods on a pointless loop.
>>>
>>>
>>> Are you observing a loop, and if so how are you observing it?  What is
>>> it that is looping?
>>
>> I'm sorry, just guessing it could be a loop.
>> The read remains stuck on the same segment.
>
> Well, if you check the output in that file you can see that 'apply' is
> progressing, so it's not stuck in some specific area.
> Is the startup process cpu bound during that time? If so, any chance to
> get a profile?

Please find attached the profile file and the postgresql log generated
in the test run on my sandbox.

>
> By switching to a warm standby you mean disabling hot_standby=on in the
> standbys or changing wal_level?

The master's wal_level is hot_standby.

I set the warm standby on the slave changing to off the hot_standby
parameter.

Cheers
Federico

--
Federico Campoli
Database Administrator brandwatch.com

Attachment
On Tue, Jun 4, 2013 at 5:57 AM, Federico Campoli <federico@brandwatch.com>wrote:

>
> I'm sorry, just guessing it could be a loop.
> The read remains stuck on the same segment.
> On my testbox I have at least 1 minute to 20 Mb/s.
> On the live system the peak is 124 Mb/s for 2 to 3 minutes without any
> progress in the wal reply.
>
> I've attached the part of postgresql's log with debug2  from my sandbox
> when that happens.
>

It looks like it is making progress, just very slowly.  Basically I think
every WAL record that needs it to be replayed triggers the random read of
some data block which is not already cached on the standby.



> In warm standby everything is fine no lag at all.
>

OK, then I don't think I can reproduce it.  The spiky replay I see is the
same whether the standby is hot or warm.


Cheers,

Jeff
On Tue, Jun 4, 2013 at 8:36 AM, Federico Campoli <federico@brandwatch.com>wrote:

>
> This is the gprof output for the profile.
>

This looks to me like the gprof of a process that is either idle, or
completely IO bound.

I'd probably approach this with a combination of "strace -T -ttt -p <PID>"
and "lsof -p <PID>" on the PID of the start-up process, to see which file
descriptors it is waiting for read or writes on, and what the underlying
names of the files for those file descriptors are.

Cheers,

Jeff
On 06/06/13 21:22, Jeff Janes wrote:
> On Tue, Jun 4, 2013 at 8:36 AM, Federico Campoli
> <federico@brandwatch.com <mailto:federico@brandwatch.com>> wrote:
>
>
>     This is the gprof output for the profile.
>
>
> This looks to me like the gprof of a process that is either idle, or
> completely IO bound.
>
> I'd probably approach this with a combination of "strace -T -ttt -p
> <PID>" and "lsof -p <PID>" on the PID of the start-up process, to see
> which file descriptors it is waiting for read or writes on, and what the
> underlying names of the files for those file descriptors are.
>
> Cheers,
>
> Jeff

I've generated a strace from the startup process a new profile report
and a new log file.

It's quite big so you can download here

http://d01.megashares.com/dl/BqZLh5H/HS_data_lag.tar

I've monitored the startup process lsof  and the only files accessed are
the recovering wal and the relation affected by the master's activity.

Regarding the warm standby, I've repeated the test on my sandbox with
the slave in warm standby and I've noticed a replication lag spike.

This does not affect the production,same database version and
architecture from debian package.

If I turn on the hot standby on the failover in 30 minutes accumulates a
several Gigabytes lag. When I switch off the lag disappear.

I can just guess in WS the lag have less impact as the random read
resumes quicker than the HS.

Many thanks
Federico

--
Federico Campoli
Database Administrator brandwatch.com
On Fri, Jun 7, 2013 at 6:08 AM, Federico Campoli <federico@brandwatch.com>wrote:

> On 06/06/13 21:22, Jeff Janes wrote:
>
>>
>>
>> I'd probably approach this with a combination of "strace -T -ttt -p
>> <PID>" and "lsof -p <PID>" on the PID of the start-up process, to see
>> which file descriptors it is waiting for read or writes on, and what the
>> underlying names of the files for those file descriptors are.
>>
>> Cheers,
>>
>> Jeff
>>
>
> I've generated a strace from the startup process a new profile report and
> a new log file.
>
> It's quite big so you can download here
>
> http://d01.megashares.com/dl/**BqZLh5H/HS_data_lag.tar<http://d01.megashares.com/dl/BqZLh5H/HS_data_lag.tar>
>
> I've monitored the startup process lsof  and the only files accessed are
> the recovering wal and the relation affected by the master's activity.
>

You didn't need to strace the entire operation, just a sampling for ~1
minute in the middle of it should be enough.


It looks like it is basically idle before 1370603193.846487, so I ignored
that part.  Of the remaining 472 seconds covered (during which time it
replays ~19 WAL files), the sum of the system call times covers only 210
seconds, so the rest is presumably CPU time (although the error due to
rounding and instrumentation vagaries could be quite high).

Even during the active part, quite a bit of the time is still spent idle
("poll([{fd=3, events=POLLIN}], 1, 5000)").  And the biggest remaining
chunk is spent on "lseek(11,", but again I don't know how  much to trust
that as any individual call is at the limit of measurability so it could
just be the accumulation of instrumentation errors, not real time.

zcat pg_HS_strace.out.gz | perl -lne '/<([\d.]+)>/ and $_ >
1370603193.846487 and print "$1\t$_"'|sort -rn|perl -lne '$sum+=$_; print
"$sum\t$_"' |less

zcat pg_HS_strace.out.gz | perl -lne '/ ([^,]+,).*<([\d.]+)>/ and $_ >
1370603193.846487 and $h{$1}+=$2; END {print "$h{$_}\t$_" foreach keys
%h}'|sort -g


For the warm stand-by, if I cut of the uninteresting parts at the beginning
and end (1370604295.966518   and
1370605296.458290), then it applies ~33 WAL files in 1000 seconds, so it
doesn't seem to be any faster than the hot standby.

Are you sure the strace covered the events of interest?

Do you have output of vmstat or sar for the period of interest?  It would
be interesting to see what the user time and wait time are.



>
> Regarding the warm standby, I've repeated the test on my sandbox with the
> slave in warm standby and I've noticed a replication lag spike.
>
> This does not affect the production,same database version and architecture
> from debian package.
>

I'm not sure what you are saying here.  Both HS and WS lag on your test
machine, but only HS does on the production?  And the straces you posted
are from the test machine?

In any case, nothing really stands out here.  Other than the idle periods,
no one thing seems to be dominating the time.

Cheers,

Jeff