Thread: [BUGS] BUG #14538: streaming replication same wal missing

[BUGS] BUG #14538: streaming replication same wal missing

From

vodevsh@gmail.com

Date:

09 February 2017, 19:58:08

The following bug has been logged on the website:

Bug reference:      14538
Logged by:          Vladimir Svedov
Email address:      vodevsh@gmail.com
PostgreSQL version: 9.3.11
Operating system:   Linux
Description:

Hi,
I posted formatted text as a question
here:http://dba.stackexchange.com/questions/163735/streaming-replication-same-wal-missing

Several days ago I saw this in master log:

 replica 2017-02-06 16:10:55 UTC ERROR:  requested WAL segment
000000030000096D00000052 has already been removed
This message was repeating until I stopped slave.

Checking pg_stat_activity on prod showed autovacuum task on table ~250GB, I
decided that it produced huge amount of WALs. Checking them confirmed my
assumption - all wal_keep_segments=500 were last minute. So I thought that
network bandwidth was not sufficient to send WALs fast enough to slave to
replay them. Master has archive_command = cd ., so no way back.

I did not panic, I recreate replication. I:

delete all data files on slave
pg_start_backup() on prod
rsync -zahv data
pg_start slave
pg_stop_backup() on prod and check logs.
Imagine my desperation to see that slave has this:

-bash-4.2$ head -n 20 /pg/data93/pg_log/postgresql-2017-02-09_121514.log
LOG:  could not open usermap file "/pg/data93/pg_ident.conf": No such file
or directory
LOG:  database system was shut down in recovery at 2017-02-09 12:15:07 UTC
LOG:  entering standby mode
LOG:  redo starts at 982/39074640
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
LOG:  consistent recovery state reached at 982/3AEC60F8
LOG:  unexpected pageaddr 97C/83EC8000 in log segment
00000003000009820000003A, offset 15499264
LOG:  database system is ready to accept read only connections
LOG:  started streaming WAL from primary at 982/3A000000 on timeline 3
ERROR:  requested WAL segment 000000030000096D00000052 has already been
removed
ERROR:  requested WAL segment 000000030000096D00000052 has already been
removed
ERROR:  requested WAL segment 000000030000096D00000052 has already been
removed
ERROR:  requested WAL segment 000000030000096D00000052 has already been
removed
ERROR:  requested WAL segment 000000030000096D00000052 has already been
removed
ERROR:  requested WAL segment 000000030000096D00000052 has already been
removed
and so on... 3 days passed! That segment has gone, but why is it looking for
it??? I copied whole data directory.

All pages are readable (pg_dump works on all dbs), changes from master go to
slave. ERROR keeps appearing each several seconds on slave.

Please, tell me I did something stupid and it is not a bug. It happened on
prod.


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14538: streaming replication same wal missing

From

Michael Paquier

Date:

10 February 2017, 07:48:34

On Thu, Feb 9, 2017 at 10:58 PM,  <vodevsh@gmail.com> wrote:
> Bug reference:      14538
> Logged by:          Vladimir Svedov
> Email address:      vodevsh@gmail.com
> PostgreSQL version: 9.3.11

You should really update to 9.3.16, the latest minor release in the
9.3 series. You are missing more than 1 year worth of bug fixes here.

> Several days ago I saw this in master log:
>
>  replica 2017-02-06 16:10:55 UTC ERROR:  requested WAL segment
> 000000030000096D00000052 has already been removed
> This message was repeating until I stopped slave.

That happens :"(

> Checking pg_stat_activity on prod showed autovacuum task on table ~250GB, I
> decided that it produced huge amount of WALs. Checking them confirmed my
> assumption - all wal_keep_segments=500 were last minute. So I thought that
> network bandwidth was not sufficient to send WALs fast enough to slave to
> replay them. Master has archive_command = cd ., so no way back.
>
> I did not panic, I recreate replication. I:
>
> delete all data files on slave
> pg_start_backup() on prod
> rsync -zahv data
> pg_start slave

Does this line mean that you are starting the slave? You may want to
start it after issuing pg_stop_backup() so a segment switch is done
and a standby is able to fetch all the data it needs has been properly
archived...

> pg_stop_backup() on prod and check logs.
> Imagine my desperation to see that slave has this:
> -bash-4.2$ head -n 20 /pg/data93/pg_log/postgresql-2017-02-09_121514.log
> LOG:  could not open usermap file "/pg/data93/pg_ident.conf": No such file
> or directory
> LOG:  database system was shut down in recovery at 2017-02-09 12:15:07 UTC
> LOG:  entering standby mode
> LOG:  redo starts at 982/39074640
> FATAL:  the database system is starting up
> FATAL:  the database system is starting up
> FATAL:  the database system is starting up
> LOG:  consistent recovery state reached at 982/3AEC60F8
> LOG:  unexpected pageaddr 97C/83EC8000 in log segment
> 00000003000009820000003A, offset 15499264
> LOG:  database system is ready to accept read only connections
> LOG:  started streaming WAL from primary at 982/3A000000 on timeline 3
> ERROR:  requested WAL segment 000000030000096D00000052 has already been
> removed

This error message is being generated by a WAL sender running on your
new *slave*, meaning that there is a cascading standby connected to
this slave server and trying to fetch this WAL segment. But it does
not have it as it has been recycled already during recovery.

If the slave you just created was complaining about a missing segment
from its primary it would throw that instead:
FATAL:  could not receive data from WAL stream: ERROR:  requested WAL
segment 000000010000000000000007 has already been removed

In short your new slave is fine, but you need to rebuild the cascading
standby behind though.
-- 
Michael


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14538: streaming replication same wal missing

From

Vladimir Svedov

Date:

10 February 2017, 15:10:19

Thank you Michael,

Sorry guys for bothering you!

It is not a bug - it is nothing more then my lack of knowledge of ERRORs produced and panic.

I mistook error as produced by slave not finding WAL at master, which would be "FATAL: could not receive data from WAL stream".

As Michael said error was produced by one of cascaded slaves I forgot to take down

I did not know how to call my bug report back, unless somebody answers it, so I'm doing it only now.

Thank you and my apologies again

2017-02-10 1:48 GMT+00:00 Michael Paquier <michael.paquier@gmail.com>:

On Thu, Feb 9, 2017 at 10:58 PM, <vodevsh@gmail.com> wrote:
> Bug reference: 14538
> Logged by: Vladimir Svedov
> Email address: vodevsh@gmail.com
> PostgreSQL version: 9.3.11

You should really update to 9.3.16, the latest minor release in the
9.3 series. You are missing more than 1 year worth of bug fixes here.

> Several days ago I saw this in master log:
>
> replica 2017-02-06 16:10:55 UTC ERROR: requested WAL segment
> 000000030000096D00000052 has already been removed
> This message was repeating until I stopped slave.

That happens :"(

> Checking pg_stat_activity on prod showed autovacuum task on table ~250GB, I
> decided that it produced huge amount of WALs. Checking them confirmed my
> assumption - all wal_keep_segments=500 were last minute. So I thought that
> network bandwidth was not sufficient to send WALs fast enough to slave to
> replay them. Master has archive_command = cd ., so no way back.
>
> I did not panic, I recreate replication. I:
>
> delete all data files on slave
> pg_start_backup() on prod
> rsync -zahv data
> pg_start slave

Does this line mean that you are starting the slave? You may want to
start it after issuing pg_stop_backup() so a segment switch is done
and a standby is able to fetch all the data it needs has been properly
archived...

> pg_stop_backup() on prod and check logs.
> Imagine my desperation to see that slave has this:
> -bash-4.2$ head -n 20 /pg/data93/pg_log/postgresql-2017-02-09_121514.log
> LOG: could not open usermap file "/pg/data93/pg_ident.conf": No such file
> or directory
> LOG: database system was shut down in recovery at 2017-02-09 12:15:07 UTC
> LOG: entering standby mode
> LOG: redo starts at 982/39074640
> FATAL: the database system is starting up
> FATAL: the database system is starting up
> FATAL: the database system is starting up
> LOG: consistent recovery state reached at 982/3AEC60F8
> LOG: unexpected pageaddr 97C/83EC8000 in log segment
> 00000003000009820000003A, offset 15499264
> LOG: database system is ready to accept read only connections
> LOG: started streaming WAL from primary at 982/3A000000 on timeline 3
> ERROR: requested WAL segment 000000030000096D00000052 has already been
> removed

This error message is being generated by a WAL sender running on your
new *slave*, meaning that there is a cascading standby connected to
this slave server and trying to fetch this WAL segment. But it does
not have it as it has been recycled already during recovery.

If the slave you just created was complaining about a missing segment
from its primary it would throw that instead:
FATAL: could not receive data from WAL stream: ERROR: requested WAL
segment 000000010000000000000007 has already been removed

In short your new slave is fine, but you need to rebuild the cascading
standby behind though.
--
Michael