[BUGS] BUG #14538: streaming replication same wal missing - Mailing list pgsql-bugs

From vodevsh@gmail.com
Subject [BUGS] BUG #14538: streaming replication same wal missing
Date
Msg-id 20170209135808.1405.57354@wrigleys.postgresql.org
Whole thread Raw
Responses Re: [BUGS] BUG #14538: streaming replication same wal missing  (Michael Paquier <michael.paquier@gmail.com>)
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      14538
Logged by:          Vladimir Svedov
Email address:      vodevsh@gmail.com
PostgreSQL version: 9.3.11
Operating system:   Linux
Description:

Hi,
I posted formatted text as a question
here:http://dba.stackexchange.com/questions/163735/streaming-replication-same-wal-missing

Several days ago I saw this in master log:

 replica 2017-02-06 16:10:55 UTC ERROR:  requested WAL segment
000000030000096D00000052 has already been removed
This message was repeating until I stopped slave.

Checking pg_stat_activity on prod showed autovacuum task on table ~250GB, I
decided that it produced huge amount of WALs. Checking them confirmed my
assumption - all wal_keep_segments=500 were last minute. So I thought that
network bandwidth was not sufficient to send WALs fast enough to slave to
replay them. Master has archive_command = cd ., so no way back.

I did not panic, I recreate replication. I:

delete all data files on slave
pg_start_backup() on prod
rsync -zahv data
pg_start slave
pg_stop_backup() on prod and check logs.
Imagine my desperation to see that slave has this:

-bash-4.2$ head -n 20 /pg/data93/pg_log/postgresql-2017-02-09_121514.log
LOG:  could not open usermap file "/pg/data93/pg_ident.conf": No such file
or directory
LOG:  database system was shut down in recovery at 2017-02-09 12:15:07 UTC
LOG:  entering standby mode
LOG:  redo starts at 982/39074640
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
LOG:  consistent recovery state reached at 982/3AEC60F8
LOG:  unexpected pageaddr 97C/83EC8000 in log segment
00000003000009820000003A, offset 15499264
LOG:  database system is ready to accept read only connections
LOG:  started streaming WAL from primary at 982/3A000000 on timeline 3
ERROR:  requested WAL segment 000000030000096D00000052 has already been
removed
ERROR:  requested WAL segment 000000030000096D00000052 has already been
removed
ERROR:  requested WAL segment 000000030000096D00000052 has already been
removed
ERROR:  requested WAL segment 000000030000096D00000052 has already been
removed
ERROR:  requested WAL segment 000000030000096D00000052 has already been
removed
ERROR:  requested WAL segment 000000030000096D00000052 has already been
removed
and so on... 3 days passed! That segment has gone, but why is it looking for
it??? I copied whole data directory.

All pages are readable (pg_dump works on all dbs), changes from master go to
slave. ERROR keeps appearing each several seconds on slave.

Please, tell me I did something stupid and it is not a bug. It happened on
prod.


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

pgsql-bugs by date:

Previous
From: mcinp@o2.pl
Date:
Subject: [BUGS] BUG #14537: Gist index irrational growth
Next
From: Tom Lane
Date:
Subject: Re: [BUGS] BUG #14537: Gist index irrational growth