Re: Standby corruption after master is restarted - Mailing list pgsql-bugs

From Tomas Vondra
Subject Re: Standby corruption after master is restarted
Date
Msg-id ce06163c-58ed-5dda-ea5c-138c86b62132@2ndquadrant.com
Whole thread Raw
In response to Re: Standby corruption after master is restarted  (Emre Hasegeli <emre@hasegeli.com>)
Responses Re: Standby corruption after master is restarted  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Re: Standby corruption after master is restarted  (Emre Hasegeli <emre@hasegeli.com>)
List pgsql-bugs
Hi Emre,

On 03/28/2018 07:50 PM, Emre Hasegeli wrote:
> We experienced this issue again, this time on production.  The primary
> instance was in a loop of being killed by Linux OOM-killer and being
> restarted in 1 minute intervals.  The corruption only happened on one
> of the two standbys.  The primary and the other standby have no
> problems.  Only the was killed and restarted, the standbys were not.
> There weren't any unusual settings, "fsync" was not disabled.  Here is
> the information I collected.
> 

I've been trying to reproduce this by running a master with a couple of
replicas, and randomly restarting the master (while pgbench is running).
But so far no luck, so I guess something else is required to reproduce
the behavior ...

> The logs at the time standby broke:
> 
>> 2018-03-28 14:00:30 UTC [3693-67] LOG:  invalid resource manager ID 39 at 1DFB/D43BE688
>> 2018-03-28 14:00:30 UTC [25347-1] LOG:  started streaming WAL from primary at 1DFB/D4000000 on timeline 5
>> 2018-03-28 14:00:59 UTC [3748-357177] LOG:  restartpoint starting: time
>> 2018-03-28 14:01:23 UTC [25347-2] FATAL:  could not receive data from WAL stream: SSL SYSCALL error: EOF detected
>> 2018-03-28 14:01:24 UTC [3693-68] FATAL:  invalid memory alloc request size 1916035072
> 
> And from the next try:
> 
>> 2018-03-28 14:02:15 UTC [26808-5] LOG:  consistent recovery state reached at 1DFB/D6BDDFF8
>> 2018-03-28 14:02:15 UTC [26808-6] FATAL:  invalid memory alloc request size 191603507
> 

In the initial report (from August 2018) you shared pg_xlogdump output,
showing that the corrupted WAL record is an FPI_FOR_HINT right after
CHECKPOINT_SHUTDOWN. Was it the same case this time?

BTW which versions are we talking about? I see the initial report
mentioned catversion 201608131, this one mentions 201510051, so I'm
guessing 9.6 and 9.5. Which minor versions?

Is the master under load (accepting writes) before shutdown?

How was it restarted, actually? I see you're mentioning OOM killer, so I
guess "kill -9". What about the first report - was it the same case, or
was it restarted "nicely" using pg_ctl?

Could the replica receive the WAL in some other way - say, from a WAL
archive? What archive/restore commands you use?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pgsql-bugs by date:

Previous
From: PG Bug reporting form
Date:
Subject: BUG #15155: table_to_xmlschema() ignores string restriction whengenerating XSD
Next
From: Tomas Vondra
Date:
Subject: Re: Standby corruption after master is restarted