Home > mailing lists

Re: Standby corruption after master is restarted - Mailing list pgsql-bugs

From	Tomas Vondra
Subject	Re: Standby corruption after master is restarted
Date	April 14, 2018 23:38:49
Msg-id	ce06163c-58ed-5dda-ea5c-138c86b62132@2ndquadrant.com Whole thread Raw
In response to	Re: Standby corruption after master is restarted (Emre Hasegeli <emre@hasegeli.com>)
Responses	Re: Standby corruption after master is restarted (Tomas Vondra <tomas.vondra@2ndquadrant.com>) Re: Standby corruption after master is restarted (Emre Hasegeli <emre@hasegeli.com>)
List	pgsql-bugs

Tree view

Hi Emre,

On 03/28/2018 07:50 PM, Emre Hasegeli wrote:
> We experienced this issue again, this time on production.  The primary
> instance was in a loop of being killed by Linux OOM-killer and being
> restarted in 1 minute intervals.  The corruption only happened on one
> of the two standbys.  The primary and the other standby have no
> problems.  Only the was killed and restarted, the standbys were not.
> There weren't any unusual settings, "fsync" was not disabled.  Here is
> the information I collected.
> 

I've been trying to reproduce this by running a master with a couple of
replicas, and randomly restarting the master (while pgbench is running).
But so far no luck, so I guess something else is required to reproduce
the behavior ...

> The logs at the time standby broke:
> 
>> 2018-03-28 14:00:30 UTC [3693-67] LOG:  invalid resource manager ID 39 at 1DFB/D43BE688
>> 2018-03-28 14:00:30 UTC [25347-1] LOG:  started streaming WAL from primary at 1DFB/D4000000 on timeline 5
>> 2018-03-28 14:00:59 UTC [3748-357177] LOG:  restartpoint starting: time
>> 2018-03-28 14:01:23 UTC [25347-2] FATAL:  could not receive data from WAL stream: SSL SYSCALL error: EOF detected
>> 2018-03-28 14:01:24 UTC [3693-68] FATAL:  invalid memory alloc request size 1916035072
> 
> And from the next try:
> 
>> 2018-03-28 14:02:15 UTC [26808-5] LOG:  consistent recovery state reached at 1DFB/D6BDDFF8
>> 2018-03-28 14:02:15 UTC [26808-6] FATAL:  invalid memory alloc request size 191603507
> 

In the initial report (from August 2018) you shared pg_xlogdump output,
showing that the corrupted WAL record is an FPI_FOR_HINT right after
CHECKPOINT_SHUTDOWN. Was it the same case this time?

BTW which versions are we talking about? I see the initial report
mentioned catversion 201608131, this one mentions 201510051, so I'm
guessing 9.6 and 9.5. Which minor versions?

Is the master under load (accepting writes) before shutdown?

How was it restarted, actually? I see you're mentioning OOM killer, so I
guess "kill -9". What about the first report - was it the same case, or
was it restarted "nicely" using pg_ctl?

Could the replica receive the WAL in some other way - say, from a WAL
archive? What archive/restore commands you use?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pgsql-bugs by date:

From: PG Bug reporting form
Date: 14 April 2018, 16:39:21
Subject: BUG #15155: table_to_xmlschema() ignores string restriction whengenerating XSD

From: Tomas Vondra
Date: 14 April 2018, 23:46:45
Subject: Re: Standby corruption after master is restarted

Re: Standby corruption after master is restarted - Mailing list pgsql-bugs

Previous

Next