Re: Segfault leading to crash, recovery mode, and TOAST corruption - Mailing list pgsql-general

From Jonathan Marks
Subject Re: Segfault leading to crash, recovery mode, and TOAST corruption
Date
Msg-id CF9FED80-6E1A-47F0-969B-B3E4757BFC2B@gmail.com
Whole thread Raw
In response to Re: Segfault leading to crash, recovery mode, and TOAST corruption  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
Thank you so very much, Tom.

Vacuuming fixed the TOAST corruption issue and we’ll upgrade our instances tonight (max RDS has is 10.3, but that’s a
start).


> On Jun 5, 2018, at 8:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Jonathan Marks <jonathanaverymarks@gmail.com> writes:
>> We had two issues today (once this morning and once a few minutes ago)
>> with our primary database (RDS running 10.1, 32 cores, 240 GB RAM, 5TB
>> total disk space, 20k PIOPS) where the database suddenly crashed and
>> went into recovery mode.
>
> I'd suggest updating to 10.4 ... see below.
>
>> Both times that the server crashed, we saw this in the logs:
>> 2018-06-05 23:08:44 UTC:172.31.7.89(36224):production@OURDB:[12173]:ERROR:  canceling statement due to statement
timeout
>> 2018-06-05 23:08:44 UTC::@:[48863]:LOG:  worker process: parallel worker for PID 12173 (PID 20238) exited with exit
code1 
>> 2018-06-05 23:08:49 UTC::@:[48863]:LOG:  server process (PID 12173) was terminated by signal 11: Segmentation fault
>
> This looks to be a parallel leader process getting confused when a worker
> process exits unexpectedly.  There were some related fixes in 10.2, which
> might resolve the issue, though it's also possible we have more to do there.
>
>> After the first crash, we then started getting errors like:
>> 2018-06-05 23:08:45 UTC:172.31.6.84(33392):production@OURDB:[11888]:ERROR:  unexpected chunk number 0 (expected 1)
fortoast value 1592283014 in pg_toast_26656 
>
> This definitely looks to be the "reuse of TOAST OIDs immediately after
> crash" issue that was fixed in 10.4.  AFAIK it's recoverable corruption;
> I believe you'll find that VACUUMing the parent table will make the
> errors stop, and all will be well.  But an update would be prudent to
> prevent it from happening again.
>
>             regards, tom lane



pgsql-general by date:

Previous
From: Christophe Pettus
Date:
Subject: Re: Code of Conduct plan
Next
From: Tom Lane
Date:
Subject: Re: Code of Conduct plan