Thread: Segfault leading to crash, recovery mode, and TOAST corruption

Segfault leading to crash, recovery mode, and TOAST corruption

From

Jonathan Marks

Date:

06 June 2018, 05:35:42

Hello —

We had two issues today (once this morning and once a few minutes ago) with our primary database (RDS running 10.1, 32 cores, 240 GB RAM, 5TB total disk space, 20k PIOPS) where the database suddenly crashed and went into recovery mode. The first time this happened, we restarted the server after about 5 minutes in an attempt to get the system live, and the second time we let it stay in recovery mode until it recovered (took about 10 minutes). The system was not under high load in either case.

Both times that the server crashed, we saw this in the logs:

2018-06-05 23:08:44 UTC:172.31.7.89(36224):production@OURDB:[12173]:ERROR: canceling statement due to statement timeout

2018-06-05 23:08:44 UTC::@:[48863]:LOG: worker process: parallel worker for PID 12173 (PID 20238) exited with exit code 1

2018-06-05 23:08:49 UTC::@:[48863]:LOG: server process (PID 12173) was terminated by signal 11: Segmentation fault

After the first crash, we then started getting errors like:

2018-06-05 23:08:45 UTC:172.31.6.84(33392):production@OURDB:[11888]:ERROR: unexpected chunk number 0 (expected 1) for toast value 1592283014 in pg_toast_26656

We were able to identify 15 rows that are corrupted and the exact fields that are being TOASTED. We’re following Josh Berkus’ post here: http://www.databasesoup.com/2013/10/de-corrupting-toast-tables.html.

We have tried to update those rows to change the bad fields by using UPDATE and DELETE, but every time we do we get an error: ERROR: tuple concurrently updated

We’re intending to reindex the TOAST table this evening, then try to delete again, and then run pg_repack. However, while that may resolve the TOAST corruption, we don’t believe it’s the root cause of this issue. We can in theory restore from one of our backups, but that would result in data loss for our clients and may not necessarily resolve the issue. We’re worried that this is a Postgres bug, perhaps due to parallelization — would appreciate any guidance people can give.

Thank you!

Re: Segfault leading to crash, recovery mode, and TOAST corruption

From

Tom Lane

Date:

06 June 2018, 06:07:31

Jonathan Marks <jonathanaverymarks@gmail.com> writes:
> We had two issues today (once this morning and once a few minutes ago)
> with our primary database (RDS running 10.1, 32 cores, 240 GB RAM, 5TB
> total disk space, 20k PIOPS) where the database suddenly crashed and
> went into recovery mode.

I'd suggest updating to 10.4 ... see below.

> Both times that the server crashed, we saw this in the logs:
> 2018-06-05 23:08:44 UTC:172.31.7.89(36224):production@OURDB:[12173]:ERROR:  canceling statement due to statement
timeout
> 2018-06-05 23:08:44 UTC::@:[48863]:LOG:  worker process: parallel worker for PID 12173 (PID 20238) exited with exit
code1 
> 2018-06-05 23:08:49 UTC::@:[48863]:LOG:  server process (PID 12173) was terminated by signal 11: Segmentation fault

This looks to be a parallel leader process getting confused when a worker
process exits unexpectedly.  There were some related fixes in 10.2, which
might resolve the issue, though it's also possible we have more to do there.

> After the first crash, we then started getting errors like:
> 2018-06-05 23:08:45 UTC:172.31.6.84(33392):production@OURDB:[11888]:ERROR:  unexpected chunk number 0 (expected 1)
fortoast value 1592283014 in pg_toast_26656 

This definitely looks to be the "reuse of TOAST OIDs immediately after
crash" issue that was fixed in 10.4.  AFAIK it's recoverable corruption;
I believe you'll find that VACUUMing the parent table will make the
errors stop, and all will be well.  But an update would be prudent to
prevent it from happening again.

            regards, tom lane

Re: Segfault leading to crash, recovery mode, and TOAST corruption

From

Jonathan Marks

Date:

06 June 2018, 06:36:00

Thank you so very much, Tom.

Vacuuming fixed the TOAST corruption issue and we’ll upgrade our instances tonight (max RDS has is 10.3, but that’s a
start).


> On Jun 5, 2018, at 8:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Jonathan Marks <jonathanaverymarks@gmail.com> writes:
>> We had two issues today (once this morning and once a few minutes ago)
>> with our primary database (RDS running 10.1, 32 cores, 240 GB RAM, 5TB
>> total disk space, 20k PIOPS) where the database suddenly crashed and
>> went into recovery mode.
>
> I'd suggest updating to 10.4 ... see below.
>
>> Both times that the server crashed, we saw this in the logs:
>> 2018-06-05 23:08:44 UTC:172.31.7.89(36224):production@OURDB:[12173]:ERROR:  canceling statement due to statement
timeout
>> 2018-06-05 23:08:44 UTC::@:[48863]:LOG:  worker process: parallel worker for PID 12173 (PID 20238) exited with exit
code1 
>> 2018-06-05 23:08:49 UTC::@:[48863]:LOG:  server process (PID 12173) was terminated by signal 11: Segmentation fault
>
> This looks to be a parallel leader process getting confused when a worker
> process exits unexpectedly.  There were some related fixes in 10.2, which
> might resolve the issue, though it's also possible we have more to do there.
>
>> After the first crash, we then started getting errors like:
>> 2018-06-05 23:08:45 UTC:172.31.6.84(33392):production@OURDB:[11888]:ERROR:  unexpected chunk number 0 (expected 1)
fortoast value 1592283014 in pg_toast_26656 
>
> This definitely looks to be the "reuse of TOAST OIDs immediately after
> crash" issue that was fixed in 10.4.  AFAIK it's recoverable corruption;
> I believe you'll find that VACUUMing the parent table will make the
> errors stop, and all will be well.  But an update would be prudent to
> prevent it from happening again.
>
>             regards, tom lane