Thread: server process (PID 2964738) was terminated by signal 11: Segmentation fault

server process (PID 2964738) was terminated by signal 11: Segmentation fault

From
Stefan Froehlich
Date:
This was starting with an invalid memory alloc request size, like in
<https://www.postgresql.org/message-id/2080966.1592844878%40sss.pgh.pa.us>.

I followed the suggestion to trace down the faulty record, found and
fixed it. Now I can access that record again, but if I try to dump
the table I get:

| 2022-11-06 11:52:36.367 CET [2098-35] LOG:  server process (PID 2964738) was terminated by signal 11: Segmentation
fault
| [...]
| 2022-11-06 11:53:46.229 CET [2964744-2] LOG:  database system was not properly shut down; automatic recovery in
progress
| 2022-11-06 11:53:46.263 CET [2964744-3] LOG:  redo starts at 7D/E53B6440
| 2022-11-06 11:53:46.265 CET [2964744-4] LOG:  invalid record length at 7D/E5412958: wanted 24, got 0
| 2022-11-06 11:53:46.265 CET [2964744-5] LOG:  redo done at 7D/E5412920
| 2022-11-06 11:53:46.749 CET [2098-39] LOG:  database system is ready to accept connections

Whatever was "done" does not prevent the problem from being
reproducible. I tried to track down this faulty record as well, but
binary proves difficult with 25*10^6 rows if the server crashes and
rebuilds after each try.

Is there any reasonable way to proceed from here?

The system seems to be working fine (the currupted table contains
textual information only), but right now I cannot even make proper
backups which is not a good thing.

Regards,

Stefan



Stefan Froehlich <postgresql@froehlich.priv.at> writes:
> I followed the suggestion to trace down the faulty record, found and
> fixed it. Now I can access that record again, but if I try to dump
> the table I get:
> | 2022-11-06 11:52:36.367 CET [2098-35] LOG:  server process (PID 2964738) was terminated by signal 11: Segmentation
fault

> Whatever was "done" does not prevent the problem from being
> reproducible. I tried to track down this faulty record as well, but
> binary proves difficult with 25*10^6 rows if the server crashes and
> rebuilds after each try.
> Is there any reasonable way to proceed from here?

contrib/amcheck might help to identify the faulty data (at this
point there's reason to fear multiple corruptions ...).  If
you're running v14 or v15 there's a frontend for that called
pg_amcheck.

            regards, tom lane



Re: server process (PID 2964738) was terminated by signal 11: Segmentation fault

From
Stefan Froehlich
Date:
On Sun, Nov 06, 2022 at 09:13:08AM -0500, Tom Lane wrote:
> > | 2022-11-06 11:52:36.367 CET [2098-35] LOG:  server process (PID 2964738) was terminated by signal 11:
Segmentationfault
 
 
> contrib/amcheck might help to identify the faulty data (at this
> point there's reason to fear multiple corruptions ...).  If you're
> running v14 or v15 there's a frontend for that called pg_amcheck.

I am using v13, but well:

| # create extension amcheck;
| # select oid, relname from pg_class where relname ='faultytablename_pkey';
| [returns oid 537203]
| # select bt_index_check(537203, true);
| server closed the connection unexpectedly
|         This probably means the server terminated abnormally
|         before or while processing the request.
| The connection to the server was lost. Attempting reset: Failed.

This seems to be quite fucked up ("how" would be a good question,
too, but for the moment priority is to work around the problem).

If I have to delete 1k records in this table, I'll do and survive,
but first I have to find out which ones.

Bye,
Stefan



Stefan Froehlich <postgresql@froehlich.priv.at> writes:
> I am using v13, but well:

> | # create extension amcheck;
> | # select oid, relname from pg_class where relname ='faultytablename_pkey';
> | [returns oid 537203]
> | # select bt_index_check(537203, true);
> | server closed the connection unexpectedly

Oh ... up through v13, amcheck lacks any functions to check heaps,
only indexes, so I guess it won't help you much.  You could try
reindexing, but if pg_dump is failing that's a good sign you have
heap corruption not (or at least not only) index corruption.

Another idea is to try using contrib/pageinspect to examine each
page of the table.  Its output is just gobbledegook to most people,
but there's a good chance it'd fail visibly on the corrupted page(s).

Also, please read
https://wiki.postgresql.org/wiki/Corruption
if you didn't already.

            regards, tom lane



Re: server process (PID 2964738) was terminated by signal 11: Segmentation fault

From
Stefan Froehlich
Date:
On Sun, Nov 06, 2022 at 09:48:32AM -0500, Tom Lane wrote:
> Stefan Froehlich <postgresql@froehlich.priv.at> writes:
> > | # create extension amcheck;
> > | # select oid, relname from pg_class where relname ='faultytablename_pkey';
> > | [returns oid 537203]
> > | # select bt_index_check(537203, true);
> > | server closed the connection unexpectedly

> Another idea is to try using contrib/pageinspect to examine each
> page of the table.  Its output is just gobbledegook to most
> people, but there's a good chance it'd fail visibly on the
> corrupted page(s).

Fortunately I was able to identify a window of 100 records (out of
25 mio.) containing all the errors. After deleting and re-inserting
those records everything seems to be ok (at least, pg_dump and
"reindex database" work without errors).

I suspect a bad RAM module to be the root of the problems. We'll
see.

Side question: If it is possible to simply delete and create such
records is it necessary that the server *core* *dumps*? There could
be a switch adding additional safety (at the cost of performance)
which would make troubleshooting not only much faster but
non-invasive for the other databases on the same server as well.

Bye,
Stefan



Re: server process (PID 2964738) was terminated by signal 11: Segmentation fault

From
Laurenz Albe
Date:
On Mon, 2022-11-07 at 11:17 +0100, Stefan Froehlich wrote:
> On Sun, Nov 06, 2022 at 09:48:32AM -0500, Tom Lane wrote:
> > Stefan Froehlich <postgresql@froehlich.priv.at> writes:
> > > > # create extension amcheck;
> > > > # select oid, relname from pg_class where relname ='faultytablename_pkey';
> > > > [returns oid 537203]
> > > > # select bt_index_check(537203, true);
> > > > server closed the connection unexpectedly
> 
> > Another idea is to try using contrib/pageinspect to examine each
> > page of the table.  Its output is just gobbledegook to most
> > people, but there's a good chance it'd fail visibly on the
> > corrupted page(s).
> 
> Fortunately I was able to identify a window of 100 records (out of
> 25 mio.) containing all the errors. After deleting and re-inserting
> those records everything seems to be ok (at least, pg_dump and
> "reindex database" work without errors).

Don't continue to work with that cluster even if everything seems OK now.
"pg_dumpall" and restore to a new cluster on good hardware.

> I suspect a bad RAM module to be the root of the problems. We'll
> see.
> 
> Side question: If it is possible to simply delete and create such
> records is it necessary that the server *core* *dumps*? There could
> be a switch adding additional safety (at the cost of performance)
> which would make troubleshooting not only much faster but
> non-invasive for the other databases on the same server as well.

Crashing is never nice.  On the other hand, adding checks and error
messages for conditions that are always true in a correct block cost
performance.  I can't tell about your specific case, but a build
of PostgreSQL --enable-cassert has assitional checks in the code.
That will still crash, but the log will show what condition was violated.

Yours,
Laurenz Albe



Re: server process (PID 2964738) was terminated by signal 11: Segmentation fault

From
Mladen Gogala
Date:
On 11/7/22 06:19, Laurenz Albe wrote:
Don't continue to work with that cluster even if everything seems OK now.
"pg_dumpall" and restore to a new cluster on good hardware.

Why would that be necessary if the original machine works well now?

-- 
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com

Re: server process (PID 2964738) was terminated by signal 11: Segmentation fault

From
Stefan Froehlich
Date:
On Mon, Nov 07, 2022 at 08:17:10AM -0500, Mladen Gogala wrote:
> On 11/7/22 06:19, Laurenz Albe wrote:
> >Don't continue to work with that cluster even if everything seems OK now.
> >"pg_dumpall" and restore to a new cluster on good hardware.
 
> Why would that be necessary if the original machine works well now?

I can understand the idea not to trust hardware anymore once a (not
clearly identified) problem occured.

In this case new hardware would - for reasons beyond the scope of
this list - not be any more or less trustworthy than the existing
one and thus (IMO) not make any difference.

Bye,
Stefan



Stefan Froehlich <postgresql@froehlich.priv.at> writes:
> On Mon, Nov 07, 2022 at 08:17:10AM -0500, Mladen Gogala wrote:
>> On 11/7/22 06:19, Laurenz Albe wrote:
>>> Don't continue to work with that cluster even if everything seems OK now.
>>> "pg_dumpall" and restore to a new cluster on good hardware.
 
>> Why would that be necessary if the original machine works well now?

> I can understand the idea not to trust hardware anymore once a (not
> clearly identified) problem occured.

> In this case new hardware would - for reasons beyond the scope of
> this list - not be any more or less trustworthy than the existing
> one and thus (IMO) not make any difference.

Whether you want to continue to trust the hardware or not is your
call.  It'd still be recommendable to pg_dumpall and restore into
a freshly-initdb'd cluster, because otherwise you can't be real
sure that you identified and cleared all the data corruption.

            regards, tom lane



Re: server process (PID 2964738) was terminated by signal 11: Segmentation fault

From
Stefan Froehlich
Date:
On Mon, Nov 07, 2022 at 09:02:26AM -0500, Tom Lane wrote:
> Stefan Froehlich <postgresql@froehlich.priv.at> writes:
> > On Mon, Nov 07, 2022 at 08:17:10AM -0500, Mladen Gogala wrote:
> >> On 11/7/22 06:19, Laurenz Albe wrote:
> >>> Don't continue to work with that cluster even if everything seems OK now.
> >>> "pg_dumpall" and restore to a new cluster on good hardware.
>  
> >> Why would that be necessary if the original machine works well now?
> 
> > I can understand the idea not to trust hardware anymore once a (not
> > clearly identified) problem occured.
> 
> > In this case new hardware would - for reasons beyond the scope of
> > this list - not be any more or less trustworthy than the existing
> > one and thus (IMO) not make any difference.
> 
> Whether you want to continue to trust the hardware or not is your
> call.  It'd still be recommendable to pg_dumpall and restore into
> a freshly-initdb'd cluster, because otherwise you can't be real
> sure that you identified and cleared all the data corruption.

Thanks, yes. This is in fact on my schedule for the next weekend as
it implies a downtime of serveral hours.

Bye,
Stefan



On 11/7/22 08:02, Tom Lane wrote:
[snip]
call. It'd still be recommendable to pg_dumpall and restore into
a freshly-initdb'd cluster, because otherwise you can't be real
sure that you identified and cleared all the data corruption.

Why just pg_dumpall instead of "pg_dumpall --globals-only" and parallel (for speed) dumps of databases?

--
Angular momentum makes the world go 'round.
Ron <ronljohnsonjr@gmail.com> writes:
> On 11/7/22 08:02, Tom Lane wrote:
>> call. It'd still be recommendable to pg_dumpall and restore into
>> a freshly-initdb'd cluster, because otherwise you can't be real
>> sure that you identified and cleared all the data corruption.

> Why *just* pg_dumpall instead of "pg_dumpall --globals-only" and parallel 
> (for speed) dumps of databases?

[ shrug... ] Sure, if you like to make your life complicated.

            regards, tom lane



On 11/7/22 09:43, Tom Lane wrote:
> Ron <ronljohnsonjr@gmail.com> writes:
>> On 11/7/22 08:02, Tom Lane wrote:
>>> call. It'd still be recommendable to pg_dumpall and restore into
>>> a freshly-initdb'd cluster, because otherwise you can't be real
>>> sure that you identified and cleared all the data corruption.
>> Why *just* pg_dumpall instead of "pg_dumpall --globals-only" and parallel
>> (for speed) dumps of databases?
> [ shrug... ] Sure, if you like to make your life complicated.

I don't remember if OP specified the size of his cluster, but Size Matters.  
Dumping a multi-TB database to an SQL text file, and then restoring it will 
be a whole lot slower than doing multi-threaded dump/restore.

-- 
Angular momentum makes the world go 'round.