Thread: server process (PID 2964738) was terminated by signal 11: Segmentation fault
server process (PID 2964738) was terminated by signal 11: Segmentation fault
From
Stefan Froehlich
Date:
This was starting with an invalid memory alloc request size, like in <https://www.postgresql.org/message-id/2080966.1592844878%40sss.pgh.pa.us>. I followed the suggestion to trace down the faulty record, found and fixed it. Now I can access that record again, but if I try to dump the table I get: | 2022-11-06 11:52:36.367 CET [2098-35] LOG: server process (PID 2964738) was terminated by signal 11: Segmentation fault | [...] | 2022-11-06 11:53:46.229 CET [2964744-2] LOG: database system was not properly shut down; automatic recovery in progress | 2022-11-06 11:53:46.263 CET [2964744-3] LOG: redo starts at 7D/E53B6440 | 2022-11-06 11:53:46.265 CET [2964744-4] LOG: invalid record length at 7D/E5412958: wanted 24, got 0 | 2022-11-06 11:53:46.265 CET [2964744-5] LOG: redo done at 7D/E5412920 | 2022-11-06 11:53:46.749 CET [2098-39] LOG: database system is ready to accept connections Whatever was "done" does not prevent the problem from being reproducible. I tried to track down this faulty record as well, but binary proves difficult with 25*10^6 rows if the server crashes and rebuilds after each try. Is there any reasonable way to proceed from here? The system seems to be working fine (the currupted table contains textual information only), but right now I cannot even make proper backups which is not a good thing. Regards, Stefan
Re: server process (PID 2964738) was terminated by signal 11: Segmentation fault
From
Tom Lane
Date:
Stefan Froehlich <postgresql@froehlich.priv.at> writes: > I followed the suggestion to trace down the faulty record, found and > fixed it. Now I can access that record again, but if I try to dump > the table I get: > | 2022-11-06 11:52:36.367 CET [2098-35] LOG: server process (PID 2964738) was terminated by signal 11: Segmentation fault > Whatever was "done" does not prevent the problem from being > reproducible. I tried to track down this faulty record as well, but > binary proves difficult with 25*10^6 rows if the server crashes and > rebuilds after each try. > Is there any reasonable way to proceed from here? contrib/amcheck might help to identify the faulty data (at this point there's reason to fear multiple corruptions ...). If you're running v14 or v15 there's a frontend for that called pg_amcheck. regards, tom lane
Re: server process (PID 2964738) was terminated by signal 11: Segmentation fault
From
Stefan Froehlich
Date:
On Sun, Nov 06, 2022 at 09:13:08AM -0500, Tom Lane wrote: > > | 2022-11-06 11:52:36.367 CET [2098-35] LOG: server process (PID 2964738) was terminated by signal 11: Segmentationfault > contrib/amcheck might help to identify the faulty data (at this > point there's reason to fear multiple corruptions ...). If you're > running v14 or v15 there's a frontend for that called pg_amcheck. I am using v13, but well: | # create extension amcheck; | # select oid, relname from pg_class where relname ='faultytablename_pkey'; | [returns oid 537203] | # select bt_index_check(537203, true); | server closed the connection unexpectedly | This probably means the server terminated abnormally | before or while processing the request. | The connection to the server was lost. Attempting reset: Failed. This seems to be quite fucked up ("how" would be a good question, too, but for the moment priority is to work around the problem). If I have to delete 1k records in this table, I'll do and survive, but first I have to find out which ones. Bye, Stefan
Re: server process (PID 2964738) was terminated by signal 11: Segmentation fault
From
Tom Lane
Date:
Stefan Froehlich <postgresql@froehlich.priv.at> writes: > I am using v13, but well: > | # create extension amcheck; > | # select oid, relname from pg_class where relname ='faultytablename_pkey'; > | [returns oid 537203] > | # select bt_index_check(537203, true); > | server closed the connection unexpectedly Oh ... up through v13, amcheck lacks any functions to check heaps, only indexes, so I guess it won't help you much. You could try reindexing, but if pg_dump is failing that's a good sign you have heap corruption not (or at least not only) index corruption. Another idea is to try using contrib/pageinspect to examine each page of the table. Its output is just gobbledegook to most people, but there's a good chance it'd fail visibly on the corrupted page(s). Also, please read https://wiki.postgresql.org/wiki/Corruption if you didn't already. regards, tom lane
Re: server process (PID 2964738) was terminated by signal 11: Segmentation fault
From
Stefan Froehlich
Date:
On Sun, Nov 06, 2022 at 09:48:32AM -0500, Tom Lane wrote: > Stefan Froehlich <postgresql@froehlich.priv.at> writes: > > | # create extension amcheck; > > | # select oid, relname from pg_class where relname ='faultytablename_pkey'; > > | [returns oid 537203] > > | # select bt_index_check(537203, true); > > | server closed the connection unexpectedly > Another idea is to try using contrib/pageinspect to examine each > page of the table. Its output is just gobbledegook to most > people, but there's a good chance it'd fail visibly on the > corrupted page(s). Fortunately I was able to identify a window of 100 records (out of 25 mio.) containing all the errors. After deleting and re-inserting those records everything seems to be ok (at least, pg_dump and "reindex database" work without errors). I suspect a bad RAM module to be the root of the problems. We'll see. Side question: If it is possible to simply delete and create such records is it necessary that the server *core* *dumps*? There could be a switch adding additional safety (at the cost of performance) which would make troubleshooting not only much faster but non-invasive for the other databases on the same server as well. Bye, Stefan
Re: server process (PID 2964738) was terminated by signal 11: Segmentation fault
From
Laurenz Albe
Date:
On Mon, 2022-11-07 at 11:17 +0100, Stefan Froehlich wrote: > On Sun, Nov 06, 2022 at 09:48:32AM -0500, Tom Lane wrote: > > Stefan Froehlich <postgresql@froehlich.priv.at> writes: > > > > # create extension amcheck; > > > > # select oid, relname from pg_class where relname ='faultytablename_pkey'; > > > > [returns oid 537203] > > > > # select bt_index_check(537203, true); > > > > server closed the connection unexpectedly > > > Another idea is to try using contrib/pageinspect to examine each > > page of the table. Its output is just gobbledegook to most > > people, but there's a good chance it'd fail visibly on the > > corrupted page(s). > > Fortunately I was able to identify a window of 100 records (out of > 25 mio.) containing all the errors. After deleting and re-inserting > those records everything seems to be ok (at least, pg_dump and > "reindex database" work without errors). Don't continue to work with that cluster even if everything seems OK now. "pg_dumpall" and restore to a new cluster on good hardware. > I suspect a bad RAM module to be the root of the problems. We'll > see. > > Side question: If it is possible to simply delete and create such > records is it necessary that the server *core* *dumps*? There could > be a switch adding additional safety (at the cost of performance) > which would make troubleshooting not only much faster but > non-invasive for the other databases on the same server as well. Crashing is never nice. On the other hand, adding checks and error messages for conditions that are always true in a correct block cost performance. I can't tell about your specific case, but a build of PostgreSQL --enable-cassert has assitional checks in the code. That will still crash, but the log will show what condition was violated. Yours, Laurenz Albe
Re: server process (PID 2964738) was terminated by signal 11: Segmentation fault
From
Mladen Gogala
Date:
On 11/7/22 06:19, Laurenz Albe wrote:
Don't continue to work with that cluster even if everything seems OK now. "pg_dumpall" and restore to a new cluster on good hardware.
Why would that be necessary if the original machine works well now?
-- Mladen Gogala Database Consultant Tel: (347) 321-1217 https://dbwhisperer.wordpress.com
Re: server process (PID 2964738) was terminated by signal 11: Segmentation fault
From
Stefan Froehlich
Date:
On Mon, Nov 07, 2022 at 08:17:10AM -0500, Mladen Gogala wrote: > On 11/7/22 06:19, Laurenz Albe wrote: > >Don't continue to work with that cluster even if everything seems OK now. > >"pg_dumpall" and restore to a new cluster on good hardware. > Why would that be necessary if the original machine works well now? I can understand the idea not to trust hardware anymore once a (not clearly identified) problem occured. In this case new hardware would - for reasons beyond the scope of this list - not be any more or less trustworthy than the existing one and thus (IMO) not make any difference. Bye, Stefan
Re: server process (PID 2964738) was terminated by signal 11: Segmentation fault
From
Tom Lane
Date:
Stefan Froehlich <postgresql@froehlich.priv.at> writes: > On Mon, Nov 07, 2022 at 08:17:10AM -0500, Mladen Gogala wrote: >> On 11/7/22 06:19, Laurenz Albe wrote: >>> Don't continue to work with that cluster even if everything seems OK now. >>> "pg_dumpall" and restore to a new cluster on good hardware. >> Why would that be necessary if the original machine works well now? > I can understand the idea not to trust hardware anymore once a (not > clearly identified) problem occured. > In this case new hardware would - for reasons beyond the scope of > this list - not be any more or less trustworthy than the existing > one and thus (IMO) not make any difference. Whether you want to continue to trust the hardware or not is your call. It'd still be recommendable to pg_dumpall and restore into a freshly-initdb'd cluster, because otherwise you can't be real sure that you identified and cleared all the data corruption. regards, tom lane
Re: server process (PID 2964738) was terminated by signal 11: Segmentation fault
From
Stefan Froehlich
Date:
On Mon, Nov 07, 2022 at 09:02:26AM -0500, Tom Lane wrote: > Stefan Froehlich <postgresql@froehlich.priv.at> writes: > > On Mon, Nov 07, 2022 at 08:17:10AM -0500, Mladen Gogala wrote: > >> On 11/7/22 06:19, Laurenz Albe wrote: > >>> Don't continue to work with that cluster even if everything seems OK now. > >>> "pg_dumpall" and restore to a new cluster on good hardware. > > >> Why would that be necessary if the original machine works well now? > > > I can understand the idea not to trust hardware anymore once a (not > > clearly identified) problem occured. > > > In this case new hardware would - for reasons beyond the scope of > > this list - not be any more or less trustworthy than the existing > > one and thus (IMO) not make any difference. > > Whether you want to continue to trust the hardware or not is your > call. It'd still be recommendable to pg_dumpall and restore into > a freshly-initdb'd cluster, because otherwise you can't be real > sure that you identified and cleared all the data corruption. Thanks, yes. This is in fact on my schedule for the next weekend as it implies a downtime of serveral hours. Bye, Stefan
On 11/7/22 08:02, Tom Lane wrote:
[snip]
Why just pg_dumpall instead of "pg_dumpall --globals-only" and parallel (for speed) dumps of databases?
[snip]
call. It'd still be recommendable to pg_dumpall and restore intoa freshly-initdb'd cluster, because otherwise you can't be real sure that you identified and cleared all the data corruption.
Why just pg_dumpall instead of "pg_dumpall --globals-only" and parallel (for speed) dumps of databases?
--
Angular momentum makes the world go 'round.
Angular momentum makes the world go 'round.
Re: server process (PID 2964738) was terminated by signal 11: Segmentation fault
From
Tom Lane
Date:
Ron <ronljohnsonjr@gmail.com> writes: > On 11/7/22 08:02, Tom Lane wrote: >> call. It'd still be recommendable to pg_dumpall and restore into >> a freshly-initdb'd cluster, because otherwise you can't be real >> sure that you identified and cleared all the data corruption. > Why *just* pg_dumpall instead of "pg_dumpall --globals-only" and parallel > (for speed) dumps of databases? [ shrug... ] Sure, if you like to make your life complicated. regards, tom lane
On 11/7/22 09:43, Tom Lane wrote: > Ron <ronljohnsonjr@gmail.com> writes: >> On 11/7/22 08:02, Tom Lane wrote: >>> call. It'd still be recommendable to pg_dumpall and restore into >>> a freshly-initdb'd cluster, because otherwise you can't be real >>> sure that you identified and cleared all the data corruption. >> Why *just* pg_dumpall instead of "pg_dumpall --globals-only" and parallel >> (for speed) dumps of databases? > [ shrug... ] Sure, if you like to make your life complicated. I don't remember if OP specified the size of his cluster, but Size Matters. Dumping a multi-TB database to an SQL text file, and then restoring it will be a whole lot slower than doing multi-threaded dump/restore. -- Angular momentum makes the world go 'round.