Re: 'full_page_writes=off' , VACUUM and crashing streaming slaves... - Mailing list pgsql-general

From Sean Chittenden
Subject Re: 'full_page_writes=off' , VACUUM and crashing streaming slaves...
Date
Msg-id EFE82544-2AA5-4899-9FA3-F092D26BF3D4@chittenden.org
Whole thread Raw
In response to Re: 'full_page_writes=off' , VACUUM and crashing streaming slaves...  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: 'full_page_writes=off' , VACUUM and crashing streaming slaves...  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
>> While the master was still being setup and data was being loaded, the
>> system ran out of open file descriptors and Pg crashed a few times
>> until this was realized and corrected.
>
> Could you provide a bit more detail than that?  Usually the behavior we
> see under out-of-FDs conditions is that everything else on the system
> starts to fall over badly, but PG keeps soldiering along to some extent.
> I would not expect data corruption to result, anyway.

Yup, I ran out of time earlier and wasn't able to dig them out, but I just recovered them. It looks like there are two
errormodes that I ran in to as a result of running out of available FDs. 

> If you've got the postmaster logs from this episode, it would be useful
> to see what complaints got logged.


These failures are against 9.2.1, btw.

The first crash scenario:

Oct  5 15:00:24 db01 postgres[75852]: [6449-2] javafail@dbcluster 75852 0: STATEMENT:  SELECT /* query */ FROM tbl AS
this_WHERE this_.user_id=$1 
Oct  5 15:00:24 db01 postgres[75852]: [6456-1] javafail@dbcluster 75852 0: ERROR:  could not seek to end of file
"base/16387/20013":Too many open files in system 
[snip - lots of could not seek to end of file errors. How does seek(2) consume a file descriptor??? ]
Oct  5 15:00:24 db01 postgres[75852]: [6480-1] javafail@dbcluster 75852 0: ERROR:  could not open file
"base/16387/20275":Too many open files in system 
Oct  5 15:00:24 db01 postgres[75852]: [6480-2] javafail@dbcluster 75852 0: STATEMENT:  SELECT /* some other query */
FROMtbl2 AS this_ WHERE this_.user_id=$1 limit $2 
Oct  5 15:00:25 db01 postgres[75852]: [6481-1] javafail@dbcluster 75852 0: ERROR:  current transaction is aborted,
commandsignored until end of transaction block 
Oct  5 15:00:25 db01 postgres[75852]: [6481-2] javafail@dbcluster 75852 0: STATEMENT:  SELECT 1
Oct  5 15:00:25 db01 postgres[76648]: [5944-1] javafail@dbcluster 76648 0: FATAL:  pipe() failed: Too many open files
insystem 
Oct  5 15:00:25 db01 postgres[76649]: [5944-1] javafail@dbcluster 76649 0: FATAL:  pipe() failed: Too many open files
insystem 
[ snip - this makes sense, pipe(2)'s able to return ENFILE ]
Oct  5 15:00:25 db01 postgres[76407]: [5894-1] javafail@dbcluster 76407 0: WARNING:  terminating connection because of
crashof another server process 
Oct  5 15:00:25 db01 postgres[76312]: [6295-1] javafail@dbcluster 76312 0: WARNING:  terminating connection because of
crashof another server process 
Oct  5 15:00:25 db01 postgres[76331]: [8612-1] javafail@dbcluster 76331 0: WARNING:  terminating connection because of
crashof another server process 
Oct  5 15:00:25 db01 postgres[76407]: [5894-2] javafail@dbcluster 76407 0: DETAIL:  The postmaster has commanded this
serverprocess to roll back the current transaction and exit, because another server process exited abnormally and
possiblycorrupted shared memory. 
Oct  5 15:00:25 db01 postgres[76312]: [6295-2] javafail@dbcluster 76312 0: DETAIL:  The postmaster has commanded this
serverprocess to roll back the current transaction and exit, because another server process exited abnormally and
possiblycorrupted shared memory. 
Oct  5 15:00:25 db01 postgres[76315]: [7317-1] javafail@dbcluster 76315 0: WARNING:  terminating connection because of
crashof another server process 
Oct  5 15:00:25 db01 postgres[76407]: [5894-3] javafail@dbcluster 76407 0: HINT:  In a moment you should be able to
reconnectto the database and repeat your command. 
Oct  5 15:00:25 db01 postgres[76302]: [10249-1] javafail@dbcluster 76302 0: WARNING:  terminating connection because of
crashof another server process 
Oct  5 15:00:25 db01 postgres[76331]: [8612-2] javafail@dbcluster 76331 0: DETAIL:  The postmaster has commanded this
serverprocess to roll back the current transaction and exit, because another server process exited abnormally and
possiblycorrupted shared memory. 
Oct  5 15:00:25 db01 postgres[76286]: [7471-1] javafail@dbcluster 76286 0: WARNING:  terminating connection because of
crashof another server process 
Oct  5 15:00:25 db01 postgres[76307]: [7628-1] javafail@dbcluster 76307 0: WARNING:  terminating connection because of
crashof another server process 
Oct  5 15:00:25 db01 postgres[76285]: [7034-1] javafail@dbcluster 76285 0: WARNING:  terminating connection because of
crashof another server process 
Oct  5 15:00:25 db01 postgres[76331]: [8612-3] javafail@dbcluster 76331 0: HINT:  In a moment you should be able to
reconnectto the database and repeat your command. 
[snip]
Oct  5 15:00:28 db01 postgres[76875]: [5949-1] javafail@dbcluster 76875 0: FATAL:  the database system is in recovery
mode
[snip]
Oct  5 15:01:28 db01 postgres[76888]: [5951-1] @ 76888 0: WARNING:  could not open statistics file
"pg_stat_tmp/pgstat.stat":Too many open files in system 
Oct  5 15:01:28 db01 postgres[76888]: [5953-1] @ 76888 0: WARNING:  could not open statistics file
"pg_stat_tmp/pgstat.stat":Too many open files in system 
Oct  5 15:01:28 db01 postgres[76888]: [5955-1] @ 76888 0: WARNING:  could not open statistics file
"pg_stat_tmp/pgstat.stat":Too many open files in system 
Oct  5 15:01:28 db01 postgres[76888]: [5957-1] @ 76888 0: WARNING:  could not open statistics file
"pg_stat_tmp/pgstat.stat":Too many open files in system 
Oct  5 15:01:28 db01 postgres[76888]: [5959-1] @ 76888 0: WARNING:  could not open statistics file
"pg_stat_tmp/pgstat.stat":Too many open files in system 
[snip]
Oct  5 15:01:31 db01 postgres[76888]: [6503-1] @ 76888 0: WARNING:  could not open statistics file
"pg_stat_tmp/pgstat.stat":Too m 
any open files in system
Oct  5 15:01:31 db01 postgres[77571]: [5978-1] @ 77571 0: FATAL:  pipe() failed: Too many open files in system
Oct  5 15:01:31 db01 postgres[77572]: [5979-1] javafail@dbcluster 77572 0: FATAL:  pipe() failed: Too many open files
insystem 
Oct  5 15:01:31 db01 postgres[77572]: [5980-1] javafail@dbcluster 77572 0: WARNING:  terminating connection because of
crashof anothe 
r server process
Oct  5 15:01:31 db01 postgres[77572]: [5980-2] javafail@dbcluster 77572 0: DETAIL:  The postmaster has commanded this
serverprocess t 
o roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted
sharedmemory. 
Oct  5 15:01:31 db01 postgres[77572]: [5980-3] javafail@dbcluster 77572 0: HINT:  In a moment you should be able to
reconnectto the d 
atabase and repeat your command.
Oct  5 15:01:31 db01 postgres[77544]: [7490-1] javafail@dbcluster 77544 0: WARNING:  terminating connection because of
crashof anothe 
r server process
[snip]
Oct  5 15:01:34 db01 postgres[77811]: [5983-1] javafail@dbcluster 77811 0: FATAL:  the database system is in recovery
mode
Oct  5 15:02:20 db01 postgres[78473]: [5987-1] javafail@dbcluster 78473 0: FATAL:  pipe() failed: Too many open files
insystem 
Oct  5 15:02:20 db01 postgres[78466]: [5989-1] javafail@dbcluster 78466 0: WARNING:  terminating connection because of
crashof another server process 


What's odd to me is not the failure scenarios that come from a system running out of FDs (though seek(2)'ing consuming
anFD seems odd), it's more that it's still possible for a master DB's VACUUM to clean up a bogus or partial page write,
andhave the slave crash when the WAL entry is shipped over. 

Are there any other bits of info that you want me to hold on to, extract, or things you want me to test? -sc


--
Sean Chittenden
sean@chittenden.org



pgsql-general by date:

Previous
From: Jan Nielsen
Date:
Subject: Re: strange hot_standby behaviour
Next
From: Edson Richter
Date:
Subject: Help with query timeout