Thread: BUG #15438: Standby corruption after "Too many open files in system"error

BUG #15438: Standby corruption after "Too many open files in system"error

From

PG Bug reporting form

Date:

18 October 2018, 18:36:42

The following bug has been logged on the website:

Bug reference:      15438
Logged by:          Juanjo Santamaría
Email address:      juanjo.santamaria@gmail.com
PostgreSQL version: 10.4
Operating system:   Redhat 7.3
Description:

After some time working under poor conditions, the open file limit was too
low for the postgres user, the streaming replica ended in a corrupted
state.

The primary was working the whole time and recreating the standby replica,
after configuring the user limits, seems to solve the issue.

The log is quite verbose, so I have tried to summarize at least one
occurrency for every error:

Oct 15 10:47:54 postgres[9507]: [9-1] LOG:  started streaming WAL from
primary at 6/BE000000 on timeline 3
Oct 15 10:48:38 postgres[9507]: [10-1] FATAL:  could not receive data from
WAL stream: ERROR:  epoll_create1 failed: Too many open files in system
Oct 15 10:48:38 postgres[40098]: [15-1] LOG:  record with incorrect
prev-link 6/A1A993C8 at 6/BEA99400
Oct 15 10:48:38 postgres[9550]: [9-1] FATAL:  could not connect to the
primary server: server closed the connection unexpectedly
Oct 15 10:48:38 postgres[9550]: [9-2] #011#011This probably means the server
terminated abnormally
Oct 15 10:48:38 postgres[9550]: [9-3] #011#011before or while processing the
request.

Oct 15 10:48:53 postgres[9586]: [9-1] LOG:  started streaming WAL from
primary at 6/BE000000 on timeline 3
Oct 15 10:49:32 postgres[9586]: [10-1] FATAL:  could not receive data from
WAL stream: ERROR:  epoll_create1 failed: Too many open f
iles in system
Oct 15 10:49:32 postgres[40098]: [16-1] LOG:  invalid record length at
6/BEABD0F8: wanted 24, got 0

Oct 15 10:50:31 postgres[9765]: [9-1] LOG:  started streaming WAL from
primary at 6/BE000000 on timeline 3
Oct 15 10:50:39 postgres[9765]: [10-1] FATAL:  could not receive data from
WAL stream: ERROR:  epoll_create1 failed: Too many open files in system
Oct 15 10:50:39 postgres[40098]: [19-1] LOG:  invalid resource manager ID
101 at 6/BEAE3708
Oct 15 10:51:01 postgres[9767]: [9-1] FATAL:  could not connect to the
primary server: server closed the connection unexpectedly
Oct 15 10:51:01 postgres[9767]: [9-2] #011#011This probably means the server
terminated abnormally
Oct 15 10:51:01 postgres[9767]: [9-3] #011#011before or while processing the
request.

Oct 15 10:51:06 postgres[9783]: [9-1] LOG:  started streaming WAL from
primary at 6/BE000000 on timeline 3
Oct 15 10:51:39 postgres[9783]: [10-1] FATAL:  could not receive data from
WAL stream: ERROR:  epoll_create1 failed: Too many open files in system
Oct 15 10:51:39 postgres[40098]: [20-1] LOG:  record with incorrect
prev-link 3E/746F6F72 at 6/BEC3D5E8

Oct 15 11:03:39 postgres[10850]: [9-1] FATAL:  could not connect to the
primary server: FATAL:  could not open relation mapping file
"global/pg_filenode.map": Too many open files in system

Oct 15 12:00:10 postgres[40101]: [3732-1] LOG:  restartpoint complete: wrote
61 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 re
cycled; write=6.159 s, sync=0.018 s, total=6.223 s; sync files=16,
longest=0.006 s, average=0.001 s; distance=451 kB, estimate=2215 kB
Oct 15 12:01:40 postgres[15182]: [10-1] FATAL:  could not receive data from
WAL stream: ERROR:  epoll_create1 failed: Too many open
files in system
Oct 15 12:01:40 postgres[40098]: [80-1] LOG:  invalid resource manager ID 64
at 6/BEE34F00
Oct 15 12:01:42 postgres[15711]: [9-1] LOG:  started streaming WAL from
primary at 6/BE000000 on timeline 3
Oct 15 12:05:38 postgres[15711]: [10-1] FATAL:  could not receive data from
WAL stream: ERROR:  epoll_create1 failed: Too many open files in system
Oct 15 12:05:38 postgres[40098]: [81-1] LOG:  unexpected pageaddr 6/A1E6E000
in log segment 0000000300000006000000BE, offset 15130624

Oct 16 01:34:15 postgres[18890]: [9-1] LOG:  started streaming WAL from
primary at 6/C3000000 on timeline 3
Oct 16 01:34:15 postgres[18890]: [10-1]  FATAL:  could not receive data from
WAL stream: ERROR:  could not open file "pg_wal/0000000300000006000000C3":
Too many open files in system

Oct 16 17:58:44 postgres[38030]: [9-1] FATAL:  the database system is
starting up
Oct 16 17:58:47 postgres[38029]: [9-1] LOG:  entering standby mode
Oct 16 17:58:47 postgres[38029]: [10-1] LOG:  redo starts at 6/CF9B3598
Oct 16 17:58:47 postgres[38029]: [11-1] LOG:  consistent recovery state
reached at 6/CF9C53B8
Oct 16 17:58:47 postgres[38026]: [8-1] LOG:  database system is ready to
accept read only connections
Oct 16 17:58:48 postgres[38029]: [12-1] FATAL:  invalid memory alloc request
size 1715044352
Oct 16 17:58:48 postgres[38026]: [9-1] LOG:  startup process (PID 38029)
exited with exit code 1
Oct 16 17:58:48 postgres[38026]: [10-1] LOG:  terminating any other active
server processes
Oct 16 17:58:48 postgres[38026]: [11-1] LOG:  database system is shut down

Re: BUG #15438: Standby corruption after "Too many open files insystem" error

From

Alvaro Herrera

Date:

22 October 2018, 17:45:04

On 2018-Oct-18, PG Bug reporting form wrote:

> The primary was working the whole time and recreating the standby replica,
> after configuring the user limits, seems to solve the issue.

Hmm, we recently fixed a bug were file descriptors were being leaked,
after 10.5 was tagged (commit e00f4b68dc87 [1]) but that was in logical
decoding, and the symptoms seem completely different anyway.  I'm not
sure it would affect you.  The epoll_create stuff in latch.c is pretty
young though.

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=e00f4b68dc878dcee46833a742844346daa1e3c8
[2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=fa31b6f4e9696f3c9777bf4ec2faea822826ce9f

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #15438: Standby corruption after "Too many open files insystem" error

From

Andres Freund

Date:

22 October 2018, 17:57:32

Hi,

On 2018-10-22 14:45:04 -0300, Alvaro Herrera wrote:
> On 2018-Oct-18, PG Bug reporting form wrote:
> 
> > The primary was working the whole time and recreating the standby replica,
> > after configuring the user limits, seems to solve the issue.

When you say corruption, do you mean that the data was corrupted
afterwards? Or just that the standby spewed errors until you restarted
with adjusted limits?

> Hmm, we recently fixed a bug were file descriptors were being leaked,
> after 10.5 was tagged (commit e00f4b68dc87 [1]) but that was in logical
> decoding, and the symptoms seem completely different anyway.  I'm not
> sure it would affect you.  The epoll_create stuff in latch.c is pretty
> young though.
> 
> [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=e00f4b68dc878dcee46833a742844346daa1e3c8
> [2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=fa31b6f4e9696f3c9777bf4ec2faea822826ce9f

I'd assume the fact that this hit epoll_create is more owing to the fact
that the file descriptor used for epoll isn't going through fd.c. That
means we don't previously release an fd if we're at our limit. That's
supposed to be catered for by a reserve of fds, but that reserve isn't
that large.

There previously have been discussions whether we can make fd.c handle
fds that need to be created outside of fd.c's remit (by making sure
there's space, and then tracking that similar to OpenTransientFile()).

Greetings,

Andres Freund

Re: BUG #15438: Standby corruption after "Too many open files insystem" error

From

Juan José Santamaría Flecha

Date:

22 October 2018, 18:18:32

Hi,

When you say corruption, do you mean that the data was corrupted
afterwards? Or just that the standby spewed errors until you restarted
with adjusted limits?

We didn't do a proper low level check to verify that the database was corrupted, but the symtopms took us to belive so: after a restart the system crashed almost immediately with and 'invalid memory alloc request size', no matter the user limits.

There is a backup still available if any test can be of interest.

Re: BUG #15438: Standby corruption after "Too many open files insystem" error

From

Andres Freund

Date:

22 October 2018, 18:28:19

Hi,

On 2018-10-22 20:18:32 +0200, Juan José Santamaría Flecha wrote:
> When you say corruption, do you mean that the data was corrupted
> > afterwards? Or just that the standby spewed errors until you restarted
> > with adjusted limits?
> >
> 
> We didn't do a proper low level check to verify that the database was
> corrupted, but the symtopms took us to belive so: after a restart the
> system crashed almost immediately with and 'invalid memory alloc request
> size', no matter the user limits.
> 
> There is a backup still available if any test can be of interest.

Yea, I think looking at that would be worthwhile.  That an FD leak or
such could cause replay to get stuck is not that suprrising, but it
shouldn't cause corruption. That hints that something further is borked.

Greetings,

Andres Freund