Thread: BUG #15438: Standby corruption after "Too many open files in system"error
BUG #15438: Standby corruption after "Too many open files in system"error
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 15438 Logged by: Juanjo Santamaría Email address: juanjo.santamaria@gmail.com PostgreSQL version: 10.4 Operating system: Redhat 7.3 Description: After some time working under poor conditions, the open file limit was too low for the postgres user, the streaming replica ended in a corrupted state. The primary was working the whole time and recreating the standby replica, after configuring the user limits, seems to solve the issue. The log is quite verbose, so I have tried to summarize at least one occurrency for every error: Oct 15 10:47:54 postgres[9507]: [9-1] LOG: started streaming WAL from primary at 6/BE000000 on timeline 3 Oct 15 10:48:38 postgres[9507]: [10-1] FATAL: could not receive data from WAL stream: ERROR: epoll_create1 failed: Too many open files in system Oct 15 10:48:38 postgres[40098]: [15-1] LOG: record with incorrect prev-link 6/A1A993C8 at 6/BEA99400 Oct 15 10:48:38 postgres[9550]: [9-1] FATAL: could not connect to the primary server: server closed the connection unexpectedly Oct 15 10:48:38 postgres[9550]: [9-2] #011#011This probably means the server terminated abnormally Oct 15 10:48:38 postgres[9550]: [9-3] #011#011before or while processing the request. Oct 15 10:48:53 postgres[9586]: [9-1] LOG: started streaming WAL from primary at 6/BE000000 on timeline 3 Oct 15 10:49:32 postgres[9586]: [10-1] FATAL: could not receive data from WAL stream: ERROR: epoll_create1 failed: Too many open f iles in system Oct 15 10:49:32 postgres[40098]: [16-1] LOG: invalid record length at 6/BEABD0F8: wanted 24, got 0 Oct 15 10:50:31 postgres[9765]: [9-1] LOG: started streaming WAL from primary at 6/BE000000 on timeline 3 Oct 15 10:50:39 postgres[9765]: [10-1] FATAL: could not receive data from WAL stream: ERROR: epoll_create1 failed: Too many open files in system Oct 15 10:50:39 postgres[40098]: [19-1] LOG: invalid resource manager ID 101 at 6/BEAE3708 Oct 15 10:51:01 postgres[9767]: [9-1] FATAL: could not connect to the primary server: server closed the connection unexpectedly Oct 15 10:51:01 postgres[9767]: [9-2] #011#011This probably means the server terminated abnormally Oct 15 10:51:01 postgres[9767]: [9-3] #011#011before or while processing the request. Oct 15 10:51:06 postgres[9783]: [9-1] LOG: started streaming WAL from primary at 6/BE000000 on timeline 3 Oct 15 10:51:39 postgres[9783]: [10-1] FATAL: could not receive data from WAL stream: ERROR: epoll_create1 failed: Too many open files in system Oct 15 10:51:39 postgres[40098]: [20-1] LOG: record with incorrect prev-link 3E/746F6F72 at 6/BEC3D5E8 Oct 15 11:03:39 postgres[10850]: [9-1] FATAL: could not connect to the primary server: FATAL: could not open relation mapping file "global/pg_filenode.map": Too many open files in system Oct 15 12:00:10 postgres[40101]: [3732-1] LOG: restartpoint complete: wrote 61 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 re cycled; write=6.159 s, sync=0.018 s, total=6.223 s; sync files=16, longest=0.006 s, average=0.001 s; distance=451 kB, estimate=2215 kB Oct 15 12:01:40 postgres[15182]: [10-1] FATAL: could not receive data from WAL stream: ERROR: epoll_create1 failed: Too many open files in system Oct 15 12:01:40 postgres[40098]: [80-1] LOG: invalid resource manager ID 64 at 6/BEE34F00 Oct 15 12:01:42 postgres[15711]: [9-1] LOG: started streaming WAL from primary at 6/BE000000 on timeline 3 Oct 15 12:05:38 postgres[15711]: [10-1] FATAL: could not receive data from WAL stream: ERROR: epoll_create1 failed: Too many open files in system Oct 15 12:05:38 postgres[40098]: [81-1] LOG: unexpected pageaddr 6/A1E6E000 in log segment 0000000300000006000000BE, offset 15130624 Oct 16 01:34:15 postgres[18890]: [9-1] LOG: started streaming WAL from primary at 6/C3000000 on timeline 3 Oct 16 01:34:15 postgres[18890]: [10-1] FATAL: could not receive data from WAL stream: ERROR: could not open file "pg_wal/0000000300000006000000C3": Too many open files in system Oct 16 17:58:44 postgres[38030]: [9-1] FATAL: the database system is starting up Oct 16 17:58:47 postgres[38029]: [9-1] LOG: entering standby mode Oct 16 17:58:47 postgres[38029]: [10-1] LOG: redo starts at 6/CF9B3598 Oct 16 17:58:47 postgres[38029]: [11-1] LOG: consistent recovery state reached at 6/CF9C53B8 Oct 16 17:58:47 postgres[38026]: [8-1] LOG: database system is ready to accept read only connections Oct 16 17:58:48 postgres[38029]: [12-1] FATAL: invalid memory alloc request size 1715044352 Oct 16 17:58:48 postgres[38026]: [9-1] LOG: startup process (PID 38029) exited with exit code 1 Oct 16 17:58:48 postgres[38026]: [10-1] LOG: terminating any other active server processes Oct 16 17:58:48 postgres[38026]: [11-1] LOG: database system is shut down
Re: BUG #15438: Standby corruption after "Too many open files insystem" error
From
Alvaro Herrera
Date:
On 2018-Oct-18, PG Bug reporting form wrote: > The primary was working the whole time and recreating the standby replica, > after configuring the user limits, seems to solve the issue. Hmm, we recently fixed a bug were file descriptors were being leaked, after 10.5 was tagged (commit e00f4b68dc87 [1]) but that was in logical decoding, and the symptoms seem completely different anyway. I'm not sure it would affect you. The epoll_create stuff in latch.c is pretty young though. [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=e00f4b68dc878dcee46833a742844346daa1e3c8 [2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=fa31b6f4e9696f3c9777bf4ec2faea822826ce9f -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: BUG #15438: Standby corruption after "Too many open files insystem" error
From
Andres Freund
Date:
Hi, On 2018-10-22 14:45:04 -0300, Alvaro Herrera wrote: > On 2018-Oct-18, PG Bug reporting form wrote: > > > The primary was working the whole time and recreating the standby replica, > > after configuring the user limits, seems to solve the issue. When you say corruption, do you mean that the data was corrupted afterwards? Or just that the standby spewed errors until you restarted with adjusted limits? > Hmm, we recently fixed a bug were file descriptors were being leaked, > after 10.5 was tagged (commit e00f4b68dc87 [1]) but that was in logical > decoding, and the symptoms seem completely different anyway. I'm not > sure it would affect you. The epoll_create stuff in latch.c is pretty > young though. > > [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=e00f4b68dc878dcee46833a742844346daa1e3c8 > [2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=fa31b6f4e9696f3c9777bf4ec2faea822826ce9f I'd assume the fact that this hit epoll_create is more owing to the fact that the file descriptor used for epoll isn't going through fd.c. That means we don't previously release an fd if we're at our limit. That's supposed to be catered for by a reserve of fds, but that reserve isn't that large. There previously have been discussions whether we can make fd.c handle fds that need to be created outside of fd.c's remit (by making sure there's space, and then tracking that similar to OpenTransientFile()). Greetings, Andres Freund
Re: BUG #15438: Standby corruption after "Too many open files insystem" error
From
Juan José Santamaría Flecha
Date:
Hi,
When you say corruption, do you mean that the data was corrupted
afterwards? Or just that the standby spewed errors until you restarted
with adjusted limits?
We didn't do a proper low level check to verify that the database was corrupted, but the symtopms took us to belive so: after a restart the system crashed almost immediately with and 'invalid memory alloc request size', no matter the user limits.
There is a backup still available if any test can be of interest.
Re: BUG #15438: Standby corruption after "Too many open files insystem" error
From
Andres Freund
Date:
Hi, On 2018-10-22 20:18:32 +0200, Juan José Santamaría Flecha wrote: > When you say corruption, do you mean that the data was corrupted > > afterwards? Or just that the standby spewed errors until you restarted > > with adjusted limits? > > > > We didn't do a proper low level check to verify that the database was > corrupted, but the symtopms took us to belive so: after a restart the > system crashed almost immediately with and 'invalid memory alloc request > size', no matter the user limits. > > There is a backup still available if any test can be of interest. Yea, I think looking at that would be worthwhile. That an FD leak or such could cause replay to get stuck is not that suprrising, but it shouldn't cause corruption. That hints that something further is borked. Greetings, Andres Freund