Thread: 13.4 on RDS, SSL SYSCALL EOF on restore
Hi: I am restoring a ~100GB backup using 16 jobs from an EC2 instance to an RDS instance (db.m6g.xlarge, which is 16GB RAM and 4 CPU) and it's dying midway with the dreaded "SSL SYSCALL error: EOF detected" error.
I did create a parameter group to hopefully speed the restoration process, it includes:
- wal_buffers 8192 (64MB)
- checkpoint_timeout 3600 (1h)
- min_wal_size 192 (192MB)
- max_wal_size 102400 (100GB)
- shared_buffers 524288 (4GB)
- synchronous_commit 0 (off)
- autovacuum 0 (off)
- maintenance_work_mem 2097152 (2GB)
- work_mem 32768 (32MB)
- checkpoint_timeout 3600 (1h)
- min_wal_size 192 (192MB)
- max_wal_size 102400 (100GB)
- shared_buffers 524288 (4GB)
- synchronous_commit 0 (off)
- autovacuum 0 (off)
- maintenance_work_mem 2097152 (2GB)
- work_mem 32768 (32MB)
I sourced these from a few different folks as well as some trial and error, but now it's blowing up on me.
If I revert the RDS instance back to default PG parameters, it restores, but it takes 3x the time.
Appreciate any tips.
On Fri, Oct 8, 2021 at 02:27:55PM -0700, Wells Oliver wrote: > Hi: I am restoring a ~100GB backup using 16 jobs from an EC2 instance to an RDS > instance (db.m6g.xlarge, which is 16GB RAM and 4 CPU) and it's dying midway > with the dreaded "SSL SYSCALL error: EOF detected" error. > > I did create a parameter group to hopefully speed the restoration process, it > includes: > > - wal_buffers 8192 (64MB) > - checkpoint_timeout 3600 (1h) > - min_wal_size 192 (192MB) > - max_wal_size 102400 (100GB) > - shared_buffers 524288 (4GB) > - synchronous_commit 0 (off) > - autovacuum 0 (off) > - maintenance_work_mem 2097152 (2GB) > - work_mem 32768 (32MB) > > I sourced these from a few different folks as well as some trial and error, but > now it's blowing up on me. > > If I revert the RDS instance back to default PG parameters, it restores, but it > takes 3x the time. Wow, that is weird. I see that error string happening when PG can't extend the receipt buffer on the client side: appendPQExpBufferStr(&conn->errorMessage, libpq_gettext("SSL SYSCALL error: EOF detected\n")); Can you check the server logs to see if there is any error there? If I had to take a guess, I would reduce maintenance_work_mem to 1GB and retest. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
The only thing I see in the RDS log are:
2021-10-06 16:36:05.359 GMT [5009] LOG: skipping missing configuration file "/rdsdbdata/config/recovery.conf"
2021-10-06 16:36:05.359 GMT [5009] LOG: skipping missing configuration file "/rdsdbdata/config/recovery.conf"
2021-10-06 16:36:05 UTC::@:[5009]:LOG: database system is shut down
Postgres Shared Memory Value: 4319797248 bytes
2021-10-06 16:36:05.619 GMT [5035] LOG: skipping missing configuration file "/rdsdbdata/config/recovery.conf"
2021-10-06 16:36:05.619 GMT [5035] LOG: skipping missing configuration file "/rdsdbdata/config/recovery.conf"
2021-10-06 16:36:05 UTC::@:[5035]:LOG: redirecting log output to logging collector process
2021-10-06 16:36:05 UTC::@:[5035]:HINT: Future log output will appear in directory "/rdsdbdata/log/error".
2021-10-07 21:52:21.343 GMT [12664] LOG: skipping missing configuration file "/rdsdbdata/config/recovery.conf"
2021-10-07 21:52:21.343 GMT [12664] LOG: skipping missing configuration file "/rdsdbdata/config/recovery.conf"
2021-10-07 21:52:21 UTC::@:[12664]:LOG: database system is shut down
Postgres Shared Memory Value: 4560945152 bytes
2021-10-07 21:52:21.542 GMT [12682] LOG: skipping missing configuration file "/rdsdbdata/config/recovery.conf"
2021-10-07 21:52:21.543 GMT [12682] LOG: skipping missing configuration file "/rdsdbdata/config/recovery.conf"
2021-10-07 21:52:21 UTC::@:[12682]:LOG: redirecting log output to logging collector process
2021-10-07 21:52:21 UTC::@:[12682]:HINT: Future log output will appear in directory "/rdsdbdata/log/error".
2021-10-07 22:19:47 UTC::@:[12683]:LOG: skipping missing configuration file "/rdsdbdata/config/recovery.conf"
2021-10-06 16:36:05.359 GMT [5009] LOG: skipping missing configuration file "/rdsdbdata/config/recovery.conf"
2021-10-06 16:36:05 UTC::@:[5009]:LOG: database system is shut down
Postgres Shared Memory Value: 4319797248 bytes
2021-10-06 16:36:05.619 GMT [5035] LOG: skipping missing configuration file "/rdsdbdata/config/recovery.conf"
2021-10-06 16:36:05.619 GMT [5035] LOG: skipping missing configuration file "/rdsdbdata/config/recovery.conf"
2021-10-06 16:36:05 UTC::@:[5035]:LOG: redirecting log output to logging collector process
2021-10-06 16:36:05 UTC::@:[5035]:HINT: Future log output will appear in directory "/rdsdbdata/log/error".
2021-10-07 21:52:21.343 GMT [12664] LOG: skipping missing configuration file "/rdsdbdata/config/recovery.conf"
2021-10-07 21:52:21.343 GMT [12664] LOG: skipping missing configuration file "/rdsdbdata/config/recovery.conf"
2021-10-07 21:52:21 UTC::@:[12664]:LOG: database system is shut down
Postgres Shared Memory Value: 4560945152 bytes
2021-10-07 21:52:21.542 GMT [12682] LOG: skipping missing configuration file "/rdsdbdata/config/recovery.conf"
2021-10-07 21:52:21.543 GMT [12682] LOG: skipping missing configuration file "/rdsdbdata/config/recovery.conf"
2021-10-07 21:52:21 UTC::@:[12682]:LOG: redirecting log output to logging collector process
2021-10-07 21:52:21 UTC::@:[12682]:HINT: Future log output will appear in directory "/rdsdbdata/log/error".
2021-10-07 22:19:47 UTC::@:[12683]:LOG: skipping missing configuration file "/rdsdbdata/config/recovery.conf"
Is there some parameter I can change to get more verbose logging in the event of these failures?
On Fri, Oct 8, 2021 at 2:35 PM Bruce Momjian <bruce@momjian.us> wrote:
On Fri, Oct 8, 2021 at 02:27:55PM -0700, Wells Oliver wrote:
> Hi: I am restoring a ~100GB backup using 16 jobs from an EC2 instance to an RDS
> instance (db.m6g.xlarge, which is 16GB RAM and 4 CPU) and it's dying midway
> with the dreaded "SSL SYSCALL error: EOF detected" error.
>
> I did create a parameter group to hopefully speed the restoration process, it
> includes:
>
> - wal_buffers 8192 (64MB)
> - checkpoint_timeout 3600 (1h)
> - min_wal_size 192 (192MB)
> - max_wal_size 102400 (100GB)
> - shared_buffers 524288 (4GB)
> - synchronous_commit 0 (off)
> - autovacuum 0 (off)
> - maintenance_work_mem 2097152 (2GB)
> - work_mem 32768 (32MB)
>
> I sourced these from a few different folks as well as some trial and error, but
> now it's blowing up on me.
>
> If I revert the RDS instance back to default PG parameters, it restores, but it
> takes 3x the time.
Wow, that is weird. I see that error string happening when PG can't
extend the receipt buffer on the client side:
appendPQExpBufferStr(&conn->errorMessage,
libpq_gettext("SSL SYSCALL error: EOF detected\n"));
Can you check the server logs to see if there is any error there? If I
had to take a guess, I would reduce maintenance_work_mem to 1GB and
retest.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
If only the physical world exists, free will is an illusion.
Wells Oliver
wells.oliver@gmail.com
wells.oliver@gmail.com
Wells Oliver <wells.oliver@gmail.com> writes: > The only thing I see in the RDS log are: > 2021-10-06 16:36:05 UTC::@:[5035]:LOG: redirecting log output to logging > collector process > 2021-10-06 16:36:05 UTC::@:[5035]:HINT: Future log output will appear in > directory "/rdsdbdata/log/error". So what's in /rdsdbdata/log/error ? regards, tom lane
Dug out some more logging:
2021-10-08 20:35:08 UTC:172.31.3.72(46884):postgres@dbname:[3969]:WARNING: terminating connection because of crash of another server process
2021-10-08 20:35:08 UTC:172.31.3.72(46884):postgres@dbname:[3969]:DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2021-10-08 20:35:08 UTC:172.31.3.72(46884):postgres@dbname:[3969]:HINT: In a moment you should be able to reconnect to the database and repeat your command.
2021-10-08 20:18:46 UTC::@:[27842]:LOG: checkpoint complete: wrote 43719 buffers (8.3%); 0 WAL file(s) added, 0 removed, 842 recycled; write=3239.859 s, sync=0.560 s, total=3242.821 s; sync files=67, longest=0.050 s, average=0.008 s; distance=55175592 kB, estimate=55185037 kB
2021-10-08 20:24:44 UTC::@:[27842]:LOG: checkpoint starting: time
2021-10-08 20:35:08 UTC::@:[12682]:LOG: server process (PID 3970) was terminated by signal 9: Killed
2021-10-08 20:35:08 UTC::@:[12682]:DETAIL: Failed process was running: CREATE INDEX ...
2021-10-08 20:35:08 UTC::@:[12682]:LOG: terminating any other active server processes
2021-10-08 20:35:08 UTC::@:[12682]:LOG: archiver process (PID 27845) exited with exit code 2
2021-10-08 20:24:44 UTC::@:[27842]:LOG: checkpoint starting: time
2021-10-08 20:35:08 UTC::@:[12682]:LOG: server process (PID 3970) was terminated by signal 9: Killed
2021-10-08 20:35:08 UTC::@:[12682]:DETAIL: Failed process was running: CREATE INDEX ...
2021-10-08 20:35:08 UTC::@:[12682]:LOG: terminating any other active server processes
2021-10-08 20:35:08 UTC::@:[12682]:LOG: archiver process (PID 27845) exited with exit code 2
2021-10-08 20:35:08 UTC:172.31.3.72(46884):postgres@dbname:[3969]:WARNING: terminating connection because of crash of another server process
2021-10-08 20:35:08 UTC:172.31.3.72(46884):postgres@dbname:[3969]:DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2021-10-08 20:35:08 UTC:172.31.3.72(46884):postgres@dbname:[3969]:HINT: In a moment you should be able to reconnect to the database and repeat your command.
That block of 3 lines i repeated several times, and then:
2021-10-08 20:35:08 UTC:172.31.3.72(46978):[unknown]@[unknown]:[22197]:LOG: PID 3970 in cancel request did not match any process
2021-10-08 20:35:09 UTC:172.31.3.72(46984):postgres@dbname:[22202]:FATAL: the database system is in recovery mode
2021-10-08 20:35:09 UTC:172.31.3.72(46986):postgres@dbname:[22203]:FATAL: the database system is in recovery mode
2021-10-08 20:35:09 UTC:172.31.3.72(46988):postgres@dbname:[22207]:FATAL: the database system is in recovery mode
2021-10-08 20:35:09 UTC:172.31.3.72(46990):postgres@dbname:[22208]:FATAL: the database system is in recovery mode
2021-10-08 20:35:09 UTC::@:[12682]:LOG: all server processes terminated; reinitializing
2021-10-08 20:35:09 UTC::@:[22209]:LOG: database system was interrupted; last known up at 2021-10-08 20:18:43 UTC
2021-10-08 20:35:09 UTC::@:[22209]:LOG: database system was not properly shut down; automatic recovery in progress
2021-10-08 20:35:09 UTC::@:[22209]:LOG: redo starts at 144/14652BF0
2021-10-08 20:35:20 UTC:127.0.0.1(27834):rdsadmin@rdsadmin:[22478]:FATAL: the database system is in recovery mode
2021-10-08 20:35:09 UTC:172.31.3.72(46984):postgres@dbname:[22202]:FATAL: the database system is in recovery mode
2021-10-08 20:35:09 UTC:172.31.3.72(46986):postgres@dbname:[22203]:FATAL: the database system is in recovery mode
2021-10-08 20:35:09 UTC:172.31.3.72(46988):postgres@dbname:[22207]:FATAL: the database system is in recovery mode
2021-10-08 20:35:09 UTC:172.31.3.72(46990):postgres@dbname:[22208]:FATAL: the database system is in recovery mode
2021-10-08 20:35:09 UTC::@:[12682]:LOG: all server processes terminated; reinitializing
2021-10-08 20:35:09 UTC::@:[22209]:LOG: database system was interrupted; last known up at 2021-10-08 20:18:43 UTC
2021-10-08 20:35:09 UTC::@:[22209]:LOG: database system was not properly shut down; automatic recovery in progress
2021-10-08 20:35:09 UTC::@:[22209]:LOG: redo starts at 144/14652BF0
2021-10-08 20:35:20 UTC:127.0.0.1(27834):rdsadmin@rdsadmin:[22478]:FATAL: the database system is in recovery mode
Repeat that last line _several_ times and then:
2021-10-08 20:37:48 UTC::@:[22209]:LOG: redo done at 14B/5D9DF848
2021-10-08 20:37:48 UTC::@:[22209]:LOG: checkpoint starting: end-of-recovery immediate
2021-10-08 20:37:51 UTC:127.0.0.1(28136):rdsadmin@rdsadmin:[25911]:FATAL: the database system is in recovery mode
2021-10-08 20:37:53 UTC:127.0.0.1(28138):rdsadmin@rdsadmin:[26058]:FATAL: the database system is in recovery mode
2021-10-08 20:37:53 UTC:127.0.0.1(28140):rdsadmin@rdsadmin:[26059]:FATAL: the database system is in recovery mode
2021-10-08 20:38:00 UTC::@:[22209]:LOG: checkpoint complete: wrote 524188 buffers (100.0%); 0 WAL file(s) added, 434 removed, 0 recycled; write=10.006 s, sync=0.007 s, total=11.772 s; sync files=77, longest=0.007 s, average=0.001 s; distance=30559795 kB, estimate=30559795 kB
2021-10-08 20:38:00 UTC::@:[12682]:LOG: database system is ready to accept connections
2021-10-08 20:37:48 UTC::@:[22209]:LOG: checkpoint starting: end-of-recovery immediate
2021-10-08 20:37:51 UTC:127.0.0.1(28136):rdsadmin@rdsadmin:[25911]:FATAL: the database system is in recovery mode
2021-10-08 20:37:53 UTC:127.0.0.1(28138):rdsadmin@rdsadmin:[26058]:FATAL: the database system is in recovery mode
2021-10-08 20:37:53 UTC:127.0.0.1(28140):rdsadmin@rdsadmin:[26059]:FATAL: the database system is in recovery mode
2021-10-08 20:38:00 UTC::@:[22209]:LOG: checkpoint complete: wrote 524188 buffers (100.0%); 0 WAL file(s) added, 434 removed, 0 recycled; write=10.006 s, sync=0.007 s, total=11.772 s; sync files=77, longest=0.007 s, average=0.001 s; distance=30559795 kB, estimate=30559795 kB
2021-10-08 20:38:00 UTC::@:[12682]:LOG: database system is ready to accept connections
On Fri, Oct 8, 2021 at 3:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Wells Oliver <wells.oliver@gmail.com> writes:
> The only thing I see in the RDS log are:
> 2021-10-06 16:36:05 UTC::@:[5035]:LOG: redirecting log output to logging
> collector process
> 2021-10-06 16:36:05 UTC::@:[5035]:HINT: Future log output will appear in
> directory "/rdsdbdata/log/error".
So what's in /rdsdbdata/log/error ?
regards, tom lane
Wells Oliver
wells.oliver@gmail.com
wells.oliver@gmail.com
On Fri, Oct 8, 2021 at 03:34:04PM -0700, Wells Oliver wrote: > Dug out some more logging: > > 2021-10-08 20:18:46 UTC::@:[27842]:LOG: checkpoint complete: wrote 43719 > buffers (8.3%); 0 WAL file(s) added, 0 removed, 842 recycled; write=3239.859 s, > sync=0.560 s, total=3242.821 s; sync files=67, longest=0.050 s, average=0.008 > s; distance=55175592 kB, estimate=55185037 kB > 2021-10-08 20:24:44 UTC::@:[27842]:LOG: checkpoint starting: time > 2021-10-08 20:35:08 UTC::@:[12682]:LOG: server process (PID 3970) was > terminated by signal 9: Killed ------------------------------ Ah, this line above is probably the out-of-memory killer, meaning you allocated too much memory and the operating system killed a backend process with kill -9, causing the server to have to restart. Reducing maintenance_work_mem is probably the right solution, or increase your RAM. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On 2021-Oct-08, Wells Oliver wrote: > Dug out some more logging: > 2021-10-08 20:35:08 UTC::@:[12682]:LOG: server process (PID 3970) was terminated by signal 9: Killed > 2021-10-08 20:35:08 UTC::@:[12682]:DETAIL: Failed process was running: CREATE INDEX ... So what's happening here is that the instance is running out of RAM while creating some index, and the kernel is killing the process. I would probably blame the combination of shared_buffers=4GB with maintenance_work_mem=2GB, together with the instance's total RAM. I would suggest to keep an eye on memory usage (if there is such a thing) in the RDS console; with some experimentation, reducing maintenace_work_mem as Bruce adviced and/or shared_buffers, you can probably find some combination that doesn't take as long as with the default params but is able to complete. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
On 2021-Oct-08, Alvaro Herrera wrote: > On 2021-Oct-08, Wells Oliver wrote: > > > Dug out some more logging: > > > 2021-10-08 20:35:08 UTC::@:[12682]:LOG: server process (PID 3970) was terminated by signal 9: Killed > > 2021-10-08 20:35:08 UTC::@:[12682]:DETAIL: Failed process was running: CREATE INDEX ... > > So what's happening here is that the instance is running out of RAM > while creating some index, and the kernel is killing the process. I > would probably blame the combination of shared_buffers=4GB with > maintenance_work_mem=2GB, together with the instance's total RAM. Also, maybe RDS could be smarter about this situation. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "La primera ley de las demostraciones en vivo es: no trate de usar el sistema. Escriba un guión que no toque nada para no causar daños." (Jakob Nielsen)
Thanks all. Reducing maintenance_work_mem to 1GB and keeping shared_buffers at 4GB did allow the restore to complete. It took ~770m with 16 processes and that configuration, but only ~500m with maintenance_work_mem at 2GB, shared_buffers at 4GB, and 8 processes. Of course, with maintenance_work_mem at 2GB and 16 processes, we ran out of memory and kaboom.
If anyone has any more parameter values I should try to improve on that restore time, I'd love to hear them.
Thanks for the help here.
On Fri, Oct 8, 2021 at 3:48 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2021-Oct-08, Alvaro Herrera wrote:
> On 2021-Oct-08, Wells Oliver wrote:
>
> > Dug out some more logging:
>
> > 2021-10-08 20:35:08 UTC::@:[12682]:LOG: server process (PID 3970) was terminated by signal 9: Killed
> > 2021-10-08 20:35:08 UTC::@:[12682]:DETAIL: Failed process was running: CREATE INDEX ...
>
> So what's happening here is that the instance is running out of RAM
> while creating some index, and the kernel is killing the process. I
> would probably blame the combination of shared_buffers=4GB with
> maintenance_work_mem=2GB, together with the instance's total RAM.
Also, maybe RDS could be smarter about this situation.
--
Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
"La primera ley de las demostraciones en vivo es: no trate de usar el sistema.
Escriba un guión que no toque nada para no causar daños." (Jakob Nielsen)
Wells Oliver
wells.oliver@gmail.com
wells.oliver@gmail.com
What if you use a bigger instance (and maybe provisioned IO, too) just for the time it takes to restore the database, and then modify it down to a "production sized" instance?
On 10/9/21 1:20 PM, Wells Oliver wrote:
Thanks all. Reducing maintenance_work_mem to 1GB and keeping shared_buffers at 4GB did allow the restore to complete. It took ~770m with 16 processes and that configuration, but only ~500m with maintenance_work_mem at 2GB, shared_buffers at 4GB, and 8 processes. Of course, with maintenance_work_mem at 2GB and 16 processes, we ran out of memory and kaboom.If anyone has any more parameter values I should try to improve on that restore time, I'd love to hear them.Thanks for the help here.On Fri, Oct 8, 2021 at 3:48 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:On 2021-Oct-08, Alvaro Herrera wrote:
> On 2021-Oct-08, Wells Oliver wrote:
>
> > Dug out some more logging:
>
> > 2021-10-08 20:35:08 UTC::@:[12682]:LOG: server process (PID 3970) was terminated by signal 9: Killed
> > 2021-10-08 20:35:08 UTC::@:[12682]:DETAIL: Failed process was running: CREATE INDEX ...
>
> So what's happening here is that the instance is running out of RAM
> while creating some index, and the kernel is killing the process. I
> would probably blame the combination of shared_buffers=4GB with
> maintenance_work_mem=2GB, together with the instance's total RAM.
Also, maybe RDS could be smarter about this situation.
--
Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
"La primera ley de las demostraciones en vivo es: no trate de usar el sistema.
Escriba un guión que no toque nada para no causar daños." (Jakob Nielsen)--Wells Oliver
wells.oliver@gmail.com
--
Angular momentum makes the world go 'round.
Angular momentum makes the world go 'round.