Thread: 13.4 on RDS, SSL SYSCALL EOF on restore

13.4 on RDS, SSL SYSCALL EOF on restore

From
Wells Oliver
Date:
Hi: I am restoring a ~100GB backup using 16 jobs from an EC2 instance to an RDS instance (db.m6g.xlarge, which is 16GB RAM and 4 CPU) and it's dying midway with the dreaded "SSL SYSCALL error: EOF detected" error.

I did create a parameter group to hopefully speed the restoration process, it includes:

- wal_buffers 8192 (64MB)
- checkpoint_timeout 3600 (1h)
- min_wal_size 192 (192MB)
- max_wal_size 102400 (100GB)
- shared_buffers 524288 (4GB)
- synchronous_commit 0 (off)
- autovacuum 0 (off)
- maintenance_work_mem 2097152 (2GB)
- work_mem 32768 (32MB)

I sourced these from a few different folks as well as some trial and error, but now it's blowing up on me. 

If I revert the RDS instance back to default PG parameters, it restores, but it takes 3x the time.

Appreciate any tips.

--

Re: 13.4 on RDS, SSL SYSCALL EOF on restore

From
Bruce Momjian
Date:
On Fri, Oct  8, 2021 at 02:27:55PM -0700, Wells Oliver wrote:
> Hi: I am restoring a ~100GB backup using 16 jobs from an EC2 instance to an RDS
> instance (db.m6g.xlarge, which is 16GB RAM and 4 CPU) and it's dying midway
> with the dreaded "SSL SYSCALL error: EOF detected" error.
> 
> I did create a parameter group to hopefully speed the restoration process, it
> includes:
> 
> - wal_buffers 8192 (64MB)
> - checkpoint_timeout 3600 (1h)
> - min_wal_size 192 (192MB)
> - max_wal_size 102400 (100GB)
> - shared_buffers 524288 (4GB)
> - synchronous_commit 0 (off)
> - autovacuum 0 (off)
> - maintenance_work_mem 2097152 (2GB)
> - work_mem 32768 (32MB)
> 
> I sourced these from a few different folks as well as some trial and error, but
> now it's blowing up on me. 
> 
> If I revert the RDS instance back to default PG parameters, it restores, but it
> takes 3x the time.

Wow, that is weird.  I see that error string happening when PG can't
extend the receipt buffer on the client side:

     appendPQExpBufferStr(&conn->errorMessage,
                   libpq_gettext("SSL SYSCALL error: EOF detected\n"));

Can you check the server logs to see if there is any error there?  If I
had to take a guess, I would reduce maintenance_work_mem to 1GB and
retest.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: 13.4 on RDS, SSL SYSCALL EOF on restore

From
Wells Oliver
Date:
The only thing I see in the RDS log are:

2021-10-06 16:36:05.359 GMT [5009] LOG:  skipping missing configuration file "/rdsdbdata/config/recovery.conf"
2021-10-06 16:36:05.359 GMT [5009] LOG:  skipping missing configuration file "/rdsdbdata/config/recovery.conf"
2021-10-06 16:36:05 UTC::@:[5009]:LOG:  database system is shut down
Postgres Shared Memory Value: 4319797248 bytes
2021-10-06 16:36:05.619 GMT [5035] LOG:  skipping missing configuration file "/rdsdbdata/config/recovery.conf"
2021-10-06 16:36:05.619 GMT [5035] LOG:  skipping missing configuration file "/rdsdbdata/config/recovery.conf"
2021-10-06 16:36:05 UTC::@:[5035]:LOG:  redirecting log output to logging collector process
2021-10-06 16:36:05 UTC::@:[5035]:HINT:  Future log output will appear in directory "/rdsdbdata/log/error".
2021-10-07 21:52:21.343 GMT [12664] LOG:  skipping missing configuration file "/rdsdbdata/config/recovery.conf"
2021-10-07 21:52:21.343 GMT [12664] LOG:  skipping missing configuration file "/rdsdbdata/config/recovery.conf"
2021-10-07 21:52:21 UTC::@:[12664]:LOG:  database system is shut down
Postgres Shared Memory Value: 4560945152 bytes
2021-10-07 21:52:21.542 GMT [12682] LOG:  skipping missing configuration file "/rdsdbdata/config/recovery.conf"
2021-10-07 21:52:21.543 GMT [12682] LOG:  skipping missing configuration file "/rdsdbdata/config/recovery.conf"
2021-10-07 21:52:21 UTC::@:[12682]:LOG:  redirecting log output to logging collector process
2021-10-07 21:52:21 UTC::@:[12682]:HINT:  Future log output will appear in directory "/rdsdbdata/log/error".
2021-10-07 22:19:47 UTC::@:[12683]:LOG:  skipping missing configuration file "/rdsdbdata/config/recovery.conf"

Is there some parameter I can change to get more verbose logging in the event of these failures?


On Fri, Oct 8, 2021 at 2:35 PM Bruce Momjian <bruce@momjian.us> wrote:
On Fri, Oct  8, 2021 at 02:27:55PM -0700, Wells Oliver wrote:
> Hi: I am restoring a ~100GB backup using 16 jobs from an EC2 instance to an RDS
> instance (db.m6g.xlarge, which is 16GB RAM and 4 CPU) and it's dying midway
> with the dreaded "SSL SYSCALL error: EOF detected" error.
>
> I did create a parameter group to hopefully speed the restoration process, it
> includes:
>
> - wal_buffers 8192 (64MB)
> - checkpoint_timeout 3600 (1h)
> - min_wal_size 192 (192MB)
> - max_wal_size 102400 (100GB)
> - shared_buffers 524288 (4GB)
> - synchronous_commit 0 (off)
> - autovacuum 0 (off)
> - maintenance_work_mem 2097152 (2GB)
> - work_mem 32768 (32MB)
>
> I sourced these from a few different folks as well as some trial and error, but
> now it's blowing up on me. 
>
> If I revert the RDS instance back to default PG parameters, it restores, but it
> takes 3x the time.

Wow, that is weird.  I see that error string happening when PG can't
extend the receipt buffer on the client side:

     appendPQExpBufferStr(&conn->errorMessage,
                   libpq_gettext("SSL SYSCALL error: EOF detected\n"));

Can you check the server logs to see if there is any error there?  If I
had to take a guess, I would reduce maintenance_work_mem to 1GB and
retest.

--
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.



--

Re: 13.4 on RDS, SSL SYSCALL EOF on restore

From
Tom Lane
Date:
Wells Oliver <wells.oliver@gmail.com> writes:
> The only thing I see in the RDS log are:

> 2021-10-06 16:36:05 UTC::@:[5035]:LOG:  redirecting log output to logging
> collector process
> 2021-10-06 16:36:05 UTC::@:[5035]:HINT:  Future log output will appear in
> directory "/rdsdbdata/log/error".

So what's in /rdsdbdata/log/error ?

            regards, tom lane



Re: 13.4 on RDS, SSL SYSCALL EOF on restore

From
Wells Oliver
Date:
Dug out some more logging:

2021-10-08 20:18:46 UTC::@:[27842]:LOG:  checkpoint complete: wrote 43719 buffers (8.3%); 0 WAL file(s) added, 0 removed, 842 recycled; write=3239.859 s, sync=0.560 s, total=3242.821 s; sync files=67, longest=0.050 s, average=0.008 s; distance=55175592 kB, estimate=55185037 kB
2021-10-08 20:24:44 UTC::@:[27842]:LOG:  checkpoint starting: time
2021-10-08 20:35:08 UTC::@:[12682]:LOG:  server process (PID 3970) was terminated by signal 9: Killed
2021-10-08 20:35:08 UTC::@:[12682]:DETAIL:  Failed process was running: CREATE INDEX ...

2021-10-08 20:35:08 UTC::@:[12682]:LOG:  terminating any other active server processes
2021-10-08 20:35:08 UTC::@:[12682]:LOG:  archiver process (PID 27845) exited with exit code 2

2021-10-08 20:35:08 UTC:172.31.3.72(46884):postgres@dbname:[3969]:WARNING:  terminating connection because of crash of another server process
2021-10-08 20:35:08 UTC:172.31.3.72(46884):postgres@dbname:[3969]:DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2021-10-08 20:35:08 UTC:172.31.3.72(46884):postgres@dbname:[3969]:HINT:  In a moment you should be able to reconnect to the database and repeat your command.

That block of 3 lines i repeated several times, and then:

2021-10-08 20:35:08 UTC:172.31.3.72(46978):[unknown]@[unknown]:[22197]:LOG:  PID 3970 in cancel request did not match any process
2021-10-08 20:35:09 UTC:172.31.3.72(46984):postgres@dbname:[22202]:FATAL:  the database system is in recovery mode
2021-10-08 20:35:09 UTC:172.31.3.72(46986):postgres@dbname:[22203]:FATAL:  the database system is in recovery mode
2021-10-08 20:35:09 UTC:172.31.3.72(46988):postgres@dbname:[22207]:FATAL:  the database system is in recovery mode
2021-10-08 20:35:09 UTC:172.31.3.72(46990):postgres@dbname:[22208]:FATAL:  the database system is in recovery mode
2021-10-08 20:35:09 UTC::@:[12682]:LOG:  all server processes terminated; reinitializing
2021-10-08 20:35:09 UTC::@:[22209]:LOG:  database system was interrupted; last known up at 2021-10-08 20:18:43 UTC
2021-10-08 20:35:09 UTC::@:[22209]:LOG:  database system was not properly shut down; automatic recovery in progress
2021-10-08 20:35:09 UTC::@:[22209]:LOG:  redo starts at 144/14652BF0
2021-10-08 20:35:20 UTC:127.0.0.1(27834):rdsadmin@rdsadmin:[22478]:FATAL:  the database system is in recovery mode

Repeat that last line _several_ times and then:

2021-10-08 20:37:48 UTC::@:[22209]:LOG:  redo done at 14B/5D9DF848
2021-10-08 20:37:48 UTC::@:[22209]:LOG:  checkpoint starting: end-of-recovery immediate
2021-10-08 20:37:51 UTC:127.0.0.1(28136):rdsadmin@rdsadmin:[25911]:FATAL:  the database system is in recovery mode
2021-10-08 20:37:53 UTC:127.0.0.1(28138):rdsadmin@rdsadmin:[26058]:FATAL:  the database system is in recovery mode
2021-10-08 20:37:53 UTC:127.0.0.1(28140):rdsadmin@rdsadmin:[26059]:FATAL:  the database system is in recovery mode
2021-10-08 20:38:00 UTC::@:[22209]:LOG:  checkpoint complete: wrote 524188 buffers (100.0%); 0 WAL file(s) added, 434 removed, 0 recycled; write=10.006 s, sync=0.007 s, total=11.772 s; sync files=77, longest=0.007 s, average=0.001 s; distance=30559795 kB, estimate=30559795 kB
2021-10-08 20:38:00 UTC::@:[12682]:LOG:  database system is ready to accept connections



On Fri, Oct 8, 2021 at 3:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Wells Oliver <wells.oliver@gmail.com> writes:
> The only thing I see in the RDS log are:

> 2021-10-06 16:36:05 UTC::@:[5035]:LOG:  redirecting log output to logging
> collector process
> 2021-10-06 16:36:05 UTC::@:[5035]:HINT:  Future log output will appear in
> directory "/rdsdbdata/log/error".

So what's in /rdsdbdata/log/error ?

                        regards, tom lane


--

Re: 13.4 on RDS, SSL SYSCALL EOF on restore

From
Bruce Momjian
Date:
On Fri, Oct  8, 2021 at 03:34:04PM -0700, Wells Oliver wrote:
> Dug out some more logging:
> 
> 2021-10-08 20:18:46 UTC::@:[27842]:LOG:  checkpoint complete: wrote 43719
> buffers (8.3%); 0 WAL file(s) added, 0 removed, 842 recycled; write=3239.859 s,
> sync=0.560 s, total=3242.821 s; sync files=67, longest=0.050 s, average=0.008
> s; distance=55175592 kB, estimate=55185037 kB
> 2021-10-08 20:24:44 UTC::@:[27842]:LOG:  checkpoint starting: time
> 2021-10-08 20:35:08 UTC::@:[12682]:LOG:  server process (PID 3970) was
> terminated by signal 9: Killed
  ------------------------------

Ah, this line above is probably the out-of-memory killer, meaning you
allocated too much memory and the operating system killed a backend
process with kill -9, causing the server to have to restart.  Reducing
maintenance_work_mem is probably the right solution, or increase your
RAM.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: 13.4 on RDS, SSL SYSCALL EOF on restore

From
Alvaro Herrera
Date:
On 2021-Oct-08, Wells Oliver wrote:

> Dug out some more logging:

> 2021-10-08 20:35:08 UTC::@:[12682]:LOG:  server process (PID 3970) was terminated by signal 9: Killed
> 2021-10-08 20:35:08 UTC::@:[12682]:DETAIL:  Failed process was running: CREATE INDEX ...

So what's happening here is that the instance is running out of RAM
while creating some index, and the kernel is killing the process.  I
would probably blame the combination of shared_buffers=4GB with
maintenance_work_mem=2GB, together with the instance's total RAM.

I would suggest to keep an eye on memory usage (if there is such a
thing) in the RDS console; with some experimentation, reducing
maintenace_work_mem as Bruce adviced and/or shared_buffers, you can
probably find some combination that doesn't take as long as with the
default params but is able to complete.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/



Re: 13.4 on RDS, SSL SYSCALL EOF on restore

From
Alvaro Herrera
Date:
On 2021-Oct-08, Alvaro Herrera wrote:

> On 2021-Oct-08, Wells Oliver wrote:
> 
> > Dug out some more logging:
> 
> > 2021-10-08 20:35:08 UTC::@:[12682]:LOG:  server process (PID 3970) was terminated by signal 9: Killed
> > 2021-10-08 20:35:08 UTC::@:[12682]:DETAIL:  Failed process was running: CREATE INDEX ...
> 
> So what's happening here is that the instance is running out of RAM
> while creating some index, and the kernel is killing the process.  I
> would probably blame the combination of shared_buffers=4GB with
> maintenance_work_mem=2GB, together with the instance's total RAM.

Also, maybe RDS could be smarter about this situation.

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/
"La primera ley de las demostraciones en vivo es: no trate de usar el sistema.
Escriba un guión que no toque nada para no causar daños." (Jakob Nielsen)



Re: 13.4 on RDS, SSL SYSCALL EOF on restore

From
Wells Oliver
Date:
Thanks all. Reducing maintenance_work_mem to 1GB and keeping shared_buffers at 4GB did allow the restore to complete. It took ~770m with 16 processes and that configuration, but only ~500m with maintenance_work_mem at 2GB, shared_buffers at 4GB, and 8 processes. Of course, with maintenance_work_mem at 2GB and 16 processes, we ran out of memory and kaboom.

If anyone has any more parameter values I should try to improve on that restore time, I'd love to hear them.

Thanks for the help here.


On Fri, Oct 8, 2021 at 3:48 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2021-Oct-08, Alvaro Herrera wrote:

> On 2021-Oct-08, Wells Oliver wrote:
>
> > Dug out some more logging:
>
> > 2021-10-08 20:35:08 UTC::@:[12682]:LOG:  server process (PID 3970) was terminated by signal 9: Killed
> > 2021-10-08 20:35:08 UTC::@:[12682]:DETAIL:  Failed process was running: CREATE INDEX ...
>
> So what's happening here is that the instance is running out of RAM
> while creating some index, and the kernel is killing the process.  I
> would probably blame the combination of shared_buffers=4GB with
> maintenance_work_mem=2GB, together with the instance's total RAM.

Also, maybe RDS could be smarter about this situation.

--
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/
"La primera ley de las demostraciones en vivo es: no trate de usar el sistema.
Escriba un guión que no toque nada para no causar daños." (Jakob Nielsen)


--

Re: 13.4 on RDS, SSL SYSCALL EOF on restore

From
Ron
Date:
What if you use a bigger instance (and maybe provisioned IO, too) just for the time it takes to restore the database, and then modify it down to a "production sized" instance?

On 10/9/21 1:20 PM, Wells Oliver wrote:
Thanks all. Reducing maintenance_work_mem to 1GB and keeping shared_buffers at 4GB did allow the restore to complete. It took ~770m with 16 processes and that configuration, but only ~500m with maintenance_work_mem at 2GB, shared_buffers at 4GB, and 8 processes. Of course, with maintenance_work_mem at 2GB and 16 processes, we ran out of memory and kaboom.

If anyone has any more parameter values I should try to improve on that restore time, I'd love to hear them.

Thanks for the help here.


On Fri, Oct 8, 2021 at 3:48 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2021-Oct-08, Alvaro Herrera wrote:

> On 2021-Oct-08, Wells Oliver wrote:
>
> > Dug out some more logging:
>
> > 2021-10-08 20:35:08 UTC::@:[12682]:LOG:  server process (PID 3970) was terminated by signal 9: Killed
> > 2021-10-08 20:35:08 UTC::@:[12682]:DETAIL:  Failed process was running: CREATE INDEX ...
>
> So what's happening here is that the instance is running out of RAM
> while creating some index, and the kernel is killing the process.  I
> would probably blame the combination of shared_buffers=4GB with
> maintenance_work_mem=2GB, together with the instance's total RAM.

Also, maybe RDS could be smarter about this situation.

--
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/
"La primera ley de las demostraciones en vivo es: no trate de usar el sistema.
Escriba un guión que no toque nada para no causar daños." (Jakob Nielsen)


--

--
Angular momentum makes the world go 'round.