Thread: Postgres PANIC when it could not open file in pg_logical/snapshots directory

Hi all,

I have a Postgres version 11.11 configured with both physical replication slots (for repmgr) as well as some logical replication slots (for AWS Database Migration Service (DMS)). This morning, the server went panic with the following messages found in the log file:

2021-06-22 04:56:35.314 +08 [PID=19457 application="[unknown]" user_name=dms database=** host(port)=**(48360)] PANIC:  could not open file "pg_logical/snapshots/969-FD606138.snap": Operation not permitted

2021-06-22 04:56:35.317 +08 [PID=1752 application="" user_name= database= host(port)=] LOG:  server process (PID 19457) was terminated by signal 6: Aborted

2021-06-22 04:56:35.317 +08 [PID=1752 application="" user_name= database= host(port)=] LOG:  terminating any other active server processes


The PG server then terminates all existing PG processes.

The process with 19457 is from one of the DMS replication tasks, I have no clue why it suddenly couldn't open a snapshot file. I checked the server load and file systems and didn't find anything unusual at that time.

Appreciate if you can give me some guidance on troubleshooting this issue

Thanks

Regards,
Mike Yeap

On Tue, 22 Jun 2021 at 13:32, Mike Yeap <wkk1020@gmail.com> wrote:
Hi all,

I have a Postgres version 11.11 configured with both physical replication slots (for repmgr) as well as some logical replication slots (for AWS Database Migration Service (DMS)). This morning, the server went panic with the following messages found in the log file:

2021-06-22 04:56:35.314 +08 [PID=19457 application="[unknown]" user_name=dms database=** host(port)=**(48360)] PANIC:  could not open file "pg_logical/snapshots/969-FD606138.snap": Operation not permitted

2021-06-22 04:56:35.317 +08 [PID=1752 application="" user_name= database= host(port)=] LOG:  server process (PID 19457) was terminated by signal 6: Aborted

2021-06-22 04:56:35.317 +08 [PID=1752 application="" user_name= database= host(port)=] LOG:  terminating any other active server processes

Are you sure there is nothing else, do you see anything in /var/log/kern.log or dmesg logs. 
 i just did a small simulation of logical replication from A -> B, i deleted one of the snapshots live, i also changed permissions to make it RO
my server did not crash at all. (pg14beta though) although i can try other things to check at pg layer, but if something else externally has happened,
it would be difficult to reproduce.
pardon me for speculating, but 
Is it network storage? did the underlying storage layer have a blip of some kind? 
are the mounts fine? are they readonly or were temporarily readonly ?
no bad hardware ?
If none of the above, did the server restart solve the issue? or is it broken still, unable to start?


The PG server then terminates all existing PG processes.

The process with 19457 is from one of the DMS replication tasks, I have no clue why it suddenly couldn't open a snapshot file. I checked the server load and file systems and didn't find anything unusual at that time.

Appreciate if you can give me some guidance on troubleshooting this issue

Thanks

Regards,
Mike Yeap

is it crashing and dumping cores? 
can you strace the postmaster on its startup to check what it going on ?

I can share my demo setup, but it would be too noisy in the thread, but can do it later if you want.
the above assumptions are based on repmgnr and AWS do not interfere in your primary server internals, just failover and publication.


--
Thanks,
Vijay
Mumbai, India


On Tue, 22 Jun 2021 at 14:34, Vijaykumar Jain <vijaykumarjain.github@gmail.com> wrote:

On Tue, 22 Jun 2021 at 13:32, Mike Yeap <wkk1020@gmail.com> wrote:
Hi all,

I have a Postgres version 11.11 configured with both physical replication slots (for repmgr) as well as some logical replication slots (for AWS Database Migration Service (DMS)). This morning, the server went panic with the following messages found in the log file:

2021-06-22 04:56:35.314 +08 [PID=19457 application="[unknown]" user_name=dms database=** host(port)=**(48360)] PANIC:  could not open file "pg_logical/snapshots/969-FD606138.snap": Operation not permitted

2021-06-22 04:56:35.317 +08 [PID=1752 application="" user_name= database= host(port)=] LOG:  server process (PID 19457) was terminated by signal 6: Aborted

2021-06-22 04:56:35.317 +08 [PID=1752 application="" user_name= database= host(port)=] LOG:  terminating any other active server processes


I just tried the below case, when a running logical replication is denied access to the snapshot folder via chattr +i  to make the pg crash at publisher.
Hence I was speculating, something was occured at the filesystem, this may not be related, but just trying to get a scenario to simulate a repeatable crash.

A (port 5001 publisher) -> logical replication all tables -> B(port 5002 subscriber), all working fine.

postgres@db:~/playground/logical_replication$ psql -p 5001 -c 'select count(1) from t;'
 count
-------
  1000
(1 row)

postgres@db:~/playground/logical_replication$ psql -p 5002 -c 'select count(1) from t;'
 count
-------
  1000
(1 row)

now i change file attributes of pg_logical folder  at the publisher

root@db:/home/postgres/playground/logical_replication/db1/pg_logical# cd /home/postgres/playground/logical_replication/db1/pg_logical
root@db:/home/postgres/playground/logical_replication/db1/pg_logical# lsattr
--------------e----- ./mappings
--------------e----- ./snapshots
--------------e----- ./replorigin_checkpoint
root@db:/home/postgres/playground/logical_replication/db1/pg_logical# chattr -R +i *   # do not allow mod
root@db:/home/postgres/playground/logical_replication/db1/pg_logical# lsattr
----i---------e----- ./mappings
----i---------e----- ./snapshots
----i---------e----- ./replorigin_checkpoint

 psql -p 5001 -c 'delete from t; checkpoint;'  # and crash

2021-06-24 00:22:36.998 IST [2899] LOG:  could not remove file "pg_logical/snapshots/0-16CE9D8.snap": Operation not permitted
2021-06-24 00:22:36.998 IST [2899] LOG:  could not remove file "pg_logical/snapshots/0-16E72B8.snap": Operation not permitted
2021-06-24 00:22:36.998 IST [2899] LOG:  could not remove file "pg_logical/snapshots/0-16CE9A0.snap": Operation not permitted
2021-06-24 00:22:36.998 IST [2899] LOG:  could not remove file "pg_logical/snapshots/0-16E7398.snap": Operation not permitted
2021-06-24 00:22:37.003 IST [2899] PANIC:  could not open file "pg_logical/replorigin_checkpoint": Operation not permitted
2021-06-24 00:22:37.092 IST [2897] LOG:  checkpointer process (PID 2899) was terminated by signal 6: Aborted
2021-06-24 00:22:37.092 IST [2897] LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
connection to server was lost
2021-06-24 00:22:37.093 IST [2897] LOG:  all server processes terminated; reinitializing
postgres@db:~/playground/logical_replication$ 2021-06-24 00:22:37.100 IST [2920] LOG:  database system was interrupted; last known up at 2021-06-24 00:20:54 IST
2021-06-24 00:22:37.100 IST [2921] FATAL:  the database system is in recovery mode
2021-06-24 00:22:37.210 IST [2920] LOG:  could not open file "./pg_logical/snapshots/0-16CE9D8.snap": Operation not permitted
2021-06-24 00:22:37.210 IST [2920] LOG:  could not open file "./pg_logical/snapshots/0-16E72B8.snap": Operation not permitted
2021-06-24 00:22:37.210 IST [2920] LOG:  could not open file "./pg_logical/snapshots/0-16CE9A0.snap": Operation not permitted
2021-06-24 00:22:37.210 IST [2920] LOG:  could not open file "./pg_logical/snapshots/0-16EF1E0.snap": Operation not permitted
2021-06-24 00:22:37.210 IST [2920] LOG:  could not open file "./pg_logical/snapshots/0-16E7398.snap": Operation not permitted
2021-06-24 00:22:37.210 IST [2920] LOG:  could not open file "./pg_logical/replorigin_checkpoint": Operation not permitted
2021-06-24 00:22:37.212 IST [2920] LOG:  database system was not properly shut down; automatic recovery in progress
2021-06-24 00:22:37.214 IST [2920] LOG:  redo starts at 0/16E72B8
2021-06-24 00:22:37.214 IST [2920] LOG:  invalid record length at 0/16EF218: wanted 24, got 0
2021-06-24 00:22:37.214 IST [2920] LOG:  redo done at 0/16EF1E0 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2021-06-24 00:22:37.217 IST [2920] LOG:  could not remove file "pg_logical/snapshots/0-16CE9D8.snap": Operation not permitted
2021-06-24 00:22:37.217 IST [2920] LOG:  could not remove file "pg_logical/snapshots/0-16E72B8.snap": Operation not permitted
2021-06-24 00:22:37.217 IST [2920] LOG:  could not remove file "pg_logical/snapshots/0-16CE9A0.snap": Operation not permitted
2021-06-24 00:22:37.217 IST [2920] LOG:  could not remove file "pg_logical/snapshots/0-16E7398.snap": Operation not permitted
2021-06-24 00:22:37.219 IST [2920] PANIC:  could not open file "pg_logical/replorigin_checkpoint": Operation not permitted
2021-06-24 00:22:37.309 IST [2897] LOG:  startup process (PID 2920) was terminated by signal 6: Aborted
2021-06-24 00:22:37.309 IST [2897] LOG:  aborting startup due to startup process failure
2021-06-24 00:22:37.310 IST [2897] LOG:  database system is shut down




*************************
now i restore the attr back for pg_logical and restart the db

root@db:/home/postgres/playground/logical_replication/db1/pg_logical# chattr -R -i *
root@db:/home/postgres/playground/logical_replication/db1/pg_logical# lsattr
--------------e----- ./mappings
--------------e----- ./replorigin_checkpoint.tmp
--------------e----- ./snapshots
--------------e----- ./replorigin_checkpoint


2021-06-24 00:24:00.026 IST [2947] LOG:  database system is ready to accept connections
 done
server started
postgres@db:~/playground/logical_replication$ 2021-06-24 00:24:02.357 IST [2957] LOG:  starting logical decoding for slot "mysub"
2021-06-24 00:24:02.357 IST [2957] DETAIL:  Streaming transactions committing after 0/16EF1E0, reading WAL from 0/16EF1E0.
2021-06-24 00:24:02.357 IST [2957] STATEMENT:  START_REPLICATION SLOT "mysub" LOGICAL 0/16E7398 (proto_version '2', publication_names '"mypub"')
2021-06-24 00:24:02.357 IST [2957] LOG:  logical decoding found consistent point at 0/16EF1E0
2021-06-24 00:24:02.357 IST [2957] DETAIL:  There are no running transactions.
2021-06-24 00:24:02.357 IST [2957] STATEMENT:  START_REPLICATION SLOT "mysub" LOGICAL 0/16E7398 (proto_version '2', publication_names '"mypub"')
 
I might be wrong, but if you can put up monitoring around your underlying storage, that may help.
I do not know how it works on the cloud, but I used to do it over the hypervisor layer on vmware and emit metrics for the same to relate errors with crash.

--
Thanks,
Vijay
Mumbai, India


On Thu, Jun 24, 2021, 9:28 AM Mike Yeap <wkk1020@gmail.com> wrote:
Hi Vijay, thanks for the tests, it's very helpful.

Just that, isn't it too extreme when one of the processes having problems accessing a snap file, it causes all other processes to be terminated? After all, most of the other processes do not need to access the snap files. Is there a way to fine tune this behaviour?

Thank you.

Regards,
Mike Yeap

I guess the best people to ask this question is the core team, I am just over a year in pg, I am not in a position to call this an issue with postgres unless I understand the underlying design and possible data corruption if any if it does not abort etc.
I'll leave it to the core team, then to give incorrect workaround.


Re: Postgres PANIC when it could not open file in pg_logical/snapshots directory

From
Alvaro Herrera
Date:
On 2021-Jun-22, Mike Yeap wrote:

> I have a Postgres version 11.11 configured with both physical replication
> slots (for repmgr) as well as some logical replication slots (for AWS
> Database Migration Service (DMS)). This morning, the server went panic with
> the following messages found in the log file:
> 
> 2021-06-22 04:56:35.314 +08 [PID=19457 application="[unknown]"
> user_name=dms database=** host(port)=**(48360)] PANIC:  could not open file
> "pg_logical/snapshots/969-FD606138.snap": Operation not permitted

Hmm, isn't this strange?  open(3) is not documented to return EPERM,
which is what this error string maps to.

You should definitely not mess with the PG data dir; if you do, it's
your problem when things break.  In this case you may not be messing
with the data dir yourself, but perhaps you kernel is buggy or you have
some security module that prevents the operation from working properly,
or something.
 
-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/