Re: Postgres PANIC when it could not open file in pg_logical/snapshots directory - Mailing list pgsql-general

From Vijaykumar Jain
Subject Re: Postgres PANIC when it could not open file in pg_logical/snapshots directory
Date
Msg-id CAM+6J94d-nr-1kUqaXgttwbski_UCawRsTYheZgZfM7A7j3aPg@mail.gmail.com
Whole thread Raw
In response to Postgres PANIC when it could not open file in pg_logical/snapshots directory  (Mike Yeap <wkk1020@gmail.com>)
Responses Re: Postgres PANIC when it could not open file in pg_logical/snapshots directory  (Vijaykumar Jain <vijaykumarjain.github@gmail.com>)
List pgsql-general

On Tue, 22 Jun 2021 at 13:32, Mike Yeap <wkk1020@gmail.com> wrote:
Hi all,

I have a Postgres version 11.11 configured with both physical replication slots (for repmgr) as well as some logical replication slots (for AWS Database Migration Service (DMS)). This morning, the server went panic with the following messages found in the log file:

2021-06-22 04:56:35.314 +08 [PID=19457 application="[unknown]" user_name=dms database=** host(port)=**(48360)] PANIC:  could not open file "pg_logical/snapshots/969-FD606138.snap": Operation not permitted

2021-06-22 04:56:35.317 +08 [PID=1752 application="" user_name= database= host(port)=] LOG:  server process (PID 19457) was terminated by signal 6: Aborted

2021-06-22 04:56:35.317 +08 [PID=1752 application="" user_name= database= host(port)=] LOG:  terminating any other active server processes

Are you sure there is nothing else, do you see anything in /var/log/kern.log or dmesg logs. 
 i just did a small simulation of logical replication from A -> B, i deleted one of the snapshots live, i also changed permissions to make it RO
my server did not crash at all. (pg14beta though) although i can try other things to check at pg layer, but if something else externally has happened,
it would be difficult to reproduce.
pardon me for speculating, but 
Is it network storage? did the underlying storage layer have a blip of some kind? 
are the mounts fine? are they readonly or were temporarily readonly ?
no bad hardware ?
If none of the above, did the server restart solve the issue? or is it broken still, unable to start?


The PG server then terminates all existing PG processes.

The process with 19457 is from one of the DMS replication tasks, I have no clue why it suddenly couldn't open a snapshot file. I checked the server load and file systems and didn't find anything unusual at that time.

Appreciate if you can give me some guidance on troubleshooting this issue

Thanks

Regards,
Mike Yeap

is it crashing and dumping cores? 
can you strace the postmaster on its startup to check what it going on ?

I can share my demo setup, but it would be too noisy in the thread, but can do it later if you want.
the above assumptions are based on repmgnr and AWS do not interfere in your primary server internals, just failover and publication.


--
Thanks,
Vijay
Mumbai, India

pgsql-general by date:

Previous
From: Nicolas Seinlet
Date:
Subject: second CTE kills perf
Next
From: Oliver Kohll
Date:
Subject: Re: replace inside regexp_replace