Thread: BUG #16039: PANIC when activating replication slots in Postgres 12.0 64bit under Windows
BUG #16039: PANIC when activating replication slots in Postgres 12.0 64bit under Windows
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 16039 Logged by: Hans Buschmann Email address: buschmann@nidsa.net PostgreSQL version: 12.0 Operating system: Windows Server 2019 64bit Description: We just moved our production cluster from pg 11.5 to pg 12.0 under windows using pg_dump/initdb/pg_restore. When we activated the replication slots by SELECT * FROM pg_create_physical_replication_slot('sam_repli_3'); and tried restarting the server, we got a PANIC in error log: CPS PRD 2019-10-04 19:10:07 CEST 00000 1:> LOG: database system was shut down at 2019-10-04 19:10:02 CEST CPS PRD 2019-10-04 19:10:07 CEST XX000 2:> PANIC: could not fsync file "pg_replslot/sam_repli_3/state": Bad file descriptor CPS PRD 2019-10-04 19:10:07 CEST 00000 6:> LOG: startup process (PID 4028) was terminated by exception 0xC0000409 CPS PRD 2019-10-04 19:10:07 CEST 00000 7:> HINT: See C include file "ntstatus.h" for a description of the hexadecimal value. CPS PRD 2019-10-04 19:10:07 CEST 00000 8:> LOG: aborting startup due to startup process failure CPS PRD 2019-10-04 19:10:07 CEST 00000 9:> LOG: database system is shut down We use the EDB distribution from the website under Windows Server 2019 (September 2019 patch level). select version (); version ------------------------------------------------------------ PostgreSQL 12.0, compiled by Visual C++ build 1914, 64-bit (1 Zeile) This seems to me like a fatal bug which makes the streaming replication unusable under Windows x64 /pg12. The same configuration worked flawlessly under pg 11.x until pg 11.5 By searching on google we encountered a similar error from 2015 under pg 9.4.1 reported under BUG #13287: https://www.postgresql.org/message-id/flat/20150514105514.2691.67352%40wrigleys.postgresql.org We were able to remove the replication slot by deleting the directory pg_replslot/sam_repli_3. The server seems to be working fine now, but without streaming replication of course. BTW: Is it necessary to restart the cluster after creating the replication slot or is it a consequence of the error, that the newly created slot is not active automatically? select * from pg_replication_slots; slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn -------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+--------------------- sam_repli_3 | | physical | | | f | f | | | | | sam_repli_5 | | physical | | | f | f | | | | | (2 Zeilen) Many thanks! Hans Buschmann
Re: BUG #16039: PANIC when activating replication slots in Postgres12.0 64bit under Windows
From
Andres Freund
Date:
Hi, Thanks for the report! On 2019-10-04 19:28:28 +0000, PG Bug reporting form wrote: > We just moved our production cluster from pg 11.5 to pg 12.0 under windows > using pg_dump/initdb/pg_restore. > > When we activated the replication slots by > > SELECT * FROM pg_create_physical_replication_slot('sam_repli_3'); > > and tried restarting the server, we got a PANIC in error log: > > CPS PRD 2019-10-04 19:10:07 CEST 00000 1:> LOG: database system was shut > down at 2019-10-04 19:10:02 CEST > CPS PRD 2019-10-04 19:10:07 CEST XX000 2:> PANIC: could not fsync file > "pg_replslot/sam_repli_3/state": Bad file descriptor > CPS PRD 2019-10-04 19:10:07 CEST 00000 6:> LOG: startup process (PID > 4028) was terminated by exception 0xC0000409 > CPS PRD 2019-10-04 19:10:07 CEST 00000 7:> HINT: See C include file > "ntstatus.h" for a description of the hexadecimal value. > CPS PRD 2019-10-04 19:10:07 CEST 00000 8:> LOG: aborting startup due to > startup process failure > CPS PRD 2019-10-04 19:10:07 CEST 00000 9:> LOG: database system is shut > down > > We use the EDB distribution from the website under Windows Server 2019 > (September 2019 patch level). > > select version (); > version > ------------------------------------------------------------ > PostgreSQL 12.0, compiled by Visual C++ build 1914, 64-bit > (1 Zeile) > > This seems to me like a fatal bug which makes the streaming replication > unusable under Windows x64 /pg12. > > The same configuration worked flawlessly under pg 11.x until pg 11.5 > > By searching on google we encountered a similar error from 2015 under pg > 9.4.1 reported under BUG #13287: > > https://www.postgresql.org/message-id/flat/20150514105514.2691.67352%40wrigleys.postgresql.org Uh, Michael? You just reintroduced this bug in commit 82a5649fb9dbef12d04cd24799be6bf298d889a6 Author: Michael Paquier <michael@paquier.xyz> Date: 2019-03-09 08:50:55 +0900 Tighten use of OpenTransientFile and CloseTransientFile This fixes two sets of issues related to the use of transient files in the backend: 1) OpenTransientFile() has been used in some code paths with read-write flags while read-only is sufficient, so switch those calls to be read-only where necessary. These have been reported by Joe Conway. You pretty much entirely reverted: commit dfbaed459754e71e01bb0cc90a12802bba3f9786 Author: Andres Freund <andres@anarazel.de> Date: 2015-04-28 00:12:38 +0200 Use a fd opened for read/write when syncing slots during startup. Some operating systems, including the reporter's windows, return EBADFD or similar when fsync() is invoked on a O_RDONLY file descriptor. Unfortunately RestoreSlotFromDisk() does exactly that; which causes failures after restarts in at least some scenarios. If you hit the bug the error message will be something like ERROR: could not fsync file "pg_replslot/$name/state": Bad file descriptor Simply use O_RDWR instead of O_RDONLY when opening the relevant file descriptor to fix the bug. Unfortunately I have no way of verifying the fix, but we've seen similar problems in the past. This bug goes back to 9.4 where slots were introduced. Backpatch accordingly. Reported-By: Patrice Drolet Bug: #13143: Discussion: 20150424101006.2556.60897@wrigleys.postgresql.org I realize I perhaps should have added a comment explaining this, but this is far from the only location that knows we have to know open fds r/w to be able to fsync them. What were you even trying to fix by changing this? Seems also pretty clear that we need a few animals running with fsync enabled. Not sure how we best can write test infrastructure to make it easy to set that for all tests. Guess I best start a thread about it on -hackers. Greetings, Andres Freund
Re: BUG #16039: PANIC when activating replication slots in Postgres12.0 64bit under Windows
From
Andres Freund
Date:
Hi Hans, On 2019-10-04 13:06:05 -0700, Andres Freund wrote: > On 2019-10-04 19:28:28 +0000, PG Bug reporting form wrote: > > CPS PRD 2019-10-04 19:10:07 CEST XX000 2:> PANIC: could not fsync file > > "pg_replslot/sam_repli_3/state": Bad file descriptor > > CPS PRD 2019-10-04 19:10:07 CEST 00000 6:> LOG: startup process (PID > > 4028) was terminated by exception 0xC0000409 Thanks again for the report! I've committed the fix. Unfortunately, unless you can compile yourself, you'll have to wait until the next minor release for the fix. Given this is the .0 release of 12, and that we already have a number of bugs to fix, I'm sure that won't be too far away however. If you want to work around the problem, you can, but it's not entirely unproblematic: If, just when restarting, you set fsync to false, you'll avoid the problem. But then you'd have to immediately re-enable afterwards, and there's still a small data loss window. If however all you want is to continue testing, fsync=off might be a reasonable workaround. Greetings, Andres Freund
AW: BUG #16039: PANIC when activating replication slots in Postgres12.0 64bit under Windows
From
Hans Buschmann
Date:
________________________________________ Von: Andres Freund <andres@anarazel.de> Gesendet: Freitag, 4. Oktober 2019 22:49 An: Hans Buschmann; pgsql-bugs@lists.postgresql.org; Michael Paquier Betreff: Re: BUG #16039: PANIC when activating replication slots in Postgres 12.0 64bit under Windows >If you want to work around the problem, you can, but it's not entirely >unproblematic: If, just when restarting, you set fsync to false, you'll >avoid the problem. But then you'd have to immediately re-enable >afterwards, and there's still a small data loss window. If however all >you want is to continue testing, fsync=off might be a reasonable >workaround. Hi Andres, Thanks for the quick fix. I reported it immediately to get it soon fixed before 12.1. In the meantime we continue to use pg 12 on the production System (quite confident in overall Software Quality!). I can't reenable it because there may be unexpected restarts (power failures for example) and then the unattended restartwould not succeed. In quite a Long journey I have got deeper Knowledge of the whole System. So now I am able to recompile from source BUT: It is VERY difficult to get the whole Environment under Windows! I use Visual Studio Express 2019 (Version 16.3 from Oct 2019) and got the core (source tarball) compiled without error. But I have not managed to get all the additional packages downloaded and adopted to the most recent Visual Studio (e.g zlib,XML, ssl etc.) The Manual only states the Locations for download, but doesn't help in compiling. It would be VERY helpfull to provide a .ZIP file for download on the Postgres Website containing all the additional packagesin a usable Directory tree and some config file for compiling. This (mostly) should exist already for the buildfarmor for Building the release. Because the upstream Software doesn't Change often, it would (as quick alternative) be sufficient to have the include filesand the compiled libraries to easily compile Postgres from source. Thanks for considering this proposal! Hans Buschmann PS Uppercase Problems are from the mail System!
Re: BUG #16039: PANIC when activating replication slots in Postgres12.0 64bit under Windows
From
Michael Paquier
Date:
On Fri, Oct 04, 2019 at 01:06:05PM -0700, Andres Freund wrote: > I realize I perhaps should have added a comment explaining this, but > this is far from the only location that knows we have to know open fds > r/w to be able to fsync them. Sorry for the late reply here. It looks like I messed up here, my apologies for that. And thanks for fixing the issue. It would have been nice to add some sanity checks based on fcntl() but directory handling in pg_fsync() makes that annoying. Anyway, I have checked the code with a little trick, and I have spotted a second bug: CheckPointLogicalRewriteHeap() fsyncs a logical rewrite mapping file with RDONLY. This is incorrect since b89e151. > What were you even trying to fix by changing this? Hardening of the code. Some code paths clearly relied on the operations to be read-only. > Seems also pretty clear that we need a few animals running with fsync > enabled. Not sure how we best can write test infrastructure to make it > easy to set that for all tests. Guess I best start a thread about it on > -hackers. I think that we would need more infrastructure here for TAP tests, aka how to be able to enforce some parameters when setting the configuration of a new node. Attached are two patches: the actual bug fix and an extra patch with the trick I have used to find it out (contrib/test_decoding/ was the part which has blown up). -- Michael
Attachment
Re: BUG #16039: PANIC when activating replication slots in Postgres12.0 64bit under Windows
From
Michael Paquier
Date:
On Sat, Oct 05, 2019 at 07:40:00AM +0000, Hans Buschmann wrote: > Thanks for the quick fix. I reported it immediately to get it soon > fixed before 12.1. Sorry that you got annoyed by that :/ You can blame me about it.. > But I have not managed to get all the additional packages downloaded > and adopted to the most recent Visual Studio (e.g zlib, XML, ssl > etc.) > > The Manual only states the Locations for download, but doesn't help > in compiling. There is documentation on the matter: https://www.postgresql.org/docs/devel/install-windows-full.html#id-1.6.4.8.10 > It would be VERY helpful to provide a .ZIP file for download on the > Postgres Website containing all the additional packages in a usable > Directory tree and some config file for compiling. This (mostly) > should exist already for the buildfarm or for Building the release. I think that this would be difficult, per licensing issues and such, no? The maintenance of such a tarball would be annoying as well, for example OpenSSL version changes quite often and needs to be bundled with the installer so as DLLs loading it are able to work. -- Michael
Attachment
Re: BUG #16039: PANIC when activating replication slots in Postgres12.0 64bit under Windows
From
Michael Paquier
Date:
On Sun, Oct 06, 2019 at 01:55:48PM +0900, Michael Paquier wrote: > It would have been nice to add some sanity checks based on fcntl() but > directory handling in pg_fsync() makes that annoying. Anyway, I have > checked the code with a little trick, and I have spotted a second bug: > CheckPointLogicalRewriteHeap() fsyncs a logical rewrite mapping file > with RDONLY. This is incorrect since b89e151. Andres, others, any thoughts about this issue? Are there any objections if I just fix it? -- Michael
Attachment
Re: BUG #16039: PANIC when activating replication slots in Postgres12.0 64bit under Windows
From
Andres Freund
Date:
Hi, On 2019-10-08 09:32:40 +0900, Michael Paquier wrote: > On Sun, Oct 06, 2019 at 01:55:48PM +0900, Michael Paquier wrote: > > It would have been nice to add some sanity checks based on fcntl() but > > directory handling in pg_fsync() makes that annoying. I wondered about adding something like that too. Not sure what you mean by directory handling problems? Couldn't that just be solved by doing an fstat()? > > Anyway, I have checked the code with a little trick, and I have > > spotted a second bug: CheckPointLogicalRewriteHeap() fsyncs a > > logical rewrite mapping file with RDONLY. This is incorrect since > > b89e151. Yuck :(. Luckily that's a pretty narrow case to hit. We really need windows coverage for this stuff. And also just general buildfarm coverage, it's not like we're immune from bugs on unixoid OSs etiher. > Andres, others, any thoughts about this issue? Are there any > objections if I just fix it? Not here. Greetings, Andres Freund
Re: BUG #16039: PANIC when activating replication slots in Postgres12.0 64bit under Windows
From
Michael Paquier
Date:
On Tue, Oct 08, 2019 at 09:09:53AM -0700, Andres Freund wrote: > I wondered about adding something like that too. Not sure what you mean > by directory handling problems? Couldn't that just be solved by doing an > fstat()? Yeah, we should just do that. > Yuck :(. Luckily that's a pretty narrow case to hit. We really need > windows coverage for this stuff. And also just general buildfarm > coverage, it's not like we're immune from bugs on unixoid OSs etiher. test_decoding hits that easily for me, and we test that on Windows for some time now as there is a buildfarm module. That's a bit depressing. >> Andres, others, any thoughts about this issue? Are there any >> objections if I just fix it? > > Not here. Done this one. -- Michael
Attachment
Re: BUG #16039: PANIC when activating replication slots in Postgres12.0 64bit under Windows
From
Andres Freund
Date:
Hi, On 2019-10-09 13:42:07 +0900, Michael Paquier wrote: > > Yuck :(. Luckily that's a pretty narrow case to hit. We really need > > windows coverage for this stuff. And also just general buildfarm > > coverage, it's not like we're immune from bugs on unixoid OSs etiher. > > test_decoding hits that easily for me, and we test that on Windows for > some time now as there is a buildfarm module. That's a bit depressing. Well, that's because it intentionally tries to hit that case ;) But yea, it's not good. I think we really ought to remove the -F from pg_regress.c, and leave that up to the caller. Right now it's just about impossible to configure without patching the source, unless I miss something. > >> Andres, others, any thoughts about this issue? Are there any > >> objections if I just fix it? > > > > Not here. > > Done this one. Thanks. Greetings, Andres Freund
Re: BUG #16039: PANIC when activating replication slots in Postgres12.0 64bit under Windows
From
Michael Paquier
Date:
On Thu, Oct 10, 2019 at 08:25:15AM -0700, Andres Freund wrote: > I think we really ought to remove the -F from pg_regress.c, and leave > that up to the caller. Right now it's just about impossible to > configure without patching the source, unless I miss something. Well, I have just posted something about that: https://www.postgresql.org/message-id/20191009062640.GB21379@paquier.xyz Perhaps there are better ideas for that particular problem, let's see. -- Michael