Further information on BUG #17299: Exit code 3 when open connections concurrently (PQisthreadsafe() == 1) - Mailing list pgsql-bugs

From Liam Bowen
Subject Further information on BUG #17299: Exit code 3 when open connections concurrently (PQisthreadsafe() == 1)
Date
Msg-id CAE7q7Eit4Eq2=bxce=Fm8HAStECjaXUE=WBQc-sDDcgJQ7s7eg@mail.gmail.com
Whole thread Raw
Responses Re: Further information on BUG #17299: Exit code 3 when open connections concurrently (PQisthreadsafe() == 1)  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
This is to expand on this thread that Clemens started (Re: BUG #17299: Exit code 3 when open connections concurrently (PQisthreadsafe() == 1))
https://www.postgresql.org/message-id/flat/45a5a8c0-da4c-31f7-0bf9-23a622bc44e6%40sussol.net#96569bd6039e6f2969f7fa9be9ef25fe

This bug affects me and I would like to help to resolve it. Unfortunately, the only way that I can reproduce this is by using Rust (with Diesel and r2d2), but I think that has to do with multithreading because I can only reproduce the crash around 90% of the time by invoking the executable the same way. Also, reducing the number of connections makes the problem go away. When my postmortem debugger is launched after a crash, however, there is only one thread that is running.

I used the EnterpriseDB installer for several versions, and narrowed down the bug as being introduced between 13.5 and 14.1. Then I used git bisect to narrow down which revision actually introduced the bug. Each time, I would build libpq and copy the DLL into the same directory as my executable and verify that my build of libpq was being loaded. Eventually my bisection pointed to 52a1022.

Here is a debugging session: https://gist.github.com/hut8/3b25e6a581a600589bdc62644734de18. I really couldn't glean too much that was new from this, but I am confident that the bug was not present before revision 52a1022. One thing that I found a bit strange is that in libpq_binddomain, ldir = "/share/locale" which looks like a Unix path and this bug only happens on Windows. Is this relevant? I have no idea. This frame seems to have the values I would expect: https://gist.github.com/hut8/3b25e6a581a600589bdc62644734de18#check-out-frame-9 -- displayed_host, displayed_port, and host_addr all seem fine. And conn->errorMessage is empty, which seems right too. I was trying to find values that would create memory corruption, like a buffer overflow or something, but haven't found any yet.

It is true that the immediate crash is in libintl-9.dll -- however, I'm confident that almost everyone who's using Postgres on Windows is using the EnterpriseDB distribution, and I verified that in all of the recent versions (including 12.* and 13.*), the libintl-9.dll is exactly the same as in 14.*. I can't find a way to build libintl-9.dll in the exact same way as EnterpriseDB, and the instructions for obtaining it in the documentation haven't worked for a long time (I reported that on pgsql-docs). This really hampers my debugging; I don't know what revision is being used to build libintl-9.dll or various other details that would make the build reproducible so I could at least get relevant symbols and links to source.

So, at least I found a problematic revision. I'm a bit stuck at this point, but I'm happy to provide any more information that I can. Perhaps a dump would be useful; I can send one to whomever wants it. Thank you all for your time.

--
Liam Bowen

pgsql-bugs by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: BUG #17372: Altering statistics during concurrent drop can lead to a server crash
Next
From: Tom Lane
Date:
Subject: Re: Further information on BUG #17299: Exit code 3 when open connections concurrently (PQisthreadsafe() == 1)