Thread: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier
postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier
From
Andres Freund
Date:
Hi, I amd working on adding an "installcheck" equivalent mode to the meson build. One invocation of something "installcheck-world" like lead to things getting stuck. Lots of log messages like: 2022-09-25 16:16:30.999 PDT [2705454][client backend][28/1112:41269][pg_regress] LOG: still waiting for backend with PID2705178 to accept ProcSignalBarrier 2022-09-25 16:16:30.999 PDT [2705454][client backend][28/1112:41269][pg_regress] STATEMENT: DROP DATABASE IF EXISTS "regression_test_parser_regress" 2022-09-25 16:16:31.006 PDT [2705472][client backend][22/3699:41294][pg_regress] LOG: still waiting for backend with PID2705178 to accept ProcSignalBarrier 2022-09-25 16:16:31.006 PDT [2705472][client backend][22/3699:41294][pg_regress] STATEMENT: DROP DATABASE IF EXISTS "regression_test_predtest_regress" a stacktrace of 2705178 shows: (gdb) bt #0 0x00007f67d26fe1b3 in __GI___poll (fds=fds@entry=0x7ffebe187c88, nfds=nfds@entry=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29 #1 0x00007f67cfd03c1c in pqSocketPoll (sock=<optimized out>, forRead=forRead@entry=1, forWrite=forWrite@entry=0, end_time=end_time@entry=-1) at ../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-misc.c:1125 #2 0x00007f67cfd04310 in pqSocketCheck (conn=conn@entry=0x562f875a9b70, forRead=forRead@entry=1, forWrite=forWrite@entry=0,end_time=end_time@entry=-1) at ../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-misc.c:1066 #3 0x00007f67cfd043fd in pqWaitTimed (forRead=forRead@entry=1, forWrite=forWrite@entry=0, conn=conn@entry=0x562f875a9b70,finish_time=finish_time@entry=-1) at ../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-misc.c:998 #4 0x00007f67cfcfc47b in connectDBComplete (conn=conn@entry=0x562f875a9b70) at ../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-connect.c:2166 #5 0x00007f67cfcfe248 in PQconnectdbParams (keywords=keywords@entry=0x562f87613d20, values=values@entry=0x562f87613d70,expand_dbname=expand_dbname@entry=0) at ../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-connect.c:659 #6 0x00007f67cfd29536 in connect_pg_server (server=server@entry=0x562f876139b0, user=user@entry=0x562f87613980) at ../../../../home/andres/src/postgresql/contrib/postgres_fdw/connection.c:474 #7 0x00007f67cfd29910 in make_new_connection (entry=entry@entry=0x562f8758b2c8, user=user@entry=0x562f87613980) at ../../../../home/andres/src/postgresql/contrib/postgres_fdw/connection.c:344 #8 0x00007f67cfd29da0 in GetConnection (user=0x562f87613980, will_prep_stmt=will_prep_stmt@entry=false, state=state@entry=0x562f876136a0) at ../../../../home/andres/src/postgresql/contrib/postgres_fdw/connection.c:204 #9 0x00007f67cfd35294 in postgresBeginForeignScan (node=0x562f87612e70, eflags=<optimized out>) and it turns out that backend can't be graciously be terminated. Maybe I am missing something, but I don't think it's OK for connect_pg_server() to connect in a blocking manner, without accepting interrupts? Greetings, Andres Freund
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier
From
Andres Freund
Date:
Hi, I just re-discovered this issue, via https://www.postgresql.org/message-id/20221209003607.bz2zdznvfnkq4zz3%40awork3.anarazel.de On 2022-09-25 16:22:37 -0700, Andres Freund wrote: > Maybe I am missing something, but I don't think it's OK for > connect_pg_server() to connect in a blocking manner, without accepting > interrupts? It's definitely not. This basically means network issues or such can lead to connections being unkillable... We know how to do better, c.f. libpqrcv_connect(). I hacked that up for postgres_fdw, and got through quite a few runs without related issues ([1]). The same problem is present in two places in dblink.c. Obviously we can copy and paste the code to dblink.c as well. But that means we have the same not quite trivial code in three different c files. There's already a fair bit of duplicated code around AcquireExternalFD(). It seems we should find a place to put backend specific libpq wrapper code somewhere. Unless we want to relax the rule about not linking libpq into the backend we can't just put it in the backend directly, though. The only alternative way to provide a wrapper that I can think of are to a) introduce a new static library that can be linked to by libpqwalreceiver, postgres_fdw, dblink b) add a header with static inline functions implementing interrupt-processing connection establishment for libpq Neither really has precedent. The attached quick patch just adds and uses libpq_connect_interruptible() in postgres_fdw. If we wanted to move this somewhere generic, at least part of the external FD handling should also be moved into it. dblink.c uses a lot of other blocking libpq functions, which obviously also isn't ok. Perhaps we could somehow make this easier from within libpq? My first thought was a connection parameter that'd provide a different implementation of pqSocketCheck() or such - but I don't think that'd work, because we might throw errors when processing interrupts, and that'd not be ok from deep within libpq. Greetings, Andres Freund [1] I eventually encountered a deadlock in REINDEX, but it didn't involve postgres_fw / ProcSignalBarrier
Attachment
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier
From
Andres Freund
Date:
Hi, On 2022-12-08 18:08:15 -0800, Andres Freund wrote: > I just re-discovered this issue, via > https://www.postgresql.org/message-id/20221209003607.bz2zdznvfnkq4zz3%40awork3.anarazel.de > > > On 2022-09-25 16:22:37 -0700, Andres Freund wrote: > > Maybe I am missing something, but I don't think it's OK for > > connect_pg_server() to connect in a blocking manner, without accepting > > interrupts? > > It's definitely not. This basically means network issues or such can lead to > connections being unkillable... > > We know how to do better, c.f. libpqrcv_connect(). I hacked that up for > postgres_fdw, and got through quite a few runs without related issues ([1]). > > The same problem is present in two places in dblink.c. Obviously we can copy > and paste the code to dblink.c as well. But that means we have the same not > quite trivial code in three different c files. There's already a fair bit of > duplicated code around AcquireExternalFD(). > > It seems we should find a place to put backend specific libpq wrapper code > somewhere. Unless we want to relax the rule about not linking libpq into the > backend we can't just put it in the backend directly, though. > > The only alternative way to provide a wrapper that I can think of are to > a) introduce a new static library that can be linked to by libpqwalreceiver, > postgres_fdw, dblink > b) add a header with static inline functions implementing interrupt-processing > connection establishment for libpq > > Neither really has precedent. > > The attached quick patch just adds and uses libpq_connect_interruptible() in > postgres_fdw. If we wanted to move this somewhere generic, at least part of > the external FD handling should also be moved into it. > > > dblink.c uses a lot of other blocking libpq functions, which obviously also > isn't ok. > > > Perhaps we could somehow make this easier from within libpq? My first thought > was a connection parameter that'd provide a different implementation of > pqSocketCheck() or such - but I don't think that'd work, because we might > throw errors when processing interrupts, and that'd not be ok from deep within > libpq. Any opinions? Due to the simplicity I'm currently leaning to a header-only helper, but I don't feel confident about it. The rate of cfbot failures is high enough that it'd be good to do something about this. Greetings, Andres Freund
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier
From
Thomas Munro
Date:
On Fri, Dec 30, 2022 at 6:23 AM Andres Freund <andres@anarazel.de> wrote: > On 2022-12-08 18:08:15 -0800, Andres Freund wrote: > > On 2022-09-25 16:22:37 -0700, Andres Freund wrote: > > The only alternative way to provide a wrapper that I can think of are to > > a) introduce a new static library that can be linked to by libpqwalreceiver, > > postgres_fdw, dblink > > b) add a header with static inline functions implementing interrupt-processing > > connection establishment for libpq > > > > Neither really has precedent. > Any opinions? Due to the simplicity I'm currently leaning to a header-only > helper, but I don't feel confident about it. The header idea is a little bit sneaky (IIUC: a header that is part of the core tree, but can't be used by core and possibly needs special treatment in 'headercheck' to get the right include search path, can only be used by libpqwalreceiver et al which are allowed to link to libpq), but I think it is compatible with other goals we have discussed in other threads. I think in the near future we'll probably remove the concept of non-threaded server builds (as proposed before in the post HP-UX 10 cleanup thread, with patches, but not quite over the line yet). Then I think the server could be allowed to link libpq directly? And at that point this code wouldn't be sneaky anymore and could optionally move into a .c. Does that makes sense?
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier
From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes: > The header idea is a little bit sneaky (IIUC: a header that is part of > the core tree, but can't be used by core and possibly needs special > treatment in 'headercheck' to get the right include search path, can > only be used by libpqwalreceiver et al which are allowed to link to > libpq), but I think it is compatible with other goals we have > discussed in other threads. I think in the near future we'll probably > remove the concept of non-threaded server builds (as proposed before > in the post HP-UX 10 cleanup thread, with patches, but not quite over > the line yet). Then I think the server could be allowed to link libpq > directly? And at that point this code wouldn't be sneaky anymore and > could optionally move into a .c. Does that makes sense? I don't like the idea of linking libpq directly into the backend. It should remain a dynamically-loaded library to avoid problems during software updates. regards, tom lane
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier
From
Andres Freund
Date:
Hi, On 2022-12-30 10:31:22 +1300, Thomas Munro wrote: > On Fri, Dec 30, 2022 at 6:23 AM Andres Freund <andres@anarazel.de> wrote: > > On 2022-12-08 18:08:15 -0800, Andres Freund wrote: > > > On 2022-09-25 16:22:37 -0700, Andres Freund wrote: > > > The only alternative way to provide a wrapper that I can think of are to > > > a) introduce a new static library that can be linked to by libpqwalreceiver, > > > postgres_fdw, dblink > > > b) add a header with static inline functions implementing interrupt-processing > > > connection establishment for libpq > > > > > > Neither really has precedent. > > > Any opinions? Due to the simplicity I'm currently leaning to a header-only > > helper, but I don't feel confident about it. > > The header idea is a little bit sneaky (IIUC: a header that is part of > the core tree, but can't be used by core and possibly needs special > treatment in 'headercheck' to get the right include search path, can > only be used by libpqwalreceiver et al which are allowed to link to > libpq), but I think it is compatible with other goals we have > discussed in other threads. Hm, what special search path / headerscheck magic are you thinking of? I think something like src/include/libpq/libpq-be-fe-helpers.h defining a bunch of static inlines should "just" work? We likely could guard against that header being included from code ending up in the postgres binary directly by #error'ing if BUILDING_DLL is defined. That's a very badly named define, but it IIRC has to be iff building code ending up in postgres directly. > I think in the near future we'll probably remove the concept of non-threaded > server builds (as proposed before in the post HP-UX 10 cleanup thread, with > patches, but not quite over the line yet). Then I think the server could be > allowed to link libpq directly? And at that point this code wouldn't be > sneaky anymore and could optionally move into a .c. Does that makes sense? I was wondering about linking in libpq directly as well. But I am not sure it's a good idea. I suspect we'd run into some issues around libraries (including extensions) linking to different versions of libpq etc - if we directly link to libpq that'd end up in tears. It might be a different story if we had a version of libpq built with different symbol names etc. But that's not exactly trivial either. Greetings, Andres Freund
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier
From
Thomas Munro
Date:
On Fri, Dec 30, 2022 at 10:54 AM Andres Freund <andres@anarazel.de> wrote: > On 2022-12-30 10:31:22 +1300, Thomas Munro wrote: > > On Fri, Dec 30, 2022 at 6:23 AM Andres Freund <andres@anarazel.de> wrote: > > > On 2022-12-08 18:08:15 -0800, Andres Freund wrote: > > > > On 2022-09-25 16:22:37 -0700, Andres Freund wrote: > > > > The only alternative way to provide a wrapper that I can think of are to > > > > a) introduce a new static library that can be linked to by libpqwalreceiver, > > > > postgres_fdw, dblink > > > > b) add a header with static inline functions implementing interrupt-processing > > > > connection establishment for libpq > > > > > > > > Neither really has precedent. > > > > > Any opinions? Due to the simplicity I'm currently leaning to a header-only > > > helper, but I don't feel confident about it. > > > > The header idea is a little bit sneaky (IIUC: a header that is part of > > the core tree, but can't be used by core and possibly needs special > > treatment in 'headercheck' to get the right include search path, can > > only be used by libpqwalreceiver et al which are allowed to link to > > libpq), but I think it is compatible with other goals we have > > discussed in other threads. > > Hm, what special search path / headerscheck magic are you thinking of? I think > something like src/include/libpq/libpq-be-fe-helpers.h defining a bunch of > static inlines should "just" work? Oh, I was imagining something slightly different. Not something under src/include/libpq, but conceptually a separate header-only library that is above both the backend and libpq. Maybe something like src/include/febe_util/libpq_connect_interruptible.h. In other words, I thought your idea b was a header-only version of your idea a. I think that might be a bit nicer than putting it under libpq? Superficial difference, perhaps... And then I assumed that headerscheck would need to be told to add libpq's header location in -I for that header, but on closer inspection it already adds that unconditionally so I retract that comment. > > I think in the near future we'll probably remove the concept of non-threaded > > server builds (as proposed before in the post HP-UX 10 cleanup thread, with > > patches, but not quite over the line yet). Then I think the server could be > > allowed to link libpq directly? And at that point this code wouldn't be > > sneaky anymore and could optionally move into a .c. Does that makes sense? > > I was wondering about linking in libpq directly as well. But I am not sure > it's a good idea. I suspect we'd run into some issues around libraries > (including extensions) linking to different versions of libpq etc - if we > directly link to libpq that'd end up in tears. > > It might be a different story if we had a version of libpq built with > different symbol names etc. But that's not exactly trivial either. Hmm, yeah. Some interesting things to think about. Whether it's a feature or an accident that new backends can pick up new libpq minor updates without restarting the postmaster, and how we'd manage a future libpq major version/ABI break. Getting a bit off topic for this thread I suppose.
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier
From
Andres Freund
Date:
Hi, On 2022-12-30 14:14:55 +1300, Thomas Munro wrote: > Oh, I was imagining something slightly different. Not something under > src/include/libpq, but conceptually a separate header-only library > that is above both the backend and libpq. Maybe something like > src/include/febe_util/libpq_connect_interruptible.h. In other words, > I thought your idea b was a header-only version of your idea a. I > think that might be a bit nicer than putting it under libpq? > Superficial difference, perhaps... It doesn't seem entirely right to introduce a top-level "module" for libpq-in-extensions to me - we don't do that for other APIs used for extensions. But a header only library also doesn't quite seem right. So ... Looking at turning my patch upthread into something slightly less prototype-y, I noticed that libpqwalreceiver doesn't do AcquireExternalFD(), added to other backend uses of libpq in 3d475515a15. It's unlikely to matter for walreceiver.c itelf, but it seems problematic for the logical replication cases? It's annoying to introduce PG_TRY/CATCH blocks, just to deal with the potential for errors inside WaitLatchOrSocket(), which should never happen. I wonder if we should consider making latch.c error out fatally, instead of elog(ERROR). If latches are broken, things are bad. The PG_CATCH() logic in postgres_fdw's GetConnection() looks quite suspicious to me. It looks like 32a9c0bdf493 took entirely the wrong path. Instead of optionally not throwing or directly re-establishing connections in begin_remote_xact(), the PG_CATCH() hammer was applied. The attached patch adds libpq-be-fe-helpers.h and uses it in libpqwalreceiver, dblink, postgres_fdw. As I made libpq-be-fe-helpers.h handle reserving external fds, libpqwalreceiver now does so. I briefly looked through its users without seeing cases of leaking in case of errors - which would already have been bad, since we'd already have leaked a libpq connection/socket. Given the lack of field complaints and the size of the required changes, I don't think we should backpatch this, even though it's pretty clearly buggy as-is. Some time back Thomas had a patch to introduce a wrapper around libpq-in-extensions that fixed issues cause by some events being edge-triggered on windows. It's possible that combining these two efforts would yield something better. I resisted the urge to create a wrapper around each connection in this patch, as it'd have ended up being a whole lot more invasive. But... Thomas, do you have a reference to that code handy? Greetings, Andres Freund
Attachment
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier
From
Robert Haas
Date:
On Sun, Sep 25, 2022 at 7:22 PM Andres Freund <andres@anarazel.de> wrote: > Maybe I am missing something, but I don't think it's OK for > connect_pg_server() to connect in a blocking manner, without accepting > interrupts? I remember noticing various problems in this area years ago. I'm not sure whether I noticed this particular one, but the commit message for ae9bfc5d65123aaa0d1cca9988037489760bdeae mentions a few others that I did notice. -- Robert Haas EDB: http://www.enterprisedb.com
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier
From
Andres Freund
Date:
Hi, On 2023-01-03 12:05:20 -0800, Andres Freund wrote: > The attached patch adds libpq-be-fe-helpers.h and uses it in libpqwalreceiver, > dblink, postgres_fdw. > As I made libpq-be-fe-helpers.h handle reserving external fds, > libpqwalreceiver now does so. I briefly looked through its users without > seeing cases of leaking in case of errors - which would already have been bad, > since we'd already have leaked a libpq connection/socket. > > > Given the lack of field complaints and the size of the required changes, I > don't think we should backpatch this, even though it's pretty clearly buggy > as-is. Any comments on this? Otherwise I think I'll go with this approach. Greetings, Andres Freund
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier
From
Thomas Munro
Date:
On Tue, Jan 10, 2023 at 3:05 PM Andres Freund <andres@anarazel.de> wrote: > On 2023-01-03 12:05:20 -0800, Andres Freund wrote: > > The attached patch adds libpq-be-fe-helpers.h and uses it in libpqwalreceiver, > > dblink, postgres_fdw. > > > As I made libpq-be-fe-helpers.h handle reserving external fds, > > libpqwalreceiver now does so. I briefly looked through its users without > > seeing cases of leaking in case of errors - which would already have been bad, > > since we'd already have leaked a libpq connection/socket. > > > > > > Given the lack of field complaints and the size of the required changes, I > > don't think we should backpatch this, even though it's pretty clearly buggy > > as-is. > > Any comments on this? Otherwise I think I'll go with this approach. +1. Not totally convinced about the location but we are free to re-organise it any time, and the random CI failures are bad.
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier
From
Andres Freund
Date:
Hi, On 2023-01-14 14:45:06 +1300, Thomas Munro wrote: > On Tue, Jan 10, 2023 at 3:05 PM Andres Freund <andres@anarazel.de> wrote: > > On 2023-01-03 12:05:20 -0800, Andres Freund wrote: > > > The attached patch adds libpq-be-fe-helpers.h and uses it in libpqwalreceiver, > > > dblink, postgres_fdw. > > > > > As I made libpq-be-fe-helpers.h handle reserving external fds, > > > libpqwalreceiver now does so. I briefly looked through its users without > > > seeing cases of leaking in case of errors - which would already have been bad, > > > since we'd already have leaked a libpq connection/socket. > > > > > > > > > Given the lack of field complaints and the size of the required changes, I > > > don't think we should backpatch this, even though it's pretty clearly buggy > > > as-is. > > > > Any comments on this? Otherwise I think I'll go with this approach. > > +1. Not totally convinced about the location but we are free to > re-organise it any time, and the random CI failures are bad. Cool. Updated patch attached. I split it into multiple pieces. 1) A fix for [1], included here because I encountered it while testing 2) Introduction of libpq-be-fe-helpers.h 3) Convert dblink and postgres_fdw to the helper 4) Convert libpqwalreceiver.c to the helper Even if we eventually decide to backpatch 3), we'd likely not backpatch 4), as there's no bug (although perhaps the lack of FD handling could be called a bug?). There's also some light polishing, improving commit message, comments and moving some internal helper functions to later in the file. Greetings, Andres Freund [1] https://postgr.es/m/20230121011237.q52apbvlarfv6jm6%40awork3.anarazel.de
Attachment
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier
From
Andres Freund
Date:
Hi, On 2023-01-20 19:00:08 -0800, Andres Freund wrote: > Updated patch attached. I split it into multiple pieces. > 1) A fix for [1], included here because I encountered it while testing > 2) Introduction of libpq-be-fe-helpers.h > 3) Convert dblink and postgres_fdw to the helper > 4) Convert libpqwalreceiver.c to the helper > > Even if we eventually decide to backpatch 3), we'd likely not backpatch 4), as > there's no bug (although perhaps the lack of FD handling could be called a > bug?). > > There's also some light polishing, improving commit message, comments and > moving some internal helper functions to later in the file. After a tiny bit further polishing, and after separately pushing a resource leak fix for walrcv_connect(), I pushed this. Greetings, Andres Freund
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier
From
Nathan Bossart
Date:
On Mon, Jan 23, 2023 at 07:28:06PM -0800, Andres Freund wrote: > After a tiny bit further polishing, and after separately pushing a resource > leak fix for walrcv_connect(), I pushed this. My colleague Robins Tharakan (CC'd) noticed crashes when testing recent commits, and he traced it back to e460248. From this information, I discovered the following typo: diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c index 8982d623d3..78a8bcee6e 100644 --- a/contrib/dblink/dblink.c +++ b/contrib/dblink/dblink.c @@ -321,7 +321,7 @@ dblink_connect(PG_FUNCTION_ARGS) else { if (pconn->conn) - libpqsrv_disconnect(conn); + libpqsrv_disconnect(pconn->conn); pconn->conn = conn; } -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier
From
Andres Freund
Date:
Hi, On 2023-01-30 11:30:08 -0800, Nathan Bossart wrote: > On Mon, Jan 23, 2023 at 07:28:06PM -0800, Andres Freund wrote: > > After a tiny bit further polishing, and after separately pushing a resource > > leak fix for walrcv_connect(), I pushed this. > > My colleague Robins Tharakan (CC'd) noticed crashes when testing recent > commits, and he traced it back to e460248. From this information, I > discovered the following typo: > > diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c > index 8982d623d3..78a8bcee6e 100644 > --- a/contrib/dblink/dblink.c > +++ b/contrib/dblink/dblink.c > @@ -321,7 +321,7 @@ dblink_connect(PG_FUNCTION_ARGS) > else > { > if (pconn->conn) > - libpqsrv_disconnect(conn); > + libpqsrv_disconnect(pconn->conn); > pconn->conn = conn; > } Ugh. Good catch. Why don't the dblink tests catch this? Any chance you or Robins could prepare a patch with fix and test, given that you know how to trigger this? Greetings, Andres Freund
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier
From
Nathan Bossart
Date:
On Mon, Jan 30, 2023 at 11:49:37AM -0800, Andres Freund wrote: > Why don't the dblink tests catch this? Any chance you or Robins could prepare > a patch with fix and test, given that you know how to trigger this? It's trivially reproducible by calling 1-argument dblink_connect() multiple times and then calling dblink_disconnect(). Here's a patch. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier
From
Robins Tharakan
Date:
Hi Andres, On Tue, 31 Jan 2023 at 06:31, Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Mon, Jan 30, 2023 at 11:49:37AM -0800, Andres Freund wrote: > > Why don't the dblink tests catch this? Any chance you or Robins could prepare > > a patch with fix and test, given that you know how to trigger this? > > It's trivially reproducible by calling 1-argument dblink_connect() multiple > times and then calling dblink_disconnect(). Here's a patch. > My test instance has been running Nathan's patch for the past few hours, and it looks like this should resolve the issue. A little bit of background, since the past few days I was noticing frequent crashes on a test instance. They're not simple to reproduce manually, but if left running I can reliably see crashes on an ~hourly basis. In trying to trace the origin, I ran a multiple-hour test for each commit going back a few commits and noticed that the crashes stopped prior to commit e4602483e9, which is when the issue became visible. The point is now moot, but let me know if full backtraces still help (I was concerned looking at the excerpts from some of the crashes): "double free or corruption (out)" "munmap_chunk(): invalid pointer" "free(): invalid pointer" str->data[0] = '\0'; Some backtraces ############### Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `postgres: 6c6d6ba3ee@master@sqith: u21 postgres 127.0.0.1(45334) SELECT '. Program terminated with signal SIGSEGV, Segmentation fault. #0 __GI___libc_free (mem=0x312f352f65736162) at malloc.c:3102 #0 __GI___libc_free (mem=0x312f352f65736162) at malloc.c:3102 #1 0x00007fc0062dfefd in pqDropConnection (conn=0x564bb3e1a080, flushInput=true) at fe-connect.c:495 #2 0x00007fc0062e5cb3 in closePGconn (conn=0x564bb3e1a080) at fe-connect.c:4112 #3 0x00007fc0062e5d55 in PQfinish (conn=0x564bb3e1a080) at fe-connect.c:4134 #4 0x00007fc0061a442b in libpqsrv_disconnect (conn=0x564bb3e1a080) at ../../src/include/libpq/libpq-be-fe-helpers.h:117 #5 0x00007fc0061a4df1 in dblink_disconnect (fcinfo=0x564bb3d980f0) at dblink.c:357 #6 0x0000564bb0e70aa7 in ExecInterpExpr (state=0x564bb3d98018, econtext=0x564bb3d979a0, isnull=0x7ffd60824b0f) at execExprInterp.c:728 #7 0x0000564bb0e72f36 in ExecInterpExprStillValid (state=0x564bb3d98018, econtext=0x564bb3d979a0, isNull=0x7ffd60824b0f) at execExprInterp.c:1838 ============ Core was generated by `postgres: 6c6d6ba3ee@master@sqith: u52 postgres 127.0.0.1(58778) SELECT '. Program terminated with signal SIGABRT, Aborted. #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #1 0x00007fc021792859 in __GI_abort () at abort.c:79 #2 0x00007fc0217fd26e in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7fc021927298 "%s\n") at ../sysdeps/posix/libc_fatal.c:155 #3 0x00007fc0218052fc in malloc_printerr ( str=str@entry=0x7fc021929670 "double free or corruption (out)") at malloc.c:5347 #4 0x00007fc021806fa0 in _int_free (av=0x7fc02195cb80 <main_arena>, p=0x7fc02195cbf0 <main_arena+112>, have_lock=<optimized out>) at malloc.c:4314 #5 0x00007fc0062e16ed in freePGconn (conn=0x564bb6e36b80) at fe-connect.c:3977 #6 0x00007fc0062e1d61 in PQfinish (conn=0x564bb6e36b80) at fe-connect.c:4135 #7 0x00007fc0062a142b in libpqsrv_disconnect (conn=0x564bb6e36b80) at ../../src/include/libpq/libpq-be-fe-helpers.h:117 #8 0x00007fc0062a1df1 in dblink_disconnect (fcinfo=0x564bb5647b58) at dblink.c:357 ============ Program terminated with signal SIGSEGV, Segmentation fault. #0 resetPQExpBuffer (str=0x559d3af0e838) at pqexpbuffer.c:153 153 str->data[0] = '\0'; #0 resetPQExpBuffer (str=0x559d3af0e838) at pqexpbuffer.c:153 #1 0x00007f2240b0a876 in PQexecStart (conn=0x559d3af0e410) at fe-exec.c:2320 #2 0x00007f2240b0a688 in PQexec (conn=0x559d3af0e410, query=0x559d56fb8ee8 "p3") at fe-exec.c:2227 #3 0x00007f223ba8c7e4 in dblink_exec (fcinfo=0x559d3b101f58) at dblink.c:1432 #4 0x0000559d2f003c82 in ExecInterpExpr (state=0x559d3b101ac0, econtext=0x559d34e76578, isnull=0x7ffe3d590fdf) at execExprInterp.c:752 ============ Core was generated by `postgres: 728f86fec6@(HEAD detached at 728f86fec6)@sqith: u19 postgres 127.0.0.'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007f96f5fc6e99 in SSL_shutdown () from /lib/x86_64-linux-gnu/libssl.so.1.1 #0 0x00007f96f5fc6e99 in SSL_shutdown () from /lib/x86_64-linux-gnu/libssl.so.1.1 #1 0x00007f96da56027a in pgtls_close (conn=0x55d919752fb0) at fe-secure-openssl.c:1555 #2 0x00007f96da558e41 in pqsecure_close (conn=0x55d919752fb0) at fe-secure.c:192 #3 0x00007f96da53dd12 in pqDropConnection (conn=0x55d919752fb0, flushInput=true) at fe-connect.c:449 #4 0x00007f96da543cb3 in closePGconn (conn=0x55d919752fb0) at fe-connect.c:4112 #5 0x00007f96da543d55 in PQfinish (conn=0x55d919752fb0) at fe-connect.c:4134 #6 0x00007f96d9ebd42b in libpqsrv_disconnect (conn=0x55d919752fb0) at ../../src/include/libpq/libpq-be-fe-helpers.h:117 #7 0x00007f96d9ebddf1 in dblink_disconnect (fcinfo=0x55d91f2692a8) at dblink.c:357 ============ Program terminated with signal SIGABRT, Aborted. #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #1 0x00007f5f6b632859 in __GI_abort () at abort.c:79 #2 0x00007f5f6b69d26e in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f5f6b7c7298 "%s\n") at ../sysdeps/posix/libc_fatal.c:155 #3 0x00007f5f6b6a52fc in malloc_printerr ( str=str@entry=0x7f5f6b7c91e0 "munmap_chunk(): invalid pointer") at malloc.c:5347 #4 0x00007f5f6b6a554c in munmap_chunk (p=<optimized out>) at malloc.c:2830 #5 0x00007f5f50085efd in pqDropConnection (conn=0x55d12ebcd100, flushInput=true) at fe-connect.c:495 #6 0x00007f5f5008bcb3 in closePGconn (conn=0x55d12ebcd100) at fe-connect.c:4112 #7 0x00007f5f5008bd55 in PQfinish (conn=0x55d12ebcd100) at fe-connect.c:4134 #8 0x00007f5f5006c42b in libpqsrv_disconnect (conn=0x55d12ebcd100) at ../../src/include/libpq/libpq-be-fe-helpers.h:117 ============ Program terminated with signal SIGABRT, Aborted. #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #1 0x00007f5f6b632859 in __GI_abort () at abort.c:79 #2 0x00007f5f6b69d26e in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f5f6b7c7298 "%s\n") at ../sysdeps/posix/libc_fatal.c:155 #3 0x00007f5f6b6a52fc in malloc_printerr ( str=str@entry=0x7f5f6b7c54c1 "free(): invalid pointer") at malloc.c:5347 #4 0x00007f5f6b6a6b2c in _int_free (av=<optimized out>, p=<optimized out>, have_lock=0) at malloc.c:4173 #5 0x00007f5f500fe6ed in freePGconn (conn=0x55d142273000) at fe-connect.c:3977 #6 0x00007f5f500fed61 in PQfinish (conn=0x55d142273000) at fe-connect.c:4135 #7 0x00007f5f501de42b in libpqsrv_disconnect (conn=0x55d142273000) at ../../src/include/libpq/libpq-be-fe-helpers.h:117 #8 0x00007f5f501dedf1 in dblink_disconnect (fcinfo=0x55d1527998f8) at dblink.c:357 ============ Core was generated by `postgres: e4602483e9@(HEAD detached at e4602483e9)@sqith: u73 postgres 127.0.0.'. Program terminated with signal SIGSEGV, Segmentation fault. #0 __GI___libc_realloc (oldmem=0x7f7f7f7f7f7f7f7f, bytes=2139070335) at malloc.c:3154 #0 __GI___libc_realloc (oldmem=0x7f7f7f7f7f7f7f7f, bytes=2139070335) at malloc.c:3154 #1 0x00007fb7bc0a580a in pqCheckOutBufferSpace (bytes_needed=2139062148, conn=0x55b191aa9380) at fe-misc.c:329 #2 0x00007fb7bc0a5b1c in pqPutMsgStart (msg_type=88 'X', conn=0x55b191aa9380) at fe-misc.c:476 #3 0x00007fb7bc097c60 in sendTerminateConn (conn=0x55b191aa9380) at fe-connect.c:4076 #4 0x00007fb7bc097c97 in closePGconn (conn=0x55b191aa9380) at fe-connect.c:4096 #5 0x00007fb7bc097d55 in PQfinish (conn=0x55b191aa9380) at fe-connect.c:4134 #6 0x00007fb7bc14a42b in libpqsrv_disconnect (conn=0x55b191aa9380) at ../../src/include/libpq/libpq-be-fe-helpers.h:117 #7 0x00007fb7bc14adf1 in dblink_disconnect (fcinfo=0x55b193894f00) at dblink.c:357 ============ Thanks to SQLSmith for helping with this find. - Robins Tharakan Amazon Web Services
Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier
From
Andres Freund
Date:
Hi, On 2023-01-30 12:00:55 -0800, Nathan Bossart wrote: > On Mon, Jan 30, 2023 at 11:49:37AM -0800, Andres Freund wrote: > > Why don't the dblink tests catch this? Any chance you or Robins could prepare > > a patch with fix and test, given that you know how to trigger this? > > It's trivially reproducible by calling 1-argument dblink_connect() multiple > times and then calling dblink_disconnect(). Here's a patch. Thanks for the quick patch and for the find. Pushed. Greetings, Andres