Re: libpq copy error handling busted - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: libpq copy error handling busted |
Date | |
Msg-id | CA+hUKG+haDNc6u0Y_QiS_5Obq0CMQVb8EBn8NQZSR=s2orLvsQ@mail.gmail.com Whole thread Raw |
In response to | Re: libpq copy error handling busted (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: libpq copy error handling busted
|
List | pgsql-hackers |
On Thu, Jun 4, 2020 at 1:35 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: > > * pqSendSome() is responsible not only for pushing out data, but for > > calling pqReadData in any situation where it can't get rid of the data > > promptly. 1f39a1c06 overlooked that requirement, and the upshot is > > that we don't necessarily notice that the connection is broken (it's > > pqReadData's job to detect that). Putting a pqReadData call into > > the early-exit path helps, but doesn't fix things completely. > > Ah, it's better if I put the pqReadData call into *both* the paths > where 1f39a1c06 made pqSendSome give up. The attached patch seems > to fix the issue for the "pgbench -i" scenario, with either fast- > or immediate-mode server stop. I tried it with and without SSL too, > just to see. Still, it's not clear to me whether this might worsen > any of the situations we discussed in the lead-up to 1f39a1c06 [1]. > Thomas, are you in a position to redo any of that testing? Yes, sure. The testing consisted of running on a system with OpenSSL 1.1.1a (older versions didn't have the problem). It originally showed up on eelpout, a very underpowered build farm machine running Linux on ARM64, but then later we worked out we could make it happen on a Mac or any other Linux system if we had bad enough luck or if we added a sleep in a particular spot. We could do it with psql running in a loop using a bad certificate from the testing setup stuff, as shown here: https://www.postgresql.org/message-id/CA%2BhUKGJafyTgpsYBgQGt1EX0O8UnL4VGHSc7J0KZyMH4_jPGBw%40mail.gmail.com I don't have access to eelpout from where I am right now, but I'll try that test now on the Debian 10 amd64 system I have here. OpenSSL has since moved on to 1.1.1d-0+deb10u3, but that should be fine, the triggering change was the move to TLS1.3 so let me see what happens if I do that with your patch applied... > > * The more longstanding problem is that the PQputCopyData code path > > doesn't have any mechanism for consuming an 'E' (error) message > > once pqReadData has collected it. > > At least with pgbench's approach (die immediately on PQputline failure) > this isn't very relevant once we apply the attached. Perhaps we should > revisit this behavior anyway, but I'd be afraid to back-patch a change > of that nature. > > > * As for control-C not getting out of it: there is > > if (CancelRequested) > > break; > > in pgbench's loop, but this does nothing in this scenario because > > fe-utils/cancel.c only sets that flag when it successfully sends a > > Cancel ... which it certainly cannot if the postmaster is gone. > > I'll send a patch for this later. > > regards, tom lane > > [1] https://www.postgresql.org/message-id/flat/CAEepm%3D2n6Nv%2B5tFfe8YnkUm1fXgvxR0Mm1FoD%2BQKG-vLNGLyKg%40mail.gmail.com
pgsql-hackers by date: