Re: libpq copy error handling busted - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: libpq copy error handling busted
Date
Msg-id CA+hUKG+haDNc6u0Y_QiS_5Obq0CMQVb8EBn8NQZSR=s2orLvsQ@mail.gmail.com
Whole thread Raw
In response to Re: libpq copy error handling busted  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: libpq copy error handling busted
List pgsql-hackers
On Thu, Jun 4, 2020 at 1:35 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
> > * pqSendSome() is responsible not only for pushing out data, but for
> > calling pqReadData in any situation where it can't get rid of the data
> > promptly.  1f39a1c06 overlooked that requirement, and the upshot is
> > that we don't necessarily notice that the connection is broken (it's
> > pqReadData's job to detect that).  Putting a pqReadData call into
> > the early-exit path helps, but doesn't fix things completely.
>
> Ah, it's better if I put the pqReadData call into *both* the paths
> where 1f39a1c06 made pqSendSome give up.  The attached patch seems
> to fix the issue for the "pgbench -i" scenario, with either fast-
> or immediate-mode server stop.  I tried it with and without SSL too,
> just to see.  Still, it's not clear to me whether this might worsen
> any of the situations we discussed in the lead-up to 1f39a1c06 [1].
> Thomas, are you in a position to redo any of that testing?

Yes, sure.  The testing consisted of running on a system with OpenSSL
1.1.1a (older versions didn't have the problem).  It originally showed
up on eelpout, a very underpowered build farm machine running Linux on
ARM64, but then later we worked out we could make it happen on a Mac
or any other Linux system if we had bad enough luck or if we added a
sleep in a particular spot.  We could do it with psql running in a
loop using a bad certificate from the testing setup stuff, as shown
here:

https://www.postgresql.org/message-id/CA%2BhUKGJafyTgpsYBgQGt1EX0O8UnL4VGHSc7J0KZyMH4_jPGBw%40mail.gmail.com

I don't have access to eelpout from where I am right now, but I'll try
that test now on the Debian 10 amd64 system I have here.  OpenSSL has
since moved on to 1.1.1d-0+deb10u3, but that should be fine, the
triggering change was the move to TLS1.3 so let me see what happens if
I do that with your patch applied...




> > * The more longstanding problem is that the PQputCopyData code path
> > doesn't have any mechanism for consuming an 'E' (error) message
> > once pqReadData has collected it.
>
> At least with pgbench's approach (die immediately on PQputline failure)
> this isn't very relevant once we apply the attached.  Perhaps we should
> revisit this behavior anyway, but I'd be afraid to back-patch a change
> of that nature.
>
> > * As for control-C not getting out of it: there is
> >               if (CancelRequested)
> >                       break;
> > in pgbench's loop, but this does nothing in this scenario because
> > fe-utils/cancel.c only sets that flag when it successfully sends a
> > Cancel ... which it certainly cannot if the postmaster is gone.
>
> I'll send a patch for this later.
>
>                         regards, tom lane
>
> [1]
https://www.postgresql.org/message-id/flat/CAEepm%3D2n6Nv%2B5tFfe8YnkUm1fXgvxR0Mm1FoD%2BQKG-vLNGLyKg%40mail.gmail.com



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: elog(DEBUG2 in SpinLocked section.
Next
From: Tom Lane
Date:
Subject: Re: elog(DEBUG2 in SpinLocked section.