Re: PostgreSQL client hangs sometimes on 'EXEC SQL PREPAREsid_sisisinst FROM :select_anw;' - Mailing list pgsql-general

From Matthias Apitz
Subject Re: PostgreSQL client hangs sometimes on 'EXEC SQL PREPAREsid_sisisinst FROM :select_anw;'
Date
Msg-id 20200505181200.GA3640@c720-r342378
Whole thread Raw
In response to Re: PostgreSQL client hangs sometimes on 'EXEC SQL PREPARE sid_sisisinst FROM :select_anw;'  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: PostgreSQL client hangs sometimes on 'EXEC SQL PREPARE sid_sisisinst FROM :select_anw;'
List pgsql-general
El día lunes, abril 27, 2020 a las 09:40:39a. m. -0400, Tom Lane escribió:

> Matthias Apitz <guru@unixarea.de> writes:
> > El día Montag, April 27, 2020 a las 08:40:04 -0400, Tom Lane escribió:
> >> Can you get a stack trace from the connected backend?
> 
> > (gdb) bt
> > #0  0x00007fd567776000 in epoll_pwait () from /lib64/libc.so.6
> > #1  0x000000000084476c in WaitEventSetWaitBlock ()
> > #2  0x0000000000844647 in WaitEventSetWait ()
> > #3  0x00000000006f89d2 in secure_read ()
> > #4  0x0000000000707425 in pq_recvbuf ()
> > #5  0x0000000000707709 in pq_discardbytes ()
> > #6  0x0000000000707aba in pq_getmessage ()
> > #7  0x000000000086b478 in SocketBackend ()
> > #8  0x000000000086b4c4 in ReadCommand ()
> > #9  0x000000000086fda9 in PostgresMain ()
> 
> Oh, that is *very* interesting, because there is only one caller of
> pq_discardbytes:
> 
>         /*
>          * Allocate space for message.  If we run out of room (ridiculously
>          * large message), we will elog(ERROR), but we want to discard the
>          * message body so as not to lose communication sync.
>          */
>         PG_TRY();
>         {
>             enlargeStringInfo(s, len);
>         }
>         PG_CATCH();
>         {
>             if (pq_discardbytes(len) == EOF)
>                 ereport(COMMERROR,
>                         (errcode(ERRCODE_PROTOCOL_VIOLATION),
>                          errmsg("incomplete message from client")));
>             ...
> 
> What this code intends to handle is the case where the client has sent a
> message that is so long that the backend hasn't enough memory to buffer
> it.  What's actually happened, more likely, is that the received message
> length is corrupt and just appears to be large, since the client-side
> trace shows that libpq has sent what it has to send and is now waiting for
> a reply.  If the received length were correct then the pq_discardbytes
> call would have completed after eating the message.
> 
> So what it looks like is that something is corrupting data on its
> way from the client to the server.  Flaky firewall maybe?  If you're
> using SSL, maybe an SSL library bug?  I'm reduced to speculation at
> this point.  It's hard even to say what to do to gather more info.
> If you could reproduce it then I'd suggest watching the connection
> with wireshark or the like to see what data is actually going across
> the wire ... but IIUC it's pretty random, so that approach seems
> unlikely to work.
> 
> If you're in a position to run a modified server, you could try
> inserting a debug log message:
> 
>         }
>         PG_CATCH();
>         {
> +           elog(COMMERROR, "bogus received message length: %d", len);
>             if (pq_discardbytes(len) == EOF)
>                 ereport(COMMERROR,
> 
> (This is in src/backend/libpq/pqcomm.c, around line 1300 as of HEAD.)
> While this seems unlikely to teach us a huge amount, perhaps the
> value of the incorrect length would be informative.
> 
> Are you always seeing this error at the exact same place so far as
> the client side is concerned?  It's hard to see why a transport-level
> problem would preferentially affect one particular query ...

I've added the printout of the length in this case and another one, and
see this in the server's log file:


2020-05-04 10:05:49.977 CEST [32092] LOG:  invalid length 33554940 of startup packet
2020-05-04 10:05:50.207 CEST [32094] LOG:  invalid length 33554940 of startup packet
2020-05-04 12:32:50.781 CEST [20334] LOG:  bogus received message length: 1650422894
2020-05-04 12:36:41.293 CEST [20380] LOG:  bogus received message length: 1650422894
2020-05-04 12:39:39.461 CEST [20441] LOG:  bogus received message length: 1650422894
2020-05-04 13:01:50.566 CEST [24222] LOG:  bogus received message length: 1650422894

Any comments? Thanks

    matthias


-- 
Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub



pgsql-general by date:

Previous
From: Peter
Date:
Subject: Re: 12.2: Howto check memory-leak in worker?
Next
From: Tom Lane
Date:
Subject: Re: Lock Postgres account after X number of failed logins?