Thread: PostgreSQL client hangs sometimes on 'EXEC SQL PREPARE sid_sisisinst FROM :select_anw;'
PostgreSQL client hangs sometimes on 'EXEC SQL PREPARE sid_sisisinst FROM :select_anw;'
From
Matthias Apitz
Date:
Hello, I encounter from time to time that one of our ESQL/C clients hangs with the details described below for ever, while accessing on startup a table 'sisisinst' which has only some 50 rows, all of CHAR columns. Other ESQL/C clients start up fine (they all check this table 'sisisinst' on startup to verify the correct version of our application) Also SQL works fine. What can I do to get to the source of the problem? Thanks matthias ESQL/C statement: EXEC SQL PREPARE sid_sisisinst FROM :select_anw; host variable :select_anw contains: (gdb) p select_anw $2 = "SELECT ctid, * from sisisinst WHERE version = :v1", '\000' process PID 23957 is hanging in poll() on network socket to PostgreSQL server: (Version 11.4 on Linux) (gdb) bt #0 0x00007ff9020c2664 in poll () from /lib64/libc.so.6 #1 0x00007ff90144c22f in pqSocketPoll () from /usr/local/sisis-pap/pgsql/lib/libpq.so.5 #2 0x00007ff90144c0ff in pqSocketCheck () from /usr/local/sisis-pap/pgsql/lib/libpq.so.5 #3 0x00007ff90144bf83 in pqWaitTimed () from /usr/local/sisis-pap/pgsql/lib/libpq.so.5 #4 0x00007ff90144bf55 in pqWait () from /usr/local/sisis-pap/pgsql/lib/libpq.so.5 #5 0x00007ff901447d3a in PQgetResult () from /usr/local/sisis-pap/pgsql/lib/libpq.so.5 #6 0x00007ff901448478 in PQexecFinish () from /usr/local/sisis-pap/pgsql/lib/libpq.so.5 #7 0x00007ff90144819a in PQprepare () from /usr/local/sisis-pap/pgsql/lib/libpq.so.5 #8 0x00007ff901cc788d in prepare_common () from /usr/local/sisis-pap/pgsql/lib/libecpg.so.6 #9 0x00007ff901cc7a26 in ECPGprepare () from /usr/local/sisis-pap/pgsql/lib/libecpg.so.6 #10 0x00007ff9044a86ee in select_record (scroll=1, lock=0, key=2, sel_anw=0x7ffed86186c0 "SELECT ctid, * from %s", where_anw=0x7ffed8617330 "%s = :v1", p_daten=0x7ff904dd7700 <hrec_sisisinst>, i_between=0, p_oben=0x7ffed86171d0, order_by=0x0, auf_ab=0x0, group_by=0x0, having=0x0, into_temp=0x0, count=0x7ffed8617328) at sisisinst.pgc:925 #11 0x00007ff9044a7a8d in sisisinst (zugriff=1, scroll=1, lock=0, key=2, sto=-20000, p_daten=0x7ffed8618c30, sel_anw=0x7ffed86186c0 "SELECT ctid, * from %s", where_anw=0x7ffed8617330 "%s = :v1", p_btw_daten=0x0, order_by=0x0, auf_ab=0x0, group_by=0x0, having=0x0, into_temp=0x0, count=0x7ffed8617328) at sisisinst.pgc:450 #12 0x00007ff9043718a5 in DB_rdir (tabmodul=0x7ff9044a7639 <sisisinst>, key=2, scroll=1, lock=0, p_daten=0x7ffed8618c30) at dbcall.pgc:1665 #13 0x00007ff90436f6c9 in DB_ChkVer () at dbcall.pgc:307 #14 0x00007ff9043708d3 in DB_opdbP (mode=1) at dbcall.pgc:840 #15 0x00007ff90436fff8 in DB_opdb () at dbcall.pgc:543 #16 0x0000000000405efa in InitIndex (setid=FSTAB_Personen) at ./INDEX.c:1000 #17 0x0000000000405241 in main (argc=2, argv=0x7ffed8619fa8) at ./INDEX.c:412 running PostgreSQL processes: # ps ax | grep postgres: 329 pts/13 S+ 0:00 grep --color=auto postgres: 2354 ? Ss 0:17 postgres: checkpointer 2355 ? Ss 0:09 postgres: background writer 2356 ? Ss 0:10 postgres: walwriter 2357 ? Ss 0:04 postgres: autovacuum launcher 2358 ? Ss 0:00 postgres: archiver last was 000000010000000F00000062 2359 ? Ss 0:31 postgres: stats collector 2360 ? Ss 0:00 postgres: logical replication launcher 14044 ? Ss 0:00 postgres: sisis sisis ::1(53426) idle 14100 ? Ss 0:00 postgres: sisis sisis ::1(53432) idle 16852 ? Ss 0:00 postgres: sisis sisis ::1(53486) idle 17575 ? Ss 0:00 postgres: sisis sisis ::1(48736) idle 17577 ? Ss 0:00 postgres: sisis sisis ::1(48742) idle 17583 ? Ss 0:00 postgres: sisis sisis 10.23.33.19(49630) idle 17584 ? Ss 0:00 postgres: sisis sisis 10.23.33.19(49632) idle 17585 ? Ss 0:00 postgres: sisis sisis 10.23.33.19(49634) idle 17587 ? Ss 0:00 postgres: sisis sisis ::1(48762) idle 17590 ? Ss 0:00 postgres: sisis sisis ::1(48770) idle 17593 ? Ss 0:00 postgres: sisis sisis ::1(48778) idle 17609 ? Ss 0:04 postgres: sisis sisis ::1(48794) idle 17612 ? Ss 0:00 postgres: sisis sisis ::1(48802) idle 17615 ? Ss 0:04 postgres: sisis sisis ::1(48810) idle 17618 ? Ss 0:00 postgres: sisis sisis ::1(48818) idle 17621 ? Ss 0:08 postgres: sisis sisis ::1(48826) idle 17866 ? Ss 0:00 postgres: sisis sisis 10.23.33.19(49758) idle 17868 ? Ss 0:00 postgres: sisis sisis ::1(48886) idle 17907 ? Ss 0:00 postgres: sisis sisis ::1(48900) idle 17948 ? Ss 0:00 postgres: sisis sisis ::1(48914) idle 21457 ? Ss 0:00 postgres: sisis sisis ::1(53664) idle 23958 ? Ss 0:00 postgres: sisis sisis ::1(53742) idle ^^^^^ lsof -P -p 23957 | grep 5432 INDEX 23957 sisis 2u IPv6 117475261 0t0 TCP localhost:53742->localhost:5432 (ESTABLISHED) -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub May, 9: Спаси́бо освободители! Thank you very much, Russian liberators!
Re: PostgreSQL client hangs sometimes on 'EXEC SQL PREPARE sid_sisisinst FROM :select_anw;'
From
Tom Lane
Date:
Matthias Apitz <guru@unixarea.de> writes: > I encounter from time to time that one of our ESQL/C clients hangs with > the details described below for ever, while accessing on startup a table > 'sisisinst' which has only some 50 rows, all of CHAR columns. Other ESQL/C clients > start up fine (they all check this table 'sisisinst' on startup to > verify the correct version of our application) Also SQL works > fine. > What can I do to get to the source of the problem? Thanks Can you get a stack trace from the connected backend? Or even just look at what it's doing according to pg_stat_activity? (If it's a reasonably modern server, the wait-condition fields would be interesting.) regards, tom lane
Re: PostgreSQL client hangs sometimes on 'EXEC SQL PREPAREsid_sisisinst FROM :select_anw;'
From
Matthias Apitz
Date:
El día Montag, April 27, 2020 a las 08:40:04 -0400, Tom Lane escribió: > Matthias Apitz <guru@unixarea.de> writes: > > I encounter from time to time that one of our ESQL/C clients hangs with > > the details described below for ever, while accessing on startup a table > > 'sisisinst' which has only some 50 rows, all of CHAR columns. Other ESQL/C clients > > start up fine (they all check this table 'sisisinst' on startup to > > verify the correct version of our application) Also SQL works > > fine. > > > What can I do to get to the source of the problem? Thanks > > Can you get a stack trace from the connected backend? Or even just look > at what it's doing according to pg_stat_activity? (If it's a reasonably > modern server, the wait-condition fields would be interesting.) srap32dxr1:/opt/lib/sisis/catserver/bin # lsof -P | grep 53742 INDEX 23957 sisis 2u IPv6 117475261 0t0 TCP localhost:53742->localhost:5432(ESTABLISHED) postmaste 23958 postgres 9u IPv6 117478789 0t0 TCP localhost:5432->localhost:53742(ESTABLISHED) 'INDEX' is the hanging proc; PID 23958 its server; gdb shows: # gdb /usr/local/sisis-pap/pgsql/bin/postgres 23958 ... (gdb) bt #0 0x00007fd567776000 in epoll_pwait () from /lib64/libc.so.6 #1 0x000000000084476c in WaitEventSetWaitBlock () #2 0x0000000000844647 in WaitEventSetWait () #3 0x00000000006f89d2 in secure_read () #4 0x0000000000707425 in pq_recvbuf () #5 0x0000000000707709 in pq_discardbytes () #6 0x0000000000707aba in pq_getmessage () #7 0x000000000086b478 in SocketBackend () #8 0x000000000086b4c4 in ReadCommand () #9 0x000000000086fda9 in PostgresMain () #10 0x00000000007dde15 in BackendRun () #11 0x00000000007dd595 in BackendStartup () #12 0x00000000007d9c9b in ServerLoop () #13 0x00000000007d9556 in PostmasterMain () #14 0x000000000070e2a5 in main () sisis=# select * from pg_stat_activity where client_port = 53742; datid | datname | pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start | query_start | state_change | wait_event_type | wait_event | state| backend_xid | backend_xmin | query | backend_type 93831 | sisis | 23958 | 16384 | sisis | | ::1 | | 53742 | 27.04.202012:37:46.941441 CEST | | | 27.04.2020 12:37:46.943073 CEST | Client | ClientRead| idle | | | | client backend (1 Zeile) HIH && Kinde Regards matthias -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub May, 9: Спаси́бо освободители! Thank you very much, Russian liberators!
Re: PostgreSQL client hangs sometimes on 'EXEC SQL PREPAREsid_sisisinst FROM :select_anw;'
From
Matthias Apitz
Date:
As the client seems to be waiting on the server and the server on the client... could it have been caused by this in /etc/hosts (what our IT department configured): /etc/hosts: 127.0.0.1 localhost # special IPv6 addresses ::1 localhost ipv6-localhost ipv6-loopback i.e. the client has sent its request on IPv4 and the server answered on IPv6? I will remove the word 'localhost' from the IPv6 addr and restart everything, or do you want to see any additional inforation? Thanks matthias -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub May, 9: Спаси́бо освободители! Thank you very much, Russian liberators!
Re: PostgreSQL client hangs sometimes on 'EXEC SQL PREPARE sid_sisisinst FROM :select_anw;'
From
Tom Lane
Date:
Matthias Apitz <guru@unixarea.de> writes: > El día Montag, April 27, 2020 a las 08:40:04 -0400, Tom Lane escribió: >> Can you get a stack trace from the connected backend? > (gdb) bt > #0 0x00007fd567776000 in epoll_pwait () from /lib64/libc.so.6 > #1 0x000000000084476c in WaitEventSetWaitBlock () > #2 0x0000000000844647 in WaitEventSetWait () > #3 0x00000000006f89d2 in secure_read () > #4 0x0000000000707425 in pq_recvbuf () > #5 0x0000000000707709 in pq_discardbytes () > #6 0x0000000000707aba in pq_getmessage () > #7 0x000000000086b478 in SocketBackend () > #8 0x000000000086b4c4 in ReadCommand () > #9 0x000000000086fda9 in PostgresMain () Oh, that is *very* interesting, because there is only one caller of pq_discardbytes: /* * Allocate space for message. If we run out of room (ridiculously * large message), we will elog(ERROR), but we want to discard the * message body so as not to lose communication sync. */ PG_TRY(); { enlargeStringInfo(s, len); } PG_CATCH(); { if (pq_discardbytes(len) == EOF) ereport(COMMERROR, (errcode(ERRCODE_PROTOCOL_VIOLATION), errmsg("incomplete message from client"))); ... What this code intends to handle is the case where the client has sent a message that is so long that the backend hasn't enough memory to buffer it. What's actually happened, more likely, is that the received message length is corrupt and just appears to be large, since the client-side trace shows that libpq has sent what it has to send and is now waiting for a reply. If the received length were correct then the pq_discardbytes call would have completed after eating the message. So what it looks like is that something is corrupting data on its way from the client to the server. Flaky firewall maybe? If you're using SSL, maybe an SSL library bug? I'm reduced to speculation at this point. It's hard even to say what to do to gather more info. If you could reproduce it then I'd suggest watching the connection with wireshark or the like to see what data is actually going across the wire ... but IIUC it's pretty random, so that approach seems unlikely to work. If you're in a position to run a modified server, you could try inserting a debug log message: } PG_CATCH(); { + elog(COMMERROR, "bogus received message length: %d", len); if (pq_discardbytes(len) == EOF) ereport(COMMERROR, (This is in src/backend/libpq/pqcomm.c, around line 1300 as of HEAD.) While this seems unlikely to teach us a huge amount, perhaps the value of the incorrect length would be informative. Are you always seeing this error at the exact same place so far as the client side is concerned? It's hard to see why a transport-level problem would preferentially affect one particular query ... regards, tom lane
Re: PostgreSQL client hangs sometimes on 'EXEC SQL PREPARE sid_sisisinst FROM :select_anw;'
From
Tom Lane
Date:
Matthias Apitz <guru@unixarea.de> writes: > As the client seems to be waiting on the server and the server on the > client... could it have been caused by this in /etc/hosts (what our IT > department configured): Doubt it. Your connection evidently got through (at least) the authentication exchange, so basic data transmission seems to be working. I speculate that there's something data-dependent about the command that's failing. As an example, it would not be terribly surprising to get a behavior like this if there were something deciding to do an LF -> CRLF or similar transformation on the data stream. The substitution might accidentally not trigger during authentication, only to break some actual query a bit later. But there really shouldn't be anything making such changes in our data stream. Still, looking at recent network-level configuration changes wouldn't be a bad idea. regards, tom lane
Re: PostgreSQL client hangs sometimes on 'EXEC SQL PREPAREsid_sisisinst FROM :select_anw;'
From
Matthias Apitz
Date:
I've added the proposed log line to the server code in src/backend/libpq/pqcomm.c and restarted the server with this. In the log file /data/postgresql11/data/serverlog are other messages about length of packages, like: 2020-04-28 09:49:49.877 CEST [16229] LOG: invalid length of startup packet 2020-04-28 09:49:50.117 CEST [16231] LOG: invalid length of startup packet They're coming from: grep 'invalid length of startup packet' src/backend/*/*.c src/backend/postmaster/postmaster.c: errmsg("invalid length of startup packet"))); I will change the code there to to get the 'len' printed with the message ... if (len < (int32) sizeof(ProtocolVersion) || len > MAX_STARTUP_PACKET_LENGTH) { ereport(COMMERROR, (errcode(ERRCODE_PROTOCOL_VIOLATION), errmsg("invalid length %d of startup packet", len))); return STATUS_ERROR; } -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub May, 9: Спаси́бо освободители! Thank you very much, Russian liberators!
Re: PostgreSQL client hangs sometimes on 'EXEC SQL PREPAREsid_sisisinst FROM :select_anw;'
From
Matthias Apitz
Date:
El día lunes, abril 27, 2020 a las 09:40:39a. m. -0400, Tom Lane escribió: > Matthias Apitz <guru@unixarea.de> writes: > > El día Montag, April 27, 2020 a las 08:40:04 -0400, Tom Lane escribió: > >> Can you get a stack trace from the connected backend? > > > (gdb) bt > > #0 0x00007fd567776000 in epoll_pwait () from /lib64/libc.so.6 > > #1 0x000000000084476c in WaitEventSetWaitBlock () > > #2 0x0000000000844647 in WaitEventSetWait () > > #3 0x00000000006f89d2 in secure_read () > > #4 0x0000000000707425 in pq_recvbuf () > > #5 0x0000000000707709 in pq_discardbytes () > > #6 0x0000000000707aba in pq_getmessage () > > #7 0x000000000086b478 in SocketBackend () > > #8 0x000000000086b4c4 in ReadCommand () > > #9 0x000000000086fda9 in PostgresMain () > > Oh, that is *very* interesting, because there is only one caller of > pq_discardbytes: > > /* > * Allocate space for message. If we run out of room (ridiculously > * large message), we will elog(ERROR), but we want to discard the > * message body so as not to lose communication sync. > */ > PG_TRY(); > { > enlargeStringInfo(s, len); > } > PG_CATCH(); > { > if (pq_discardbytes(len) == EOF) > ereport(COMMERROR, > (errcode(ERRCODE_PROTOCOL_VIOLATION), > errmsg("incomplete message from client"))); > ... > > What this code intends to handle is the case where the client has sent a > message that is so long that the backend hasn't enough memory to buffer > it. What's actually happened, more likely, is that the received message > length is corrupt and just appears to be large, since the client-side > trace shows that libpq has sent what it has to send and is now waiting for > a reply. If the received length were correct then the pq_discardbytes > call would have completed after eating the message. > > So what it looks like is that something is corrupting data on its > way from the client to the server. Flaky firewall maybe? If you're > using SSL, maybe an SSL library bug? I'm reduced to speculation at > this point. It's hard even to say what to do to gather more info. > If you could reproduce it then I'd suggest watching the connection > with wireshark or the like to see what data is actually going across > the wire ... but IIUC it's pretty random, so that approach seems > unlikely to work. > > If you're in a position to run a modified server, you could try > inserting a debug log message: > > } > PG_CATCH(); > { > + elog(COMMERROR, "bogus received message length: %d", len); > if (pq_discardbytes(len) == EOF) > ereport(COMMERROR, > > (This is in src/backend/libpq/pqcomm.c, around line 1300 as of HEAD.) > While this seems unlikely to teach us a huge amount, perhaps the > value of the incorrect length would be informative. > > Are you always seeing this error at the exact same place so far as > the client side is concerned? It's hard to see why a transport-level > problem would preferentially affect one particular query ... I've added the printout of the length in this case and another one, and see this in the server's log file: 2020-05-04 10:05:49.977 CEST [32092] LOG: invalid length 33554940 of startup packet 2020-05-04 10:05:50.207 CEST [32094] LOG: invalid length 33554940 of startup packet 2020-05-04 12:32:50.781 CEST [20334] LOG: bogus received message length: 1650422894 2020-05-04 12:36:41.293 CEST [20380] LOG: bogus received message length: 1650422894 2020-05-04 12:39:39.461 CEST [20441] LOG: bogus received message length: 1650422894 2020-05-04 13:01:50.566 CEST [24222] LOG: bogus received message length: 1650422894 Any comments? Thanks matthias -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub
Re: PostgreSQL client hangs sometimes on 'EXEC SQL PREPARE sid_sisisinst FROM :select_anw;'
From
Tom Lane
Date:
Matthias Apitz <guru@unixarea.de> writes: > El día lunes, abril 27, 2020 a las 09:40:39a. m. -0400, Tom Lane escribió: >> If you're in a position to run a modified server, you could try >> inserting a debug log message: > I've added the printout of the length in this case and another one, and > see this in the server's log file: > 2020-05-04 10:05:49.977 CEST [32092] LOG: invalid length 33554940 of startup packet > 2020-05-04 10:05:50.207 CEST [32094] LOG: invalid length 33554940 of startup packet > 2020-05-04 12:32:50.781 CEST [20334] LOG: bogus received message length: 1650422894 > 2020-05-04 12:36:41.293 CEST [20380] LOG: bogus received message length: 1650422894 > 2020-05-04 12:39:39.461 CEST [20441] LOG: bogus received message length: 1650422894 > 2020-05-04 13:01:50.566 CEST [24222] LOG: bogus received message length: 1650422894 Hmph. That confirms that we're getting a bogus message length, but not much more. It's quite interesting though that the bogus value is always the same. According to my calculator 1650422894 corresponds to ASCII "b_tn", or possibly "nt_b" depending on what you want to assume about endianness, which looks tantalizingly like it could be a fragment of a SQL query. So I'm still leaning to the idea that the client has sent a malformed query stream --- but how? Or if the data got corrupted in transit, how did that happen? Can you work backwards to what the client was doing just before the point at which it hangs? It's likely that the particular PQprepare call it's stuck on is just the victim of prior misfeasance. If you can find "b_tn" or "nt_b" in any strings the client should have been sending up to this point, that might shed light too. regards, tom lane
Re: PostgreSQL client hangs sometimes on 'EXEC SQL PREPAREsid_sisisinst FROM :select_anw;'
From
Matthias Apitz
Date:
El día martes, mayo 05, 2020 a las 04:18:02p. m. -0400, Tom Lane escribió: > Matthias Apitz <guru@unixarea.de> writes: > > El día lunes, abril 27, 2020 a las 09:40:39a. m. -0400, Tom Lane escribió: > >> If you're in a position to run a modified server, you could try > >> inserting a debug log message: > > > I've added the printout of the length in this case and another one, and > > see this in the server's log file: > > > 2020-05-04 10:05:49.977 CEST [32092] LOG: invalid length 33554940 of startup packet > > 2020-05-04 10:05:50.207 CEST [32094] LOG: invalid length 33554940 of startup packet > > 2020-05-04 12:32:50.781 CEST [20334] LOG: bogus received message length: 1650422894 > > 2020-05-04 12:36:41.293 CEST [20380] LOG: bogus received message length: 1650422894 > > 2020-05-04 12:39:39.461 CEST [20441] LOG: bogus received message length: 1650422894 > > 2020-05-04 13:01:50.566 CEST [24222] LOG: bogus received message length: 1650422894 > > Hmph. That confirms that we're getting a bogus message length, but not > much more. It's quite interesting though that the bogus value is always > the same. According to my calculator 1650422894 corresponds to ASCII > "b_tn", or possibly "nt_b" depending on what you want to assume about > endianness, which looks tantalizingly like it could be a fragment of a > SQL query. So I'm still leaning to the idea that the client has sent > a malformed query stream --- but how? Or if the data got corrupted in > transit, how did that happen? > > Can you work backwards to what the client was doing just before the > point at which it hangs? It's likely that the particular PQprepare > call it's stuck on is just the victim of prior misfeasance. If you > can find "b_tn" or "nt_b" in any strings the client should have been > sending up to this point, that might shed light too. $ printf "0x%x\n" 1650422894 0x625f746e $ printf "0x%x\n" 1650422894 | xxd -r b_tn I will try to add in the case of the 'bogus received message length...' more logging of the content of the message. The client in question is written in ESQL/C and I will enable the (very good) ESQL/C logging feature of this interface to get an understanding what the client was sending before this connection gets stuck. We will produce tons of log files, but apart of Linux strace or tcpdump with even more tons, I don't see any way at the moment. The situation is relatively seldom. Thanks matthias -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub
Re: PostgreSQL client hangs sometimes on 'EXEC SQL PREPAREsid_sisisinst FROM :select_anw;'
From
Matthias Apitz
Date:
El día Dienstag, April 28, 2020 a las 10:10:18 +0200, Matthias Apitz escribió: > > I've added the proposed log line to the server code in src/backend/libpq/pqcomm.c > and restarted the server with this. In the log file /data/postgresql11/data/serverlog > are other messages about length of packages, like: > > 2020-04-28 09:49:49.877 CEST [16229] LOG: invalid length of startup packet > 2020-04-28 09:49:50.117 CEST [16231] LOG: invalid length of startup packet > > They're coming from: > > grep 'invalid length of startup packet' src/backend/*/*.c > src/backend/postmaster/postmaster.c: errmsg("invalid length of startup packet"))); > > I will change the code there to to get the 'len' printed with the > message > > ... > if (len < (int32) sizeof(ProtocolVersion) || > len > MAX_STARTUP_PACKET_LENGTH) > { > ereport(COMMERROR, > (errcode(ERRCODE_PROTOCOL_VIOLATION), > errmsg("invalid length %d of startup packet", len))); > return STATUS_ERROR; > } I'm still trying to catch from where (i.e. from which client connection) the above message are originated. Over the actual weeken(!), I see the following picture: # egrep 'bogus|incomplete message from|invalid length' /data/postgresql11/data/serverlog ... 2020-05-09 05:35:20.124 CEST [6905] LOG: invalid length 33554940 of startup packet 2020-05-09 05:35:20.368 CEST [6907] LOG: invalid length 33554940 of startup packet 2020-05-09 05:38:14.381 CEST [7186] LOG: invalid length 33554940 of startup packet 2020-05-09 05:38:14.669 CEST [7246] LOG: invalid length 33554940 of startup packet 2020-05-10 05:34:56.716 CEST [13241] LOG: invalid length 33554940 of startup packet 2020-05-10 05:34:56.942 CEST [13243] LOG: invalid length 33554940 of startup packet 2020-05-10 05:37:46.599 CEST [13921] LOG: invalid length 33554940 of startup packet 2020-05-10 05:37:49.577 CEST [14017] LOG: invalid length 33554940 of startup packet i.e. the messages only appear short after 5:3x am; this is exactly the time when (based on cronjobs) our software is deployed to this server and all servers (the clients for PostgreSQL) are started. Following questions: 1. In addition to the above code, can I get somehow the connecting client's PID? If so, I could get from its PID its command line from the /proc file system. This would help to instrument the client side with logging. There are some hundred clients of different server types booting up at this time frame after the deployment. 2. Can I get something out of the buffer of the startup packet? For example, with adding to the above change something like: { char buf[1024+1]; memset (buf, 0, sizeof(buf)); pq_getbytes(buf, 1024); buf[1024] = '\0'; ereport(COMMERROR, (errcode(ERRCODE_PROTOCOL_VIOLATION), errmsg("first 1024 bytes of startup packet [%s]", buf))); } Thanks matthias -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub
Re: PostgreSQL client hangs sometimes on 'EXEC SQL PREPARE sid_sisisinst FROM :select_anw;'
From
Tom Lane
Date:
Matthias Apitz <guru@unixarea.de> writes: > 1. In addition to the above code, can I get somehow the connecting client's PID? If Not in any portable way. If you believe that the offending client is on the local system, you could try building something involving netstat, lsof, etc. getpeereid() might be useful too. > 2. Can I get something out of the buffer of the startup packet? I'm fairly sure that at the point where the error is thrown, we've only read the putative length word, so there's nothing buffered. But sure, you could try to collect whatever is available from the socket and dump that --- probably a hex dump would be advisable, because it'll contain \0 bytes. Seems like a good idea. regards, tom lane
Re: PostgreSQL client hangs sometimes on 'EXEC SQL PREPAREsid_sisisinst FROM :select_anw;'
From
Matthias Apitz
Date:
El día Sonntag, Mai 10, 2020 a las 11:33:41 -0400, Tom Lane escribió: > Matthias Apitz <guru@unixarea.de> writes: > > 1. In addition to the above code, can I get somehow the connecting client's PID? If > > Not in any portable way. If you believe that the offending client is on > the local system, you could try building something involving netstat, > lsof, etc. getpeereid() might be useful too. I launched a shell script to get some more information in this case: ... if (len < (int32) sizeof(ProtocolVersion) || len > MAX_STARTUP_PACKET_LENGTH) { int pid = getpid(); char cmd[80]; sprintf(cmd, "/home/sisis/guru/getClient.sh %d", pid); system(cmd); ereport(COMMERROR, (errcode(ERRCODE_PROTOCOL_VIOLATION), errmsg("invalid length %d of startup packet", len))); return STATUS_ERROR; } ... The process with the PID in question is 'postmaster': PS: 14622 ? Ss 0:00 /usr/local/sisis-pap/pgsql/bin/postmaster -D /data/postgresql11/data lsof -P -p 14622 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME postmaste 14622 postgres cwd DIR 254,0 4096 5373968 /data/postgresql11/data postmaste 14622 postgres rtd DIR 254,0 4096 2 / postmaste 14622 postgres txt REG 254,0 9959488 5636120 /usr/local/sisis-pap/pgsql/bin/postgres postmaste 14622 postgres DEL REG 0,5 131559958 /dev/zero ... postmaste 14622 postgres 9u IPv4 131903453 0t0 TCP srap32dxr1.dev.oclc.org:5432->srap32dxr1.dev.oclc.org:44836(ESTABLISHED) ... Interestingly, the other side of this connection of the port 5432 is not visible, i.e. a 'lsof -P | grep 5432' does not show it in this moment. matthias -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub
Re: PostgreSQL client hangs sometimes on 'EXEC SQL PREPARE sid_sisisinst FROM :select_anw;'
From
Tom Lane
Date:
Matthias Apitz <guru@unixarea.de> writes: > I launched a shell script to get some more information in this case: ok > The process with the PID in question is 'postmaster': Yeah, this should actually be a newly-forked postmaster child process, but it won't have done anything yet to change its ps-visible command line. > postmaste 14622 postgres 9u IPv4 131903453 0t0 TCP srap32dxr1.dev.oclc.org:5432->srap32dxr1.dev.oclc.org:44836(ESTABLISHED) OK, so it's gotten a TCP not Unix-socket connection; that's already going to narrow things a little bit. > Interestingly, the other side of this connection of the port 5432 is not > visible, i.e. a 'lsof -P | grep 5432' does not show it in this moment. Permissions problem maybe? I'm not sure that lsof will tell you much about non-postgres-owned processes, unless you run it as root. regards, tom lane
Re: PostgreSQL client hangs sometimes on 'EXEC SQL PREPAREsid_sisisinst FROM :select_anw;'
From
Matthias Apitz
Date:
El día Montag, Mai 11, 2020 a las 09:50:12 -0400, Tom Lane escribió: > > > postmaste 14622 postgres 9u IPv4 131903453 0t0 TCP srap32dxr1.dev.oclc.org:5432->srap32dxr1.dev.oclc.org:44836(ESTABLISHED) > > OK, so it's gotten a TCP not Unix-socket connection; that's already > going to narrow things a little bit. > > > Interestingly, the other side of this connection of the port 5432 is not > > visible, i.e. a 'lsof -P | grep 5432' does not show it in this moment. > > Permissions problem maybe? I'm not sure that lsof will tell you much > about non-postgres-owned processes, unless you run it as root. Ah, yes. As this is just a dev host only, I set 'chmod 4755 /usr/bin/lsof' and now I see, the connecting client is one of our Java tools and I can look at the problem from this side. matthias -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub
Re: PostgreSQL client hangs sometimes on 'EXEC SQL PREPAREsid_sisisinst FROM :select_anw;'
From
Matthias Apitz
Date:
El día Montag, Mai 11, 2020 a las 06:24:52 +0200, Matthias Apitz escribió: > Ah, yes. As this is just a dev host only, I set 'chmod 4755 /usr/bin/lsof' > and now I see, the connecting client is one of our Java tools and I can > look at the problem from this side. > With the knowledge which Java process it is causing, I'm able to simple reproduce the message and with knowing the 4-tuple of the TCP connection to identify the exchange in a capture with TCPDUMP sniffing on localhost port 5432. Below is the exchange. The visible strings are: 'sisis' (the PG user), 'sisis123' (its password in PG, a dummy password used for testing) and 'srap32dxr1' the hostname. The server is terminating the connection with a F-pkg: 19:54:02.940205 IP 10.23.33.19.48438 > 10.23.33.19.5432: Flags [S], seq 3950072774, win 43690, options [mss 65495,sackOK,TSval 3334863612 ecr 0,nop,wscale 7], length 0 0x0000: 4500 003c e98b 4000 4006 fadc 0a17 2113 E..<..@.@.....!. 0x0010: 0a17 2113 bd36 1538 eb71 53c6 0000 0000 ..!..6.8.qS..... 0x0020: a002 aaaa 5682 0000 0204 ffd7 0402 080a ....V........... 0x0030: c6c5 fafc 0000 0000 0103 0307 ............ 19:54:02.940217 IP 10.23.33.19.5432 > 10.23.33.19.48438: Flags [S.], seq 749639996, ack 3950072775, win 43690, options [mss65495,sackOK,TS val 3334863612 ecr 3334863612,nop,wscale 7], length 0 0x0000: 4500 003c 0000 4000 4006 e468 0a17 2113 E..<..@.@..h..!. 0x0010: 0a17 2113 1538 bd36 2cae 993c eb71 53c7 ..!..8.6,..<.qS. 0x0020: a012 aaaa 5682 0000 0204 ffd7 0402 080a ....V........... 0x0030: c6c5 fafc c6c5 fafc 0103 0307 ............ 19:54:02.940226 IP 10.23.33.19.48438 > 10.23.33.19.5432: Flags [.], ack 1, win 342, options [nop,nop,TS val 3334863612 ecr3334863612], length 0 0x0000: 4500 0034 e98c 4000 4006 fae3 0a17 2113 E..4..@.@.....!. 0x0010: 0a17 2113 bd36 1538 eb71 53c7 2cae 993d ..!..6.8.qS.,..= 0x0020: 8010 0156 567a 0000 0101 080a c6c5 fafc ...VVz.......... 0x0030: c6c5 fafc .... 19:54:02.948877 IP 10.23.33.19.48438 > 10.23.33.19.5432: Flags [P.], seq 1:513, ack 1, win 342, options [nop,nop,TS val 3334863621ecr 3334863612], length 512 0x0000: 4500 0234 e98d 4000 4006 f8e2 0a17 2113 E..4..@.@.....!. 0x0010: 0a17 2113 bd36 1538 eb71 53c7 2cae 993d ..!..6.8.qS.,..= 0x0020: 8018 0156 587a 0000 0101 080a c6c5 fb05 ...VXz.......... 0x0030: c6c5 fafc 0200 0200 0000 0000 7372 6170 ............srap 0x0040: 3332 6478 7231 0000 0000 0000 0000 0000 32dxr1.......... 0x0050: 0000 0000 0000 0000 0000 0a73 6973 6973 ...........sisis 0x0060: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0x0070: 0000 0000 0000 0000 0005 7369 7369 7331 ..........sisis1 0x0080: 3233 0000 0000 0000 0000 0000 0000 0000 23.............. 0x0090: 0000 0000 0000 0000 0800 0000 0000 0000 ................ 0x00a0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0x00b0: 0000 0000 0000 0001 0200 0604 0801 0000 ................ 0x00c0: 0000 0002 0000 0000 0000 0000 0000 0000 ................ 0x00d0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0x00e0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0x00f0: 0000 .. 19:54:02.948886 IP 10.23.33.19.5432 > 10.23.33.19.48438: Flags [.], ack 513, win 350, options [nop,nop,TS val 3334863621ecr 3334863621], length 0 0x0000: 4500 0034 6004 4000 4006 846c 0a17 2113 E..4`.@.@..l..!. 0x0010: 0a17 2113 1538 bd36 2cae 993d eb71 55c7 ..!..8.6,..=.qU. 0x0020: 8010 015e 567a 0000 0101 080a c6c5 fb05 ...^Vz.......... 0x0030: c6c5 fb05 .... 19:54:02.948970 IP 10.23.33.19.48438 > 10.23.33.19.5432: Flags [P.], seq 513:612, ack 1, win 342, options [nop,nop,TS val3334863621 ecr 3334863621], length 99 0x0000: 4500 0097 e98e 4000 4006 fa7e 0a17 2113 E.....@.@..~..!. 0x0010: 0a17 2113 bd36 1538 eb71 55c7 2cae 993d ..!..6.8.qU.,..= 0x0020: 8018 0156 56dd 0000 0101 080a c6c5 fb05 ...VV........... 0x0030: c6c5 fb05 0201 0063 0000 0000 0000 0000 .......c........ 0x0040: 0000 0600 0000 0000 0800 0000 0000 0000 ................ 0x0050: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0x0060: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0x0070: 0135 3132 0000 0003 0000 0000 e200 1801 .512............ 0x0080: 0c07 cdff 85ee ef65 7fff ffff d602 0800 .......e........ 0x0090: 0680 0648 0000 00 ...H... 19:54:02.948974 IP 10.23.33.19.5432 > 10.23.33.19.48438: Flags [.], ack 612, win 350, options [nop,nop,TS val 3334863621ecr 3334863621], length 0 0x0000: 4500 0034 6005 4000 4006 846b 0a17 2113 E..4`.@.@..k..!. 0x0010: 0a17 2113 1538 bd36 2cae 993d eb71 562a ..!..8.6,..=.qV* 0x0020: 8010 015e 567a 0000 0101 080a c6c5 fb05 ...^Vz.......... 0x0030: c6c5 fb05 .... 19:54:04.176794 IP 10.23.33.19.5432 > 10.23.33.19.48438: Flags [F.], seq 1, ack 612, win 350, options [nop,nop,TS val 3334864848ecr 3334863621], length 0 0x0000: 4500 0034 6006 4000 4006 846a 0a17 2113 E..4`.@.@..j..!. 0x0010: 0a17 2113 1538 bd36 2cae 993d eb71 562a ..!..8.6,..=.qV* 0x0020: 8011 015e 567a 0000 0101 080a c6c5 ffd0 ...^Vz.......... 0x0030: c6c5 fb05 .... 19:54:04.176861 IP 10.23.33.19.48438 > 10.23.33.19.5432: Flags [.], ack 2, win 342, options [nop,nop,TS val 3334864849 ecr3334864848], length 0 0x0000: 4500 0034 e98f 4000 4006 fae0 0a17 2113 E..4..@.@.....!. 0x0010: 0a17 2113 bd36 1538 eb71 562a 2cae 993e ..!..6.8.qV*,..> 0x0020: 8010 0156 567a 0000 0101 080a c6c5 ffd1 ...VVz.......... 0x0030: c6c5 ffd0 .... 19:54:04.177085 IP 10.23.33.19.48438 > 10.23.33.19.5432: Flags [F.], seq 612, ack 2, win 342, options [nop,nop,TS val 3334864849ecr 3334864848], length 0 0x0000: 4500 0034 e990 4000 4006 fadf 0a17 2113 E..4..@.@.....!. 0x0010: 0a17 2113 bd36 1538 eb71 562a 2cae 993e ..!..6.8.qV*,..> 0x0020: 8011 0156 567a 0000 0101 080a c6c5 ffd1 ...VVz.......... 0x0030: c6c5 ffd0 .... 19:54:04.177096 IP 10.23.33.19.5432 > 10.23.33.19.48438: Flags [.], ack 613, win 350, options [nop,nop,TS val 3334864849ecr 3334864849], length 0 0x0000: 4500 0034 0000 4000 4006 e470 0a17 2113 E..4..@.@..p..!. 0x0010: 0a17 2113 1538 bd36 2cae 993e eb71 562b ..!..8.6,..>.qV+ 0x0020: 8010 015e b7e3 0000 0101 080a c6c5 ffd1 ...^............ 0x0030: c6c5 ffd1 .... -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub
Re: PostgreSQL client hangs sometimes on 'EXEC SQL PREPARE sid_sisisinst FROM :select_anw;'
From
Tom Lane
Date:
Matthias Apitz <guru@unixarea.de> writes: > Below is the exchange. The visible strings are: 'sisis' (the PG user), > 'sisis123' (its password in PG, a dummy password used for testing) and > 'srap32dxr1' the hostname. The server is terminating the connection with > a F-pkg: [ squint... ] That looks nothing like a Postgres-protocol exchange to me. If it weren't for the runs of zeroes, I'd wonder if the connection had SSL encryption turned on. Perhaps you captured the wrong session? regards, tom lane
Re: PostgreSQL client hangs sometimes on 'EXEC SQL PREPAREsid_sisisinst FROM :select_anw;'
From
Matthias Apitz
Date:
El día Montag, Mai 11, 2020 a las 02:41:29 -0400, Tom Lane escribió: > Matthias Apitz <guru@unixarea.de> writes: > > Below is the exchange. The visible strings are: 'sisis' (the PG user), > > 'sisis123' (its password in PG, a dummy password used for testing) and > > 'srap32dxr1' the hostname. The server is terminating the connection with > > a F-pkg: > > [ squint... ] That looks nothing like a Postgres-protocol exchange > to me. If it weren't for the runs of zeroes, I'd wonder if the > connection had SSL encryption turned on. Perhaps you captured the > wrong session? I don't think so. It is exactly the problematic connection started by the Java process against the port 5432; look the SYN pkg: 19:54:02.940205 IP 10.23.33.19.48438 > 10.23.33.19.5432: Flags [S], seq 3950072774, win 43690, options [mss 65495,sackOK,TS val 3334863612 ecr 0,nop,wscale 7], length 0 As you say, that this is not any Postgres-protocol exchange, I will check the configuration of this part of our software. It can be some kind of misconfiguration, some part is talking TCP/IP to the wrong server. Thanks for the hint. matthias -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub
Re: PostgreSQL client hangs sometimes on 'EXEC SQL PREPAREsid_sisisinst FROM :select_anw;'
From
Matthias Apitz
Date:
El día Montag, Mai 11, 2020 a las 02:41:29 -0400, Tom Lane escribió: > Matthias Apitz <guru@unixarea.de> writes: > > Below is the exchange. The visible strings are: 'sisis' (the PG user), > > 'sisis123' (its password in PG, a dummy password used for testing) and > > 'srap32dxr1' the hostname. The server is terminating the connection with > > a F-pkg: > > [ squint... ] That looks nothing like a Postgres-protocol exchange > to me. If it weren't for the runs of zeroes, I'd wonder if the > connection had SSL encryption turned on. Perhaps you captured the > wrong session? It turned out, that the messages: 2020-05-12 05:33:06.828 CEST [6271] LOG: invalid length 33554940 of startup packet 2020-05-12 05:33:08.305 CEST [6291] LOG: invalid length 33554940 of startup packet 2020-05-12 05:34:07.495 CEST [6596] LOG: invalid length 33554940 of startup packet 2020-05-12 05:34:09.710 CEST [6927] LOG: invalid length 33554940 of startup packet have been caused by a misconfiguration of the component in question: the application used the Sybase JDBC driver to talk to the PostgreSQL server :-( This part of the thread is hereby solved. Remaining is still: 2020-04-30 15:14:03.367 CEST [3843] LOG: bogus received message length: 1650422894 2020-04-30 15:57:48.926 CEST [10380] LOG: bogus received message length: 1650422894 2020-04-30 15:58:31.851 CEST [10380] LOG: incomplete message from client 2020-04-30 15:58:53.578 CEST [10439] LOG: bogus received message length: 1650422894 2020-04-30 16:00:13.500 CEST [10439] LOG: incomplete message from client matthias -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub