Thread: client waits for end of update operation and server proc is idle
Hello, We face (somehow reproducible) the following situation with a 13.1 server and a client written in ESQL/C: # ps ax | grep post ... 27106 ? Ss 0:00 postgres: sisis sisis ::1(49518) idle in transaction # lsof | grep 49518 INDEX 27074 sisis 2u IPv6 26465150 0t0 TCP localhost:49518->localhost:postgresql(ESTABLISHED) postmaste 27106 postgres 9u IPv6 26464149 0t0 TCP localhost:postgresql->localhost:49518(ESTABLISHED) # psql -Usisis -dsisis psql (11.4, server 13.1) WARNING: psql major version 11, server major version 13. Some psql features might not work. Type "help" for help. sisis=# select * from pg_stat_activity where client_port = 49518; datid | datname | pid | leader_pid | usesysid | usename | application_name | client_addr | client_hostname| client_port | backend_start | xact_start | query_start | state_change | wait_event_type | wait_event | state | backend_xid| backend_xmin | query | backend_type --------+---------+-------+------------+----------+---------+---------------------------------+-------------+-----------------+-------------+---------------------------------+---------------------------------+---------------------------------+---------------------------------+-----------------+------------+---------------------+-------------+--------------+---------------------+---------------- 951941 | sisis | 27106 | | 16384 | sisis | SunRise DBCALL V7.2 (pid=27074) | ::1 | | 49518 | 23.04.2021 05:57:19.536206 CEST | 23.04.2021 05:57:19.562102 CEST | 23.04.2021 05:57:19.664471 CEST| 23.04.2021 05:57:19.664495 CEST | Client | ClientRead | idle in transaction | 18228156 | 18228156 |fetch per_kette_seq | client backend (1 row) i.e. the server is in 'ClientRead | idle in transaction'; The ESQL/C code the client is executing at this moment is: /* * update the actual row */ posDebug(POSDEBUG_FLOW, "ES UPDATE per_kette set .. WHERE CURRENT OF hc_per_kette"); EXEC SQL UPDATE per_kette SET desknr =:hrec_per_kette.desknr, frei =:hrec_per_kette.frei, modif =:hrec_per_kette.modif, anz =:hrec_per_kette.anz, katlist =:hrec_per_kette.katlist WHERE CURRENT OF hc_per_kette; An attached GDB shows: # gdb /opt/lib/sisis/catserver/bin/INDEX 27074 #0 0x00007f1e65c1b5c4 in poll () from /lib64/libc.so.6 #1 0x00007f1e648ac22f in pqSocketPoll () from /usr/local/sisis-pap/pgsql/lib/libpq.so.5 #2 0x00007f1e648ac0ff in pqSocketCheck () from /usr/local/sisis-pap/pgsql/lib/libpq.so.5 #3 0x00007f1e648abf83 in pqWaitTimed () from /usr/local/sisis-pap/pgsql/lib/libpq.so.5 #4 0x00007f1e648abf55 in pqWait () from /usr/local/sisis-pap/pgsql/lib/libpq.so.5 #5 0x00007f1e648a7d3a in PQgetResult () from /usr/local/sisis-pap/pgsql/lib/libpq.so.5 #6 0x00007f1e648a8478 in PQexecFinish () from /usr/local/sisis-pap/pgsql/lib/libpq.so.5 #7 0x00007f1e648a812e in PQexecParams () from /usr/local/sisis-pap/pgsql/lib/libpq.so.5 #8 0x00007f1e65818216 in ecpg_execute () from /usr/local/sisis-pap/pgsql/lib/libecpg.so.6 #9 0x00007f1e658195de in ecpg_do () from /usr/local/sisis-pap/pgsql/lib/libecpg.so.6 #10 0x00007f1e658196ef in ECPGdo () from /usr/local/sisis-pap/pgsql/lib/libecpg.so.6 #11 0x00007f1e67e091c6 in update_record () at per_kette.pgc:1968 #12 0x00007f1e67e04d16 in per_kette (zugriff=10, scroll=-20000, lock=-20000, key=-20000, sto=-20000, p_daten=0x7fff4e6db9b0, sel_anw=0x0, where_anw=0x0, p_btw_daten=0x0, order_by=0x0, auf_ab=0x0, group_by=0x0, having=0x0, into_temp=0x0, count=0x7fff4e6da6d0) at per_kette.pgc:497 #13 0x00007f1e67d7d70f in DB_rewr (tabmodul=0x7f1e67e047ca <per_kette>, p_daten=0x7fff4e6db9b0) at dbcall.pgc:3274 #14 0x00007f1e68ade821 in BKDeskInsert (setid=FSTAB_Personen, kette=..., katkey=168952979) at ./BKDesk.c:1447 #15 0x00007f1e68ae79c1 in InvAufnehmen (setid=FSTAB_Personen, katkey=168952979) #16 0x00007f1e68ae6e17 in BKBearbeiteIndexAuftrag (setid=FSTAB_Personen, auftrag=...) at ./BKInvert.c:109 #17 0x0000000000405384 in main (argc=3, argv=0x7fff4e6ec1d8) at ./INDEX.c:525 i.e. our client issued ECPGdo() and now waits on finish; the provided data looks fine: (gdb) p hrec_per_kette.desknr $2 = 1384118 (gdb) p hrec_per_kette.katlist value requires 130000 bytes, which is more than max-value-size (gdb) x/1s hrec_per_kette.katlist 0x7f1e6864af6e <hrec_per_kette+14>: "\\xee9e01001eaf010057af010005ce010082d20100d4f801008e0a0200af340200253a0200a58e0200ae8e020044a602003ea7020020f1030024f1030054170400a04304003f670400ba8904003bed000005490100925201007454010083600100828301"... the server pocess is in epoll_pwait(): # gdb /usr/local/sisis-pap/pgsql-13.0/bin/postmaster 27106 ... (gdb) bt #0 0x00007fe7ee953fe0 in epoll_pwait () from /lib64/libc.so.6 #1 0x00000000008acf10 in WaitEventSetWaitBlock () #2 0x00000000008acde9 in WaitEventSetWaitBlock () #3 0x000000000073c5a5 in check_ssl_key_file_permissions () #4 0x000000000074636e in RemoveSocketFiles () #5 0x000000000074662b in pq_getbyte_if_available () #6 0x000000000074697b in pq_getmessage () #7 0x00000000008d543b in pg_analyze_and_rewrite_params () #8 0x00007f54fec14780 in ?? () #9 0x00007ffffec148a0 in ?? () #10 0x00007ffffec14780 in ?? () #11 0x00000053008d3b1b in ?? () #12 0x00007ffffec14780 in ?? () #13 0x00000000008d5482 in pg_analyze_and_rewrite_params () The client has a log file for this update which shows the actual time of this operation: # ls -l --full-time /var/spool/sisis/catserver/tmp/.lock_index-A.sisis.2 -rw-rw---- 1 sisis sisis 6 2021-04-23 05:57:19.529476471 +0200 /var/spool/sisis/catserver/tmp/.lock_index-A.sisis.2 The serverlog has around this time (sorry for German): 2021-04-23 05:55:23.591 CEST [2317] LOG: unvollständige Message vom Client 2021-04-23 05:55:23.593 CEST [2317] FEHLER: Speicher aufgebraucht 2021-04-23 05:55:23.593 CEST [2317] DETAIL: Kann Zeichenkettenpuffer mit 0 Bytes nicht um 1380986158 Bytes vergrößern. 2021-04-23 05:55:23.593 CEST [2317] LOG: konnte Daten nicht an den Client senden: Datenübergabe unterbrochen (broken pipe) 2021-04-23 05:55:23.593 CEST [2317] FATAL: Verbindung zum Client wurde verloren 2021-04-23 05:57:20.339 CEST [27190] LOG: unerwartetes EOF auf Client-Verbindung mit einer offenen Transaktion 2021-04-23 05:59:17.419 CEST [26827] LOG: unerwartetes EOF auf Client-Verbindung mit einer offenen Transaktion What else could be checked for this? Any hints about this issue? Thanks matthias -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub
On Fri, 2021-04-23 at 10:15 +0200, Matthias Apitz wrote: > We face (somehow reproducible) the following situation with a 13.1 > server and a client written in ESQL/C: > > # ps ax | grep post > ... > 27106 ? Ss 0:00 postgres: sisis sisis ::1(49518) idle in transaction > > # lsof | grep 49518 > INDEX 27074 sisis 2u IPv6 26465150 0t0 TCP localhost:49518->localhost:postgresql(ESTABLISHED) > postmaste 27106 postgres 9u IPv6 26464149 0t0 TCP localhost:postgresql->localhost:49518(ESTABLISHED) > > sisis=# select * from pg_stat_activity where client_port = 49518; > > [...] ClientRead | idle in transaction [...] > > The ESQL/C code the client is executing at this moment is: > > /* > * update the actual row > */ > posDebug(POSDEBUG_FLOW, "ES UPDATE per_kette set .. WHERE CURRENT OF hc_per_kette"); > EXEC SQL UPDATE per_kette > SET desknr =:hrec_per_kette.desknr, > frei =:hrec_per_kette.frei, > modif =:hrec_per_kette.modif, > anz =:hrec_per_kette.anz, > katlist =:hrec_per_kette.katlist > WHERE CURRENT OF hc_per_kette; > > An attached GDB shows: > > # gdb /opt/lib/sisis/catserver/bin/INDEX 27074 > > #0 0x00007f1e65c1b5c4 in poll () from /lib64/libc.so.6 > #1 0x00007f1e648ac22f in pqSocketPoll () from /usr/local/sisis-pap/pgsql/lib/libpq.so.5 > #2 0x00007f1e648ac0ff in pqSocketCheck () from /usr/local/sisis-pap/pgsql/lib/libpq.so.5 > #3 0x00007f1e648abf83 in pqWaitTimed () from /usr/local/sisis-pap/pgsql/lib/libpq.so.5 > #4 0x00007f1e648abf55 in pqWait () from /usr/local/sisis-pap/pgsql/lib/libpq.so.5 > #5 0x00007f1e648a7d3a in PQgetResult () from /usr/local/sisis-pap/pgsql/lib/libpq.so.5 > #6 0x00007f1e648a8478 in PQexecFinish () from /usr/local/sisis-pap/pgsql/lib/libpq.so.5 > #7 0x00007f1e648a812e in PQexecParams () from /usr/local/sisis-pap/pgsql/lib/libpq.so.5 > #8 0x00007f1e65818216 in ecpg_execute () > from /usr/local/sisis-pap/pgsql/lib/libecpg.so.6 > #9 0x00007f1e658195de in ecpg_do () from /usr/local/sisis-pap/pgsql/lib/libecpg.so.6 > #10 0x00007f1e658196ef in ECPGdo () from /usr/local/sisis-pap/pgsql/lib/libecpg.so.6 > > i.e. our client issued ECPGdo() and now waits on finish; the provided > data looks fine: > > the server pocess is in epoll_pwait(): > > # gdb /usr/local/sisis-pap/pgsql-13.0/bin/postmaster 27106 > ... > (gdb) bt > #0 0x00007fe7ee953fe0 in epoll_pwait () from /lib64/libc.so.6 > #1 0x00000000008acf10 in WaitEventSetWaitBlock () > #2 0x00000000008acde9 in WaitEventSetWaitBlock () > #3 0x000000000073c5a5 in check_ssl_key_file_permissions () > #4 0x000000000074636e in RemoveSocketFiles () > #5 0x000000000074662b in pq_getbyte_if_available () > #6 0x000000000074697b in pq_getmessage () > #7 0x00000000008d543b in pg_analyze_and_rewrite_params () > > The client has a log file for this update which shows the actual time of > this operation: > > # ls -l --full-time /var/spool/sisis/catserver/tmp/.lock_index-A.sisis.2 > -rw-rw---- 1 sisis sisis 6 2021-04-23 05:57:19.529476471 +0200 /var/spool/sisis/catserver/tmp/.lock_index-A.sisis.2 > > > The serverlog has around this time (sorry for German): > > 2021-04-23 05:55:23.591 CEST [2317] LOG: unvollständige Message vom Client > 2021-04-23 05:55:23.593 CEST [2317] FEHLER: Speicher aufgebraucht > 2021-04-23 05:55:23.593 CEST [2317] DETAIL: Kann Zeichenkettenpuffer mit 0 Bytes nicht um 1380986158 Bytes vergrößern. > 2021-04-23 05:55:23.593 CEST [2317] LOG: konnte Daten nicht an den Client senden: Datenübergabe unterbrochen (broken pipe) > 2021-04-23 05:55:23.593 CEST [2317] FATAL: Verbindung zum Client wurde verloren > 2021-04-23 05:57:20.339 CEST [27190] LOG: unerwartetes EOF auf Client-Verbindung mit einer offenen Transaktion > 2021-04-23 05:59:17.419 CEST [26827] LOG: unerwartetes EOF auf Client-Verbindung mit einer offenen Transaktion > > What else could be checked for this? Any hints about this issue? It seems pretty clear that both the client and the server are waiting for each other. The server is of the opinion that it has completed the query, while the client is still waiting for the response. I would look for the problem in the network between both. The error messages (unexpected EOF on client connection, broken pipe) seem to point in the same direction. Yours, Laurenz Albe -- Cybertec | https://www.cybertec-postgresql.com
Am Fri, Apr 23, 2021 at 10:48:24AM +0200 schrieb Laurenz Albe: > > The serverlog has around this time (sorry for German): > > > > 2021-04-23 05:55:23.591 CEST [2317] LOG: unvollständige Message vom Client > > 2021-04-23 05:55:23.593 CEST [2317] FEHLER: Speicher aufgebraucht > > 2021-04-23 05:55:23.593 CEST [2317] DETAIL: Kann Zeichenkettenpuffer mit 0 Bytes nicht um 1380986158 Bytes vergrößern. > > 2021-04-23 05:55:23.593 CEST [2317] LOG: konnte Daten nicht an den Client senden: Datenübergabe unterbrochen (brokenpipe) > > 2021-04-23 05:55:23.593 CEST [2317] FATAL: Verbindung zum Client wurde verloren > > 2021-04-23 05:57:20.339 CEST [27190] LOG: unerwartetes EOF auf Client-Verbindung mit einer offenen Transaktion > > 2021-04-23 05:59:17.419 CEST [26827] LOG: unerwartetes EOF auf Client-Verbindung mit einer offenen Transaktion > > > > What else could be checked for this? Any hints about this issue? > > It seems pretty clear that both the client and the server are waiting for each other. > > The server is of the opinion that it has completed the query, while the client is > still waiting for the response. > > I would look for the problem in the network between both. The error messages > (unexpected EOF on client connection, broken pipe) seem to point in the same direction. It also reports an out-of-memory situation: >> 2021-04-23 05:55:23.593 CEST [2317] FEHLER: Speicher aufgebraucht >> 2021-04-23 05:55:23.593 CEST [2317] DETAIL: Kann Zeichenkettenpuffer mit 0 Bytes nicht um 1380986158 Bytes vergrößern. It seems what is to be sent to the client is "too large". Following which sending the data seems (intentionally?) aborted by the server, which the other side likely reports as loss-of-signal. Also, where does the "imcomplete message from client" originate from ? Karsten -- GPG 40BE 5B0E C98E 1713 AFA6 5BC0 3BEA AC80 7D4F C89B
El día viernes, abril 23, 2021 a las 10:48:24a. m. +0200, Laurenz Albe escribió: > > The serverlog has around this time (sorry for German): > > > > 2021-04-23 05:55:23.591 CEST [2317] LOG: unvollständige Message vom Client > > 2021-04-23 05:55:23.593 CEST [2317] FEHLER: Speicher aufgebraucht > > 2021-04-23 05:55:23.593 CEST [2317] DETAIL: Kann Zeichenkettenpuffer mit 0 Bytes nicht um 1380986158 Bytes vergrößern. > > 2021-04-23 05:55:23.593 CEST [2317] LOG: konnte Daten nicht an den Client senden: Datenübergabe unterbrochen (brokenpipe) > > 2021-04-23 05:55:23.593 CEST [2317] FATAL: Verbindung zum Client wurde verloren > > 2021-04-23 05:57:20.339 CEST [27190] LOG: unerwartetes EOF auf Client-Verbindung mit einer offenen Transaktion > > 2021-04-23 05:59:17.419 CEST [26827] LOG: unerwartetes EOF auf Client-Verbindung mit einer offenen Transaktion > > > > What else could be checked for this? Any hints about this issue? > > It seems pretty clear that both the client and the server are waiting for each other. Yes, they are both alive as processes. Both are running on the same Linux server, i.e. no real network between. And any loss of data would have caused perhaps the end of one (and then the other) process. I will try to gather a TCPDUMP for this. > > The server is of the opinion that it has completed the query, while the client is > still waiting for the response. > > I would look for the problem in the network between both. The error messages > (unexpected EOF on client connection, broken pipe) seem to point in the same direction. The other 'unerwartetes EOF auf Client' (unexpected EOF client) are logged by other PostgreSQL server processes, not invloved here, and caused by killing other clients with 'kill -9'. matthias -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub ¡Con Cuba no te metas! «» Don't mess with Cuba! «» Leg Dich nicht mit Kuba an! http://www.cubadebate.cu/noticias/2020/12/25/en-video-con-cuba-no-te-metas/
El día viernes, abril 23, 2021 a las 11:32:02a. m. +0200, Matthias Apitz escribió: > El día viernes, abril 23, 2021 a las 10:48:24a. m. +0200, Laurenz Albe escribió: > > > > The serverlog has around this time (sorry for German): > > > > > > 2021-04-23 05:55:23.591 CEST [2317] LOG: unvollständige Message vom Client > > > 2021-04-23 05:55:23.593 CEST [2317] FEHLER: Speicher aufgebraucht > > > 2021-04-23 05:55:23.593 CEST [2317] DETAIL: Kann Zeichenkettenpuffer mit 0 Bytes nicht um 1380986158 Bytes vergrößern. > > > 2021-04-23 05:55:23.593 CEST [2317] LOG: konnte Daten nicht an den Client senden: Datenübergabe unterbrochen (brokenpipe) > > > 2021-04-23 05:55:23.593 CEST [2317] FATAL: Verbindung zum Client wurde verloren > > > 2021-04-23 05:57:20.339 CEST [27190] LOG: unerwartetes EOF auf Client-Verbindung mit einer offenen Transaktion > > > 2021-04-23 05:59:17.419 CEST [26827] LOG: unerwartetes EOF auf Client-Verbindung mit einer offenen Transaktion > > > > > > What else could be checked for this? Any hints about this issue? > > > > It seems pretty clear that both the client and the server are waiting for each other. > > Yes, they are both alive as processes. Both are running on the same > Linux server, i.e. no real network between. And any loss of data would > have caused perhaps the end of one (and then the other) process. > I will try to gather a TCPDUMP for this. > I hit the same situation again. The ESQL/C client sent an UPDATE to the PostgreSQL server: # gdb /opt/lib/sisis/catserver/bin/INDEX 31864 ... #0 0x00007f688bb955c4 in poll () from /lib64/libc.so.6 #1 0x00007f688a82622f in pqSocketPoll () from /usr/local/sisis-pap/pgsql/lib/libpq.so.5 #2 0x00007f688a8260ff in pqSocketCheck () from /usr/local/sisis-pap/pgsql/lib/libpq.so.5 #3 0x00007f688a825f83 in pqWaitTimed () from /usr/local/sisis-pap/pgsql/lib/libpq.so.5 #4 0x00007f688a825f55 in pqWait () from /usr/local/sisis-pap/pgsql/lib/libpq.so.5 #5 0x00007f688a821d3a in PQgetResult () from /usr/local/sisis-pap/pgsql/lib/libpq.so.5 #6 0x00007f688a822478 in PQexecFinish () from /usr/local/sisis-pap/pgsql/lib/libpq.so.5 #7 0x00007f688a82212e in PQexecParams () from /usr/local/sisis-pap/pgsql/lib/libpq.so.5 #8 0x00007f688b792216 in ecpg_execute () from /usr/local/sisis-pap/pgsql/lib/libecpg.so.6 #9 0x00007f688b7935de in ecpg_do () from /usr/local/sisis-pap/pgsql/lib/libecpg.so.6 #10 0x00007f688b7936ef in ECPGdo () from /usr/local/sisis-pap/pgsql/lib/libecpg.so.6 #11 0x00007f688dd831c6 in update_record () at per_kette.pgc:1968 #12 0x00007f688dd7ed16 in per_kette (zugriff=10, scroll=-20000, lock=-20000, key=-20000, sto=-20000, p_daten=0x7ffef0499950, sel_anw=0x0, where_anw=0x0, p_btw_daten=0x0, order_by=0x0, auf_ab=0x0, group_by=0x0, having=0x0, into_temp=0x0, count=0x7ffef0498670) at per_kette.pgc:497 #13 0x00007f688dcf770f in DB_rewr (tabmodul=0x7f688dd7e7ca <per_kette>, p_daten=0x7ffef0499950) at dbcall.pgc:3274 #14 0x00007f688ea58821 in BKDeskInsert (setid=FSTAB_Personen, kette=..., katkey=168952979) at ./BKDesk.c:1447 and the Pos server sits idle: # select * from pg_stat_activity where client_port = 52288; datid | datname | pid | leader_pid | usesysid | usename | application_name | client_addr | client_hostname| client_port | backend_start | xact_start | query_start | state_change | wait_event_type | wait_event | state | backend_xid| backend_xmin | query | backend_type --------+---------+-------+------------+----------+---------+---------------------------------+-------------+-----------------+-------------+---------------------------------+---------------------------------+---------------------------------+---------------------------------+-----------------+------------+---------------------+-------------+--------------+---------------------+---------------- 951941 | sisis | 31894 | | 16384 | sisis | SunRise DBCALL V7.2 (pid=31864) | 127.0.0.1 | | 52288 | 25.04.2021 10:57:15.922137 CEST | 25.04.2021 10:57:15.968081 CEST | 25.04.2021 10:57:16.051256 CEST| 25.04.2021 10:57:16.051273 CEST | Client | ClientRead | idle in transaction | 18314796 | 18314796 |fetch per_kette_seq | client backend But this time I was collecting all TCP traffic of all clients to and from the port 5432. This TCPDUMP command writes all traffic as binary data to files of 100.000.000 bytes in size (-C 100) and appends a number when a new file is needed: # /usr/sbin/tcpdump -n -i lo -C 100 -w /var/spool/sisis/tcp5432. port 5432 & Later one can look into the files for the local port 52288 of the connection with: # tcpdump -n -r tcp5432.2 -Xx -s1024 port 52288 > INDEX.52288.txt 10:57:15.921783 IP 127.0.0.1.52288 > 127.0.0.1.5432: Flags [S], seq 3652783347, win 65495, options [mss 65495,sackOK,TS val3424445579 ecr 0,nop,wscale 7], length 0 0x0000: 4500 003c b720 4000 4006 8599 7f00 0001 E..<..@.@....... 0x0010: 7f00 0001 cc40 1538 d9b9 0cf3 0000 0000 .....@.8........ 0x0020: a002 ffd7 fe30 0000 0204 ffd7 0402 080a .....0.......... 0x0030: cc1c e48b 0000 0000 0103 0307 ............ 10:57:15.921792 IP 127.0.0.1.5432 > 127.0.0.1.52288: Flags [S.], seq 4241480135, ack 3652783348, win 65483, options [mss65495,sackOK,TS val 3424445579 ecr 3424445579,nop,wscale 7], length 0 0x0000: 4500 003c 0000 4000 4006 3cba 7f00 0001 E..<..@.@.<..... 0x0010: 7f00 0001 1538 cc40 fccf d9c7 d9b9 0cf4 .....8.@........ 0x0020: a012 ffcb fe30 0000 0204 ffd7 0402 080a .....0.......... 0x0030: cc1c e48b cc1c e48b 0103 0307 ............ 10:57:15.921799 IP 127.0.0.1.52288 > 127.0.0.1.5432: Flags [.], ack 1, win 512, options [nop,nop,TS val 3424445579 ecr 3424445579],length 0 0x0000: 4500 0034 b721 4000 4006 85a0 7f00 0001 E..4.!@.@....... 0x0010: 7f00 0001 cc40 1538 d9b9 0cf4 fccf d9c8 .....@.8........ 0x0020: 8010 0200 fe28 0000 0101 080a cc1c e48b .....(.......... 0x0030: cc1c e48b .... And here you see the last command (the UPDATE) sent to the Pos server: 10:57:16.051388 IP 127.0.0.1.52288 > 127.0.0.1.5432: Flags [P.], seq 14407:14812, ack 120304, win 512, options [nop,nop,TSval 3424445708 ecr 3424445708], length 405 0x0000: 4500 01c9 b84f 4000 4006 82dd 7f00 0001 E....O@.@....... 0x0010: 7f00 0001 cc40 1538 d9b9 453a fcd1 afb7 .....@.8..E:.... 0x0020: 8018 0200 ffbd 0000 0101 080a cc1c e50c ................ 0x0030: cc1c e50c 5000 0000 7e00 7570 6461 7465 ....P...~.update 0x0040: 2070 6572 5f6b 6574 7465 2073 6574 2064 .per_kette.set.d 0x0050: 6573 6b6e 7220 3d20 2431 2020 2c20 6672 esknr.=.$1..,.fr 0x0060: 6569 203d 2024 3220 202c 206d 6f64 6966 ei.=.$2..,.modif 0x0070: 203d 2024 3320 202c 2061 6e7a 203d 2024 .=.$3..,.anz.=.$ 0x0080: 3420 202c 206b 6174 6c69 7374 203d 2024 4..,.katlist.=.$ 0x0090: 3520 2077 6865 7265 2063 7572 7265 6e74 5..where.current 0x00a0: 206f 6620 6863 5f70 6572 5f6b 6574 7465 .of.hc_per_kette 0x00b0: 0000 0042 0000 00ff 0000 0000 0005 0000 ...B............ 0x00c0: 0007 3133 3834 3131 3800 0000 0130 0000 ..1384118....0.. 0x00d0: 0001 3000 0000 0232 3600 0000 d25c 7865 ..0....26....\xe 0x00e0: 6539 6530 3130 3031 6561 6630 3130 3035 e9e01001eaf01005 0x00f0: 3761 6630 3130 3030 3563 6530 3130 3038 7af010005ce01008 0x0100: 3264 3230 3130 3064 3466 3830 3130 3038 2d20100d4f801008 0x0110: 6530 6130 3230 3061 6633 3430 3230 3032 e0a0200af3402002 0x0120: 3533 6130 3230 3061 3538 6530 3230 3061 53a0200a58e0200a 0x0130: 6538 6530 3230 3034 3461 3630 3230 3033 e8e020044a602003 0x0140: 6561 3730 3230 3032 3066 3130 3330 3032 ea7020020f103002 0x0150: 3466 3130 3330 3035 3431 3730 3430 3061 4f1030054170400a 0x0160: 3034 3330 3430 3033 6636 3730 3430 3062 04304003f670400b 0x0170: 6138 3930 3430 3033 6265 6430 3030 3030 a8904003bed00000 0x0180: 3534 3930 3130 3039 3235 3230 3130 3037 5490100925201007 0x0190: 3435 3430 3130 3038 3336 3030 3130 3038 4540100836001008 0x01a0: 3238 3330 3130 3039 3330 3431 3230 6100 28301009304120a. 0x01b0: 0100 0044 0000 0006 5000 4500 0000 0900 ...D....P.E..... 0x01c0: 0000 0000 5300 0000 04 ....S.... 10:57:16.051395 IP 127.0.0.1.5432 > 127.0.0.1.52288: Flags [.], ack 14812, win 512, options [nop,nop,TS val 3424445708 ecr3424445708], length 0 0x0000: 4500 0034 453c 4000 4006 f785 7f00 0001 E..4E<@.@....... 0x0010: 7f00 0001 1538 cc40 fcd1 afb7 d9b9 46cf .....8.@......F. 0x0020: 8010 0200 fe28 0000 0101 080a cc1c e50c .....(.......... 0x0030: cc1c e50c .... The send buffer is complete, the bytea string \x ....00 is terminated by a NULL byte. TCP ACK'ed the package and then no further reaction of the Pos server on this connection. It still sits there if someone needs more information about this connection. From the collected data, I tend to say: this must be a bug in the Pos server... Any ideas? matthias -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub
El día domingo, abril 25, 2021 a las 01:54:49p. m. +0200, Matthias Apitz escribió: >From the collected data, I tend to say: this must be a bug in the Pos > server... > At the end of the day, it turned out that out client caused the problem. Because we were hunting some other issue, the client was logging some message with fprintf(stderr, ...) but the fd=2 was not stderr, fd=2 was the connection to the socket of the PostgreSQL server. And ofc the Pos server could not understand our human message. Problem solved. Sorry for all the noise. matthias -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub
Matthias Apitz <guru@unixarea.de> writes: > At the end of the day, it turned out that out client caused the problem. > Because we were hunting some other issue, the client was logging some > message with fprintf(stderr, ...) but the fd=2 was not stderr, fd=2 was the > connection to the socket of the PostgreSQL server. And ofc the Pos > server could not understand our human message. Hmph. So why wasn't this visible in the tcpdump trace? My initial thought was that the server process should log a message and drop the connection once it received something that didn't look like a valid Postgres-protocol message, on the assumption that message-boundary sync could never be recovered reliably. But on further thought, it would start by taking the second through fifth bytes of the input as a message length, and it wouldn't really do anything interesting until it had collected that much input --- and, if those bytes were text, that would look like a large number. So what the server is really doing is waiting for the rest of the imaginary message, which will never arrive. Maybe we should have more sanity checking on alleged message lengths. I recall there is some checking of that sort in libpq, but I'm less sure about the server. regards, tom lane
El día domingo, abril 25, 2021 a las 11:51:45a. m. -0400, Tom Lane escribió: > Matthias Apitz <guru@unixarea.de> writes: > > At the end of the day, it turned out that out client caused the problem. > > Because we were hunting some other issue, the client was logging some > > message with fprintf(stderr, ...) but the fd=2 was not stderr, fd=2 was the > > connection to the socket of the PostgreSQL server. And ofc the Pos > > server could not understand our human message. > > Hmph. So why wasn't this visible in the tcpdump trace? It was visible. That's why I detected it while looking up the TCP packages before the UPDATE: 10:57:16.051326 IP 127.0.0.1.52288 > 127.0.0.1.5432: Flags [P.], seq 14280:14407, ack 120304, win 512, options [nop,nop,TSval 3424445708 ecr 3424445708], length 127 0x0000: 4500 00b3 b84e 4000 4006 83f4 7f00 0001 E....N@.@....... 0x0010: 7f00 0001 cc40 1538 d9b9 44bb fcd1 afb7 .....@.8..D..... 0x0020: 8018 0200 fea7 0000 0101 080a cc1c e50c ................ 0x0030: cc1c e50c 5352 502d 3236 3937 363a 2072 ....SRP-26976:.r 0x0040: 6573 746f 7265 6420 7374 7275 6374 206b estored.struct.k 0x0050: 6574 7465 7361 747a 2066 6f72 2073 6574 ettesatz.for.set 0x0060: 6964 3a32 2c20 6465 736b 3a33 3831 3832 id:2,.desk:38182 0x0070: 3038 302c 2066 656c 646e 723a 362c 206b 080,.feldnr:6,.k 0x0080: 6174 6b65 793a 3136 3839 3532 3937 392c atkey:168952979, 0x0090: 206b 6574 7465 7361 747a 2e66 7265 693a .kettesatz.frei: 0x00a0: 302c 206b 6574 7465 7361 747a 2e61 6e7a 0,.kettesatz.anz 0x00b0: 3235 0a 25. 10:57:16.051388 IP 127.0.0.1.52288 > 127.0.0.1.5432: Flags [P.], seq 14407:14812, ack 120304, win 512, options [nop,nop,TSval 3424445708 ecr 3424445708], length 405 0x0000: 4500 01c9 b84f 4000 4006 82dd 7f00 0001 E....O@.@....... 0x0010: 7f00 0001 cc40 1538 d9b9 453a fcd1 afb7 .....@.8..E:.... 0x0020: 8018 0200 ffbd 0000 0101 080a cc1c e50c ................ 0x0030: cc1c e50c 5000 0000 7e00 7570 6461 7465 ....P...~.update 0x0040: 2070 6572 5f6b 6574 7465 2073 6574 2064 .per_kette.set.d 0x0050: 6573 6b6e 7220 3d20 2431 2020 2c20 6672 esknr.=.$1..,.fr > > My initial thought was that the server process should log a message > and drop the connection once it received something that didn't look > like a valid Postgres-protocol message, on the assumption that > message-boundary sync could never be recovered reliably. But on > further thought, it would start by taking the second through fifth > bytes of the input as a message length, and it wouldn't really do > anything interesting until it had collected that much input --- > and, if those bytes were text, that would look like a large number. > So what the server is really doing is waiting for the rest of the > imaginary message, which will never arrive. > > Maybe we should have more sanity checking on alleged message lengths. > I recall there is some checking of that sort in libpq, but I'm less > sure about the server. Yes, this would make sense. matthias -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub
Matthias Apitz <guru@unixarea.de> writes: > El día domingo, abril 25, 2021 a las 11:51:45a. m. -0400, Tom Lane escribió: >> Hmph. So why wasn't this visible in the tcpdump trace? > It was visible. That's why I detected it while looking up the TCP > packages before the UPDATE: > 10:57:16.051326 IP 127.0.0.1.52288 > 127.0.0.1.5432: Flags [P.], seq 14280:14407, ack 120304, win 512, options [nop,nop,TSval 3424445708 ecr 3424445708], length 127 > 0x0000: 4500 00b3 b84e 4000 4006 83f4 7f00 0001 E....N@.@....... > 0x0010: 7f00 0001 cc40 1538 d9b9 44bb fcd1 afb7 .....@.8..D..... > 0x0020: 8018 0200 fea7 0000 0101 080a cc1c e50c ................ > 0x0030: cc1c e50c 5352 502d 3236 3937 363a 2072 ....SRP-26976:.r > 0x0040: 6573 746f 7265 6420 7374 7275 6374 206b estored.struct.k > 0x0050: 6574 7465 7361 747a 2066 6f72 2073 6574 ettesatz.for.set > 0x0060: 6964 3a32 2c20 6465 736b 3a33 3831 3832 id:2,.desk:38182 > 0x0070: 3038 302c 2066 656c 646e 723a 362c 206b 080,.feldnr:6,.k > 0x0080: 6174 6b65 793a 3136 3839 3532 3937 392c atkey:168952979, > 0x0090: 206b 6574 7465 7361 747a 2e66 7265 693a .kettesatz.frei: > 0x00a0: 302c 206b 6574 7465 7361 747a 2e61 6e7a 0,.kettesatz.anz > 0x00b0: 3235 0a 25. Ah. On looking at the server code, I see there *is* some validation on the message type code --- but here, the first byte is 'S' which happens to be a legal message type, so you got past that check. Still, the apparent message length is then "RP-2" (0x52502d32) which is bigger than 1GB so I don't quite understand why you didn't get an "out of memory" error. In any case, we could clearly do with some more sanity checking than is there now. I'll post a patch for that in a bit. regards, tom lane
Karsten Hilbert <Karsten.Hilbert@gmx.net> writes: > Am Sun, Apr 25, 2021 at 01:21:25PM -0400 schrieb Tom Lane: >> Still, the apparent message length is then "RP-2" (0x52502d32) >> which is bigger than 1GB so I don't quite understand why you didn't >> get an "out of memory" error. > But he did, didn't he ? Those log messages do appear to match the theory, but in the case at hand, the server seems to have hung up without throwing any error. However, we'd reproduce that if the apparent "message length" was less than 1GB (assuming the server process could malloc that much memory). That'd occur for any second byte less than 0x40, such as an ASCII digit. regards, tom lane
El día domingo, abril 25, 2021 a las 08:05:21p. m. +0200, Karsten Hilbert escribió: > > > > The serverlog has around this time (sorry for German): > > > > > > > > 2021-04-23 05:55:23.591 CEST [2317] LOG: unvollständige Message vom Client > > "incomplete message from client" > > > > > 2021-04-23 05:55:23.593 CEST [2317] FEHLER: Speicher aufgebraucht > > "memory exhausted" > > > > > 2021-04-23 05:55:23.593 CEST [2317] DETAIL: Kann Zeichenkettenpuffer mit 0 Bytes nicht um 1380986158 Bytes vergrößern. > > "cannot extend 0-byte string buffer by 1380986158 bytes" > > > > > 2021-04-23 05:55:23.593 CEST [2317] LOG: konnte Daten nicht an den Client senden: Datenübergabe unterbrochen (brokenpipe) > These messages at 05:55:xx are caused by our automatic deployment of the software every day which starts at 05:55:00 by cron and short after installing all software all servers (the PostgreSQL clients) get stopped (i.e. killed) and restarted. The hung appears a few minutes later at 05:57:xx. matthias -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub
El día domingo, abril 25, 2021 a las 01:54:49p. m. +0200, Matthias Apitz escribió: ... > # select * from pg_stat_activity where client_port = 52288; > > datid | datname | pid | leader_pid | usesysid | usename | application_name | client_addr | client_hostname| client_port | backend_start | xact_start | query_start | state_change | wait_event_type | wait_event | state | backend_xid| backend_xmin | query | backend_type > --------+---------+-------+------------+----------+---------+---------------------------------+-------------+-----------------+-------------+---------------------------------+---------------------------------+---------------------------------+---------------------------------+-----------------+------------+---------------------+-------------+--------------+---------------------+---------------- > 951941 | sisis | 31894 | | 16384 | sisis | SunRise DBCALL V7.2 (pid=31864) | 127.0.0.1 | | 52288 | 25.04.2021 10:57:15.922137 CEST | 25.04.2021 10:57:15.968081 CEST | 25.04.2021 10:57:16.051256 CEST| 25.04.2021 10:57:16.051273 CEST | Client | ClientRead | idle in transaction | 18314796 | 18314796 |fetch per_kette_seq | client backend Having solved this issue, one last question: Is there any good way for a C-written or ESQL/C-written client to get the local port number or even the PID of the PostgreSQL server process on the other end of the connection for logging/tracing purpose? Thanks matthias -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub ¡Con Cuba no te metas! «» Don't mess with Cuba! «» Leg Dich nicht mit Kuba an! http://www.cubadebate.cu/noticias/2020/12/25/en-video-con-cuba-no-te-metas/
El día lunes, abril 26, 2021 a las 07:38:40a. m. +0200, Matthias Apitz escribió: > Having solved this issue, one last question: Is there any good way for a > C-written or ESQL/C-written client to get the local port number or even > the PID of the PostgreSQL server process on the other end of the > connection for logging/tracing purpose? Ah, I found it as int PQbackendPID(const PGconn *conn); One should read firt the fine documentatio befor asking stupid questions: https://www.postgresql.org/docs/11/libpq-status.html matthias -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub ¡Con Cuba no te metas! «» Don't mess with Cuba! «» Leg Dich nicht mit Kuba an! http://www.cubadebate.cu/noticias/2020/12/25/en-video-con-cuba-no-te-metas/