Re: BUG #8647: Backend process hangs and becomes unkillable when SSL client looses connection - Mailing list pgsql-bugs

From Alexander Kukushkin
Subject Re: BUG #8647: Backend process hangs and becomes unkillable when SSL client looses connection
Date
Msg-id CAFh8B=n84eWSmvau3ETL0OMUfrHag9Rotmc2AzizqaZBpQsLZg@mail.gmail.com
Whole thread Raw
In response to BUG #8647: Backend process hangs and becomes unkillable when SSL client looses connection  (valgog@gmail.com)
Responses Re: Re: BUG #8647: Backend process hangs and becomes unkillable when SSL client looses connection  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Re: BUG #8647: Backend process hangs and becomes unkillable when SSL client looses connection  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
Hi all,

I am sorry, but previously I was not subscribed to this maillist and can't
directly reply to original message.

In this  message you will find all needed information about hanging
backends when SSL enabled.

Recently we had enabled ssl on our postgres servers and started observe
some hanging (unkillable) postgres processes.
They appears usually when client has died or connection for some reason has
gone. I am able reproduce such problem on 9.0, 9.1 and 9.2, but not on 9.3.
To do this I am running relatively big query which produce big enough data
to fill system buffers, then I am stopping psql with Ctrl+Z and with
iptables forbidding all communication between server and client.
After that you could even kill psql, but on server side connection will be
still in state ESTABLISHED and server stuck on sendto system call.

After some time 15-30 minutes server changes his state start doing strange
things:
--- SIGPIPE (Broken pipe) @ 0 (0) ---
sendto(10,
"\0\346\203\300\350a6\326G\206\266\237q\220\6\20\217\234\344D\232S\234\234=\306\255=[\342\225o"...,
7750, 0, NULL, 0) = -1 EPIPE (Broken pipe)
--- SIGPIPE (Broken pipe) @ 0 (0) ---
sendto(10,
"\0\346\203\300\350a6\326G\206\266\237q\220\6\20\217\234\344D\232S\234\234=\306\255=[\342\225o"...,
7750, 0, NULL, 0) = -1 EPIPE (Broken pipe)
--- SIGPIPE (Broken pipe) @ 0 (0) ---
sendto(10,
"\0\346\203\300\350a6\326G\206\266\237q\220\6\20\217\234\344D\232S\234\234=\306\255=[\342\225o"...,
7750, 0, NULL, 0) = -1 EPIPE (Broken pipe)
--- SIGPIPE (Broken pipe) @ 0 (0) ---
sendto(10,
"\0\346\203\300\350a6\326G\206\266\237q\220\6\20\217\234\344D\232S\234\234=\306\255=[\342\225o"...,
7750, 0, NULL, 0) = -1 EPIPE (Broken pipe)
--- SIGPIPE (Broken pipe) @ 0 (0) ---
sendto(10,
"\0\346\203\300\350a6\326G\206\266\237q\220\6\20\217\234\344D\232S\234\234=\306\255=[\342\225o"...,
7750, 0, NULL, 0) = -1 EPIPE (Broken pipe)
--- SIGPIPE (Broken pipe) @ 0 (0) ---
sendto(10,
"\0\346\203\300\350a6\326G\206\266\237q\220\6\20\217\234\344D\232S\234\234=\306\255=[\342\225o"...,
7750, 0, NULL, 0) = -1 EPIPE (Broken pipe)
--- SIGPIPE (Broken pipe) @ 0 (0) ---
sendto(10,
"\0\346\203\300\350a6\326G\206\266\237q\220\6\20\217\234\344D\232S\234\234=\306\255=[\342\225o"...,
7750, 0, NULL, 0) = -1 EPIPE (Broken pipe)
--- SIGPIPE (Broken pipe) @ 0 (0) ---
sendto(10,
"\0\346\203\300\350a6\326G\206\266\237q\220\6\20\217\234\344D\232S\234\234=\306\255=[\342\225o"...,
7750, 0, NULL, 0) = -1 EPIPE (Broken pipe)

Definitely it retry send data to client in infinite loop without proper
handling of errors.
After analysing source code of postgres and openssl I've found (at least I
think) the reason why it is happening:

----------------------------------------------------------------------------------------------------------------------------
static int
my_sock_write(BIO *h, const char *buf, int size)
{
        int                     res = 0;

        res = send(h->num, buf, size, 0);
        if (res <= 0)
        {
                if (errno == EINTR)
                {
                        BIO_set_retry_write(h);
                }
        }

        return res;
}

static BIO_METHOD *
my_BIO_s_socket(void)
{
        if (!my_bio_initialized)
        {
                memcpy(&my_bio_methods, BIO_s_socket(), sizeof(BIO_METHOD));
                my_bio_methods.bread = my_sock_read;
                my_bio_methods.bwrite = my_sock_write;
                my_bio_initialized = true;
        }
        return &my_bio_methods;
}

----------------------------------------------------------------------------------------------------------------------------
From this code snippet one could see that postgres overrides standard
openssl function (original name sock_write crypto/bio/bss_sock.c) with his
own function which is called my_sock_write.
And this function even handles some errors: if (errno == EINTR), but we are
receiving EPIPE...
While original openssl function looks a bit differently:

----------------------------------------------------------------------------------------------------------------------------
static int sock_write(BIO *b, const char *in, int inl)
        {
        int ret;

        clear_socket_error();
        ret=writesocket(b->num,in,inl);
        BIO_clear_retry_flags(b);
        if (ret <= 0)
                {
                if (BIO_sock_should_retry(ret))
                        BIO_set_retry_write(b);
                }
        return(ret);
        }

----------------------------------------------------------------------------------------------------------------------------
As you could see, first it reset flags (BIO_clear_retry_flags(b);) and only
then (if error is non fatal) set retry_write flags again with calling
BIO_set_retry_write(b);

This leads to a problem, because ssize_t secure_write(Port *port, void
*ptr, size_t len) is relying on return value from SSL_get_error, which in
my case (SIGPIPE) returns SSL_ERROR_WANT_WRITE, which causes infinite loop.
This happening because SSL_get_error returns error code relying on some
internal states, and SSL_ERROR_WANT_WRITE retuned because BIO *b->flags ==
10 (0x2-write | 0x8-retry)
Lets find out how this value was set.

Lets debug all this stuff.
I have opened a new connection, run query, stopped psql with Ctr+Z banned
all communication between psql and postgres with iptables and killed psql


And on server side:

(gdb) attach 17556
Attaching to process 17556
Reading symbols from /server/postgres/9.2.5/bin/postgres...done.
/* skipped long output of shared libraries */
(gdb) c
Continuing.

Program received signal SIGINT, Interrupt.
0x00007f35d4d72707 in kill () at ../sysdeps/unix/syscall-template.S:82
82      ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) bt
#0  0x00007f35d4d72707 in kill () at ../sysdeps/unix/syscall-template.S:82
#1  0x00000000006560fc in CheckStatementTimeout () at proc.c:1699
#2  0x00000000006578b5 in CheckStatementTimeout () at proc.c:1687
#3  handle_sig_alarm (postgres_signal_arg=<optimized out>) at proc.c:1754
#4  <signal handler called>
#5  0x00007f35d4e314d2 in __libc_send (fd=10, buf=0x20ea96a, n=7750,
flags=<optimized out>) at ../sysdeps/unix/sysv/linux/x86_64/send.c:28
#6  0x00000000005b3b0e in my_sock_write (h=0x20cf6f0, buf=<optimized out>,
size=<optimized out>) at be-secure.c:451
#7  0x00007f35d5b693c7 in BIO_write (b=0x20cf6f0, in=0x20ea96a, inl=7750)
at bio_lib.c:247
#8  0x00007f35d5e84fc0 in ssl3_write_pending (s=0x20ced10, type=<optimized
out>, buf=<optimized out>, len=<optimized out>) at s3_pkt.c:881
#9  0x00007f35d5e856c4 in ssl3_write_bytes (s=0x20ced10, type=23,
buf_=0x20ccd00, len=<optimized out>) at s3_pkt.c:609
#10 0x00000000005b47e8 in secure_write (port=0x20cc890, ptr=0x20ccd00,
len=<optimized out>) at be-secure.c:352
#11 0x00000000005be41d in internal_flush () at pqcomm.c:1222
#12 0x00000000005be57d in internal_putbytes (s=0x2208d7b
"\002\061\066\377\377\377\377", len=186) at pqcomm.c:1168
#13 0x00000000005bf61b in pq_putmessage (msgtype=68 'D', s=0x2208d60 "",
len=<optimized out>) at pqcomm.c:1365
#14 0x00000000005c0234 in pq_endmessage (buf=0x7fff45fb48c0) at
pqformat.c:346
#15 0x0000000000461bfc in printtup (slot=0x2208650, self=0x2202210) at
printtup.c:359
#16 0x000000000058b276 in ExecutePlan (dest=0x2202210, direction=<optimized
out>, numberTuples=0, sendTuples=1 '\001', operation=CMD_SELECT,
planstate=0x2207480, estate=0x2207370) at execMain.c:1420
#17 standard_ExecutorRun (queryDesc=0x21f43d0, direction=<optimized out>,
count=0) at execMain.c:303
#18 0x00007f35d0e7c58d in pgss_ExecutorRun (queryDesc=0x21f43d0,
direction=ForwardScanDirection, count=0) at pg_stat_statements.c:719
#19 0x0000000000668687 in PortalRunSelect (portal=0x21f6b20,
forward=<optimized out>, count=0, dest=0x2202210) at pquery.c:946
#20 0x0000000000669b40 in PortalRun (portal=0x21f6b20,
count=9223372036854775807, isTopLevel=1 '\001', dest=0x2202210,
altdest=0x2202210, completionTag=0x7fff45fb4e70 "") at pquery.c:790
#21 0x0000000000665ac4 in exec_simple_query (query_string=0x20a0c50 "select
* from zcat_data.price_archive limit 1000000 offset 10000000;") at
postgres.c:1046
#22 PostgresMain (argc=<optimized out>, argv=<optimized out>,
dbname=0x2088800 "integration_catalog1_db", username=<optimized out>) at
postgres.c:3966
#23 0x000000000062304b in BackendRun (port=0x20cc890) at postmaster.c:3614
#24 BackendStartup (port=0x20cc890) at postmaster.c:3304
#25 ServerLoop () at postmaster.c:1367
#26 0x0000000000623b21 in PostmasterMain (argc=<optimized out>,
argv=<optimized out>) at postmaster.c:1127
#27 0x000000000045e872 in main (argc=3, argv=0x2086940) at main.c:199
(gdb) frame 7
#7  0x00007f35d5b693c7 in BIO_write (b=0x20cf6f0, in=0x20ea96a, inl=7750)
at bio_lib.c:247
247     bio_lib.c: No such file or directory.
(gdb) print *b
$1 = {method = 0xb4bc00, callback = 0, cb_arg = 0x0, init = 1, shutdown =
0, flags = 0, retry_reason = 0, num = 10, ptr = 0x0, next_bio = 0x0,
prev_bio = 0x0, references = 1, num_read = 1715, num_write = 94657, ex_data
= {sk = 0x0, dummy = 0}}
(gdb) c
Continuing.

Program received signal SIGINT, Interrupt.
0x00007f35d4d72707 in kill () at ../sysdeps/unix/syscall-template.S:82
82      ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) bt
#0  0x00007f35d4d72707 in kill () at ../sysdeps/unix/syscall-template.S:82
#1  0x0000000000656108 in CheckStatementTimeout () at proc.c:1701
#2  0x00000000006578b5 in CheckStatementTimeout () at proc.c:1687
#3  handle_sig_alarm (postgres_signal_arg=<optimized out>) at proc.c:1754
#4  <signal handler called>
#5  0x00007f35d4e314d2 in __libc_send (fd=10, buf=0x20ea96a, n=7750,
flags=<optimized out>) at ../sysdeps/unix/sysv/linux/x86_64/send.c:28
#6  0x00000000005b3b0e in my_sock_write (h=0x20cf6f0, buf=<optimized out>,
size=<optimized out>) at be-secure.c:451
#7  0x00007f35d5b693c7 in BIO_write (b=0x20cf6f0, in=0x20ea96a, inl=7750)
at bio_lib.c:247
#8  0x00007f35d5e84fc0 in ssl3_write_pending (s=0x20ced10, type=<optimized
out>, buf=<optimized out>, len=<optimized out>) at s3_pkt.c:881
#9  0x00007f35d5e856c4 in ssl3_write_bytes (s=0x20ced10, type=23,
buf_=0x20ccd00, len=<optimized out>) at s3_pkt.c:609
#10 0x00000000005b47e8 in secure_write (port=0x20cc890, ptr=0x20ccd00,
len=<optimized out>) at be-secure.c:352
#11 0x00000000005be41d in internal_flush () at pqcomm.c:1222
#12 0x00000000005be57d in internal_putbytes (s=0x2208d7b
"\002\061\066\377\377\377\377", len=186) at pqcomm.c:1168
#13 0x00000000005bf61b in pq_putmessage (msgtype=68 'D', s=0x2208d60 "",
len=<optimized out>) at pqcomm.c:1365
#14 0x00000000005c0234 in pq_endmessage (buf=0x7fff45fb48c0) at
pqformat.c:346
#15 0x0000000000461bfc in printtup (slot=0x2208650, self=0x2202210) at
printtup.c:359
#16 0x000000000058b276 in ExecutePlan (dest=0x2202210, direction=<optimized
out>, numberTuples=0, sendTuples=1 '\001', operation=CMD_SELECT,
planstate=0x2207480, estate=0x2207370) at execMain.c:1420
#17 standard_ExecutorRun (queryDesc=0x21f43d0, direction=<optimized out>,
count=0) at execMain.c:303
#18 0x00007f35d0e7c58d in pgss_ExecutorRun (queryDesc=0x21f43d0,
direction=ForwardScanDirection, count=0) at pg_stat_statements.c:719
#19 0x0000000000668687 in PortalRunSelect (portal=0x21f6b20,
forward=<optimized out>, count=0, dest=0x2202210) at pquery.c:946
#20 0x0000000000669b40 in PortalRun (portal=0x21f6b20,
count=9223372036854775807, isTopLevel=1 '\001', dest=0x2202210,
altdest=0x2202210, completionTag=0x7fff45fb4e70 "") at pquery.c:790
#21 0x0000000000665ac4 in exec_simple_query (query_string=0x20a0c50 "select
* from zcat_data.price_archive limit 1000000 offset 10000000;") at
postgres.c:1046
#22 PostgresMain (argc=<optimized out>, argv=<optimized out>,
dbname=0x2088800 "integration_catalog1_db", username=<optimized out>) at
postgres.c:3966
#23 0x000000000062304b in BackendRun (port=0x20cc890) at postmaster.c:3614
#24 BackendStartup (port=0x20cc890) at postmaster.c:3304
#25 ServerLoop () at postmaster.c:1367
#26 0x0000000000623b21 in PostmasterMain (argc=<optimized out>,
argv=<optimized out>) at postmaster.c:1127
#27 0x000000000045e872 in main (argc=3, argv=0x2086940) at main.c:199
(gdb) frame 7
#7  0x00007f35d5b693c7 in BIO_write (b=0x20cf6f0, in=0x20ea96a, inl=7750)
at bio_lib.c:247
247     bio_lib.c: No such file or directory.
(gdb) print *b
$2 = {method = 0xb4bc00, callback = 0, cb_arg = 0x0, init = 1, shutdown =
0, flags = 0, retry_reason = 0, num = 10, ptr = 0x0, next_bio = 0x0,
prev_bio = 0x0, references = 1, num_read = 1715, num_write = 94657, ex_data
= {sk = 0x0, dummy = 0}}
(gdb) c
Continuing.
^C /* pressed Ctrl+C */
Program received signal SIGINT, Interrupt.
0x00007f35d4e314d2 in __libc_send (fd=10, buf=0x20ea96a, n=7750,
flags=<optimized out>) at ../sysdeps/unix/sysv/linux/x86_64/send.c:28
28      ../sysdeps/unix/sysv/linux/x86_64/send.c: No such file or directory.
(gdb) bt
#0  0x00007f35d4e314d2 in __libc_send (fd=10, buf=0x20ea96a, n=7750,
flags=<optimized out>) at ../sysdeps/unix/sysv/linux/x86_64/send.c:28
#1  0x00000000005b3b0e in my_sock_write (h=0x20cf6f0, buf=<optimized out>,
size=<optimized out>) at be-secure.c:451
#2  0x00007f35d5b693c7 in BIO_write (b=0x20cf6f0, in=0x20ea96a, inl=7750)
at bio_lib.c:247
#3  0x00007f35d5e84fc0 in ssl3_write_pending (s=0x20ced10, type=<optimized
out>, buf=<optimized out>, len=<optimized out>) at s3_pkt.c:881
#4  0x00007f35d5e856c4 in ssl3_write_bytes (s=0x20ced10, type=23,
buf_=0x20ccd00, len=<optimized out>) at s3_pkt.c:609
#5  0x00000000005b47e8 in secure_write (port=0x20cc890, ptr=0x20ccd00,
len=<optimized out>) at be-secure.c:352
#6  0x00000000005be41d in internal_flush () at pqcomm.c:1222
#7  0x00000000005be57d in internal_putbytes (s=0x2208d7b
"\002\061\066\377\377\377\377", len=186) at pqcomm.c:1168
#8  0x00000000005bf61b in pq_putmessage (msgtype=68 'D', s=0x2208d60 "",
len=<optimized out>) at pqcomm.c:1365
#9  0x00000000005c0234 in pq_endmessage (buf=0x7fff45fb48c0) at
pqformat.c:346
#10 0x0000000000461bfc in printtup (slot=0x2208650, self=0x2202210) at
printtup.c:359
#11 0x000000000058b276 in ExecutePlan (dest=0x2202210, direction=<optimized
out>, numberTuples=0, sendTuples=1 '\001', operation=CMD_SELECT,
planstate=0x2207480, estate=0x2207370) at execMain.c:1420
#12 standard_ExecutorRun (queryDesc=0x21f43d0, direction=<optimized out>,
count=0) at execMain.c:303
#13 0x00007f35d0e7c58d in pgss_ExecutorRun (queryDesc=0x21f43d0,
direction=ForwardScanDirection, count=0) at pg_stat_statements.c:719
#14 0x0000000000668687 in PortalRunSelect (portal=0x21f6b20,
forward=<optimized out>, count=0, dest=0x2202210) at pquery.c:946
#15 0x0000000000669b40 in PortalRun (portal=0x21f6b20,
count=9223372036854775807, isTopLevel=1 '\001', dest=0x2202210,
altdest=0x2202210, completionTag=0x7fff45fb4e70 "") at pquery.c:790
#16 0x0000000000665ac4 in exec_simple_query (query_string=0x20a0c50 "select
* from zcat_data.price_archive limit 1000000 offset 10000000;") at
postgres.c:1046
#17 PostgresMain (argc=<optimized out>, argv=<optimized out>,
dbname=0x2088800 "integration_catalog1_db", username=<optimized out>) at
postgres.c:3966
#18 0x000000000062304b in BackendRun (port=0x20cc890) at postmaster.c:3614
#19 BackendStartup (port=0x20cc890) at postmaster.c:3304
#20 ServerLoop () at postmaster.c:1367
#21 0x0000000000623b21 in PostmasterMain (argc=<optimized out>,
argv=<optimized out>) at postmaster.c:1127
#22 0x000000000045e872 in main (argc=3, argv=0x2086940) at main.c:199
(gdb) frame 2
#2  0x00007f35d5b693c7 in BIO_write (b=0x20cf6f0, in=0x20ea96a, inl=7750)
at bio_lib.c:247
247     bio_lib.c: No such file or directory.
(gdb) print *b
$3 = {method = 0xb4bc00, callback = 0, cb_arg = 0x0, init = 1, shutdown =
0, flags = 10, retry_reason = 0, num = 10, ptr = 0x0, next_bio = 0x0,
prev_bio = 0x0, references = 1, num_read = 1715, num_write = 94657, ex_data
= {sk = 0x0, dummy = 0}}
flags == 10 (0x2 - write | 0x8 - retry) -- this flags will state forever
and leed to infinite loop in ssize_t secure_write(Port *port, void *ptr,
size_t len) function
// As one could see, in this place value of flags is already 10 (0x2 | 0x8)
== write | retry and my_sock_write function will never change it...
// Well, lets wait for SIGPIPE
(gdb) c
Continuing.
and ... after very long waiting here it is


Program received signal SIGPIPE, Broken pipe.
0x00007f35d4e314d2 in __libc_send (fd=10, buf=0x20ea96a, n=7750,
flags=<optimized out>) at ../sysdeps/unix/sysv/linux/x86_64/send.c:28
28      ../sysdeps/unix/sysv/linux/x86_64/send.c: No such file or directory.
(gdb) c
Continuing.

Program received signal SIGPIPE, Broken pipe.
0x00007f35d4e314d2 in __libc_send (fd=10, buf=0x20ea96a, n=7750,
flags=<optimized out>) at ../sysdeps/unix/sysv/linux/x86_64/send.c:28
28      in ../sysdeps/unix/sysv/linux/x86_64/send.c
(gdb)
Continuing.

Program received signal SIGPIPE, Broken pipe.
0x00007f35d4e314d2 in __libc_send (fd=10, buf=0x20ea96a, n=7750,
flags=<optimized out>) at ../sysdeps/unix/sysv/linux/x86_64/send.c:28
28      in ../sysdeps/unix/sysv/linux/x86_64/send.c
(gdb)
Continuing.

Program received signal SIGPIPE, Broken pipe.
0x00007f35d4e314d2 in __libc_send (fd=10, buf=0x20ea96a, n=7750,
flags=<optimized out>) at ../sysdeps/unix/sysv/linux/x86_64/send.c:28
28      in ../sysdeps/unix/sysv/linux/x86_64/send.c
(gdb) bt
#0  0x00007f35d4e314d2 in __libc_send (fd=10, buf=0x20ea96a, n=7750,
flags=<optimized out>) at ../sysdeps/unix/sysv/linux/x86_64/send.c:28
#1  0x00000000005b3b0e in my_sock_write (h=0x20cf6f0, buf=<optimized out>,
size=<optimized out>) at be-secure.c:451
#2  0x00007f35d5b693c7 in BIO_write (b=0x20cf6f0, in=0x20ea96a, inl=7750)
at bio_lib.c:247
#3  0x00007f35d5e84fc0 in ssl3_write_pending (s=0x20ced10, type=<optimized
out>, buf=<optimized out>, len=<optimized out>) at s3_pkt.c:881
#4  0x00007f35d5e856c4 in ssl3_write_bytes (s=0x20ced10, type=23,
buf_=0x20ccd00, len=<optimized out>) at s3_pkt.c:609
#5  0x00000000005b47e8 in secure_write (port=0x20cc890, ptr=0x20ccd00,
len=<optimized out>) at be-secure.c:352
#6  0x00000000005be41d in internal_flush () at pqcomm.c:1222
#7  0x00000000005be57d in internal_putbytes (s=0x2208d7b
"\002\061\066\377\377\377\377", len=186) at pqcomm.c:1168
#8  0x00000000005bf61b in pq_putmessage (msgtype=68 'D', s=0x2208d60 "",
len=<optimized out>) at pqcomm.c:1365
#9  0x00000000005c0234 in pq_endmessage (buf=0x7fff45fb48c0) at
pqformat.c:346
#10 0x0000000000461bfc in printtup (slot=0x2208650, self=0x2202210) at
printtup.c:359
#11 0x000000000058b276 in ExecutePlan (dest=0x2202210, direction=<optimized
out>, numberTuples=0, sendTuples=1 '\001', operation=CMD_SELECT,
planstate=0x2207480, estate=0x2207370) at execMain.c:1420
#12 standard_ExecutorRun (queryDesc=0x21f43d0, direction=<optimized out>,
count=0) at execMain.c:303
#13 0x00007f35d0e7c58d in pgss_ExecutorRun (queryDesc=0x21f43d0,
direction=ForwardScanDirection, count=0) at pg_stat_statements.c:719
#14 0x0000000000668687 in PortalRunSelect (portal=0x21f6b20,
forward=<optimized out>, count=0, dest=0x2202210) at pquery.c:946
#15 0x0000000000669b40 in PortalRun (portal=0x21f6b20,
count=9223372036854775807, isTopLevel=1 '\001', dest=0x2202210,
altdest=0x2202210, completionTag=0x7fff45fb4e70 "") at pquery.c:790
#16 0x0000000000665ac4 in exec_simple_query (query_string=0x20a0c50 "select
* from zcat_data.price_archive limit 1000000 offset 10000000;") at
postgres.c:1046
#17 PostgresMain (argc=<optimized out>, argv=<optimized out>,
dbname=0x2088800 "integration_catalog1_db", username=<optimized out>) at
postgres.c:3966
#18 0x000000000062304b in BackendRun (port=0x20cc890) at postmaster.c:3614
#19 BackendStartup (port=0x20cc890) at postmaster.c:3304
#20 ServerLoop () at postmaster.c:1367
#21 0x0000000000623b21 in PostmasterMain (argc=<optimized out>,
argv=<optimized out>) at postmaster.c:1127
#22 0x000000000045e872 in main (argc=3, argv=0x2086940) at main.c:199
(gdb) frame 2
#2  0x00007f35d5b693c7 in BIO_write (b=0x20cf6f0, in=0x20ea96a, inl=7750)
at bio_lib.c:247
247     bio_lib.c: No such file or directory.
(gdb) print (*b)->flags
$4 = 10 -- still retry write flag

lets reset value of flags manually:

(gdb) print (*b)->flags
$4 = 10
(gdb) set variable (*b)->flags = 0
(gdb) print (*b)->flags
$5 = 0
(gdb) c
Continuing.
[Inferior 1 (process 17556) exited with code 01]


And the hanging process exited normally.

2013-12-05 11:44:39.130
CET,"myuser","mydb",17556,"127.0.0.1:50245",52a04e6f.4494,5,"SELECT",2013-12-05
10:59:11 CET,14/5,0,LOG,08006,"could not send data to client: Broken
pipe",,,,,,"select * from test limit 1000000 offset 10000000;",,,"psql"
2013-12-05 11:44:39.130
CET,"myuser","mydb",17556,"127.0.0.1:50245",52a04e6f.4494,6,"SELECT",2013-12-05
10:59:11 CET,14/5,0,FATAL,08006,"connection to client lost",,,,,,"select *
from test limit 1000000 offset 10000000;",,,"psql"

For me fix of this issue should look like following:

----------------------------------------------------------------------------------------------------------------------------
static int
my_sock_write(BIO *h, const char *buf, int size)
{
        int                     res = 0;

        res = send(h->num, buf, size, 0);
+        BIO_clear_retry_flags(b);
        if (res <= 0)
        {
                if (errno == EINTR)
                {
                        BIO_set_retry_write(h);
                }
        }

        return res;
}

----------------------------------------------------------------------------------------------------------------------------

We already tried it on one of our databases and it is working without any
problems.

Best regards,
Alexander Kukushkin

pgsql-bugs by date:

Previous
From: d.shumilov@404-group.com
Date:
Subject: BUG #8658: Errors during restore database
Next
From: Tom Lane
Date:
Subject: Re: Re: BUG #8647: Backend process hangs and becomes unkillable when SSL client looses connection