Thread: BUG #18732: Segfault in pgbench on max_connections starvation

BUG #18732: Segfault in pgbench on max_connections starvation

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      18732
Logged by:          Mikhail Kot
Email address:      mikhail@neon.tech
PostgreSQL version: 16.6
Operating system:   Debian 12
Description:

When --client connections in pgbench exceed max_connections in postgres,
pgbench 16 sometimes exits with segfault when a (presumably) ssl
certificate
validation error occurs.

OpenSSL version: 3.0.15-1~deb12u1
pgbench version: 16.6 (f5cfc6fa898544050e821ac688adafece1ac3cff)
pgbench params: pgbench postgresql://REDACTED/neondb?sslmode=require -c 2000
-T 60 -P 1 -j 20 --protocol=prepared

#0  0x00007f097342d3f0 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#1  0x00007f097342da0a in OPENSSL_LH_retrieve () from
/lib/x86_64-linux-gnu/libcrypto.so.3
#2  0x00007f097346a283 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#3  0x00007f097340bced in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#4  0x00007f097340c122 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#5  0x00007f09733f60ba in EVP_MD_fetch () from
/lib/x86_64-linux-gnu/libcrypto.so.3
#6  0x00007f09733f67f0 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#7  0x00007f097342899f in HMAC_Init_ex () from
/lib/x86_64-linux-gnu/libcrypto.so.3
#8  0x00007f09737ba60a in pg_hmac_init (ctx=0x7f09484a7360,
key=0x7f09484a7880 "R1M6EFcoABKs", len=12)
    at /home/myrrc/neon/vendor/postgres-v16/src/common/hmac_openssl.c:174
#9  0x00007f09737b54ef in scram_SaltedPassword (password=0x7f09484a7880
"R1M6EFcoABKs", 
    hash_type=PG_SHA256, key_length=32, salt=0x7f09484a78c0
"\260\376E\302@\0341Z\025%'\244H", 
    saltlen=16, iterations=4096, 
    result=0x7f09484a4118
"\235\245\371\260\203\243矮\357\224\305F\204F\341K\212\025$#\030CL\"\325ɑ\247\021os\340=IH\t\177",
errstr=0x7f09719fead8)
    at /home/myrrc/neon/vendor/postgres-v16/src/common/scram-common.c:87
#10 0x00007f097379452c in calculate_client_proof (state=0x7f09484a40f0, 
    client_final_message_without_proof=0x7f0948489920

"c=cD10bHMtc2VydmVyLWVuZC1wb2ludCwstoyKkoGIYqGK5C4vgGtRjvNeDwvmGQlaYHBXl8ZybAA=,r=RbPpYlql+b/rBgDtitBWxtAdW9BcFuPI9WsP7VCILEORedB6",

    result=0x7f09719feaf0 "0", errstr=0x7f09719fead8)
    at
/home/myrrc/neon/vendor/postgres-v16/src/interfaces/libpq/fe-auth-scram.c:788
#11 0x00007f0973793e55 in build_client_final_message
(state=0x7f09484a40f0)
    at
/home/myrrc/neon/vendor/postgres-v16/src/interfaces/libpq/fe-auth-scram.c:565
#12 0x00007f0973793403 in scram_exchange (opaq=0x7f09484a40f0, 
    input=0x7f09484a60a0
"r=RbPpYlql+b/rBgDtitBWxtAdW9BcFuPI9WsP7VCILEORedB6", inputlen=84, 
    output=0x7f09719febe0, outputlen=0x7f09719febdc, done=0x7f09719febdb,
success=0x7f09719febda)
    at
/home/myrrc/neon/vendor/postgres-v16/src/interfaces/libpq/fe-auth-scram.c:255
#13 0x00007f09737b002e in pg_SASL_continue (conn=0x7f0948486540,
payloadlen=84, final=false)
#14 0x00007f09737af729 in pg_fe_sendauth (areq=11, payloadlen=84,
conn=0x7f0948486540)
    at
/home/myrrc/neon/vendor/postgres-v16/src/interfaces/libpq/fe-auth.c:1139
#15 0x00007f0973798c5d in PQconnectPoll (conn=0x7f0948486540)
    at
/home/myrrc/neon/vendor/postgres-v16/src/interfaces/libpq/fe-connect.c:3802
#16 0x00007f0973794c9c in connectDBComplete (conn=0x7f0948486540)
    at
/home/myrrc/neon/vendor/postgres-v16/src/interfaces/libpq/fe-connect.c:2511
#17 0x00007f09737949b4 in PQconnectdbParams (keywords=0x7f09719ff890,
values=0x7f09719ff850, 
    expand_dbname=1) at
/home/myrrc/neon/vendor/postgres-v16/src/interfaces/libpq/fe-connect.c:685
#18 0x0000558da1510b5e in doConnect ()
    at /home/myrrc/neon/vendor/postgres-v16/src/bin/pgbench/pgbench.c:1560
#19 0x0000558da15113d0 in threadRun (arg=0x558db50ebce0)
    at /home/myrrc/neon/vendor/postgres-v16/src/bin/pgbench/pgbench.c:7384
#20 0x00007f09730a81c4 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#21 0x00007f097312885c in ?? () from /lib/x86_64-linux-gnu/libc.so.6

Steps to reproduce:
1. Launch a postgres server with max_connections=900
2. Launch pgbench a couple of times with -c 2000

I was also able to reproduce this error by running multiple pgbench
instances
with same launch parameters. This error doesn't reproduce on pgbench 17.2 or
15.10
I can provide the coredump upon request.


Re: BUG #18732: Segfault in pgbench on max_connections starvation

From
Heikki Linnakangas
Date:
On 03/12/2024 14:23, PG Bug reporting form wrote:
> When --client connections in pgbench exceed max_connections in postgres,
> pgbench 16 sometimes exits with segfault when a (presumably) ssl
> certificate
> validation error occurs.
> 
> ...
> 
> Steps to reproduce:
> 1. Launch a postgres server with max_connections=900
> 2. Launch pgbench a couple of times with -c 2000
> 
> I was also able to reproduce this error by running multiple pgbench
> instances
> with same launch parameters. This error doesn't reproduce on pgbench 17.2 or
> 15.10
> I can provide the coredump upon request.

I was able to reproduce this on both REL_16_STABLE and REL_17_STABLE. 
Didn't try v15, but I presume this issue is present in all branches (see 
analysis below).

Backtrace from thread 1:

#0  0x00007f19dfa55516 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#1  0x00007f19dfa55bce in OPENSSL_LH_retrieve () from 
/lib/x86_64-linux-gnu/libcrypto.so.3
#2  0x00007f19dfb456d5 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#3  0x00007f19dfa2e943 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#4  0x00007f19dfa2edc1 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#5  0x00007f19dfa17eee in EVP_MD_fetch () from 
/lib/x86_64-linux-gnu/libcrypto.so.3
#6  0x00007f19dfa1855b in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#7  0x00007f19dfa4c22a in HMAC_Init_ex () from 
/lib/x86_64-linux-gnu/libcrypto.so.3
#8  0x00007f19e00a9296 in pg_hmac_init (ctx=ctx@entry=0x7f19cc51bb90, 
key=key@entry=0x7f19cc50d560 "foo", len=len@entry=3) at 
../src/common/hmac_openssl.c:180
#9  0x00007f19e00a62b0 in scram_SaltedPassword (password=0x7f19cc50d560 
"foo", hash_type=<optimized out>, key_length=32, salt=<optimized out>, 
saltlen=<optimized out>, iterations=4096,
     result=0x7f19cc51bb08 
"w\351אI\256\035\330\003y\021ւ\205\327ƿ\217Q\332\362}\a\0364\243^\324\321a\034H0\250P\314\031\177", 
errstr=0x7f19dd4bb928) at ../src/common/scram-common.c:87
#10 0x00007f19e0089bcd in calculate_client_proof (state=0x7f19cc51bae0,
     client_final_message_without_proof=0x7f19cc50b040 

"c=cD10bHMtc2VydmVyLWVuZC1wb2ludCwsvkIO06ZPSH1cmElOgC2DbPafilVET0yej6RhzH30Rzw=,r=Wkk2fofG+RP23HT1tBMqx0ijin6taf2xdjPuJBYqBqw2853/",


     result=<optimized out>, errstr=<optimized out>) at 
../src/interfaces/libpq/fe-auth-scram.c:788
#11 build_client_final_message (state=0x7f19cc51bae0) at 
../src/interfaces/libpq/fe-auth-scram.c:565
#12 scram_exchange (opaq=0x7f19cc51bae0, input=<optimized out>, 
inputlen=<optimized out>, output=0x7f19dd4bba28, outputlen=<optimized 
out>, done=<optimized out>, success=<optimized out>)
     at ../src/interfaces/libpq/fe-auth-scram.c:255
#13 0x00007f19e008a642 in pg_SASL_continue (conn=0x7f19cc4ff1f0, 
payloadlen=84, final=<optimized out>) at 
../src/interfaces/libpq/fe-auth.c:654
#14 pg_fe_sendauth (areq=11, payloadlen=84, 
conn=conn@entry=0x7f19cc4ff1f0) at ../src/interfaces/libpq/fe-auth.c:1139
#15 0x00007f19e008f756 in PQconnectPoll (conn=conn@entry=0x7f19cc4ff1f0) 
at ../src/interfaces/libpq/fe-connect.c:3802
#16 0x00007f19e008bae8 in connectDBComplete 
(conn=conn@entry=0x7f19cc4ff1f0) at 
../src/interfaces/libpq/fe-connect.c:2511
#17 0x00007f19e008b2bf in PQconnectdbParams 
(keywords=keywords@entry=0x7f19dd4bc1f0, 
values=values@entry=0x7f19dd4bc1b0, expand_dbname=expand_dbname@entry=1)
     at ../src/interfaces/libpq/fe-connect.c:685
#18 0x000056350c35efa5 in doConnect () at ../src/bin/pgbench/pgbench.c:1560
#19 0x000056350c35f2c5 in threadRun (arg=0x56350d1184a0) at 
../src/bin/pgbench/pgbench.c:7396
#20 0x00007f19dfe1b112 in start_thread (arg=<optimized out>) at 
./nptl/pthread_create.c:447
#21 0x00007f19dfe998f8 in __GI___clone3 () at 
../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 2:

#0  0x00007f19dfe28a04 in _int_free_merge_chunk 
(av=av@entry=0x7f19dff70ac0 <main_arena>, p=0x56350d126280, size=144) at 
./malloc/malloc.c:4675
#1  0x00007f19dfe28d31 in _int_free (av=0x7f19dff70ac0 <main_arena>, 
p=<optimized out>, have_lock=<optimized out>, have_lock@entry=0) at 
./malloc/malloc.c:4646
#2  0x00007f19dfe2b4ff in __GI___libc_free (mem=<optimized out>) at 
./malloc/malloc.c:3398
#3  0x00007f19dfa5580e in OPENSSL_LH_free () from 
/lib/x86_64-linux-gnu/libcrypto.so.3
#4  0x00007f19dfb4489f in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#5  0x00007f19dfa6e0e7 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#6  0x00007f19dfb44c35 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#7  0x00007f19dfa565a5 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#8  0x00007f19dfa56aa0 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#9  0x00007f19dfa5ac32 in OPENSSL_cleanup () from 
/lib/x86_64-linux-gnu/libcrypto.so.3
#10 0x00007f19dfdcb1e1 in __run_exit_handlers (status=status@entry=1, 
listp=0x7f19dff70680 <__exit_funcs>, 
run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true)
     at ./stdlib/exit.c:108
#11 0x00007f19dfdcb29a in __GI_exit (status=status@entry=1) at 
./stdlib/exit.c:138
#12 0x000056350c362ae6 in threadRun (arg=<optimized out>) at 
../src/bin/pgbench/pgbench.c:7399
#13 0x00007f19dfe1b112 in start_thread (arg=<optimized out>) at 
./nptl/pthread_create.c:447
#14 0x00007f19dfe998f8 in __GI___clone3 () at 
../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Sometimes you also get this error instead of a crash, which is 
presumably another symptom of the same race condition:

pgbench (16.6, server 18devel)
starting vacuum...end.
pgbench: error: connection to server at "localhost" (::1), port 5432 
failed: FATAL:  sorry, too many clients already
pgbench: error: could not create connection for client 1145
pgbench: error: connection to server at "localhost" (::1), port 5432 
failed: could not verify server signature: OpenSSL failure

Once I also got this:

pgbench (17.2, server 18devel)
starting vacuum...end.
pgbench: error: connection to server at "localhost" (::1), port 5432 
failed: FATAL:  sorry, too many clients already
pgbench: error: could not create connection for client 1045
k5_mutex_lock: Received error 22 (Invalid argument)
*** %n in writable segment detected ***

It looks like a race condition between OpenSSL's exit handler and the . 
HMAC_Init_ex() call in another thread. I think we could use the 
OPENSSL_INIT_NO_ATEXIT option to prevent the atexit handler from 
running. The OpenSSL man page on OPENSSL_init_crypto says:

> OPENSSL_INIT_NO_ATEXIT
> 
> By default OpenSSL will attempt to clean itself up when the process
> exits via an "atexit" handler. Using this option suppresses that
> behaviour. This means that the application will have to clean up
> OpenSSL explicitly using OPENSSL_cleanup().

I don't understand why that cleanup would be needed. When the program 
exits, all resources are gone anyway.

-- 
Heikki Linnakangas
Neon (https://neon.tech)



Re: BUG #18732: Segfault in pgbench on max_connections starvation

From
Andres Freund
Date:
Hi,

On 2024-12-03 16:52:32 +0200, Heikki Linnakangas wrote:
> It looks like a race condition between OpenSSL's exit handler and the .
> HMAC_Init_ex() call in another thread. I think we could use the
> OPENSSL_INIT_NO_ATEXIT option to prevent the atexit handler from running.
> The OpenSSL man page on OPENSSL_init_crypto says:

Using exit() while another thread is running is, IIRC, undefined behaviour,
regardless of OPENSSL_INIT_NO_ATEXIT's pointlessness. The whole atexit()
mechanism is not threadsafe, two processes exit()ing at the same time can
cause a lot of havoc.

Short term it's probably easiest to just use _exit(). Medium term I think we
should just exit individual threads - which would probably require the main
thread to not run a benchmark itself.


> > By default OpenSSL will attempt to clean itself up when the process
> > exits via an "atexit" handler. Using this option suppresses that
> > behaviour. This means that the application will have to clean up
> > OpenSSL explicitly using OPENSSL_cleanup().
> 
> I don't understand why that cleanup would be needed. When the program exits,
> all resources are gone anyway.

Somewhat random aside: This is also bad for postgres performance. Postmaster
initializes openssl. When a child exits, it runs - completely pointlessly -
OPENSSL_cleanup(), which modifies a lot of datastructures that have been set
up in postmaster. Which, in turn, requires all those pages to be
copy-on-write'ed. Just for that copy to immediately be discarded, at process
exit.

Greetings,

Andres Freund