libpq's multi-threaded SSL callback handling is busted - Mailing list pgsql-hackers
From | Jan Urbański |
---|---|
Subject | libpq's multi-threaded SSL callback handling is busted |
Date | |
Msg-id | 871tlzrlkq.fsf@wulczer.org Whole thread Raw |
Responses |
Re: libpq's multi-threaded SSL callback handling is busted
Re: libpq's multi-threaded SSL callback handling is busted |
List | pgsql-hackers |
I did some more digging on bug http://www.postgresql.org/message-id/CAHUL3dpWYFnUgdgo95OHYDQ4kugdnBKPTjq0mNbTuBhCMG4xvQ@mail.gmail.com which describes a deadlock when using libpq with SSL in a multi-threaded environment with other threads doing SSL independently. Attached is a reproducing Python script in my experience is faster at showing the problem. Run it with python -u pg_lock.py As Heikki correctly diagnosed, the problem is with libpq unsetting the OpenSSL locking callback while another thread is holding one of the locks. The other thread then never releases the lock and the next time anyone tries to take it, the application deadlocks. The exact way it goes down is like this: T1 (libpq) T2 (Python) start libpq connection init ssl system add locking callback start ssl operation take lock finish libpq connection destroy ssl system remove locking callback (!) release lock, noop since no callback And the next time any thread tries to take the lock, it deadlocks. We added unsetting the locking callback in 4e816286533dd34c10b368487d4079595a3e1418 due to this bug report: http://www.postgresql.org/message-id/48620925.6070806@pws.com.au Indeed, commenting out the CRYPTO_set_locking_callback(NULL) call in fe-secure-openssl.c gets rid of the deadlock. However, it makes php segfault with the (attached) reproduction script from the original 2008 bug report. If your php.ini loads both the pgsql and curl extensions, reproduce the segfault with: php -f pg_segfault.php The most difficult part about fixing this bug is to determine *who's at fault*. I now lean towards the opinion that we shouldn't be messing with OpenSSL callbacks *at all*. First of all, the current behaviour is crazy. We're setting and unsetting the locking callback every time a connection is made/closed, which is not how OpenSSL is supposed to be used. The *application* using libpq should set a callback before it starts threads, it's no business of the library's. The old behaviour was slightly less insane (set callbacks first time we're engaging OpenSSL code, never touch them again). The real sane solution is to leave it up to the application. I posit we should remove all CRYPTO_set_*_callback functions and associated cruft from libpq. This unfortunately means that multi-threaded applications using libpq and SSL will break if they haven't been setting their own callbacks (if they have, well, tough luck! libpq will just stomp over them the first time it connects to Postgres, but at least *some* callbacks are left present after that). However, AFAICS if your app is not in C, then runtimes already handle that for you (as they should). Python: https://hg.python.org/cpython/file/dc820b44ce21/Modules/_ssl.c#l4284 PHP: https://github.com/php/php-src/blob/master/ext/curl/interface.c#L1235 Note that the PHP pgsql extension doesn't set the OpenSSL callbacks, because libpq was setting them on its own. If we remove the callback handling from libpq, PHP will need to add them. By the way, the MySQL extension for PHP also does not set those callbacks. Let me reiterate: I now believe the callbacks should be set by the application, libraries should not touch them, since they don't know what will they be stomping on. If the application is run through a VM like Python or PHP, it's the VM that should make sure the callbacks are set. I could submit a patch to get rid of the crazy CRYPTO_*_callback dance in libpq, but at the very least this will require a warning in the release notes about how you can't assume that libpq will take care of making sure your app is multi-threaded safe when using OpenSSL. I also don't know how far that's back-patcheable. I would very much like to have this change back-patched, since setting and resetting the callback makes using libpq in a threaded OpenSSL-enabled app arguably less safe than if it didn't use any locking. If the app is written correctly, it will have set locking callbacks before starting. Then libpq will happily stomp on them. If the app hasn't set callbacks, it wasn't written correctly in the first place and it will get segfaults instead of deadlocks. Thanks, Jan #!/usr/bin/env python import sys import time import threading import urllib import psycopg2 DB_HOST = '127.0.0.1' DB_USER = 'postgres' DB_NAME = 'postgres' HTTPS_URL = 'https://google.com' NUM_HTTPS_THREADS = 10 def connect(): conn = psycopg2.connect( host=DB_HOST, user=DB_USER, database=DB_NAME, sslmode='require') return conn class Worker(threading.Thread): def __init__(self): super(Worker, self).__init__() self._stop = threading.Event() def stop(self): self._stop.set() def stopped(self): return self._stop.isSet() def run(self): i = 0 while not self.stopped(): i += 1 self.do_work(i) class Libpq(Worker): def do_work(self, i): conn = connect() cur = conn.cursor() cur.execute('SELECT 1') cur.close() conn.close() if not i % 50: sys.stdout.write('+') class HTTPS(Worker): def do_work(self, i): urllib.urlopen(HTTPS_URL).read() if not i % 50: sys.stdout.write('=') def main(): # make sure libpq has set its locking callbacks conn = connect() cur = conn.cursor() cur.execute('SELECT 1') cur.close() conn.close() threads = [Libpq()] threads += [HTTPS() for _ in range(NUM_HTTPS_THREADS)] print 'starting worker threads' try: for t in threads: t.start() while True: time.sleep(1000) except KeyboardInterrupt: for t in threads: t.stop() for t in threads: t.join() if __name__ == "__main__": sys.exit(main())
Attachment
pgsql-hackers by date: