libpq's multi-threaded SSL callback handling is busted - Mailing list pgsql-hackers

From Jan Urbański
Subject libpq's multi-threaded SSL callback handling is busted
Date
Msg-id 871tlzrlkq.fsf@wulczer.org
Whole thread Raw
Responses Re: libpq's multi-threaded SSL callback handling is busted
Re: libpq's multi-threaded SSL callback handling is busted
List pgsql-hackers
I did some more digging on bug
http://www.postgresql.org/message-id/CAHUL3dpWYFnUgdgo95OHYDQ4kugdnBKPTjq0mNbTuBhCMG4xvQ@mail.gmail.com
which describes a deadlock when using libpq with SSL in a multi-threaded
environment with other threads doing SSL independently.

Attached is a reproducing Python script in my experience is faster at showing
the problem. Run it with

python -u pg_lock.py

As Heikki correctly diagnosed, the problem is with libpq unsetting the OpenSSL
locking callback while another thread is holding one of the locks. The other
thread then never releases the lock and the next time anyone tries to take it,
the application deadlocks.

The exact way it goes down is like this:



T1 (libpq)                     T2 (Python)

start libpq connection
init ssl system
add locking callback

                               start ssl operation
                               take lock

finish libpq connection
destroy ssl system
remove locking callback

                               (!) release lock, noop since no callback



And the next time any thread tries to take the lock, it deadlocks.

We added unsetting the locking callback in
4e816286533dd34c10b368487d4079595a3e1418 due to this bug report:
http://www.postgresql.org/message-id/48620925.6070806@pws.com.au

Indeed, commenting out the CRYPTO_set_locking_callback(NULL) call in
fe-secure-openssl.c gets rid of the deadlock. However, it makes php segfault
with the (attached) reproduction script from the original 2008 bug report. If
your php.ini loads both the pgsql and curl extensions, reproduce the segfault with:

php -f pg_segfault.php

The most difficult part about fixing this bug is to determine *who's at
fault*. I now lean towards the opinion that we shouldn't be messing with
OpenSSL callbacks *at all*.

First of all, the current behaviour is crazy. We're setting and unsetting the
locking callback every time a connection is made/closed, which is not how
OpenSSL is supposed to be used. The *application* using libpq should set a
callback before it starts threads, it's no business of the library's.

The old behaviour was slightly less insane (set callbacks first time we're
engaging OpenSSL code, never touch them again). The real sane solution is to
leave it up to the application.

I posit we should remove all CRYPTO_set_*_callback functions and associated
cruft from libpq. This unfortunately means that multi-threaded applications
using libpq and SSL will break if they haven't been setting their own callbacks
(if they have, well, tough luck! libpq will just stomp over them the first time
it connects to Postgres, but at least *some* callbacks are left present after
that).

However, AFAICS if your app is not in C, then runtimes already handle that for
you (as they should).

Python:

https://hg.python.org/cpython/file/dc820b44ce21/Modules/_ssl.c#l4284

PHP:

https://github.com/php/php-src/blob/master/ext/curl/interface.c#L1235

Note that the PHP pgsql extension doesn't set the OpenSSL callbacks, because
libpq was setting them on its own. If we remove the callback handling from
libpq, PHP will need to add them. By the way, the MySQL extension for PHP also
does not set those callbacks.

Let me reiterate: I now believe the callbacks should be set by the application,
libraries should not touch them, since they don't know what will they be
stomping on. If the application is run through a VM like Python or PHP, it's
the VM that should make sure the callbacks are set.

I could submit a patch to get rid of the crazy CRYPTO_*_callback dance in
libpq, but at the very least this will require a warning in the release notes
about how you can't assume that libpq will take care of making sure your app is
multi-threaded safe when using OpenSSL. I also don't know how far that's
back-patcheable.

I would very much like to have this change back-patched, since setting and
resetting the callback makes using libpq in a threaded OpenSSL-enabled app
arguably less safe than if it didn't use any locking. If the app is written
correctly, it will have set locking callbacks before starting. Then libpq will
happily stomp on them. If the app hasn't set callbacks, it wasn't written
correctly in the first place and it will get segfaults instead of deadlocks.

Thanks,
Jan
#!/usr/bin/env python
import sys
import time
import threading
import urllib

import psycopg2


DB_HOST = '127.0.0.1'
DB_USER = 'postgres'
DB_NAME = 'postgres'

HTTPS_URL = 'https://google.com'
NUM_HTTPS_THREADS = 10


def connect():
    conn = psycopg2.connect(
        host=DB_HOST, user=DB_USER,
        database=DB_NAME, sslmode='require')
    return conn


class Worker(threading.Thread):

    def __init__(self):
        super(Worker, self).__init__()
        self._stop = threading.Event()

    def stop(self):
        self._stop.set()

    def stopped(self):
        return self._stop.isSet()

    def run(self):
        i = 0
        while not self.stopped():
            i += 1
            self.do_work(i)


class Libpq(Worker):

    def do_work(self, i):
        conn = connect()
        cur = conn.cursor()
        cur.execute('SELECT 1')
        cur.close()
        conn.close()
        if not i % 50:
            sys.stdout.write('+')


class HTTPS(Worker):

    def do_work(self, i):
        urllib.urlopen(HTTPS_URL).read()
        if not i % 50:
            sys.stdout.write('=')


def main():
    # make sure libpq has set its locking callbacks
    conn = connect()
    cur = conn.cursor()
    cur.execute('SELECT 1')
    cur.close()
    conn.close()

    threads = [Libpq()]
    threads += [HTTPS() for _ in range(NUM_HTTPS_THREADS)]

    print 'starting worker threads'

    try:
        for t in threads:
            t.start()
        while True:
            time.sleep(1000)
    except KeyboardInterrupt:
        for t in threads:
            t.stop()

    for t in threads:
        t.join()


if __name__ == "__main__":
    sys.exit(main())

Attachment

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: RangeType internal use
Next
From: Robert Haas
Date:
Subject: sloppy back-patching of column-privilege leak