Thread: [PATCH] fix race condition in libpq (related to ssl connections)

[PATCH] fix race condition in libpq (related to ssl connections)

From
Willi Mann
Date:
Hi,

I've found a race condition in libpq. It is about the initialization of
the my_bio_methods static variable in fe-secure-openssl.c, which is not
protected by any lock. The race condition may make the initialization of
the connection fail, and as an additional weird consequence, it might
cause openssl call close(0), so stdin of the client application gets
closed.

I've prepared a patch to protect the initialization of my_bio_methods
from the race. This is my first patch submission to the postgresql
project, so I hope I didn't miss anything. Any comments and suggestions
are of course very welcome.

I also prepared a testcase. In the testcase tarball, there is a patch
that adds sleeps at the right positions to make the close(0) problem
occur practically always. It also includes comments to explain how the
race can end up calling close(0).

Concerning the patch, it is only tested on Linux. I'm unsure about
whether the simple initialization of the mutex would work nowadays also
on Windows or whether the more complicated initialization also to be
found for the ssl_config_mutex in the same source file needs to be used.
Let me know whether I should adapt that.

We discovered the problem with release 11.5, but the patch and the 
testcase are against the master branch.

Regards,
Willi

-- 
___________________________________________________

Dr. Willi Mann | Principal Software Engineer, Tech Lead PQL

Celonis SE | Theresienstrasse 6 | 80333 Munich, Germany
F: +4989416159679
w.mann@celonis.com | www.celonis.com | LinkedIn | Twitter | Xing

AG Munich HRB 225439 | Management: Martin Klenk, Bastian Nominacher, 
Alexander Rinke
Attachment

Re: [PATCH] fix race condition in libpq (related to ssl connections)

From
Aleksander Alekseev
Date:
Hi,

> I've found a race condition in libpq. It is about the initialization of
> the my_bio_methods static variable in fe-secure-openssl.c, which is not
> protected by any lock. The race condition may make the initialization of
> the connection fail, and as an additional weird consequence, it might
> cause openssl call close(0), so stdin of the client application gets
> closed.

Thanks for the patch. Interestingly enough we have PQisthreadsafe()
[1], but it's current implementation is:

```
/* libpq is thread-safe? */
int
PQisthreadsafe(void)
{
    return true;
}
```

I wonder if we should just document that libpq is thread safe as of PG
v??? and deprecate PQisthreadsafe() at some point. Currently the
documentation gives an impression that the library may or may not be
thread safe depending on the circumstances.

> I've prepared a patch to protect the initialization of my_bio_methods
> from the race. This is my first patch submission to the postgresql
> project, so I hope I didn't miss anything. Any comments and suggestions
> are of course very welcome.
>
> I also prepared a testcase. In the testcase tarball, there is a patch
> that adds sleeps at the right positions to make the close(0) problem
> occur practically always. It also includes comments to explain how the
> race can end up calling close(0).
>
> Concerning the patch, it is only tested on Linux. I'm unsure about
> whether the simple initialization of the mutex would work nowadays also
> on Windows or whether the more complicated initialization also to be
> found for the ssl_config_mutex in the same source file needs to be used.
> Let me know whether I should adapt that.

Please add the patch to the nearest open commit fest [2]. The patch
will be automatically picked up by cfbot [3] and tested on different
platforms. Also this way it will not be lost among other patches.

The code looks OK but I would appreciate a second opinion from cfbot.
Also maybe a few comments before my_BIO_methods_init_mutex and/or
pthread_mutex_lock would be appropriate. Personally I am inclined to
think that the automatic test in this particular case is redundant.

[1]: https://www.postgresql.org/docs/current/libpq-threading.html
[2]: https://commitfest.postgresql.org/
[3]: http://cfbot.cputube.org/

-- 
Best regards,
Aleksander Alekseev



Re: [PATCH] fix race condition in libpq (related to ssl connections)

From
Michael Paquier
Date:
On Tue, Nov 21, 2023 at 12:14:16PM +0300, Aleksander Alekseev wrote:

Thanks for the report, Willi, and the test case!  Thanks Aleksander
for the reply.

> I wonder if we should just document that libpq is thread safe as of PG
> v??? and deprecate PQisthreadsafe() at some point. Currently the
> documentation gives an impression that the library may or may not be
> thread safe depending on the circumstances.

Because --disable-thread-safe has been removed recently in
68a4b58eca03.  The routine could be marked as deprecated on top of
saying that it always returns 1 for 17~.

> Please add the patch to the nearest open commit fest [2]. The patch
> will be automatically picked up by cfbot [3] and tested on different
> platforms. Also this way it will not be lost among other patches.
>
> The code looks OK but I would appreciate a second opinion from cfbot.
> Also maybe a few comments before my_BIO_methods_init_mutex and/or
> pthread_mutex_lock would be appropriate. Personally I am inclined to
> think that the automatic test in this particular case is redundant.

I am not really convinced that we require a second mutex here, as
there is always a concern with inter-locking changes.  I may be
missing something, of course, but BIO_s_socket() is just a pointer to
a set of callbacks hidden in bss_sock.c with BIO_meth_new() and
BIO_get_new_index() assigning some centralized data to handle the
methods in a controlled way in OpenSSL.  We only case about
initializing once for the sake of libpq's threads, so wouldn't it be
better to move my_BIO_s_socket() in pgtls_init() where the
initialization of the BIO methods would be protected by
ssl_config_mutex?  That's basically what Willi means in his first
message, no?
--
Michael

Attachment

Re: [PATCH] fix race condition in libpq (related to ssl connections)

From
Michael Paquier
Date:
On Tue, Nov 21, 2023 at 12:14:16PM +0300, Aleksander Alekseev wrote:
> Please add the patch to the nearest open commit fest [2]. The patch
> will be automatically picked up by cfbot [3] and tested on different
> platforms. Also this way it will not be lost among other patches.

I have noticed that this was not tracked yet, so I have added an entry
here:
https://commitfest.postgresql.org/46/4670/

Willi, note that this requires a PostgreSQL community account, and it
does not seem like you have one, or I would have added you as author
;)
--
Michael

Attachment

Re: [PATCH] fix race condition in libpq (related to ssl connections)

From
Thomas Munro
Date:
On Wed, Nov 22, 2023 at 2:44 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Tue, Nov 21, 2023 at 12:14:16PM +0300, Aleksander Alekseev wrote:
> > I wonder if we should just document that libpq is thread safe as of PG
> > v??? and deprecate PQisthreadsafe() at some point. Currently the
> > documentation gives an impression that the library may or may not be
> > thread safe depending on the circumstances.
>
> Because --disable-thread-safe has been removed recently in
> 68a4b58eca03.  The routine could be marked as deprecated on top of
> saying that it always returns 1 for 17~.

See also commit ce0b0fa3 "Doc: Adjust libpq docs about thread safety."



Re: [PATCH] fix race condition in libpq (related to ssl connections)

From
Michael Paquier
Date:
On Wed, Nov 22, 2023 at 02:58:15PM +1300, Thomas Munro wrote:
> On Wed, Nov 22, 2023 at 2:44 PM Michael Paquier <michael@paquier.xyz> wrote:
>> Because --disable-thread-safe has been removed recently in
>> 68a4b58eca03.  The routine could be marked as deprecated on top of
>> saying that it always returns 1 for 17~.
>
> See also commit ce0b0fa3 "Doc: Adjust libpq docs about thread safety."

Sure, I've noticed that in the docs, but declaring it as deprecated is
a different topic, and we don't actually use this term in the docs for
this routine, no?
--
Michael

Attachment

Re: [PATCH] fix race condition in libpq (related to ssl connections)

From
Michael Paquier
Date:
On Wed, Nov 22, 2023 at 10:43:32AM +0900, Michael Paquier wrote:
> I am not really convinced that we require a second mutex here, as
> there is always a concern with inter-locking changes.  I may be
> missing something, of course, but BIO_s_socket() is just a pointer to
> a set of callbacks hidden in bss_sock.c with BIO_meth_new() and
> BIO_get_new_index() assigning some centralized data to handle the
> methods in a controlled way in OpenSSL.

I was looking at the origin of this one, and this is an issue coming
down to 8bb14cdd33de that has removed the ssl_config_mutex taken in
pgtls_open_client() when we called my_SSL_set_fd().  The commit has
accidentally missed that path with the static BIO method where the
mutex mattered.

> We only case about
> initializing once for the sake of libpq's threads, so wouldn't it be
> better to move my_BIO_s_socket() in pgtls_init() where the
> initialization of the BIO methods would be protected by
> ssl_config_mutex?  That's basically what Willi means in his first
> message, no?

I've looked at this idea, and finished by being unhappy with the error
handling that we are currently assuming in my_SSL_set_fd() in the
event of an error in the bio method setup, which would be most likely
an OOM, so let's use ssl_config_mutex in my_BIO_s_socket().  Another
thing is that I have minimized the manipulation of my_bio_methods in
the setup routine.

I've been also testing the risks of collusion, and it takes me quite a
a few tries with hundred threads to reproduce the failure even without
any forced sleep, so that seems really hard to reach.

Please find attached a patch (still need to do more checks with older
versions of OpenSSL).  Any thoughts or comments?
--
Michael

Attachment

Re: [PATCH] fix race condition in libpq (related to ssl connections)

From
Michael Paquier
Date:
On Fri, Nov 24, 2023 at 04:48:58PM +0900, Michael Paquier wrote:
> I've looked at this idea, and finished by being unhappy with the error
> handling that we are currently assuming in my_SSL_set_fd() in the
> event of an error in the bio method setup, which would be most likely
> an OOM, so let's use ssl_config_mutex in my_BIO_s_socket().  Another
> thing is that I have minimized the manipulation of my_bio_methods in
> the setup routine.

I've spent more time on that today, and the patch I've posted on
Friday had a small mistake in the non-HAVE_BIO_METH_NEW path when
saving the BIO_METHODs causing the SSL tests to fail with older
OpenSSL versions.  I've fixed that and the patch was straight-forward,
so applied it down to v12.  I didn't use Willi's patch at the end,
still credited him as author as his original patch is rather close to
the result committed and it feels that he has spent a good deal of
time on this issue.
--
Michael

Attachment