Thread: Problem in SQLFreeHandle (Statement)

Problem in SQLFreeHandle (Statement)

From

Jade Koskela

Date:

12 June 2014, 05:30:29

Hello,

I am using psql-odbc 9.3 on OSX. I have one thread running a query, and another thread cancels the query using SQLCancel. The thread running the query successfully cancels and exits the driver code. When we try to free the statement handle, it blocks on a mutex doing ENTER_CONN_CS, or something similar in SQLFreeHandle.

Does anyone have any input on this problem? I don't see who is holding the mutex, the thread running the query is done.

Thanks!

Re: Problem in SQLFreeHandle (Statement)

From

Heikki Linnakangas

Date:

12 June 2014, 16:30:29

On 06/12/2014 05:30 AM, Jade Koskela wrote:
> Hello,
> I am using psql-odbc 9.3 on OSX.

Which version, exactly? There were some changes to locking in
SQLFreeHandle in 09.03.0200. Not sure if those changes might've fixed
this, or caused it.

Can easily you test on another platform, or with different version of
unixODBC?

> I have one thread running a query, and
> another thread cancels the query using SQLCancel. The thread running the
> query successfully cancels and exits the driver code. When we try to free
> the statement handle, it blocks on a mutex doing ENTER_CONN_CS, or
> something similar in SQLFreeHandle.
>
> Does anyone have any input on this problem? I don't see who is holding the
> mutex, the thread running the query is done.

I tried to reproduce this with the attached little test program. It
works fine. Can you post the test program you're using, or modify the
attached so that it exhibits the behavior you're seeing, please?

- Heikki

Attachment

Re: Problem in SQLFreeHandle (Statement)

From

Jade Koskela

Date:

17 June 2014, 23:06:17

I couldn't reproduce the behavior using the test program, or a similar one.

I recompiled the driver to enable extra logging.

Here is the grep on ENTER_CONN_CS, the second address is the address of the mutex (pthread recursive).

This was compiled on OSX with Clang, and then also with GCC with the same result. Behavior doesn't reproduce on Windows.

[140735257568016][[SQLFreeHandle]]

[140735257568016]*** ENTER_CONN_CS [140210608634616] ***

[140735257568016]*** LEAVE_CONN_CS [140210608634616] ***

[140735257568016][SQLGetConnectAttrW]

[140735257568016]*** ENTER_CONN_CS [140210608634616] ***

[140735257568016]*** LEAVE_CONN_CS [140210608634616] ***

[140735257568016][SQLGetConnectAttrW]

[140735257568016]*** ENTER_CONN_CS [140210608634616] ***

[140735257568016]*** LEAVE_CONN_CS [140210608634616] ***

[140735257568016][SQLGetConnectAttrW]

[140735257568016]*** ENTER_CONN_CS [140210608634616] ***

[140735257568016]*** LEAVE_CONN_CS [140210608634616] ***

[4466282496][[SQLAllocHandle]]

[4466282496]*** ENTER_CONN_CS [140210608634616] *** <- new thread grabbed mutex, count=1

[4466282496]**** PGAPI_AllocStmt: hdbc = 0x7f8553857000, stmt = 0x7f8552d17100

[4466282496]*** LEAVE_CONN_CS [140210608634616] *** <- count=0

[4466282496]**** PGAPI_Prepare: STMT_ALLOCATED, copy

[4466282496]*** ENTER_CONN_CS [140210608634616] *** <- count=1

[4466282496]*** ENTER_CONN_CS [140210608634616] *** <- count=2

[140735257568016]*** LEAVE_CONN_CS [140210608634616] *** <- this thread doesn't own, count=2

[4466282496]*** LEAVE_CONN_CS [140210608634616] *** <- count=1

[4466282496]*** LEAVE_CONN_CS [140210608634616] *** <- count=0

[140735257568016][[SQLFreeHandle]]

[140735257568016]*** ENTER_CONN_CS [140210608634616] *** <-thread blocks on mutex here, should be unlocked?

On Thu, Jun 12, 2014 at 6:30 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

On 06/12/2014 05:30 AM, Jade Koskela wrote:
Hello,
I am using psql-odbc 9.3 on OSX.

Which version, exactly? There were some changes to locking in SQLFreeHandle in 09.03.0200. Not sure if those changes might've fixed this, or caused it.

Can easily you test on another platform, or with different version of unixODBC?

I have one thread running a query, and
another thread cancels the query using SQLCancel. The thread running the
query successfully cancels and exits the driver code. When we try to free
the statement handle, it blocks on a mutex doing ENTER_CONN_CS, or
something similar in SQLFreeHandle.

Does anyone have any input on this problem? I don't see who is holding the
mutex, the thread running the query is done.

I tried to reproduce this with the attached little test program. It works fine. Can you post the test program you're using, or modify the attached so that it exhibits the behavior you're seeing, please?

- Heikki

Attachment

mylog_tail.log

Re: Problem in SQLFreeHandle (Statement)

From

Heikki Linnakangas

Date:

19 June 2014, 15:33:22

On 06/17/2014 11:06 PM, Jade Koskela wrote:
> I couldn't reproduce the behavior using the test program, or a similar one.
> I recompiled the driver to enable extra logging.
> Here is the grep on ENTER_CONN_CS, the second address is the address of the
> mutex (pthread recursive).
> This was compiled on OSX with Clang, and then also with GCC with the same
> result. Behavior doesn't reproduce on Windows.
>
>
> [140735257568016][[SQLFreeHandle]]
>
> [140735257568016]*** ENTER_CONN_CS [140210608634616] ***
>
> [140735257568016]*** ENTER_CONN_CS [140210608634616] ***
>
> [140735257568016]*** LEAVE_CONN_CS [140210608634616] ***
>
> [140735257568016]*** LEAVE_CONN_CS [140210608634616] ***
>
> [140735257568016]*** LEAVE_CONN_CS [140210608634616] ***
>
> [140735257568016][SQLGetConnectAttrW]
>
> [140735257568016]*** ENTER_CONN_CS [140210608634616] ***
>
> [140735257568016]*** LEAVE_CONN_CS [140210608634616] ***
>
> [140735257568016][SQLGetConnectAttrW]
>
> [140735257568016]*** ENTER_CONN_CS [140210608634616] ***
>
> [140735257568016]*** LEAVE_CONN_CS [140210608634616] ***
>
> [140735257568016][SQLGetConnectAttrW]
>
> [140735257568016]*** ENTER_CONN_CS [140210608634616] ***
>
> [140735257568016]*** LEAVE_CONN_CS [140210608634616] ***
>
> [4466282496][[SQLAllocHandle]]
>
> [4466282496]*** ENTER_CONN_CS [140210608634616] ***    <- new thread
> grabbed mutex, count=1
>
> [4466282496]**** PGAPI_AllocStmt: hdbc = 0x7f8553857000, stmt =
> 0x7f8552d17100
>
> [4466282496]*** LEAVE_CONN_CS [140210608634616] ***    <- count=0
>
> [4466282496]**** PGAPI_Prepare: STMT_ALLOCATED, copy
>
> [4466282496]*** ENTER_CONN_CS [140210608634616] ***    <- count=1
>
> [4466282496]*** ENTER_CONN_CS [140210608634616] ***    <- count=2
>
> [140735257568016]*** LEAVE_CONN_CS [140210608634616] *** <- this thread
> doesn't own, count=2

A-ha. The problem is that after issuing the query cancel, the
SQLCancel() function calls DiscardStatementSvp(), to discard any
savepoints associated with the statement. SetStatementSvp() and
DiscardStatementSvp() maintain a counter of how many times the CONN_CS
has been acquired, and DiscardStatementSvp() tries to completely release
it by releasing it as many times as the counter says it's been locked.
However, unlocking a pthread mutex from a different thread than the one
that locked it is undefined behavior. Apparently it happens to work on
other platforms by accident, but OS X is somehow more picky about it.

We never actually check the return value of pthread_mutex_lock() /
pthread_mutex_unlock(), so it's possible that they are reporting errors
even on other platforms, and we just don't pay attention to it. That
ought to be fixed too, at least in debug builds. Also for debugging
purposes, I think we should add a thread ID field next to the
lock_CC_for_rb field to remember which thread is holding the locks, as a
sanity check.

Attached are two patches. The first one is a narrow fix, which should
fix the bug. It simply removes the DiscardStatementSvp() call. AFAICS
it's never needed: if the statement was busy executing in a different
thread (this is the case you hit), we send a query cancel request to the
server, and it's not cool to do anything else with the statement. And in
the codepaths where the statement was not busy executing, PGAPI_Cancel()
already called DiscardStatementSvp(), with the appropriate lock held.

The second patch refactors PGAPI_Cancel() and adds comments, to make it
more readable. It's not absolutely required, but I found PGAPI_Cancel()
very difficult to debug as it is.

Can you test these patches, to confirm that it fixes the bug you're seeing?

- Heikki

Attachment

Re: Problem in SQLFreeHandle (Statement)

From

Jade Koskela

Date:

19 June 2014, 22:00:07

I couldn't reproduce the behavior using the test program, or a similar one.

I recompiled the driver to enable extra logging.

Here is the grep on ENTER_CONN_CS, the second address is the address of the mutex (pthread recursive).

This was compiled on OSX with Clang, and then also with GCC with the same result. Behavior doesn't reproduce on Windows.

[140735257568016][[SQLFreeHandle]]

[140735257568016]*** ENTER_CONN_CS [140210608634616] ***

[140735257568016]*** LEAVE_CONN_CS [140210608634616] ***

[140735257568016][SQLGetConnectAttrW]

[140735257568016]*** ENTER_CONN_CS [140210608634616] ***

[140735257568016]*** LEAVE_CONN_CS [140210608634616] ***

[140735257568016][SQLGetConnectAttrW]

[140735257568016]*** ENTER_CONN_CS [140210608634616] ***

[140735257568016]*** LEAVE_CONN_CS [140210608634616] ***

[140735257568016][SQLGetConnectAttrW]

[140735257568016]*** ENTER_CONN_CS [140210608634616] ***

[140735257568016]*** LEAVE_CONN_CS [140210608634616] ***

[4466282496][[SQLAllocHandle]]

[4466282496]*** ENTER_CONN_CS [140210608634616] *** <- new thread grabbed mutex, count=1

[4466282496]**** PGAPI_AllocStmt: hdbc = 0x7f8553857000, stmt = 0x7f8552d17100

[4466282496]*** LEAVE_CONN_CS [140210608634616] *** <- count=0

[4466282496]**** PGAPI_Prepare: STMT_ALLOCATED, copy

[4466282496]*** ENTER_CONN_CS [140210608634616] *** <- count=1

[4466282496]*** ENTER_CONN_CS [140210608634616] *** <- count=2

[140735257568016]*** LEAVE_CONN_CS [140210608634616] *** <- this thread doesn't own, count=2

[4466282496]*** LEAVE_CONN_CS [140210608634616] *** <- count=1

[4466282496]*** LEAVE_CONN_CS [140210608634616] *** <- count=0

[140735257568016][[SQLFreeHandle]]

[140735257568016]*** ENTER_CONN_CS [140210608634616] *** <-thread blocks on mutex here, should be unlocked?

On Thu, Jun 12, 2014 at 6:30 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

On 06/12/2014 05:30 AM, Jade Koskela wrote:
Hello,
I am using psql-odbc 9.3 on OSX.

Which version, exactly? There were some changes to locking in SQLFreeHandle in 09.03.0200. Not sure if those changes might've fixed this, or caused it.

Can easily you test on another platform, or with different version of unixODBC?

I have one thread running a query, and
another thread cancels the query using SQLCancel. The thread running the
query successfully cancels and exits the driver code. When we try to free
the statement handle, it blocks on a mutex doing ENTER_CONN_CS, or
something similar in SQLFreeHandle.

Does anyone have any input on this problem? I don't see who is holding the
mutex, the thread running the query is done.

I tried to reproduce this with the attached little test program. It works fine. Can you post the test program you're using, or modify the attached so that it exhibits the behavior you're seeing, please?

- Heikki

Attachment

mylog_jkoskela50004.log

Re: Problem in SQLFreeHandle (Statement)

From

Barry Bell

Date:

19 June 2014, 22:00:12

I was able to login with the olddominion login and download the file

(This is normal FTP , this is NOT SFTP)

Using port 21

Which port are you using?

Barry Bell
Senior Developer/Analyst
Logistics
Harte Hanks

1525 NW 3rd ST
Deerfield Beach FL, 33442

954-429-3771 Ext 267 office
954-281-1464 fax

hartehanks.com / linkedin / twitter / facebook

From: pgsql-odbc-owner@postgresql.org [mailto:pgsql-odbc-owner@postgresql.org] On Behalf Of Jade Koskela
Sent: Tuesday, June 17, 2014 4:06 PM
To: Heikki Linnakangas; pgsql-odbc@postgresql.org
Subject: Re: [ODBC] Problem in SQLFreeHandle (Statement)

I couldn't reproduce the behavior using the test program, or a similar one.

I recompiled the driver to enable extra logging.

Here is the grep on ENTER_CONN_CS, the second address is the address of the mutex (pthread recursive).

This was compiled on OSX with Clang, and then also with GCC with the same result. Behavior doesn't reproduce on Windows.

[140735257568016][[SQLFreeHandle]]

[140735257568016]*** ENTER_CONN_CS [140210608634616] ***

[140735257568016]*** LEAVE_CONN_CS [140210608634616] ***

[140735257568016][SQLGetConnectAttrW]

[140735257568016]*** ENTER_CONN_CS [140210608634616] ***

[140735257568016]*** LEAVE_CONN_CS [140210608634616] ***

[140735257568016][SQLGetConnectAttrW]

[140735257568016]*** ENTER_CONN_CS [140210608634616] ***

[140735257568016]*** LEAVE_CONN_CS [140210608634616] ***

[140735257568016][SQLGetConnectAttrW]

[140735257568016]*** ENTER_CONN_CS [140210608634616] ***

[140735257568016]*** LEAVE_CONN_CS [140210608634616] ***

[4466282496][[SQLAllocHandle]]

[4466282496]*** ENTER_CONN_CS [140210608634616] *** <- new thread grabbed mutex, count=1

[4466282496]**** PGAPI_AllocStmt: hdbc = 0x7f8553857000, stmt = 0x7f8552d17100

[4466282496]*** LEAVE_CONN_CS [140210608634616] *** <- count=0

[4466282496]**** PGAPI_Prepare: STMT_ALLOCATED, copy

[4466282496]*** ENTER_CONN_CS [140210608634616] *** <- count=1

[4466282496]*** ENTER_CONN_CS [140210608634616] *** <- count=2

[140735257568016]*** LEAVE_CONN_CS [140210608634616] *** <- this thread doesn't own, count=2

[4466282496]*** LEAVE_CONN_CS [140210608634616] *** <- count=1

[4466282496]*** LEAVE_CONN_CS [140210608634616] *** <- count=0

[140735257568016][[SQLFreeHandle]]

[140735257568016]*** ENTER_CONN_CS [140210608634616] *** <-thread blocks on mutex here, should be unlocked?

On Thu, Jun 12, 2014 at 6:30 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

On 06/12/2014 05:30 AM, Jade Koskela wrote:
Hello,
I am using psql-odbc 9.3 on OSX.

Which version, exactly? There were some changes to locking in SQLFreeHandle in 09.03.0200. Not sure if those changes might've fixed this, or caused it.

Can easily you test on another platform, or with different version of unixODBC?

I have one thread running a query, and
another thread cancels the query using SQLCancel. The thread running the
query successfully cancels and exits the driver code. When we try to free
the statement handle, it blocks on a mutex doing ENTER_CONN_CS, or
something similar in SQLFreeHandle.

Does anyone have any input on this problem? I don't see who is holding the
mutex, the thread running the query is done.

I tried to reproduce this with the attached little test program. It works fine. Can you post the test program you're using, or modify the attached so that it exhibits the behavior you're seeing, please?

- Heikki

Attachment

image001.png

Re: Problem in SQLFreeHandle (Statement)

From

Jade Koskela

Date:

20 June 2014, 01:26:08

Thanks Heikki !

This has solved the problem. I'm still curious as to the root cause.

According to the SUSv3, which OSX/Darwin is supposed to conform to, recursive mutexes should return error codes if unlocked from a different thread, but it is not undefined behavior (it is only undefined for normal mutexes). I tried to test this by doing a bunch of locking/unlocking of the same mutex from different threads and couldn't see any problems. However, if I compile without support for recursive mutexes then the problem disappears. Maybe we'll never know.

Anyways back to other work, thanks again.

Jade Koskela

Tableau Software

On Thu, Jun 19, 2014 at 5:33 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

On 06/17/2014 11:06 PM, Jade Koskela wrote:
I couldn't reproduce the behavior using the test program, or a similar one.
I recompiled the driver to enable extra logging.
Here is the grep on ENTER_CONN_CS, the second address is the address of the
mutex (pthread recursive).
This was compiled on OSX with Clang, and then also with GCC with the same
result. Behavior doesn't reproduce on Windows.

[140735257568016][[SQLFreeHandle]]

[140735257568016]*** ENTER_CONN_CS [140210608634616] ***

[140735257568016]*** ENTER_CONN_CS [140210608634616] ***

[140735257568016]*** LEAVE_CONN_CS [140210608634616] ***

[140735257568016]*** LEAVE_CONN_CS [140210608634616] ***

[140735257568016]*** LEAVE_CONN_CS [140210608634616] ***

[140735257568016][SQLGetConnectAttrW]

[140735257568016]*** ENTER_CONN_CS [140210608634616] ***

[140735257568016]*** LEAVE_CONN_CS [140210608634616] ***

[140735257568016][SQLGetConnectAttrW]

[140735257568016]*** ENTER_CONN_CS [140210608634616] ***

[140735257568016]*** LEAVE_CONN_CS [140210608634616] ***

[140735257568016][SQLGetConnectAttrW]

[140735257568016]*** ENTER_CONN_CS [140210608634616] ***

[140735257568016]*** LEAVE_CONN_CS [140210608634616] ***

[4466282496][[SQLAllocHandle]]

[4466282496]*** ENTER_CONN_CS [140210608634616] *** <- new thread
grabbed mutex, count=1

[4466282496]**** PGAPI_AllocStmt: hdbc = 0x7f8553857000, stmt =
0x7f8552d17100

[4466282496]*** LEAVE_CONN_CS [140210608634616] *** <- count=0

[4466282496]**** PGAPI_Prepare: STMT_ALLOCATED, copy

[4466282496]*** ENTER_CONN_CS [140210608634616] *** <- count=1

[4466282496]*** ENTER_CONN_CS [140210608634616] *** <- count=2

[140735257568016]*** LEAVE_CONN_CS [140210608634616] *** <- this thread
doesn't own, count=2

A-ha. The problem is that after issuing the query cancel, the SQLCancel() function calls DiscardStatementSvp(), to discard any savepoints associated with the statement. SetStatementSvp() and DiscardStatementSvp() maintain a counter of how many times the CONN_CS has been acquired, and DiscardStatementSvp() tries to completely release it by releasing it as many times as the counter says it's been locked. However, unlocking a pthread mutex from a different thread than the one that locked it is undefined behavior. Apparently it happens to work on other platforms by accident, but OS X is somehow more picky about it.

We never actually check the return value of pthread_mutex_lock() / pthread_mutex_unlock(), so it's possible that they are reporting errors even on other platforms, and we just don't pay attention to it. That ought to be fixed too, at least in debug builds. Also for debugging purposes, I think we should add a thread ID field next to the lock_CC_for_rb field to remember which thread is holding the locks, as a sanity check.

Attached are two patches. The first one is a narrow fix, which should fix the bug. It simply removes the DiscardStatementSvp() call. AFAICS it's never needed: if the statement was busy executing in a different thread (this is the case you hit), we send a query cancel request to the server, and it's not cool to do anything else with the statement. And in the codepaths where the statement was not busy executing, PGAPI_Cancel() already called DiscardStatementSvp(), with the appropriate lock held.

The second patch refactors PGAPI_Cancel() and adds comments, to make it more readable. It's not absolutely required, but I found PGAPI_Cancel() very difficult to debug as it is.

Can you test these patches, to confirm that it fixes the bug you're seeing?

- Heikki