Thread: Protocol de-synchronisation bug, bogus query sent

Protocol de-synchronisation bug, bogus query sent

From
Craig Ringer
Date:
Hi folks

I've found a bug in psqlODBC, in CC_send_query_append, where psqlODBC
gets out of sync with the server's idea of the connection state and both
wait for each other indefinitely.

So far I've only reproduced it as part of debugging a larger issue. I've
only actually seen it when psqlODBC establishes an "isolated" XA
connection after failing to acquire a lock on the DTC connection in
IAsyncPG::getLockedXAConn(). If the client has specified a connect-time
query string with a DSN like:

    DRIVER={PostgreSQL ANSI};SERVER=LOCALHOST;DATABASE=postgres;A6=set
search_path to someschema;UID=;PWD=;BI=-5;debug=1;commlog=1;

... but it's clearly a bug in its own right, so I'll write it up here
and propose a fix.


Patch attached, or you can pull the branch:

    fix-febe-protocol-desync-emptyquery

from https://github.com/ringerc/psqlODBC.git



The whole code block around connection.c:3082 appears to be attempting
to send an empty query in order to elicit an 'I' (EmptyQueryResponse)
message. It sends a malformed protocol message:

    SOCK_put_string(self->sock, "Q ");
    SOCK_flush_output(self->sock);

so the server just waits for the rest of the message, never responding.

When the connection is killed by program exit later, the server will log:

LOG:  unexpected EOF within message length word
LOG:  disconnection: session time: 0:00:08.910 user=Administrator
database=XXXX host=127.0.0.1 port=53705
LOG:  could not receive data from client: No connection could be made
because the target machine actively refused it.


That's because the V3 protocol specifies that the message format for
Query is:

'Q'
[32 bit integer]
[query string]

http://www.postgresql.org/docs/current/static/protocol-message-formats.html

... but psqlODBC just sends:

'Q'
'\x20'
'\x00'

If it is really necessary to go through this empty-query dance, then for
the V3 protocol the correct message would be:

/* Send an empty query to elicit an 'I' (EmptyQueryResponse)
 * from the server */
SOCK_put_char(self->sock, 'Q');
SOCK_put_int(self->sock, 5, 4);
SOCK_put_char(self->sock, '\0');



Since no 'I' message is ever received the next message is 'Z'
(ReadyForQuery). psqlODBC consumes this and carries on waiting for the
next message, as it has set emptyreqs=1 to expect 'I' messages. So it
waits forever, since the server thinks it's waiting for psqlODBC to
finish sending the prior protocol message.

It isn't clear to me why psqlODBC does this weird dance with empty
queries at all, but it's certainly doing it wrong.

The attached patch fixes the immediate problem, but I'd like to see if
this can be simplified out entirely in favour of, in ascending
size-of-work order, (a) using a protocol sync message, then (b) deleting
as much as possible of this custom protocol handling code entirely and
using libpq.


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Protocol de-synchronisation bug, bogus query sent

From
"Inoue, Hiroshi"
Date:
Hi Craig,

The code seems for v0/v1 protocol. Though I don't remember well, the
dance was probably because v0/v1 didn't have clear sync mechanism.
I wonder why the code reached there. The flag beforeV2 is broken?

regards,
Hiroshi Inoue

(2014/06/19 12:44), Craig Ringer wrote:
> Hi folks
>
> I've found a bug in psqlODBC, in CC_send_query_append, where psqlODBC
> gets out of sync with the server's idea of the connection state and both
> wait for each other indefinitely.
>
> So far I've only reproduced it as part of debugging a larger issue. I've
> only actually seen it when psqlODBC establishes an "isolated" XA
> connection after failing to acquire a lock on the DTC connection in
> IAsyncPG::getLockedXAConn(). If the client has specified a connect-time
> query string with a DSN like:
>
>      DRIVER={PostgreSQL ANSI};SERVER=LOCALHOST;DATABASE=postgres;A6=set
> search_path to someschema;UID=;PWD=;BI=-5;debug=1;commlog=1;
>
> ... but it's clearly a bug in its own right, so I'll write it up here
> and propose a fix.
>
>
> Patch attached, or you can pull the branch:
>
>      fix-febe-protocol-desync-emptyquery
>
> from https://github.com/ringerc/psqlODBC.git
>
>
>
> The whole code block around connection.c:3082 appears to be attempting
> to send an empty query in order to elicit an 'I' (EmptyQueryResponse)
> message. It sends a malformed protocol message:
>
>      SOCK_put_string(self->sock, "Q ");
>      SOCK_flush_output(self->sock);
>
> so the server just waits for the rest of the message, never responding.
>
> When the connection is killed by program exit later, the server will log:
>
> LOG:  unexpected EOF within message length word
> LOG:  disconnection: session time: 0:00:08.910 user=Administrator
> database=XXXX host=127.0.0.1 port=53705
> LOG:  could not receive data from client: No connection could be made
> because the target machine actively refused it.
>
>
> That's because the V3 protocol specifies that the message format for
> Query is:
>
> 'Q'
> [32 bit integer]
> [query string]
>
> http://www.postgresql.org/docs/current/static/protocol-message-formats.html
>
> ... but psqlODBC just sends:
>
> 'Q'
> '\x20'
> '\x00'
>
> If it is really necessary to go through this empty-query dance, then for
> the V3 protocol the correct message would be:
>
> /* Send an empty query to elicit an 'I' (EmptyQueryResponse)
>   * from the server */
> SOCK_put_char(self->sock, 'Q');
> SOCK_put_int(self->sock, 5, 4);
> SOCK_put_char(self->sock, '\0');
>
>
>
> Since no 'I' message is ever received the next message is 'Z'
> (ReadyForQuery). psqlODBC consumes this and carries on waiting for the
> next message, as it has set emptyreqs=1 to expect 'I' messages. So it
> waits forever, since the server thinks it's waiting for psqlODBC to
> finish sending the prior protocol message.
>
> It isn't clear to me why psqlODBC does this weird dance with empty
> queries at all, but it's certainly doing it wrong.
>
> The attached patch fixes the immediate problem, but I'd like to see if
> this can be simplified out entirely in favour of, in ascending
> size-of-work order, (a) using a protocol sync message, then (b) deleting
> as much as possible of this custom protocol handling code entirely and
> using libpq.

--
I am using the free version of SPAMfighter.
SPAMfighter has removed 10973 of my spam emails to date.
Get the free SPAMfighter here: http://www.spamfighter.com/len

Do you have a slow PC? Try a Free scan
http://www.spamfighter.com/SLOW-PCfighter?cid=sigen



Re: Protocol de-synchronisation bug, bogus query sent

From
Craig Ringer
Date:
On 06/19/2014 12:25 PM, Inoue, Hiroshi wrote:
>
>
> Patch attached, or you can pull the branch:
>
>      fix-febe-protocol-desync-emptyquery
>
> from https://github.com/ringerc/psqlODBC.git


BTW, the macros:

PG_PROTOCOL_74

and

PROTOCOL_74(...)

are NOT the same thing at all. The former produces a value identifying
the protocol; the latter queries the protocol version requested in the
conninfo struct against the named version.

Tad confusing.


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Protocol de-synchronisation bug, bogus query sent

From
Craig Ringer
Date:
On 06/19/2014 12:25 PM, Inoue, Hiroshi wrote:
>
>
> The code seems for v0/v1 protocol. Though I don't remember well, the
> dance was probably because v0/v1 didn't have clear sync mechanism.
> I wonder why the code reached there. The flag beforeV2 is broken?

Hm, yes.

beforeV2 seems to be 1, somehow. But only on these "isolated"
connections established via the callstack:


CC_connect(...) with trimmed conninfo

+        dsn     ""
+        desc     ""
+        drivername     "{PostgreSQL ANSI}"
+        server     "LOCALHOST"
+        database     "XXXX"
+        username     "Administrator"
+        protocol     "7.4"
+        port     "5432"
+        sslmode     "disable"
+        conn_settings    {name=0x02e68fb0 "set search_path to
femsconfig_femsmqsv1601" }    pgNAME
        use_server_side_prepare    1 '\x1'    char
        lower_case_identifier    0 '\0'    char
        rollback_on_error    -1 'ÿ'    char
        force_abbrev_connstr    0 '\0'    char
        autocommit_public    1 '\x1'    char
        extra_opts    0    unsigned int
        xa_opt    1 '\x1'    char

PgDtc_connect(...)
IAsyncPG::getLockedXAConn()
IAsyncPG::RequestExec(...)  with CommitExec
DtcRequestExec(...)


that hit the problematic code in:

>      psqlodbc30a.dll!CC_send_query_append(ConnectionClass_ * self, const char * query, QueryInfo_ * qi, unsigned long
flag,StatementClass_ * stmt, const char * appendq) Line 3082    C 
>      psqlodbc30a.dll!SC_execute(StatementClass_ * self) Line 2092    C
>      psqlodbc30a.dll!Exec_with_parameters_resolved(StatementClass_ * stmt, int * exec_end) Line 511    C
>      psqlodbc30a.dll!PGAPI_Execute(void * hstmt, unsigned short flag) Line 1150    C
>      psqlodbc30a.dll!PGAPI_ExecDirect(void * hstmt, const unsigned char * szSqlStr, long cbSqlStr, unsigned short
flag)Line 195    C 
>>    psqlodbc30a.dll!CC_send_settings(ConnectionClass_ * self) Line 3794    C
>      psqlodbc30a.dll!CC_connect(ConnectionClass_ * self, char password_req, char * salt_para) Line 2187    C
>      psqlodbc30a.dll!PgDtc_connect(void * self) Line 4471    C
>      pgenlista.dll!IAsyncPG::getLockedXAConn() Line 597    C++
>      pgenlista.dll!IAsyncPG::RequestExec(unsigned long type, HRESULT res) Line 657    C++
>      pgenlista.dll!DtcRequestExec(void * para) Line 996    C++

The conninfo says it clearly _wants_ to use the V3 protocol. Hm. Maybe
the flag is not set up yet because the connection params are still being
sent? Nope, it's set at the top of CC_send_query_append.

However, the version is garbage:

        pg_version    0x02d08830 ""    char[128]
        pg_version_number    0.000000000    float
        pg_version_major    0    short
        pg_version_minor    0    short


Unsurprisingly, version 0 is not >= 7.4 .

So there's at least another bug here causing the version to be zero. It
looks like that's because the connection isn't established yet, per
connection.c:2187:


 /*
  * Since these functions allocate statements, and since the connection
  * is not established yet, it would violate odbc state transition
  * rules.  Therefore, these functions call the corresponding local
  * function instead.
  */
inolog("CC_send_settings\n");
  retsend = CC_send_settings(self);


and we're attempting to use the version obtained from the server before
we have it, instead of the conninfo-specified version to use for
protocol selection purposes.

A debug-mode assertion in the SERVER_VERSION_GT, SERVER_VERSION_GT and
SERVER_VERSION_EQ macros that bailed out if the version was 0 and it's
in the debug configuration would help, I think. Inclined to accept a
patch? If so, how does a newbie Windows programmer like me actually *do*
such an assertion in Windows-y code?



--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Protocol de-synchronisation bug, bogus query sent

From
Craig Ringer
Date:
On 06/19/2014 12:56 PM, Craig Ringer wrote:
>
> and we're attempting to use the version obtained from the server before
> we have it, instead of the conninfo-specified version to use for
> protocol selection purposes.

The attached patch fixes this.

Pushed to branch fix-febe-protocol-desync-emptyquery

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Protocol de-synchronisation bug, bogus query sent

From
Heikki Linnakangas
Date:
On 06/19/2014 06:44 AM, Craig Ringer wrote:
> The whole code block around connection.c:3082 appears to be attempting
> to send an empty query in order to elicit an 'I' (EmptyQueryResponse)
> message. It sends a malformed protocol message:
>
>      SOCK_put_string(self->sock, "Q ");
>      SOCK_flush_output(self->sock);
>
> so the server just waits for the rest of the message, never responding.
>
> When the connection is killed by program exit later, the server will log:
>
> LOG:  unexpected EOF within message length word
> LOG:  disconnection: session time: 0:00:08.910 user=Administrator
> database=XXXX host=127.0.0.1 port=53705
> LOG:  could not receive data from client: No connection could be made
> because the target machine actively refused it.
>
>
> That's because the V3 protocol specifies that the message format for
> Query is:
>
> 'Q'
> [32 bit integer]
> [query string]
>
> http://www.postgresql.org/docs/current/static/protocol-message-formats.html
>
> ... but psqlODBC just sends:
>
> 'Q'
> '\x20'
> '\x00'
>
> If it is really necessary to go through this empty-query dance, then for
> the V3 protocol the correct message would be:
>
> /* Send an empty query to elicit an 'I' (EmptyQueryResponse)
>   * from the server */
> SOCK_put_char(self->sock, 'Q');
> SOCK_put_int(self->sock, 5, 4);
> SOCK_put_char(self->sock, '\0');

Hmm. Just before that code, there's this:

>  if (!beforeV2)
>      break;

So we shouldn't even get there with protocol version 3. But if you
managed to hit that, apparently we do. Can you debug that further?
- Heikki


Re: Protocol de-synchronisation bug, bogus query sent

From
Hiroshi Inoue
Date:
(2014/06/19 11:47), Craig Ringer wrote:
> On 06/19/2014 12:56 PM, Craig Ringer wrote:
>>
>> and we're attempting to use the version obtained from the server before
>> we have it, instead of the conninfo-specified version to use for
>> protocol selection purposes.
>
> The attached patch fixes this.

Thanks.
I would also check the source of this trouble.

regards,
Hiroshi Inoue



Re: Protocol de-synchronisation bug, bogus query sent

From
Craig Ringer
Date:
On 06/19/2014 08:08 PM, Hiroshi Inoue wrote:
> (2014/06/19 11:47), Craig Ringer wrote:
>> On 06/19/2014 12:56 PM, Craig Ringer wrote:
>>>
>>> and we're attempting to use the version obtained from the server before
>>> we have it, instead of the conninfo-specified version to use for
>>> protocol selection purposes.
>>
>> The attached patch fixes this.
>
> Thanks.
> I would also check the source of this trouble.

Huh?

That _is_ the source of this trouble. Some code that runs during
connection setup, before the server version is available, tries to use
the version obtained from the server instead of the protocol version
requested in the conninfo.


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Protocol de-synchronisation bug, bogus query sent

From
Craig Ringer
Date:
On 06/20/2014 08:29 AM, Craig Ringer wrote:
> On 06/19/2014 08:08 PM, Hiroshi Inoue wrote:
>> (2014/06/19 11:47), Craig Ringer wrote:
>>> On 06/19/2014 12:56 PM, Craig Ringer wrote:
>>>>
>>>> and we're attempting to use the version obtained from the server before
>>>> we have it, instead of the conninfo-specified version to use for
>>>> protocol selection purposes.
>>>
>>> The attached patch fixes this.
>>
>> Thanks.
>> I would also check the source of this trouble.
>
> Huh?
>
> That _is_ the source of this trouble. Some code that runs during
> connection setup, before the server version is available, tries to use
> the version obtained from the server instead of the protocol version
> requested in the conninfo.

Just to clarify, only the 0002 patch is required. My branch at:

https://github.com/ringerc/psqlodbc/tree/fix-febe-protocol-desync-emptyquery

only contains the second patch. I've attached a revised version that
should make that clearer.

Please apply / pull.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Protocol de-synchronisation bug, bogus query sent

From
"Inoue, Hiroshi"
Date:
(2014/06/20 10:55), Craig Ringer wrote:
> On 06/20/2014 08:29 AM, Craig Ringer wrote:
>> On 06/19/2014 08:08 PM, Hiroshi Inoue wrote:
>>> (2014/06/19 11:47), Craig Ringer wrote:
>>>> On 06/19/2014 12:56 PM, Craig Ringer wrote:
>>>>>
>>>>> and we're attempting to use the version obtained from the server before
>>>>> we have it, instead of the conninfo-specified version to use for
>>>>> protocol selection purposes.
>>>>
>>>> The attached patch fixes this.
>>>
>>> Thanks.
>>> I would also check the source of this trouble.
>>
>> Huh?
>>
>> That _is_ the source of this trouble. Some code that runs during
>> connection setup, before the server version is available, tries to use
>> the version obtained from the server instead of the protocol version
>> requested in the conninfo.
>
> Just to clarify, only the 0002 patch is required. My branch at:
>
> https://github.com/ringerc/psqlodbc/tree/fix-febe-protocol-desync-emptyquery
>
> only contains the second patch. I've attached a revised version that
> should make that clearer.
>
> Please apply / pull.

Thanks.
Applied.

I also pushd another change. It fixes another (or a real) cause of
this problem. I borrowed your beforeV2 in it. Thanks again.

regards,
Hiroshi Inoue


--
I am using the free version of SPAMfighter.
SPAMfighter has removed 10986 of my spam emails to date.
Get the free SPAMfighter here: http://www.spamfighter.com/len

Do you have a slow PC? Try a Free scan
http://www.spamfighter.com/SLOW-PCfighter?cid=sigen



Re: Protocol de-synchronisation bug, bogus query sent

From
Craig Ringer
Date:
On 06/20/2014 11:27 AM, Inoue, Hiroshi wrote:
> Thanks.
> Applied.
>
> I also pushd another change. It fixes another (or a real) cause of
> this problem. I borrowed your beforeV2 in it. Thanks again.

Much appreciated. I've pulled git head and confirmed that the patches
fix the issue as applied. The case you patched is certainly another
incidence of the same issue, so I'm glad you looked. Thanks.

I'm still chasing a couple of other issues, so expect more mail from me
soon.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services