Thread: Protocol de-synchronisation bug, bogus query sent
Hi folks I've found a bug in psqlODBC, in CC_send_query_append, where psqlODBC gets out of sync with the server's idea of the connection state and both wait for each other indefinitely. So far I've only reproduced it as part of debugging a larger issue. I've only actually seen it when psqlODBC establishes an "isolated" XA connection after failing to acquire a lock on the DTC connection in IAsyncPG::getLockedXAConn(). If the client has specified a connect-time query string with a DSN like: DRIVER={PostgreSQL ANSI};SERVER=LOCALHOST;DATABASE=postgres;A6=set search_path to someschema;UID=;PWD=;BI=-5;debug=1;commlog=1; ... but it's clearly a bug in its own right, so I'll write it up here and propose a fix. Patch attached, or you can pull the branch: fix-febe-protocol-desync-emptyquery from https://github.com/ringerc/psqlODBC.git The whole code block around connection.c:3082 appears to be attempting to send an empty query in order to elicit an 'I' (EmptyQueryResponse) message. It sends a malformed protocol message: SOCK_put_string(self->sock, "Q "); SOCK_flush_output(self->sock); so the server just waits for the rest of the message, never responding. When the connection is killed by program exit later, the server will log: LOG: unexpected EOF within message length word LOG: disconnection: session time: 0:00:08.910 user=Administrator database=XXXX host=127.0.0.1 port=53705 LOG: could not receive data from client: No connection could be made because the target machine actively refused it. That's because the V3 protocol specifies that the message format for Query is: 'Q' [32 bit integer] [query string] http://www.postgresql.org/docs/current/static/protocol-message-formats.html ... but psqlODBC just sends: 'Q' '\x20' '\x00' If it is really necessary to go through this empty-query dance, then for the V3 protocol the correct message would be: /* Send an empty query to elicit an 'I' (EmptyQueryResponse) * from the server */ SOCK_put_char(self->sock, 'Q'); SOCK_put_int(self->sock, 5, 4); SOCK_put_char(self->sock, '\0'); Since no 'I' message is ever received the next message is 'Z' (ReadyForQuery). psqlODBC consumes this and carries on waiting for the next message, as it has set emptyreqs=1 to expect 'I' messages. So it waits forever, since the server thinks it's waiting for psqlODBC to finish sending the prior protocol message. It isn't clear to me why psqlODBC does this weird dance with empty queries at all, but it's certainly doing it wrong. The attached patch fixes the immediate problem, but I'd like to see if this can be simplified out entirely in favour of, in ascending size-of-work order, (a) using a protocol sync message, then (b) deleting as much as possible of this custom protocol handling code entirely and using libpq. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hi Craig, The code seems for v0/v1 protocol. Though I don't remember well, the dance was probably because v0/v1 didn't have clear sync mechanism. I wonder why the code reached there. The flag beforeV2 is broken? regards, Hiroshi Inoue (2014/06/19 12:44), Craig Ringer wrote: > Hi folks > > I've found a bug in psqlODBC, in CC_send_query_append, where psqlODBC > gets out of sync with the server's idea of the connection state and both > wait for each other indefinitely. > > So far I've only reproduced it as part of debugging a larger issue. I've > only actually seen it when psqlODBC establishes an "isolated" XA > connection after failing to acquire a lock on the DTC connection in > IAsyncPG::getLockedXAConn(). If the client has specified a connect-time > query string with a DSN like: > > DRIVER={PostgreSQL ANSI};SERVER=LOCALHOST;DATABASE=postgres;A6=set > search_path to someschema;UID=;PWD=;BI=-5;debug=1;commlog=1; > > ... but it's clearly a bug in its own right, so I'll write it up here > and propose a fix. > > > Patch attached, or you can pull the branch: > > fix-febe-protocol-desync-emptyquery > > from https://github.com/ringerc/psqlODBC.git > > > > The whole code block around connection.c:3082 appears to be attempting > to send an empty query in order to elicit an 'I' (EmptyQueryResponse) > message. It sends a malformed protocol message: > > SOCK_put_string(self->sock, "Q "); > SOCK_flush_output(self->sock); > > so the server just waits for the rest of the message, never responding. > > When the connection is killed by program exit later, the server will log: > > LOG: unexpected EOF within message length word > LOG: disconnection: session time: 0:00:08.910 user=Administrator > database=XXXX host=127.0.0.1 port=53705 > LOG: could not receive data from client: No connection could be made > because the target machine actively refused it. > > > That's because the V3 protocol specifies that the message format for > Query is: > > 'Q' > [32 bit integer] > [query string] > > http://www.postgresql.org/docs/current/static/protocol-message-formats.html > > ... but psqlODBC just sends: > > 'Q' > '\x20' > '\x00' > > If it is really necessary to go through this empty-query dance, then for > the V3 protocol the correct message would be: > > /* Send an empty query to elicit an 'I' (EmptyQueryResponse) > * from the server */ > SOCK_put_char(self->sock, 'Q'); > SOCK_put_int(self->sock, 5, 4); > SOCK_put_char(self->sock, '\0'); > > > > Since no 'I' message is ever received the next message is 'Z' > (ReadyForQuery). psqlODBC consumes this and carries on waiting for the > next message, as it has set emptyreqs=1 to expect 'I' messages. So it > waits forever, since the server thinks it's waiting for psqlODBC to > finish sending the prior protocol message. > > It isn't clear to me why psqlODBC does this weird dance with empty > queries at all, but it's certainly doing it wrong. > > The attached patch fixes the immediate problem, but I'd like to see if > this can be simplified out entirely in favour of, in ascending > size-of-work order, (a) using a protocol sync message, then (b) deleting > as much as possible of this custom protocol handling code entirely and > using libpq. -- I am using the free version of SPAMfighter. SPAMfighter has removed 10973 of my spam emails to date. Get the free SPAMfighter here: http://www.spamfighter.com/len Do you have a slow PC? Try a Free scan http://www.spamfighter.com/SLOW-PCfighter?cid=sigen
On 06/19/2014 12:25 PM, Inoue, Hiroshi wrote: > > > Patch attached, or you can pull the branch: > > fix-febe-protocol-desync-emptyquery > > from https://github.com/ringerc/psqlODBC.git BTW, the macros: PG_PROTOCOL_74 and PROTOCOL_74(...) are NOT the same thing at all. The former produces a value identifying the protocol; the latter queries the protocol version requested in the conninfo struct against the named version. Tad confusing. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 06/19/2014 12:25 PM, Inoue, Hiroshi wrote: > > > The code seems for v0/v1 protocol. Though I don't remember well, the > dance was probably because v0/v1 didn't have clear sync mechanism. > I wonder why the code reached there. The flag beforeV2 is broken? Hm, yes. beforeV2 seems to be 1, somehow. But only on these "isolated" connections established via the callstack: CC_connect(...) with trimmed conninfo + dsn "" + desc "" + drivername "{PostgreSQL ANSI}" + server "LOCALHOST" + database "XXXX" + username "Administrator" + protocol "7.4" + port "5432" + sslmode "disable" + conn_settings {name=0x02e68fb0 "set search_path to femsconfig_femsmqsv1601" } pgNAME use_server_side_prepare 1 '\x1' char lower_case_identifier 0 '\0' char rollback_on_error -1 'ÿ' char force_abbrev_connstr 0 '\0' char autocommit_public 1 '\x1' char extra_opts 0 unsigned int xa_opt 1 '\x1' char PgDtc_connect(...) IAsyncPG::getLockedXAConn() IAsyncPG::RequestExec(...) with CommitExec DtcRequestExec(...) that hit the problematic code in: > psqlodbc30a.dll!CC_send_query_append(ConnectionClass_ * self, const char * query, QueryInfo_ * qi, unsigned long flag,StatementClass_ * stmt, const char * appendq) Line 3082 C > psqlodbc30a.dll!SC_execute(StatementClass_ * self) Line 2092 C > psqlodbc30a.dll!Exec_with_parameters_resolved(StatementClass_ * stmt, int * exec_end) Line 511 C > psqlodbc30a.dll!PGAPI_Execute(void * hstmt, unsigned short flag) Line 1150 C > psqlodbc30a.dll!PGAPI_ExecDirect(void * hstmt, const unsigned char * szSqlStr, long cbSqlStr, unsigned short flag)Line 195 C >> psqlodbc30a.dll!CC_send_settings(ConnectionClass_ * self) Line 3794 C > psqlodbc30a.dll!CC_connect(ConnectionClass_ * self, char password_req, char * salt_para) Line 2187 C > psqlodbc30a.dll!PgDtc_connect(void * self) Line 4471 C > pgenlista.dll!IAsyncPG::getLockedXAConn() Line 597 C++ > pgenlista.dll!IAsyncPG::RequestExec(unsigned long type, HRESULT res) Line 657 C++ > pgenlista.dll!DtcRequestExec(void * para) Line 996 C++ The conninfo says it clearly _wants_ to use the V3 protocol. Hm. Maybe the flag is not set up yet because the connection params are still being sent? Nope, it's set at the top of CC_send_query_append. However, the version is garbage: pg_version 0x02d08830 "" char[128] pg_version_number 0.000000000 float pg_version_major 0 short pg_version_minor 0 short Unsurprisingly, version 0 is not >= 7.4 . So there's at least another bug here causing the version to be zero. It looks like that's because the connection isn't established yet, per connection.c:2187: /* * Since these functions allocate statements, and since the connection * is not established yet, it would violate odbc state transition * rules. Therefore, these functions call the corresponding local * function instead. */ inolog("CC_send_settings\n"); retsend = CC_send_settings(self); and we're attempting to use the version obtained from the server before we have it, instead of the conninfo-specified version to use for protocol selection purposes. A debug-mode assertion in the SERVER_VERSION_GT, SERVER_VERSION_GT and SERVER_VERSION_EQ macros that bailed out if the version was 0 and it's in the debug configuration would help, I think. Inclined to accept a patch? If so, how does a newbie Windows programmer like me actually *do* such an assertion in Windows-y code? -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 06/19/2014 12:56 PM, Craig Ringer wrote: > > and we're attempting to use the version obtained from the server before > we have it, instead of the conninfo-specified version to use for > protocol selection purposes. The attached patch fixes this. Pushed to branch fix-febe-protocol-desync-emptyquery -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 06/19/2014 06:44 AM, Craig Ringer wrote: > The whole code block around connection.c:3082 appears to be attempting > to send an empty query in order to elicit an 'I' (EmptyQueryResponse) > message. It sends a malformed protocol message: > > SOCK_put_string(self->sock, "Q "); > SOCK_flush_output(self->sock); > > so the server just waits for the rest of the message, never responding. > > When the connection is killed by program exit later, the server will log: > > LOG: unexpected EOF within message length word > LOG: disconnection: session time: 0:00:08.910 user=Administrator > database=XXXX host=127.0.0.1 port=53705 > LOG: could not receive data from client: No connection could be made > because the target machine actively refused it. > > > That's because the V3 protocol specifies that the message format for > Query is: > > 'Q' > [32 bit integer] > [query string] > > http://www.postgresql.org/docs/current/static/protocol-message-formats.html > > ... but psqlODBC just sends: > > 'Q' > '\x20' > '\x00' > > If it is really necessary to go through this empty-query dance, then for > the V3 protocol the correct message would be: > > /* Send an empty query to elicit an 'I' (EmptyQueryResponse) > * from the server */ > SOCK_put_char(self->sock, 'Q'); > SOCK_put_int(self->sock, 5, 4); > SOCK_put_char(self->sock, '\0'); Hmm. Just before that code, there's this: > if (!beforeV2) > break; So we shouldn't even get there with protocol version 3. But if you managed to hit that, apparently we do. Can you debug that further? - Heikki
(2014/06/19 11:47), Craig Ringer wrote: > On 06/19/2014 12:56 PM, Craig Ringer wrote: >> >> and we're attempting to use the version obtained from the server before >> we have it, instead of the conninfo-specified version to use for >> protocol selection purposes. > > The attached patch fixes this. Thanks. I would also check the source of this trouble. regards, Hiroshi Inoue
On 06/19/2014 08:08 PM, Hiroshi Inoue wrote: > (2014/06/19 11:47), Craig Ringer wrote: >> On 06/19/2014 12:56 PM, Craig Ringer wrote: >>> >>> and we're attempting to use the version obtained from the server before >>> we have it, instead of the conninfo-specified version to use for >>> protocol selection purposes. >> >> The attached patch fixes this. > > Thanks. > I would also check the source of this trouble. Huh? That _is_ the source of this trouble. Some code that runs during connection setup, before the server version is available, tries to use the version obtained from the server instead of the protocol version requested in the conninfo. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 06/20/2014 08:29 AM, Craig Ringer wrote: > On 06/19/2014 08:08 PM, Hiroshi Inoue wrote: >> (2014/06/19 11:47), Craig Ringer wrote: >>> On 06/19/2014 12:56 PM, Craig Ringer wrote: >>>> >>>> and we're attempting to use the version obtained from the server before >>>> we have it, instead of the conninfo-specified version to use for >>>> protocol selection purposes. >>> >>> The attached patch fixes this. >> >> Thanks. >> I would also check the source of this trouble. > > Huh? > > That _is_ the source of this trouble. Some code that runs during > connection setup, before the server version is available, tries to use > the version obtained from the server instead of the protocol version > requested in the conninfo. Just to clarify, only the 0002 patch is required. My branch at: https://github.com/ringerc/psqlodbc/tree/fix-febe-protocol-desync-emptyquery only contains the second patch. I've attached a revised version that should make that clearer. Please apply / pull. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
(2014/06/20 10:55), Craig Ringer wrote: > On 06/20/2014 08:29 AM, Craig Ringer wrote: >> On 06/19/2014 08:08 PM, Hiroshi Inoue wrote: >>> (2014/06/19 11:47), Craig Ringer wrote: >>>> On 06/19/2014 12:56 PM, Craig Ringer wrote: >>>>> >>>>> and we're attempting to use the version obtained from the server before >>>>> we have it, instead of the conninfo-specified version to use for >>>>> protocol selection purposes. >>>> >>>> The attached patch fixes this. >>> >>> Thanks. >>> I would also check the source of this trouble. >> >> Huh? >> >> That _is_ the source of this trouble. Some code that runs during >> connection setup, before the server version is available, tries to use >> the version obtained from the server instead of the protocol version >> requested in the conninfo. > > Just to clarify, only the 0002 patch is required. My branch at: > > https://github.com/ringerc/psqlodbc/tree/fix-febe-protocol-desync-emptyquery > > only contains the second patch. I've attached a revised version that > should make that clearer. > > Please apply / pull. Thanks. Applied. I also pushd another change. It fixes another (or a real) cause of this problem. I borrowed your beforeV2 in it. Thanks again. regards, Hiroshi Inoue -- I am using the free version of SPAMfighter. SPAMfighter has removed 10986 of my spam emails to date. Get the free SPAMfighter here: http://www.spamfighter.com/len Do you have a slow PC? Try a Free scan http://www.spamfighter.com/SLOW-PCfighter?cid=sigen
On 06/20/2014 11:27 AM, Inoue, Hiroshi wrote: > Thanks. > Applied. > > I also pushd another change. It fixes another (or a real) cause of > this problem. I borrowed your beforeV2 in it. Thanks again. Much appreciated. I've pulled git head and confirmed that the patches fix the issue as applied. The case you patched is certainly another incidence of the same issue, so I'm glad you looked. Thanks. I'm still chasing a couple of other issues, so expect more mail from me soon. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services