Thread: [HACKERS] statement_timeout is not working as expected with postgres_fdw
Hello All,
The query on foreign table hangs due to network down of remote server for near about 16 minutes before exiting.
statement_timeout is expected to work in this case as well but when i tried statement_timeout, it is not working as expected.
Below is test case to reproduce the issue: [I am testing this on two different systems]
1: Set the postgres_fdw
postgres=# CREATE SERVER myserver FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host '172.16.173.237', dbname 'postgres', port '5432',keepalives '1', keepalives_interval '3',keepalives_idle '3', keepalives_count '1');
CREATE SERVER
postgres=# CREATE USER MAPPING FOR postgres SERVER myserver OPTIONS (user 'postgres', password 'edb');
CREATE USER MAPPING
postgres=# CREATE FOREIGN TABLE test(id int) SERVER myserver;
CREATE FOREIGN TABLE
postgres=# select * from test;
id
----
1
10
2
1
(4 rows)
postgres=# \d
List of relations
Schema | Name | Type | Owner
--------+------+---------------+----------
public | test | foreign table | postgres
(1 row)
postgres=#
postgres=# select * from pg_foreign_server ;
srvname | srvowner | srvfdw | srvtype | srvversion | srvacl | srvoptions
----------+----------+--------+---------+------------+--------+------------------------------------------------------------------------------------------------------------------
-------
myserver | 10 | 16387 | | | | {host=172.16.173.237,dbname=postgres,port=5432,keepalives=1,keepalives_interval=3,keepalives_idle=3,keepalives_co
unt=1}
(1 row)
3. Execute the insert command.
E.g: insert into test values (generate_series(1,1000000));
* You need to do network down of remote server during insert command.
postgres=# set statement_timeout to 6000;
SET
postgres=# insert into test values (generate_series(1,1000000));
WARNING: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
CONTEXT: Remote SQL command: ABORT TRANSACTION
ERROR: could not receive data from server: No route to host
CONTEXT: Remote SQL command: INSERT INTO public.test(id) VALUES ($1)
Time: 931427.002 ms
It was in hang state for approx 15-16 minute before exit.
In pg_log, i see below error for above query:
=========
2017-04-20 15:22:02 IST ERROR: could not receive data from server: No route to host
2017-04-20 15:22:02 IST CONTEXT: Remote SQL command: INSERT INTO public.test(id) VALUES ($1)
2017-04-20 15:22:02 IST STATEMENT: insert into test values (generate_series(1,1000000));
2017-04-20 15:22:02 IST WARNING: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2017-04-20 15:22:02 IST CONTEXT: Remote SQL command: ABORT TRANSACTION
==========
I tested the patch submitted by Ashutosh Bapat on community [https://www.postgresql.org/message-id/CAFjFpRdcWw4h0a-zrL-EiaekkPj8O0GR2M1FwZ1useSRfRm3-g%40mail.gmail.com] to make the statement_timeout working and i can see statement_timeout is working but it is taking some extra time as statement_timeout.
postgres=# set statement_timeout to 6000;
SET
postgres=#
postgres=# \timing on
Timing is on.
postgres=# insert into test values (generate_series(1,1000000)); -- [after executing query immediately disconnect the ethernet on remote server ]
2017-04-20 07:10:51.588 IST [10467] ERROR: canceling statement due to statement timeout
2017-04-20 07:10:51.588 IST [10467] STATEMENT: insert into test values (generate_series(1,1000000));
2017-04-20 07:11:25.590 IST [10467] WARNING: discarding connection 0xfe4260 because of failed cancel request: PQcancel() -- connect() failed: No route to host
WARNING: discarding connection 0xfe4260 because of failed cancel request: PQcancel() -- connect() failed: No route to host
ERROR: canceling statement due to statement timeout
Time: 40001.765 ms (00:40.002)
postgres=#
postgres=#
In above case, I got the error related to statement timeout after 6 seconds,
but it it taking more time (approx 34 sec and it is varing each time if you see below) to terminate the connection and to exit from query. As per the tcp keepavlies settings for foreign server, it should take max 6 sec to terminate the connection.
postgres=#
postgres=# set statement_timeout to 20000;
SET
Time: 0.254 ms
postgres=#
postgres=# insert into test values (generate_series(1,1000000));
2017-04-20 07:13:25.666 IST [10467] ERROR: canceling statement due to statement timeout
2017-04-20 07:13:25.666 IST [10467] STATEMENT: insert into test values (generate_series(1,1000000));
2017-04-20 07:13:43.668 IST [10467] WARNING: discarding connection 0xfe4260 because of failed cancel request: PQcancel() -- connect() failed: No route to host
WARNING: discarding connection 0xfe4260 because of failed cancel request: PQcancel() -- connect() failed: No route to host
ERROR: canceling statement due to statement timeout
Time: 38004.169 ms (00:38.004)
postgres=#
When I tested this using a local TCP connection, then query was exited immediately after statement timeout.
Suraj Kharage
EnterpriseDB Corporation
The Postgres Database Company
The Postgres Database Company
Re: [HACKERS] statement_timeout is not working as expected with postgres_fdw
From
Ashutosh Bapat
Date:
On Thu, Apr 20, 2017 at 4:08 PM, Suraj Kharage <suraj.kharage@enterprisedb.com> wrote: > Hello All, > > The query on foreign table hangs due to network down of remote server for > near about 16 minutes before exiting. > statement_timeout is expected to work in this case as well but when i tried > statement_timeout, it is not working as expected. > > Below is test case to reproduce the issue: [I am testing this on two > different systems] > > 1: Set the postgres_fdw > > postgres=# CREATE SERVER myserver FOREIGN DATA WRAPPER postgres_fdw OPTIONS > (host '172.16.173.237', dbname 'postgres', port '5432',keepalives '1', > keepalives_interval '3',keepalives_idle '3', keepalives_count '1'); > CREATE SERVER > postgres=# CREATE USER MAPPING FOR postgres SERVER myserver OPTIONS (user > 'postgres', password 'edb'); > CREATE USER MAPPING > postgres=# CREATE FOREIGN TABLE test(id int) SERVER myserver; > CREATE FOREIGN TABLE > postgres=# select * from test; > id > ---- > 1 > 10 > 2 > 1 > (4 rows) > > postgres=# \d > List of relations > Schema | Name | Type | Owner > --------+------+---------------+---------- > public | test | foreign table | postgres > (1 row) > postgres=# > postgres=# select * from pg_foreign_server ; > srvname | srvowner | srvfdw | srvtype | srvversion | srvacl | > srvoptions > > ----------+----------+--------+---------+------------+--------+------------------------------------------------------------------------------------------------------------------ > ------- > myserver | 10 | 16387 | | | | > {host=172.16.173.237,dbname=postgres,port=5432,keepalives=1,keepalives_interval=3,keepalives_idle=3,keepalives_co > unt=1} > (1 row) > > 3. Execute the insert command. > E.g: insert into test values (generate_series(1,1000000)); > * You need to do network down of remote server during insert command. > > postgres=# set statement_timeout to 6000; > SET > postgres=# insert into test values (generate_series(1,1000000)); > WARNING: server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > > CONTEXT: Remote SQL command: ABORT TRANSACTION > ERROR: could not receive data from server: No route to host > > CONTEXT: Remote SQL command: INSERT INTO public.test(id) VALUES ($1) > Time: 931427.002 ms > > It was in hang state for approx 15-16 minute before exit. > > In pg_log, i see below error for above query: > ========= > 2017-04-20 15:22:02 IST ERROR: could not receive data from server: No route > to host > 2017-04-20 15:22:02 IST CONTEXT: Remote SQL command: INSERT INTO > public.test(id) VALUES ($1) > 2017-04-20 15:22:02 IST STATEMENT: insert into test values > (generate_series(1,1000000)); > 2017-04-20 15:22:02 IST WARNING: server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > 2017-04-20 15:22:02 IST CONTEXT: Remote SQL command: ABORT TRANSACTION > ========== > > I tested the patch submitted by Ashutosh Bapat on community > [https://www.postgresql.org/message-id/CAFjFpRdcWw4h0a-zrL-EiaekkPj8O0GR2M1FwZ1useSRfRm3-g%40mail.gmail.com] > to make the statement_timeout working and i can see statement_timeout is > working but it is taking some extra time as statement_timeout. > Thanks for testing the patch. > postgres=# set statement_timeout to 6000; > SET > postgres=# > postgres=# \timing on > Timing is on. > postgres=# insert into test values (generate_series(1,1000000)); > -- [after executing query immediately disconnect the ethernet on remote > server ] > 2017-04-20 07:10:51.588 IST [10467] ERROR: canceling statement due to > statement timeout > 2017-04-20 07:10:51.588 IST [10467] STATEMENT: insert into test values > (generate_series(1,1000000)); > 2017-04-20 07:11:25.590 IST [10467] WARNING: discarding connection 0xfe4260 > because of failed cancel request: PQcancel() -- connect() failed: No route > to host > WARNING: discarding connection 0xfe4260 because of failed cancel request: > PQcancel() -- connect() failed: No route to host > > ERROR: canceling statement due to statement timeout > Time: 40001.765 ms (00:40.002) > postgres=# > postgres=# > > In above case, I got the error related to statement timeout after 6 seconds, > but it it taking more time (approx 34 sec and it is varing each time if you > see below) to terminate the connection and to exit from query. As per the > tcp keepavlies settings for foreign server, it should take max 6 sec to > terminate the connection. The logs above show that 34 seconds elapsed between starting to abort the transaction and knowing that the foreign server is unreachable. It looks like it took that much time for the local server to realise that the foreign server is not reachable. Looking at PQcancel code, it seems to be trying to connect to the foreign server to cancel the query. But somehow it doesn't seem to honor connect_timeout setting. Is that expected? Irrespective of what PQcancel does, it looks like postgres_fdw should just slam the connection if query is being aborted because of statement_timeout. But then pgfdw_xact_callback() doesn't seem to have a way to know whether this ABORT is because of user's request to cancel the query, statement timeout, an abort because of some other error or a user requested abort. Except statement timeout (may be user's request to cancel the query?), it should try to keep the connection around to avoid any future reconnection. But I am not able to see how can we provide that information to pgfdw_xact_callback(). -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] statement_timeout is not working as expected withpostgres_fdw
From
Kyotaro HORIGUCHI
Date:
Hello, At Thu, 20 Apr 2017 19:57:30 +0530, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote in <CAFjFpRdMD25Lia_J0P1eiKzNXAXtH9AGNx2vCZ2dBPNffNOaog@mail.gmail.com> > On Thu, Apr 20, 2017 at 4:08 PM, Suraj Kharage > <suraj.kharage@enterprisedb.com> wrote: > > Hello All, > > > > The query on foreign table hangs due to network down of remote server for > > near about 16 minutes before exiting. > > statement_timeout is expected to work in this case as well but when i tried > > statement_timeout, it is not working as expected. > > > > Below is test case to reproduce the issue: [I am testing this on two > > different systems] > > > > 1: Set the postgres_fdw > > > > postgres=# CREATE SERVER myserver FOREIGN DATA WRAPPER postgres_fdw OPTIONS > > (host '172.16.173.237', dbname 'postgres', port '5432',keepalives '1', > > keepalives_interval '3',keepalives_idle '3', keepalives_count '1'); > > CREATE SERVER > > postgres=# CREATE USER MAPPING FOR postgres SERVER myserver OPTIONS (user > > 'postgres', password 'edb'); > > CREATE USER MAPPING > > postgres=# CREATE FOREIGN TABLE test(id int) SERVER myserver; > > CREATE FOREIGN TABLE > > postgres=# select * from test; > > id > > ---- > > 1 > > 10 > > 2 > > 1 > > (4 rows) > > > > postgres=# \d > > List of relations > > Schema | Name | Type | Owner > > --------+------+---------------+---------- > > public | test | foreign table | postgres > > (1 row) > > postgres=# > > postgres=# select * from pg_foreign_server ; > > srvname | srvowner | srvfdw | srvtype | srvversion | srvacl | > > srvoptions > > > > ----------+----------+--------+---------+------------+--------+------------------------------------------------------------------------------------------------------------------ > > ------- > > myserver | 10 | 16387 | | | | > > {host=172.16.173.237,dbname=postgres,port=5432,keepalives=1,keepalives_interval=3,keepalives_idle=3,keepalives_co > > unt=1} > > (1 row) > > > > 3. Execute the insert command. > > E.g: insert into test values (generate_series(1,1000000)); > > * You need to do network down of remote server during insert command. > > > > postgres=# set statement_timeout to 6000; > > SET > > postgres=# insert into test values (generate_series(1,1000000)); > > WARNING: server closed the connection unexpectedly > > This probably means the server terminated abnormally > > before or while processing the request. > > > > CONTEXT: Remote SQL command: ABORT TRANSACTION > > ERROR: could not receive data from server: No route to host > > > > CONTEXT: Remote SQL command: INSERT INTO public.test(id) VALUES ($1) > > Time: 931427.002 ms > > > > It was in hang state for approx 15-16 minute before exit. > > > > In pg_log, i see below error for above query: > > ========= > > 2017-04-20 15:22:02 IST ERROR: could not receive data from server: No route > > to host > > 2017-04-20 15:22:02 IST CONTEXT: Remote SQL command: INSERT INTO > > public.test(id) VALUES ($1) > > 2017-04-20 15:22:02 IST STATEMENT: insert into test values > > (generate_series(1,1000000)); > > 2017-04-20 15:22:02 IST WARNING: server closed the connection unexpectedly > > This probably means the server terminated abnormally > > before or while processing the request. > > 2017-04-20 15:22:02 IST CONTEXT: Remote SQL command: ABORT TRANSACTION > > ========== > > > > I tested the patch submitted by Ashutosh Bapat on community > > [https://www.postgresql.org/message-id/CAFjFpRdcWw4h0a-zrL-EiaekkPj8O0GR2M1FwZ1useSRfRm3-g%40mail.gmail.com] > > to make the statement_timeout working and i can see statement_timeout is > > working but it is taking some extra time as statement_timeout. > > > > Thanks for testing the patch. > > > postgres=# set statement_timeout to 6000; > > SET > > postgres=# > > postgres=# \timing on > > Timing is on. > > postgres=# insert into test values (generate_series(1,1000000)); > > -- [after executing query immediately disconnect the ethernet on remote > > server ] > > 2017-04-20 07:10:51.588 IST [10467] ERROR: canceling statement due to > > statement timeout > > 2017-04-20 07:10:51.588 IST [10467] STATEMENT: insert into test values > > (generate_series(1,1000000)); > > 2017-04-20 07:11:25.590 IST [10467] WARNING: discarding connection 0xfe4260 > > because of failed cancel request: PQcancel() -- connect() failed: No route > > to host > > WARNING: discarding connection 0xfe4260 because of failed cancel request: > > PQcancel() -- connect() failed: No route to host > > > > ERROR: canceling statement due to statement timeout > > Time: 40001.765 ms (00:40.002) > > postgres=# > > postgres=# > > > > In above case, I got the error related to statement timeout after 6 seconds, > > but it it taking more time (approx 34 sec and it is varing each time if you > > see below) to terminate the connection and to exit from query. As per the > > tcp keepavlies settings for foreign server, it should take max 6 sec to > > terminate the connection. > > The logs above show that 34 seconds elapsed between starting to abort > the transaction and knowing that the foreign server is unreachable. It > looks like it took that much time for the local server to realise that > the foreign server is not reachable. Looking at PQcancel code, it > seems to be trying to connect to the foreign server to cancel the > query. But somehow it doesn't seem to honor connect_timeout setting. > Is that expected? Yes, and No. I think PQcancel requires connection timeout, but I think it is not necessariry the same with that of a foreign server. > Irrespective of what PQcancel does, it looks like postgres_fdw should > just slam the connection if query is being aborted because of > statement_timeout. But then pgfdw_xact_callback() doesn't seem to have > a way to know whether this ABORT is because of user's request to > cancel the query, statement timeout, an abort because of some other > error or a user requested abort. Except statement timeout (may be > user's request to cancel the query?), it should try to keep the > connection around to avoid any future reconnection. But I am not able > to see how can we provide that information to pgfdw_xact_callback(). Expiration of statement_timeout doesn't mean a stall of foreign connections. If we are to keep connections by, for example, a cancel request from client, we also should keep them on statememt_timeout because it is not necessariry triggered by a stall of foreign connection. I think we can detect a stall of the channel where the foreign connections are on by a cancel request with a very short timeout, although it is a bit incorrect. I reconsider this problem and my proposal for this issue is as the follows. - Foreign servers have a new options 'stall_detection_threshold' in milliseconds, maybe defaults to connect_timeout of theforeign server setting. For many foreign servers in a local network, it could be lowered to several tens of milliseconds. -+ Invoke PQcancel with the stall_detection_threashold. (It| doesn't accept such parameter so far)|+- If PQcancel returnswith timeout, slam the connection down.|+- Else we continue to send ABORT TRANSACTION to the connection. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] statement_timeout is not working as expected with postgres_fdw
From
Ashutosh Bapat
Date:
On Tue, Apr 25, 2017 at 1:31 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> >> The logs above show that 34 seconds elapsed between starting to abort >> the transaction and knowing that the foreign server is unreachable. It >> looks like it took that much time for the local server to realise that >> the foreign server is not reachable. Looking at PQcancel code, it >> seems to be trying to connect to the foreign server to cancel the >> query. But somehow it doesn't seem to honor connect_timeout setting. >> Is that expected? > > Yes, and No. I think PQcancel requires connection timeout, but I > think it is not necessariry the same with that of a foreign > server. Since connect_timeout is property of foreign server, it should be honored by any connection made to that server from local server including the one by PQcancel(). > >> Irrespective of what PQcancel does, it looks like postgres_fdw should >> just slam the connection if query is being aborted because of >> statement_timeout. But then pgfdw_xact_callback() doesn't seem to have >> a way to know whether this ABORT is because of user's request to >> cancel the query, statement timeout, an abort because of some other >> error or a user requested abort. Except statement timeout (may be >> user's request to cancel the query?), it should try to keep the >> connection around to avoid any future reconnection. But I am not able >> to see how can we provide that information to pgfdw_xact_callback(). > > Expiration of statement_timeout doesn't mean a stall of foreign > connections. If we are to keep connections by, for example, a > cancel request from client, we also should keep them on > statememt_timeout because it is not necessariry triggered by a > stall of foreign connection. When statement_timeout completes, we don't want to spend more time in trying to cancel queries: esp when there are many foreign server, each consuming some "timeout" time OR even trying to send Abort transaction statement. Instead, we should slam those down. I consider this to be different from query cancellation since query cancellation doesn't have a hard bound on time, although we would like to cancel the running query as fast as possible. Rethinking about it, probably we should slam down the connection in case of query cancel as well. > > I think we can detect a stall of the channel where the foreign > connections are on by a cancel request with a very short timeout, > although it is a bit incorrect. > > I reconsider this problem and my proposal for this issue is as > the follows. > > - Foreign servers have a new options 'stall_detection_threshold' > in milliseconds, maybe defaults to connect_timeout of the > foreign server setting. For many foreign servers in a local > network, it could be lowered to several tens of milliseconds. A connect_timeout less than 2 seconds is not encouraged. https://www.postgresql.org/docs/devel/static/libpq-connect.html. So, we can not set stall_detection_threshold to be smaller than 2 seconds. statement_timeout however is set in milliseconds, so 2 seconds per connection would be quite a lot compared to statement_timeout setting. Waiting to cancel query for 2 seconds when the statement_timeout itself is 2 seconds would mean the query would be cancelled after 4 seconds, which is kind of funny. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] statement_timeout is not working as expected withpostgres_fdw
From
Kyotaro HORIGUCHI
Date:
At Tue, 25 Apr 2017 15:43:59 +0530, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote in <CAFjFpRe6uf3-nU-orTEAJ0_CKb3MMiPQ5AHJ_SwDguV7Sjs6Ww@mail.gmail.com> > On Tue, Apr 25, 2017 at 1:31 PM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > >> > >> The logs above show that 34 seconds elapsed between starting to abort > >> the transaction and knowing that the foreign server is unreachable. It > >> looks like it took that much time for the local server to realise that > >> the foreign server is not reachable. Looking at PQcancel code, it > >> seems to be trying to connect to the foreign server to cancel the > >> query. But somehow it doesn't seem to honor connect_timeout setting. > >> Is that expected? > > > > Yes, and No. I think PQcancel requires connection timeout, but I > > think it is not necessariry the same with that of a foreign > > server. > > Since connect_timeout is property of foreign server, it should be > honored by any connection made to that server from local server > including the one by PQcancel(). The two connections have different characteristics. A query (client) connection should be finally established. On the other we may give up to establish a cancel connection in certain cases. For example, when a network stall is anticipated, we can rather choose to trash the session on it than waiting for the timeout. > >> Irrespective of what PQcancel does, it looks like postgres_fdw should > >> just slam the connection if query is being aborted because of > >> statement_timeout. But then pgfdw_xact_callback() doesn't seem to have > >> a way to know whether this ABORT is because of user's request to > >> cancel the query, statement timeout, an abort because of some other > >> error or a user requested abort. Except statement timeout (may be > >> user's request to cancel the query?), it should try to keep the > >> connection around to avoid any future reconnection. But I am not able > >> to see how can we provide that information to pgfdw_xact_callback(). > > > > Expiration of statement_timeout doesn't mean a stall of foreign > > connections. If we are to keep connections by, for example, a > > cancel request from client, we also should keep them on > > statememt_timeout because it is not necessariry triggered by a > > stall of foreign connection. > > When statement_timeout completes, we don't want to spend more time in > trying to cancel queries: esp when there are many foreign server, each > consuming some "timeout" time OR even trying to send Abort transaction > statement. Instead, we should slam those down. I consider this to be > different from query cancellation since query cancellation doesn't > have a hard bound on time, although we would like to cancel the > running query as fast as possible. Rethinking about it, probably we > should slam down the connection in case of query cancel as well. Yeah, both are rather unusual, in other words, we don't expect it to happen twice or more in a short interval. With that prerequisite, I think slamming all down is optimal. But defenitely not for the commanded rollback case. > > I think we can detect a stall of the channel where the foreign > > connections are on by a cancel request with a very short timeout, > > although it is a bit incorrect. > > > > I reconsider this problem and my proposal for this issue is as > > the follows. > > > > - Foreign servers have a new options 'stall_detection_threshold' > > in milliseconds, maybe defaults to connect_timeout of the > > foreign server setting. For many foreign servers in a local > > network, it could be lowered to several tens of milliseconds. > > A connect_timeout less than 2 seconds is not encouraged. > https://www.postgresql.org/docs/devel/static/libpq-connect.html. So, > we can not set stall_detection_threshold to be smaller than 2 seconds. It seems coming from DNS lookup timeout so users can set more short timeout. Anyway the several-tens or hundres millisecond is an extreme example. But 2 seconds for every connection would be too long. Anyway the resolution is 1 second.. > statement_timeout however is set in milliseconds, so 2 seconds per > connection would be quite a lot compared to statement_timeout setting. > Waiting to cancel query for 2 seconds when the statement_timeout > itself is 2 seconds would mean the query would be cancelled after 4 > seconds, which is kind of funny. Yes, we can keep connections in the cases but in turn it can complel a user to wait for incrediblly long time for a cancel request. We could reduce the time to wait with some complex stuff but I don't think we can cancel queries to hundreds of foreign servers in a second without asynchronisity or parallelism. It seems to me unreasonablly complex. In short, I'd like to just discard all connections by the two kind of triggers. I suppose that we can inform the cause to the transaction callback as an event, but I haven't realize how AbortCurrentTransaction can detect the cause. Letting ProcessInterrupts set an additional global variable such like requested_immediate_abort could work. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, Apr 20, 2017 at 10:27 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > The logs above show that 34 seconds elapsed between starting to abort > the transaction and knowing that the foreign server is unreachable. It > looks like it took that much time for the local server to realise that > the foreign server is not reachable. Looking at PQcancel code, it > seems to be trying to connect to the foreign server to cancel the > query. But somehow it doesn't seem to honor connect_timeout setting. > Is that expected? Well, there's no code to do anything else. Regular connections go through connectDBComplete() which uses non-blocking mode and timed waits to respect connection_timeout. PQcancel() has no such handling. internal_cancel just opens a socket and, without setting any options on it, calls connect(). No loop, no timed waits, nada. So it's going to take as long as it takes for the operating system to notice. > Irrespective of what PQcancel does, it looks like postgres_fdw should > just slam the connection if query is being aborted because of > statement_timeout. But then pgfdw_xact_callback() doesn't seem to have > a way to know whether this ABORT is because of user's request to > cancel the query, statement timeout, an abort because of some other > error or a user requested abort. Except statement timeout (may be > user's request to cancel the query?), it should try to keep the > connection around to avoid any future reconnection. But I am not able > to see how can we provide that information to pgfdw_xact_callback(). I don't think we can. In general, PostgreSQL has no facility for telling error cleanup handlers why the abort happened, and it wouldn't really be the right thing anyway. The fact that statement_timeout fired doesn't necessarily mean that the connection is dead; it could equally well mean that the query ran for a long time. I think we basically have two choices. One is to bound the amount of time we spend performing error cleanup, and the other is just to always drop the connection. A problem with the latter is that it might do the wrong thing if we're aborting a subtransaction but not the whole transaction. In that case, we need to undo changes since the relevant savepoint, but not the ones made before that; closing the connection amounts to a full rollback. Therefore, I think the patch you proposed in https://www.postgresql.org/message-id/CAFjFpRdcWw4h0a-zrL-EiaekkPj8O0GR2M1FwZ1useSRfRm3-g%40mail.gmail.com isn't correct, because if the cancel fails it will slam the connection shut regardless of whether we're in pgfdw_xact_callback or pgfdw_subxact_callback. It looks to me like there's a fairly lengthy list of conceptually separate but practically related problems here: 1. PQcancel() ignores the keepalive parameters, because it makes no attempt to set the relevant socket options before connecting. 2. PQcancel() ignores connection_timeout, because it doesn't use non-blocking mode and has no wait loop. 3. There is no asynchronous equivalent of PQcancel(), so we can't use a loop to allow responding to interrupts while the cancel is outstanding (see pgfdw_get_result for an example). 4. pgfdw_xact_callback() and pgfdw_subxact_callback() use PQexec() rather than the asynchronous interfaces for sending queries and checking for results, so the SQL commands they send are not interruptible. 5. pgfdw_xact_callback() and pgfdw_subxact_callback() try to get the connection back to a good state by using ABORT TRANSACTION for the former and ROLLBACK TO SAVEPOINT and RELEASE SAVEPOINT for the latter. But if it doesn't work, then they just emit a WARNING and continue on as if they'd succeeded. That seems highly likely to make the next use of that connection fail in unpredictable ways. 6. postgres_fdw doesn't use PQsetnonblocking() anywhere, so even if we converted pgfdw_xact_callback() and pgfdw_subxact_callback() to use pgfdw_exec_query() or something like it rather than PQexec() to submit queries, they might still block if we fail to send the query, as the comments for pgfdw_exec_query() explain. This is not possibly but not particularly likely to happen for queries being sent out of error handling paths, because odds are good that the connection was sent just before. However, if the user is pressing ^C because the remote server isn't responding, it's quite probable that we'll run into this exact issue. 7. postgres_fdw never considers slamming the connection shut as a response to trouble. It seems pretty clear that this is only a safe response if we're aborting the toplevel transaction. If we did it for a subtransaction, we'd end up reconnecting if the server were accessed again, which would at the very least change the snapshot (note that we use at least REPEATABLE READ on the remote side regardless of the local isolation level) and might discard changes made on the remote side at outer transaction levels. Even for a top-level transaction, it might not always be the right way to proceed, as it incurs some overhead to reconnect, but it seems worth considering at least in the case where we've already got some indication that the connection is unhealthy (e.g. a previous attempt to clean up the connection state failed, as in #5, or PQcancel failed, as in Ashutosh's proposed patch). 8. Before 9.6, PQexec() is used in many more places, so many more queries are entirely non-interruptible. It seems pretty clear to me that we are not going to fix all of these issues in one patch. Here's a sketch of an idea for how to start making things better: - Add an in_abort_cleanup flag to ConnCacheEntry. - Before pgfdw_xact_callback() or pgfdw_subxact_callback() begin abort cleanup, check whether the flag is set. If so, slam the connection shut unless that's already been done; furthermore, if the flag is set and we're in pgfdw_xact_callback (i.e. this is a toplevel abort), forget about the connection entry entirely. On the other hand, if the flag is not set, set it flag and attempt abort cleanup. If we succeed, clear the flag. - Before pgfdw_xact_callback() or pgfdw_subxact_callback() begin pre-commit processing, check whether the flag is set. If so, throw an ERROR, so that we switch over to abort processing. - Change uses of PQexec() in the abort path to use pgfdw_exec_query() instead. If we exit pgfdw_exec_query() with an error, we'll re-enter the abort path, but now in_abort_cleanup will be set, so we'll just drop the connection (and force any outer transaction levels to abort as well). - For bonus points, give pgfdw_exec_query() an optional timeout argument, and set it to 30 seconds or so when we're doing abort cleanup. If the timeout succeeds, it errors out (which again re-enters the abort path, dropping the connection and forcing outer transaction levels to abort as well). Assuming the above works out, I propose to back-patch f039eaac7131ef2a4cf63a10cf98486f8bcd09d2, 1b812afb0eafe125b820cc3b95e7ca03821aa675, and the changes above to all supported releases. That would fix problems 4, 5, 7, and 8 from the above list for all branches. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, May 3, 2017 at 3:49 PM, Robert Haas <robertmhaas@gmail.com> wrote: > It seems pretty clear to me that we are not going to fix all of these > issues in one patch. Here's a sketch of an idea for how to start > making things better: Patch attached. > - Add an in_abort_cleanup flag to ConnCacheEntry. Ended up renaming this to abort_cleanup_incomplete. > - Before pgfdw_xact_callback() or pgfdw_subxact_callback() begin abort > cleanup, check whether the flag is set. If so, slam the connection > shut unless that's already been done; furthermore, if the flag is set > and we're in pgfdw_xact_callback (i.e. this is a toplevel abort), > forget about the connection entry entirely. On the other hand, if the > flag is not set, set it flag and attempt abort cleanup. If we > succeed, clear the flag. Did approximately this. It turned out not to be necessary to add any new calls to PQfinish(); the existing one was adequate. > - Before pgfdw_xact_callback() or pgfdw_subxact_callback() begin > pre-commit processing, check whether the flag is set. If so, throw an > ERROR, so that we switch over to abort processing. Did this. > - Change uses of PQexec() in the abort path to use pgfdw_exec_query() > instead. If we exit pgfdw_exec_query() with an error, we'll re-enter > the abort path, but now in_abort_cleanup will be set, so we'll just > drop the connection (and force any outer transaction levels to abort > as well). Created a new function pgfdw_exec_cleanup_query() for this, also incorporating a timeout. Also fixed things so that after issuing PQcancel() we actually throw away the pending result from the cancelled query, and added some error recursion protection. Review would be appreciated. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 05/04/2017 08:01 AM, Robert Haas wrote: > Patch attached. I tried at my end after applying the patch against PG HEAD, Case 1 - without setting statement_timeout i.e default X machine - create table test1(a int); Y machine - CREATE SERVER myserver_ppas FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'X', dbname 'postgres', port '5432', connect_timeout '3'); CREATE USER MAPPING FOR centos SERVER myserver_ppas OPTIONS (user 'centos', password 'adminedb'); create foreign table ft_test_ppas (a int ) server myserver_ppas options (table_name 'test1'); statement_timeout =0; \timing insert into ft_test_ppas values (generate_series(1,10000000)); X machine- disconnect network Y machine - postgres=# insert into ft_test_ppas values (generate_series(1,10000000)); ^CCancel request sent ^CCancel request sent ^CCancel request sent ^CCancel request sent ^CCancel request sent ^CCancel request sent ^CCancel request sent ^CCancel request sent ^CCancel request sent ^CCancel request sent WARNING: could not send cancel request: PQcancel() -- connect() failed: Connection timed out ERROR: canceling statement due to user request Time: 81073.872 ms (01:21.074) Case 2- when statement_timeout=6000 Y machine - CREATE SERVER myserver FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'X', dbname 'postgres', port '5432',keepalives '1', keepalives_interval '3',keepalives_idle '3', keepalives_count '1'); CREATE USER MAPPING FOR centos SERVER myserver OPTIONS (user 'centos', password 'adminedb'); create foreign table ft_test_ppas1 (a int ) server myserver options (table_name 'test1'); set statement_timeout=6000; \timing insert into ft_test_ppas1 values (generate_series(1,10000000)); X machine- disconnect network Y machine postgres=# insert into ft_test_ppas1 values (generate_series(1,10000000)); WARNING: could not send cancel request: PQcancel() -- connect() failed: Connection timed out ERROR: canceling statement due to statement timeout Time: 69009.875 ms (01:09.010) postgres=# Case 3-when statement_timeout=20000 Y machine - CREATE SERVER myserver FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'X', dbname 'postgres', port '5432',keepalives '1', keepalives_interval '3',keepalives_idle '3', keepalives_count '1'); CREATE USER MAPPING FOR centos SERVER myserver OPTIONS (user 'centos', password 'adminedb'); create foreign table ft_test_ppas1 (a int ) server myserver options (table_name 'test1'); set statement_timeout=20000; \timing insert into ft_test_ppas1 values (generate_series(1,10000000)); X machine- disconnect network Y machine - postgres=# insert into ft_test_ppas1 values (generate_series(1,10000000)); WARNING: could not send cancel request: PQcancel() -- connect() failed: Connection timed out ERROR: canceling statement due to statement timeout Time: 83014.503 ms (01:23.015) We can see statement_timeout is working but it is taking some extra time,not sure this is an expected behavior in above case or not. -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
On 05/04/2017 03:53 PM, tushar wrote: > We can see statement_timeout is working but it is taking some extra > time,not sure this is an expected behavior in above case or not. This is only when remote server is involved . in case when both the servers are on the same machine , then this is working as expected. d1=# CREATE SERVER myserver_ppas FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'localhost', dbname 'postgres', port '5432', connect_timeout '3'); CREATE SERVER d1=# CREATE USER MAPPING FOR centos SERVER myserver_ppas OPTIONS (user 'centos', password 'adminedb'); CREATE USER MAPPING d1=# create foreign table ft_test_ppas (a int ) server myserver_ppas options (table_name 'test1'); CREATE FOREIGN TABLE d1=# d1=# insert into ft_test_ppas values (1); INSERT 0 1 Case 1- d1=# \timing Timing is on. d1=# set statement_timeout =6000; SET Time: 0.360 ms d1=# insert into ft_test_ppas values (generate_series(1,10000000)); ERROR: canceling statement due to statement timeout Time: 6002.509 ms (00:06.003) d1=# Case 2 - d1=# set statement_timeout =20000; SET Time: 0.693 ms d1=# insert into ft_test_ppas values (generate_series(1,10000000)); ERROR: canceling statement due to statement timeout Time: 20001.741 ms (00:20.002) d1=# -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
On Thu, May 4, 2017 at 6:23 AM, tushar <tushar.ahuja@enterprisedb.com> wrote: > We can see statement_timeout is working but it is taking some extra time,not > sure this is an expected behavior in above case or not. Yeah, that's expected. To fix that, we'd need libpq to have an async equivalent of PQcancel(), which doesn't currently exist. I think it would be a good idea to add that, but that's another discussion. With this patch: 1. If postgres_fdw is waiting for one of the cleanup queries to execute, you can cancel it, and things like statement_timeout will also work. 2. If the cancel fails or any of the cleanup queries fail, postgres_fdw will forcibly close the connection and force the current transaction and all subtransactions to abort. This isn't wonderful behavior, but if we can't talk to the remote server any more there doesn't seem to be any other real alternative. (We could improve this, I think, by tracking whether the connection had ever been use by an outer transaction level, and if not, allowing the transaction to commit if it never again tried to access the failed connection, but I left the design and installation of such a mechanism to a future patch.) 3. But if you're stuck trying to send the cancel request itself, you still have to wait for that to fail before anything else happens. Once it does, we'll proceed as outlined in point #2. This stinks, but it's the inevitable consequence of having no async equivalent of PQcancel(). My goal is to make things noticeably better with this patch, not to fix them completely. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, May 4, 2017 at 1:19 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Apr 20, 2017 at 10:27 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> The logs above show that 34 seconds elapsed between starting to abort >> the transaction and knowing that the foreign server is unreachable. It >> looks like it took that much time for the local server to realise that >> the foreign server is not reachable. Looking at PQcancel code, it >> seems to be trying to connect to the foreign server to cancel the >> query. But somehow it doesn't seem to honor connect_timeout setting. >> Is that expected? > > Well, there's no code to do anything else. Regular connections go > through connectDBComplete() which uses non-blocking mode and timed > waits to respect connection_timeout. PQcancel() has no such handling. > internal_cancel just opens a socket and, without setting any options > on it, calls connect(). No loop, no timed waits, nada. So it's going > to take as long as it takes for the operating system to notice. > >> Irrespective of what PQcancel does, it looks like postgres_fdw should >> just slam the connection if query is being aborted because of >> statement_timeout. But then pgfdw_xact_callback() doesn't seem to have >> a way to know whether this ABORT is because of user's request to >> cancel the query, statement timeout, an abort because of some other >> error or a user requested abort. Except statement timeout (may be >> user's request to cancel the query?), it should try to keep the >> connection around to avoid any future reconnection. But I am not able >> to see how can we provide that information to pgfdw_xact_callback(). > > I don't think we can. In general, PostgreSQL has no facility for > telling error cleanup handlers why the abort happened, and it wouldn't > really be the right thing anyway. The fact that statement_timeout > fired doesn't necessarily mean that the connection is dead; it could > equally well mean that the query ran for a long time. I think we > basically have two choices. One is to bound the amount of time we > spend performing error cleanup, and the other is just to always drop > the connection. A problem with the latter is that it might do the > wrong thing if we're aborting a subtransaction but not the whole > transaction. In that case, we need to undo changes since the relevant > savepoint, but not the ones made before that; closing the connection > amounts to a full rollback. > > Therefore, I think the patch you proposed in > https://www.postgresql.org/message-id/CAFjFpRdcWw4h0a-zrL-EiaekkPj8O0GR2M1FwZ1useSRfRm3-g%40mail.gmail.com > isn't correct, because if the cancel fails it will slam the connection > shut regardless of whether we're in pgfdw_xact_callback or > pgfdw_subxact_callback. > > It looks to me like there's a fairly lengthy list of conceptually > separate but practically related problems here: > > 1. PQcancel() ignores the keepalive parameters, because it makes no > attempt to set the relevant socket options before connecting. > > 2. PQcancel() ignores connection_timeout, because it doesn't use > non-blocking mode and has no wait loop. > > 3. There is no asynchronous equivalent of PQcancel(), so we can't use > a loop to allow responding to interrupts while the cancel is > outstanding (see pgfdw_get_result for an example). > > 4. pgfdw_xact_callback() and pgfdw_subxact_callback() use PQexec() > rather than the asynchronous interfaces for sending queries and > checking for results, so the SQL commands they send are not > interruptible. > > 5. pgfdw_xact_callback() and pgfdw_subxact_callback() try to get the > connection back to a good state by using ABORT TRANSACTION for the > former and ROLLBACK TO SAVEPOINT and RELEASE SAVEPOINT for the latter. > But if it doesn't work, then they just emit a WARNING and continue on > as if they'd succeeded. That seems highly likely to make the next use > of that connection fail in unpredictable ways. > In pgfdw_xact_callback, if the execution of ABORT TRANSACTION fails due to any reason then I think it will close the connection. The relavant code is: if (PQstatus(entry->conn) != CONNECTION_OK || PQtransactionStatus(entry->conn) != PQTRANS_IDLE) Basically, if abort transaction fails then transaction status won't be PQTRANS_IDLE. Also, does it matter if pgfdw_subxact_callback fails during execution of "ROLLBACK TO SAVEPOINT", because anyway user needs to execute rollback to savepoint command before proceeding and that should be good enough? About patch: + /* Rollback all remote subtransactions during abort */ + snprintf(sql, sizeof(sql), + "ROLLBACK TO SAVEPOINT s%d; RELEASE SAVEPOINT s%d", + curlevel, curlevel); + if (!pgfdw_exec_cleanup_query(entry->conn, sql)) + abort_cleanup_failure = true; If the above execution fails due to any reason, then it won't allow even committing the mail transaction which I am not sure is the right thing to do. I have tried it manually editing the above command in debugger and the result is as below: Create a foreign table and try below scenario. postgres=# begin; BEGIN postgres=# savepoint s1; SAVEPOINT postgres=# insert into foreign_test values(8); INSERT 0 1 postgres=# savepoint s2; SAVEPOINT postgres=# insert into foreign_test values(generate_series(1,1000000)); Cancel request sent WARNING: syntax error at or near "OOLLBACK" ERROR: canceling statement due to user request -- In the debugger, I have changed ROLLBACK to mimic the failure. postgres=# Rollback to savepoint s2; ROLLBACK postgres=# commit; ERROR: connection to server "foreign_server" was lost Considering above, I am not sure if we should consider all failures during abort of subtransaction to indicate pending cleanup state and then don't allow further commits. #include "utils/memutils.h" - +#include "utils/syscache.h" Looks like spurious line removal > 6. postgres_fdw doesn't use PQsetnonblocking() anywhere, so even if we > converted pgfdw_xact_callback() and pgfdw_subxact_callback() to use > pgfdw_exec_query() or something like it rather than PQexec() to submit > queries, they might still block if we fail to send the query, as the > comments for pgfdw_exec_query() explain. This is not possibly but not > particularly likely to happen for queries being sent out of error > handling paths, because odds are good that the connection was sent > just before. However, if the user is pressing ^C because the remote > server isn't responding, it's quite probable that we'll run into this > exact issue. > > 7. postgres_fdw never considers slamming the connection shut as a > response to trouble. It seems pretty clear that this is only a safe > response if we're aborting the toplevel transaction. If we did it for > a subtransaction, we'd end up reconnecting if the server were accessed > again, which would at the very least change the snapshot (note that we > use at least REPEATABLE READ on the remote side regardless of the > local isolation level) and might discard changes made on the remote > side at outer transaction levels. Even for a top-level transaction, > it might not always be the right way to proceed, as it incurs some > overhead to reconnect, but it seems worth considering at least in the > case where we've already got some indication that the connection is > unhealthy (e.g. a previous attempt to clean up the connection state > failed, as in #5, or PQcancel failed, as in Ashutosh's proposed > patch). > > 8. Before 9.6, PQexec() is used in many more places, so many more > queries are entirely non-interruptible. > > It seems pretty clear to me that we are not going to fix all of these > issues in one patch. Here's a sketch of an idea for how to start > making things better: > > - Add an in_abort_cleanup flag to ConnCacheEntry. > > - Before pgfdw_xact_callback() or pgfdw_subxact_callback() begin abort > cleanup, check whether the flag is set. If so, slam the connection > shut unless that's already been done; furthermore, if the flag is set > and we're in pgfdw_xact_callback (i.e. this is a toplevel abort), > forget about the connection entry entirely. On the other hand, if the > flag is not set, set it flag and attempt abort cleanup. If we > succeed, clear the flag. > > - Before pgfdw_xact_callback() or pgfdw_subxact_callback() begin > pre-commit processing, check whether the flag is set. If so, throw an > ERROR, so that we switch over to abort processing. > > - Change uses of PQexec() in the abort path to use pgfdw_exec_query() > instead. If we exit pgfdw_exec_query() with an error, we'll re-enter > the abort path, but now in_abort_cleanup will be set, so we'll just > drop the connection (and force any outer transaction levels to abort > as well). > > - For bonus points, give pgfdw_exec_query() an optional timeout > argument, and set it to 30 seconds or so when we're doing abort > cleanup. If the timeout succeeds, it errors out (which again > re-enters the abort path, dropping the connection and forcing outer > transaction levels to abort as well). > Why do we need this 30 seconds timeout for abort cleanup failures? Isn't it sufficient if we propagate the error only on failures? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, May 4, 2017 at 7:13 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > In pgfdw_xact_callback, if the execution of ABORT TRANSACTION fails > due to any reason then I think it will close the connection. The > relavant code is: > if (PQstatus(entry->conn) != CONNECTION_OK || > PQtransactionStatus(entry->conn) != PQTRANS_IDLE) > > Basically, if abort transaction fails then transaction status won't be > PQTRANS_IDLE. I don't think that's necessarily true. Look at this code: /* Rollback all remote subtransactions during abort */ snprintf(sql, sizeof(sql), "ROLLBACK TO SAVEPOINT s%d; RELEASE SAVEPOINT s%d", curlevel, curlevel); res = PQexec(entry->conn,sql); if (PQresultStatus(res) != PGRES_COMMAND_OK) pgfdw_report_error(WARNING,res, entry->conn, true, sql); else PQclear(res); If that sql command somehow returns a result status other than PGRES_COMMAND_OK, like say due to a local OOM failure or whatever, the connection is PQTRANS_IDLE but the rollback wasn't actually done. > Also, does it matter if pgfdw_subxact_callback fails > during execution of "ROLLBACK TO SAVEPOINT", because anyway user needs > to execute rollback to savepoint command before proceeding and that > should be good enough? I don't understand this. If pgfdw_subxact_callback doesn't roll the remote side back to the savepoint, how is it going to get done later? There's no code in postgres_fdw other than that code to issue the rollback. > About patch: > > + /* Rollback all remote subtransactions during abort */ > + snprintf(sql, sizeof(sql), > + "ROLLBACK TO SAVEPOINT s%d; RELEASE SAVEPOINT s%d", > + curlevel, curlevel); > + if (!pgfdw_exec_cleanup_query(entry->conn, sql)) > + abort_cleanup_failure = true; > > If the above execution fails due to any reason, then it won't allow > even committing the mail transaction which I am not sure is the right > thing to do. Well, as I said in my reply to Tushar, I think it's the only correct thing to do. Suppose you do the following: (1) make a change to the foreign table (2) enter a subtransaction (3) make another change to the foreign table (4) roll back the subtransaction (5) commit the transaction If the remote rollback gets skipped in step 4, it is dead wrong for step 5 to commit the remote transaction anyway. Both the change made in step (1) and the change made in step (3) would get made on the remote side, which is 100% wrong. Given the failure of the rollback in step 4, there is no way to achieve the desired outcome of committing the step (1) change and rolling back the step (3) change. The only way forward is to roll back everything - i.e. force a toplevel abort. > #include "utils/memutils.h" > - > +#include "utils/syscache.h" > > Looks like spurious line removal OK. >> - For bonus points, give pgfdw_exec_query() an optional timeout >> argument, and set it to 30 seconds or so when we're doing abort >> cleanup. If the timeout succeeds, it errors out (which again >> re-enters the abort path, dropping the connection and forcing outer >> transaction levels to abort as well). > > Why do we need this 30 seconds timeout for abort cleanup failures? > Isn't it sufficient if we propagate the error only on failures? Well, the problem is that right now we're waiting for vast amounts of time when the remote connection dies. First, the user notices that the command is running for too long and hits ^C. At that point, the connection is probably already dead, and the user's been waiting for a while already. Then, we wait for PQcancel() to time out. Then, we wait for PQexec() to time out. The result of that, as Suraj mentioned in the original email, is that it takes about 16 minutes for the session to recover when the connection dies, even with keepalive settings and connection timeout set all the way to the minimum. The goal here is to make it take a more reasonable time to recover. Without modifying libpq, we can't do anything about the need to wait for PQcancel() to time out, but we can improve the rest of it. If we removed that 30-second timeout, we would win in the case where either ABORT TRANSACTION or ROLLBACK; RELEASE eventually succeeds but takes more than 30 seconds. In that case, the change you are proposing would allow us to reuse the connection instead of dropping it, and would prevent a forced toplevel abort when subtransactions are in use. However, the cost would be that when one of those commands never succeeds, we would wait a lot longer before finishing a transaction abort. I argue that if the remote server fails to respond to one of those commands within 30 seconds, it's probably dead or nearly dead, and we're probably better off just treating it as dead rather than waiting longer. That's a judgement call, of course. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, May 4, 2017 at 8:08 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, May 4, 2017 at 7:13 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > >>> - For bonus points, give pgfdw_exec_query() an optional timeout >>> argument, and set it to 30 seconds or so when we're doing abort >>> cleanup. If the timeout succeeds, it errors out (which again >>> re-enters the abort path, dropping the connection and forcing outer >>> transaction levels to abort as well). >> >> Why do we need this 30 seconds timeout for abort cleanup failures? >> Isn't it sufficient if we propagate the error only on failures? > > Well, the problem is that right now we're waiting for vast amounts of > time when the remote connection dies. First, the user notices that > the command is running for too long and hits ^C. At that point, the > connection is probably already dead, and the user's been waiting for a > while already. Then, we wait for PQcancel() to time out. Then, we > wait for PQexec() to time out. The result of that, as Suraj mentioned > in the original email, is that it takes about 16 minutes for the > session to recover when the connection dies, even with keepalive > settings and connection timeout set all the way to the minimum. The > goal here is to make it take a more reasonable time to recover. > Without modifying libpq, we can't do anything about the need to wait > for PQcancel() to time out, but we can improve the rest of it. If we > removed that 30-second timeout, we would win in the case where either > ABORT TRANSACTION or ROLLBACK; RELEASE eventually succeeds but takes > more than 30 seconds. In that case, the change you are proposing > would allow us to reuse the connection instead of dropping it, and > would prevent a forced toplevel abort when subtransactions are in use. > However, the cost would be that when one of those commands never > succeeds, we would wait a lot longer before finishing a transaction > abort. > As soon as the first command fails due to timeout, we will set 'abort_cleanup_failure' which will make a toplevel transaction to abort and also won't allow other statements to execute. The patch is trying to enforce a 30-second timeout after statement execution, so it has to anyway wait till the command execution completes (irrespective of whether the command succeeds or fail). It is quite possible I am missing some point, is it possible for you to tell in somewhat more detail in which exact case 30-second timeout will allow us to abort the toplevel transaction if we already do that in the case of statement failure case? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, May 4, 2017 at 12:18 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > As soon as the first command fails due to timeout, we will set > 'abort_cleanup_failure' which will make a toplevel transaction to > abort and also won't allow other statements to execute. The patch is > trying to enforce a 30-second timeout after statement execution, so it > has to anyway wait till the command execution completes (irrespective > of whether the command succeeds or fail). I don't understand what you mean by this. If the command doesn't finish within 30 seconds, we *don't* wait for it to finish. To avoid getting inconsistent results in consequence, we set abort_cleanup_failure. > It is quite possible I am > missing some point, is it possible for you to tell in somewhat more > detail in which exact case 30-second timeout will allow us to abort > the toplevel transaction if we already do that in the case of > statement failure case? Suppose the user's original connection is stuck for some reason but the postmaster is still accepting new connections - e.g. somebody attached gdb to the remote server process and it is therefore stopped. The cancel request gets sent OK, but without the 30-second timeout, we'll hang forever (or until gdb continues or detaches the processes) waiting for it to take effect. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, May 4, 2017 at 10:18 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, May 4, 2017 at 12:18 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> As soon as the first command fails due to timeout, we will set >> 'abort_cleanup_failure' which will make a toplevel transaction to >> abort and also won't allow other statements to execute. The patch is >> trying to enforce a 30-second timeout after statement execution, so it >> has to anyway wait till the command execution completes (irrespective >> of whether the command succeeds or fail). > > I don't understand what you mean by this. If the command doesn't > finish within 30 seconds, we *don't* wait for it to finish. > + /* + * Submit a query. Since we don't use non-blocking mode, this also can + * block. But its risk is relatively small, so we ignore that for now. + */ + if (!PQsendQuery(conn, query)) + { + pgfdw_report_error(WARNING, NULL, conn, false, query); + return false; + } + + /* Get the result of the query. */ + if (pgfdw_get_cleanup_result(conn, endtime, &result)) + return false; The 30 seconds logic is in function pgfdw_get_cleanup_result, can't the command hang in PQsendQuery? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, May 4, 2017 at 1:04 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, May 4, 2017 at 10:18 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, May 4, 2017 at 12:18 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> As soon as the first command fails due to timeout, we will set >>> 'abort_cleanup_failure' which will make a toplevel transaction to >>> abort and also won't allow other statements to execute. The patch is >>> trying to enforce a 30-second timeout after statement execution, so it >>> has to anyway wait till the command execution completes (irrespective >>> of whether the command succeeds or fail). >> >> I don't understand what you mean by this. If the command doesn't >> finish within 30 seconds, we *don't* wait for it to finish. >> > > + /* > + * Submit a query. Since we don't use non-blocking mode, this also can > + * block. But its risk is relatively small, so we ignore that for now. > + */ > + if (!PQsendQuery(conn, query)) > + { > + pgfdw_report_error(WARNING, NULL, conn, false, query); > + return false; > + } > + > + /* Get the result of the query. */ > + if (pgfdw_get_cleanup_result(conn, endtime, &result)) > + return false; > > The 30 seconds logic is in function pgfdw_get_cleanup_result, can't > the command hang in PQsendQuery? Sure. I thought about trying to fix that too, by using PQsetnonblocking(), but I thought the patch was doing enough already. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, May 4, 2017 at 8:08 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, May 4, 2017 at 7:13 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> In pgfdw_xact_callback, if the execution of ABORT TRANSACTION fails >> due to any reason then I think it will close the connection. The >> relavant code is: >> if (PQstatus(entry->conn) != CONNECTION_OK || >> PQtransactionStatus(entry->conn) != PQTRANS_IDLE) >> >> Basically, if abort transaction fails then transaction status won't be >> PQTRANS_IDLE. > > I don't think that's necessarily true. Look at this code: > > /* Rollback all remote subtransactions during abort */ > snprintf(sql, sizeof(sql), > "ROLLBACK TO SAVEPOINT s%d; RELEASE SAVEPOINT s%d", > curlevel, curlevel); > res = PQexec(entry->conn, sql); > if (PQresultStatus(res) != PGRES_COMMAND_OK) > pgfdw_report_error(WARNING, res, entry->conn, true, sql); > else > PQclear(res); > > If that sql command somehow returns a result status other than > PGRES_COMMAND_OK, like say due to a local OOM failure or whatever, the > connection is PQTRANS_IDLE but the rollback wasn't actually done. > I have tried to verify above point and it seems to me in such a situation the transaction status will be either PQTRANS_INTRANS or PQTRANS_INERROR. I have tried to mimic "out of memory" failure by making PQmakeEmptyPGresult return NULL (malloc failure). AFAIU, it will make the state as PQTRANS_IDLE only when the statement is executed successfully. >> Also, does it matter if pgfdw_subxact_callback fails >> during execution of "ROLLBACK TO SAVEPOINT", because anyway user needs >> to execute rollback to savepoint command before proceeding and that >> should be good enough? > > I don't understand this. If pgfdw_subxact_callback doesn't roll the > remote side back to the savepoint, how is it going to get done later? > There's no code in postgres_fdw other than that code to issue the > rollback. > It is done via toplevel transaction ABORT. I think how it works is after "ROLLBACK TO SAVEPOINT" failure, any next execution is going to fail. Let us say if we issue Commit after failure, it will try to execute RELEASE SAVEPOINT which will fail and lead to toplevel transaction ABORT. Now if the toplevel transaction ABORT also fails, it will drop the connection. >> About patch: >> >> + /* Rollback all remote subtransactions during abort */ >> + snprintf(sql, sizeof(sql), >> + "ROLLBACK TO SAVEPOINT s%d; RELEASE SAVEPOINT s%d", >> + curlevel, curlevel); >> + if (!pgfdw_exec_cleanup_query(entry->conn, sql)) >> + abort_cleanup_failure = true; >> >> If the above execution fails due to any reason, then it won't allow >> even committing the mail transaction which I am not sure is the right >> thing to do. > > Well, as I said in my reply to Tushar, I think it's the only correct > thing to do. Suppose you do the following: > > (1) make a change to the foreign table > (2) enter a subtransaction > (3) make another change to the foreign table > (4) roll back the subtransaction > (5) commit the transaction > > If the remote rollback gets skipped in step 4, it is dead wrong for > step 5 to commit the remote transaction anyway. Both the change made > in step (1) and the change made in step (3) would get made on the > remote side, which is 100% wrong. Given the failure of the rollback > in step 4, there is no way to achieve the desired outcome of > committing the step (1) change and rolling back the step (3) change. > The only way forward is to roll back everything - i.e. force a > toplevel abort. > Agreed, in such a situation toplevel transaction abort is the only way out. However, it seems to me that will anyway happen in current code even if we don't try to force it via abort_cleanup_failure. I understand that it might be better to force it as you have done in the patch, but not sure if it is good to drop the connection where previously it could have done without it (for ex. if the first statement Rollback To Savepoint fails, but ABORT succeeds). I feel it is worth considering whether such a behavior difference is okay as you have proposed to backpatch it. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, May 4, 2017 at 10:40 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, May 4, 2017 at 1:04 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Thu, May 4, 2017 at 10:18 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Thu, May 4, 2017 at 12:18 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>>> As soon as the first command fails due to timeout, we will set >>>> 'abort_cleanup_failure' which will make a toplevel transaction to >>>> abort and also won't allow other statements to execute. The patch is >>>> trying to enforce a 30-second timeout after statement execution, so it >>>> has to anyway wait till the command execution completes (irrespective >>>> of whether the command succeeds or fail). >>> >>> I don't understand what you mean by this. If the command doesn't >>> finish within 30 seconds, we *don't* wait for it to finish. >>> >> >> + /* >> + * Submit a query. Since we don't use non-blocking mode, this also can >> + * block. But its risk is relatively small, so we ignore that for now. >> + */ >> + if (!PQsendQuery(conn, query)) >> + { >> + pgfdw_report_error(WARNING, NULL, conn, false, query); >> + return false; >> + } >> + >> + /* Get the result of the query. */ >> + if (pgfdw_get_cleanup_result(conn, endtime, &result)) >> + return false; >> >> The 30 seconds logic is in function pgfdw_get_cleanup_result, can't >> the command hang in PQsendQuery? > > Sure. I thought about trying to fix that too, by using > PQsetnonblocking(), but I thought the patch was doing enough already. > Okay, it seems better to deal it in a separate patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, May 5, 2017 at 1:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > I have tried to verify above point and it seems to me in such a > situation the transaction status will be either PQTRANS_INTRANS or > PQTRANS_INERROR. I have tried to mimic "out of memory" failure by > making PQmakeEmptyPGresult return NULL (malloc failure). AFAIU, it > will make the state as PQTRANS_IDLE only when the statement is > executed successfully. Hrm. OK, you might be right, but still I don't see why having the 30 second timeout is bad. People don't want a transaction rollback to hang for a long period of time waiting for ABORT TRANSACTION to run if we could've just dropped the connection to get the same effect. Sure, the next transaction will then have to reconnect, but that takes a few tens of milliseconds; if you wait for one extra second to avoid that outcome, you come out behind even if you do manage to avoid the reconnect. Now for a subtransaction you could argue that it's worth being more patient, because forcing a toplevel abort sucks. But what are the chances that, after failing to respond to a ROLLBACK; RELEASE SAVEPOINT within 30 seconds, the remote side is going to wake up again and start behaving normally? I'd argue that it's not terribly likely and most people aren't going to want to wait for multiple minutes to see whether it does, but I suppose we could impose the 30 second timeout only at the toplevel and let subtransactions wait however long the operating system takes to time out. >>> Also, does it matter if pgfdw_subxact_callback fails >>> during execution of "ROLLBACK TO SAVEPOINT", because anyway user needs >>> to execute rollback to savepoint command before proceeding and that >>> should be good enough? >> >> I don't understand this. If pgfdw_subxact_callback doesn't roll the >> remote side back to the savepoint, how is it going to get done later? >> There's no code in postgres_fdw other than that code to issue the >> rollback. >> > It is done via toplevel transaction ABORT. I think how it works is > after "ROLLBACK TO SAVEPOINT" failure, any next execution is going to > fail. Let us say if we issue Commit after failure, it will try to > execute RELEASE SAVEPOINT which will fail and lead to toplevel > transaction ABORT. Now if the toplevel transaction ABORT also fails, > it will drop the connection. Mmmph. I guess that would work, but I think it's better to track it explicitly. I tried this (bar is a foreign table): begin; select * from bar; savepoint s1; select * from bar; savepoint s2; select * from bar; savepoint s3; select * from bar; commit; On the remote side, the commit statement produces this: 2017-05-05 09:28:52.597 EDT [80263] LOG: statement: RELEASE SAVEPOINT s4 2017-05-05 09:28:52.597 EDT [80263] LOG: statement: RELEASE SAVEPOINT s3 2017-05-05 09:28:52.597 EDT [80263] LOG: statement: RELEASE SAVEPOINT s2 2017-05-05 09:28:52.597 EDT [80263] LOG: statement: COMMIT TRANSACTION But that's obviously silly. Somebody could easily come along and optimize that by getting rid of the RELEASE SAVEPOINT commands which are completely unnecessary if we're going to commit the toplevel transaction anyway, and then the argument that you're making wouldn't be true any more. It seems smart to me to explicitly track whether the remote side is known to be in a bad state rather than relying on the future sequence of commands to be such that we'll get the right result anyway. > Agreed, in such a situation toplevel transaction abort is the only way > out. However, it seems to me that will anyway happen in current code > even if we don't try to force it via abort_cleanup_failure. I > understand that it might be better to force it as you have done in the > patch, but not sure if it is good to drop the connection where > previously it could have done without it (for ex. if the first > statement Rollback To Savepoint fails, but ABORT succeeds). I feel it > is worth considering whether such a behavior difference is okay as you > have proposed to backpatch it. Well, I think the whole point of this patch is that if one command on a connection fails, the chances of future commands succeeding are pretty poor. It's not impossible, but it's not very likely either. To recap, the problem at present is that if network connectivity is interrupted and the statement is then killed by statement_timeout or a local cancel request, we try to send a cancel request to the remote side, which takes a while to time out. Even if that fails, we then send an ABORT TRANSACTION, hoping against hope that it will succeed despite the failure of the cancel request. The user ends up waiting for the cancel request to time out and then for the ABORT TRANSACTION to time out, which takes 15-16 minutes no matter how you set the keepalive settings and no matter how many times you hit ^C on the local side. Now, if you want to allow for the possibility that some future operation might succeed never mind past failures, then you're basically saying that trying the ABORT TRANSACTION even though the query cancel has failed is the correct behavior. And if you don't want the ABORT TRANSACTION to have a 30 second timeout either, then you're saying we should wait however long it takes the operating system to detect the failure, which again is the current behavior. If we rip those things out of the patch, then there's not much left of the patch. The only thing that would left would be making the wait for ABORT TRANSACTION interruptible, so that if you hit ^C *again* after you've already canceled the query, we then give up on waiting for the ABORT TRANSACTION. I guess that would be an improvement over the status quo, but it's not really a solution. People don't expect to have to send multiple cancel requests to kill the same query in a timely fashion, and if the commands are being sent by an application rather than by a human being it probably won't happen. Do you want to propose a competing patch to fix this problem? I'm happy to listen to suggestions. But I think if you're determined to say "let's rely entirely on the OS timeouts rather than having any of our own" and you're also determined to say "let's not assume that a failure of one operation means that the connection is dead", then you're basically arguing that there's nothing really broken here, and I don't agree with that. Aborting a transaction shouldn't take a quarter of an hour. You could perhaps argue that it's better to make only one of the two changes I'm proposing and not both, but I don't really see it. If you make the abort_cleanup_failure changes but don't apply the 30 second timeout, then suppose the cancel request works but the connection is dead and the ABORT TRANSACTION command takes five minutes to time out. Well, if you'd had the timeout, you would have just closed the connection after 30 seconds and the only thing it would have cost you is a reconnect the next time that foreign server was used. Instead, you waited 5 minutes in the hope of saving a reconnect cycle that probably would have taken tens of milliseconds or less. Blech. On the other hand, if you make the 30 second timeout changes but not the abort_cleanup_failure changes, then suppose network connectivity is entirely interrupted. Then you'll wait however long it takes for the cancel to succeed plus another 30 seconds for the ABORT TRANSACTION, whereas with the patch as proposed you only wait for the first one. Also blech. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, May 5, 2017 at 7:44 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, May 5, 2017 at 1:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> I have tried to verify above point and it seems to me in such a >> situation the transaction status will be either PQTRANS_INTRANS or >> PQTRANS_INERROR. I have tried to mimic "out of memory" failure by >> making PQmakeEmptyPGresult return NULL (malloc failure). AFAIU, it >> will make the state as PQTRANS_IDLE only when the statement is >> executed successfully. > > Hrm. OK, you might be right, but still I don't see why having the 30 > second timeout is bad. People don't want a transaction rollback to > hang for a long period of time waiting for ABORT TRANSACTION to run if > we could've just dropped the connection to get the same effect. Sure, > the next transaction will then have to reconnect, but that takes a few > tens of milliseconds; if you wait for one extra second to avoid that > outcome, you come out behind even if you do manage to avoid the > reconnect. Now for a subtransaction you could argue that it's worth > being more patient, because forcing a toplevel abort sucks. But what > are the chances that, after failing to respond to a ROLLBACK; RELEASE > SAVEPOINT within 30 seconds, the remote side is going to wake up again > and start behaving normally? I'd argue that it's not terribly likely > and most people aren't going to want to wait for multiple minutes to > see whether it does, but I suppose we could impose the 30 second > timeout only at the toplevel and let subtransactions wait however long > the operating system takes to time out. > >>>> Also, does it matter if pgfdw_subxact_callback fails >>>> during execution of "ROLLBACK TO SAVEPOINT", because anyway user needs >>>> to execute rollback to savepoint command before proceeding and that >>>> should be good enough? >>> >>> I don't understand this. If pgfdw_subxact_callback doesn't roll the >>> remote side back to the savepoint, how is it going to get done later? >>> There's no code in postgres_fdw other than that code to issue the >>> rollback. >>> >> It is done via toplevel transaction ABORT. I think how it works is >> after "ROLLBACK TO SAVEPOINT" failure, any next execution is going to >> fail. Let us say if we issue Commit after failure, it will try to >> execute RELEASE SAVEPOINT which will fail and lead to toplevel >> transaction ABORT. Now if the toplevel transaction ABORT also fails, >> it will drop the connection. > > Mmmph. I guess that would work, but I think it's better to track it > explicitly. I tried this (bar is a foreign table): > > begin; > select * from bar; > savepoint s1; > select * from bar; > savepoint s2; > select * from bar; > savepoint s3; > select * from bar; > commit; > > On the remote side, the commit statement produces this: > > 2017-05-05 09:28:52.597 EDT [80263] LOG: statement: RELEASE SAVEPOINT s4 > 2017-05-05 09:28:52.597 EDT [80263] LOG: statement: RELEASE SAVEPOINT s3 > 2017-05-05 09:28:52.597 EDT [80263] LOG: statement: RELEASE SAVEPOINT s2 > 2017-05-05 09:28:52.597 EDT [80263] LOG: statement: COMMIT TRANSACTION > > But that's obviously silly. Somebody could easily come along and > optimize that by getting rid of the RELEASE SAVEPOINT commands which > are completely unnecessary if we're going to commit the toplevel > transaction anyway, and then the argument that you're making wouldn't > be true any more. It seems smart to me to explicitly track whether > the remote side is known to be in a bad state rather than relying on > the future sequence of commands to be such that we'll get the right > result anyway. > Sure, tracking explicitly might be a better way to do deal with it, but at the cost of always dropping the connection in case of error might not be the best thing to do. >> Agreed, in such a situation toplevel transaction abort is the only way >> out. However, it seems to me that will anyway happen in current code >> even if we don't try to force it via abort_cleanup_failure. I >> understand that it might be better to force it as you have done in the >> patch, but not sure if it is good to drop the connection where >> previously it could have done without it (for ex. if the first >> statement Rollback To Savepoint fails, but ABORT succeeds). I feel it >> is worth considering whether such a behavior difference is okay as you >> have proposed to backpatch it. > > Well, I think the whole point of this patch is that if one command on > a connection fails, the chances of future commands succeeding are > pretty poor. It's not impossible, but it's not very likely either. > To recap, the problem at present is that if network connectivity is > interrupted and the statement is then killed by statement_timeout or a > local cancel request, we try to send a cancel request to the remote > side, which takes a while to time out. Even if that fails, we then > send an ABORT TRANSACTION, hoping against hope that it will succeed > despite the failure of the cancel request. The user ends up waiting > for the cancel request to time out and then for the ABORT TRANSACTION > to time out, which takes 15-16 minutes no matter how you set the > keepalive settings and no matter how many times you hit ^C on the > local side. > > Now, if you want to allow for the possibility that some future > operation might succeed never mind past failures, then you're > basically saying that trying the ABORT TRANSACTION even though the > query cancel has failed is the correct behavior. And if you don't > want the ABORT TRANSACTION to have a 30 second timeout either, then > you're saying we should wait however long it takes the operating > system to detect the failure, which again is the current behavior. If > we rip those things out of the patch, then there's not much left of > the patch. > I am not saying to rip those changes. Your most of the mail stresses about the 30-second timeout which you have added in the patch to detect timeouts during failures (rollback processing). I have no objection to that part of the patch except that still there is a race condition around PQsendQuery and you indicate that it is better to deal it as a separate patch to which I agree. The point of which I am not in total agreement is about the failure other than the time out. +pgfdw_exec_cleanup_query(PGconn *conn, const char *query) { .. + /* Get the result of the query. */ + if (pgfdw_get_cleanup_result(conn, endtime, &result)) + return false; + + /* Issue a warning if not successful. */ + if (PQresultStatus(result) != PGRES_COMMAND_OK) + { + pgfdw_report_error(WARNING, result, conn, true, query); + return false; + } .. } In the above code, if there is a failure due to timeout then it will return from first statement (pgfdw_get_cleanup_result), however, if there is any other failure, then it will issue a warning and return false. I am not sure if there is a need to return false in the second case, because even without that it will work fine (and there is a chance that it won't drop the connection), but maybe your point is better to be consistent for all kind of error handling in this path. > The only thing that would left would be making the wait > for ABORT TRANSACTION interruptible, so that if you hit ^C *again* > after you've already canceled the query, we then give up on waiting > for the ABORT TRANSACTION. I guess that would be an improvement over > the status quo, but it's not really a solution. People don't expect > to have to send multiple cancel requests to kill the same query in a > timely fashion, and if the commands are being sent by an application > rather than by a human being it probably won't happen. > > Do you want to propose a competing patch to fix this problem? I'm > happy to listen to suggestions. But I think if you're determined to > say "let's rely entirely on the OS timeouts rather than having any of > our own" and you're also determined to say "let's not assume that a > failure of one operation means that the connection is dead", then > you're basically arguing that there's nothing really broken here, and > I don't agree with that. > No, I am not saying any of those. I hope the previous point clarifies what I want to say. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, May 5, 2017 at 11:15 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > I am not saying to rip those changes. Your most of the mail stresses > about the 30-second timeout which you have added in the patch to > detect timeouts during failures (rollback processing). I have no > objection to that part of the patch except that still there is a race > condition around PQsendQuery and you indicate that it is better to > deal it as a separate patch to which I agree. OK. > The point of which I am > not in total agreement is about the failure other than the time out. > > +pgfdw_exec_cleanup_query(PGconn *conn, const char *query) > { > .. > + /* Get the result of the query. */ > + if (pgfdw_get_cleanup_result(conn, endtime, &result)) > + return false; > + > + /* Issue a warning if not successful. */ > + if (PQresultStatus(result) != PGRES_COMMAND_OK) > + { > + pgfdw_report_error(WARNING, result, conn, true, query); > + return false; > + } > .. > } > > In the above code, if there is a failure due to timeout then it will > return from first statement (pgfdw_get_cleanup_result), however, if > there is any other failure, then it will issue a warning and return > false. I am not sure if there is a need to return false in the second > case, because even without that it will work fine (and there is a > chance that it won't drop the connection), but maybe your point is > better to be consistent for all kind of error handling in this path. There are three possible queries that can be executed by pgfdw_exec_cleanup_query; let's consider them individually. 1. ABORT TRANSACTION. If ABORT TRANSACTION fails, I think that we have no alternative but to close the connection. If we don't, then the next local connection that tries to use this connection will probably also fail, because it will find the connection inside an aborted transaction rather than not in a transaction. If we do close the connection, the next transaction will try to reconnect and everything will be fine. 2. DEALLOCATE ALL. If DEALLOCATE ALL fails, we do not have to close the connection, but the reason why we're running that statement in the first place is because we've potentially lost track of which statements are prepared on the remote side. If we don't drop the connection, we'll still be confused about that. Reconnecting will fix it. 3. ROLLBACK TO SAVEPOINT %d; RELEASE SAVEPOINT %d. If this fails, the local toplevel transaction is doomed. It would be wrong to try to commit on the remote side, because there could be remote changes made by the aborted subtransaction which must not survive, and the ROLLBACK TO SAVEPOINT %d command is essential in order for us to get rid of them, and that has already failed. So we must eventually abort the remote transaction, which we can do either by closing the connection or by trying to execute ABORT TRANSACTION. The former, however, is quick, and the latter is very possibly going to involve a long hang, so closing the connection seems better. In all of these cases, the failure of one of these statements seems to me to *probably* mean that the remote side is just dead, in which case dropping the connection is the only option. But even if the remote side is still alive and the statement failed for some other reason -- say somebody changed kwlist.h so that ABORT now has to be spelled ABURT -- returning true rather than false just causes us to end up with a remote connection whose state is out of step with the local server state, and such a connection is not going to be usable. I think part of your point is that in case 3, we might still get back to a usable connection if ABORT TRANSACTION is successfully executed later, but that's not very likely and, even if it would have happened, reconnecting instead doesn't really leave us any worse off. > No, I am not saying any of those. I hope the previous point clarifies > what I want to say. Apologies, but I still cannot see quite what you are getting at. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, May 5, 2017 at 9:32 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, May 5, 2017 at 11:15 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> I am not saying to rip those changes. Your most of the mail stresses >> about the 30-second timeout which you have added in the patch to >> detect timeouts during failures (rollback processing). I have no >> objection to that part of the patch except that still there is a race >> condition around PQsendQuery and you indicate that it is better to >> deal it as a separate patch to which I agree. > > OK. > >> The point of which I am >> not in total agreement is about the failure other than the time out. >> >> +pgfdw_exec_cleanup_query(PGconn *conn, const char *query) >> { >> .. >> + /* Get the result of the query. */ >> + if (pgfdw_get_cleanup_result(conn, endtime, &result)) >> + return false; >> + >> + /* Issue a warning if not successful. */ >> + if (PQresultStatus(result) != PGRES_COMMAND_OK) >> + { >> + pgfdw_report_error(WARNING, result, conn, true, query); >> + return false; >> + } >> .. >> } >> >> In the above code, if there is a failure due to timeout then it will >> return from first statement (pgfdw_get_cleanup_result), however, if >> there is any other failure, then it will issue a warning and return >> false. I am not sure if there is a need to return false in the second >> case, because even without that it will work fine (and there is a >> chance that it won't drop the connection), but maybe your point is >> better to be consistent for all kind of error handling in this path. > > There are three possible queries that can be executed by > pgfdw_exec_cleanup_query; let's consider them individually. > > 1. ABORT TRANSACTION. If ABORT TRANSACTION fails, I think that we > have no alternative but to close the connection. If we don't, then > the next local connection that tries to use this connection will > probably also fail, because it will find the connection inside an > aborted transaction rather than not in a transaction. If we do close > the connection, the next transaction will try to reconnect and > everything will be fine. > > 2. DEALLOCATE ALL. If DEALLOCATE ALL fails, we do not have to close > the connection, but the reason why we're running that statement in the > first place is because we've potentially lost track of which > statements are prepared on the remote side. If we don't drop the > connection, we'll still be confused about that. Reconnecting will fix > it. > There is a comment in the code which explains why currently we ignore errors from DEALLOCATE ALL. * DEALLOCATE ALL only exists in 8.3 and later, so this * constrains how old a server postgres_fdw can * communicate with. We intentionally ignore errors in * the DEALLOCATE, so that we can hobble along to some * extent with older servers (leaking prepared statements * as we go; but we don't really support update operations * pre-8.3 anyway). It is not entirely clear to me whether it is only for Commit case or in general. From the code, it appears that the comment holds true for both commit and abort. If it is for both, then after this patch, the above comment will not be valid and you might want to update it in case we want to proceed with your proposed changes for DEALLOCATE ALL statement. > 3. ROLLBACK TO SAVEPOINT %d; RELEASE SAVEPOINT %d. If this fails, the > local toplevel transaction is doomed. It would be wrong to try to > commit on the remote side, because there could be remote changes made > by the aborted subtransaction which must not survive, and the ROLLBACK > TO SAVEPOINT %d command is essential in order for us to get rid of > them, and that has already failed. So we must eventually abort the > remote transaction, which we can do either by closing the connection > or by trying to execute ABORT TRANSACTION. The former, however, is > quick, and the latter is very possibly going to involve a long hang, > so closing the connection seems better. > > In all of these cases, the failure of one of these statements seems to > me to *probably* mean that the remote side is just dead, in which case > dropping the connection is the only option. But even if the remote > side is still alive and the statement failed for some other reason -- > say somebody changed kwlist.h so that ABORT now has to be spelled > ABURT -- returning true rather than false just causes us to end up > with a remote connection whose state is out of step with the local > server state, and such a connection is not going to be usable. I > think part of your point is that in case 3, we might still get back to > a usable connection if ABORT TRANSACTION is successfully executed > later, > Yes. > but that's not very likely and, even if it would have happened, > reconnecting instead doesn't really leave us any worse off. > I see your point. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, May 6, 2017 at 4:41 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, May 5, 2017 at 9:32 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Fri, May 5, 2017 at 11:15 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> I am not saying to rip those changes. Your most of the mail stresses >>> about the 30-second timeout which you have added in the patch to >>> detect timeouts during failures (rollback processing). I have no >>> objection to that part of the patch except that still there is a race >>> condition around PQsendQuery and you indicate that it is better to >>> deal it as a separate patch to which I agree. >> >> OK. >> >>> The point of which I am >>> not in total agreement is about the failure other than the time out. >>> >>> +pgfdw_exec_cleanup_query(PGconn *conn, const char *query) >>> { >>> .. >>> + /* Get the result of the query. */ >>> + if (pgfdw_get_cleanup_result(conn, endtime, &result)) >>> + return false; >>> + >>> + /* Issue a warning if not successful. */ >>> + if (PQresultStatus(result) != PGRES_COMMAND_OK) >>> + { >>> + pgfdw_report_error(WARNING, result, conn, true, query); >>> + return false; >>> + } >>> .. >>> } >>> >>> In the above code, if there is a failure due to timeout then it will >>> return from first statement (pgfdw_get_cleanup_result), however, if >>> there is any other failure, then it will issue a warning and return >>> false. I am not sure if there is a need to return false in the second >>> case, because even without that it will work fine (and there is a >>> chance that it won't drop the connection), but maybe your point is >>> better to be consistent for all kind of error handling in this path. >> >> There are three possible queries that can be executed by >> pgfdw_exec_cleanup_query; let's consider them individually. >> >> 1. ABORT TRANSACTION. If ABORT TRANSACTION fails, I think that we >> have no alternative but to close the connection. If we don't, then >> the next local connection that tries to use this connection will >> probably also fail, because it will find the connection inside an >> aborted transaction rather than not in a transaction. If we do close >> the connection, the next transaction will try to reconnect and >> everything will be fine. >> >> 2. DEALLOCATE ALL. If DEALLOCATE ALL fails, we do not have to close >> the connection, but the reason why we're running that statement in the >> first place is because we've potentially lost track of which >> statements are prepared on the remote side. If we don't drop the >> connection, we'll still be confused about that. Reconnecting will fix >> it. >> > > There is a comment in the code which explains why currently we ignore > errors from DEALLOCATE ALL. > > * DEALLOCATE ALL only exists in 8.3 and later, so this > * constrains how old a server postgres_fdw can > * communicate with. We intentionally ignore errors in > * the DEALLOCATE, so that we can hobble along to some > * extent with older servers (leaking prepared statements > * as we go; but we don't really support update operations > * pre-8.3 anyway). > > It is not entirely clear to me whether it is only for Commit case or > in general. From the code, it appears that the comment holds true for > both commit and abort. If it is for both, then after this patch, the > above comment will not be valid and you might want to update it in > case we want to proceed with your proposed changes for DEALLOCATE ALL > statement. > Apart from this, I don't see any other problem with the patch. Additionally, I have run the regression tests with the patch and everything passed. As a side note, why we want 30-second timeout only for Rollback related commands and why not for Commit and the Deallocate All as well? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, May 5, 2017 at 7:11 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > There is a comment in the code which explains why currently we ignore > errors from DEALLOCATE ALL. > > * DEALLOCATE ALL only exists in 8.3 and later, so this > * constrains how old a server postgres_fdw can > * communicate with. We intentionally ignore errors in > * the DEALLOCATE, so that we can hobble along to some > * extent with older servers (leaking prepared statements > * as we go; but we don't really support update operations > * pre-8.3 anyway). > > It is not entirely clear to me whether it is only for Commit case or > in general. From the code, it appears that the comment holds true for > both commit and abort. If it is for both, then after this patch, the > above comment will not be valid and you might want to update it in > case we want to proceed with your proposed changes for DEALLOCATE ALL > statement. Oh! Good catch. Given that the behavior in question is intentional there and intended to provide backward compatibility, changing it in back-branches doesn't seem like a good plan. I'll adjust the patch so that it continues to ignore errors in that case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, May 6, 2017 at 12:55 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Oh! Good catch. Given that the behavior in question is intentional > there and intended to provide backward compatibility, changing it in > back-branches doesn't seem like a good plan. I'll adjust the patch so > that it continues to ignore errors in that case. Updated patch attached. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Sat, May 6, 2017 at 10:41 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sat, May 6, 2017 at 12:55 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Oh! Good catch. Given that the behavior in question is intentional >> there and intended to provide backward compatibility, changing it in >> back-branches doesn't seem like a good plan. I'll adjust the patch so >> that it continues to ignore errors in that case. > > Updated patch attached. > Patch looks good to me. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sun, May 7, 2017 at 12:42 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Sat, May 6, 2017 at 10:41 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Sat, May 6, 2017 at 12:55 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> Oh! Good catch. Given that the behavior in question is intentional >>> there and intended to provide backward compatibility, changing it in >>> back-branches doesn't seem like a good plan. I'll adjust the patch so >>> that it continues to ignore errors in that case. >> >> Updated patch attached. > > Patch looks good to me. I'm having second thoughts based on some more experimentation I did this morning. I'll update again once I've had a bit more time to poke at it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, May 7, 2017 at 11:54 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I'm having second thoughts based on some more experimentation I did > this morning. I'll update again once I've had a bit more time to poke > at it. So the issue that I noticed here is that this problem really isn't confined to abort processing. If we ROLLBACK TO SAVEPOINT or ABORT TRANSACTION on an open connection, we really do not know what the state of that connection is until we get an acknowledgement that the command completed successfully. The patch handles that. However, the same is true if we're sending a SAVEPOINT or RELEASE SAVEPOINT command, and the patch that I posted doesn't do anything about those cases. I think it would be best to fix all transaction control commands in a symmetric manner. Concretely, I think we should replace the abort_cleanup_incomplete flag from my previous patch with a changing_xact_state flag and set that flag around all transaction state changes, clearing it when such changes have succeeded. On error, the flag remains set, so we know that the state of that connection is unknown and that we must close it (killing outer transaction levels as needed). Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Concretely, I think we should replace the abort_cleanup_incomplete > flag from my previous patch with a changing_xact_state flag and set > that flag around all transaction state changes, clearing it when such > changes have succeeded. On error, the flag remains set, so we know > that the state of that connection is unknown and that we must close it > (killing outer transaction levels as needed). > Thoughts? Sounds plausible. regards, tom lane
Re: [HACKERS] statement_timeout is not working as expected withpostgres_fdw
From
Kyotaro HORIGUCHI
Date:
Hello, At Tue, 16 May 2017 12:45:39 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <22556.1494953139@sss.pgh.pa.us> > Robert Haas <robertmhaas@gmail.com> writes: > > Concretely, I think we should replace the abort_cleanup_incomplete > > flag from my previous patch with a changing_xact_state flag and set > > that flag around all transaction state changes, clearing it when such > > changes have succeeded. On error, the flag remains set, so we know > > that the state of that connection is unknown and that we must close it > > (killing outer transaction levels as needed). > > > Thoughts? > > Sounds plausible. I think that the current issue is the usability of the current connection. Even if any other command should fail, we can continue using the connection if ABORT TRANSACTION succeeds. Failure of ABORT immediately means that the connection is no longer available. If this discuttion is reasonable, changing_xact_state might be too-much. By the way if an fdw connection is stalled amid do_sql_command waiting a result, a cancel request doesn't work (in other words pgsql_xact_callback is not called) since EINTR is just ignored in libpq. Maybe we should teach libpq to error out with EINTR with some additional reasons. PQsetEINTRcallback() or something? regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] statement_timeout is not working as expected with postgres_fdw
From
Ashutosh Bapat
Date:
On Wed, May 17, 2017 at 7:24 AM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Hello, > > At Tue, 16 May 2017 12:45:39 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <22556.1494953139@sss.pgh.pa.us> >> Robert Haas <robertmhaas@gmail.com> writes: >> > Concretely, I think we should replace the abort_cleanup_incomplete >> > flag from my previous patch with a changing_xact_state flag and set >> > that flag around all transaction state changes, clearing it when such >> > changes have succeeded. On error, the flag remains set, so we know >> > that the state of that connection is unknown and that we must close it >> > (killing outer transaction levels as needed). >> >> > Thoughts? >> >> Sounds plausible. > > I think that the current issue is the usability of the current > connection. Even if any other command should fail, we can > continue using the connection if ABORT TRANSACTION > succeeds. Failure of ABORT immediately means that the connection > is no longer available. If this discuttion is reasonable, > changing_xact_state might be too-much. pgfdw_get_result() raises an ERROR, which will end up in pgfdw_xact_callback with ABORT. There if we found connection in bad state, we will slam it down. The problem seems to be that waits for none of the transaction related commands are interruptible; both while sending the command and while receiving its result. Robert's suggestion is to fix the later for transaction command. So, it looks good to me. Can you please elaborate what's too much with changing_xact_state? > > By the way if an fdw connection is stalled amid do_sql_command > waiting a result, a cancel request doesn't work (in other words > pgsql_xact_callback is not called) since EINTR is just ignored in > libpq. Maybe we should teach libpq to error out with EINTR with > some additional reasons. PQsetEINTRcallback() or something? Right do_sql_command() is uninterruptible and so is PQexec() used for sending "ABORT TRANSACTION" and "ROLLBACK TO SAVEPOINT". Robert's suggestion seems to be that we make them interruptible with changing_xact_state, so that we could notice a previously failed in transaction related command in pgfdw_xact_callback(). So, his approach scores on that front as well. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Tue, May 16, 2017 at 9:39 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sun, May 7, 2017 at 11:54 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I'm having second thoughts based on some more experimentation I did >> this morning. I'll update again once I've had a bit more time to poke >> at it. > > So the issue that I noticed here is that this problem really isn't > confined to abort processing. If we ROLLBACK TO SAVEPOINT or ABORT > TRANSACTION on an open connection, we really do not know what the > state of that connection is until we get an acknowledgement that the > command completed successfully. The patch handles that. However, the > same is true if we're sending a SAVEPOINT or RELEASE SAVEPOINT > command, and the patch that I posted doesn't do anything about those > cases. I think it would be best to fix all transaction control > commands in a symmetric manner. > +1. Why not similar behavior for any other statements executed in this module by do_sql_command? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, May 17, 2017 at 6:57 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > +1. Why not similar behavior for any other statements executed in > this module by do_sql_command? The other cases are not quite the same situation. It would be good to accept interrupts in all cases, but there's no problem with a session continuing to be used after a failure in configure_remote_session() because the connection hasn't been entered in the hash table at that point yet, and the TRY/CATCH block in connect_pg_server() ensures that the connection also gets closed. So we don't need to worry about those statements leaving behind messed-up sessions; they won't; only the transaction control commands have that part of the problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, May 18, 2017 at 7:56 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, May 17, 2017 at 6:57 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> +1. Why not similar behavior for any other statements executed in >> this module by do_sql_command? > > The other cases are not quite the same situation. It would be good to > accept interrupts in all cases, > Yes, that's one of the main things I was indicating for making the behavior same, but again that could be done as a separate patch as well. > but there's no problem with a session > continuing to be used after a failure in configure_remote_session() > because the connection hasn't been entered in the hash table at that > point yet, and the TRY/CATCH block in connect_pg_server() ensures that > the connection also gets closed. So we don't need to worry about > those statements leaving behind messed-up sessions; they won't; only > the transaction control commands have that part of the problem. > Agreed. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, May 16, 2017 at 9:39 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sun, May 7, 2017 at 11:54 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I'm having second thoughts based on some more experimentation I did >> this morning. I'll update again once I've had a bit more time to poke >> at it. > > So the issue that I noticed here is that this problem really isn't > confined to abort processing. If we ROLLBACK TO SAVEPOINT or ABORT > TRANSACTION on an open connection, we really do not know what the > state of that connection is until we get an acknowledgement that the > command completed successfully. The patch handles that. However, the > same is true if we're sending a SAVEPOINT or RELEASE SAVEPOINT > command, and the patch that I posted doesn't do anything about those > cases. I think it would be best to fix all transaction control > commands in a symmetric manner. > > Concretely, I think we should replace the abort_cleanup_incomplete > flag from my previous patch with a changing_xact_state flag and set > that flag around all transaction state changes, clearing it when such > changes have succeeded. On error, the flag remains set, so we know > that the state of that connection is unknown and that we must close it > (killing outer transaction levels as needed). > > Thoughts? I tried following the sketch stated above, and here's what I added to v2 of the patch. In begin_remote_xact when savepoint is send to the remote server through do_sql_command I set the changing_xact_state flag and when it successfully returns from there the flag is unset. Similar behaviour is followed in pgfdw_subxact_callback for release savepoint. Additionally, I added this flag set-unset for start transaction command in begin_remote_xact. Enlighten me if I've msised/misunderstood something here. I ran the regress test for postgres_fdw and it went on fine. I have a couple of concerns here, since there is only flag required for the purpose, it's name to be changing_xact_state doesn't suit at some places particularly in pgfdw_subxact_callback, not sure should change comment or variable name /* Disarm abort_cleanup_incomplete if it all worked. */ + entry->changing_xact_state = abort_cleanup_failure; Also, by any chance should we add a test-case for this? -- Regards, Rafia Sabih EnterpriseDB: http://www.enterprisedb.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Fri, Jun 2, 2017 at 3:48 AM, Rafia Sabih <rafia.sabih@enterprisedb.com> wrote: > I tried following the sketch stated above, and here's what I added to > v2 of the patch. In begin_remote_xact when savepoint is send to the > remote server through do_sql_command I set the changing_xact_state > flag and when it successfully returns from there the flag is unset. > Similar behaviour is followed in pgfdw_subxact_callback for release > savepoint. Additionally, I added this flag set-unset for start > transaction command in begin_remote_xact. Enlighten me if I've > msised/misunderstood something here. You didn't catch all of the places where do_sql_command() is used to change the transaction state, and you didn't update the comments to match the code changes. > I ran the regress test for > postgres_fdw and it went on fine. Well, that's good as far as it goes, but it doesn't prove much, since this patch mostly changes the behavior when there are errors or interrupts involved, and that's not going to happen in the regression tests. > I have a couple of concerns here, since there is only flag required > for the purpose, it's name to be changing_xact_state doesn't suit at > some places particularly in pgfdw_subxact_callback, not sure should > change comment or variable name > > /* Disarm abort_cleanup_incomplete if it all worked. */ > + entry->changing_xact_state = abort_cleanup_failure; Seems pretty obvious that would need to be updated, at least insofar as making the comment match the new structure member name. I mean, there's often some question regarding the degree to which existing comments should be updated, but it hardly makes sense to add a new comment that isn't consistent with the rest of the patch, right? > Also, by any chance should we add a test-case for this? I don't see how to do that. It could possibly be done with the TAP framework, but that exceeds my abilities. Here's an updated patch with a bunch of cosmetic fixes, plus I adjusted things so that do_sql_command() is more interruptible. I tested it manually and it seems to work OK. I'll commit and back-patch this version, barring objections or further suggestions for improvement. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Sat, Jun 3, 2017 at 1:03 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jun 2, 2017 at 3:48 AM, Rafia Sabih > <rafia.sabih@enterprisedb.com> wrote: > > I don't see how to do that. It could possibly be done with the TAP > framework, but that exceeds my abilities. > > Here's an updated patch with a bunch of cosmetic fixes, plus I > adjusted things so that do_sql_command() is more interruptible. I > tested it manually and it seems to work OK. I'll commit and > back-patch this version, barring objections or further suggestions for > improvement. > No objections from me, updated patch looks good. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Jun 7, 2017 at 11:20 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Sat, Jun 3, 2017 at 1:03 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Fri, Jun 2, 2017 at 3:48 AM, Rafia Sabih >> <rafia.sabih@enterprisedb.com> wrote: >> I don't see how to do that. It could possibly be done with the TAP >> framework, but that exceeds my abilities. >> >> Here's an updated patch with a bunch of cosmetic fixes, plus I >> adjusted things so that do_sql_command() is more interruptible. I >> tested it manually and it seems to work OK. I'll commit and >> back-patch this version, barring objections or further suggestions for >> improvement. > > No objections from me, updated patch looks good. OK, done. Further testing nevertheless appreciated, if anyone feels the urge. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company