Thread: [BDR] Best practice to automatically abort a DDL operation when one node is down

[BDR] Best practice to automatically abort a DDL operation when one node is down

From
Sylvain MARECHAL
Date:
Hello all,

I am using BDR with two nodes 1 and 2.
If I issue a DDL operation in node 1 when node 2 is down, for example:
   CREATE TABLE test (i int PRIMARY KEY); (1)

all other transactions fail with the following error:
   Database is locked against DDL operations

The problem is that the (1) DDL request will wait indefinitely, meaning
all transactions will continue to fail until the DDL operation is
manually aborted (for example, doing CTRL C in psql to abort the "CREATE
TABLE").

What is the best practice to make sure the DDL operation will fail,
possibly after a timeout, if one of the node is down? I could check the
state of the node before issuing the DDL operation, but this solution is
far from being perfect as the node may fail right after this.

Thanks and Regards,
--
Sylvain


Re: [BDR] Best practice to automatically abort a DDL operation when one node is down

From
Sylvain MARECHAL
Date:
> I am using BDR with two nodes 1 and 2.
> If I issue a DDL operation in node 1 when node 2 is down, for example:
>   CREATE TABLE test (i int PRIMARY KEY); (1)
>
> all other transactions fail with the following error:
>   Database is locked against DDL operations
>
> The problem is that the (1) DDL request will wait indefinitely,
> meaning all transactions will continue to fail until the DDL operation
> is manually aborted (for example, doing CTRL C in psql to abort the
> "CREATE TABLE").
>
> What is the best practice to make sure the DDL operation will fail,
> possibly after a timeout, if one of the node is down? I could check
> the state of the node before issuing the DDL operation, but this
> solution is far from being perfect as the node may fail right after this.
>

Answering to myself, I guess no magic SQL command exists for this, I
have to cancel the request with pg_cancel_backend() (in fact, that what
the does says, I was guessing if something could detect this
automatically and abort the request).

If using a blocking API, this means one should fork the task and monitor
it to decide whether it should be canceled or not if it takes to much
time (check if one of the node is down, then cancel the request and
retry it later when the node will be up again).

--
Sylvain


On 13 January 2016 at 21:45, Sylvain MARECHAL <marechal.sylvain2@gmail.com> wrote:
 
The problem is that the (1) DDL request will wait indefinitely, meaning all transactions will continue to fail until the DDL operation is manually aborted (for example, doing CTRL C in psql to abort the "CREATE TABLE").

Correct, and by design.

I'd like to do a pre-check where we sync up with the peer nodes and see if they're all alive before we take the DDL lock. This would reduce the impact a bit and allow an early ERROR like "ERROR: cannot perform DDL when one or more nodes is unreachable".

However... we have something pretty close already. You can just set a statement_timeout in the session doing the DDL. It'll cancel the operation if it takes too long.

Note that a lock_timeout will NOT work because the BDR global DDL lock is not recognised as a true lock by PostgreSQL.

 
What is the best practice to make sure the DDL operation will fail, possibly after a timeout, if one of the node is down?

statement_timeout
 
I could check the state of the node before issuing the DDL operation, but this solution is far from being perfect as the node may fail right after this.

Correct, but it's still useful to do.

I'd check to see all nodes are connected in pg_stat_replication then I'd issue the DDL with a statement_timeout set.


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [BDR] Best practice to automatically abort a DDL operation when one node is down

From
Sylvain MARECHAL
Date:


 
What is the best practice to make sure the DDL operation will fail, possibly after a timeout, if one of the node is down?

statement_timeout

Ok. Thank-you for pointing this. I have just tried it, and this work great even for nodes that are not properly power off (plug removed).
 
I could check the state of the node before issuing the DDL operation, but this solution is far from being perfect as the node may fail right after this.

Correct, but it's still useful to do.

I'd check to see all nodes are connected in pg_stat_replication then I'd issue the DDL with a statement_timeout set.

Ok. For the first check, I was using bdr.bdr_test_remote_connectback(peer_dsn, local_dsn), getting the dsn from the bdr.bdr_nodes table; but using the pg_stat_replication table is problably quicker and simpler.

Thank-you again,

Sylvain