Thread: BDR to ignore table exists error

BDR to ignore table exists error

From
Nikhil
Date:
Hello,


I have a BDR setup with two nodes. If I bring one node down i am seeing that
the replication slot is becoming inactive with below error.

<10.106.43.152(43253)nsxpostgres798452016-05-25 23:58:19 GMTnsxdb%DETAIL:  streaming transactions committing after 0/111A91
48, reading WAL from 0/110F03F8
<10.106.43.152(43253)nsxpostgres798452016-05-25 23:58:19 GMTnsxdb%LOG:  logical decoding found consistent point at 0/110F03
F8
<10.106.43.152(43253)nsxpostgres798452016-05-25 23:58:19 GMTnsxdb%DETAIL:  Logical decoding will begin using saved snapshot
.
<10.106.43.152(43253)nsxpostgres798452016-05-25 23:58:19 GMTnsxdb%LOG:  unexpected EOF on standby connection
<127.0.0.1(31185)nsxroot792492016-05-25 23:58:19 GMTnsxdb%LOG:  duration: 0.437 ms
<127.0.0.1(31185)nsxroot792492016-05-25 23:58:19 GMTnsxdb%LOG:  duration: 0.462 ms
<127.0.0.1(31185)nsxroot792492016-05-25 23:58:19 GMTnsxdb%LOG:  duration: 0.096 ms
<127.0.0.1(31185)nsxroot792492016-05-25 23:58:19 GMTnsxdb%LOG:  duration: 0.101 ms
<3462016-05-25 23:58:20 GMT%LOG:  starting background worker process "bdr (6288505144157102317,1,16384,)->bdr (628851211361
7339435,2,"
<798462016-05-25 23:58:20 GMT%ERROR:  relation "af_npx_device_l3_16_149_10" already exists


Thanks & Regards,
Nikhil

Re: BDR to ignore table exists error

From
Martín Marqués
Date:
El 27/05/16 a las 06:33, Nikhil escribió:
> Hello,
>
>
> I have a BDR setup with two nodes. If I bring one node down i am seeing that
> the replication slot is becoming inactive with below error.

If you take down one of the nodes of a BDR mesh, the replication slots
from each of the upstream nodes it connects to will switch to inactive.
That's how replication slots work.

> <10.106.43.152(43253)nsxpostgres798452016-05-25 23:58:19 GMTnsxdb%DETAIL:
>  streaming transactions committing after 0/111A91
> 48, reading WAL from 0/110F03F8
> <10.106.43.152(43253)nsxpostgres798452016-05-25 23:58:19 GMTnsxdb%LOG:
>  logical decoding found consistent point at 0/110F03
> F8
> <10.106.43.152(43253)nsxpostgres798452016-05-25 23:58:19 GMTnsxdb%DETAIL:
>  Logical decoding will begin using saved snapshot
> .
> <10.106.43.152(43253)nsxpostgres798452016-05-25 23:58:19 GMTnsxdb%LOG:
>  unexpected EOF on standby connection

Downstream node got disconnected, which is sensible given that you took
that node down.

> <127.0.0.1(31185)nsxroot792492016-05-25 23:58:19 GMTnsxdb%LOG:  duration:
> 0.437 ms
> <127.0.0.1(31185)nsxroot792492016-05-25 23:58:19 GMTnsxdb%LOG:  duration:
> 0.462 ms
> <127.0.0.1(31185)nsxroot792492016-05-25 23:58:19 GMTnsxdb%LOG:  duration:
> 0.096 ms
> <127.0.0.1(31185)nsxroot792492016-05-25 23:58:19 GMTnsxdb%LOG:  duration:
> 0.101 ms
> <3462016-05-25 23:58:20 GMT%LOG:  starting background worker process "bdr
> (6288505144157102317,1,16384,)->bdr (628851211361
> 7339435,2,"

It seems you brought up postgres on the downstream node again and it
connected to the replication slot.

> <798462016-05-25 23:58:20 GMT%ERROR:  relation "af_npx_device_l3_16_149_10"
> already exists

I'm not sure what happened here. Does that relation exist?

Run \d+ af_npx_device_l3_16_149_10 with psql on both nodes.

Also, did replication resume? Check with the lag query from the BDR
documentation.

Regards,

--
Martín Marqués                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


Re: BDR to ignore table exists error

From
Nikhil
Date:

Once the node which was down is brought back the replication slot is not turned active. The reason being replication slot is trying to create a partition table which already exists. Because of this error replication slot is stuck in inactive mode. Is there any way to ignore this error?

On 28-May-2016 4:56 PM, "Martín Marqués" <martin@2ndquadrant.com> wrote:
El 27/05/16 a las 06:33, Nikhil escribió:
> Hello,
>
>
> I have a BDR setup with two nodes. If I bring one node down i am seeing that
> the replication slot is becoming inactive with below error.

If you take down one of the nodes of a BDR mesh, the replication slots
from each of the upstream nodes it connects to will switch to inactive.
That's how replication slots work.

> <10.106.43.152(43253)nsxpostgres798452016-05-25 23:58:19 GMTnsxdb%DETAIL:
>  streaming transactions committing after 0/111A91
> 48, reading WAL from 0/110F03F8
> <10.106.43.152(43253)nsxpostgres798452016-05-25 23:58:19 GMTnsxdb%LOG:
>  logical decoding found consistent point at 0/110F03
> F8
> <10.106.43.152(43253)nsxpostgres798452016-05-25 23:58:19 GMTnsxdb%DETAIL:
>  Logical decoding will begin using saved snapshot
> .
> <10.106.43.152(43253)nsxpostgres798452016-05-25 23:58:19 GMTnsxdb%LOG:
>  unexpected EOF on standby connection

Downstream node got disconnected, which is sensible given that you took
that node down.

> <127.0.0.1(31185)nsxroot792492016-05-25 23:58:19 GMTnsxdb%LOG:  duration:
> 0.437 ms
> <127.0.0.1(31185)nsxroot792492016-05-25 23:58:19 GMTnsxdb%LOG:  duration:
> 0.462 ms
> <127.0.0.1(31185)nsxroot792492016-05-25 23:58:19 GMTnsxdb%LOG:  duration:
> 0.096 ms
> <127.0.0.1(31185)nsxroot792492016-05-25 23:58:19 GMTnsxdb%LOG:  duration:
> 0.101 ms
> <3462016-05-25 23:58:20 GMT%LOG:  starting background worker process "bdr
> (6288505144157102317,1,16384,)->bdr (628851211361
> 7339435,2,"

It seems you brought up postgres on the downstream node again and it
connected to the replication slot.

> <798462016-05-25 23:58:20 GMT%ERROR:  relation "af_npx_device_l3_16_149_10"
> already exists

I'm not sure what happened here. Does that relation exist?

Run \d+ af_npx_device_l3_16_149_10 with psql on both nodes.

Also, did replication resume? Check with the lag query from the BDR
documentation.

Regards,

--
Martín Marqués                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: BDR to ignore table exists error

From
Martín Marqués
Date:
El 28/05/16 a las 08:57, Nikhil escribió:
> Once the node which was down is brought back the replication slot is not
> turned active. The reason being replication slot is trying to create a
> partition table which already exists. Because of this error replication
> slot is stuck in inactive mode. Is there any way to ignore this error?

Could you provide the DDL that's run for the patitioning?

Could you provide logs from the other node? (the one where the partition
was created)

Regards,

--
Martín Marqués                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


Re: BDR to ignore table exists error

From
Martín Marqués
Date:
El 28/05/16 a las 08:57, Nikhil escribió:
> Once the node which was down is brought back the replication slot is not
> turned active. The reason being replication slot is trying to create a
> partition table which already exists. Because of this error replication
> slot is stuck in inactive mode. Is there any way to ignore this error?

BTW, how did you end up in such a state? Did you create the partition
table skipping ddl locking?

At this point the easiest way out is to drop the table on the node where
it's trying to get applied with bdr_replication off or
skip_ddl_replication on, so the table is dropped locally but not
replicated, and the create table from the slot can be consumed.

The other option is to consume the create table statement from the slot
directly.

Be aware of the dangers of changing the default values for such
parameters, (bdr_replication, skip_ddl_replication, skip_ddl_locking)
and when needed they should be used with special care.

Regards,

--
Martín Marqués                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


Re: BDR to ignore table exists error

From
Nikhil
Date:
Please see my replies inline.

On Sat, May 28, 2016 at 8:08 PM, Martín Marqués <martin@2ndquadrant.com> wrote:
El 28/05/16 a las 08:57, Nikhil escribió:
> Once the node which was down is brought back the replication slot is not
> turned active. The reason being replication slot is trying to create a
> partition table which already exists. Because of this error replication
> slot is stuck in inactive mode. Is there any way to ignore this error?

BTW, how did you end up in such a state? Did you create the partition
table skipping ddl locking?
​Nik>> skip_ddl_locking is set to True in my configuration. As this was preventing single
​node from doing DDL operation (if one is down majority is not there for doing DDL on available node)
 

At this point the easiest way out is to drop the table on the node where
it's trying to get applied with bdr_replication off or
skip_ddl_replication on, so the table is dropped locally but not
replicated, and the create table from the slot can be consumed.

The other option is to consume the create table statement from the slot
directly.
​Nik>> DDL used is
ERROR:  relation "af_npx_l3_16_146_10" already exists
<596802016-05-29 08:53:07 GMT%CONTEXT:  during DDL replay of ddl statement: CREATE  TABLE  public.af_npx_license_l3_16_146_
10 (CONSTRAINT af_npx_license_l3_16_146_10_rpt_sample_time_check CHECK (((rpt_sample_time OPERATOR(pg_catalog.>=) 146417040
0) AND (rpt_sample_time OPERATOR(pg_catalog.<=) 1464173999))) ) INHERITS (public.af_npx_l3) WITH (oids=OFF)
<554132016-05-29 08:53:07 GMT%LOG:  worker process: bdr (6288512113617339435,2,16384,)->bdr (6288505144157102317,1, (PID 59
680) exited with exit code 1
 

Be aware of the dangers of changing the default values for such
parameters, (bdr_replication, skip_ddl_replication, skip_ddl_locking)
and when needed they should be used with special care.
​Nik>>. The DDL replay is started once the node join back to bdr group. I think its started from an old check point causing partition already exists error. Is there any way to ignore replay error ? or ignore DDL errors while replay ?​

Regards,

--
Martín Marqués                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: BDR to ignore table exists error

From
Martín Marqués
Date:
Hi,

El 29/05/16 a las 06:01, Nikhil escribió:
>
>     *​Nik>> skip_ddl_locking is set to True in my configuration. As this
>     was preventing single*
>
> *​node from doing DDL operation (if one is down majority is not there
> for doing DDL on available node)*​

Well, you have to be prepared to deal with burn wounds if you play with
fire. ;)

If you decide to have skip_ddl_locking on you have to be sure all DDLs
happen on one node, else you end up with conflicts like this.

I suggest you find out why the table was already created on the
downstream node (as a forensics task so you can avoid bumping into the
same issue).

>     ​Nik>> DDL used is
>
> ​
> ERROR:  relation "af_npx_l3_16_146_10" already exists
> <596802016-05-29 08:53:07 GMT%CONTEXT:  during DDL replay of ddl
> statement: CREATE  TABLE  public.af_npx_license_l3_16_146_
> 10 (CONSTRAINT af_npx_license_l3_16_146_10_rpt_sample_time_check CHECK
> (((rpt_sample_time OPERATOR(pg_catalog.>=) 146417040
> 0) AND (rpt_sample_time OPERATOR(pg_catalog.<=) 1464173999))) ) INHERITS
> (public.af_npx_l3) WITH (oids=OFF)
> <554132016-05-29 08:53:07 GMT%LOG:  worker process: bdr
> (6288512113617339435,2,16384,)->bdr (6288505144157102317,1, (PID 59
> 680) exited with exit code 1

On the node where the CREATE TABLE is trying to get applied run this:

BEGIN;
SET LOCAL bdr.skip_ddl_replication TO 'on';
SET LOCAL bdr.skip_ddl_locking TO 'on';
DROP TABLE af_npx_l3_16_146_10;
END;

After that, the DDL that's stuck will get applied and the stream of
changes will continue.

By the looks of what you're dealing with, I wouldn't be surprised if the
replication gets stuck again on another DDL conflict.

I suggest rethinking the locking strategy, because this shows that
there's something fishy there.

Regards,

--
Martín Marqués                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


Re: BDR to ignore table exists error

From
Nikhil
Date:
Thanks a lot Martin for your replies.

On Sun, May 29, 2016 at 11:50 PM, Martín Marqués <martin@2ndquadrant.com> wrote:
Hi,

El 29/05/16 a las 06:01, Nikhil escribió:
>
>     *​Nik>> skip_ddl_locking is set to True in my configuration. As this
>     was preventing single*
>
> *​node from doing DDL operation (if one is down majority is not there
> for doing DDL on available node)*​

Well, you have to be prepared to deal with burn wounds if you play with
fire. ;)

If you decide to have skip_ddl_locking on you have to be sure all DDLs
happen on one node, else you end up with conflicts like this.

I suggest you find out why the table was already created on the
downstream node (as a forensics task so you can avoid bumping into the
same issue).

>     ​Nik>> DDL used is
>
> ​
> ERROR:  relation "af_npx_l3_16_146_10" already exists
> <596802016-05-29 08:53:07 GMT%CONTEXT:  during DDL replay of ddl
> statement: CREATE  TABLE  public.af_npx_license_l3_16_146_
> 10 (CONSTRAINT af_npx_license_l3_16_146_10_rpt_sample_time_check CHECK
> (((rpt_sample_time OPERATOR(pg_catalog.>=) 146417040
> 0) AND (rpt_sample_time OPERATOR(pg_catalog.<=) 1464173999))) ) INHERITS
> (public.af_npx_l3) WITH (oids=OFF)
> <554132016-05-29 08:53:07 GMT%LOG:  worker process: bdr
> (6288512113617339435,2,16384,)->bdr (6288505144157102317,1, (PID 59
> 680) exited with exit code 1

On the node where the CREATE TABLE is trying to get applied run this:

BEGIN;
SET LOCAL bdr.skip_ddl_replication TO 'on';
SET LOCAL bdr.skip_ddl_locking TO 'on';
DROP TABLE af_npx_l3_16_146_10;
END;

After that, the DDL that's stuck will get applied and the stream of
changes will continue.

By the looks of what you're dealing with, I wouldn't be surprised if the
replication gets stuck again on another DDL conflict.

I suggest rethinking the locking strategy, because this shows that
there's something fishy there.

Regards,

--
Martín Marqués                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services