Thread: pg_basebackup from 9.4-bdr to 9.4 results in corrupt index

pg_basebackup from 9.4-bdr to 9.4 results in corrupt index

From

Jānis Pūris

Date:

25 May 2019, 16:49:27

Hello,

I'm working with a 9.4-bdr cluster and want to move away from BDR tech all together. So my idea was to follow instructions on http://bdr-project.org/docs/stable/ to first strip the node from BDR making it into "regular" node and then moving the data from this node to official 9.4 instance.

Environment:

BDR_Node_1

BDR_Node_2

Regular_Node_1

Steps done on BDR_Node_2

1. Part node from BDR cluster http://bdr-project.org/docs/stable/node-management-removing.html

2. Remove BDR from node http://bdr-project.org/docs/stable/node-management-disabling.html

3. Create SR slot select pg_create_physical_replication_slot('new_cluster_node');

After this. Steps done on Regular_Node_1, which is 9.4.22

1. Init the new node with pg_basebackup with source being BDR_Node_1

2. Update recovery.conf and postgres.conf

3. Start it with pg_ctl

All seems to be working well, but for this:

select * from information_schema.sequences;

produces "[XX000] ERROR: cache lookup failed for index 899807" for all of the DBs that were configured with BDR with various OID.

I can not reproduce this error, when running query on any of the DBs on 9.4-bdr node.

Does anyone have any idea what may be causing this ?

Due to this I'm also unable to run pg_dump on new_cluster_node - it also results in the same error.

That said. For more context - If it worked, next steps would have been to

1. Deactivate virtual IP on BDR_Node_2

2. Check for no replication lag on BDR_Node_2 slot "new_cluster_node"

3. Promote Regular_Node_1

4. Activate virtual IP on BDR_Node_2

Build the new 9.4 cluster from there on.

Any ideas / comments will be much appreciated!

Thank you in advance.

Best regards, Janis Puris

Re: pg_basebackup from 9.4-bdr to 9.4 results in corrupt index

From

Ron

Date:

25 May 2019, 17:15:51

On 5/25/19 11:49 AM, Jānis Pūris wrote:

body{font-family:Helvetica,Arial;font-size:13px}Hello,

I'm working with a 9.4-bdr cluster and want to move away from BDR tech all together. So my idea was to follow instructions on http://bdr-project.org/docs/stable/ to first strip the node from BDR making it into "regular" node and then moving the data from this node to official 9.4 instance.

Environment:

BDR_Node_1
BDR_Node_2

Regular_Node_1

Steps done on BDR_Node_2
1. Part node from BDR cluster http://bdr-project.org/docs/stable/node-management-removing.html
2. Remove BDR from node http://bdr-project.org/docs/stable/node-management-disabling.html
3. Create SR slot select pg_create_physical_replication_slot('new_cluster_node');

After this. Steps done on Regular_Node_1, which is 9.4.22
1. Init the new node with pg_basebackup with source being BDR_Node_1
2. Update recovery.conf and postgres.conf
3. Start it with pg_ctl

All seems to be working well, but for this:

select * from information_schema.sequences;

produces "[XX000] ERROR: cache lookup failed for index 899807" for all of the DBs that were configured with BDR with various OID.

I can not reproduce this error, when running query on any of the DBs on 9.4-bdr node.

Does anyone have any idea what may be causing this ?
Due to this I'm also unable to run pg_dump on new_cluster_node - it also results in the same error.

That said. For more context - If it worked, next steps would have been to
1. Deactivate virtual IP on BDR_Node_2
2. Check for no replication lag on BDR_Node_2 slot "new_cluster_node"
3. Promote Regular_Node_1
4. Activate virtual IP on BDR_Node_2

Build the new 9.4 cluster from there on.

Any ideas / comments will be much appreciated!

1. Are you sure that you removed all BDR from the node?
2. Is the corruption there in BDR_Node_1?
3. Can you rebuild the indexes?

--
Angular momentum makes the world go 'round.

Re: pg_basebackup from 9.4-bdr to 9.4 results in corrupt index

From

Adrian Klaver

Date:

25 May 2019, 17:27:38

On 5/25/19 9:49 AM, Jānis Pūris wrote:
> Hello,
> 
> I'm working with a 9.4-bdr cluster and want to move away from BDR tech 
> all together. So my idea was to follow instructions on 
> http://bdr-project.org/docs/stable/ to first strip the node from BDR 
> making it into "regular" node and then moving the data from this node to 
> official 9.4 instance.
> 
> Environment:
> 
> BDR_Node_1
> BDR_Node_2
> 
> Regular_Node_1
> 
> Steps done on BDR_Node_2
> 1. Part node from BDR cluster 
> http://bdr-project.org/docs/stable/node-management-removing.html
> 2. Remove BDR from node 
> http://bdr-project.org/docs/stable/node-management-disabling.html
> 3. Create SR slot select 
> pg_create_physical_replication_slot('new_cluster_node');
> 
> After this. Steps done on Regular_Node_1, which is 9.4.22
> 1. Init the new node with pg_basebackup with source being BDR_Node_1

I am not clear about above:

1) You removed BDR_Node_2 from cluster

2) You took pg_basebackup from BDR_Node_1 to create a new regular cluster.

3) 1) & 2) seem to be at odds with each other.

> 2. Update recovery.conf and postgres.conf
> 3. Start it with pg_ctl
> 
> All seems to be working well, but for this:
> 
> select * from information_schema.sequences;
> 
> produces "[XX000] ERROR: cache lookup failed for index 899807" for all 
> of the DBs that were configured with BDR with various OID.
> 
> I can not reproduce this error, when running query on any of the DBs on 
> 9.4-bdr node.
> 
> Does anyone have any idea what may be causing this ?
> Due to this I'm also unable to run pg_dump on new_cluster_node - it also 
> results in the same error.
> 
> That said. For more context - If it worked, next steps would have been to
> 1. Deactivate virtual IP on BDR_Node_2
> 2. Check for no replication lag on BDR_Node_2 slot "new_cluster_node"
> 3. Promote Regular_Node_1
> 4. Activate virtual IP on BDR_Node_2
> 
> Build the new 9.4 cluster from there on.
> 
> Any ideas / comments will be much appreciated!
> 
> Thank you in advance.
> Best regards, Janis Puris
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com

Re: pg_basebackup from 9.4-bdr to 9.4 results in corrupt index

From

Jānis Pūris

Date:

25 May 2019, 18:42:26

Hi, Adrian

Apologies, it is a typo.

pg_basebackup was taken from BDR_Node_2, not BDR_Node_1

Thank you in advance.

Best regards, Janis Puris

On 25 May 2019 at 19:27:42, Adrian Klaver (adrian.klaver@aklaver.com) wrote:

I am not clear about above:

1) You removed BDR_Node_2 from cluster

2) You took pg_basebackup from BDR_Node_1 to create a new regular cluster.

3) 1) & 2) seem to be at odds with each other.

Re: pg_basebackup from 9.4-bdr to 9.4 results in corrupt index

From

Jānis Pūris

Date:

25 May 2019, 18:46:02

Hi Ron,

I can not reproduce this error on BDR_Node_2 (it is not BDR_Node_1 as stated before. Typo)

I've been successful in transferring the data with pg_dump on BDR_Node_2 and then restoring it on Regular_Node_1. Then running "select * from information_schema.sequences;" all is OK.

The problem with this approach is that I'm required to have minimal downtime in this transition and we have a lot of data to transfer, which would be lengthy process.

Thank you in advance.

Best regards, Janis Puris

On 25 May 2019 at 19:16:11, Ron (ronljohnsonjr@gmail.com) wrote:

1. Are you sure that you removed all BDR from the node?
2. Is the corruption there in BDR_Node_1?
3. Can you rebuild the indexes?

Re: pg_basebackup from 9.4-bdr to 9.4 results in corrupt index

From

Adrian Klaver

Date:

25 May 2019, 21:26:21

On 5/25/19 11:46 AM, Jānis Pūris wrote:
> Hi Ron,
> 
> I can not reproduce this error on BDR_Node_2 (it is not BDR_Node_1 as 
> stated before. Typo)
> 
> I've been successful in transferring the data with pg_dump on BDR_Node_2 
> and then restoring it on Regular_Node_1. Then running "select * from 
> information_schema.sequences;" all is OK.

So the issue is that a binary backup/restore via pg_basebackup fails but 
a logical backup/restore via pg_dump/pg_restore works, correct?

You might want to take a look here:

https://github.com/2ndQuadrant/bdr/issues/140

It might make sense to you, it does not to me. Looks to me something is 
being done on the binary level that makes this difficult. I guessing you 
are going to have to talk to the BDR folks.

> 
> The problem with this approach is that I'm required to have minimal 
> downtime in this transition and we have a lot of data to transfer, which 
> would be lengthy process.
> 
> Thank you in advance.
> Best regards, Janis Puris
> 
> On 25 May 2019 at 19:16:11, Ron (ronljohnsonjr@gmail.com 
> <mailto:ronljohnsonjr@gmail.com>) wrote:
> 
>> 1. Are you sure that you removed all BDR from the node?
>> 2. Is the corruption there in BDR_Node_1?
>> 3. Can you rebuild the indexes?


-- 
Adrian Klaver
adrian.klaver@aklaver.com

Re: pg_basebackup from 9.4-bdr to 9.4 results in corrupt index

From

Jānis Pūris

Date:

25 May 2019, 22:31:11

Managed to find following in the announcement of BDR on 9.6:

"BDR has always been an extension, but on 9.4 it required a heavily patched PostgreSQL, one that isn’t fully on-disk-format compatible with stock community PostgreSQL 9.4. The goal all along has been to allow it to run as an extension on an unmodified PostgreSQL … and now we’re there."

Source: https://www.2ndquadrant.com/en/blog/bdr-is-coming-to-postgresql-9-6/

I suppose this answers why logical dump / restore works, but pg_basebackup fails - the stock 9.4 and BDR 9.4 are not compatible on disk level.

On 25 May 2019 at 23:26:26, Adrian Klaver (adrian.klaver@aklaver.com) wrote:

On 5/25/19 11:46 AM, Jānis Pūris wrote:
> Hi Ron,
>
> I can not reproduce this error on BDR_Node_2 (it is not BDR_Node_1 as
> stated before. Typo)
>
> I've been successful in transferring the data with pg_dump on BDR_Node_2
> and then restoring it on Regular_Node_1. Then running "select * from
> information_schema.sequences;" all is OK.

So the issue is that a binary backup/restore via pg_basebackup fails but
a logical backup/restore via pg_dump/pg_restore works, correct?

You might want to take a look here:

https://github.com/2ndQuadrant/bdr/issues/140

It might make sense to you, it does not to me. Looks to me something is
being done on the binary level that makes this difficult. I guessing you
are going to have to talk to the BDR folks.

>
> The problem with this approach is that I'm required to have minimal
> downtime in this transition and we have a lot of data to transfer, which
> would be lengthy process.
>
> Thank you in advance.
> Best regards, Janis Puris
>
> On 25 May 2019 at 19:16:11, Ron (ronljohnsonjr@gmail.com
> <mailto:ronljohnsonjr@gmail.com>) wrote:
>
>> 1. Are you sure that you removed all BDR from the node?
>> 2. Is the corruption there in BDR_Node_1?
>> 3. Can you rebuild the indexes?

--
Adrian Klaver
adrian.klaver@aklaver.com