Thread: Replication failed

Replication failed

From

Sreejith P

Date:

17 December 2020, 22:26:33

Hello all,

We had 1 M x 4 slave server streaming replication using Postgres 10. Was working successfully for long time.

Suddenly all replication servers got failed and getting following message.

2020-12-17 22:24:32 +04 [1587]: [357-1] db=,user=LOG:  invalid contrecord length 2722 at AEA/A1FFF9E0.

Requesting help for identifying root cause.

Thanks
Sreejith
--




 

*Solutions for Care Anywhere*
*dWise HealthCare IT Solutions Pvt.
Ltd.* | www.lifetrenz.com <http://www.lifetrenz.com>
*Disclaimer*:
 The
information and attachments contained in this email are intended
for
exclusive use of the addressee(s) and may contain confidential or
privileged information. If you are not the intended recipient, please
notify the sender immediately and destroy all copies of this message and

any attachments. The views expressed in this email are, unless
otherwise
stated, those of the author and not those of dWise HealthCare IT Solutions
or its management.

Re: Replication failed

From

Laurenz Albe

Date:

18 December 2020, 11:43:30

On Fri, 2020-12-18 at 00:56 +0530, Sreejith P wrote:
> We had 1 M x 4 slave server streaming replication using Postgres 10. Was working successfully for long time.
> 
> Suddenly all replication servers got failed and getting following message.
> 
> 2020-12-17 22:24:32 +04 [1587]: [357-1] db=,user=LOG:  invalid contrecord length 2722 at AEA/A1FFF9E0.
> 
> Requesting help for identifying root cause.

Looks like this problem:
https://postgr.es/m/77734732-44A4-4209-8C2F-3AF36C9D4D18%40amazon.com

Was there a crash on the primary sever?

There is a patch under development at this thread:
https://postgr.es/m/CBDDFA01-6E40-46BB-9F98-9340F4379505%40amazon.com

Yours,
Laurenz Albe

Re: Replication failed

From

Sreejith P

Date:

18 December 2020, 13:19:02

Yes.

There was a crash.

Master recovered automatically .. But replication failed.

See blow log from Master.

cp: error writing '/BackupVolume/hisDbBackup/WALs/0000000200000AE6000000B9': No space left on device
2020-12-17 21:55:26 +04 [55822]: user=,db=,app=,client= LOG:  archive command failed with exit code 1
2020-12-17 21:55:26 +04 [55822]: user=,db=,app=,client= DETAIL:  The failed archive command was: test ! -f
/BackupVolume/dbWALBack/WALs/0000000200000AE6000000B9&& cp pg_wal/0000000200000AE6000000B9
/BackupVolume/hisDbBackup/WALs/0000000200000AE6000000B9

2020-12-17 21:53:56 +04 [84225]: user=AstDBA,db=AST-PROD,app=[unknown],client=172.18.200.100 HINT:  In a moment you
shouldbe able to reconnect to the database and repeat your command. 
2020-12-17 21:53:56 +04 [51102]: user=AstDBA,db=AST-PROD,app=[unknown],client=172.18.200.100 WARNING:  terminating
connectionbecause of crash of another server process 
2020-12-17 21:53:56 +04 [51102]: user=AstDBA,db=AST-PROD,app=[unknown],client=172.18.200.100 DETAIL:  The postmaster
hascommanded this server process to roll back the current transaction and exit, because another server process exited
abnormallyand possibly corrupted shared memory. 

2020-12-17 21:55:20 +04 [55808]: user=replicator,db=[unknown],app=[unknown],client=172.18.200.144 FATAL:  the database
systemis in recovery mode 

2020-12-17 21:55:25 +04 [55828]: user=replicator,db=[unknown],app=walreceiver,client=172.18.200.148 ERROR:  requested
startingpoint AEA/A2000000 is ahead of the WAL flush position of this server AEA/A1FFFA50 
2020-12-17 21:55:25 +04 [55827]: user=replicator,db=[unknown],app=walreceiver,client=172.18.200.147 ERROR:  requested
startingpoint AEA/A2000000 is ahead of the WAL flush position of this server AEA/A1FFFA50 
2020-12-17 21:55:25 +04 [55829]: user=replicator,db=[unknown],app=walreceiver,client=172.18.200.145 ERROR:  requested
startingpoint AEA/A2000000 is ahead of the WAL flush position of this server AEA/A1FFFA50 
2020-12-17 21:55:25 +04 [55830]: user=replicator,db=[unknown],app=walreceiver,client=172.18.200.146 ERROR:  requested
startingpoint AEA/A2000000 is ahead of the WAL flush position of this server AEA/A1FFFA50 
2020-12-17 21:55:25 +04 [55831]: user=replicator,db=[unknown],app=walreceiver,client=172.18.200.144 ERROR:  requested
startingpoint AEA/A2000000 is ahead of the WAL flush position of this server AEA/A1FFFA50 

On 18/12/20, 2:13 PM, "Laurenz Albe" <laurenz.albe@cybertec.at> wrote:

    On Fri, 2020-12-18 at 00:56 +0530, Sreejith P wrote:
    > We had 1 M x 4 slave server streaming replication using Postgres 10. Was working successfully for long time.
    >
    > Suddenly all replication servers got failed and getting following message.
    >
    > 2020-12-17 22:24:32 +04 [1587]: [357-1] db=,user=LOG:  invalid contrecord length 2722 at AEA/A1FFF9E0.
    >
    > Requesting help for identifying root cause.

    Looks like this problem:
    https://postgr.es/m/77734732-44A4-4209-8C2F-3AF36C9D4D18%40amazon.com

    Was there a crash on the primary sever?

    There is a patch under development at this thread:
    https://postgr.es/m/CBDDFA01-6E40-46BB-9F98-9340F4379505%40amazon.com

    Yours,
    Laurenz Albe

--

*Solutions for Care Anywhere*
*dWise HealthCare IT Solutions Pvt.
Ltd.* | www.lifetrenz.com <http://www.lifetrenz.com>
*Disclaimer*:
 The
information and attachments contained in this email are intended
for
exclusive use of the addressee(s) and may contain confidential or
privileged information. If you are not the intended recipient, please
notify the sender immediately and destroy all copies of this message and

any attachments. The views expressed in this email are, unless
otherwise
stated, those of the author and not those of dWise HealthCare IT Solutions
or its management.

Re: Replication failed

From

Laurenz Albe

Date:

18 December 2020, 14:55:34

> > On Fri, 2020-12-18 at 00:56 +0530, Sreejith P wrote:
> > > We had 1 M x 4 slave server streaming replication using Postgres 10. Was working successfully for long time.
> > > 
> > > Suddenly all replication servers got failed and getting following message.
> > > 
> > > 2020-12-17 22:24:32 +04 [1587]: [357-1] db=,user=LOG:  invalid contrecord length 2722 at AEA/A1FFF9E0.
> > > 
> > > Requesting help for identifying root cause.
> >
> > Looks like this problem:
> > https://postgr.es/m/77734732-44A4-4209-8C2F-3AF36C9D4D18%40amazon.com
> >
> > Was there a crash on the primary sever?
> >
> > There is a patch under development at this thread:
> > https://postgr.es/m/CBDDFA01-6E40-46BB-9F98-9340F4379505%40amazon.com
> 
> There was a crash.
> 
> Master recovered automatically .. But replication failed.

Then I am pretty sure that this is the same problem.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com

Re: Replication failed

From

Rambabu V

Date:

21 December 2020, 05:16:02

Hi Laurenz,

please let us know how to apply the patch, we are able to see below attachments , once downloaded how to apply this patch.

Attachment	Content-Type	Size
repro_helper.patch	application/octet-stream	1.3 KB
ready_file_fix.patch	application/octet-stream

On Fri, Dec 18, 2020 at 2:13 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:

On Fri, 2020-12-18 at 00:56 +0530, Sreejith P wrote:
> We had 1 M x 4 slave server streaming replication using Postgres 10. Was working successfully for long time.
>
> Suddenly all replication servers got failed and getting following message.
>
> 2020-12-17 22:24:32 +04 [1587]: [357-1] db=,user=LOG: invalid contrecord length 2722 at AEA/A1FFF9E0.
>
> Requesting help for identifying root cause.

Looks like this problem:
https://postgr.es/m/77734732-44A4-4209-8C2F-3AF36C9D4D18%40amazon.com

Was there a crash on the primary sever?

There is a patch under development at this thread:
https://postgr.es/m/CBDDFA01-6E40-46BB-9F98-9340F4379505%40amazon.com

Yours,
Laurenz Albe

Re: Replication failed

From

Laurenz Albe

Date:

21 December 2020, 06:50:02

On Mon, 2020-12-21 at 07:46 +0530, Rambabu V wrote:
> On Fri, Dec 18, 2020 at 2:13 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> 
> > On Fri, 2020-12-18 at 00:56 +0530, Sreejith P wrote:
> > > We had 1 M x 4 slave server streaming replication using Postgres 10. Was working successfully for long time.
> > > 
> > > Suddenly all replication servers got failed and getting following message.
> > > 
> > > 2020-12-17 22:24:32 +04 [1587]: [357-1] db=,user=LOG:  invalid contrecord length 2722 at AEA/A1FFF9E0.
> > > 
> > > Requesting help for identifying root cause.
> > 
> > Looks like this problem:
> > https://postgr.es/m/77734732-44A4-4209-8C2F-3AF36C9D4D18%40amazon.com
> > 
> > Was there a crash on the primary sever?
> > 
> > There is a patch under development at this thread:
> > https://postgr.es/m/CBDDFA01-6E40-46BB-9F98-9340F4379505%40amazon.com
> > 
>
> please let us know how to apply the patch, we are able to see below attachments , once downloaded how to apply this
patch.
> 
> Attachment    Content-Type    Size
> repro_helper.patch    application/octet-stream    1.3 KB
> ready_file_fix.patch    application/octet-stream

This is still under development, so you should not consider using the patch in
a production system.  The bug is only triggered by a rare coincidence (crash
on while a WAL record is written that spans more than one segment).

But if you want to review the patch
(see https://wiki.postgresql.org/wiki/Reviewing_a_Patch)
to help getting it applied soon, that would be great.

You'd have to read the documentation about building PostgreSQL from source.
The path is applied with the program "patch", typically with

patch -p1 <../ready_file_fix.patch

while you are inside the git repository.

You need some familiarity with building C programs from source.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com