Thread: Errors with physical replication

Errors with physical replication

From
greigwise
Date:
Hello.  

We are on Postgresql version 9.6.6.  We have 2 EC2 instances in different
Amazon regions and we are doing physical replication via VPN.  It all seems
to work just fine most of the time.   I'm noticing in the logs that we have
recurring erros (maybe 10 or 12 times per day) that look like this:

2018-05-17 06:36:14 UTC  5af0599f.210d  LOG:  invalid resource manager ID 49
at 384/42A4AB00
2018-05-17 06:36:14 UTC  5afd22de.7ac4  LOG:  started streaming WAL from
primary at 384/42000000 on timeline 1
2018-05-17 07:20:17 UTC  5afd22de.7ac4  FATAL:  could not receive data from
WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.

Or some that also look like this:

2018-05-17 07:20:17 UTC  5af0599f.210d  LOG:  record with incorrect
prev-link 49F07120/9F100C95 at 384/45209FC0
2018-05-17 07:20:18 UTC  5afd2d31.1889  LOG:  started streaming WAL from
primary at 384/45000000 on timeline 1
2018-05-17 08:03:28 UTC  5afd2d31.1889  FATAL:  could not receive data from
WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.

And some like this:

2018-05-17 23:00:13 UTC  5afd63ec.26fc  LOG:  invalid magic number 0000 in
log segment 00000001000003850000003C, offset 10436608
2018-05-17 23:00:14 UTC  5afe097d.49aa  LOG:  started streaming WAL from
primary at 385/3C000000 on timeline 1

Then, like maybe once every couple months or so, we have a crash with logs
looking like this:

2018-05-17 08:03:28 UTC hireology 5af47b75.2670 hireology WARNING: 
terminating connection because of crash of another server process
2018-05-17 08:03:28 UTC hireology 5af47b75.2670 hireology DETAIL:  The
postmaster has commanded this server process to roll back the current
transaction and exit, because another server process exited abnormally and
possibly corrupted shared memory.
2018-05-17 08:03:28 UTC hireology 5af47b75.2670 hireology HINT:  In a moment
you should be able to reconnect to the database and repeat your command.
2018-05-17 08:03:28 UTC  5af0599f.210a  LOG:  database system is shut down

When this last error occurs, the recovery is to go on the replica and remove
all the WAL logs from the pg_xlog director and then restart Postgresql. 
Everything seems to recover and come up fine.  I've done some tests
comparing counts between the replica and the primary and everything seems
synced just fine from all I can tell.  

So, a couple of questions.  1) Should I be worried that my replica is
corrupt in some way or given that everything *seems* ok, is it reasonable to
believe that things are working correctly in spite of these errors being
reported.  2)  Is there something I should configure differently to avoid
some of these errors?

Thanks in advance for any help.

Greig Wise




--
Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f1843780.html


Re: Errors with physical replication

From
Kyotaro HORIGUCHI
Date:
Hello.

At Mon, 21 May 2018 05:18:57 -0700 (MST), greigwise <greigwise@comcast.net> wrote in
<1526905137308-0.post@n3.nabble.com>
> Hello.  
> 
> We are on Postgresql version 9.6.6.  We have 2 EC2 instances in different
> Amazon regions and we are doing physical replication via VPN.  It all seems
> to work just fine most of the time.   I'm noticing in the logs that we have
> recurring erros (maybe 10 or 12 times per day) that look like this:

<following is digested>

> 2018-05-17 06:36:14 UTC  5af0599f.210d  LOG:  invalid resource manager ID 49
> 2018-05-17 06:36:14 UTC  5afd22de.7ac4  LOG:  started streaming WAL from
> 2018-05-17 07:20:17 UTC  5afd22de.7ac4  FATAL:  could not receive data from
> WAL stream: server closed the connection unexpectedly

> Or some that also look like this:
> 
> 2018-05-17 07:20:17 UTC  5af0599f.210d  LOG:  record with incorrect prev-link
> 2018-05-17 07:20:18 UTC  5afd2d31.1889  LOG:  started streaming WAL from
> 2018-05-17 08:03:28 UTC  5afd2d31.1889  FATAL:  could not receive data from
> WAL stream: server closed the connection unexpectedly

> And some like this:
> 
> 2018-05-17 23:00:13 UTC  5afd63ec.26fc  LOG:  invalid magic number 0000 in
> log segment 00000001000003850000003C, offset 10436608
> 2018-05-17 23:00:14 UTC  5afe097d.49aa  LOG:  started streaming WAL from
> primary at 385/3C000000 on timeline 1

You recplication connection seems quite unstable and disconnected
frequently. After disconnection, you will see several kinds of "I
find a broken record in my WAL file" and they are cues for
standby to switch to streaming. Itself is a normal operation as
PostgreSQL with one known exception.

> Then, like maybe once every couple months or so, we have a crash with logs
> looking like this:


> 2018-05-17 08:03:28 UTC hireology 5af47b75.2670 hireology WARNING: 
> terminating connection because of crash of another server process

I think the lines follow an error message like "FATAL: invalid
memory alloc request size 3075129344". This is a kind of "broken
record" but it is known to lead standby to crash. It is
disucussed here.

> [bug fix] Cascaded standby cannot start after a clean shutdown


https://www.postgresql.org/message-id/flat/0A3221C70F24FB45833433255569204D1F8B57AD@G01JPEXMBYT05#0A3221C70F24FB45833433255569204D1F8B57AD@G01JPEXMBYT05

> When this last error occurs, the recovery is to go on the replica and remove
> all the WAL logs from the pg_xlog director and then restart Postgresql. 
> Everything seems to recover and come up fine.  I've done some tests
> comparing counts between the replica and the primary and everything seems
> synced just fine from all I can tell.  

It is right recovery steps, as far as looking the attached log
messages.

> So, a couple of questions.  1) Should I be worried that my replica is
> corrupt in some way or given that everything *seems* ok, is it reasonable to
> believe that things are working correctly in spite of these errors being
> reported.  2)  Is there something I should configure differently to avoid
> some of these errors?

It doesn't seem worth warrying from the viewpoint of data
integrity, but if walsender/walreceiver timeouts fire too
frequently, you might need to increase them for increased
stability.

> Thanks in advance for any help.
> 
> Greig Wise

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center