Thread: Postgresql 9.5: Streaming Replication: Secondaries Fail To Start Post WAL Error

Dear Community,

I am trying to understand why all the secondary databases failed to start
after seeing a WAL related error for some time.

Timeline:

2024-04-19: WAL errors appear in the secondary database nodes

```
LOG: invalid resource manager ID 55 at 40/F46CBCA8
```

- the secondaries did not lag in replication
  - monitored via query
```
pg_last_xact_replay_timestamp
```

- 2024-05-02; Secondaries reboot and fail to start up

```
FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000004100000049 has already been removed
 FATAL:  the database system is starting up
```

from my understanding, the WAL file is streamed over the network (secondary pulls from primary) and creates a WAL file in the secondary.
then it replays the copied WAL file using a different process.

in order for the local WAL file to go out of sync,

1. the primary removed the WAL file, the secondary was streaming
2. the WAL file on the secondary got corrupted
3 ....

Questions

- what do those error messages mean ?
- how can I prevent this from happening ?

- references

Any advice/information is highly appreciated.
thank you
mohan
Hi Mohan,

On Tue, May 28, 2024 at 02:26:41PM -0400, Mohan NBSPS wrote:
> Dear Community,
> [...]
> ```
> FATAL:  could not receive data from WAL stream: ERROR:  requested WAL
> segment 000000010000004100000049 has already been removed
>  FATAL:  the database system is starting up
> ```
>
> from my understanding, the WAL file is streamed over the network (secondary
> pulls from primary) and creates a WAL file in the secondary.
> then it replays the copied WAL file using a different process.
>
> in order for the local WAL file to go out of sync,
>
> 1. the primary removed the WAL file, the secondary was streaming
> 2. the WAL file on the secondary got corrupted
> 3 ....
>
> Questions
>
> - what do those error messages mean ?
> - how can I prevent this from happening ?

It means that, unless you have archived the required WAL segments somewhere
and can recover them from there, your replica is now broken, and you will have
to re-create it anew.

You can prevent this by correctly configuring streaming replication either by
using replication slots (not sure if that's already implemented in 9.5,
actually - you should prioritize upgrading to a supported release while you
are working this problem!), or by introducing a WAL archive[0] for replicas to
retrieve WAL from that the primary has already evicted from its kept segments.

Hth!

[0]:
https://www.postgresql.org/docs/9.5/runtime-config-wal.html#RUNTIME-CONFIG-WAL-ARCHIVING

--
with best regards:
- Johannes Truschnigg ( johannes@truschnigg.info )

www:   https://johannes.truschnigg.info/

Attachment


On Tue, May 28, 2024 at 2:47 PM Johannes Truschnigg <johannes@truschnigg.info> wrote:
Hi Mohan,

On Tue, May 28, 2024 at 02:26:41PM -0400, Mohan NBSPS wrote:
> Dear Community,
> [...]
> ```
> FATAL:  could not receive data from WAL stream: ERROR:  requested WAL
> segment 000000010000004100000049 has already been removed
>  FATAL:  the database system is starting up
> ```
>
> from my understanding, the WAL file is streamed over the network (secondary
> pulls from primary) and creates a WAL file in the secondary.
> then it replays the copied WAL file using a different process.
>
> in order for the local WAL file to go out of sync,
>
> 1. the primary removed the WAL file, the secondary was streaming
> 2. the WAL file on the secondary got corrupted
> 3 ....
>
> Questions
>
> - what do those error messages mean ?
> - how can I prevent this from happening ?

It means that, unless you have archived the required WAL segments somewhere
and can recover them from there, your replica is now broken, and you will have
to re-create it anew.

You can prevent this by correctly configuring streaming replication either by
using replication slots (not sure if that's already implemented in 9.5,
actually - you should prioritize upgrading to a supported release while you
are working this problem!), or by introducing a WAL archive[0] for replicas to
retrieve WAL from that the primary has already evicted from its kept segments.

Hth!


Thank you Johannes for the advice.

We are looking at moving to 16.
We did not implement slots to avoid disk space issues on primary (possible network disconnect may fill up primary `pg_xlog`).

We have changed the WAL settings to retain more WAL files.

Since we have not seen this issue before, (have been running postgresql for over 10 years), what kind
of scenario would trigger this ?

- we do not see any network latencies or outages


Thank you again.
 

[0]:
https://www.postgresql.org/docs/9.5/runtime-config-wal.html#RUNTIME-CONFIG-WAL-ARCHIVING

--
with best regards:
- Johannes Truschnigg ( johannes@truschnigg.info )

www:   https://johannes.truschnigg.info/
On Tue, May 28, 2024 at 03:00:23PM -0400, Mohan NBSPS wrote:
> [...]
> Thank you Johannes for the advice.
>
> We are looking at moving to 16.
> We did not implement slots to avoid disk space issues on primary (possible
> network disconnect may fill up primary `pg_xlog`).

Yes, replication slots can interrupt your primary. Relying on
wal_keep_segments alone can kill your replicas. Having a WAL archive can be
the best of both worlds, but also needs careful monitoring and management.


> We have changed the WAL settings to retain more WAL files.
>
> Since we have not seen this issue before, (have been running postgresql for
> over 10 years), what kind
> of scenario would trigger this ?

Every time you interrupt the replication stream (such as when a replica
reboots, or its postgres master process is stopped), you enter a race
condition between WAL segments accumulating on the primary, and the
replication stream to pick up again once the replica is up once more. So if,
during your replica restart, enough WAL was produced to exceed
wal_keep_segments, the lineage is broken, and the replica cannot ever catch up
again.

Also, the "invalid resource manager" log line you reported *might* hint at
data corruption in your WAL segments. I think that data checksums and WAL
compression could both make detection of such conditions more reliable.

--
with best regards:
- Johannes Truschnigg ( johannes@truschnigg.info )

www:   https://johannes.truschnigg.info/

Attachment


On Tue, May 28, 2024 at 3:11 PM Johannes Truschnigg <johannes@truschnigg.info> wrote:
On Tue, May 28, 2024 at 03:00:23PM -0400, Mohan NBSPS wrote:
> [...]
> Thank you Johannes for the advice.
>
> We are looking at moving to 16.
> We did not implement slots to avoid disk space issues on primary (possible
> network disconnect may fill up primary `pg_xlog`).

Yes, replication slots can interrupt your primary. Relying on
wal_keep_segments alone can kill your replicas. Having a WAL archive can be
the best of both worlds, but also needs careful monitoring and management.


> We have changed the WAL settings to retain more WAL files.
>
> Since we have not seen this issue before, (have been running postgresql for
> over 10 years), what kind
> of scenario would trigger this ?

Every time you interrupt the replication stream (such as when a replica
reboots, or its postgres master process is stopped), you enter a race
condition between WAL segments accumulating on the primary, and the
replication stream to pick up again once the replica is up once more. So if,
during your replica restart, enough WAL was produced to exceed
wal_keep_segments, the lineage is broken, and the replica cannot ever catch up
again.

Also, the "invalid resource manager" log line you reported *might* hint at
data corruption in your WAL segments. I think that data checksums and WAL
compression could both make detection of such conditions more reliable.


so there was enough disruption or time lag for the current WAL file to go out out of sync.
why I am saying this is, we do a lot of O/S patches and reboots over all these years disrupting
the primary/secondary server.
however the robustness of postgres seems to reveal itself by recovering and working without issue.
 
Thank you again.
I will take those recommendations into consideration for implementation.

I might update this thread and probably ask more questions.



--
with best regards:
- Johannes Truschnigg ( johannes@truschnigg.info )

www:   https://johannes.truschnigg.info/
On Tue, May 28, 2024 at 3:11 PM Johannes Truschnigg <johannes@truschnigg.info> wrote:
On Tue, May 28, 2024 at 03:00:23PM -0400, Mohan NBSPS wrote:
> [...]
> Thank you Johannes for the advice.
>
> We are looking at moving to 16.
> We did not implement slots to avoid disk space issues on primary (possible
> network disconnect may fill up primary `pg_xlog`).

Yes, replication slots can interrupt your primary.

Please define "interrupt".  Using a replication slot, I thought files would just accumulate in pg_wal while the replica is down (or the network is slow, or the replica can't keep up with the primary).

Disaster, of course, when that disk fills up, but that's always been the case.

On Tue, May 28, 2024 at 05:24:56PM -0400, Ron Johnson wrote:
> On Tue, May 28, 2024 at 3:11 PM Johannes Truschnigg <
> >[...]
> > Yes, replication slots can interrupt your primary.
> >
>
> Please define "interrupt".  Using a replication slot, I thought files would
> just accumulate in pg_wal while the replica is down (or the network is
> slow, or the replica can't keep up with the primary).
>
> Disaster, of course, when that disk fills up, but that's always been the
> case.

And that is exactly the scenario I meant when I said "interrupt". If you use
replication slots, your monitoring/alerting isn't set up correctly, and you're
accumulating a lot of WAL, chances are ENOSPC on the primary is around the
corner for you.

That's why I generally prefer a WAL archive on a separate file system for
replicas to source segments from, because filling that up won't break the
primary (unless the archive_command misbehaves). That also needs proper
monitoring/alerting, of course (and a contingency plan for what to do when/if
the archive runs over) - but everyone whose workload is important enough for a
replication setup to make sense is required to have that in my book.

--
with best regards:
- Johannes Truschnigg ( johannes@truschnigg.info )

www:   https://johannes.truschnigg.info/

Attachment
Update

Regarding the failing secondaries, with error

```
LOG:  invalid resource manager ID
```

It was identified that the server running the secondaries was down for a considerable amount of time
(2-3 hours).
The errors started around that date.

Guess, the primaries (several of them) moved on writing to the WAL based on how busy they are
and the secondaries were out of sync in the WAL file.

Thank you for the advice.


On Wed, May 29, 2024 at 1:19 AM Johannes Truschnigg <johannes@truschnigg.info> wrote:
On Tue, May 28, 2024 at 05:24:56PM -0400, Ron Johnson wrote:
> On Tue, May 28, 2024 at 3:11 PM Johannes Truschnigg <
> >[...]
> > Yes, replication slots can interrupt your primary.
> >
>
> Please define "interrupt".  Using a replication slot, I thought files would
> just accumulate in pg_wal while the replica is down (or the network is
> slow, or the replica can't keep up with the primary).
>
> Disaster, of course, when that disk fills up, but that's always been the
> case.

And that is exactly the scenario I meant when I said "interrupt". If you use
replication slots, your monitoring/alerting isn't set up correctly, and you're
accumulating a lot of WAL, chances are ENOSPC on the primary is around the
corner for you.

That's why I generally prefer a WAL archive on a separate file system for
replicas to source segments from, because filling that up won't break the
primary (unless the archive_command misbehaves). That also needs proper
monitoring/alerting, of course (and a contingency plan for what to do when/if
the archive runs over) - but everyone whose workload is important enough for a
replication setup to make sense is required to have that in my book.

--
with best regards:
- Johannes Truschnigg ( johannes@truschnigg.info )

www:   https://johannes.truschnigg.info/