Thread: BUG #18025: Probably we need to change behaviour of the checkpoint failures in PG

BUG #18025: Probably we need to change behaviour of the checkpoint failures in PG

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      18025
Logged by:          Kishor Hargude
Email address:      hargudekishor@gmail.com
PostgreSQL version: 14.8
Operating system:   rhel
Description:

Hello , 

I have just witnessed the data loss scenario. 

Scenario is like, there was checkpoint operation failures going on the DB
server since last 8 hours which means no successful checkpoint happened in
the DB server since last 8 hours. Then DB server went into the crash mode
due to the exhausted disk space and did not came up as part of crash
recovery. Actually the victim had moved few WALs from the pg_wal to other
location and reimporting those wal on original location also did not solved
the problem. DB server was not able to find out the valid checkpoint record.
The victim was not having the backup which he could use that backup to
recover the data with the help of available archived WALs. So , the victim
had only one option left in his hand that is pg_resetwal.  We have tried
every possible solution but did not worked so we did not left with more
choices other than pg_Resetwal

Now, The victim has to opt for pg_resetwal but the cost for this option is 8
hours of data loss as last successful checkpoint was happened 8 hours back.
There is not other option available other than pg_resetwal. 

Above mentioned scenario can occurred with many of the PostgreSQL customers
as the customer does not bother about to monitor the DB logs for finding out
the errors/fatals/PANICs on daily basis or hourly basis (for the events like
checkpoint failures//archiving failures). Due the ignorance of monitoring
the DB logs or improper backup recovery strategies ,any customer can meet
with the above mentioned scenario which costs to data loss. 

In my opinion, to limit the data loss or damage to the data in above
scenario, we should think on changing the behaviour of the checkpoint
failure operations. right now , we are just erroring out the checkpoint
failures and allowing write transctions on DB server which kept generating
the WALs. Now as checkpoints are unable to recycle or remove the unwanted
WALs from pg_wal , the WALs keeps accumulating and eventually leads to disk
full issue and server gets crashed.And then above scenario may happens. 

I am sharing my thought to limit the data loss in above situation:

I learned that failure of the checkpoint operation can lead to data loss .
The dataloss of several hours/days etc. 

Customers will not accept their mistakes but blame on PostgreSQL.

I think we can improve the behaviour of the PostgreSQL when it comes to
constantly failures of the checkpoints operation. We should stop accepting
the write transactions and allow only read only transactions as soon as
postgres detects the failure of the checkpoint operation (which can not be
resolved) . This will give opportunity customer to import the databases from
problematic cluster and build new cluster. But this will definitely limit
the loss of the data if any customer suffer through the scenario mentioned
as above. This approach also will give immediate alarm to the customer to
fix the issue if its fixable otherwise do the recovery steps. Having
downtime is always always better than loosing data. 

Thank You.


On Mon, 2023-07-17 at 05:03 +0000, PG Bug reporting form wrote:
> I have just witnessed the data loss scenario.
>
> Scenario is like, there was checkpoint operation failures going on the DB
> server since last 8 hours which means no successful checkpoint happened in
> the DB server since last 8 hours. Then DB server went into the crash mode
> due to the exhausted disk space and did not came up as part of crash
> recovery.

Mistake #1: you did not monitor disk space.

> Actually the victim had moved few WALs from the pg_wal to other
> location and reimporting those wal on original location also did not solved
> the problem.

Mistake #2: manually messing with the database directory.

> DB server was not able to find out the valid checkpoint record.
> The victim was not having the backup which he could use that backup to
> recover the data with the help of available archived WALs.

Mistake #0: no backup.

> So , the victim
> had only one option left in his hand that is pg_resetwal.  We have tried
> every possible solution but did not worked so we did not left with more
> choices other than pg_Resetwal

Mistake #3: run pg_resetwal

"We have tried every possible solution" sounds a bit like "we tried all the
haphazard things that came to our mind".

Sorry, this is not a bug, this is a pilot error.

If PostgreSQL crashes because "pg_wal" runs out of disk space, you increase
the disk space, start PostgreSQL and let it complete crash recovery.  It is
as simple as that.

Yours,
Laurenz Albe



On Mon, Jul 17, 2023 at 09:53:32AM +0200, Laurenz Albe wrote:
> On Mon, 2023-07-17 at 05:03 +0000, PG Bug reporting form wrote:
>> Scenario is like, there was checkpoint operation failures going on the DB
>> server since last 8 hours which means no successful checkpoint happened in
>> the DB server since last 8 hours. Then DB server went into the crash mode
>> due to the exhausted disk space and did not came up as part of crash
>> recovery.
>
> Mistake #1: you did not monitor disk space.

max_wal_size is a very critical piece to adjust.  It is usually
recommended to split pg_wal/ into its own partition so as the space
allocated for WAL records is predictable across checkpoints.  This is
not a perfect science as max_wal_size is a soft limit so usually one
needs an extra margin with a WAL partition.  There have been some
patches floating around to make that a hard limit, as well, but I
don't think we've ever agreed on the semantics that would be
acceptable when reaching the upper limit authorized.
--
Michael

Attachment