Re: Fixing or Mitigating this ERROR: invalid page in block 35217 of relation base/16421/3192429 - Mailing list pgsql-general

From Stephen Frost
Subject Re: Fixing or Mitigating this ERROR: invalid page in block 35217 of relation base/16421/3192429
Date
Msg-id CAOuzzgo4kaP0KsiG9vXNupUycX8q9C1_zJYCQGhmWeSJgUNbCA@mail.gmail.com
Whole thread Raw
In response to Re: Fixing or Mitigating this ERROR: invalid page in block 35217 of relation base/16421/3192429  (Abdul Qoyyuum <aqoyyuum@cardaccess.com.bn>)
List pgsql-general
Greetings,

On Thu, Nov 30, 2023 at 02:51 Abdul Qoyyuum <aqoyyuum@cardaccess.com.bn> wrote:
On Wed, Nov 29, 2023 at 5:53 PM Stephen Frost <sfrost@snowman.net> wrote:
* Abdul Qoyyuum (aqoyyuum@cardaccess.com.bn) wrote:
> Knowing that it's a data corruption issue, the only way to fix this is to
> vacuum and reindex the database. What was suggested was the following:
>
> SET zero_damaged_pages = 0; # This is so that we can have the application
> to continue to run
> VACUUM FULL VERBOSE ANALYSE; # Do a full vacuum and analyse the problem if
> possible.
> REINDEX DATABASE "core"; # Then do a reindex and clean it up.

This is only going to help if the issue is in an index, which isn't
clear from what's been shared.

That is a good point. Problem is I can't really find out as the logs isn't that verbose to tell me more.

Look in pg_class for a relation with a relfileno matching the filename or for a relation with an oid matching it.

it happened (almost exactly
> a year ago) and I think this needs a more permanent solution. I've looked
> at routine vacuuming and checked the autovacuum is set to on and the
> following configurations:

This isn't something that should ever happen ...

This also doesn't have anything to do with autovacuum, changing settings
there won't make any difference.

Noted but it has been a clean running since a year ago that I ran the vacuum and reindex commands. 

Great to hear but isn’t particularly relevant. Sure, a reindex could “fix” this, but it shouldn’t happen in the first place…

> Can anyone advise if there's anything else we can do? We have no clue what
> causes the invalid page block and we are running a High Availability
> cluster set up but we are hoping that there may be a way to mitigate it.

Was there some kind of hardware fault?  Did you do a failover?  Restore
from a backup?  Do you have checksums enabled?  How many times has this
happened before, and how many pages were impacted?  What is the design
of your HA solution, are you using PG replication or something else? 

There have been a few maintenance operations earlier this year but nothing too critical or anything failed that would have caused the database to go corrupt. The HA solution we're using is the pacemaker with the active passive setup.

This really isn’t enough information to assess if what you’re doing in your configuration is leading to corruption, or not. Note that a lot of corruption can happen without invalid pages showing up- you should seriously look into amcheck and using it to see if your indexes are valid.

What exactly is your “pacemaker with the active passive setup”?  What’s the storage under that?  What other maintenance operations have been done..?

Unsure if sharing the relevant WAL settings from postgresql.conf may be useful but here they are:

max_connections = 300
shared_buffers = 128MB

archive_mode = on
archive_command = 'test ! -f /opt/databases/postgres12/wal_archive/%f && cp %p /opt/databases/postgres12/wal_archive/%f'

cp is absolutely not valid as an archive command- it doesn’t ensure that the WAL has been fully sync’d to disk and therefore could lead you to a situation where you lose necessary WAL. I strongly encourage you to look into a proper backup solution.

Thanks,

Stephen

pgsql-general by date:

Previous
From: Loles
Date:
Subject: Re: Two started cluster on the same data directory and port
Next
From: Adrian Klaver
Date:
Subject: Re: Two started cluster on the same data directory and port