Hi all,
Knowing that it's a data corruption issue, the only way to fix this is to vacuum and reindex the database. What was suggested was the following:
SET zero_damaged_pages = 0; # This is so that we can have the application to continue to run
VACUUM FULL VERBOSE ANALYSE; # Do a full vacuum and analyse the problem if possible.
REINDEX DATABASE "core"; # Then do a reindex and clean it up.
We're on Postgresql 12. This has worked before it happened (almost exactly a year ago) and I think this needs a more permanent solution. I've looked at routine vacuuming and checked the autovacuum is set to on and the following configurations:
core=> select name, setting from pg_settings where name like 'autovacuum%';
name | setting
-------------------------------------+-----------
autovacuum | on
autovacuum_analyze_scale_factor | 0.1
autovacuum_analyze_threshold | 50
autovacuum_freeze_max_age | 200000000
autovacuum_max_workers | 3
autovacuum_multixact_freeze_max_age | 400000000
autovacuum_naptime | 60
autovacuum_vacuum_cost_delay | 2
autovacuum_vacuum_cost_limit | -1
autovacuum_vacuum_scale_factor | 0.2
autovacuum_vacuum_threshold | 50
autovacuum_work_mem | -1
(12 rows)
Can anyone advise if there's anything else we can do? We have no clue what causes the invalid page block and we are running a High Availability cluster set up but we are hoping that there may be a way to mitigate it.