Thread: Failures with wal_consistency_checking and 13~

Failures with wal_consistency_checking and 13~

From
Michael Paquier
Date:
Hi all,

I have begun my annual study of WAL consistency across replays, and
wal_consistency_checking = 'all' is pointing out at some issues with
at least VACUUM and SPGist:
FATAL:  inconsistent page found, rel 1663/16385/22133, forknum 0,
blkno 15
CONTEXT:  WAL redo at 0/739CEDE8 for SPGist/VACUUM_REDIRECT: newest
XID 4619

It may be possible that there are other failures, I have just run
installcheck and this is the first failure I saw after replaying all
the generated WAL on a standby.  Please note that I have also checked
12, and installcheck passes.

Thanks,
--
Michael

Attachment

Re: Failures with wal_consistency_checking and 13~

From
Ranier Vilela
Date:
Em seg., 15 de jun. de 2020 às 10:14, Michael Paquier <michael@paquier.xyz> escreveu:
Hi all,

I have begun my annual study of WAL consistency across replays, and
wal_consistency_checking = 'all' is pointing out at some issues with
at least VACUUM and SPGist:
FATAL:  inconsistent page found, rel 1663/16385/22133, forknum 0,
blkno 15
CONTEXT:  WAL redo at 0/739CEDE8 for SPGist/VACUUM_REDIRECT: newest
XID 4619

It may be possible that there are other failures, I have just run
installcheck and this is the first failure I saw after replaying all
the generated WAL on a standby.  Please note that I have also checked
12, and installcheck passes.
With Postgres 13, Windows 10 (home), msvc 2019 64 bits,
Shutting down, without interrupting the database, consistently, this log has appeared.

2020-06-15 21:40:35.998 -03 [3380] LOG:  database system shutdown was interrupted; last known up at 2020-06-15 21:36:01 -03
2020-06-15 21:40:37.978 -03 [3380] LOG:  database system was not properly shut down; automatic recovery in progress
2020-06-15 21:40:37.992 -03 [3380] LOG:  redo starts at 0/8A809C78
2020-06-15 21:40:37.992 -03 [3380] LOG:  invalid record length at 0/8A809CB0: wanted 24, got 0
2020-06-15 21:40:37.992 -03 [3380] LOG:  redo done at 0/8A809C78
 
regards,
Ranier Vilela

Re: Failures with wal_consistency_checking and 13~

From
Michael Paquier
Date:
On Mon, Jun 15, 2020 at 11:33:42PM -0300, Ranier Vilela wrote:
> With Postgres 13, Windows 10 (home), msvc 2019 64 bits,
> Shutting down, without interrupting the database, consistently, this log
> has appeared.
>
> 2020-06-15 21:40:35.998 -03 [3380] LOG:  database system shutdown was
> interrupted; last known up at 2020-06-15 21:36:01 -03
> 2020-06-15 21:40:37.978 -03 [3380] LOG:  database system was not properly
> shut down; automatic recovery in progress
> 2020-06-15 21:40:37.992 -03 [3380] LOG:  redo starts at 0/8A809C78
> 2020-06-15 21:40:37.992 -03 [3380] LOG:  invalid record length at
> 0/8A809CB0: wanted 24, got 0
> 2020-06-15 21:40:37.992 -03 [3380] LOG:  redo done at 0/8A809C78

Could you please keep discussions on their own thread please?  After
stopping Postgres in a sudden way (immediate mode or just SIGKILL), it
is normal to see crash recovery happen, as well as it is normal to see
an "invalid record length" in the logs when the end of WAL is reached,
meaning the end of recovery.

Thanks.
--
Michael

Attachment

Re: Failures with wal_consistency_checking and 13~

From
Alvaro Herrera
Date:
On 2020-Jun-15, Michael Paquier wrote:

> I have begun my annual study of WAL consistency across replays, and
> wal_consistency_checking = 'all' is pointing out at some issues with
> at least VACUUM and SPGist:
> FATAL:  inconsistent page found, rel 1663/16385/22133, forknum 0,
> blkno 15
> CONTEXT:  WAL redo at 0/739CEDE8 for SPGist/VACUUM_REDIRECT: newest
> XID 4619
> 
> It may be possible that there are other failures, I have just run
> installcheck and this is the first failure I saw after replaying all
> the generated WAL on a standby.  Please note that I have also checked
> 12, and installcheck passes.

Umm.  Alexander, do you an idea of what this is about?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Failures with wal_consistency_checking and 13~

From
Alexander Korotkov
Date:
On Fri, Jun 19, 2020 at 10:34 PM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
>
> On 2020-Jun-15, Michael Paquier wrote:
>
> > I have begun my annual study of WAL consistency across replays, and
> > wal_consistency_checking = 'all' is pointing out at some issues with
> > at least VACUUM and SPGist:
> > FATAL:  inconsistent page found, rel 1663/16385/22133, forknum 0,
> > blkno 15
> > CONTEXT:  WAL redo at 0/739CEDE8 for SPGist/VACUUM_REDIRECT: newest
> > XID 4619
> >
> > It may be possible that there are other failures, I have just run
> > installcheck and this is the first failure I saw after replaying all
> > the generated WAL on a standby.  Please note that I have also checked
> > 12, and installcheck passes.
>
> Umm.  Alexander, do you an idea of what this is about?

I don't have idea yet, but I'll check this out

------
Regards,
Alexander Korotkov



Re: Failures with wal_consistency_checking and 13~

From
Alexander Korotkov
Date:
On Sat, Jun 20, 2020 at 1:16 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Fri, Jun 19, 2020 at 10:34 PM Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> >
> > On 2020-Jun-15, Michael Paquier wrote:
> >
> > > I have begun my annual study of WAL consistency across replays, and
> > > wal_consistency_checking = 'all' is pointing out at some issues with
> > > at least VACUUM and SPGist:
> > > FATAL:  inconsistent page found, rel 1663/16385/22133, forknum 0,
> > > blkno 15
> > > CONTEXT:  WAL redo at 0/739CEDE8 for SPGist/VACUUM_REDIRECT: newest
> > > XID 4619
> > >
> > > It may be possible that there are other failures, I have just run
> > > installcheck and this is the first failure I saw after replaying all
> > > the generated WAL on a standby.  Please note that I have also checked
> > > 12, and installcheck passes.
> >
> > Umm.  Alexander, do you an idea of what this is about?
>
> I don't have idea yet, but I'll check this out

I have discovered and fixed the issue in a44dd932ff.  spg_mask()
masked unused space only when pagehdr->pd_lower >
SizeOfPageHeaderData.  But during the vacuum regression tests, one
page has been erased completely and pagehdr->pd_lower was set to
SizeOfPageHeaderData.  Actually, 13 didn't introduce any issue, it
just added a test that spotted the issue.  The issue is here since
a507b86900.

------
Regards,
Alexander Korotkov



Re: Failures with wal_consistency_checking and 13~

From
Michael Paquier
Date:
On Sat, Jun 20, 2020 at 05:43:19PM +0300, Alexander Korotkov wrote:
> I have discovered and fixed the issue in a44dd932ff.  spg_mask()
> masked unused space only when pagehdr->pd_lower >
> SizeOfPageHeaderData.  But during the vacuum regression tests, one
> page has been erased completely and pagehdr->pd_lower was set to
> SizeOfPageHeaderData.  Actually, 13 didn't introduce any issue, it
> just added a test that spotted the issue.  The issue is here since
> a507b86900.

Thanks Alexander for looking at this issue!  My runs with
wal_consistency_checking are all clear now.
--
Michael

Attachment