Thread: detailed error message of pg_waldump

detailed error message of pg_waldump

From
Kyotaro Horiguchi
Date:
In a very common operation of accidentally specifying a recycled
segment, pg_waldump often returns the following obscure message.

$ pg_waldump 00000001000000000000002D
pg_waldump: fatal: could not find a valid record after 0/2D000000

The more detailed message is generated internally and we can use it.
That looks like the following.

$ pg_waldump 00000001000000000000002D
pg_waldump: fatal: unexpected pageaddr 0/24000000 in log segment 00000001000000000000002D, offset 0

Is it work doing?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 42738eb940..5a98c523c6 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -932,17 +932,28 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
  *
  * This positions the reader, like XLogBeginRead(), so that the next call to
  * XLogReadRecord() will read the next valid record.
+ *
+ * If the page_read callback fails to read the requested data,
+ * InvalidXLogRecPtr is returned.  The callback is expected to have reported
+ * the error; errormsg is set to NULL.
+ *
+ * If the reading fails for some other reason, InvalidXLogRecPtr is also
+ * returned, and *errormsg is set to a string with details of the failure.
+ *
+ * The returned pointer (or *errormsg) points to an internal buffer that's
+ * valid until the next call to XLogFindNextRecord or XLogReadRecord.
  */
 XLogRecPtr
-XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
+XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 {
     XLogRecPtr    tmpRecPtr;
     XLogRecPtr    found = InvalidXLogRecPtr;
     XLogPageHeader header;
-    char       *errormsg;
 
     Assert(!XLogRecPtrIsInvalid(RecPtr));
 
+    *errormsg = NULL;
+
     /*
      * skip over potential continuation data, keeping in mind that it may span
      * multiple pages
@@ -1021,7 +1032,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
      * or we just jumped over the remaining data of a continuation.
      */
     XLogBeginRead(state, tmpRecPtr);
-    while (XLogReadRecord(state, &errormsg) != NULL)
+    while (XLogReadRecord(state, errormsg) != NULL)
     {
         /* past the record we've found, break out */
         if (RecPtr <= state->ReadRecPtr)
@@ -1036,6 +1047,9 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 err:
     XLogReaderInvalReadState(state);
 
+    if (state->errormsg_buf[0] != '\0')
+        *errormsg = state->errormsg_buf;
+
     return InvalidXLogRecPtr;
 }
 
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index f8b8afe4a7..d3dbccf22c 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1051,11 +1051,17 @@ main(int argc, char **argv)
         fatal_error("out of memory");
 
     /* first find a valid recptr to start from */
-    first_record = XLogFindNextRecord(xlogreader_state, private.startptr);
+    first_record = XLogFindNextRecord(xlogreader_state, private.startptr,
+                                      &errormsg);
 
     if (first_record == InvalidXLogRecPtr)
-        fatal_error("could not find a valid record after %X/%X",
-                    LSN_FORMAT_ARGS(private.startptr));
+    {
+        if (errormsg)
+            fatal_error("%s", errormsg);
+        else
+            fatal_error("could not find a valid record after %X/%X",
+                        LSN_FORMAT_ARGS(private.startptr));
+    }
 
     /*
      * Display a message that we're skipping data if `from` wasn't a pointer
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 21d200d3df..0f6d920fed 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -267,7 +267,8 @@ extern void XLogReaderFree(XLogReaderState *state);
 /* Position the XLogReader to given record */
 extern void XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr);
 #ifdef FRONTEND
-extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
+extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr,
+                                     char **errormsg);
 #endif                            /* FRONTEND */
 
 /* Read the next XLog record. Returns NULL on end-of-WAL or failure */

Re: detailed error message of pg_waldump

From
Masahiko Sawada
Date:
On Fri, Jun 4, 2021 at 5:35 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> In a very common operation of accidentally specifying a recycled
> segment, pg_waldump often returns the following obscure message.
>
> $ pg_waldump 00000001000000000000002D
> pg_waldump: fatal: could not find a valid record after 0/2D000000
>
> The more detailed message is generated internally and we can use it.
> That looks like the following.
>
> $ pg_waldump 00000001000000000000002D
> pg_waldump: fatal: unexpected pageaddr 0/24000000 in log segment 00000001000000000000002D, offset 0
>
> Is it work doing?

Perhaps we need both? The current message describes where the error
happened and the message internally generated describes the details.
It seems to me that both are useful. For example, if we find an error
during XLogReadRecord(), we show both as follows:

   if (errormsg)
       fatal_error("error in WAL record at %X/%X: %s",
                   LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
                   errormsg);

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: detailed error message of pg_waldump

From
Kyotaro Horiguchi
Date:
Thanks!

At Wed, 16 Jun 2021 16:52:11 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in 
> On Fri, Jun 4, 2021 at 5:35 PM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> >
> > In a very common operation of accidentally specifying a recycled
> > segment, pg_waldump often returns the following obscure message.
> >
> > $ pg_waldump 00000001000000000000002D
> > pg_waldump: fatal: could not find a valid record after 0/2D000000
> >
> > The more detailed message is generated internally and we can use it.
> > That looks like the following.
> >
> > $ pg_waldump 00000001000000000000002D
> > pg_waldump: fatal: unexpected pageaddr 0/24000000 in log segment 00000001000000000000002D, offset 0
> >
> > Is it work doing?
> 
> Perhaps we need both? The current message describes where the error
> happened and the message internally generated describes the details.
> It seems to me that both are useful. For example, if we find an error
> during XLogReadRecord(), we show both as follows:
> 
>    if (errormsg)
>        fatal_error("error in WAL record at %X/%X: %s",
>                    LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
>                    errormsg);

Yeah, I thought that it might be a bit vervous and lengty but actually
we have another place where doing that. One more point is whether we
have a case where first_record is invalid but errormsg is NULL
there. WALDumpReadPage immediately exits so we should always have a
message in that case according to the comment in ReadRecord.

> * We only end up here without a message when XLogPageRead()
> * failed - in that case we already logged something. In
> * StandbyMode that only happens if we have been triggered, so we
> * shouldn't loop anymore in that case.

So that can be an assertion.

Now the messages looks like this.

$ pg_waldump /home/horiguti/data/data_work/pg_wal/000000020000000000000010 
pg_waldump: fatal: could not find a valid record after 0/0: unexpected pageaddr 0/9000000 in log segment
000000020000000000000010,offset 0
 

reagards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 42738eb940..5a98c523c6 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -932,17 +932,28 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
  *
  * This positions the reader, like XLogBeginRead(), so that the next call to
  * XLogReadRecord() will read the next valid record.
+ *
+ * If the page_read callback fails to read the requested data,
+ * InvalidXLogRecPtr is returned.  The callback is expected to have reported
+ * the error; errormsg is set to NULL.
+ *
+ * If the reading fails for some other reason, InvalidXLogRecPtr is also
+ * returned, and *errormsg is set to a string with details of the failure.
+ *
+ * The returned pointer (or *errormsg) points to an internal buffer that's
+ * valid until the next call to XLogFindNextRecord or XLogReadRecord.
  */
 XLogRecPtr
-XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
+XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 {
     XLogRecPtr    tmpRecPtr;
     XLogRecPtr    found = InvalidXLogRecPtr;
     XLogPageHeader header;
-    char       *errormsg;
 
     Assert(!XLogRecPtrIsInvalid(RecPtr));
 
+    *errormsg = NULL;
+
     /*
      * skip over potential continuation data, keeping in mind that it may span
      * multiple pages
@@ -1021,7 +1032,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
      * or we just jumped over the remaining data of a continuation.
      */
     XLogBeginRead(state, tmpRecPtr);
-    while (XLogReadRecord(state, &errormsg) != NULL)
+    while (XLogReadRecord(state, errormsg) != NULL)
     {
         /* past the record we've found, break out */
         if (RecPtr <= state->ReadRecPtr)
@@ -1036,6 +1047,9 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 err:
     XLogReaderInvalReadState(state);
 
+    if (state->errormsg_buf[0] != '\0')
+        *errormsg = state->errormsg_buf;
+
     return InvalidXLogRecPtr;
 }
 
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index f8b8afe4a7..0dae1ebea7 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1051,11 +1051,17 @@ main(int argc, char **argv)
         fatal_error("out of memory");
 
     /* first find a valid recptr to start from */
-    first_record = XLogFindNextRecord(xlogreader_state, private.startptr);
+    first_record = XLogFindNextRecord(xlogreader_state, private.startptr,
+                                      &errormsg);
 
     if (first_record == InvalidXLogRecPtr)
-        fatal_error("could not find a valid record after %X/%X",
-                    LSN_FORMAT_ARGS(private.startptr));
+    {
+        /* XLogFindNextRecord must return errormsg on error */
+        Assert (errormsg);
+        fatal_error("could not find a valid record after %X/%X: %s",
+                    LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
+                    errormsg);
+    }
 
     /*
      * Display a message that we're skipping data if `from` wasn't a pointer
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 21d200d3df..0f6d920fed 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -267,7 +267,8 @@ extern void XLogReaderFree(XLogReaderState *state);
 /* Position the XLogReader to given record */
 extern void XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr);
 #ifdef FRONTEND
-extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
+extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr,
+                                     char **errormsg);
 #endif                            /* FRONTEND */
 
 /* Read the next XLog record. Returns NULL on end-of-WAL or failure */

Re: detailed error message of pg_waldump

From
Masahiko Sawada
Date:
On Wed, Jun 16, 2021 at 5:36 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> Thanks!
>
> At Wed, 16 Jun 2021 16:52:11 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
> > On Fri, Jun 4, 2021 at 5:35 PM Kyotaro Horiguchi
> > <horikyota.ntt@gmail.com> wrote:
> > >
> > > In a very common operation of accidentally specifying a recycled
> > > segment, pg_waldump often returns the following obscure message.
> > >
> > > $ pg_waldump 00000001000000000000002D
> > > pg_waldump: fatal: could not find a valid record after 0/2D000000
> > >
> > > The more detailed message is generated internally and we can use it.
> > > That looks like the following.
> > >
> > > $ pg_waldump 00000001000000000000002D
> > > pg_waldump: fatal: unexpected pageaddr 0/24000000 in log segment 00000001000000000000002D, offset 0
> > >
> > > Is it work doing?
> >
> > Perhaps we need both? The current message describes where the error
> > happened and the message internally generated describes the details.
> > It seems to me that both are useful. For example, if we find an error
> > during XLogReadRecord(), we show both as follows:
> >
> >    if (errormsg)
> >        fatal_error("error in WAL record at %X/%X: %s",
> >                    LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
> >                    errormsg);
>
> Yeah, I thought that it might be a bit vervous and lengty but actually
> we have another place where doing that. One more point is whether we
> have a case where first_record is invalid but errormsg is NULL
> there. WALDumpReadPage immediately exits so we should always have a
> message in that case according to the comment in ReadRecord.
>
> > * We only end up here without a message when XLogPageRead()
> > * failed - in that case we already logged something. In
> > * StandbyMode that only happens if we have been triggered, so we
> > * shouldn't loop anymore in that case.
>
> So that can be an assertion.
>
> Now the messages looks like this.
>
> $ pg_waldump /home/horiguti/data/data_work/pg_wal/000000020000000000000010
> pg_waldump: fatal: could not find a valid record after 0/0: unexpected pageaddr 0/9000000 in log segment
000000020000000000000010,offset 0
 
>

Thank you for updating the patch!

+ *
+ * The returned pointer (or *errormsg) points to an internal buffer that's
+ * valid until the next call to XLogFindNextRecord or XLogReadRecord.
  */

The comment of XLogReadRecord() also has a similar description. Should
we update it as well?

BTW is this patch registered to the current commitfest? I could not find it.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/