Thread: Two noncritical bugs of pg_waldump

Two noncritical bugs of pg_waldump

From
Kyotaro Horiguchi
Date:
Hello.

pg_waldump complains at the end in any case.  I noticed that the LSN
it shows in the finish message is incorrect.  (I faintly thought that
I posted about this but I didn't find it..)

> pg_waldump: fatal: error in WAL record at 0/15073F8: invalid record length at 0/1507470: wanted 24, got 0

xlogreader found the error at the record begins at 1507470, but
pg_waldump tells that error happens at 15073F8, which is actually the
beginning of the last sound record.


If I give an empty file to the tool it complains as the follows.

> pg_waldump: fatal: could not read file "hoge": No such file or directory

No, the file exists.  The cause is it reads uninitialized errno to
detect errors from the system call.  read(2) is defined to set errno
always when it returns -1 and doesn't otherwise. Thus it seems to me
that it is better to check that the return value is less than zero
than to clear errno before the call to read().

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 773f14f01c4a5cfd334d0a80778aa819c55c1cb7 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 27 Jan 2022 09:47:10 +0900
Subject: [PATCH] Fix errornious messages of pg_waldump

Fix pg_waldump to give the LSN where read error happens instead of the
LSN of the last record successfully read. And fix to give the correct
error message for zero-sized blocks instead of a bogus system-call
error message.
---
 src/bin/pg_waldump/pg_waldump.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index a6251e1a96..58dc4044b5 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -222,15 +222,12 @@ search_directory(const char *directory, const char *fname)
                                      WalSegSz),
                             fname, WalSegSz);
         }
+        else if (r < 0)
+            fatal_error("could not read file \"%s\": %m",
+                        fname);
         else
-        {
-            if (errno != 0)
-                fatal_error("could not read file \"%s\": %m",
-                            fname);
-            else
-                fatal_error("could not read file \"%s\": read %d of %d",
-                            fname, r, XLOG_BLCKSZ);
-        }
+            fatal_error("could not read file \"%s\": read %d of %d",
+                        fname, r, XLOG_BLCKSZ);
         close(fd);
         return true;
     }
@@ -1177,7 +1174,7 @@ main(int argc, char **argv)
 
     if (errormsg)
         fatal_error("error in WAL record at %X/%X: %s",
-                    LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
+                    LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
                     errormsg);
 
     XLogReaderFree(xlogreader_state);
-- 
2.27.0


Re: Two noncritical bugs of pg_waldump

From
Nathan Bossart
Date:
On Thu, Jan 27, 2022 at 10:07:38AM +0900, Kyotaro Horiguchi wrote:
> pg_waldump complains at the end in any case.  I noticed that the LSN
> it shows in the finish message is incorrect.  (I faintly thought that
> I posted about this but I didn't find it..)

Is this thread [0] what you are remembering?

[0] https://postgr.es/m/2B4510B2-3D70-4990-BFE3-0FE64041C08A%40amazon.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com/



Re: Two noncritical bugs of pg_waldump

From
Kyotaro Horiguchi
Date:
At Wed, 26 Jan 2022 17:25:14 -0800, Nathan Bossart <nathandbossart@gmail.com> wrote in 
> On Thu, Jan 27, 2022 at 10:07:38AM +0900, Kyotaro Horiguchi wrote:
> > pg_waldump complains at the end in any case.  I noticed that the LSN
> > it shows in the finish message is incorrect.  (I faintly thought that
> > I posted about this but I didn't find it..)
> 
> Is this thread [0] what you are remembering?
> 
> [0] https://postgr.es/m/2B4510B2-3D70-4990-BFE3-0FE64041C08A%40amazon.com

Maybe exactly.  I rememberd the discussion.

So the issue there is neither EndRecPtr and ReadRecPtr always points
to the current read LSN. The first proposal from Nathen was to use
currRecPtr but it was a private member.  But after discussion, it
seems to me it is (at least now) exactly what we need here so if we
ofccially exposed the member the problem would be blown away.  Do you
want to conitnue that?

Otherwise, I think we need to add a comment there about the known
issue.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Two noncritical bugs of pg_waldump

From
Kyotaro Horiguchi
Date:
(this is off-topic)

At Thu, 27 Jan 2022 13:23:06 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Wed, 26 Jan 2022 17:25:14 -0800, Nathan Bossart <nathandbossart@gmail.com> wrote in 
> > On Thu, Jan 27, 2022 at 10:07:38AM +0900, Kyotaro Horiguchi wrote:
> > > pg_waldump complains at the end in any case.  I noticed that the LSN
> > > it shows in the finish message is incorrect.  (I faintly thought that
> > > I posted about this but I didn't find it..)
> > 
> > Is this thread [0] what you are remembering?
> > 
> > [0] https://postgr.es/m/2B4510B2-3D70-4990-BFE3-0FE64041C08A%40amazon.com
> 
> Maybe exactly.  I rememberd the discussion.
> 
> So the issue there is neither EndRecPtr and ReadRecPtr always points
> to the current read LSN. The first proposal from Nathen was to use

Mmm. Sorry for my fat finger, Nathan.

> currRecPtr but it was a private member.  But after discussion, it
> seems to me it is (at least now) exactly what we need here so if we
> ofccially exposed the member the problem would be blown away.  Do you
> want to conitnue that?
> 
> Otherwise, I think we need to add a comment there about the known
> issue.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Two noncritical bugs of pg_waldump

From
Nathan Bossart
Date:
On Thu, Jan 27, 2022 at 01:23:06PM +0900, Kyotaro Horiguchi wrote:
> So the issue there is neither EndRecPtr and ReadRecPtr always points
> to the current read LSN. The first proposal from Nathen was to use
> currRecPtr but it was a private member.  But after discussion, it
> seems to me it is (at least now) exactly what we need here so if we
> ofccially exposed the member the problem would be blown away.  Do you
> want to conitnue that?

Presumably there is a good reason for keeping it private, but if not, maybe
that is a reasonable approach to consider.  I don't think we wanted to
proceed with the approaches discussed on the old thread, anyway.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com/



Re: Two noncritical bugs of pg_waldump

From
Nathan Bossart
Date:
On Thu, Jan 27, 2022 at 01:27:36PM +0900, Kyotaro Horiguchi wrote:
> At Thu, 27 Jan 2022 13:23:06 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
>> So the issue there is neither EndRecPtr and ReadRecPtr always points
>> to the current read LSN. The first proposal from Nathen was to use
> 
> Mmm. Sorry for my fat finger, Nathan.

It's okay.  :)

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com/



Re: Two noncritical bugs of pg_waldump

From
Kyotaro Horiguchi
Date:
Hmm..

At Thu, 27 Jan 2022 10:07:38 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> pg_waldump complains at the end in any case.  I noticed that the LSN
> it shows in the finish message is incorrect.  (I faintly thought that
> I posted about this but I didn't find it..)
> 
> > pg_waldump: fatal: error in WAL record at 0/15073F8: invalid record length at 0/1507470: wanted 24, got 0
> 
> xlogreader found the error at the record begins at 1507470, but
> pg_waldump tells that error happens at 15073F8, which is actually the
> beginning of the last sound record.

It is arguable, but the following is indisputable.

> If I give an empty file to the tool it complains as the follows.
> 
> > pg_waldump: fatal: could not read file "hoge": No such file or directory
> 
> No, the file exists.  The cause is it reads uninitialized errno to
> detect errors from the system call.  read(2) is defined to set errno
> always when it returns -1 and doesn't otherwise. Thus it seems to me
> that it is better to check that the return value is less than zero
> than to clear errno before the call to read().

So I post a patch contains only the indisputable part.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From f43cf938b0686dc406a1c81863a6415de1190b40 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Mon, 14 Feb 2022 18:11:19 +0900
Subject: [PATCH v1] Fix incorrect error handling of pg_waldump

search_directory of pg_waldump.c puts a wrong assumption about the
return value from read(2) and results in a bogus error message.  It
should check (r < 0) instead of (r != XLOG_BLCKSZ) to correctly detect
errors that %m work correctly on.
---
 src/bin/pg_waldump/pg_waldump.c | 13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index a6251e1a96..2340dc247b 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -222,15 +222,12 @@ search_directory(const char *directory, const char *fname)
                                      WalSegSz),
                             fname, WalSegSz);
         }
+        else if (r < 0)
+            fatal_error("could not read file \"%s\": %m",
+                        fname);
         else
-        {
-            if (errno != 0)
-                fatal_error("could not read file \"%s\": %m",
-                            fname);
-            else
-                fatal_error("could not read file \"%s\": read %d of %d",
-                            fname, r, XLOG_BLCKSZ);
-        }
+            fatal_error("could not read file \"%s\": read %d of %d",
+                        fname, r, XLOG_BLCKSZ);
         close(fd);
         return true;
     }
-- 
2.27.0


Re: Two noncritical bugs of pg_waldump

From
Andres Freund
Date:
Hi,

On 2022-02-14 18:18:47 +0900, Kyotaro Horiguchi wrote:
> > If I give an empty file to the tool it complains as the follows.
> > 
> > > pg_waldump: fatal: could not read file "hoge": No such file or directory
> > 
> > No, the file exists.  The cause is it reads uninitialized errno to
> > detect errors from the system call.  read(2) is defined to set errno
> > always when it returns -1 and doesn't otherwise. Thus it seems to me
> > that it is better to check that the return value is less than zero
> > than to clear errno before the call to read().
> 
> So I post a patch contains only the indisputable part.

Thanks for the report and fix. Pushed. This was surprisingly painful, all but
one branch had conflicts...

Greetings,

Andres Freund



Re: Two noncritical bugs of pg_waldump

From
Kyotaro Horiguchi
Date:
At Fri, 25 Feb 2022 10:48:47 -0800, Andres Freund <andres@anarazel.de> wrote in 
> Hi,
> 
> On 2022-02-14 18:18:47 +0900, Kyotaro Horiguchi wrote:
> > > If I give an empty file to the tool it complains as the follows.
> > > 
> > > > pg_waldump: fatal: could not read file "hoge": No such file or directory
> > > 
> > > No, the file exists.  The cause is it reads uninitialized errno to
> > > detect errors from the system call.  read(2) is defined to set errno
> > > always when it returns -1 and doesn't otherwise. Thus it seems to me
> > > that it is better to check that the return value is less than zero
> > > than to clear errno before the call to read().
> > 
> > So I post a patch contains only the indisputable part.
> 
> Thanks for the report and fix. Pushed. This was surprisingly painful, all but
> one branch had conflicts...

Ah, I didn't expect that this is committed so quickly.  I should have
created patches for all versions.  Anyway thanks for committing this!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center