Thread: fix "Success" error messages

fix "Success" error messages

From
Peter Eisentraut
Date:
In a case of a corrupted database, I saw an error message like

    Could not read from file ...: Success.

from the SLRU module.  This is because it checks that it reads or writes
exactly BLCKSZ, and else goes to the error path.  The attached patch
gives a different error message in this case.

Because of the structure of this code, we don't have the information to
do the usual "read %d of %zu", but at least this is better than
reporting a "success" error.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: fix "Success" error messages

From
Shawn Debnath
Date:
On Tue, Jun 18, 2019 at 02:35:02PM +0200, Peter Eisentraut wrote:
> --- a/src/backend/access/transam/slru.c
> +++ b/src/backend/access/transam/slru.c
> @@ -923,15 +923,19 @@ SlruReportIOError(SlruCtl ctl, int pageno, TransactionId xid)
>              ereport(ERROR,
>                      (errcode_for_file_access(),
>                       errmsg("could not access status of transaction %u", xid),
> -                     errdetail("Could not read from file \"%s\" at offset %u: %m.",
> -                               path, offset)));
> +                     errno == 0
> +                     ? errdetail("Short read from file \"%s\" at offset %u.", path, offset)
> +                     : errdetail("Could not read from file \"%s\" at offset %u: %m.",
> +                                 path, offset)));

Perhaps using "Encountered partial read from file ..." would be better 
suited for the error message.

>              break;
>          case SLRU_WRITE_FAILED:
>              ereport(ERROR,
>                      (errcode_for_file_access(),
>                       errmsg("could not access status of transaction %u", xid),
> -                     errdetail("Could not write to file \"%s\" at offset %u: %m.",
> -                               path, offset)));
> +                     errno == 0
> +                     ? errdetail("Short write to file \"%s\" at offset %u.", path, offset)
> +                     : errdetail("Could not write to file \"%s\" at offset %u: %m.",
> +                                 path, offset)));

Similarly, "Encountered partial write to file ..." would be better? Not 
a 100% on using "Encountered" but "partial" seems to be the right word 
to use here.

Do note that SlruPhysicalWritePage() will always set errno to ENOSPC if 
errno was 0 during a write attempt, see [0]. Not sure this is a good 
assumption to make since write(2) should return ENOSPC if the storage 
medium is out of space [1].

[0] 
https://github.com/postgres/postgres/blob/master/src/backend/access/transam/slru.c#L854
[1] https://linux.die.net/man/2/write

-- 
Shawn Debnath
Amazon Web Services (AWS)



Re: fix "Success" error messages

From
Michael Paquier
Date:
On Tue, Jun 18, 2019 at 09:13:19AM -0700, Shawn Debnath wrote:
>>          case SLRU_WRITE_FAILED:
>>              ereport(ERROR,
>>                      (errcode_for_file_access(),
>>                       errmsg("could not access status of transaction %u", xid),
>> -                     errdetail("Could not write to file \"%s\" at offset %u: %m.",
>> -                               path, offset)));
>> +                     errno == 0
>> +                     ? errdetail("Short write to file \"%s\" at offset %u.", path, offset)
>> +                     : errdetail("Could not write to file \"%s\" at offset %u: %m.",
>> +                                 path, offset)));

There is no point to call errcode_for_file_access() if we know that
errno is 0.  Not a big deal, still..  The last time I looked at that,
the best way I could think of would be to replace slru_errcause with a
proper error string generated at error time.  Perhaps the partial
read/write case does not justify this extra facility.  I don't know.
At least that would be more flexible.

> Similarly, "Encountered partial write to file ..." would be better? Not
> a 100% on using "Encountered" but "partial" seems to be the right word
> to use here.

811b6e36 has done some work so as we have more consistency in the
error messages and I don't think we should introduce more flavors of
this stuff as that makes the life of translators harder.

-   if (CloseTransientFile(fd))
+   if (CloseTransientFile(fd) < 0)
Some spots are missing:
$ git grep "CloseTransientFile" | grep "))" | wc -l
30
--
Michael

Attachment

Re: fix "Success" error messages

From
Peter Eisentraut
Date:
On 2019-06-19 04:51, Michael Paquier wrote:
> On Tue, Jun 18, 2019 at 09:13:19AM -0700, Shawn Debnath wrote:
>>>          case SLRU_WRITE_FAILED:
>>>              ereport(ERROR,
>>>                      (errcode_for_file_access(),
>>>                       errmsg("could not access status of transaction %u", xid),
>>> -                     errdetail("Could not write to file \"%s\" at offset %u: %m.",
>>> -                               path, offset)));
>>> +                     errno == 0
>>> +                     ? errdetail("Short write to file \"%s\" at offset %u.", path, offset)
>>> +                     : errdetail("Could not write to file \"%s\" at offset %u: %m.",
>>> +                                 path, offset)));
> 
> There is no point to call errcode_for_file_access() if we know that
> errno is 0.  Not a big deal, still..  The last time I looked at that,
> the best way I could think of would be to replace slru_errcause with a
> proper error string generated at error time.  Perhaps the partial
> read/write case does not justify this extra facility.  I don't know.
> At least that would be more flexible.

Here is an updated patch set that rearranges this a bit according to
your suggestions, and also fixes some similar issues in pg_checksums.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: fix "Success" error messages

From
Michael Paquier
Date:
On Mon, Aug 26, 2019 at 09:40:23PM +0200, Peter Eisentraut wrote:
> Here is an updated patch set that rearranges this a bit according to
> your suggestions, and also fixes some similar issues in pg_checksums.

Thanks for the new patch, and you are right that pg_checksums has been
slacking here.  There is the same issue with pg_verify_checksums in
11.  Not sure that's worth a back-patch though.  Those parts could
find their way to v12 easily.

> -            ereport(ERROR,
> -                    (errcode_for_file_access(),
> -                     errmsg("could not access status of transaction %u", xid),
> -                     errdetail("Could not read from file \"%s\" at offset %u: %m.",
> -                               path, offset)));
> +            if (errno)
> +                ereport(ERROR,
> +                        (errcode_for_file_access(),
> +                         errmsg("could not access status of transaction %u", xid),
> +                         errdetail("Could not read from file \"%s\" at offset %u: %m.",
> +                                   path, offset)));
> +            else
> +                ereport(ERROR,
> +                        (errmsg("could not access status of transaction %u", xid),
> +                         errdetail("Could not read from file \"%s\" at offset %u: read too few bytes.", path,
offset)));

Last time I worked on that, the following suggestion was made for
error messages with shorter reads or writes:
could not read file \"%s\": read %d of %zu
Still this is clearly an improvement and I that's not worth the extra
complication, so +1 for this way of doing things.

>          if (r == 0)
>              break;
> -        if (r != BLCKSZ)
> +        else if (r < 0)
> +        {
> +            pg_log_error("could not read block %u in file \"%s\": %m",
> +                         blockno, fn);
> +            exit(1);
> +        }
> +        else if (r != BLCKSZ)
>          {
>              pg_log_error("could not read block %u in file \"%s\": read %d of %d",
>                           blockno, fn, r, BLCKSZ);

Other code areas (xlog.c, pg_waldump.c, etc.) prefer doing it this
way, after checking the size read:
if (r != BLCKSZ)
{
    if (r < 0)
        pg_log_error("could not read blah: %m");
    else
        pg_log_error("could not read blah: read %d of %d")
}

>              /* Write block with checksum */
> -            if (write(f, buf.data, BLCKSZ) != BLCKSZ)
> +            w = write(f, buf.data, BLCKSZ);
> +            if (w != BLCKSZ)
>              {
> -                pg_log_error("could not write block %u in file \"%s\": %m",
> -                             blockno, fn);
> +                if (w < 0)
> +                    pg_log_error("could not write block %u in file \"%s\": %m",
> +                                 blockno, fn);
> +                else
> +                    pg_log_error("could not write block %u in file \"%s\": wrote %d of %d",
> +                                 blockno, fn, w, BLCKSZ);
>                  exit(1);
>              }
>          }

This is consistent.
--
Michael

Attachment

Re: fix "Success" error messages

From
Peter Eisentraut
Date:
On 2019-08-27 08:27, Michael Paquier wrote:
> Thanks for the new patch, and you are right that pg_checksums has been
> slacking here.  There is the same issue with pg_verify_checksums in
> 11.  Not sure that's worth a back-patch though.  Those parts could
> find their way to v12 easily.

Committed to master and PG12 with your suggested changes.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: fix "Success" error messages

From
Bruce Momjian
Date:
On Tue, Sep  3, 2019 at 08:38:22AM +0200, Peter Eisentraut wrote:
> On 2019-08-27 08:27, Michael Paquier wrote:
> > Thanks for the new patch, and you are right that pg_checksums has been
> > slacking here.  There is the same issue with pg_verify_checksums in
> > 11.  Not sure that's worth a back-patch though.  Those parts could
> > find their way to v12 easily.
> 
> Committed to master and PG12 with your suggested changes.

This "Success" has happened so many times I think we should tell people
to report any such error message as a bug by emitting a special error
message line.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: fix "Success" error messages

From
TAKATSUKA Haruka
Date:
Hello, pgsql-hackers

I'm gathering information about the following error.

 FATAL: could not access status of transaction ..
 DETAIL: Could not read from file (pg_clog/.... or pg_xact/....) ...: Success.

This error has caused the server to fail to start with recovery.
I got a report that it happend repeatedly at the newly generated
standby cluster. I gave them advice to comfirm the low level server
environment.

However, in addition to improving the message, should we retry to read 
the rest of the data in the case reading too few bytes? 
What about a limited number of retries instead of a complete loop?

---
 Haruka Takatsuka / SRA OSS, Inc. Japan

On Fri, 27 Sep 2019 11:52:39 -0400
Bruce Momjian <bruce@momjian.us> wrote:

> On Tue, Sep  3, 2019 at 08:38:22AM +0200, Peter Eisentraut wrote:
> > On 2019-08-27 08:27, Michael Paquier wrote:
> > > Thanks for the new patch, and you are right that pg_checksums has been
> > > slacking here.  There is the same issue with pg_verify_checksums in
> > > 11.  Not sure that's worth a back-patch though.  Those parts could
> > > find their way to v12 easily.
> > 
> > Committed to master and PG12 with your suggested changes.
> 
> This "Success" has happened so many times I think we should tell people
> to report any such error message as a bug by emitting a special error
> message line.
> 
> -- 
>   Bruce Momjian  <bruce@momjian.us>        http://momjian.us
>   EnterpriseDB                             http://enterprisedb.com
> 
> + As you are, so once was I.  As I am, so you will be. +
> +                      Ancient Roman grave inscription +




Re: fix "Success" error messages

From
Peter Eisentraut
Date:
On 2019-11-21 02:42, TAKATSUKA Haruka wrote:
>   FATAL: could not access status of transaction ..
>   DETAIL: Could not read from file (pg_clog/.... or pg_xact/....) ...: Success.
> 
> This error has caused the server to fail to start with recovery.
> I got a report that it happend repeatedly at the newly generated
> standby cluster. I gave them advice to comfirm the low level server
> environment.
> 
> However, in addition to improving the message, should we retry to read
> the rest of the data in the case reading too few bytes?
> What about a limited number of retries instead of a complete loop?

If we thought that would help, there are probably hundreds or more other 
places where we read files that would need to be fixed up in the same 
way.  That doesn't seem reasonable.

Also, it is my understanding that short reads can in practice only 
happen if the underlying storage is having a serious problem, so 
retrying wouldn't actually help much.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: fix "Success" error messages

From
TAKATSUKA Haruka
Date:
On Thu, 21 Nov 2019 10:40:36 +0100
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:

> On 2019-11-21 02:42, TAKATSUKA Haruka wrote:
> >   FATAL: could not access status of transaction ..
> >   DETAIL: Could not read from file (pg_clog/.... or pg_xact/....) ...: Success.
> > 
> > This error has caused the server to fail to start with recovery.
> > I got a report that it happend repeatedly at the newly generated
> > standby cluster. I gave them advice to comfirm the low level server
> > environment.
> > 
> > However, in addition to improving the message, should we retry to read
> > the rest of the data in the case reading too few bytes?
> > What about a limited number of retries instead of a complete loop?
> 
> If we thought that would help, there are probably hundreds or more other 
> places where we read files that would need to be fixed up in the same 
> way.  That doesn't seem reasonable.
> 
> Also, it is my understanding that short reads can in practice only 
> happen if the underlying storage is having a serious problem, so 
> retrying wouldn't actually help much.

OK, I understand.
In our case, the standby DB cluster space is on DRBD.
I will report the clear occurrence condition if it is found.

thanks,
Haruka Takatsuka