Thread: fix "Success" error messages
In a case of a corrupted database, I saw an error message like Could not read from file ...: Success. from the SLRU module. This is because it checks that it reads or writes exactly BLCKSZ, and else goes to the error path. The attached patch gives a different error message in this case. Because of the structure of this code, we don't have the information to do the usual "read %d of %zu", but at least this is better than reporting a "success" error. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Tue, Jun 18, 2019 at 02:35:02PM +0200, Peter Eisentraut wrote: > --- a/src/backend/access/transam/slru.c > +++ b/src/backend/access/transam/slru.c > @@ -923,15 +923,19 @@ SlruReportIOError(SlruCtl ctl, int pageno, TransactionId xid) > ereport(ERROR, > (errcode_for_file_access(), > errmsg("could not access status of transaction %u", xid), > - errdetail("Could not read from file \"%s\" at offset %u: %m.", > - path, offset))); > + errno == 0 > + ? errdetail("Short read from file \"%s\" at offset %u.", path, offset) > + : errdetail("Could not read from file \"%s\" at offset %u: %m.", > + path, offset))); Perhaps using "Encountered partial read from file ..." would be better suited for the error message. > break; > case SLRU_WRITE_FAILED: > ereport(ERROR, > (errcode_for_file_access(), > errmsg("could not access status of transaction %u", xid), > - errdetail("Could not write to file \"%s\" at offset %u: %m.", > - path, offset))); > + errno == 0 > + ? errdetail("Short write to file \"%s\" at offset %u.", path, offset) > + : errdetail("Could not write to file \"%s\" at offset %u: %m.", > + path, offset))); Similarly, "Encountered partial write to file ..." would be better? Not a 100% on using "Encountered" but "partial" seems to be the right word to use here. Do note that SlruPhysicalWritePage() will always set errno to ENOSPC if errno was 0 during a write attempt, see [0]. Not sure this is a good assumption to make since write(2) should return ENOSPC if the storage medium is out of space [1]. [0] https://github.com/postgres/postgres/blob/master/src/backend/access/transam/slru.c#L854 [1] https://linux.die.net/man/2/write -- Shawn Debnath Amazon Web Services (AWS)
On Tue, Jun 18, 2019 at 09:13:19AM -0700, Shawn Debnath wrote: >> case SLRU_WRITE_FAILED: >> ereport(ERROR, >> (errcode_for_file_access(), >> errmsg("could not access status of transaction %u", xid), >> - errdetail("Could not write to file \"%s\" at offset %u: %m.", >> - path, offset))); >> + errno == 0 >> + ? errdetail("Short write to file \"%s\" at offset %u.", path, offset) >> + : errdetail("Could not write to file \"%s\" at offset %u: %m.", >> + path, offset))); There is no point to call errcode_for_file_access() if we know that errno is 0. Not a big deal, still.. The last time I looked at that, the best way I could think of would be to replace slru_errcause with a proper error string generated at error time. Perhaps the partial read/write case does not justify this extra facility. I don't know. At least that would be more flexible. > Similarly, "Encountered partial write to file ..." would be better? Not > a 100% on using "Encountered" but "partial" seems to be the right word > to use here. 811b6e36 has done some work so as we have more consistency in the error messages and I don't think we should introduce more flavors of this stuff as that makes the life of translators harder. - if (CloseTransientFile(fd)) + if (CloseTransientFile(fd) < 0) Some spots are missing: $ git grep "CloseTransientFile" | grep "))" | wc -l 30 -- Michael
Attachment
On 2019-06-19 04:51, Michael Paquier wrote: > On Tue, Jun 18, 2019 at 09:13:19AM -0700, Shawn Debnath wrote: >>> case SLRU_WRITE_FAILED: >>> ereport(ERROR, >>> (errcode_for_file_access(), >>> errmsg("could not access status of transaction %u", xid), >>> - errdetail("Could not write to file \"%s\" at offset %u: %m.", >>> - path, offset))); >>> + errno == 0 >>> + ? errdetail("Short write to file \"%s\" at offset %u.", path, offset) >>> + : errdetail("Could not write to file \"%s\" at offset %u: %m.", >>> + path, offset))); > > There is no point to call errcode_for_file_access() if we know that > errno is 0. Not a big deal, still.. The last time I looked at that, > the best way I could think of would be to replace slru_errcause with a > proper error string generated at error time. Perhaps the partial > read/write case does not justify this extra facility. I don't know. > At least that would be more flexible. Here is an updated patch set that rearranges this a bit according to your suggestions, and also fixes some similar issues in pg_checksums. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Mon, Aug 26, 2019 at 09:40:23PM +0200, Peter Eisentraut wrote: > Here is an updated patch set that rearranges this a bit according to > your suggestions, and also fixes some similar issues in pg_checksums. Thanks for the new patch, and you are right that pg_checksums has been slacking here. There is the same issue with pg_verify_checksums in 11. Not sure that's worth a back-patch though. Those parts could find their way to v12 easily. > - ereport(ERROR, > - (errcode_for_file_access(), > - errmsg("could not access status of transaction %u", xid), > - errdetail("Could not read from file \"%s\" at offset %u: %m.", > - path, offset))); > + if (errno) > + ereport(ERROR, > + (errcode_for_file_access(), > + errmsg("could not access status of transaction %u", xid), > + errdetail("Could not read from file \"%s\" at offset %u: %m.", > + path, offset))); > + else > + ereport(ERROR, > + (errmsg("could not access status of transaction %u", xid), > + errdetail("Could not read from file \"%s\" at offset %u: read too few bytes.", path, offset))); Last time I worked on that, the following suggestion was made for error messages with shorter reads or writes: could not read file \"%s\": read %d of %zu Still this is clearly an improvement and I that's not worth the extra complication, so +1 for this way of doing things. > if (r == 0) > break; > - if (r != BLCKSZ) > + else if (r < 0) > + { > + pg_log_error("could not read block %u in file \"%s\": %m", > + blockno, fn); > + exit(1); > + } > + else if (r != BLCKSZ) > { > pg_log_error("could not read block %u in file \"%s\": read %d of %d", > blockno, fn, r, BLCKSZ); Other code areas (xlog.c, pg_waldump.c, etc.) prefer doing it this way, after checking the size read: if (r != BLCKSZ) { if (r < 0) pg_log_error("could not read blah: %m"); else pg_log_error("could not read blah: read %d of %d") } > /* Write block with checksum */ > - if (write(f, buf.data, BLCKSZ) != BLCKSZ) > + w = write(f, buf.data, BLCKSZ); > + if (w != BLCKSZ) > { > - pg_log_error("could not write block %u in file \"%s\": %m", > - blockno, fn); > + if (w < 0) > + pg_log_error("could not write block %u in file \"%s\": %m", > + blockno, fn); > + else > + pg_log_error("could not write block %u in file \"%s\": wrote %d of %d", > + blockno, fn, w, BLCKSZ); > exit(1); > } > } This is consistent. -- Michael
Attachment
On 2019-08-27 08:27, Michael Paquier wrote: > Thanks for the new patch, and you are right that pg_checksums has been > slacking here. There is the same issue with pg_verify_checksums in > 11. Not sure that's worth a back-patch though. Those parts could > find their way to v12 easily. Committed to master and PG12 with your suggested changes. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Sep 3, 2019 at 08:38:22AM +0200, Peter Eisentraut wrote: > On 2019-08-27 08:27, Michael Paquier wrote: > > Thanks for the new patch, and you are right that pg_checksums has been > > slacking here. There is the same issue with pg_verify_checksums in > > 11. Not sure that's worth a back-patch though. Those parts could > > find their way to v12 easily. > > Committed to master and PG12 with your suggested changes. This "Success" has happened so many times I think we should tell people to report any such error message as a bug by emitting a special error message line. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Hello, pgsql-hackers I'm gathering information about the following error. FATAL: could not access status of transaction .. DETAIL: Could not read from file (pg_clog/.... or pg_xact/....) ...: Success. This error has caused the server to fail to start with recovery. I got a report that it happend repeatedly at the newly generated standby cluster. I gave them advice to comfirm the low level server environment. However, in addition to improving the message, should we retry to read the rest of the data in the case reading too few bytes? What about a limited number of retries instead of a complete loop? --- Haruka Takatsuka / SRA OSS, Inc. Japan On Fri, 27 Sep 2019 11:52:39 -0400 Bruce Momjian <bruce@momjian.us> wrote: > On Tue, Sep 3, 2019 at 08:38:22AM +0200, Peter Eisentraut wrote: > > On 2019-08-27 08:27, Michael Paquier wrote: > > > Thanks for the new patch, and you are right that pg_checksums has been > > > slacking here. There is the same issue with pg_verify_checksums in > > > 11. Not sure that's worth a back-patch though. Those parts could > > > find their way to v12 easily. > > > > Committed to master and PG12 with your suggested changes. > > This "Success" has happened so many times I think we should tell people > to report any such error message as a bug by emitting a special error > message line. > > -- > Bruce Momjian <bruce@momjian.us> http://momjian.us > EnterpriseDB http://enterprisedb.com > > + As you are, so once was I. As I am, so you will be. + > + Ancient Roman grave inscription +
On 2019-11-21 02:42, TAKATSUKA Haruka wrote: > FATAL: could not access status of transaction .. > DETAIL: Could not read from file (pg_clog/.... or pg_xact/....) ...: Success. > > This error has caused the server to fail to start with recovery. > I got a report that it happend repeatedly at the newly generated > standby cluster. I gave them advice to comfirm the low level server > environment. > > However, in addition to improving the message, should we retry to read > the rest of the data in the case reading too few bytes? > What about a limited number of retries instead of a complete loop? If we thought that would help, there are probably hundreds or more other places where we read files that would need to be fixed up in the same way. That doesn't seem reasonable. Also, it is my understanding that short reads can in practice only happen if the underlying storage is having a serious problem, so retrying wouldn't actually help much. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, 21 Nov 2019 10:40:36 +0100 Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > On 2019-11-21 02:42, TAKATSUKA Haruka wrote: > > FATAL: could not access status of transaction .. > > DETAIL: Could not read from file (pg_clog/.... or pg_xact/....) ...: Success. > > > > This error has caused the server to fail to start with recovery. > > I got a report that it happend repeatedly at the newly generated > > standby cluster. I gave them advice to comfirm the low level server > > environment. > > > > However, in addition to improving the message, should we retry to read > > the rest of the data in the case reading too few bytes? > > What about a limited number of retries instead of a complete loop? > > If we thought that would help, there are probably hundreds or more other > places where we read files that would need to be fixed up in the same > way. That doesn't seem reasonable. > > Also, it is my understanding that short reads can in practice only > happen if the underlying storage is having a serious problem, so > retrying wouldn't actually help much. OK, I understand. In our case, the standby DB cluster space is on DRBD. I will report the clear occurrence condition if it is found. thanks, Haruka Takatsuka