Thread: pg_restore: could not close data file: Success
I'm trying to restore a backup on a different machine and it terminates with the not really helpful messages: pg_restore: [directory archiver] could not close data file: Success pg_restore: [parallel archiver] a worker process died unexpectedly The backup was made with pg_dump --compress=5 -v -Fd -f "$dirname" -j 4 $db (so it's in directory format) The restore command was pg_restore -c --if-exists -d $db -j 4 -v $dirname (I would use -C, but due to suboptimal partitioning I have to use a different tablspace, so I need to create $db before the restore) Both machines are running Ubuntu 18.04 and PostgreSQL is version 11.7 from the pgdg repo. The error happens while restoring the data for the tables. My guess is that maybe one of the data files is damaged ("Success" probably means that errno is 0, so it wasn't a system call that failed, but something in the application). Does that sound plausible or should I look somewhere else? A web search returned nothing relevant. hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
Attachment
On 2020-04-15 12:01:46 +0200, Peter J. Holzer wrote: > I'm trying to restore a backup on a different machine and it terminates > with the not really helpful messages: > > pg_restore: [directory archiver] could not close data file: Success > pg_restore: [parallel archiver] a worker process died unexpectedly [...] > My guess is that maybe one of the data files is damaged As is often the case the matter became obvious a few minutes after writing the mail. There were indeed two file with length 0 in the dump. That happened because the backup failed because it couldn't obtain a lock on a table. I nicer error message (something like "cannot decompress '13503.dat.gz': Empty file") would have helped. hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
Attachment
Hello. Added -hackers. At Wed, 15 Apr 2020 12:14:25 +0200, "Peter J. Holzer" <hjp-pgsql@hjp.at> wrote in > On 2020-04-15 12:01:46 +0200, Peter J. Holzer wrote: > > I'm trying to restore a backup on a different machine and it terminates > > with the not really helpful messages: > > > > pg_restore: [directory archiver] could not close data file: Success > > pg_restore: [parallel archiver] a worker process died unexpectedly > [...] > > My guess is that maybe one of the data files is damaged > > As is often the case the matter became obvious a few minutes after > writing the mail. > > There were indeed two file with length 0 in the dump. That happened > because the backup failed because it couldn't obtain a lock on a table. > > I nicer error message (something like "cannot decompress '13503.dat.gz': > Empty file") would have helped. Unfortunately, just emptying .dat.gz file doesn't worked for me. Anyway the message is emitted the following way. pg_backup_directoy.c: > if (cfclose(cfp) !=0) > fatal("could not close data file: %m"); %m doesn't work for some kinds of errors about compressed files but cfclose conseals the true cause. I'm surprised to find an old thread about the same issue. https://www.postgresql.org/message-id/20160307.174354.251049100.horiguchi.kyotaro%40lab.ntt.co.jp But I don't think it's not acceptable that use fake errno for gzclose, but cfclose properly passes-through the error code from gzclose, so it is enought that the caller should recognize the difference. Please find the attached. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/bin/pg_dump/compress_io.c b/src/bin/pg_dump/compress_io.c index 1417401086..3a8394d7f2 100644 --- a/src/bin/pg_dump/compress_io.c +++ b/src/bin/pg_dump/compress_io.c @@ -645,6 +645,13 @@ cfgets(cfp *fp, char *buf, int len) return fgets(buf, len, fp->uncompressedfp); } +/* + * cfclose close the stream + * + * Returns 0 if successfully closed the cfp. Most of errors are reported as -1 + * and errno is set. Otherwise the return value is the return value from + * gzclose and errno doesn't hold a meangful value. + */ int cfclose(cfp *fp) { @@ -665,6 +672,11 @@ cfclose(cfp *fp) #endif { result = fclose(fp->uncompressedfp); + + /* normalize error return, just in case EOF is not -1 */ + if (result != 0) + result = -1; + fp->uncompressedfp = NULL; } free_keep_errno(fp); diff --git a/src/bin/pg_dump/pg_backup_directory.c b/src/bin/pg_dump/pg_backup_directory.c index c9cce5ed8a..ecc6aa5fbb 100644 --- a/src/bin/pg_dump/pg_backup_directory.c +++ b/src/bin/pg_dump/pg_backup_directory.c @@ -108,6 +108,7 @@ void InitArchiveFmt_Directory(ArchiveHandle *AH) { lclContext *ctx; + int ret; /* Assuming static functions, this can be copied for each format. */ AH->ArchiveEntryPtr = _ArchiveEntry; @@ -218,8 +219,14 @@ InitArchiveFmt_Directory(ArchiveHandle *AH) ReadToc(AH); /* Nothing else in the file, so close it again... */ - if (cfclose(tocFH) != 0) - fatal("could not close TOC file: %m"); + ret = cfclose(tocFH); + if (ret < 0) + { + if (ret == -1) + fatal("could not close TOC file: %m"); + else + fatal("could not close TOC file: zlib error (%d)", ret); + } ctx->dataFH = NULL; } } @@ -378,6 +385,7 @@ _PrintFileData(ArchiveHandle *AH, char *filename) char *buf; size_t buflen; cfp *cfp; + int ret; if (!filename) return; @@ -396,8 +404,15 @@ _PrintFileData(ArchiveHandle *AH, char *filename) } free(buf); - if (cfclose(cfp) !=0) - fatal("could not close data file: %m"); + + ret = cfclose(cfp); + if (ret < 0) + { + if (ret == -1) + fatal("could not close data file: %m"); + else + fatal("could not close data file: zlib error (%d)", ret); + } } /* @@ -429,6 +444,7 @@ _LoadBlobs(ArchiveHandle *AH) lclContext *ctx = (lclContext *) AH->formatData; char fname[MAXPGPATH]; char line[MAXPGPATH]; + int ret; StartRestoreBlobs(AH); @@ -460,9 +476,16 @@ _LoadBlobs(ArchiveHandle *AH) fatal("error reading large object TOC file \"%s\"", fname); - if (cfclose(ctx->blobsTocFH) != 0) - fatal("could not close large object TOC file \"%s\": %m", - fname); + ret = cfclose(ctx->blobsTocFH); + if (ret < 0) + { + if (ret == -1) + fatal("could not close large object TOC file \"%s\": %m", + fname); + else + fatal("could not close large object TOC file \"%s\": zlib error (%d)", + fname, ret); + } ctx->blobsTocFH = NULL; @@ -555,6 +578,7 @@ _CloseArchive(ArchiveHandle *AH) { cfp *tocFH; char fname[MAXPGPATH]; + int ret; setFilePath(AH, fname, "toc.dat"); @@ -576,8 +600,14 @@ _CloseArchive(ArchiveHandle *AH) WriteHead(AH); AH->format = archDirectory; WriteToc(AH); - if (cfclose(tocFH) != 0) - fatal("could not close TOC file: %m"); + ret = cfclose(tocFH); + if (ret < 0) + { + if (ret == -1) + fatal("could not close TOC file: %m"); + else + fatal("could not close TOC file: zlib error (%d)", ret); + } WriteDataChunks(AH, ctx->pstate); ParallelBackupEnd(AH, ctx->pstate);
Hello. Added -hackers. At Wed, 15 Apr 2020 12:14:25 +0200, "Peter J. Holzer" <hjp-pgsql@hjp.at> wrote in > On 2020-04-15 12:01:46 +0200, Peter J. Holzer wrote: > > I'm trying to restore a backup on a different machine and it terminates > > with the not really helpful messages: > > > > pg_restore: [directory archiver] could not close data file: Success > > pg_restore: [parallel archiver] a worker process died unexpectedly > [...] > > My guess is that maybe one of the data files is damaged > > As is often the case the matter became obvious a few minutes after > writing the mail. > > There were indeed two file with length 0 in the dump. That happened > because the backup failed because it couldn't obtain a lock on a table. > > I nicer error message (something like "cannot decompress '13503.dat.gz': > Empty file") would have helped. Unfortunately, just emptying .dat.gz file doesn't worked for me. Anyway the message is emitted the following way. pg_backup_directoy.c: > if (cfclose(cfp) !=0) > fatal("could not close data file: %m"); %m doesn't work for some kinds of errors about compressed files but cfclose conseals the true cause. I'm surprised to find an old thread about the same issue. https://www.postgresql.org/message-id/20160307.174354.251049100.horiguchi.kyotaro%40lab.ntt.co.jp But I don't think it's not acceptable that use fake errno for gzclose, but cfclose properly passes-through the error code from gzclose, so it is enought that the caller should recognize the difference. Please find the attached. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/bin/pg_dump/compress_io.c b/src/bin/pg_dump/compress_io.c index 1417401086..3a8394d7f2 100644 --- a/src/bin/pg_dump/compress_io.c +++ b/src/bin/pg_dump/compress_io.c @@ -645,6 +645,13 @@ cfgets(cfp *fp, char *buf, int len) return fgets(buf, len, fp->uncompressedfp); } +/* + * cfclose close the stream + * + * Returns 0 if successfully closed the cfp. Most of errors are reported as -1 + * and errno is set. Otherwise the return value is the return value from + * gzclose and errno doesn't hold a meangful value. + */ int cfclose(cfp *fp) { @@ -665,6 +672,11 @@ cfclose(cfp *fp) #endif { result = fclose(fp->uncompressedfp); + + /* normalize error return, just in case EOF is not -1 */ + if (result != 0) + result = -1; + fp->uncompressedfp = NULL; } free_keep_errno(fp); diff --git a/src/bin/pg_dump/pg_backup_directory.c b/src/bin/pg_dump/pg_backup_directory.c index c9cce5ed8a..ecc6aa5fbb 100644 --- a/src/bin/pg_dump/pg_backup_directory.c +++ b/src/bin/pg_dump/pg_backup_directory.c @@ -108,6 +108,7 @@ void InitArchiveFmt_Directory(ArchiveHandle *AH) { lclContext *ctx; + int ret; /* Assuming static functions, this can be copied for each format. */ AH->ArchiveEntryPtr = _ArchiveEntry; @@ -218,8 +219,14 @@ InitArchiveFmt_Directory(ArchiveHandle *AH) ReadToc(AH); /* Nothing else in the file, so close it again... */ - if (cfclose(tocFH) != 0) - fatal("could not close TOC file: %m"); + ret = cfclose(tocFH); + if (ret < 0) + { + if (ret == -1) + fatal("could not close TOC file: %m"); + else + fatal("could not close TOC file: zlib error (%d)", ret); + } ctx->dataFH = NULL; } } @@ -378,6 +385,7 @@ _PrintFileData(ArchiveHandle *AH, char *filename) char *buf; size_t buflen; cfp *cfp; + int ret; if (!filename) return; @@ -396,8 +404,15 @@ _PrintFileData(ArchiveHandle *AH, char *filename) } free(buf); - if (cfclose(cfp) !=0) - fatal("could not close data file: %m"); + + ret = cfclose(cfp); + if (ret < 0) + { + if (ret == -1) + fatal("could not close data file: %m"); + else + fatal("could not close data file: zlib error (%d)", ret); + } } /* @@ -429,6 +444,7 @@ _LoadBlobs(ArchiveHandle *AH) lclContext *ctx = (lclContext *) AH->formatData; char fname[MAXPGPATH]; char line[MAXPGPATH]; + int ret; StartRestoreBlobs(AH); @@ -460,9 +476,16 @@ _LoadBlobs(ArchiveHandle *AH) fatal("error reading large object TOC file \"%s\"", fname); - if (cfclose(ctx->blobsTocFH) != 0) - fatal("could not close large object TOC file \"%s\": %m", - fname); + ret = cfclose(ctx->blobsTocFH); + if (ret < 0) + { + if (ret == -1) + fatal("could not close large object TOC file \"%s\": %m", + fname); + else + fatal("could not close large object TOC file \"%s\": zlib error (%d)", + fname, ret); + } ctx->blobsTocFH = NULL; @@ -555,6 +578,7 @@ _CloseArchive(ArchiveHandle *AH) { cfp *tocFH; char fname[MAXPGPATH]; + int ret; setFilePath(AH, fname, "toc.dat"); @@ -576,8 +600,14 @@ _CloseArchive(ArchiveHandle *AH) WriteHead(AH); AH->format = archDirectory; WriteToc(AH); - if (cfclose(tocFH) != 0) - fatal("could not close TOC file: %m"); + ret = cfclose(tocFH); + if (ret < 0) + { + if (ret == -1) + fatal("could not close TOC file: %m"); + else + fatal("could not close TOC file: zlib error (%d)", ret); + } WriteDataChunks(AH, ctx->pstate); ParallelBackupEnd(AH, ctx->pstate);
On Thu, Apr 16, 2020 at 12:08:09PM +0900, Kyotaro Horiguchi wrote: > I'm surprised to find an old thread about the same issue. > > https://www.postgresql.org/message-id/20160307.174354.251049100.horiguchi.kyotaro%40lab.ntt.co.jp > > But I don't think it's not acceptable that use fake errno for gzclose, > but cfclose properly passes-through the error code from gzclose, so it > is enought that the caller should recognize the difference. A problem with this patch is that we may forget again to add this special error handling if more code paths use cfclose(). As of HEAD, there are three code paths where cfclose() is called but it does not generate an error: two when ending a blob and one when ending a data file. Perhaps it would make sense to just move all this error within the routine itself? Note that it would also mean registering file names in lclContext or equivalent as that's an important piece of the error message. -- Michael
Attachment
On Thu, Apr 16, 2020 at 12:08:09PM +0900, Kyotaro Horiguchi wrote: > I'm surprised to find an old thread about the same issue. > > https://www.postgresql.org/message-id/20160307.174354.251049100.horiguchi.kyotaro%40lab.ntt.co.jp > > But I don't think it's not acceptable that use fake errno for gzclose, > but cfclose properly passes-through the error code from gzclose, so it > is enought that the caller should recognize the difference. A problem with this patch is that we may forget again to add this special error handling if more code paths use cfclose(). As of HEAD, there are three code paths where cfclose() is called but it does not generate an error: two when ending a blob and one when ending a data file. Perhaps it would make sense to just move all this error within the routine itself? Note that it would also mean registering file names in lclContext or equivalent as that's an important piece of the error message. -- Michael
At Thu, 16 Apr 2020 14:40:09 +0900, Michael Paquier <michael@paquier.xyz> wrote in > On Thu, Apr 16, 2020 at 12:08:09PM +0900, Kyotaro Horiguchi wrote: > > I'm surprised to find an old thread about the same issue. > > > > https://www.postgresql.org/message-id/20160307.174354.251049100.horiguchi.kyotaro%40lab.ntt.co.jp > > > > But I don't think it's not acceptable that use fake errno for gzclose, > > but cfclose properly passes-through the error code from gzclose, so it > > is enought that the caller should recognize the difference. > > A problem with this patch is that we may forget again to add this > special error handling if more code paths use cfclose(). Definitely. The reason for the patch is the error codes are diffrent according to callers and some of callers don't even checking the error (seemingly intentionally). > As of HEAD, there are three code paths where cfclose() is called but > it does not generate an error: two when ending a blob and one when > ending a data file. Perhaps it would make sense to just move all this > error within the routine itself? Note that it would also mean > registering file names in lclContext or equivalent as that's an > important piece of the error message. Hmm. Sounds reasonable. I'm going to do that. Thanks! regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Thu, 16 Apr 2020 14:40:09 +0900, Michael Paquier <michael@paquier.xyz> wrote in > On Thu, Apr 16, 2020 at 12:08:09PM +0900, Kyotaro Horiguchi wrote: > > I'm surprised to find an old thread about the same issue. > > > > https://www.postgresql.org/message-id/20160307.174354.251049100.horiguchi.kyotaro%40lab.ntt.co.jp > > > > But I don't think it's not acceptable that use fake errno for gzclose, > > but cfclose properly passes-through the error code from gzclose, so it > > is enought that the caller should recognize the difference. > > A problem with this patch is that we may forget again to add this > special error handling if more code paths use cfclose(). Definitely. The reason for the patch is the error codes are diffrent according to callers and some of callers don't even checking the error (seemingly intentionally). > As of HEAD, there are three code paths where cfclose() is called but > it does not generate an error: two when ending a blob and one when > ending a data file. Perhaps it would make sense to just move all this > error within the routine itself? Note that it would also mean > registering file names in lclContext or equivalent as that's an > important piece of the error message. Hmm. Sounds reasonable. I'm going to do that. Thanks! regards. -- Kyotaro Horiguchi NTT Open Source Software Center