Thread: Deficient error handling in pg_dump and pg_basebackup

Deficient error handling in pg_dump and pg_basebackup

From
Tom Lane
Date:
Coverity complained about this code in bbstreamer_file.c,
saying that the get_gz_error() call could be accessing freed
memory:

    if (gzclose(mystreamer->gzfile) != 0)
    {
        pg_log_error("could not close compressed file \"%s\": %s",
                     mystreamer->pathname,
                     get_gz_error(mystreamer->gzfile));
        exit(1);
    }

I think it's right.  This certainly isn't sanctioned by the zlib
specification [1], nor does it look like our other gzclose() calls,
which simply consult errno.  (Note that this coding is not at all new,
but Coverity thinks it is because 23a1c6578 moved it to a new file.)

I set out to fix that, intending only to replace the use of
get_gz_error() with %m, but the patch soon metastasized to the
point where I think it needs review :-(.

0001 below addresses places that either didn't check the result of
gzclose() at all, or neglected to print a useful error message.
The zlib spec doesn't quite promise that a failed gzclose() will
set errno, but the cases where it might not seem like can't-happen
cases that we needn't spend much effort on.  So what I've done here
is just to initialize errno to zero before gzclose().  If we ever
get a report of "could not close file: Success" in one of these
places, we'll know to look more closely.

0002 below addresses a rather messier problem, which is that the
error handling in src/bin/pg_basebackup/walmethods.c is just
appallingly bad.  The design is awful, because it relies on errno
to hold still for much longer than we can sanely expect --- notably,
walmethods.c really shouldn't be assuming much about what the callers
might do between calling an error-generating method and calling
getlasterror().  Worse, there are a lot of places that naively expect
errno to hold still across free() and similar operations.  We know
that not to be the case on (at least) macOS.  On top of that,
commit babbbb595 (LZ4 support) broke the design completely, because
liblz4 doesn't use errno to report errors.  Separately from the
question of how to report errors, there were a number of places
that failed to detect errors in the first place, and more that were
sloppy about setting the correct error code for a short write.

What I chose to do to fix the design problem is to use a modified
version of the idea that existed in the tar-methods part of the file,
which is to have both a string field and an errno field in the
DirectoryMethodData and TarMethodData objects.  If the string isn't
NULL then it overrides the errno field, allowing us to report
non-errno problems.  Then, a bunch of places have to remember to copy
errno into the lasterrno field, which is a bit annoying.  On the other
hand, we can get rid of logic that tries to save and restore errno
across functions that might change it, so this does buy back some
complexity too.

There's at least one loose end here, which is that there's one
place in tar_close() that does a summary exit(1) on fsync failure,
without even bothering with an error message.  I added the
missing error report:

     /* Always fsync on close, so the padding gets fsynced */
     if (tar_sync(f) < 0)
+    {
+        /* XXX this seems pretty bogus; why is only this case fatal? */
+        pg_log_fatal("could not fsync file \"%s\": %s",
+                     tf->pathname, tar_getlasterror());
         exit(1);
+    }

but this code seems about as fishy and ill-thought-out as anything
else I've touched here.  Why is this different from the half-dozen
other fsync-error cases in the same file?  Why, if fsync failure
here is so catastrophic, is it okay to just return a normal failure
code when tar_close doesn't even get to this point because of some
earlier error?  At the very least it seems like it'd be preferable
to do the exit(1) at the caller level.

The addition of the exit(1) seems to have been intentional in
1e2fddfa3, so cc'ing Peter for comment.

            regards, tom lane

[1] https://refspecs.linuxbase.org/LSB_3.0.0/LSB-Core-generic/LSB-Core-generic/zlib-gzclose-1.html

diff --git a/src/bin/pg_basebackup/bbstreamer_file.c b/src/bin/pg_basebackup/bbstreamer_file.c
index eba173f688..5dc828f742 100644
--- a/src/bin/pg_basebackup/bbstreamer_file.c
+++ b/src/bin/pg_basebackup/bbstreamer_file.c
@@ -303,11 +303,11 @@ bbstreamer_gzip_writer_finalize(bbstreamer *streamer)

     mystreamer = (bbstreamer_gzip_writer *) streamer;

+    errno = 0;                    /* in case gzclose() doesn't set it */
     if (gzclose(mystreamer->gzfile) != 0)
     {
-        pg_log_error("could not close compressed file \"%s\": %s",
-                     mystreamer->pathname,
-                     get_gz_error(mystreamer->gzfile));
+        pg_log_error("could not close compressed file \"%s\": %m",
+                     mystreamer->pathname);
         exit(1);
     }

diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 2d4f660daa..d5db519f53 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -70,7 +70,12 @@ mark_file_as_archived(StreamCtl *stream, const char *fname)
         return false;
     }

-    stream->walmethod->close(f, CLOSE_NORMAL);
+    if (stream->walmethod->close(f, CLOSE_NORMAL) != 0)
+    {
+        pg_log_error("could not close archive status file \"%s\": %s",
+                     tmppath, stream->walmethod->getlasterror());
+        return false;
+    }

     return true;
 }
diff --git a/src/bin/pg_basebackup/walmethods.c b/src/bin/pg_basebackup/walmethods.c
index f1ba2a828a..5cc10c6eba 100644
--- a/src/bin/pg_basebackup/walmethods.c
+++ b/src/bin/pg_basebackup/walmethods.c
@@ -355,7 +355,10 @@ dir_close(Walfile f, WalCloseMethod method)

 #ifdef HAVE_LIBZ
     if (dir_data->compression_method == COMPRESSION_GZIP)
+    {
+        errno = 0;                /* in case gzclose() doesn't set it */
         r = gzclose(df->gzfp);
+    }
     else
 #endif
 #ifdef HAVE_LIBLZ4
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 2c4cfb9457..59f4fbb2cc 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -268,6 +268,7 @@ CloseArchive(Archive *AHX)
     AH->ClosePtr(AH);

     /* Close the output */
+    errno = 0;                    /* in case gzclose() doesn't set it */
     if (AH->gzOut)
         res = GZCLOSE(AH->OF);
     else if (AH->OF != stdout)
@@ -1567,6 +1568,7 @@ RestoreOutput(ArchiveHandle *AH, OutputContext savedContext)
 {
     int            res;

+    errno = 0;                    /* in case gzclose() doesn't set it */
     if (AH->gzOut)
         res = GZCLOSE(AH->OF);
     else
diff --git a/src/bin/pg_dump/pg_backup_directory.c b/src/bin/pg_dump/pg_backup_directory.c
index 8aff6bce38..4e0fb7d2d3 100644
--- a/src/bin/pg_dump/pg_backup_directory.c
+++ b/src/bin/pg_dump/pg_backup_directory.c
@@ -369,7 +369,8 @@ _EndData(ArchiveHandle *AH, TocEntry *te)
     lclContext *ctx = (lclContext *) AH->formatData;

     /* Close the file */
-    cfclose(ctx->dataFH);
+    if (cfclose(ctx->dataFH) != 0)
+        fatal("could not close data file: %m");

     ctx->dataFH = NULL;
 }
@@ -680,7 +681,8 @@ _EndBlob(ArchiveHandle *AH, TocEntry *te, Oid oid)
     int            len;

     /* Close the BLOB data file itself */
-    cfclose(ctx->dataFH);
+    if (cfclose(ctx->dataFH) != 0)
+        fatal("could not close blob data file: %m");
     ctx->dataFH = NULL;

     /* register the blob in blobs.toc */
@@ -699,7 +701,8 @@ _EndBlobs(ArchiveHandle *AH, TocEntry *te)
 {
     lclContext *ctx = (lclContext *) AH->formatData;

-    cfclose(ctx->blobsTocFH);
+    if (cfclose(ctx->blobsTocFH) != 0)
+        fatal("could not close blobs TOC file: %m");
     ctx->blobsTocFH = NULL;
 }

diff --git a/src/bin/pg_dump/pg_backup_tar.c b/src/bin/pg_dump/pg_backup_tar.c
index 65bcb41a2f..5c351acda0 100644
--- a/src/bin/pg_dump/pg_backup_tar.c
+++ b/src/bin/pg_dump/pg_backup_tar.c
@@ -438,8 +438,11 @@ tarClose(ArchiveHandle *AH, TAR_MEMBER *th)
      * Close the GZ file since we dup'd. This will flush the buffers.
      */
     if (AH->compression != 0)
+    {
+        errno = 0;                /* in case gzclose() doesn't set it */
         if (GZCLOSE(th->zFH) != 0)
-            fatal("could not close tar member");
+            fatal("could not close tar member: %m");
+    }

     if (th->mode == 'w')
         _tarAddFile(AH, th);    /* This will close the temp file */
diff --git a/src/bin/pg_basebackup/walmethods.c b/src/bin/pg_basebackup/walmethods.c
index 5cc10c6eba..fe6b034637 100644
--- a/src/bin/pg_basebackup/walmethods.c
+++ b/src/bin/pg_basebackup/walmethods.c
@@ -27,6 +27,7 @@

 #include "common/file_perm.h"
 #include "common/file_utils.h"
+#include "common/logging.h"
 #include "pgtar.h"
 #include "receivelog.h"
 #include "streamutil.h"
@@ -51,6 +52,8 @@ typedef struct DirectoryMethodData
     WalCompressionMethod compression_method;
     int            compression_level;
     bool        sync;
+    const char *lasterrstring;    /* if set, takes precedence over lasterrno */
+    int            lasterrno;
 } DirectoryMethodData;
 static DirectoryMethodData *dir_data = NULL;

@@ -74,11 +77,17 @@ typedef struct DirectoryMethodFile
 #endif
 } DirectoryMethodFile;

+#define dir_clear_error() \
+    (dir_data->lasterrstring = NULL, dir_data->lasterrno = 0)
+#define dir_set_error(msg) \
+    (dir_data->lasterrstring = _(msg))
+
 static const char *
 dir_getlasterror(void)
 {
-    /* Directory method always sets errno, so just use strerror */
-    return strerror(errno);
+    if (dir_data->lasterrstring)
+        return dir_data->lasterrstring;
+    return strerror(dir_data->lasterrno);
 }

 static char *
@@ -98,7 +107,7 @@ dir_get_file_name(const char *pathname, const char *temp_suffix)
 static Walfile
 dir_open_for_write(const char *pathname, const char *temp_suffix, size_t pad_to_size)
 {
-    static char tmppath[MAXPGPATH];
+    char        tmppath[MAXPGPATH];
     char       *filename;
     int            fd;
     DirectoryMethodFile *f;
@@ -111,6 +120,8 @@ dir_open_for_write(const char *pathname, const char *temp_suffix, size_t pad_to_
     void       *lz4buf = NULL;
 #endif

+    dir_clear_error();
+
     filename = dir_get_file_name(pathname, temp_suffix);
     snprintf(tmppath, sizeof(tmppath), "%s/%s",
              dir_data->basedir, filename);
@@ -124,7 +135,10 @@ dir_open_for_write(const char *pathname, const char *temp_suffix, size_t pad_to_
      */
     fd = open(tmppath, O_WRONLY | O_CREAT | PG_BINARY, pg_file_create_mode);
     if (fd < 0)
+    {
+        dir_data->lasterrno = errno;
         return NULL;
+    }

 #ifdef HAVE_LIBZ
     if (dir_data->compression_method == COMPRESSION_GZIP)
@@ -132,6 +146,7 @@ dir_open_for_write(const char *pathname, const char *temp_suffix, size_t pad_to_
         gzfp = gzdopen(fd, "wb");
         if (gzfp == NULL)
         {
+            dir_data->lasterrno = errno;
             close(fd);
             return NULL;
         }
@@ -139,6 +154,7 @@ dir_open_for_write(const char *pathname, const char *temp_suffix, size_t pad_to_
         if (gzsetparams(gzfp, dir_data->compression_level,
                         Z_DEFAULT_STRATEGY) != Z_OK)
         {
+            dir_data->lasterrno = errno;
             gzclose(gzfp);
             return NULL;
         }
@@ -153,6 +169,7 @@ dir_open_for_write(const char *pathname, const char *temp_suffix, size_t pad_to_
         ctx_out = LZ4F_createCompressionContext(&ctx, LZ4F_VERSION);
         if (LZ4F_isError(ctx_out))
         {
+            dir_data->lasterrstring = LZ4F_getErrorName(ctx_out);
             close(fd);
             return NULL;
         }
@@ -164,6 +181,7 @@ dir_open_for_write(const char *pathname, const char *temp_suffix, size_t pad_to_
         header_size = LZ4F_compressBegin(ctx, lz4buf, lz4bufsize, NULL);
         if (LZ4F_isError(header_size))
         {
+            dir_data->lasterrstring = LZ4F_getErrorName(header_size);
             (void) LZ4F_freeCompressionContext(ctx);
             pg_free(lz4buf);
             close(fd);
@@ -173,17 +191,12 @@ dir_open_for_write(const char *pathname, const char *temp_suffix, size_t pad_to_
         errno = 0;
         if (write(fd, lz4buf, header_size) != header_size)
         {
-            int            save_errno = errno;
-
+            /* If write didn't set errno, assume problem is no disk space */
+            dir_data->lasterrno = errno ? errno : ENOSPC;
             (void) LZ4F_compressEnd(ctx, lz4buf, lz4bufsize, NULL);
             (void) LZ4F_freeCompressionContext(ctx);
             pg_free(lz4buf);
             close(fd);
-
-            /*
-             * If write didn't set errno, assume problem is no disk space.
-             */
-            errno = save_errno ? save_errno : ENOSPC;
             return NULL;
         }
     }
@@ -201,24 +214,17 @@ dir_open_for_write(const char *pathname, const char *temp_suffix, size_t pad_to_
             errno = 0;
             if (write(fd, zerobuf.data, XLOG_BLCKSZ) != XLOG_BLCKSZ)
             {
-                int            save_errno = errno;
-
+                /* If write didn't set errno, assume problem is no disk space */
+                dir_data->lasterrno = errno ? errno : ENOSPC;
                 close(fd);
-
-                /*
-                 * If write didn't set errno, assume problem is no disk space.
-                 */
-                errno = save_errno ? save_errno : ENOSPC;
                 return NULL;
             }
         }

         if (lseek(fd, 0, SEEK_SET) != 0)
         {
-            int            save_errno = errno;
-
+            dir_data->lasterrno = errno;
             close(fd);
-            errno = save_errno;
             return NULL;
         }
     }
@@ -234,6 +240,7 @@ dir_open_for_write(const char *pathname, const char *temp_suffix, size_t pad_to_
         if (fsync_fname(tmppath, false) != 0 ||
             fsync_parent_path(tmppath) != 0)
         {
+            dir_data->lasterrno = errno;
 #ifdef HAVE_LIBZ
             if (dir_data->compression_method == COMPRESSION_GZIP)
                 gzclose(gzfp);
@@ -285,10 +292,19 @@ dir_write(Walfile f, const void *buf, size_t count)
     DirectoryMethodFile *df = (DirectoryMethodFile *) f;

     Assert(f != NULL);
+    dir_clear_error();

 #ifdef HAVE_LIBZ
     if (dir_data->compression_method == COMPRESSION_GZIP)
+    {
+        errno = 0;
         r = (ssize_t) gzwrite(df->gzfp, buf, count);
+        if (r != count)
+        {
+            /* If write didn't set errno, assume problem is no disk space */
+            dir_data->lasterrno = errno ? errno : ENOSPC;
+        }
+    }
     else
 #endif
 #ifdef HAVE_LIBLZ4
@@ -315,10 +331,18 @@ dir_write(Walfile f, const void *buf, size_t count)
                                              NULL);

             if (LZ4F_isError(compressed))
+            {
+                dir_data->lasterrstring = LZ4F_getErrorName(compressed);
                 return -1;
+            }

+            errno = 0;
             if (write(df->fd, df->lz4buf, compressed) != compressed)
+            {
+                /* If write didn't set errno, assume problem is no disk space */
+                dir_data->lasterrno = errno ? errno : ENOSPC;
                 return -1;
+            }

             inbuf = ((char *) inbuf) + chunk;
         }
@@ -328,7 +352,15 @@ dir_write(Walfile f, const void *buf, size_t count)
     }
     else
 #endif
+    {
+        errno = 0;
         r = write(df->fd, buf, count);
+        if (r != count)
+        {
+            /* If write didn't set errno, assume problem is no disk space */
+            dir_data->lasterrno = errno ? errno : ENOSPC;
+        }
+    }
     if (r > 0)
         df->currpos += r;
     return r;
@@ -338,6 +370,7 @@ static off_t
 dir_get_current_pos(Walfile f)
 {
     Assert(f != NULL);
+    dir_clear_error();

     /* Use a cached value to prevent lots of reseeks */
     return ((DirectoryMethodFile *) f)->currpos;
@@ -348,10 +381,11 @@ dir_close(Walfile f, WalCloseMethod method)
 {
     int            r;
     DirectoryMethodFile *df = (DirectoryMethodFile *) f;
-    static char tmppath[MAXPGPATH];
-    static char tmppath2[MAXPGPATH];
+    char        tmppath[MAXPGPATH];
+    char        tmppath2[MAXPGPATH];

     Assert(f != NULL);
+    dir_clear_error();

 #ifdef HAVE_LIBZ
     if (dir_data->compression_method == COMPRESSION_GZIP)
@@ -371,10 +405,18 @@ dir_close(Walfile f, WalCloseMethod method)
                                       NULL);

         if (LZ4F_isError(compressed))
+        {
+            dir_data->lasterrstring = LZ4F_getErrorName(compressed);
             return -1;
+        }

+        errno = 0;
         if (write(df->fd, df->lz4buf, compressed) != compressed)
+        {
+            /* If write didn't set errno, assume problem is no disk space */
+            dir_data->lasterrno = errno ? errno : ENOSPC;
             return -1;
+        }

         r = close(df->fd);
     }
@@ -433,6 +475,9 @@ dir_close(Walfile f, WalCloseMethod method)
         }
     }

+    if (r != 0)
+        dir_data->lasterrno = errno;
+
 #ifdef HAVE_LIBLZ4
     pg_free(df->lz4buf);
     /* supports free on NULL */
@@ -451,7 +496,10 @@ dir_close(Walfile f, WalCloseMethod method)
 static int
 dir_sync(Walfile f)
 {
+    int            r;
+
     Assert(f != NULL);
+    dir_clear_error();

     if (!dir_data->sync)
         return 0;
@@ -460,7 +508,10 @@ dir_sync(Walfile f)
     if (dir_data->compression_method == COMPRESSION_GZIP)
     {
         if (gzflush(((DirectoryMethodFile *) f)->gzfp, Z_SYNC_FLUSH) != Z_OK)
+        {
+            dir_data->lasterrno = errno;
             return -1;
+        }
     }
 #endif
 #ifdef HAVE_LIBLZ4
@@ -472,27 +523,41 @@ dir_sync(Walfile f)
         /* Flush any internal buffers */
         compressed = LZ4F_flush(df->ctx, df->lz4buf, df->lz4bufsize, NULL);
         if (LZ4F_isError(compressed))
+        {
+            dir_data->lasterrstring = LZ4F_getErrorName(compressed);
             return -1;
+        }

+        errno = 0;
         if (write(df->fd, df->lz4buf, compressed) != compressed)
+        {
+            /* If write didn't set errno, assume problem is no disk space */
+            dir_data->lasterrno = errno ? errno : ENOSPC;
             return -1;
+        }
     }
 #endif

-    return fsync(((DirectoryMethodFile *) f)->fd);
+    r = fsync(((DirectoryMethodFile *) f)->fd);
+    if (r < 0)
+        dir_data->lasterrno = errno;
+    return r;
 }

 static ssize_t
 dir_get_file_size(const char *pathname)
 {
     struct stat statbuf;
-    static char tmppath[MAXPGPATH];
+    char        tmppath[MAXPGPATH];

     snprintf(tmppath, sizeof(tmppath), "%s/%s",
              dir_data->basedir, pathname);

     if (stat(tmppath, &statbuf) != 0)
+    {
+        dir_data->lasterrno = errno;
         return -1;
+    }

     return statbuf.st_size;
 }
@@ -506,9 +571,11 @@ dir_compression_method(void)
 static bool
 dir_existsfile(const char *pathname)
 {
-    static char tmppath[MAXPGPATH];
+    char        tmppath[MAXPGPATH];
     int            fd;

+    dir_clear_error();
+
     snprintf(tmppath, sizeof(tmppath), "%s/%s",
              dir_data->basedir, pathname);

@@ -522,6 +589,8 @@ dir_existsfile(const char *pathname)
 static bool
 dir_finish(void)
 {
+    dir_clear_error();
+
     if (dir_data->sync)
     {
         /*
@@ -529,7 +598,10 @@ dir_finish(void)
          * directory entry here as well.
          */
         if (fsync_fname(dir_data->basedir, true) != 0)
+        {
+            dir_data->lasterrno = errno;
             return false;
+        }
     }
     return true;
 }
@@ -569,6 +641,7 @@ FreeWalDirectoryMethod(void)
 {
     pg_free(dir_data->basedir);
     pg_free(dir_data);
+    dir_data = NULL;
 }


@@ -594,7 +667,8 @@ typedef struct TarMethodData
     int            compression_level;
     bool        sync;
     TarMethodFile *currentfile;
-    char        lasterror[1024];
+    const char *lasterrstring;    /* if set, takes precedence over lasterrno */
+    int            lasterrno;
 #ifdef HAVE_LIBZ
     z_streamp    zp;
     void       *zlibOut;
@@ -602,19 +676,17 @@ typedef struct TarMethodData
 } TarMethodData;
 static TarMethodData *tar_data = NULL;

-#define tar_clear_error() tar_data->lasterror[0] = '\0'
-#define tar_set_error(msg) strlcpy(tar_data->lasterror, _(msg), sizeof(tar_data->lasterror))
+#define tar_clear_error() \
+    (tar_data->lasterrstring = NULL, tar_data->lasterrno = 0)
+#define tar_set_error(msg) \
+    (tar_data->lasterrstring = _(msg))

 static const char *
 tar_getlasterror(void)
 {
-    /*
-     * If a custom error is set, return that one. Otherwise, assume errno is
-     * set and return that one.
-     */
-    if (tar_data->lasterror[0])
-        return tar_data->lasterror;
-    return strerror(errno);
+    if (tar_data->lasterrstring)
+        return tar_data->lasterrstring;
+    return strerror(tar_data->lasterrno);
 }

 #ifdef HAVE_LIBZ
@@ -642,11 +714,8 @@ tar_write_compressed_data(void *buf, size_t count, bool flush)
             errno = 0;
             if (write(tar_data->fd, tar_data->zlibOut, len) != len)
             {
-                /*
-                 * If write didn't set errno, assume problem is no disk space.
-                 */
-                if (errno == 0)
-                    errno = ENOSPC;
+                /* If write didn't set errno, assume problem is no disk space */
+                tar_data->lasterrno = errno ? errno : ENOSPC;
                 return false;
             }

@@ -683,9 +752,15 @@ tar_write(Walfile f, const void *buf, size_t count)
     /* Tarfile will always be positioned at the end */
     if (!tar_data->compression_level)
     {
+        errno = 0;
         r = write(tar_data->fd, buf, count);
-        if (r > 0)
-            ((TarMethodFile *) f)->currpos += r;
+        if (r != count)
+        {
+            /* If write didn't set errno, assume problem is no disk space */
+            tar_data->lasterrno = errno ? errno : ENOSPC;
+            return -1;
+        }
+        ((TarMethodFile *) f)->currpos += r;
         return r;
     }
 #ifdef HAVE_LIBZ
@@ -698,8 +773,11 @@ tar_write(Walfile f, const void *buf, size_t count)
     }
 #else
     else
+    {
         /* Can't happen - compression enabled with no libz */
+        tar_data->lasterrno = ENOSYS;
         return -1;
+    }
 #endif
 }

@@ -737,7 +815,6 @@ tar_get_file_name(const char *pathname, const char *temp_suffix)
 static Walfile
 tar_open_for_write(const char *pathname, const char *temp_suffix, size_t pad_to_size)
 {
-    int            save_errno;
     char       *tmppath;

     tar_clear_error();
@@ -751,7 +828,10 @@ tar_open_for_write(const char *pathname, const char *temp_suffix, size_t pad_to_
                             O_WRONLY | O_CREAT | PG_BINARY,
                             pg_file_create_mode);
         if (tar_data->fd < 0)
+        {
+            tar_data->lasterrno = errno;
             return NULL;
+        }

 #ifdef HAVE_LIBZ
         if (tar_data->compression_level)
@@ -782,7 +862,6 @@ tar_open_for_write(const char *pathname, const char *temp_suffix, size_t pad_to_
         /* There's no tar header itself, the file starts with regular files */
     }

-    Assert(tar_data->currentfile == NULL);
     if (tar_data->currentfile != NULL)
     {
         tar_set_error("implementation error: tar files can't have more than one open file");
@@ -824,10 +903,9 @@ tar_open_for_write(const char *pathname, const char *temp_suffix, size_t pad_to_
     tar_data->currentfile->ofs_start = lseek(tar_data->fd, 0, SEEK_CUR);
     if (tar_data->currentfile->ofs_start == -1)
     {
-        save_errno = errno;
+        tar_data->lasterrno = errno;
         pg_free(tar_data->currentfile);
         tar_data->currentfile = NULL;
-        errno = save_errno;
         return NULL;
     }
     tar_data->currentfile->currpos = 0;
@@ -838,11 +916,10 @@ tar_open_for_write(const char *pathname, const char *temp_suffix, size_t pad_to_
         if (write(tar_data->fd, tar_data->currentfile->header,
                   TAR_BLOCK_SIZE) != TAR_BLOCK_SIZE)
         {
-            save_errno = errno;
+            /* If write didn't set errno, assume problem is no disk space */
+            tar_data->lasterrno = errno ? errno : ENOSPC;
             pg_free(tar_data->currentfile);
             tar_data->currentfile = NULL;
-            /* if write didn't set errno, assume problem is no disk space */
-            errno = save_errno ? save_errno : ENOSPC;
             return NULL;
         }
     }
@@ -875,12 +952,16 @@ tar_open_for_write(const char *pathname, const char *temp_suffix, size_t pad_to_
         if (!tar_data->compression_level)
         {
             /* Uncompressed, so pad now */
-            tar_write_padding_data(tar_data->currentfile, pad_to_size);
+            if (!tar_write_padding_data(tar_data->currentfile, pad_to_size))
+                return NULL;
             /* Seek back to start */
             if (lseek(tar_data->fd,
                       tar_data->currentfile->ofs_start + TAR_BLOCK_SIZE,
                       SEEK_SET) != tar_data->currentfile->ofs_start + TAR_BLOCK_SIZE)
+            {
+                tar_data->lasterrno = errno;
                 return NULL;
+            }

             tar_data->currentfile->currpos = 0;
         }
@@ -895,7 +976,7 @@ tar_get_file_size(const char *pathname)
     tar_clear_error();

     /* Currently not used, so not supported */
-    errno = ENOSYS;
+    tar_data->lasterrno = ENOSYS;
     return -1;
 }

@@ -917,6 +998,8 @@ tar_get_current_pos(Walfile f)
 static int
 tar_sync(Walfile f)
 {
+    int            r;
+
     Assert(f != NULL);
     tar_clear_error();

@@ -930,7 +1013,10 @@ tar_sync(Walfile f)
     if (tar_data->compression_level)
         return 0;

-    return fsync(tar_data->fd);
+    r = fsync(tar_data->fd);
+    if (r < 0)
+        tar_data->lasterrno = errno;
+    return r;
 }

 static int
@@ -957,7 +1043,10 @@ tar_close(Walfile f, WalCloseMethod method)
          * allow writing of the very last file.
          */
         if (ftruncate(tar_data->fd, tf->ofs_start) != 0)
+        {
+            tar_data->lasterrno = errno;
             return -1;
+        }

         pg_free(tf->pathname);
         pg_free(tf);
@@ -1018,10 +1107,7 @@ tar_close(Walfile f, WalCloseMethod method)
     {
         /* Flush the current buffer */
         if (!tar_write_compressed_data(NULL, 0, true))
-        {
-            errno = EINVAL;
             return -1;
-        }
     }
 #endif

@@ -1042,15 +1128,17 @@ tar_close(Walfile f, WalCloseMethod method)

     print_tar_number(&(tf->header[148]), 8, tarChecksum(((TarMethodFile *) f)->header));
     if (lseek(tar_data->fd, tf->ofs_start, SEEK_SET) != ((TarMethodFile *) f)->ofs_start)
+    {
+        tar_data->lasterrno = errno;
         return -1;
+    }
     if (!tar_data->compression_level)
     {
         errno = 0;
         if (write(tar_data->fd, tf->header, TAR_BLOCK_SIZE) != TAR_BLOCK_SIZE)
         {
-            /* if write didn't set errno, assume problem is no disk space */
-            if (errno == 0)
-                errno = ENOSPC;
+            /* If write didn't set errno, assume problem is no disk space */
+            tar_data->lasterrno = errno ? errno : ENOSPC;
             return -1;
         }
     }
@@ -1080,11 +1168,19 @@ tar_close(Walfile f, WalCloseMethod method)

     /* Move file pointer back down to end, so we can write the next file */
     if (lseek(tar_data->fd, 0, SEEK_END) < 0)
+    {
+        tar_data->lasterrno = errno;
         return -1;
+    }

     /* Always fsync on close, so the padding gets fsynced */
     if (tar_sync(f) < 0)
+    {
+        /* XXX this seems pretty bogus; why is only this case fatal? */
+        pg_log_fatal("could not fsync file \"%s\": %s",
+                     tf->pathname, tar_getlasterror());
         exit(1);
+    }

     /* Clean up and done */
     pg_free(tf->pathname);
@@ -1122,9 +1218,8 @@ tar_finish(void)
         errno = 0;
         if (write(tar_data->fd, zerobuf, sizeof(zerobuf)) != sizeof(zerobuf))
         {
-            /* if write didn't set errno, assume problem is no disk space */
-            if (errno == 0)
-                errno = ENOSPC;
+            /* If write didn't set errno, assume problem is no disk space */
+            tar_data->lasterrno = errno ? errno : ENOSPC;
             return false;
         }
     }
@@ -1159,8 +1254,7 @@ tar_finish(void)
                      * If write didn't set errno, assume problem is no disk
                      * space.
                      */
-                    if (errno == 0)
-                        errno = ENOSPC;
+                    tar_data->lasterrno = errno ? errno : ENOSPC;
                     return false;
                 }
             }
@@ -1180,20 +1274,28 @@ tar_finish(void)
     if (tar_data->sync)
     {
         if (fsync(tar_data->fd) != 0)
+        {
+            tar_data->lasterrno = errno;
             return false;
+        }
     }

     if (close(tar_data->fd) != 0)
+    {
+        tar_data->lasterrno = errno;
         return false;
+    }

     tar_data->fd = -1;

     if (tar_data->sync)
     {
-        if (fsync_fname(tar_data->tarfilename, false) != 0)
-            return false;
-        if (fsync_parent_path(tar_data->tarfilename) != 0)
+        if (fsync_fname(tar_data->tarfilename, false) != 0 ||
+            fsync_parent_path(tar_data->tarfilename) != 0)
+        {
+            tar_data->lasterrno = errno;
             return false;
+        }
     }

     return true;
@@ -1250,4 +1352,5 @@ FreeWalTarMethod(void)
         pg_free(tar_data->zlibOut);
 #endif
     pg_free(tar_data);
+    tar_data = NULL;
 }

Re: Deficient error handling in pg_dump and pg_basebackup

From
Michael Paquier
Date:
On Tue, Nov 09, 2021 at 03:20:31PM -0500, Tom Lane wrote:
> I think it's right.  This certainly isn't sanctioned by the zlib
> specification [1], nor does it look like our other gzclose() calls,
> which simply consult errno.  (Note that this coding is not at all new,
> but Coverity thinks it is because 23a1c6578 moved it to a new file.)
>
> I set out to fix that, intending only to replace the use of
> get_gz_error() with %m, but the patch soon metastasized to the
> point where I think it needs review :-(.

This points gzclose_w() to gzwrite.c.  Looking at its end, it could
return Z_ERRNO after freeing everything when failing to close the
state's fd.  So I agree that coverity is right about this
possibility.  That's unlikely going to happen, though.

> 0001 below addresses places that either didn't check the result of
> gzclose() at all, or neglected to print a useful error message.
> The zlib spec doesn't quite promise that a failed gzclose() will
> set errno, but the cases where it might not seem like can't-happen
> cases that we needn't spend much effort on.

The zlib code does not do much regarding that, but system calls
would.  Still that joins with the point you are doing below where
free() may not hold errno, and note that the last thing gzclose() does
is a free().  :p

> So what I've done here
> is just to initialize errno to zero before gzclose().  If we ever
> get a report of "could not close file: Success" in one of these
> places, we'll know to look more closely.

Okay.

> On top of that,
> commit babbbb595 (LZ4 support) broke the design completely, because
> liblz4 doesn't use errno to report errors.  Separately from the
> question of how to report errors, there were a number of places
> that failed to detect errors in the first place, and more that were
> sloppy about setting the correct error code for a short write.

This one's my fault, sorry about that :/

What you are doing here looks fine for the LZ4F_isError() paths.

> What I chose to do to fix the design problem is to use a modified
> version of the idea that existed in the tar-methods part of the file,
> which is to have both a string field and an errno field in the
> DirectoryMethodData and TarMethodData objects.  If the string isn't
> NULL then it overrides the errno field, allowing us to report
> non-errno problems.  Then, a bunch of places have to remember to copy
> errno into the lasterrno field, which is a bit annoying.  On the other
> hand, we can get rid of logic that tries to save and restore errno
> across functions that might change it, so this does buy back some
> complexity too.

This makes the code a bit harder to follow when it comes to cascading
calls with tar_write_*() or tar_close(), but your approach looks
simple enough for this code.  I don't think you have missed a spot
after scanning the whole area.

> There's at least one loose end here, which is that there's one
> place in tar_close() that does a summary exit(1) on fsync failure,
> without even bothering with an error message.  I added the
> missing error report:
>
>      /* Always fsync on close, so the padding gets fsynced */
>      if (tar_sync(f) < 0)
> +    {
> +        /* XXX this seems pretty bogus; why is only this case fatal? */
> +        pg_log_fatal("could not fsync file \"%s\": %s",
> +                     tf->pathname, tar_getlasterror());
>          exit(1);
> +    }

> but this code seems about as fishy and ill-thought-out as anything
> else I've touched here.  Why is this different from the half-dozen
> other fsync-error cases in the same file?  Why, if fsync failure
> here is so catastrophic, is it okay to just return a normal failure
> code when tar_close doesn't even get to this point because of some
> earlier error?

Hmm, I don't think that's fine.  There is an extra one in tar_finish()
that would report a failure, but not exit() immediately.  All the
callers of the sync callback in receivelog.c exit() on sight, but I am
wondering if it would not be saner to just exit() from walmethods.c
each time we see a failure with a fsync().

> At the very least it seems like it'd be preferable
> to do the exit(1) at the caller level.

You mean walmethods.c here, right?
--
Michael

Attachment

Re: Deficient error handling in pg_dump and pg_basebackup

From
Michael Paquier
Date:
On Wed, Nov 10, 2021 at 02:03:11PM +0900, Michael Paquier wrote:
>> but this code seems about as fishy and ill-thought-out as anything
>> else I've touched here.  Why is this different from the half-dozen
>> other fsync-error cases in the same file?  Why, if fsync failure
>> here is so catastrophic, is it okay to just return a normal failure
>> code when tar_close doesn't even get to this point because of some
>> earlier error?
>
> Hmm, I don't think that's fine.  There is an extra one in tar_finish()
> that would report a failure, but not exit() immediately.  All the
> callers of the sync callback in receivelog.c exit() on sight, but I am
> wondering if it would not be saner to just exit() from walmethods.c
> each time we see a failure with a fsync().

Taking the issues with fsync() aside, which could be improved as a
separate patch, is there anything I can do for this thread?  The error
handling problems in walmethods.c introduced by the integration of LZ4
are on me, so I'd like to fix it.  0002 depends on some parts of 0001,
as well for walmethods.c and the new error code paths.  And I have
been through this code quite a lot recently.
--
Michael

Attachment

Re: Deficient error handling in pg_dump and pg_basebackup

From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes:
> Taking the issues with fsync() aside, which could be improved as a
> separate patch, is there anything I can do for this thread?  The error
> handling problems in walmethods.c introduced by the integration of LZ4
> are on me, so I'd like to fix it.  0002 depends on some parts of 0001,
> as well for walmethods.c and the new error code paths.  And I have
> been through this code quite a lot recently.

I feel like doing an immediate exit() for these errors and not other
ones is a pretty terrible idea, mainly because it doesn't account for
the question of what to do with a failure that prevents us from getting
to the fsync() call in the first place.  So I'd like to see a better
design rather than more quick hacking.  I confess I don't have a clear
idea of what "a better design" would look like.

However, that's largely orthogonal to any of the things my proposed
patches are trying to fix.  If you want to review the patches without
considering the fsync-error-handling problem, that'd be great.

            regards, tom lane



Re: Deficient error handling in pg_dump and pg_basebackup

From
Michael Paquier
Date:
On Tue, Nov 16, 2021 at 10:26:11PM -0500, Tom Lane wrote:
> I feel like doing an immediate exit() for these errors and not other
> ones is a pretty terrible idea, mainly because it doesn't account for
> the question of what to do with a failure that prevents us from getting
> to the fsync() call in the first place.  So I'd like to see a better
> design rather than more quick hacking.  I confess I don't have a
> clear idea of what "a better design" would look like.

[ .. thinks .. ]

We cannot really have something equivalent to data_sync_retry in the
frontends.  But I'd like to think that it would be fine for
pg_basebackup to just exit() on this failure so as callers would be
able to retry a base backup.  pg_receivewal is more tricky though.  An
exit() would allow for flush retries of a previous WAL segment where
things failed, but that stands when using --no-loop (still the code
path triggered by this option would not be used).  When not using
--no-loop, it may be better to actually just retry streaming from the
previous point so as the error should be reported from walmethods.c to
the upper stack anyway.

> However, that's largely orthogonal to any of the things my proposed
> patches are trying to fix.  If you want to review the patches without
> considering the fsync-error-handling problem, that'd be great.

I have looked at them upthread, FWIW:
https://www.postgresql.org/message-id/YYtSj5vlWp5faVXz@paquier.xyz
Your proposals still look rather sane to me, after a second look.
--
Michael

Attachment

Re: Deficient error handling in pg_dump and pg_basebackup

From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes:
> On Tue, Nov 16, 2021 at 10:26:11PM -0500, Tom Lane wrote:
>> However, that's largely orthogonal to any of the things my proposed
>> patches are trying to fix.  If you want to review the patches without
>> considering the fsync-error-handling problem, that'd be great.

> I have looked at them upthread, FWIW:
> https://www.postgresql.org/message-id/YYtSj5vlWp5faVXz@paquier.xyz
> Your proposals still look rather sane to me, after a second look.

Pushed then; thanks for reviewing that.  We can consider the fsync
error question at leisure.

            regards, tom lane



Re: Deficient error handling in pg_dump and pg_basebackup

From
Michael Paquier
Date:
On Wed, Nov 17, 2021 at 02:19:20PM -0500, Tom Lane wrote:
> Pushed then; thanks for reviewing that.  We can consider the fsync
> error question at leisure.

Fine by me.  Thanks for the commit.
--
Michael

Attachment

Re: Deficient error handling in pg_dump and pg_basebackup

From
Peter Eisentraut
Date:
On 09.11.21 21:20, Tom Lane wrote:
> Why is this different from the half-dozen
> other fsync-error cases in the same file?  Why, if fsync failure
> here is so catastrophic, is it okay to just return a normal failure
> code when tar_close doesn't even get to this point because of some
> earlier error?  At the very least it seems like it'd be preferable
> to do the exit(1) at the caller level.
> 
> The addition of the exit(1) seems to have been intentional in
> 1e2fddfa3, so cc'ing Peter for comment.

That commit addressed the behavior of fsync() failure in pg_receivewal 
and pg_recvlogical, which are long-running daemon processes, so this 
change was analogous to the server-side changes at the time.  I don't 
know what the behavior of fsync() failure in pg_basebackup should be, so 
calls that are only reachable from pg_basebackup are currently being 
handled differently.