Thread: pread, pwrite, etc return ssize_t not int

pread, pwrite, etc return ssize_t not int

From
Tom Lane
Date:
Coverity whinged this morning about the following bit in
the new pg_combinebackup code:

644                 unsigned    rb;
645
646                 /* Read the block from the correct source, except if dry-run. */
647                 rb = pg_pread(s->fd, buffer, BLCKSZ, offsetmap[i]);
648                 if (rb != BLCKSZ)
649                 {
>>>     CID 1559912:  Control flow issues  (NO_EFFECT)
>>>     This less-than-zero comparison of an unsigned value is never true. "rb < 0U".
650                     if (rb < 0)
651                         pg_fatal("could not read file \"%s\": %m", s->filename);

It's dead right to complain of course.  (I kind of think that the
majority of places where reconstruct.c is using "unsigned" variables
are poorly-thought-through; many of them look like they should be
size_t, and I suspect some other ones beside this one are flat wrong
or at least unnecessarily fragile.  But I digress.)

While looking around for other places that might've made comparable
mistakes, I noted that we have places that are storing the result of
pg_pread[v]/pg_pwrite[v] into an "int" variable even though they are
passing a size_t count argument that there is no obvious reason to
believe must fit in int.  This seems like trouble waiting to happen,
so I fixed some of these in the attached.  The major remaining place
that I think we ought to change is the newly-minted
FileRead[V]/FileWrite[V] functions, which are declared to return int
but really should be returning ssize_t IMO.  I didn't do that here
though.

We could go further by insisting that *all* uses of pg_pread/pg_pwrite
use ssize_t result variables.  I think that's probably overkill --- in
the example above, which is only asking to write BLCKSZ worth of data,
surely an int is sufficient.  But you could argue that allowing this
pattern at all creates risk of copy/paste errors.

Of course the real elephant in the room is that plain old read(2)
and write(2) also return ssize_t.  I've not attempted to vet every
call of those, and I think it'd likely be a waste of effort, as
we're unlikely to ever try to shove more than INT_MAX worth of
data through them.  But it's a bit harder to make that argument
for the iovec-based file APIs.  I think we ought to try to keep
our uses of those functions clean on this point.

Thoughts?

            regards, tom lane

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1e9019156a..1264849883 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2280,7 +2280,7 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
             char       *from;
             Size        nbytes;
             Size        nleft;
-            int            written;
+            ssize_t        written;
             instr_time    start;

             /* OK to write the page(s) */
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 5ee9628422..84246739ae 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -117,8 +117,8 @@ static void perform_base_backup(basebackup_options *opt, bbsink *sink,
                                 IncrementalBackupInfo *ib);
 static void parse_basebackup_options(List *options, basebackup_options *opt);
 static int    compareWalFileNames(const ListCell *a, const ListCell *b);
-static int    basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
-                                 const char *filename, bool partial_read_ok);
+static ssize_t basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
+                                    const char *filename, bool partial_read_ok);

 /* Was the backup currently in-progress initiated in recovery mode? */
 static bool backup_started_in_recovery = false;
@@ -525,7 +525,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink,
         {
             char       *walFileName = (char *) lfirst(lc);
             int            fd;
-            size_t        cnt;
+            ssize_t        cnt;
             pgoff_t        len = 0;

             snprintf(pathbuf, MAXPGPATH, XLOGDIR "/%s", walFileName);
@@ -2079,11 +2079,11 @@ convert_link_to_directory(const char *pathbuf, struct stat *statbuf)
  *
  * Returns the number of bytes read.
  */
-static int
+static ssize_t
 basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
                      const char *filename, bool partial_read_ok)
 {
-    int            rc;
+    ssize_t        rc;

     pgstat_report_wait_start(WAIT_EVENT_BASEBACKUP_READ);
     rc = pg_pread(fd, buf, nbytes, offset);
@@ -2096,7 +2096,7 @@ basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
     if (!partial_read_ok && rc > 0 && rc != nbytes)
         ereport(ERROR,
                 (errcode_for_file_access(),
-                 errmsg("could not read file \"%s\": read %d of %zu",
+                 errmsg("could not read file \"%s\": read %zd of %zu",
                         filename, rc, nbytes)));

     return rc;
diff --git a/src/bin/pg_combinebackup/reconstruct.c b/src/bin/pg_combinebackup/reconstruct.c
index 6decdd8934..874e6cd150 100644
--- a/src/bin/pg_combinebackup/reconstruct.c
+++ b/src/bin/pg_combinebackup/reconstruct.c
@@ -504,7 +504,7 @@ make_rfile(char *filename, bool missing_ok)
 static void
 read_bytes(rfile *rf, void *buffer, unsigned length)
 {
-    unsigned    rb = read(rf->fd, buffer, length);
+    int            rb = read(rf->fd, buffer, length);

     if (rb != length)
     {
@@ -512,7 +512,7 @@ read_bytes(rfile *rf, void *buffer, unsigned length)
             pg_fatal("could not read file \"%s\": %m", rf->filename);
         else
             pg_fatal("could not read file \"%s\": read only %d of %d bytes",
-                     rf->filename, (int) rb, length);
+                     rf->filename, rb, length);
     }
 }

@@ -614,7 +614,7 @@ write_reconstructed_file(char *input_filename,
     {
         uint8        buffer[BLCKSZ];
         rfile       *s = sourcemap[i];
-        unsigned    wb;
+        int            wb;

         /* Update accounting information. */
         if (s == NULL)
@@ -641,7 +641,7 @@ write_reconstructed_file(char *input_filename,
         }
         else
         {
-            unsigned    rb;
+            int            rb;

             /* Read the block from the correct source, except if dry-run. */
             rb = pg_pread(s->fd, buffer, BLCKSZ, offsetmap[i]);
@@ -650,9 +650,9 @@ write_reconstructed_file(char *input_filename,
                 if (rb < 0)
                     pg_fatal("could not read file \"%s\": %m", s->filename);
                 else
-                    pg_fatal("could not read file \"%s\": read only %d of %d bytes at offset %u",
-                             s->filename, (int) rb, BLCKSZ,
-                             (unsigned) offsetmap[i]);
+                    pg_fatal("could not read file \"%s\": read only %d of %d bytes at offset %llu",
+                             s->filename, rb, BLCKSZ,
+                             (unsigned long long) offsetmap[i]);
             }
         }

@@ -663,7 +663,7 @@ write_reconstructed_file(char *input_filename,
                 pg_fatal("could not write file \"%s\": %m", output_filename);
             else
                 pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
-                         output_filename, (int) wb, BLCKSZ);
+                         output_filename, wb, BLCKSZ);
         }

         /* Update the checksum computation. */

Re: pread, pwrite, etc return ssize_t not int

From
Thomas Munro
Date:
On Mon, Dec 25, 2023 at 7:09 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Coverity whinged this morning about the following bit in
> the new pg_combinebackup code:
>
> 644                 unsigned    rb;
> 645
> 646                 /* Read the block from the correct source, except if dry-run. */
> 647                 rb = pg_pread(s->fd, buffer, BLCKSZ, offsetmap[i]);
> 648                 if (rb != BLCKSZ)
> 649                 {
> >>>     CID 1559912:  Control flow issues  (NO_EFFECT)
> >>>     This less-than-zero comparison of an unsigned value is never true. "rb < 0U".
> 650                     if (rb < 0)
> 651                         pg_fatal("could not read file \"%s\": %m", s->filename);
>
> It's dead right to complain of course.  (I kind of think that the
> majority of places where reconstruct.c is using "unsigned" variables
> are poorly-thought-through; many of them look like they should be
> size_t, and I suspect some other ones beside this one are flat wrong
> or at least unnecessarily fragile.  But I digress.)

Yeah.

> While looking around for other places that might've made comparable
> mistakes, I noted that we have places that are storing the result of
> pg_pread[v]/pg_pwrite[v] into an "int" variable even though they are
> passing a size_t count argument that there is no obvious reason to
> believe must fit in int.  This seems like trouble waiting to happen,
> so I fixed some of these in the attached.  The major remaining place
> that I think we ought to change is the newly-minted
> FileRead[V]/FileWrite[V] functions, which are declared to return int
> but really should be returning ssize_t IMO.  I didn't do that here
> though.

Agreed in theory.  Note that we've only been using size_t in fd.c
functions since:

commit 2d4f1ba6cfc2f0a977f1c30bda9848041343e248
Author: Peter Eisentraut <peter@eisentraut.org>
Date:   Thu Dec 8 08:51:38 2022 +0100

    Update types in File API

    Make the argument types of the File API match stdio better:

    - Change the data buffer to void *, from char *.
    - Change FileWrite() data buffer to const on top of that.
    - Change amounts to size_t, from int.

I guess it was an oversight not to change the return type to match at
the same time.  That said, I think that would only be for tidiness.

Some assorted observations:

1.  We don't yet require "large file" support, meaning that we use
off_t our fd.c and lseek()/p*() replacement functions, but we know it
is only 32 bits on Windows, and we avoid creating large files.  I
think that means that a hypothetical very large write would break that
assumption, creating data whose position cannot be named in those
calls.  We could fix that on Windows by adjusting our wrappers, either
to work with pgoff_t instead off off_t (yuck) or redefining off_t
(yuck), but given that both options are gross, so far we have imagined
that we should move towards using large files only conditionally, on
sizeof(off_t) >= 8.

2.  Windows' native read() and write() functions still have prototypes
like 1988 POSIX, eg int read(int filedes, void *buf, unsigned int
nbyte).  It was 2001 POSIX that changed them to ssize_t read(int
filedesc, void *buf, size_t nbytes).  I doubt there is much that can
be done about that, except perhaps adding a wrapper that caps it.
Seems like overkill for a hypothetical sort of a problem...

3.  Windows' native ReadFile() and WriteFile() functions, which our
'positionified' pg_p*() wrapper functions use, work in terms of DWORD
= unsigned long which is 32 bit on that cursed ABI.  Our wrappers
should probably cap.

I think a number of Unixoid systems implemented the POSIX interface
change by capping internally, which doesn't matter much in practice
because no one really tries to transfer gigabytes at once, and any
non-trivial transfer size probably requires handling short transfers.
For example, man read on Linux:

       On Linux, read() (and similar system calls) will transfer at most
       0x7ffff000 (2,147,479,552) bytes, returning the number of bytes
       actually transferred.  (This is true on both 32-bit and 64-bit
       systems.)

> We could go further by insisting that *all* uses of pg_pread/pg_pwrite
> use ssize_t result variables.  I think that's probably overkill --- in
> the example above, which is only asking to write BLCKSZ worth of data,
> surely an int is sufficient.  But you could argue that allowing this
> pattern at all creates risk of copy/paste errors.

Yeah.

> Of course the real elephant in the room is that plain old read(2)
> and write(2) also return ssize_t.  I've not attempted to vet every
> call of those, and I think it'd likely be a waste of effort, as
> we're unlikely to ever try to shove more than INT_MAX worth of
> data through them.  But it's a bit harder to make that argument
> for the iovec-based file APIs.  I think we ought to try to keep
> our uses of those functions clean on this point.

Yeah I think it's OK for a caller that knows it's passing in an int
value to (implicitly) cast the return to int.  But it'd be nice to
make our I/O functions look and feel like standard functions and
return ssize_t.



Re: pread, pwrite, etc return ssize_t not int

From
Thomas Munro
Date:
Patches attached.

PS Correction to my earlier statement about POSIX: the traditional K&R
interfaces were indeed in the original POSIX.1 1988 but it was the
1990 edition (approximately coinciding with standard C) that adopted
void, size_t, const and invented ssize_t.

Attachment

Re: pread, pwrite, etc return ssize_t not int

From
Peter Eisentraut
Date:
On 27.02.24 12:21, Thomas Munro wrote:
> Patches attached.
> 
> PS Correction to my earlier statement about POSIX: the traditional K&R
> interfaces were indeed in the original POSIX.1 1988 but it was the
> 1990 edition (approximately coinciding with standard C) that adopted
> void, size_t, const and invented ssize_t.

0001-Return-ssize_t-in-fd.c-I-O-functions.patch

This patch looks correct to me.

0002-Fix-theoretical-overflow-in-Windows-pg_pread-pg_pwri.patch

I have two comments on that:

For the overflow of the input length (size_t -> DWORD), I don't think we 
actually need to do anything.  The size argument would be truncated, but 
the callers would just repeat the calls with the remaining size, so in 
effect they will read the data in chunks of rest + N * DWORD_MAX.  The 
patch just changes this to chunks of N * 1GB + rest.

The other issue, the possible overflow of size_t -> ssize_t is not 
specific to Windows.  We could install some protection against that on 
some other layer, but it's unclear how widespread that issue is or what 
the appropriate fix is.  POSIX says that passing in a size larger than 
SSIZE_MAX has implementation-defined effect.  The FreeBSD man page says 
that this will result in an EINVAL error.  So if we here truncate 
instead of error, we'd introduce a divergence.




Re: pread, pwrite, etc return ssize_t not int

From
Thomas Munro
Date:
On Sat, Mar 2, 2024 at 3:12 AM Peter Eisentraut <peter@eisentraut.org> wrote:
> 0001-Return-ssize_t-in-fd.c-I-O-functions.patch
>
> This patch looks correct to me.

Thanks, I'll push this one.

> 0002-Fix-theoretical-overflow-in-Windows-pg_pread-pg_pwri.patch
>
> I have two comments on that:
>
> For the overflow of the input length (size_t -> DWORD), I don't think we
> actually need to do anything.  The size argument would be truncated, but
> the callers would just repeat the calls with the remaining size, so in
> effect they will read the data in chunks of rest + N * DWORD_MAX.  The
> patch just changes this to chunks of N * 1GB + rest.

But implicit conversion size_t -> DWORD doesn't convert large numbers
to DWORD_MAX, it just cuts off the high bits, and that might leave you
with zero.  Zero has a special meaning (if we assume that kernel
doesn't reject a zero size argument outright, I dunno): if returned by
reads it indicates EOF, and if returned by writes a typical caller
would either loop forever making no progress or (in some of our code)
conjure up a fake ENOSPC.  Hence desire to impose a cap.

I'm on the fence about whether it's worth wasting any more energy on
this, I mean we aren't really going to read/write 4GB, so I'd be OK if
we just left this as an observation in the archives...

> The other issue, the possible overflow of size_t -> ssize_t is not
> specific to Windows.  We could install some protection against that on
> some other layer, but it's unclear how widespread that issue is or what
> the appropriate fix is.  POSIX says that passing in a size larger than
> SSIZE_MAX has implementation-defined effect.  The FreeBSD man page says
> that this will result in an EINVAL error.  So if we here truncate
> instead of error, we'd introduce a divergence.

Yeah, right, that's the caller's job to worry about on all platforms
so I was wrong to mention ssize_t in the comment.



Re: pread, pwrite, etc return ssize_t not int

From
Peter Eisentraut
Date:
On 01.03.24 22:23, Thomas Munro wrote:
>> For the overflow of the input length (size_t -> DWORD), I don't think we
>> actually need to do anything.  The size argument would be truncated, but
>> the callers would just repeat the calls with the remaining size, so in
>> effect they will read the data in chunks of rest + N * DWORD_MAX.  The
>> patch just changes this to chunks of N * 1GB + rest.
> 
> But implicit conversion size_t -> DWORD doesn't convert large numbers
> to DWORD_MAX, it just cuts off the high bits, and that might leave you
> with zero.  Zero has a special meaning (if we assume that kernel
> doesn't reject a zero size argument outright, I dunno): if returned by
> reads it indicates EOF, and if returned by writes a typical caller
> would either loop forever making no progress or (in some of our code)
> conjure up a fake ENOSPC.  Hence desire to impose a cap.

Right, my thinko.  Your patch is correct then.