Thread: pread, pwrite, etc return ssize_t not int
Coverity whinged this morning about the following bit in the new pg_combinebackup code: 644 unsigned rb; 645 646 /* Read the block from the correct source, except if dry-run. */ 647 rb = pg_pread(s->fd, buffer, BLCKSZ, offsetmap[i]); 648 if (rb != BLCKSZ) 649 { >>> CID 1559912: Control flow issues (NO_EFFECT) >>> This less-than-zero comparison of an unsigned value is never true. "rb < 0U". 650 if (rb < 0) 651 pg_fatal("could not read file \"%s\": %m", s->filename); It's dead right to complain of course. (I kind of think that the majority of places where reconstruct.c is using "unsigned" variables are poorly-thought-through; many of them look like they should be size_t, and I suspect some other ones beside this one are flat wrong or at least unnecessarily fragile. But I digress.) While looking around for other places that might've made comparable mistakes, I noted that we have places that are storing the result of pg_pread[v]/pg_pwrite[v] into an "int" variable even though they are passing a size_t count argument that there is no obvious reason to believe must fit in int. This seems like trouble waiting to happen, so I fixed some of these in the attached. The major remaining place that I think we ought to change is the newly-minted FileRead[V]/FileWrite[V] functions, which are declared to return int but really should be returning ssize_t IMO. I didn't do that here though. We could go further by insisting that *all* uses of pg_pread/pg_pwrite use ssize_t result variables. I think that's probably overkill --- in the example above, which is only asking to write BLCKSZ worth of data, surely an int is sufficient. But you could argue that allowing this pattern at all creates risk of copy/paste errors. Of course the real elephant in the room is that plain old read(2) and write(2) also return ssize_t. I've not attempted to vet every call of those, and I think it'd likely be a waste of effort, as we're unlikely to ever try to shove more than INT_MAX worth of data through them. But it's a bit harder to make that argument for the iovec-based file APIs. I think we ought to try to keep our uses of those functions clean on this point. Thoughts? regards, tom lane diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 1e9019156a..1264849883 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -2280,7 +2280,7 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible) char *from; Size nbytes; Size nleft; - int written; + ssize_t written; instr_time start; /* OK to write the page(s) */ diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c index 5ee9628422..84246739ae 100644 --- a/src/backend/backup/basebackup.c +++ b/src/backend/backup/basebackup.c @@ -117,8 +117,8 @@ static void perform_base_backup(basebackup_options *opt, bbsink *sink, IncrementalBackupInfo *ib); static void parse_basebackup_options(List *options, basebackup_options *opt); static int compareWalFileNames(const ListCell *a, const ListCell *b); -static int basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset, - const char *filename, bool partial_read_ok); +static ssize_t basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset, + const char *filename, bool partial_read_ok); /* Was the backup currently in-progress initiated in recovery mode? */ static bool backup_started_in_recovery = false; @@ -525,7 +525,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink, { char *walFileName = (char *) lfirst(lc); int fd; - size_t cnt; + ssize_t cnt; pgoff_t len = 0; snprintf(pathbuf, MAXPGPATH, XLOGDIR "/%s", walFileName); @@ -2079,11 +2079,11 @@ convert_link_to_directory(const char *pathbuf, struct stat *statbuf) * * Returns the number of bytes read. */ -static int +static ssize_t basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset, const char *filename, bool partial_read_ok) { - int rc; + ssize_t rc; pgstat_report_wait_start(WAIT_EVENT_BASEBACKUP_READ); rc = pg_pread(fd, buf, nbytes, offset); @@ -2096,7 +2096,7 @@ basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset, if (!partial_read_ok && rc > 0 && rc != nbytes) ereport(ERROR, (errcode_for_file_access(), - errmsg("could not read file \"%s\": read %d of %zu", + errmsg("could not read file \"%s\": read %zd of %zu", filename, rc, nbytes))); return rc; diff --git a/src/bin/pg_combinebackup/reconstruct.c b/src/bin/pg_combinebackup/reconstruct.c index 6decdd8934..874e6cd150 100644 --- a/src/bin/pg_combinebackup/reconstruct.c +++ b/src/bin/pg_combinebackup/reconstruct.c @@ -504,7 +504,7 @@ make_rfile(char *filename, bool missing_ok) static void read_bytes(rfile *rf, void *buffer, unsigned length) { - unsigned rb = read(rf->fd, buffer, length); + int rb = read(rf->fd, buffer, length); if (rb != length) { @@ -512,7 +512,7 @@ read_bytes(rfile *rf, void *buffer, unsigned length) pg_fatal("could not read file \"%s\": %m", rf->filename); else pg_fatal("could not read file \"%s\": read only %d of %d bytes", - rf->filename, (int) rb, length); + rf->filename, rb, length); } } @@ -614,7 +614,7 @@ write_reconstructed_file(char *input_filename, { uint8 buffer[BLCKSZ]; rfile *s = sourcemap[i]; - unsigned wb; + int wb; /* Update accounting information. */ if (s == NULL) @@ -641,7 +641,7 @@ write_reconstructed_file(char *input_filename, } else { - unsigned rb; + int rb; /* Read the block from the correct source, except if dry-run. */ rb = pg_pread(s->fd, buffer, BLCKSZ, offsetmap[i]); @@ -650,9 +650,9 @@ write_reconstructed_file(char *input_filename, if (rb < 0) pg_fatal("could not read file \"%s\": %m", s->filename); else - pg_fatal("could not read file \"%s\": read only %d of %d bytes at offset %u", - s->filename, (int) rb, BLCKSZ, - (unsigned) offsetmap[i]); + pg_fatal("could not read file \"%s\": read only %d of %d bytes at offset %llu", + s->filename, rb, BLCKSZ, + (unsigned long long) offsetmap[i]); } } @@ -663,7 +663,7 @@ write_reconstructed_file(char *input_filename, pg_fatal("could not write file \"%s\": %m", output_filename); else pg_fatal("could not write file \"%s\": wrote only %d of %d bytes", - output_filename, (int) wb, BLCKSZ); + output_filename, wb, BLCKSZ); } /* Update the checksum computation. */
On Mon, Dec 25, 2023 at 7:09 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Coverity whinged this morning about the following bit in > the new pg_combinebackup code: > > 644 unsigned rb; > 645 > 646 /* Read the block from the correct source, except if dry-run. */ > 647 rb = pg_pread(s->fd, buffer, BLCKSZ, offsetmap[i]); > 648 if (rb != BLCKSZ) > 649 { > >>> CID 1559912: Control flow issues (NO_EFFECT) > >>> This less-than-zero comparison of an unsigned value is never true. "rb < 0U". > 650 if (rb < 0) > 651 pg_fatal("could not read file \"%s\": %m", s->filename); > > It's dead right to complain of course. (I kind of think that the > majority of places where reconstruct.c is using "unsigned" variables > are poorly-thought-through; many of them look like they should be > size_t, and I suspect some other ones beside this one are flat wrong > or at least unnecessarily fragile. But I digress.) Yeah. > While looking around for other places that might've made comparable > mistakes, I noted that we have places that are storing the result of > pg_pread[v]/pg_pwrite[v] into an "int" variable even though they are > passing a size_t count argument that there is no obvious reason to > believe must fit in int. This seems like trouble waiting to happen, > so I fixed some of these in the attached. The major remaining place > that I think we ought to change is the newly-minted > FileRead[V]/FileWrite[V] functions, which are declared to return int > but really should be returning ssize_t IMO. I didn't do that here > though. Agreed in theory. Note that we've only been using size_t in fd.c functions since: commit 2d4f1ba6cfc2f0a977f1c30bda9848041343e248 Author: Peter Eisentraut <peter@eisentraut.org> Date: Thu Dec 8 08:51:38 2022 +0100 Update types in File API Make the argument types of the File API match stdio better: - Change the data buffer to void *, from char *. - Change FileWrite() data buffer to const on top of that. - Change amounts to size_t, from int. I guess it was an oversight not to change the return type to match at the same time. That said, I think that would only be for tidiness. Some assorted observations: 1. We don't yet require "large file" support, meaning that we use off_t our fd.c and lseek()/p*() replacement functions, but we know it is only 32 bits on Windows, and we avoid creating large files. I think that means that a hypothetical very large write would break that assumption, creating data whose position cannot be named in those calls. We could fix that on Windows by adjusting our wrappers, either to work with pgoff_t instead off off_t (yuck) or redefining off_t (yuck), but given that both options are gross, so far we have imagined that we should move towards using large files only conditionally, on sizeof(off_t) >= 8. 2. Windows' native read() and write() functions still have prototypes like 1988 POSIX, eg int read(int filedes, void *buf, unsigned int nbyte). It was 2001 POSIX that changed them to ssize_t read(int filedesc, void *buf, size_t nbytes). I doubt there is much that can be done about that, except perhaps adding a wrapper that caps it. Seems like overkill for a hypothetical sort of a problem... 3. Windows' native ReadFile() and WriteFile() functions, which our 'positionified' pg_p*() wrapper functions use, work in terms of DWORD = unsigned long which is 32 bit on that cursed ABI. Our wrappers should probably cap. I think a number of Unixoid systems implemented the POSIX interface change by capping internally, which doesn't matter much in practice because no one really tries to transfer gigabytes at once, and any non-trivial transfer size probably requires handling short transfers. For example, man read on Linux: On Linux, read() (and similar system calls) will transfer at most 0x7ffff000 (2,147,479,552) bytes, returning the number of bytes actually transferred. (This is true on both 32-bit and 64-bit systems.) > We could go further by insisting that *all* uses of pg_pread/pg_pwrite > use ssize_t result variables. I think that's probably overkill --- in > the example above, which is only asking to write BLCKSZ worth of data, > surely an int is sufficient. But you could argue that allowing this > pattern at all creates risk of copy/paste errors. Yeah. > Of course the real elephant in the room is that plain old read(2) > and write(2) also return ssize_t. I've not attempted to vet every > call of those, and I think it'd likely be a waste of effort, as > we're unlikely to ever try to shove more than INT_MAX worth of > data through them. But it's a bit harder to make that argument > for the iovec-based file APIs. I think we ought to try to keep > our uses of those functions clean on this point. Yeah I think it's OK for a caller that knows it's passing in an int value to (implicitly) cast the return to int. But it'd be nice to make our I/O functions look and feel like standard functions and return ssize_t.
Patches attached. PS Correction to my earlier statement about POSIX: the traditional K&R interfaces were indeed in the original POSIX.1 1988 but it was the 1990 edition (approximately coinciding with standard C) that adopted void, size_t, const and invented ssize_t.
Attachment
On 27.02.24 12:21, Thomas Munro wrote: > Patches attached. > > PS Correction to my earlier statement about POSIX: the traditional K&R > interfaces were indeed in the original POSIX.1 1988 but it was the > 1990 edition (approximately coinciding with standard C) that adopted > void, size_t, const and invented ssize_t. 0001-Return-ssize_t-in-fd.c-I-O-functions.patch This patch looks correct to me. 0002-Fix-theoretical-overflow-in-Windows-pg_pread-pg_pwri.patch I have two comments on that: For the overflow of the input length (size_t -> DWORD), I don't think we actually need to do anything. The size argument would be truncated, but the callers would just repeat the calls with the remaining size, so in effect they will read the data in chunks of rest + N * DWORD_MAX. The patch just changes this to chunks of N * 1GB + rest. The other issue, the possible overflow of size_t -> ssize_t is not specific to Windows. We could install some protection against that on some other layer, but it's unclear how widespread that issue is or what the appropriate fix is. POSIX says that passing in a size larger than SSIZE_MAX has implementation-defined effect. The FreeBSD man page says that this will result in an EINVAL error. So if we here truncate instead of error, we'd introduce a divergence.
On Sat, Mar 2, 2024 at 3:12 AM Peter Eisentraut <peter@eisentraut.org> wrote: > 0001-Return-ssize_t-in-fd.c-I-O-functions.patch > > This patch looks correct to me. Thanks, I'll push this one. > 0002-Fix-theoretical-overflow-in-Windows-pg_pread-pg_pwri.patch > > I have two comments on that: > > For the overflow of the input length (size_t -> DWORD), I don't think we > actually need to do anything. The size argument would be truncated, but > the callers would just repeat the calls with the remaining size, so in > effect they will read the data in chunks of rest + N * DWORD_MAX. The > patch just changes this to chunks of N * 1GB + rest. But implicit conversion size_t -> DWORD doesn't convert large numbers to DWORD_MAX, it just cuts off the high bits, and that might leave you with zero. Zero has a special meaning (if we assume that kernel doesn't reject a zero size argument outright, I dunno): if returned by reads it indicates EOF, and if returned by writes a typical caller would either loop forever making no progress or (in some of our code) conjure up a fake ENOSPC. Hence desire to impose a cap. I'm on the fence about whether it's worth wasting any more energy on this, I mean we aren't really going to read/write 4GB, so I'd be OK if we just left this as an observation in the archives... > The other issue, the possible overflow of size_t -> ssize_t is not > specific to Windows. We could install some protection against that on > some other layer, but it's unclear how widespread that issue is or what > the appropriate fix is. POSIX says that passing in a size larger than > SSIZE_MAX has implementation-defined effect. The FreeBSD man page says > that this will result in an EINVAL error. So if we here truncate > instead of error, we'd introduce a divergence. Yeah, right, that's the caller's job to worry about on all platforms so I was wrong to mention ssize_t in the comment.
On 01.03.24 22:23, Thomas Munro wrote: >> For the overflow of the input length (size_t -> DWORD), I don't think we >> actually need to do anything. The size argument would be truncated, but >> the callers would just repeat the calls with the remaining size, so in >> effect they will read the data in chunks of rest + N * DWORD_MAX. The >> patch just changes this to chunks of N * 1GB + rest. > > But implicit conversion size_t -> DWORD doesn't convert large numbers > to DWORD_MAX, it just cuts off the high bits, and that might leave you > with zero. Zero has a special meaning (if we assume that kernel > doesn't reject a zero size argument outright, I dunno): if returned by > reads it indicates EOF, and if returned by writes a typical caller > would either loop forever making no progress or (in some of our code) > conjure up a fake ENOSPC. Hence desire to impose a cap. Right, my thinko. Your patch is correct then.