Thread: [HACKERS] Error while copying a large file in pg_rewind
Hello all, pg_rewind throws the following error when there is a file of large size available in the Slave server's data directory. unexpected result while sending file list: ERROR: value "2148000000" is out of range for type integer CONTEXT: COPY fetchchunks, line 2402, column begin: "2148000000" How to reproduce ---------------------------- 1. Set up replication between Server A(master) and Server B(slave) 2. Promote the slave server(Server B ) 3. Stop the old master (Server A) 4. Create a large file in the newly promoted master's (Server B) data directory using the below command dd if=/dev/zero of=large.file bs=1024 count=4000000 [root@localhost data]# dd if=/dev/zero of=large.file bs=1024 count=4000000 4000000+0 records in 4000000+0 records out 4096000000 bytes (4.1 GB) copied, 8.32263 s, 492 MB/s 5. Execute pg_rewind command from old master(server A) ./pg_rewind -D /home/enterprisedb/master/ --debug --progress --source-server="port=5661 user=enterprisedb dbname=edb" IMHO, it seems to be a bug in pg_rewind. As mentioned in pg_rewind documentation, there are few files which are copied in whole. "Copy all other files such as pg_xact and configuration files from the source cluster to the target cluster (everything except the relation files)." -- https://www.postgresql.org/docs/devel/static/app-pgrewind.html Those files are copied in max CHUNKSIZE(default 1000000) bytes at a time. In the process, pg_rewind creates a table with the following schema and loads information about blocks that need to be copied. CREATE TEMPORARY TABLE fetchchunks(path text, begin int4, len int4); postgres=# select * from fetchchunks where begin != 0; path | begin | len -----------------------------------------+----------+--------- pg_wal/000000010000000000000002 | 1000000 | 1000000 pg_wal/000000010000000000000002 | 2000000 | 1000000 pg_wal/000000010000000000000002 | 3000000 | 1000000 pg_wal/000000010000000000000002 | 4000000 | 1000000 ...... and so on. The range for begin is between -2147483648 to +2147483647. For a 4GB file, begin definitely goes beyond 2147483647 and it throws the following error: unexpected result while sending file list: ERROR: value "2148000000" is out of range for type integer CONTEXT: COPY fetchchunks, line 2659, column begin: "2148000000" I guess we've to change the data type to bigint. Also, we need some implementation of ntohl() for 8-byte data types. I've attached a script to reproduce the error and a draft patch. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Mon, Jul 3, 2017 at 8:22 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > pg_rewind throws the following error when there is a file of large > size available in the Slave server's data directory. Oops. > I guess we've to change the data type to bigint. Also, we need some > implementation of ntohl() for 8-byte data types. I've attached a > script to reproduce the error and a draft patch. pg_basebackup/ with fe_recvint64() has its own way to do things, as does the large object things in libpq. I would think that at least on HEAD things could be gathered with a small set of routines. It is annoying to have a third copy of the same thing. OK that's not invasive but src/common/ would be a nice place to put things. - if (PQgetlength(res, 0, 1) != sizeof(int32)) + if (PQgetlength(res, 0, 1) != sizeof(long long int)) This had better be int64. -- Michael
Kuntal Ghosh <kuntalghosh.2007@gmail.com> writes: > pg_rewind throws the following error when there is a file of large > size available in the Slave server's data directory. Hm. Before we add a bunch of code to deal with that, are we sure we *want* it to copy such files? Seems like that's expending a lot of data-transfer work for zero added value --- consider e.g. a server with a bunch of old core files laying about in $PGDATA. Given that it's already excluded all database-data-containing files, maybe we should just set a cap on the plausible size of auxiliary files. regards, tom lane
On 7/3/17 09:53, Tom Lane wrote: > Hm. Before we add a bunch of code to deal with that, are we sure we > *want* it to copy such files? Seems like that's expending a lot of > data-transfer work for zero added value --- consider e.g. a server > with a bunch of old core files laying about in $PGDATA. Given that > it's already excluded all database-data-containing files, maybe we > should just set a cap on the plausible size of auxiliary files. It seems kind of lame to fail on large files these days, even if they are not often useful in the particular case. Also, most of the segment and file sizes are configurable, and we have had reports of people venturing into much larger file sizes. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: > On 7/3/17 09:53, Tom Lane wrote: >> Hm. Before we add a bunch of code to deal with that, are we sure we >> *want* it to copy such files? Seems like that's expending a lot of >> data-transfer work for zero added value --- consider e.g. a server >> with a bunch of old core files laying about in $PGDATA. Given that >> it's already excluded all database-data-containing files, maybe we >> should just set a cap on the plausible size of auxiliary files. > It seems kind of lame to fail on large files these days, even if they > are not often useful in the particular case. True. But copying useless data is also lame. > Also, most of the segment and file sizes are configurable, and we have > had reports of people venturing into much larger file sizes. But if I understand the context correctly, we're not transferring relation data files this way anyway. If we do transfer WAL files this way, we could make sure to set the cutoff larger than the WAL segment size. regards, tom lane
On Tue, Jul 4, 2017 at 4:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: >> On 7/3/17 09:53, Tom Lane wrote: >>> Hm. Before we add a bunch of code to deal with that, are we sure we >>> *want* it to copy such files? Seems like that's expending a lot of >>> data-transfer work for zero added value --- consider e.g. a server >>> with a bunch of old core files laying about in $PGDATA. Given that >>> it's already excluded all database-data-containing files, maybe we >>> should just set a cap on the plausible size of auxiliary files. > >> It seems kind of lame to fail on large files these days, even if they >> are not often useful in the particular case. > > True. But copying useless data is also lame. We don't want to complicate pg_rewind code with filtering capabilities, so if the fix is simple I think that we should include it and be done. That will avoid future complications as well. >> Also, most of the segment and file sizes are configurable, and we have >> had reports of people venturing into much larger file sizes. > > But if I understand the context correctly, we're not transferring relation > data files this way anyway. If we do transfer WAL files this way, we > could make sure to set the cutoff larger than the WAL segment size. WAL segments are not transferred. Only the WAL data of the the target data folder is gone through to find all the blocks that have been touched from the last checkpoint before WAL forked. Now, I think that this is broken for relation files higher than 2GB, see fetch_file_range where the begin location is an int32. -- Michael
On Tue, Jul 4, 2017 at 4:12 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Tue, Jul 4, 2017 at 4:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: >>> On 7/3/17 09:53, Tom Lane wrote: >>>> Hm. Before we add a bunch of code to deal with that, are we sure we >>>> *want* it to copy such files? Seems like that's expending a lot of >>>> data-transfer work for zero added value --- consider e.g. a server >>>> with a bunch of old core files laying about in $PGDATA. Given that >>>> it's already excluded all database-data-containing files, maybe we >>>> should just set a cap on the plausible size of auxiliary files. >> >>> It seems kind of lame to fail on large files these days, even if they >>> are not often useful in the particular case. >> >> True. But copying useless data is also lame. > > We don't want to complicate pg_rewind code with filtering > capabilities, so if the fix is simple I think that we should include > it and be done. That will avoid future complications as well. > Yeah, I agree. In the above case, it's a core dump file. So, copying it to master seems to be of no use. But, even if we add some filtering capabilities, it's difficult to decide which files to skip and which files to copy. >>> Also, most of the segment and file sizes are configurable, and we have >>> had reports of people venturing into much larger file sizes. >> >> But if I understand the context correctly, we're not transferring relation >> data files this way anyway. If we do transfer WAL files this way, we >> could make sure to set the cutoff larger than the WAL segment size. > > WAL segments are not transferred. Only the WAL data of the the target > data folder is gone through to find all the blocks that have been > touched from the last checkpoint before WAL forked. > > Now, I think that this is broken for relation files higher than 2GB, > see fetch_file_range where the begin location is an int32. > -- Okay. So, if the relation block size differs by 2GB or more between the source and target directory, we've a problem. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
On Mon, Jul 3, 2017 at 6:50 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > pg_basebackup/ with fe_recvint64() has its own way to do things, as > does the large object things in libpq. I would think that at least on > HEAD things could be gathered with a small set of routines. It is > annoying to have a third copy of the same thing. OK that's not > invasive but src/common/ would be a nice place to put things. > > - if (PQgetlength(res, 0, 1) != sizeof(int32)) > + if (PQgetlength(res, 0, 1) != sizeof(long long int)) > This had better be int64. Thank you. I'll do the changes and submit the revised patch. I've added an entry in commitfest for the same. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
On Tue, Jul 4, 2017 at 3:25 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > On Mon, Jul 3, 2017 at 6:50 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> pg_basebackup/ with fe_recvint64() has its own way to do things, as >> does the large object things in libpq. I would think that at least on >> HEAD things could be gathered with a small set of routines. It is >> annoying to have a third copy of the same thing. OK that's not >> invasive but src/common/ would be a nice place to put things. >> >> - if (PQgetlength(res, 0, 1) != sizeof(int32)) >> + if (PQgetlength(res, 0, 1) != sizeof(long long int)) >> This had better be int64. > > Thank you. I'll do the changes and submit the revised patch. I've > added an entry in commitfest for the same. Yeah... I was halfway into writing a patch myself. As it seems that you are planning to submit a new version, and because it won't be nice to step on your toes, I'll wait for your updated version. I am also attaching an updated version of your script which triggers the error if you use for a larger size for --with-segsize. I have mainly played with a segment size of 16GB to check for all potential overflows and with more than 4GB of page offsets. I have spent a couple of hours into looking at those overflow issues, and only fetch_file_range is at risk as the data inserted into the page map itself is a block number, which is fine as a 32-bit integer. Here are some notes on what the patch should do. 1) In fetch_file_range, use uint64 for begin and end. Using int4 for the length calculation is fine as this cannot go past CHUNKSIZE. I would love to be able to add a static assertion on frontend tools though. But this meritates a comment. 2) Using int8 for the definition of fetchchunks would be more consistent with the existing code. 3) About the logs: + pg_log(PG_DEBUG, "received chunk for file \"%s\", offset %lld, size %d\n", filename, chunkoff, chunksize); This should not use %lld but UINT64_FORMAT. 4) long long int needs to be replaced by uint64. 5) Using ntohll as name results in a compilation failure on MacOS. Let's avid any fancy refactoring for now, and just use a static routine with a name similar to what is in receivelog.c. You could as well just use the routine of this file. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Tue, Jul 4, 2017 at 12:44 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Tue, Jul 4, 2017 at 3:25 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: >> On Mon, Jul 3, 2017 at 6:50 PM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >>> pg_basebackup/ with fe_recvint64() has its own way to do things, as >>> does the large object things in libpq. I would think that at least on >>> HEAD things could be gathered with a small set of routines. It is >>> annoying to have a third copy of the same thing. OK that's not >>> invasive but src/common/ would be a nice place to put things. >>> >>> - if (PQgetlength(res, 0, 1) != sizeof(int32)) >>> + if (PQgetlength(res, 0, 1) != sizeof(long long int)) >>> This had better be int64. >> >> Thank you. I'll do the changes and submit the revised patch. I've >> added an entry in commitfest for the same. > > Yeah... I was halfway into writing a patch myself. As it seems that > you are planning to submit a new version, and because it won't be nice > to step on your toes, I'll wait for your updated version. > I've not yet started the patch and it may take some time for me to understand and write the patch in a correct way. Since, you've almost written the patch, IMHO, please go ahead and submit the patch. I'll happily review and test it. :-) Thanks for the notes. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
On Tue, Jul 4, 2017 at 4:41 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > I've not yet started the patch and it may take some time for me to > understand and write > the patch in a correct way. Since, you've almost written the patch, > IMHO, please go ahead > and submit the patch. I'll happily review and test it. :-) > > Thanks for the notes. OK, thanks. Here you go. Upon further testing, I have discovered as well that libpqProcessFileList suffers an overflow on the file size when it is higher than 2GB, which caused only a portion of the relation blocks to be copied when doing a transfer with libpq in my test case. That could be the origin of all kind of corruptions, with my test case a sequential scan resulted in inconsistent data fetched, which in this case was only a portion of the tuples scannable on the rewound standby because the original blocks remained in place, and those had a LSN set to a position *newer* than what the master was generating. When doing an offline rewind, things are a bit smarter and no such corruptions are possible. But that's just the top of the iceberg for such issues. Big files completely copied were actually able to do a double overflow, which made the transfer able to work correctly, but that was plain luck. The patch attached passes my test case where blocks from a large relation file are copied as well as when a large raw file is copied. This needs to be back-patched down to 9.5, since pg_rewind has been introduced. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Wed, Jul 5, 2017 at 9:35 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Tue, Jul 4, 2017 at 4:41 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: >> I've not yet started the patch and it may take some time for me to >> understand and write >> the patch in a correct way. Since, you've almost written the patch, >> IMHO, please go ahead >> and submit the patch. I'll happily review and test it. :-) >> >> Thanks for the notes. > > OK, thanks. Here you go. > Thanks for the patch. It looks good and it solves the existing issues. But, I'm little concerned/doubt regarding the following part of the code. +/* + * Converts an int64 from network byte order to native format. + */ +static int64 +pg_recvint64(int64 value) +{ + union + { + int64 i64; + uint32 i32[2]; + } swap; + int64 result; + + swap.i64 = value; + + result = (uint32) ntohl(swap.i32[0]); + result <<= 32; + result |= (uint32) ntohl(swap.i32[1]); + + return result; +} Does this always work correctly irrespective of the endianess of the underlying system? -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
On Thu, Jul 6, 2017 at 4:18 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > On Wed, Jul 5, 2017 at 9:35 AM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Tue, Jul 4, 2017 at 4:41 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: >>> I've not yet started the patch and it may take some time for me to >>> understand and write >>> the patch in a correct way. Since, you've almost written the patch, >>> IMHO, please go ahead >>> and submit the patch. I'll happily review and test it. :-) >>> >>> Thanks for the notes. >> >> OK, thanks. Here you go. >> > Thanks for the patch. It looks good and it solves the existing issues. > > But, I'm little concerned/doubt regarding the following part of the code. > +/* > + * Converts an int64 from network byte order to native format. > + */ > +static int64 > +pg_recvint64(int64 value) > +{ > + union > + { > + int64 i64; > + uint32 i32[2]; > + } swap; > + int64 result; > + > + swap.i64 = value; > + > + result = (uint32) ntohl(swap.i32[0]); > + result <<= 32; > + result |= (uint32) ntohl(swap.i32[1]); > + > + return result; > +} > Does this always work correctly irrespective of the endianess of the > underlying system? I think this will have problem, we may need to do like union { int64 i64; uint8 i8[8]; } swap; .... and reverse complete array if byte order is changed or we can use something like "be64toh" > > > -- > Thanks & Regards, > Kuntal Ghosh > EnterpriseDB: http://www.enterprisedb.com > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Jul 6, 2017 at 8:51 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > On Thu, Jul 6, 2017 at 4:18 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: >> But, I'm little concerned/doubt regarding the following part of the code. >> +/* >> + * Converts an int64 from network byte order to native format. >> + */ >> +static int64 >> +pg_recvint64(int64 value) >> +{ >> + union >> + { >> + int64 i64; >> + uint32 i32[2]; >> + } swap; >> + int64 result; >> + >> + swap.i64 = value; >> + >> + result = (uint32) ntohl(swap.i32[0]); >> + result <<= 32; >> + result |= (uint32) ntohl(swap.i32[1]); >> + >> + return result; >> +} >> Does this always work correctly irrespective of the endianess of the >> underlying system? > > I think this will have problem, we may need to do like > > and reverse complete array if byte order is changed This comes from the the implementation of 64b-large objects, which was first broken on big-endian systems, until Tom fixed it with the following commit, and this looks fine to me: commit: 26fe56481c0f7baa705f0b3265b5a0676f894a94 author: Tom Lane <tgl@sss.pgh.pa.us> date: Mon, 8 Oct 2012 18:24:32 -0400 Code review for 64-bit-large-object patch. At some point it would really make sense to group all things under the same banner (64-b LO, pg_basebackup, and now pg_rewind). > or we can use something like "be64toh" That's even less portable. For example macos would need a macro to map to OSSwapBigToHostInt64 or OSSwapLittleToHostInt64 from OSByteOrder.h. Brr. -- Michael
On Fri, Jul 7, 2017 at 7:49 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Thu, Jul 6, 2017 at 8:51 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote: >> On Thu, Jul 6, 2017 at 4:18 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: >>> But, I'm little concerned/doubt regarding the following part of the code. >>> +/* >>> + * Converts an int64 from network byte order to native format. >>> + */ >>> +static int64 >>> +pg_recvint64(int64 value) >>> +{ >>> + union >>> + { >>> + int64 i64; >>> + uint32 i32[2]; >>> + } swap; >>> + int64 result; >>> + >>> + swap.i64 = value; >>> + >>> + result = (uint32) ntohl(swap.i32[0]); >>> + result <<= 32; >>> + result |= (uint32) ntohl(swap.i32[1]); >>> + >>> + return result; >>> +} >>> Does this always work correctly irrespective of the endianess of the >>> underlying system? >> >> I think this will have problem, we may need to do like >> >> and reverse complete array if byte order is changed > > This comes from the the implementation of 64b-large objects, which was > first broken on big-endian systems, until Tom fixed it with the > following commit, and this looks fine to me: > commit: 26fe56481c0f7baa705f0b3265b5a0676f894a94 > author: Tom Lane <tgl@sss.pgh.pa.us> > date: Mon, 8 Oct 2012 18:24:32 -0400 > Code review for 64-bit-large-object patch. > Great. Besides, the logic for pg_recvint64 looks same as fe_recvint64() defined in pg_basebackup. I don't have any more inputs on this patch and it looks good to me. So, I'm moving the status to ready for committer. > At some point it would really make sense to group all things under the > same banner (64-b LO, pg_basebackup, and now pg_rewind). > +1. Implementation-wise, I prefer pg_recvint64 to fe_recvint64. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
On Fri, Jul 7, 2017 at 4:31 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > On Fri, Jul 7, 2017 at 7:49 AM, Michael Paquier > <michael.paquier@gmail.com> wrote: > I don't have any more inputs on this patch and it looks good to me. > So, I'm moving the status to ready for committer. Thanks! >> At some point it would really make sense to group all things under the >> same banner (64-b LO, pg_basebackup, and now pg_rewind). >> > +1. Implementation-wise, I prefer pg_recvint64 to fe_recvint64. So do I. That's a matter of taste I guess. -- Michael
On Fri, Jul 7, 2017 at 9:33 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Fri, Jul 7, 2017 at 4:31 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: >> On Fri, Jul 7, 2017 at 7:49 AM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >> I don't have any more inputs on this patch and it looks good to me. >> So, I'm moving the status to ready for committer. > > Thanks! > >>> At some point it would really make sense to group all things under the >>> same banner (64-b LO, pg_basebackup, and now pg_rewind). >>> >> +1. Implementation-wise, I prefer pg_recvint64 to fe_recvint64. > > So do I. That's a matter of taste I guess. Heikki, this bug is rather bad for anybody using pg_rewind with relation file sizes larger than 2GB as this corrupts data of instances. I think that you would be the best fit as a committer to look at this patch as you implemented the tool first, and it would be a bad idea to let that sit for a too long time. Could it be possible to spare a bit of your time at some point to look at it? -- Michael
On Thu, Jul 20, 2017 at 2:17 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > Heikki, this bug is rather bad for anybody using pg_rewind with > relation file sizes larger than 2GB as this corrupts data of > instances. I think that you would be the best fit as a committer to > look at this patch as you implemented the tool first, and it would be > a bad idea to let that sit for a too long time. Could it be possible > to spare a bit of your time at some point to look at it? The comment header of receiveFileChunks is updated like this: /*---- * Runs a query, which returns pieces of files from the remote source data * directory, and overwrites the correspondingparts of target files with * the received parts. The result set is expected to be of format: * * path text -- path in the data directory, e.g "base/1/123" - * begin int4 -- offset within the file + * begin int8 -- offset within the file * chunk bytea -- file content *---- ...but down inside the function there's still this: if (PQftype(res, 0) != TEXTOID && PQftype(res, 1) != INT4OID && PQftype(res, 2) != BYTEAOID) { pg_fatal("unexpected data types in result set while fetching remote files: %u %u %u\n", PQftype(res, 0), PQftype(res, 1), PQftype(res, 2)); } I was initially surprised that your testing managed to pass, but then I noticed that this sanity test is using && where it should really be using ||; it will only fail if ALL of the data types are wrong. Oops. This code plays pretty fast and loose with off_t vs. size_t vs. uint64 vs. int64, but those definitely don't all agree in signedness and may not even agree in width (size_t, at least, might be 4 bytes). We use stat() to get the size of a file as an off_t, and then store it into a uint64. Perhaps off_t is always uint64 anyway, but I'm not sure. Then, that value gets passed to fetch_file_range(), still as a uint64, and it then gets sent to the server to be stored into an int8 column, which is signed rather than unsigned. That will error out if there's a file size >= 2^63, but filesystem holding more than 8 exabytes are unusual and will probably remain so for some time. The server sends back an int64 which we pass to write_target_range(), whose argument is declared as off_t, which is where we started. I'm not sure there's any real problem with all of that, but it's a little nervous-making that we keep switching types. Similarly, libpqProcessFileList gets the file size as an int64 and then passes it to process_source_file() which is expecting size_t. So we manage to use 4 different data types to represent a file size. On most systems, all of them except SQL's int8 are likely to be 64-bit unsigned integers, but I'm not sure what will happen on obscure platforms. Still, it can't be worse than the status quo, where instead of int64 we're using int and int32, so maybe we ought to back-patch it as-is for now and look at any further cleanup that is needed as a master-only improvement. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jul 20, 2017 at 9:31 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I was initially surprised that your testing managed to pass, but then > I noticed that this sanity test is using && where it should really be > using ||; it will only fail if ALL of the data types are wrong. Oops. Oh, oh. If this was right from the beginning I would have hit this issue. Yes that's a clear oversight of the original code. > This code plays pretty fast and loose with off_t vs. size_t vs. uint64 > vs. int64, but those definitely don't all agree in signedness and may > not even agree in width (size_t, at least, might be 4 bytes). We use > stat() to get the size of a file as an off_t, and then store it into a > uint64. Perhaps off_t is always uint64 anyway, but I'm not sure. > Then, that value gets passed to fetch_file_range(), still as a uint64, > and it then gets sent to the server to be stored into an int8 column, > which is signed rather than unsigned. That will error out if there's > a file size >= 2^63, but filesystem holding more than 8 exabytes are > unusual and will probably remain so for some time. The server sends > back an int64 which we pass to write_target_range(), whose argument is > declared as off_t, which is where we started. I'm not sure there's > any real problem with all of that, but it's a little nervous-making > that we keep switching types. Similarly, libpqProcessFileList gets > the file size as an int64 and then passes it to process_source_file() > which is expecting size_t. So we manage to use 4 different data types > to represent a file size. On most systems, all of them except SQL's > int8 are likely to be 64-bit unsigned integers, but I'm not sure what > will happen on obscure platforms. Yes, I felt uneasy as well when hacking the patch with all the type switches that have been done, but I kept my focus on bringing out a minimally invasive patch to fix the issue and any other holes I found. One thing that I think could be added in the patch is an overflow check in libpqGetFile(), because I think that we still want to use size_t for this code path. What do you think? Note I am not much worrying on signedness of the values yet (Postgres still lacks a proper in-core unsigned type, this gives an argument for one), as long as the value are correctly stored on 8 bytes, and we still have some margin until we reach that. We could add a comment in the code to underline that assumption, though I am not sure that this is really necessary... > Still, it can't be worse than the status quo, where instead of int64 > we're using int and int32, so maybe we ought to back-patch it as-is > for now and look at any further cleanup that is needed as a > master-only improvement. Yes. I don't like playing much with the variable types on back-branches, as long as the initial amount of bytes is large enough we will be safe for some time. -- Michael
On Fri, Jul 21, 2017 at 8:22 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Thu, Jul 20, 2017 at 9:31 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Still, it can't be worse than the status quo, where instead of int64 >> we're using int and int32, so maybe we ought to back-patch it as-is >> for now and look at any further cleanup that is needed as a >> master-only improvement. > > Yes. I don't like playing much with the variable types on > back-branches, as long as the initial amount of bytes is large enough > we will be safe for some time. Note for the archives: the main issue has been fixed as a46fe6e8, and the incorrect condition as 063ff921. Thanks Robert! -- Michael