Thread: [HACKERS] Error while copying a large file in pg_rewind

[HACKERS] Error while copying a large file in pg_rewind

From
Kuntal Ghosh
Date:
Hello all,

pg_rewind throws the following error when there is a file of large
size available in the Slave server's data directory.

unexpected result while sending file list: ERROR:  value "2148000000"
is out of range for type integer
CONTEXT:  COPY fetchchunks, line 2402, column begin: "2148000000"

How to reproduce
----------------------------
1. Set up replication between Server A(master) and Server B(slave)

2. Promote the slave server(Server B )

3. Stop the old master (Server A)

4. Create a large file in the newly promoted master's (Server B) data
directory using the below command

     dd if=/dev/zero of=large.file bs=1024 count=4000000

     [root@localhost data]# dd if=/dev/zero of=large.file bs=1024 count=4000000
4000000+0 records in
4000000+0 records out
4096000000 bytes (4.1 GB) copied, 8.32263 s, 492 MB/s

5. Execute pg_rewind command from old master(server A)

     ./pg_rewind -D /home/enterprisedb/master/ --debug --progress
--source-server="port=5661 user=enterprisedb dbname=edb"

IMHO, it seems to be a bug in pg_rewind.

As mentioned in pg_rewind documentation, there are few files which are
copied in whole.

"Copy all other files such as pg_xact and configuration files from the
source cluster to the target cluster (everything except the relation
files)." -- https://www.postgresql.org/docs/devel/static/app-pgrewind.html

Those files are copied in max CHUNKSIZE(default 1000000) bytes at a
time. In the process, pg_rewind creates a table with the following
schema and loads information about blocks that need to be copied.

CREATE TEMPORARY TABLE fetchchunks(path text, begin int4, len int4);

postgres=# select * from fetchchunks where begin != 0;
                  path                   |  begin   |   len
-----------------------------------------+----------+---------
 pg_wal/000000010000000000000002         |  1000000 | 1000000
 pg_wal/000000010000000000000002         |  2000000 | 1000000
 pg_wal/000000010000000000000002         |  3000000 | 1000000
 pg_wal/000000010000000000000002         |  4000000 | 1000000
......
and so on.

The range for begin is between -2147483648 to +2147483647. For a 4GB
file, begin definitely goes beyond 2147483647 and it throws the
following error:
unexpected result while sending file list: ERROR:  value "2148000000"
is out of range for type integer
CONTEXT:  COPY fetchchunks, line 2659, column begin: "2148000000"

I guess we've to change the data type to bigint. Also, we need some
implementation of ntohl() for 8-byte data types. I've attached a
script to reproduce the error and a draft patch.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Error while copying a large file in pg_rewind

From
Michael Paquier
Date:
On Mon, Jul 3, 2017 at 8:22 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> pg_rewind throws the following error when there is a file of large
> size available in the Slave server's data directory.

Oops.

> I guess we've to change the data type to bigint. Also, we need some
> implementation of ntohl() for 8-byte data types. I've attached a
> script to reproduce the error and a draft patch.

pg_basebackup/ with fe_recvint64() has its own way to do things, as
does the large object things in libpq. I would think that at least on
HEAD things could be gathered with a small set of routines. It is
annoying to have a third copy of the same thing. OK that's not
invasive but src/common/ would be a nice place to put things.

-        if (PQgetlength(res, 0, 1) != sizeof(int32))
+        if (PQgetlength(res, 0, 1) != sizeof(long long int))
This had better be int64.
-- 
Michael



Re: [HACKERS] Error while copying a large file in pg_rewind

From
Tom Lane
Date:
Kuntal Ghosh <kuntalghosh.2007@gmail.com> writes:
> pg_rewind throws the following error when there is a file of large
> size available in the Slave server's data directory.

Hm.  Before we add a bunch of code to deal with that, are we sure we
*want* it to copy such files?  Seems like that's expending a lot of
data-transfer work for zero added value --- consider e.g. a server
with a bunch of old core files laying about in $PGDATA.  Given that
it's already excluded all database-data-containing files, maybe we
should just set a cap on the plausible size of auxiliary files.
        regards, tom lane



Re: [HACKERS] Error while copying a large file in pg_rewind

From
Peter Eisentraut
Date:
On 7/3/17 09:53, Tom Lane wrote:
> Hm.  Before we add a bunch of code to deal with that, are we sure we
> *want* it to copy such files?  Seems like that's expending a lot of
> data-transfer work for zero added value --- consider e.g. a server
> with a bunch of old core files laying about in $PGDATA.  Given that
> it's already excluded all database-data-containing files, maybe we
> should just set a cap on the plausible size of auxiliary files.

It seems kind of lame to fail on large files these days, even if they
are not often useful in the particular case.

Also, most of the segment and file sizes are configurable, and we have
had reports of people venturing into much larger file sizes.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Error while copying a large file in pg_rewind

From
Tom Lane
Date:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> On 7/3/17 09:53, Tom Lane wrote:
>> Hm.  Before we add a bunch of code to deal with that, are we sure we
>> *want* it to copy such files?  Seems like that's expending a lot of
>> data-transfer work for zero added value --- consider e.g. a server
>> with a bunch of old core files laying about in $PGDATA.  Given that
>> it's already excluded all database-data-containing files, maybe we
>> should just set a cap on the plausible size of auxiliary files.

> It seems kind of lame to fail on large files these days, even if they
> are not often useful in the particular case.

True.  But copying useless data is also lame.

> Also, most of the segment and file sizes are configurable, and we have
> had reports of people venturing into much larger file sizes.

But if I understand the context correctly, we're not transferring relation
data files this way anyway.  If we do transfer WAL files this way, we
could make sure to set the cutoff larger than the WAL segment size.
        regards, tom lane



Re: [HACKERS] Error while copying a large file in pg_rewind

From
Michael Paquier
Date:
On Tue, Jul 4, 2017 at 4:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
>> On 7/3/17 09:53, Tom Lane wrote:
>>> Hm.  Before we add a bunch of code to deal with that, are we sure we
>>> *want* it to copy such files?  Seems like that's expending a lot of
>>> data-transfer work for zero added value --- consider e.g. a server
>>> with a bunch of old core files laying about in $PGDATA.  Given that
>>> it's already excluded all database-data-containing files, maybe we
>>> should just set a cap on the plausible size of auxiliary files.
>
>> It seems kind of lame to fail on large files these days, even if they
>> are not often useful in the particular case.
>
> True.  But copying useless data is also lame.

We don't want to complicate pg_rewind code with filtering
capabilities, so if the fix is simple I think that we should include
it and be done. That will avoid future complications as well.

>> Also, most of the segment and file sizes are configurable, and we have
>> had reports of people venturing into much larger file sizes.
>
> But if I understand the context correctly, we're not transferring relation
> data files this way anyway.  If we do transfer WAL files this way, we
> could make sure to set the cutoff larger than the WAL segment size.

WAL segments are not transferred. Only the WAL data of the the target
data folder is gone through to find all the blocks that have been
touched from the last checkpoint before WAL forked.

Now, I think that this is broken for relation files higher than 2GB,
see fetch_file_range where the begin location is an int32.
-- 
Michael



Re: [HACKERS] Error while copying a large file in pg_rewind

From
Kuntal Ghosh
Date:
On Tue, Jul 4, 2017 at 4:12 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Tue, Jul 4, 2017 at 4:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
>>> On 7/3/17 09:53, Tom Lane wrote:
>>>> Hm.  Before we add a bunch of code to deal with that, are we sure we
>>>> *want* it to copy such files?  Seems like that's expending a lot of
>>>> data-transfer work for zero added value --- consider e.g. a server
>>>> with a bunch of old core files laying about in $PGDATA.  Given that
>>>> it's already excluded all database-data-containing files, maybe we
>>>> should just set a cap on the plausible size of auxiliary files.
>>
>>> It seems kind of lame to fail on large files these days, even if they
>>> are not often useful in the particular case.
>>
>> True.  But copying useless data is also lame.
>
> We don't want to complicate pg_rewind code with filtering
> capabilities, so if the fix is simple I think that we should include
> it and be done. That will avoid future complications as well.
>
Yeah, I agree. In the above case, it's a core dump file. So, copying
it to master seems to be of no use. But, even if we add some filtering
capabilities, it's difficult to decide which files to skip and which
files to copy.

>>> Also, most of the segment and file sizes are configurable, and we have
>>> had reports of people venturing into much larger file sizes.
>>
>> But if I understand the context correctly, we're not transferring relation
>> data files this way anyway.  If we do transfer WAL files this way, we
>> could make sure to set the cutoff larger than the WAL segment size.
>
> WAL segments are not transferred. Only the WAL data of the the target
> data folder is gone through to find all the blocks that have been
> touched from the last checkpoint before WAL forked.
>
> Now, I think that this is broken for relation files higher than 2GB,
> see fetch_file_range where the begin location is an int32.
> --
Okay. So, if the relation block size differs by 2GB or more between
the source and target directory, we've a problem.



-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Error while copying a large file in pg_rewind

From
Kuntal Ghosh
Date:
On Mon, Jul 3, 2017 at 6:50 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> pg_basebackup/ with fe_recvint64() has its own way to do things, as
> does the large object things in libpq. I would think that at least on
> HEAD things could be gathered with a small set of routines. It is
> annoying to have a third copy of the same thing. OK that's not
> invasive but src/common/ would be a nice place to put things.
>
> -        if (PQgetlength(res, 0, 1) != sizeof(int32))
> +        if (PQgetlength(res, 0, 1) != sizeof(long long int))
> This had better be int64.

Thank you. I'll do the changes and submit the revised patch. I've
added an entry in commitfest for the same.



-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Error while copying a large file in pg_rewind

From
Michael Paquier
Date:
On Tue, Jul 4, 2017 at 3:25 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> On Mon, Jul 3, 2017 at 6:50 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> pg_basebackup/ with fe_recvint64() has its own way to do things, as
>> does the large object things in libpq. I would think that at least on
>> HEAD things could be gathered with a small set of routines. It is
>> annoying to have a third copy of the same thing. OK that's not
>> invasive but src/common/ would be a nice place to put things.
>>
>> -        if (PQgetlength(res, 0, 1) != sizeof(int32))
>> +        if (PQgetlength(res, 0, 1) != sizeof(long long int))
>> This had better be int64.
>
> Thank you. I'll do the changes and submit the revised patch. I've
> added an entry in commitfest for the same.

Yeah... I was halfway into writing a patch myself. As it seems that
you are planning to submit a new version, and because it won't be nice
to step on your toes, I'll wait for your updated version.

I am also attaching an updated version of your script which triggers
the error if you use for a larger size for --with-segsize. I have
mainly played with a segment size of 16GB to check for all potential
overflows and with more than 4GB of page offsets. I have spent a
couple of hours into looking at those overflow issues, and only
fetch_file_range is at risk as the data inserted into the page map
itself is a block number, which is fine as a 32-bit integer.

Here are some notes on what the patch should do.
1) In fetch_file_range, use uint64 for begin and end. Using int4 for
the length calculation is fine as this cannot go past CHUNKSIZE. I
would love to be able to add a static assertion on frontend tools
though. But this meritates a comment.
2) Using int8 for the definition of fetchchunks would be more
consistent with the existing code.
3) About the logs:
+       pg_log(PG_DEBUG, "received chunk for file \"%s\", offset %lld,
size %d\n",
               filename, chunkoff, chunksize);
This should not use %lld but UINT64_FORMAT.
4) long long int needs to be replaced by uint64.
5) Using ntohll as name results in a compilation failure on MacOS.
Let's avid any fancy refactoring for now, and just use a static
routine with a name similar to what is in receivelog.c. You could as
well just use the routine of this file.
-- 
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Error while copying a large file in pg_rewind

From
Kuntal Ghosh
Date:
On Tue, Jul 4, 2017 at 12:44 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Tue, Jul 4, 2017 at 3:25 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>> On Mon, Jul 3, 2017 at 6:50 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>> pg_basebackup/ with fe_recvint64() has its own way to do things, as
>>> does the large object things in libpq. I would think that at least on
>>> HEAD things could be gathered with a small set of routines. It is
>>> annoying to have a third copy of the same thing. OK that's not
>>> invasive but src/common/ would be a nice place to put things.
>>>
>>> -        if (PQgetlength(res, 0, 1) != sizeof(int32))
>>> +        if (PQgetlength(res, 0, 1) != sizeof(long long int))
>>> This had better be int64.
>>
>> Thank you. I'll do the changes and submit the revised patch. I've
>> added an entry in commitfest for the same.
>
> Yeah... I was halfway into writing a patch myself. As it seems that
> you are planning to submit a new version, and because it won't be nice
> to step on your toes, I'll wait for your updated version.
>
I've not yet started the patch and it may take some time for me to
understand and write
the patch in a correct way. Since, you've almost written the patch,
IMHO, please go ahead
and submit the patch. I'll happily review and test it. :-)

Thanks for the notes.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Error while copying a large file in pg_rewind

From
Michael Paquier
Date:
On Tue, Jul 4, 2017 at 4:41 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> I've not yet started the patch and it may take some time for me to
> understand and write
> the patch in a correct way. Since, you've almost written the patch,
> IMHO, please go ahead
> and submit the patch. I'll happily review and test it. :-)
>
> Thanks for the notes.

OK, thanks. Here you go.

Upon further testing, I have discovered as well that
libpqProcessFileList suffers an overflow on the file size when it is
higher than 2GB, which caused only a portion of the relation blocks to
be copied when doing a transfer with libpq in my test case. That could
be the origin of all kind of corruptions, with my test case a
sequential scan resulted in inconsistent data fetched, which in this
case was only a portion of the tuples scannable on the rewound standby
because the original blocks remained in place, and those had a LSN set
to a position *newer* than what the master was generating. When doing
an offline rewind, things are a bit smarter and no such corruptions
are possible. But that's just the top of the iceberg for such issues.
Big files completely copied were actually able to do a double
overflow, which made the transfer able to work correctly, but that was
plain luck.

The patch attached passes my test case where blocks from a large
relation file are copied as well as when a large raw file is copied.
This needs to be back-patched down to 9.5, since pg_rewind has been
introduced.
-- 
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Error while copying a large file in pg_rewind

From
Kuntal Ghosh
Date:
On Wed, Jul 5, 2017 at 9:35 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Tue, Jul 4, 2017 at 4:41 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>> I've not yet started the patch and it may take some time for me to
>> understand and write
>> the patch in a correct way. Since, you've almost written the patch,
>> IMHO, please go ahead
>> and submit the patch. I'll happily review and test it. :-)
>>
>> Thanks for the notes.
>
> OK, thanks. Here you go.
>
Thanks for the patch. It looks good and it solves the existing issues.

But, I'm little concerned/doubt regarding the following part of the code.
+/*
+ * Converts an int64 from network byte order to native format.
+ */
+static int64
+pg_recvint64(int64 value)
+{
+   union
+   {
+       int64   i64;
+       uint32  i32[2];
+   } swap;
+   int64   result;
+
+   swap.i64 = value;
+
+   result = (uint32) ntohl(swap.i32[0]);
+   result <<= 32;
+   result |= (uint32) ntohl(swap.i32[1]);
+
+   return result;
+}
Does this always work correctly irrespective of the endianess of the
underlying system?


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Error while copying a large file in pg_rewind

From
Dilip Kumar
Date:
On Thu, Jul 6, 2017 at 4:18 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> On Wed, Jul 5, 2017 at 9:35 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Tue, Jul 4, 2017 at 4:41 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>>> I've not yet started the patch and it may take some time for me to
>>> understand and write
>>> the patch in a correct way. Since, you've almost written the patch,
>>> IMHO, please go ahead
>>> and submit the patch. I'll happily review and test it. :-)
>>>
>>> Thanks for the notes.
>>
>> OK, thanks. Here you go.
>>
> Thanks for the patch. It looks good and it solves the existing issues.
>
> But, I'm little concerned/doubt regarding the following part of the code.
> +/*
> + * Converts an int64 from network byte order to native format.
> + */
> +static int64
> +pg_recvint64(int64 value)
> +{
> +   union
> +   {
> +       int64   i64;
> +       uint32  i32[2];
> +   } swap;
> +   int64   result;
> +
> +   swap.i64 = value;
> +
> +   result = (uint32) ntohl(swap.i32[0]);
> +   result <<= 32;
> +   result |= (uint32) ntohl(swap.i32[1]);
> +
> +   return result;
> +}
> Does this always work correctly irrespective of the endianess of the
> underlying system?

I think this will have problem, we may need to do like
union  {      int64   i64;      uint8  i8[8];  } swap;

....
and reverse complete array if byte order is changed

or we can use something like "be64toh"

>
>
> --
> Thanks & Regards,
> Kuntal Ghosh
> EnterpriseDB: http://www.enterprisedb.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Error while copying a large file in pg_rewind

From
Michael Paquier
Date:
On Thu, Jul 6, 2017 at 8:51 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Thu, Jul 6, 2017 at 4:18 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>> But, I'm little concerned/doubt regarding the following part of the code.
>> +/*
>> + * Converts an int64 from network byte order to native format.
>> + */
>> +static int64
>> +pg_recvint64(int64 value)
>> +{
>> +   union
>> +   {
>> +       int64   i64;
>> +       uint32  i32[2];
>> +   } swap;
>> +   int64   result;
>> +
>> +   swap.i64 = value;
>> +
>> +   result = (uint32) ntohl(swap.i32[0]);
>> +   result <<= 32;
>> +   result |= (uint32) ntohl(swap.i32[1]);
>> +
>> +   return result;
>> +}
>> Does this always work correctly irrespective of the endianess of the
>> underlying system?
>
> I think this will have problem, we may need to do like
>
> and reverse complete array if byte order is changed

This comes from the the implementation of 64b-large objects, which was
first broken on big-endian systems, until Tom fixed it with the
following commit, and this looks fine to me:
commit: 26fe56481c0f7baa705f0b3265b5a0676f894a94
author: Tom Lane <tgl@sss.pgh.pa.us>
date: Mon, 8 Oct 2012 18:24:32 -0400
Code review for 64-bit-large-object patch.

At some point it would really make sense to group all things under the
same banner (64-b LO, pg_basebackup, and now pg_rewind).

> or we can use something like "be64toh"

That's even less portable. For example macos would need a macro to map
to OSSwapBigToHostInt64 or OSSwapLittleToHostInt64 from OSByteOrder.h.
Brr.
-- 
Michael



Re: [HACKERS] Error while copying a large file in pg_rewind

From
Kuntal Ghosh
Date:
On Fri, Jul 7, 2017 at 7:49 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Jul 6, 2017 at 8:51 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> On Thu, Jul 6, 2017 at 4:18 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>>> But, I'm little concerned/doubt regarding the following part of the code.
>>> +/*
>>> + * Converts an int64 from network byte order to native format.
>>> + */
>>> +static int64
>>> +pg_recvint64(int64 value)
>>> +{
>>> +   union
>>> +   {
>>> +       int64   i64;
>>> +       uint32  i32[2];
>>> +   } swap;
>>> +   int64   result;
>>> +
>>> +   swap.i64 = value;
>>> +
>>> +   result = (uint32) ntohl(swap.i32[0]);
>>> +   result <<= 32;
>>> +   result |= (uint32) ntohl(swap.i32[1]);
>>> +
>>> +   return result;
>>> +}
>>> Does this always work correctly irrespective of the endianess of the
>>> underlying system?
>>
>> I think this will have problem, we may need to do like
>>
>> and reverse complete array if byte order is changed
>
> This comes from the the implementation of 64b-large objects, which was
> first broken on big-endian systems, until Tom fixed it with the
> following commit, and this looks fine to me:
> commit: 26fe56481c0f7baa705f0b3265b5a0676f894a94
> author: Tom Lane <tgl@sss.pgh.pa.us>
> date: Mon, 8 Oct 2012 18:24:32 -0400
> Code review for 64-bit-large-object patch.
>
Great. Besides, the logic for pg_recvint64 looks same as
fe_recvint64() defined in pg_basebackup.

I don't have any more inputs on this patch and it looks good to me.
So, I'm moving the status to ready for committer.

> At some point it would really make sense to group all things under the
> same banner (64-b LO, pg_basebackup, and now pg_rewind).
>
+1. Implementation-wise, I prefer pg_recvint64 to fe_recvint64.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Error while copying a large file in pg_rewind

From
Michael Paquier
Date:
On Fri, Jul 7, 2017 at 4:31 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> On Fri, Jul 7, 2017 at 7:49 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> I don't have any more inputs on this patch and it looks good to me.
> So, I'm moving the status to ready for committer.

Thanks!

>> At some point it would really make sense to group all things under the
>> same banner (64-b LO, pg_basebackup, and now pg_rewind).
>>
> +1. Implementation-wise, I prefer pg_recvint64 to fe_recvint64.

So do I. That's a matter of taste I guess.
-- 
Michael



Re: [HACKERS] Error while copying a large file in pg_rewind

From
Michael Paquier
Date:
On Fri, Jul 7, 2017 at 9:33 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Fri, Jul 7, 2017 at 4:31 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>> On Fri, Jul 7, 2017 at 7:49 AM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>> I don't have any more inputs on this patch and it looks good to me.
>> So, I'm moving the status to ready for committer.
>
> Thanks!
>
>>> At some point it would really make sense to group all things under the
>>> same banner (64-b LO, pg_basebackup, and now pg_rewind).
>>>
>> +1. Implementation-wise, I prefer pg_recvint64 to fe_recvint64.
>
> So do I. That's a matter of taste I guess.

Heikki, this bug is rather bad for anybody using pg_rewind with
relation file sizes larger than 2GB as this corrupts data of
instances. I think that you would be the best fit as a committer to
look at this patch as you implemented the tool first, and it would be
a bad idea to let that sit for a too long time. Could it be possible
to spare a bit of your time at some point to look at it?
-- 
Michael



Re: [HACKERS] Error while copying a large file in pg_rewind

From
Robert Haas
Date:
On Thu, Jul 20, 2017 at 2:17 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Heikki, this bug is rather bad for anybody using pg_rewind with
> relation file sizes larger than 2GB as this corrupts data of
> instances. I think that you would be the best fit as a committer to
> look at this patch as you implemented the tool first, and it would be
> a bad idea to let that sit for a too long time. Could it be possible
> to spare a bit of your time at some point to look at it?

The comment header of receiveFileChunks is updated like this:
/*---- * Runs a query, which returns pieces of files from the remote source data * directory, and overwrites the
correspondingparts of target files with * the received parts. The result set is expected to be of format: * * path
         text    -- path in the data directory, e.g "base/1/123"
 
- * begin       int4    -- offset within the file
+ * begin       int8    -- offset within the file * chunk       bytea   -- file content *----

...but down inside the function there's still this:
               if (PQftype(res, 0) != TEXTOID &&                       PQftype(res, 1) != INT4OID &&
  PQftype(res, 2) != BYTEAOID)               {                       pg_fatal("unexpected data types in result set
 
while fetching remote files: %u %u %u\n",                                        PQftype(res, 0), PQftype(res,
1), PQftype(res, 2));               }

I was initially surprised that your testing managed to pass, but then
I noticed that this sanity test is using && where it should really be
using ||; it will only fail if ALL of the data types are wrong.  Oops.

This code plays pretty fast and loose with off_t vs. size_t vs. uint64
vs. int64, but those definitely don't all agree in signedness and may
not even agree in width (size_t, at least, might be 4 bytes).  We use
stat() to get the size of a file as an off_t, and then store it into a
uint64. Perhaps off_t is always uint64 anyway, but I'm not sure.
Then, that value gets passed to fetch_file_range(), still as a uint64,
and it then gets sent to the server to be stored into an int8 column,
which is signed rather than unsigned.  That will error out if there's
a file size >= 2^63, but filesystem holding more than 8 exabytes are
unusual and will probably remain so for some time.  The server sends
back an int64 which we pass to write_target_range(), whose argument is
declared as off_t, which is where we started.  I'm not sure there's
any real problem with all of that, but it's a little nervous-making
that we keep switching types.  Similarly, libpqProcessFileList gets
the file size as an int64 and then passes it to process_source_file()
which is expecting size_t.  So we manage to use 4 different data types
to represent a file size.  On most systems, all of them except SQL's
int8 are likely to be 64-bit unsigned integers, but I'm not sure what
will happen on obscure platforms.

Still, it can't be worse than the status quo, where instead of int64
we're using int and int32, so maybe we ought to back-patch it as-is
for now and look at any further cleanup that is needed as a
master-only improvement.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Error while copying a large file in pg_rewind

From
Michael Paquier
Date:
On Thu, Jul 20, 2017 at 9:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I was initially surprised that your testing managed to pass, but then
> I noticed that this sanity test is using && where it should really be
> using ||; it will only fail if ALL of the data types are wrong.  Oops.

Oh, oh. If this was right from the beginning I would have hit this
issue. Yes that's a clear oversight of the original code.

> This code plays pretty fast and loose with off_t vs. size_t vs. uint64
> vs. int64, but those definitely don't all agree in signedness and may
> not even agree in width (size_t, at least, might be 4 bytes).  We use
> stat() to get the size of a file as an off_t, and then store it into a
> uint64. Perhaps off_t is always uint64 anyway, but I'm not sure.
> Then, that value gets passed to fetch_file_range(), still as a uint64,
> and it then gets sent to the server to be stored into an int8 column,
> which is signed rather than unsigned.  That will error out if there's
> a file size >= 2^63, but filesystem holding more than 8 exabytes are
> unusual and will probably remain so for some time.  The server sends
> back an int64 which we pass to write_target_range(), whose argument is
> declared as off_t, which is where we started.  I'm not sure there's
> any real problem with all of that, but it's a little nervous-making
> that we keep switching types.  Similarly, libpqProcessFileList gets
> the file size as an int64 and then passes it to process_source_file()
> which is expecting size_t.  So we manage to use 4 different data types
> to represent a file size.  On most systems, all of them except SQL's
> int8 are likely to be 64-bit unsigned integers, but I'm not sure what
> will happen on obscure platforms.

Yes, I felt uneasy as well when hacking the patch with all the type
switches that have been done, but I kept my focus on bringing out a
minimally invasive patch to fix the issue and any other holes I found.
One thing that I think could be added in the patch is an overflow
check in libpqGetFile(), because I think that we still want to use
size_t for this code path. What do you think?

Note I am not much worrying on signedness of the values yet (Postgres
still lacks a proper in-core unsigned type, this gives an argument for
one), as long as the value are correctly stored on 8 bytes, and we
still have some margin until we reach that. We could add a comment in
the code to underline that assumption, though I am not sure that this
is really necessary...

> Still, it can't be worse than the status quo, where instead of int64
> we're using int and int32, so maybe we ought to back-patch it as-is
> for now and look at any further cleanup that is needed as a
> master-only improvement.

Yes. I don't like playing much with the variable types on
back-branches, as long as the initial amount of bytes is large enough
we will be safe for some time.
-- 
Michael



Re: [HACKERS] Error while copying a large file in pg_rewind

From
Michael Paquier
Date:
On Fri, Jul 21, 2017 at 8:22 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Jul 20, 2017 at 9:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Still, it can't be worse than the status quo, where instead of int64
>> we're using int and int32, so maybe we ought to back-patch it as-is
>> for now and look at any further cleanup that is needed as a
>> master-only improvement.
>
> Yes. I don't like playing much with the variable types on
> back-branches, as long as the initial amount of bytes is large enough
> we will be safe for some time.

Note for the archives: the main issue has been fixed as a46fe6e8, and
the incorrect condition as 063ff921. Thanks Robert!
-- 
Michael