Thread: [HACKERS] pg_resetwal is broken if run from v10 against older version of PGdata directory

Hi,

I have installed PG v9.6 / v9.5 , if we run pg_resetwal from v10 
binaries against data directory of v9.6/9.5 ,getting this error -

centos@centos-cpula bin]$ ./pg_resetwal -D /tmp/pg9.6/bin/data/
pg_resetwal: pg_control exists but is broken or unknown version; ignoring it
pg_resetwal: could not open directory "pg_wal": No such file or directory
[centos@centos-cpula bin]$

Steps to reproduce-
installed PG v9.6
installed PG v10
go to bin directory of v10 and run pg_resetwal , provide -D = data 
directory of v9.6.

-- 
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company




On Mon, May 29, 2017 at 1:48 PM, tushar <tushar.ahuja@enterprisedb.com> wrote:
> Hi,
>
> I have installed PG v9.6 / v9.5 , if we run pg_resetwal from v10 binaries
> against data directory of v9.6/9.5 ,getting this error -
>
> centos@centos-cpula bin]$ ./pg_resetwal -D /tmp/pg9.6/bin/data/
> pg_resetwal: pg_control exists but is broken or unknown version; ignoring it
>

This appears to be an expected error as you are trying to use a
version of binaries that doesn't match the version stored in
pg_control.  What makes you think above is a valid usage and should
pass?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



On 05/29/2017 03:10 PM, Amit Kapila wrote:
>   What makes you think above is a valid usage and should
> pass?
with earlier versions ,for instance - v.96 v/s v9.5 ,pg_resetwal was 
giving pg_control values .

Installed v9.6 and v9.5  and run pg_resetwal of v9.6 against data 
directory of v9.5.

[centos@centos-cpula ~]$ /tmp/pg9.6/bin/pg_resetxlog -D /tmp/pg9.5/bin/data/
pg_resetxlog: pg_control exists but is broken or unknown version; 
ignoring it
Guessed pg_control values:

pg_control version number:            960
Catalog version number:               201608131
Database system identifier:           6425491233437069295
Latest checkpoint's TimeLineID:       1
Latest checkpoint's full_page_writes: off
Latest checkpoint's NextXID:          0:3
Latest checkpoint's NextOID:          10000
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Latest checkpoint's oldestXID:        3
Latest checkpoint's oldestXID's DB:   0
Latest checkpoint's oldestActiveXID:  0
Latest checkpoint's oldestMultiXid:   1
Latest checkpoint's oldestMulti's DB: 0
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Size of a large-object chunk:         2048
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value
Data page checksum version:           0


Values to be changed:

First log segment after reset:        000000010000000000000002

If these values seem acceptable, use -f to force reset.
[centos@centos-cpula ~]$

-- 
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company




On Mon, May 29, 2017 at 3:20 PM, tushar <tushar.ahuja@enterprisedb.com> wrote:
> On 05/29/2017 03:10 PM, Amit Kapila wrote:
>>
>>   What makes you think above is a valid usage and should
>> pass?
>
> with earlier versions ,for instance - v.96 v/s v9.5 ,pg_resetwal was giving
> pg_control values .
>
> Installed v9.6 and v9.5  and run pg_resetwal of v9.6 against data directory
> of v9.5.
>
> [centos@centos-cpula ~]$ /tmp/pg9.6/bin/pg_resetxlog -D /tmp/pg9.5/bin/data/
> pg_resetxlog: pg_control exists but is broken or unknown version; ignoring
> it
> Guessed pg_control values:
>

I think this happens due to commit
f82ec32ac30ae7e3ec7c84067192535b2ff8ec0e which renames pg_xlog to
pg_wal.  It does take care of making some of the modules like
pg_basebackup to understand both old and new directory structures, but
not done for all the modules.  I think we should make similar changes
in pg_resetwal or at the very least update the docs to indicate
pg_resetwal can give an error if used against pre-10 data directory.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



tushar <tushar.ahuja@enterprisedb.com> writes:
> I have installed PG v9.6 / v9.5 , if we run pg_resetwal from v10 
> binaries against data directory of v9.6/9.5 ,getting this error -

> centos@centos-cpula bin]$ ./pg_resetwal -D /tmp/pg9.6/bin/data/
> pg_resetwal: pg_control exists but is broken or unknown version; ignoring it
> pg_resetwal: could not open directory "pg_wal": No such file or directory

This is pretty much exactly what I'd expect.  Why would you think
that pg_resetwal should work with other major versions?

(Maybe we should have it look at PG_VERSION and refuse to do anything
if that doesn't have the expected contents.)
        regards, tom lane



On Mon, May 29, 2017 at 3:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I think this happens due to commit
> f82ec32ac30ae7e3ec7c84067192535b2ff8ec0e which renames pg_xlog to
> pg_wal.  It does take care of making some of the modules like
> pg_basebackup to understand both old and new directory structures, but
> not done for all the modules.  I think we should make similar changes
> in pg_resetwal or at the very least update the docs to indicate
> pg_resetwal can give an error if used against pre-10 data directory.

Contrary to pg_basebackup which makes clear in its documentation that
it is compatible with past server versions down to 9.1, pg_resetwal
does not mention such guarantees. And actually, it seems to me that it
is rather unsafe to use it across major versions as the size of
ControlFileData varies across major versions so you can write junk
bytes in the control file by using pg_resetwal from v10 on a 9.6
server.
-- 
Michael



Amit Kapila <amit.kapila16@gmail.com> writes:
> I think this happens due to commit
> f82ec32ac30ae7e3ec7c84067192535b2ff8ec0e which renames pg_xlog to
> pg_wal.  It does take care of making some of the modules like
> pg_basebackup to understand both old and new directory structures, but
> not done for all the modules.

check

> I think we should make similar changes
> in pg_resetwal or at the very least update the docs to indicate
> pg_resetwal can give an error if used against pre-10 data directory.

I think it's just horribly dangerous to run any version of
pg_resetxlog/pg_resetwal against any major version other than its
own.  If people have been doing what the OP tried to do, it's only
been sheerest luck if it didn't destroy their database beyond recall.
The result will certainly fail to start up, because both pg_control
and WAL page header versions will not match the server version.  The
*best* case scenario is that redoing the reset with the correct version
of pg_resetxlog/pg_resetwal gets you out of that with no new damage
done.  The worst case is pretty awful --- for instance, a truly
bullheaded user might try to get out of it by starting the newer server
version in the old DB, likely causing irrecoverable catalog corruption.

So we need to prevent this, not try to make it work.  I don't think
we can insist on a version match in pg_control, because part of the
point of pg_resetxlog/pg_resetwal is to recover if pg_control is
unreadable.  But I think we could look at PG_VERSION, which is only a
text file.  In bad corruption scenarios, if that somehow got corrupted
(which seems unlikely since it's never written to post-initdb),
you could fill in the correct contents by hand and then
pg_resetxlog/pg_resetwal would run.
        regards, tom lane



On Mon, May 29, 2017 at 9:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> So we need to prevent this, not try to make it work.  I don't think
> we can insist on a version match in pg_control, because part of the
> point of pg_resetxlog/pg_resetwal is to recover if pg_control is
> unreadable.  But I think we could look at PG_VERSION, which is only a
> text file.  In bad corruption scenarios, if that somehow got corrupted
> (which seems unlikely since it's never written to post-initdb),
> you could fill in the correct contents by hand and then
> pg_resetxlog/pg_resetwal would run.

Agreed. Shouldn't this be back-patched? PG_CONTROL_VERSION has not
been bumped between 9.4 and 9.5. Attached is a patch for HEAD.
-- 
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
Michael Paquier <michael.paquier@gmail.com> writes:
> On Mon, May 29, 2017 at 9:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> So we need to prevent this, not try to make it work.  I don't think
>> we can insist on a version match in pg_control, because part of the
>> point of pg_resetxlog/pg_resetwal is to recover if pg_control is
>> unreadable.  But I think we could look at PG_VERSION, which is only a
>> text file.

> Agreed. Shouldn't this be back-patched? PG_CONTROL_VERSION has not
> been bumped between 9.4 and 9.5. Attached is a patch for HEAD.

Yeah, I'm thinking it would be a good idea to enforce this in all
branches.  Your patch looks sane in a quick once-over, but I didn't
test it.
        regards, tom lane



On Mon, May 29, 2017 at 10:02 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Michael Paquier <michael.paquier@gmail.com> writes:
>> On Mon, May 29, 2017 at 9:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> So we need to prevent this, not try to make it work.  I don't think
>>> we can insist on a version match in pg_control, because part of the
>>> point of pg_resetxlog/pg_resetwal is to recover if pg_control is
>>> unreadable.  But I think we could look at PG_VERSION, which is only a
>>> text file.
>
>> Agreed. Shouldn't this be back-patched? PG_CONTROL_VERSION has not
>> been bumped between 9.4 and 9.5. Attached is a patch for HEAD.
>
> Yeah, I'm thinking it would be a good idea to enforce this in all
> branches.  Your patch looks sane in a quick once-over, but I didn't
> test it.

Thanks. I can provide patches for back-branches as needed.
-- 
Michael



Michael Paquier <michael.paquier@gmail.com> writes:
> Agreed. Shouldn't this be back-patched? PG_CONTROL_VERSION has not
> been bumped between 9.4 and 9.5. Attached is a patch for HEAD.

Pushed with minor adjustments.  Notably, I didn't take your addition
of canonicalize_path() and referencing the file by absolute path.
That seems like an independent and rather debatable behavioral change.
All the other files that pg_resetwal touches are referenced with
relative paths; if we did want to change the reporting, shouldn't we
change it for all of them?
        regards, tom lane



On Mon, May 29, 2017 at 9:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Amit Kapila <amit.kapila16@gmail.com> writes:
>> I think this happens due to commit
>> f82ec32ac30ae7e3ec7c84067192535b2ff8ec0e which renames pg_xlog to
>> pg_wal.  It does take care of making some of the modules like
>> pg_basebackup to understand both old and new directory structures, but
>> not done for all the modules.
>
> check
>
>> I think we should make similar changes
>> in pg_resetwal or at the very least update the docs to indicate
>> pg_resetwal can give an error if used against pre-10 data directory.
>
> I think it's just horribly dangerous to run any version of
> pg_resetxlog/pg_resetwal against any major version other than its
> own.  If people have been doing what the OP tried to do, it's only
> been sheerest luck if it didn't destroy their database beyond recall.
> The result will certainly fail to start up, because both pg_control
> and WAL page header versions will not match the server version.  The
> *best* case scenario is that redoing the reset with the correct version
> of pg_resetxlog/pg_resetwal gets you out of that with no new damage
> done.  The worst case is pretty awful --- for instance, a truly
> bullheaded user might try to get out of it by starting the newer server
> version in the old DB, likely causing irrecoverable catalog corruption.
>
> So we need to prevent this, not try to make it work.
>

Yeah, I was wrong in suggesting to try to make it work.  An explicit
error is a better way to deal with this.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



On 30 May 2017 at 00:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> I think it's just horribly dangerous to run any version of
> pg_resetxlog/pg_resetwal

You can pretty much stop there ;)

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



On 29 May 2017 at 17:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Amit Kapila <amit.kapila16@gmail.com> writes:
>> I think this happens due to commit
>> f82ec32ac30ae7e3ec7c84067192535b2ff8ec0e which renames pg_xlog to
>> pg_wal.  It does take care of making some of the modules like
>> pg_basebackup to understand both old and new directory structures, but
>> not done for all the modules.
>
> check
>
>> I think we should make similar changes
>> in pg_resetwal or at the very least update the docs to indicate
>> pg_resetwal can give an error if used against pre-10 data directory.
>
> I think it's just horribly dangerous to run any version of
> pg_resetxlog/pg_resetwal against any major version other than its
> own.

Just check the name of the directory so that pg_resetwal will refuse
to run against pg_xlog directory

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On Tue, May 30, 2017 at 4:24 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Just check the name of the directory so that pg_resetwal will refuse
> to run against pg_xlog directory

That's a strictly weaker integrity check than what Tom already
committed.  That only distinguishes pre-10 from post-10, whereas Tom
linked it to the specific major version, which is better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company