Thread: BUG #13876: pg_xlogdump give an error on timeline switch

BUG #13876: pg_xlogdump give an error on timeline switch

From
postgresql.org@gclough.com
Date:
The following bug has been logged on the website:

Bug reference:      13876
Logged by:          Greg Clough
Email address:      postgresql.org@gclough.com
PostgreSQL version: 9.3.10
Operating system:   CentOS v7.2
Description:

When pg_xlogdump encounters a timeline switch in WAL, it marks it as an
"error in WAL record".  I'm just guessing that there is an end-of-timeline
marker in the WAL that pg_xlogdump hasn't been programmed to recognize, so
it marks it as an error.

Here is a dump of WAL file that's had a timeline switch as the first thing
after forcing an pg_swtich_xlog():

-bash-4.2$ pwd
/opt/postgres/9.3/db2/pg_xlog

-bash-4.2$ ls -al
total 114704
drwx------.  3 postgres postgres     4096 Jan 19 09:15 .
drwx------. 16 postgres postgres     4096 Jan 19 09:33 ..
-rw-------.  1 postgres postgres 16777216 Jan 19 08:50
000000010000000000000002
-rw-------.  1 postgres postgres 16777216 Jan 19 08:53
000000010000000000000003
-rw-------.  1 postgres postgres 16777216 Jan 19 09:07
000000020000000000000003
-rw-------.  1 postgres postgres 16777216 Jan 19 09:07
000000020000000000000004
-rw-------.  1 postgres postgres 16777216 Jan 19 09:10
000000020000000000000005
-rw-------.  1 postgres postgres 16777216 Jan 19 09:15
000000020000000000000006
-rw-------.  1 postgres postgres       41 Jan 19 08:55 00000002.history
-rw-------.  1 postgres postgres 16777216 Jan 19 09:33
000000030000000000000006
-rw-------.  1 postgres postgres       83 Jan 19 09:15 00000003.history
drwx------.  2 postgres postgres       78 Jan 19 09:15 archive_status

-bash-4.2$ pg_xlogdump 000000020000000000000006
rmgr: XLOG        len (rec/tot):     72/   104, tx:          0, lsn:
0/06000028, prev 0/056370D0, bkp: 0000, desc: checkpoint: redo 0/6000028;
tli 2; prev tli 2; fpw true; xid 0/1894; oid 24576; multi 1; offset 0;
oldest xid 1879 in DB 1; oldest multi 1 in DB 1; oldest running xid 0;
shutdown
rmgr: XLOG        len (rec/tot):     72/   104, tx:          0, lsn:
0/06000090, prev 0/06000028, bkp: 0000, desc: checkpoint: redo 0/6000090;
tli 2; prev tli 2; fpw true; xid 0/1894; oid 24576; multi 1; offset 0;
oldest xid 1879 in DB 1; oldest multi 1 in DB 1; oldest running xid 0;
shutdown
pg_xlogdump: FATAL:  error in WAL record at 0/6000090: record with zero
length at 0/60000F8

That final message was mis-interpreted as corruption in the WAL, but without
knowing the internals of either the WAL or pg_xlogdump I can't say for sure
that it's not corruption.  It's repeatable, so I doubt it's a problem in
PostgreSQL core.

Could someone confirm my understanding, and if so, then could we please get
pg_xlogdump updated to recognize an end-of-timeline?

Re: BUG #13876: pg_xlogdump give an error on timeline switch

From
Andres Freund
Date:
On 2016-01-19 15:34:48 +0000, postgresql.org@gclough.com wrote:
> That final message was mis-interpreted as corruption in the WAL, but without
> knowing the internals of either the WAL or pg_xlogdump I can't say for sure
> that it's not corruption.  It's repeatable, so I doubt it's a problem in
> PostgreSQL core.
>
> Could someone confirm my understanding, and if so, then could we please get
> pg_xlogdump updated to recognize an end-of-timeline?

Yes, that's pretty much correct. The "old" timeline doesn't even really
know about it, so it's not generally clear how we go forward. There
could be WAL from several parallel timelines present.

You can manually choose the timeline you want to be displayed. Either by
specifying -t or by specifying files from the other timeline.

I don't think we really need to update pg_xlogdump to do any of this
automatically - for which use case is that required?

Regards,

Andres

Re: BUG #13876: pg_xlogdump give an error on timeline switch

From
Michael Paquier
Date:
On Wed, Jan 20, 2016 at 12:44 AM, Andres Freund <andres@anarazel.de> wrote:
> You can manually choose the timeline you want to be displayed. Either by
> specifying -t or by specifying files from the other timeline.
>
> I don't think we really need to update pg_xlogdump to do any of this
> automatically - for which use case is that required?

Yes, I am not sure it is worth adding a dependency with the history
file just to debug the WAL contents.
--
Michael

Re: BUG #13876: pg_xlogdump give an error on timeline switch

From
Greg Clough
Date:
>
> (please don't top post on postgres lists)
>

Top posting... sorry, I blame GMail defaults... and mobile phones. :-/


If pg_xlogdump were a tool aimed at normal endusers, maybe. But it's
> clearly not, so ...
>
> What are you using pg_xlogdump for?
>

The issue came about as one of our other DBA's did a role swap
Primary/Standby, but didn't have the recovery_target_timeline = 'latest'
set.  He couldn't figure out why the old Primary (now a Standby) wouldn't
follow the new Primary onto the new timeline, and deduced it was a corrupt
WAL due to the output of pg_xlogdump.

I can understand his confusion, as the error is quite alarming with that
big "FATAL:"... but ultimately, yes, it was our own mistake that led us
down this avenue of investigation


The timeline switch will often happen on a different machine, so you
> can't just do that.
>

It's not a huge issue that I doubt we'll ever have again, but if it was an
oversight and easy to fix then I thought it would be worthwhile doing, so
as to stop others having the same problem.

Cheers.
Greg