Home > mailing lists

Re: 9.3.9 and pg_multixact corruption - Mailing list pgsql-hackers

From	Christoph Berg
Subject	Re: 9.3.9 and pg_multixact corruption
Date	September 11, 2015 12:25:46
Msg-id	20150911122538.GA2672@msg.df7cb.de Whole thread Raw
In response to	9.3.9 and pg_multixact corruption (Bernd Helmle <bernd@oopsware.de>)
Responses	Re: 9.3.9 and pg_multixact corruption
List	pgsql-hackers

Tree view

Re: Bernd Helmle 2015-09-10 <7E3C7F8D210AC9A423E96F3A@eje.local>
> 2015-09-08 11:40:59 CEST [27047] DETAIL:  Could not seek in file
> "pg_multixact/members/FFFF5FC4" to offset 4294950912: Invalid argument.
> 2015-09-08 11:40:59 CEST [27047] CONTEXT:  xlog redo create mxid 1068235595
> offset 2147483648 nmembers 2: 2896635220 (upd) 2896635510 (keysh) 
> 2015-09-08 11:40:59 CEST [27045] LOG:  startup process (PID 27047) exited
> with exit code 1
> 2015-09-08 11:40:59 CEST [27045] LOG:  aborting startup due to startup
> process failure
> 
> Some side notes:
> 
> An additional recovery from a base backup and archive recovery yield to the
> same error, as soon as the affected tuple was touched with a DELETE. The
> affected table was fully dumpable via pg_dump, though.

A few more words here: the archive recovery was a pitr to 00:45, so
well before the problem, and the cluster was initially working well,
but crashed shortly after with the same mxid 1068235595 message. The
crash was triggered from a delete on a different table (which was
related schema-wise, but iirc neither of these tables has any FKs).

We then rewound the system to a zfs snapshot taken when the archive
recovery had finished (db shut down cleanly), and put it up again,
when it again crashed with mxid 1068235595, this time on a third
table.

The original crash and the first post-recovery crash happened a few
minutes after pg_start_backup(), though the next crash was without
that.


(While the archive recovery was running, I had pg_resetxlog the
original cluster. It was possible to isolate the ctid of an affected
tuple, but it wasn't possible to DELETE it, yielding an error message
similar to the above, but the database would continue. I then zeroed
the bad block using dd (zero_damaged_pages didn't help), only to find
that at least one more tuple in that table was affected (with a
different mxid).)

Christoph

pgsql-hackers by date:

From: Zhaomo Yang
Date: 11 September 2015, 12:22:18
Subject: Re: CREATE POLICY and RETURNING

From: "Jinyu Zhang"
Date: 11 September 2015, 12:28:42
Subject: Did we forget to unpin buf in function "revmap_physical_extend" ?

Re: 9.3.9 and pg_multixact corruption - Mailing list pgsql-hackers

Previous

Next