Re: recovering from "found xmin ... from before relfrozenxid ..." - Mailing list pgsql-hackers

From Chris Travers
Subject Re: recovering from "found xmin ... from before relfrozenxid ..."
Date
Msg-id CAN-RpxD8k_T4B8tgCWuu=0GnUEnN6NaepiUx8WMUPwUgHEgeuQ@mail.gmail.com
Whole thread Raw
In response to Re: recovering from "found xmin ... from before relfrozenxid ..."  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: recovering from "found xmin ... from before relfrozenxid ..."
List pgsql-hackers


On Tue, Jul 14, 2020 at 12:28 AM Peter Geoghegan <pg@bowt.ie> wrote:
On Mon, Jul 13, 2020 at 2:12 PM Robert Haas <robertmhaas@gmail.com> wrote:
> 1. There's nothing to identify the tuple that has the problem, and no
> way to know how many more of them there might be. Back-patching
> b61d161c146328ae6ba9ed937862d66e5c8b035a would help with the first
> part of this.

I am in favor of backpatching such changes in cases where senior
community members feel that it could help with hypothetical
undiscovered data corruption issues -- if they're willing to take
responsibility for the change. It certainly wouldn't be the first
time. A "defense in depth" mindset seems like the right one when it
comes to data corruption bugs. Early detection is really important.

> Moreover, not everyone is as
> interested in an extended debugging exercise as they are in getting
> the system working again, and VACUUM failing repeatedly is a pretty
> serious problem.

That's absolutely consistent with my experience. Most users want to
get back to business as usual now, while letting somebody else do the
hard work of debugging.

Also even if you do trace the problem you still have to recover.

And sometimes I have found latent corruption from times when dbs were running on older versions and older servers, making debugging largely a futile exercise.

> Therefore, one of my colleagues has - at my request - created a couple
> of functions called heap_force_kill() and heap_force_freeze() which
> take an array of TIDs.

> So I have these questions:
>
> - Do people think it would me smart/good/useful to include something
> like this in PostgreSQL?

I'm in favor of it.

+1

Would be worth extending it with some functions to grab rows that have various TOAST oids too.

> - If so, how? I would propose a new contrib module that we back-patch
> all the way, because the VACUUM errors were back-patched all the way,
> and there seems to be no advantage in making people wait 5 years for a
> new version that has some kind of tooling in this area.

I'm in favor of it being *possible* to backpatch tooling that is
clearly related to correctness in a fundamental way. Obviously this
would mean that we'd be revising our general position on backpatching
to allow some limited exceptions around corruption. I'm not sure that
this meets that standard, though. It's hardly something that we can
expect all that many users to be able to use effectively.

I may be biased, but I'd be inclined to permit it in the case of
something like amcheck, or pg_visibility, on the grounds that they're
more or less the same as the new VACUUM errcontext instrumentation you
mentioned. The same cannot be said of something like this new
heap_force_kill() stuff.

> - Any ideas for additional things we should include, or improvements
> on the sketch above?

Clearly you should work out a way of making it very hard to
accidentally (mis)use. For example, maybe you make the functions check
for the presence of a sentinel file in the data directory.

Agreed.


--
Peter Geoghegan




--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin

pgsql-hackers by date:

Previous
From: Ashutosh Sharma
Date:
Subject: Re: recovering from "found xmin ... from before relfrozenxid ..."
Next
From: Greg Nancarrow
Date:
Subject: Re: Libpq support to connect to standby server as priority