Thread: recovering from "found xmin ... from before relfrozenxid ..."

recovering from "found xmin ... from before relfrozenxid ..."

From
Robert Haas
Date:
Hi,

A number of EDB customers have had this error crop on their tables for
reasons that we have usually not been able to determine. In many
cases, it's probably down to things like running buggy old releases
for a long time before upgrading, or bad backup and recovery
procedures. It's more than possible that there are still-unfixed
server bugs, but I do not have any compelling evidence of such bugs at
this time. Unfortunately, once you're in this situation, it's kind of
hard to find your way out of it. There are a few problems:

1. There's nothing to identify the tuple that has the problem, and no
way to know how many more of them there might be. Back-patching
b61d161c146328ae6ba9ed937862d66e5c8b035a would help with the first
part of this.

2. In some other, similar situations, e.g. where the tuple data is
garbled, it's often possible to get out from under the problem by
deleting the tuple at issue. But I think that doesn't necessarily fix
anything in this case.

3. We've had some success with using a PL/plgsql loop with an
EXCEPTION block to extract all the accessible tuples from the table.
Then you can truncate the original table and reinsert the data. But
this is slow, so it stinks if the table is big, and it's not a viable
approach if the table in question is a system catalog table -- at
least if it's not if it's something critical like pg_class.

I realize somebody's probably going to say "well, you shouldn't try to
repair a database that's in this state, you shouldn't let it happen in
the first place, and if it does happen, you should track the root
cause to the ends of the earth." But I think that's a completely
impractical approach. I at least have no idea how I'm supposed to
figure out when and how a bad relfrozenxid ended up in the table, and
by the time the problem is discovered after an upgrade the problem
that caused it may be quite old. Moreover, not everyone is as
interested in an extended debugging exercise as they are in getting
the system working again, and VACUUM failing repeatedly is a pretty
serious problem.

Therefore, one of my colleagues has - at my request - created a couple
of functions called heap_force_kill() and heap_force_freeze() which
take an array of TIDs. The former truncates them all to dead line
pointers. The latter resets the infomask and xmin to make the xmin
frozen. (It should probably handle the xmax too; not sure that the
current version does that, but it's easily fixed if not.) The
intention is that you can use these to get either get rid of, or get
access to, tuples whose visibility information is corrupted for
whatever reason. These are pretty sharp tools; you could corrupt a
perfectly-good table by incautious use of them, or destroy a large
amount of data. You could, for example, force-freeze a tuple created
by a transaction which added a column, inserted data, and rolled back;
that would likely be disastrous. However, in the cases that I'm
thinking about, disaster has already struck, and something that you
can use to get things back to a saner state is better than just
leaving the table perpetually broken. Without something like this, the
backup plan is probably to shut down the server and try to edit the
pages using a perl script or something, but that seems clearly worse.

So I have these questions:

- Do people think it would me smart/good/useful to include something
like this in PostgreSQL?

- If so, how? I would propose a new contrib module that we back-patch
all the way, because the VACUUM errors were back-patched all the way,
and there seems to be no advantage in making people wait 5 years for a
new version that has some kind of tooling in this area.

- Any ideas for additional things we should include, or improvements
on the sketch above?

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> - Do people think it would me smart/good/useful to include something
> like this in PostgreSQL?

Absolutely, yes.

> - If so, how? I would propose a new contrib module that we back-patch
> all the way, because the VACUUM errors were back-patched all the way,
> and there seems to be no advantage in making people wait 5 years for a
> new version that has some kind of tooling in this area.

While I agree that this would be a good and useful new contrib module to
have, I don't think it would be appropriate to back-patch it into PG
formally.

Unfortunately, that gets into the discussion that's cropped up on a few
other threads of late- that we don't have a good place to put extensions
which are well maintained/recommended by core PG hackers, and which are
able to work with lots of different versions of PG, and are versioned
and released independently of PG (and, ideally, built for all the
versions of PG that we distribute through our packages).

Given the lack of such a place today, I'd at least suggest starting with
proposing it as a new contrib module for v14.

> - Any ideas for additional things we should include, or improvements
> on the sketch above?

Not right off-hand, but will think about it, there could certainly be a
lot of very interesting tools in such a toolbox.

Thanks!

Stephen

Attachment

Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> * Robert Haas (robertmhaas@gmail.com) wrote:
>> - If so, how? I would propose a new contrib module that we back-patch
>> all the way, because the VACUUM errors were back-patched all the way,
>> and there seems to be no advantage in making people wait 5 years for a
>> new version that has some kind of tooling in this area.

> While I agree that this would be a good and useful new contrib module to
> have, I don't think it would be appropriate to back-patch it into PG
> formally.

Yeah, I don't care for that either.  That's a pretty huge violation of our
normal back-patching rules, and I'm not convinced that it's justified.

No objection to adding it as a new contrib module.

            regards, tom lane



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Peter Geoghegan
Date:
On Mon, Jul 13, 2020 at 2:12 PM Robert Haas <robertmhaas@gmail.com> wrote:
> 1. There's nothing to identify the tuple that has the problem, and no
> way to know how many more of them there might be. Back-patching
> b61d161c146328ae6ba9ed937862d66e5c8b035a would help with the first
> part of this.

I am in favor of backpatching such changes in cases where senior
community members feel that it could help with hypothetical
undiscovered data corruption issues -- if they're willing to take
responsibility for the change. It certainly wouldn't be the first
time. A "defense in depth" mindset seems like the right one when it
comes to data corruption bugs. Early detection is really important.

> Moreover, not everyone is as
> interested in an extended debugging exercise as they are in getting
> the system working again, and VACUUM failing repeatedly is a pretty
> serious problem.

That's absolutely consistent with my experience. Most users want to
get back to business as usual now, while letting somebody else do the
hard work of debugging.

> Therefore, one of my colleagues has - at my request - created a couple
> of functions called heap_force_kill() and heap_force_freeze() which
> take an array of TIDs.

> So I have these questions:
>
> - Do people think it would me smart/good/useful to include something
> like this in PostgreSQL?

I'm in favor of it.

> - If so, how? I would propose a new contrib module that we back-patch
> all the way, because the VACUUM errors were back-patched all the way,
> and there seems to be no advantage in making people wait 5 years for a
> new version that has some kind of tooling in this area.

I'm in favor of it being *possible* to backpatch tooling that is
clearly related to correctness in a fundamental way. Obviously this
would mean that we'd be revising our general position on backpatching
to allow some limited exceptions around corruption. I'm not sure that
this meets that standard, though. It's hardly something that we can
expect all that many users to be able to use effectively.

I may be biased, but I'd be inclined to permit it in the case of
something like amcheck, or pg_visibility, on the grounds that they're
more or less the same as the new VACUUM errcontext instrumentation you
mentioned. The same cannot be said of something like this new
heap_force_kill() stuff.

> - Any ideas for additional things we should include, or improvements
> on the sketch above?

Clearly you should work out a way of making it very hard to
accidentally (mis)use. For example, maybe you make the functions check
for the presence of a sentinel file in the data directory.


--
Peter Geoghegan



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Andres Freund
Date:
Hi,

On 2020-07-13 17:12:18 -0400, Robert Haas wrote:
> 1. There's nothing to identify the tuple that has the problem, and no
> way to know how many more of them there might be. Back-patching
> b61d161c146328ae6ba9ed937862d66e5c8b035a would help with the first
> part of this.

Not fully, I'm afraid. Afaict it doesn't currently tell you the item
pointer offset, just the block numer, right? We probably should extend
it to also include the offset...


> 2. In some other, similar situations, e.g. where the tuple data is
> garbled, it's often possible to get out from under the problem by
> deleting the tuple at issue. But I think that doesn't necessarily fix
> anything in this case.

Huh, why not? That worked in the cases I saw.


> Therefore, one of my colleagues has - at my request - created a couple
> of functions called heap_force_kill() and heap_force_freeze() which
> take an array of TIDs. The former truncates them all to dead line
> pointers. The latter resets the infomask and xmin to make the xmin
> frozen. (It should probably handle the xmax too; not sure that the
> current version does that, but it's easily fixed if not.)

xmax is among the problematic cases IIRC, so yes, it'd be good to fix
that.

Greetings,

Andres Freund



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Robert Haas
Date:
On Mon, Jul 13, 2020 at 6:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Yeah, I don't care for that either.  That's a pretty huge violation of our
> normal back-patching rules, and I'm not convinced that it's justified.

I think that our normal back-patching rules are based primarily on the
risk of breaking things, and a new contrib module carries a pretty
negligible risk of breaking anything that works today. I wouldn't
propose to back-patch something on those grounds just as a way of
delivering a new feature more quickly, but that's not the intention
here. At least in my experience, un-VACUUM-able tables have gotten
several orders of magnitude more common since Andres put those changes
in. As far as I can recall, EDB has not had this many instances of
different customers reporting the same problem since the 9.3-era
multixact issues. So far, this does not rise to that level, but it is
by no means a negligible issue, either. I believe it deserves to be
taken quite seriously, especially because the existing options for
helping customers with this kind of problem are so limited.

Now, if this goes into v14, we can certainly stick it up on github, or
put it out there in some other way for users to download,
self-compile, and install, but that seems noticeably less convenient
for people who need it, and I'm not clear what the benefit to the
project is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Robert Haas
Date:
On Mon, Jul 13, 2020 at 6:38 PM Andres Freund <andres@anarazel.de> wrote:
> Not fully, I'm afraid. Afaict it doesn't currently tell you the item
> pointer offset, just the block numer, right? We probably should extend
> it to also include the offset...

Oh, I hadn't realized that limitation. That would be good to fix. It
would be even better, I think, if we could have VACUUM proceed with
the rest of vacuuming the table, emitting warnings about each
instance, instead of blowing up when it hits the first bad tuple, but
I think you may have told me sometime that doing so would be, uh, less
than straightforward. We probably should refuse to update
relfrozenxid/relminmxid when this is happening, but I *think* it would
be better to still proceed with dead tuple cleanup as far as we can,
or at least have an option to enable that behavior. I'm not positive
about that, but not being able to complete VACUUM at all is a FAR more
urgent problem than not being able to freeze, even though in the long
run the latter is more severe.

> > 2. In some other, similar situations, e.g. where the tuple data is
> > garbled, it's often possible to get out from under the problem by
> > deleting the tuple at issue. But I think that doesn't necessarily fix
> > anything in this case.
>
> Huh, why not? That worked in the cases I saw.

I'm not sure I've seen a case where that didn't work, but I don't see
a reason why it couldn't happen. Do you think the code is structured
in such a way that a deleted tuple is guaranteed to be pruned even if
the XID is old? What if clog has been truncated so that the xmin can't
be looked up?

> xmax is among the problematic cases IIRC, so yes, it'd be good to fix
> that.

Thanks for the input.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Oh, I hadn't realized that limitation. That would be good to fix. It
> would be even better, I think, if we could have VACUUM proceed with
> the rest of vacuuming the table, emitting warnings about each
> instance, instead of blowing up when it hits the first bad tuple, but
> I think you may have told me sometime that doing so would be, uh, less
> than straightforward. We probably should refuse to update
> relfrozenxid/relminmxid when this is happening, but I *think* it would
> be better to still proceed with dead tuple cleanup as far as we can,
> or at least have an option to enable that behavior. I'm not positive
> about that, but not being able to complete VACUUM at all is a FAR more
> urgent problem than not being able to freeze, even though in the long
> run the latter is more severe.

+1 for proceeding in this direction, rather than handing users tools
that they *will* hurt themselves with.

The more that I think about it, the more I think that the proposed
functions are tools for wizards only, and so I'm getting hesitant
about having them in contrib at all.  We lack a better place to
put them, but that doesn't mean they should be there.

            regards, tom lane



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Robert Haas
Date:
On Mon, Jul 13, 2020 at 8:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > Oh, I hadn't realized that limitation. That would be good to fix. It
> > would be even better, I think, if we could have VACUUM proceed with
> > the rest of vacuuming the table, emitting warnings about each
> > instance, instead of blowing up when it hits the first bad tuple, but
> > I think you may have told me sometime that doing so would be, uh, less
> > than straightforward. We probably should refuse to update
> > relfrozenxid/relminmxid when this is happening, but I *think* it would
> > be better to still proceed with dead tuple cleanup as far as we can,
> > or at least have an option to enable that behavior. I'm not positive
> > about that, but not being able to complete VACUUM at all is a FAR more
> > urgent problem than not being able to freeze, even though in the long
> > run the latter is more severe.
>
> +1 for proceeding in this direction, rather than handing users tools
> that they *will* hurt themselves with.
>
> The more that I think about it, the more I think that the proposed
> functions are tools for wizards only, and so I'm getting hesitant
> about having them in contrib at all.  We lack a better place to
> put them, but that doesn't mean they should be there.

It's not an either/or; it's a both/and. To recover from this problem,
you need to:

1. Be able to tell which tuples are affected.
2. Do something about it.

I think there are a number of strategies that we could pursue around
either of those things, and there are better and worse ways of
accomplishing them, but having one without the other isn't too great.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Andres Freund
Date:
Hi,

On 2020-07-13 20:47:10 -0400, Robert Haas wrote:
> On Mon, Jul 13, 2020 at 6:38 PM Andres Freund <andres@anarazel.de> wrote:
> > Not fully, I'm afraid. Afaict it doesn't currently tell you the item
> > pointer offset, just the block numer, right? We probably should extend
> > it to also include the offset...
>
> Oh, I hadn't realized that limitation. That would be good to fix.

Yea. And it'd even be good if we were to to end up implementing your
suggestion below about continuing vacuuming other tuples.


> It would be even better, I think, if we could have VACUUM proceed with
> the rest of vacuuming the table, emitting warnings about each
> instance, instead of blowing up when it hits the first bad tuple, but
> I think you may have told me sometime that doing so would be, uh, less
> than straightforward.

Yea, it's not that simple to implement. Not impossible either.


> We probably should refuse to update relfrozenxid/relminmxid when this
> is happening, but I *think* it would be better to still proceed with
> dead tuple cleanup as far as we can, or at least have an option to
> enable that behavior. I'm not positive about that, but not being able
> to complete VACUUM at all is a FAR more urgent problem than not being
> able to freeze, even though in the long run the latter is more severe.

I'm hesitant to default to removing tuples once we've figured out that
something is seriously wrong. Could easy enough make us plow ahead and
delete valuable data on other tuples, even if we'd already detected
there's a problem. But I also see the problem you raise. That's not
academic, a number of multixact corruption issues the checks detected
IIRC weren't guaranteed to be caught.


> > > 2. In some other, similar situations, e.g. where the tuple data is
> > > garbled, it's often possible to get out from under the problem by
> > > deleting the tuple at issue. But I think that doesn't necessarily fix
> > > anything in this case.
> >
> > Huh, why not? That worked in the cases I saw.
>
> I'm not sure I've seen a case where that didn't work, but I don't see
> a reason why it couldn't happen. Do you think the code is structured
> in such a way that a deleted tuple is guaranteed to be pruned even if
> the XID is old?

I think so, leaving aside some temporary situations perhaps.


> What if clog has been truncated so that the xmin can't be looked up?

That's possible, but probably only in cases where xmin actually
committed.

Greetings,

Andres Freund



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Robert Haas
Date:
On Mon, Jul 13, 2020 at 8:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The more that I think about it, the more I think that the proposed
> functions are tools for wizards only, and so I'm getting hesitant
> about having them in contrib at all.  We lack a better place to
> put them, but that doesn't mean they should be there.

Also, I want to clarify that in a typical situation in which a
customer is facing this problem, I don't have any access to their
system. I basically never touch customer systems directly. Typically,
the customer sends us log files and a description of the problem and
their goals, and we send them back advice or instructions. So it's
impractical to imagine that this can be something where you have to
know the secret magic wizard password to get access to it. We'd just
have to give the customers who need to use this tool said password,
and then the jig is up - they can redistribute that password to all
the non-wizards on the Internet, if they so choose.

I understand that it's not too great when we give people access to
sharp tools and they hurt themselves with said tools. But this is open
source. That's how it goes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Robert Haas
Date:
On Mon, Jul 13, 2020 at 9:10 PM Andres Freund <andres@anarazel.de> wrote:
> > What if clog has been truncated so that the xmin can't be looked up?
>
> That's possible, but probably only in cases where xmin actually
> committed.

Isn't that the normal case? I'm imagining something like:

- Tuple gets inserted. Transaction commits.
- VACUUM processes table.
- Mischievous fairies mark page all-visible in the visibility map.
- VACUUM runs lots more times, relfrozenxid advances, but without ever
looking at the page in question, because it's all-visible.
- clog is truncated, rendering xmin no longer accessible.
- User runs VACUUM disabling page skipping, gets ERROR.
- User deletes offending tuple.
- At this point, I think the tuple is both invisible and unprunable?
- Fairies happy, user sad.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Jul 13, 2020 at 8:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The more that I think about it, the more I think that the proposed
>> functions are tools for wizards only, and so I'm getting hesitant
>> about having them in contrib at all.  We lack a better place to
>> put them, but that doesn't mean they should be there.

> I understand that it's not too great when we give people access to
> sharp tools and they hurt themselves with said tools. But this is open
> source. That's how it goes.

I think you're attacking a straw man.  I'm well aware of how open source
works, thanks.  What I'm saying is that contrib is mostly seen to be
reasonably harmless stuff.  Sure, you can overwrite data you didn't want
to with adminpack's pg_file_write.  But that's the price of having such a
capability at all, and in general it's not hard for users to understand
both the uses and risks of that function.  That statement does not apply
to the functions being proposed here.  It doesn't seem like they could
possibly be safe to use without very specific expert advice --- and even
then, we're talking rather small values of "safe".  So I wish we had some
other way to distribute them than via contrib.

            regards, tom lane



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Fujii Masao
Date:

On 2020/07/14 9:41, Robert Haas wrote:
> On Mon, Jul 13, 2020 at 6:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Yeah, I don't care for that either.  That's a pretty huge violation of our
>> normal back-patching rules, and I'm not convinced that it's justified.
> 
> I think that our normal back-patching rules are based primarily on the
> risk of breaking things, and a new contrib module carries a pretty
> negligible risk of breaking anything that works today. I wouldn't
> propose to back-patch something on those grounds just as a way of
> delivering a new feature more quickly, but that's not the intention
> here. At least in my experience, un-VACUUM-able tables have gotten
> several orders of magnitude more common since Andres put those changes
> in. As far as I can recall, EDB has not had this many instances of
> different customers reporting the same problem since the 9.3-era
> multixact issues. So far, this does not rise to that level, but it is
> by no means a negligible issue, either. I believe it deserves to be
> taken quite seriously, especially because the existing options for
> helping customers with this kind of problem are so limited.
> 
> Now, if this goes into v14, we can certainly stick it up on github, or
> put it out there in some other way for users to download,
> self-compile, and install, but that seems noticeably less convenient
> for people who need it, and I'm not clear what the benefit to the
> project is.

But updating this tool can fit to the release schedule and
policy of PostgreSQL?

While investigating the problem by using this tool, we may want to
add new feature into the tool because it's necessary for the investigation.
But users would need to wait for next minor version release, to use this
new feature.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
"Andrey M. Borodin"
Date:
Hi!

> 14 июля 2020 г., в 02:12, Robert Haas <robertmhaas@gmail.com> написал(а):
>
> So I have these questions:
>
> - Do people think it would me smart/good/useful to include something
> like this in PostgreSQL?
>
> - If so, how? I would propose a new contrib module that we back-patch
> all the way


My 0.05₽.

At Yandex we used to fix similar corruption things with our pg_dirty_hands extension [0].
But then we developed our internal pg_heapcheck module (unfortunately we did not publish it) and incorporated
aggressiverecovery into heapcheck. 

Now when community has official heapcheck I think it worth to keep detection and fixing tools together.

Best regards, Andrey Borodin.

[0] https://github.com/dsarafan/pg_dirty_hands/blob/master/src/pg_dirty_hands.c




Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Peter Eisentraut
Date:
On 2020-07-14 02:41, Robert Haas wrote:
> I think that our normal back-patching rules are based primarily on the
> risk of breaking things, and a new contrib module carries a pretty
> negligible risk of breaking anything that works today.

I think that all feature code ought to go through a beta cycle.  So if 
this code makes it to 14.0 or 14.1, then I'd consider backpatching it.

> Now, if this goes into v14, we can certainly stick it up on github, or
> put it out there in some other way for users to download,
> self-compile, and install, but that seems noticeably less convenient
> for people who need it, and I'm not clear what the benefit to the
> project is.

In the meantime, if you're wizard enough to deal with this kind of 
thing, you could also clone the module from the PG14 tree and build it 
against older versions manually.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Magnus Hagander
Date:


On Tue, Jul 14, 2020 at 3:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Jul 13, 2020 at 8:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The more that I think about it, the more I think that the proposed
>> functions are tools for wizards only, and so I'm getting hesitant
>> about having them in contrib at all.  We lack a better place to
>> put them, but that doesn't mean they should be there.

> I understand that it's not too great when we give people access to
> sharp tools and they hurt themselves with said tools. But this is open
> source. That's how it goes.

I think you're attacking a straw man.  I'm well aware of how open source
works, thanks.  What I'm saying is that contrib is mostly seen to be
reasonably harmless stuff.  Sure, you can overwrite data you didn't want
to with adminpack's pg_file_write.  But that's the price of having such a
capability at all, and in general it's not hard for users to understand
both the uses and risks of that function.  That statement does not apply
to the functions being proposed here.  It doesn't seem like they could
possibly be safe to use without very specific expert advice --- and even
then, we're talking rather small values of "safe".  So I wish we had some
other way to distribute them than via contrib.

The countersable of this is pg_resetwal. The number of people who have broken their database with that tool is not small.

That said, we could have a separate "class" of extensions/tools in the distribution, and encourage packagers to pack them up as separate packages for example. Technically they don't have to be in the same source repository at all of course, but I have a feeling some of them might be a lot easier to maintain if they are. And then the user would just have to install something like "postgresql-14-wizardtools". They'd still be available to everybody, of course, but at least the knives would be in a closed drawer until intentionally picked up.
 
--

Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Jochem van Dieten
Date:
On Tue, Jul 14, 2020 at 3:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I think you're attacking a straw man.  I'm well aware of how open source
> works, thanks.  What I'm saying is that contrib is mostly seen to be
> reasonably harmless stuff.  Sure, you can overwrite data you didn't want
> to with adminpack's pg_file_write.  But that's the price of having such a
> capability at all, and in general it's not hard for users to understand
> both the uses and risks of that function.  That statement does not apply
> to the functions being proposed here.  It doesn't seem like they could
> possibly be safe to use without very specific expert advice --- and even
> then, we're talking rather small values of "safe".

Would it be possible to make them safe(r)? For instance, truncate
only, don't freeze; only tuples whose visibility information is
corrupted; and only in non-catalog tables. What exactly is the risk in
that case? Foreign keys might not be satisfied, which might make it
impossible to restore a dump, but is that worse than what a DBA can do
anyway? I would think that it is not and would leave the database in a
state DBAs are much better equipped to deal with.
Or would it be possible to create a table like the original table
(minus any constraints) and copy all tuples with corrupted visibility
there before truncating to a dead line pointer?

Jochem



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Robert Haas
Date:
On Tue, Jul 14, 2020 at 3:08 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> In the meantime, if you're wizard enough to deal with this kind of
> thing, you could also clone the module from the PG14 tree and build it
> against older versions manually.

But what if you are NOT a wizard, and a wizard is giving you
directions? Then having to build from source is a real pain. And
that's normally the situation I'm in when a customer has this issue.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Robert Haas
Date:
On Tue, Jul 14, 2020 at 4:59 AM Magnus Hagander <magnus@hagander.net> wrote:
> The countersable of this is pg_resetwal. The number of people who have broken their database with that tool is not
small.

Very true.

> That said, we could have a separate "class" of extensions/tools in the distribution, and encourage packagers to pack
themup as separate packages for example. Technically they don't have to be in the same source repository at all of
course,but I have a feeling some of them might be a lot easier to maintain if they are. And then the user would just
haveto install something like "postgresql-14-wizardtools". They'd still be available to everybody, of course, but at
leastthe knives would be in a closed drawer until intentionally picked up. 

I don't think that does much to help with the immediate problem here,
because people are being bitten by this problem *now* and a packaging
change like this will take a long time to happen and become standard
out there, but I think it's a good idea.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Magnus Hagander
Date:
On Tue, Jul 14, 2020 at 1:52 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Jul 14, 2020 at 4:59 AM Magnus Hagander <magnus@hagander.net> wrote:
> The countersable of this is pg_resetwal. The number of people who have broken their database with that tool is not small.

Very true.

> That said, we could have a separate "class" of extensions/tools in the distribution, and encourage packagers to pack them up as separate packages for example. Technically they don't have to be in the same source repository at all of course, but I have a feeling some of them might be a lot easier to maintain if they are. And then the user would just have to install something like "postgresql-14-wizardtools". They'd still be available to everybody, of course, but at least the knives would be in a closed drawer until intentionally picked up.

I don't think that does much to help with the immediate problem here,
because people are being bitten by this problem *now* and a packaging
change like this will take a long time to happen and become standard
out there, but I think it's a good idea.

I don't think that it necessarily has to be. As long as we're talking about adding something and not actually changing their existing packages, getting this into both yum and apt shouldn't be *that* hard, if it's coordinated well with Christoph and Devrim (obviously that's based on my experience and they will have to give a more complete answer themselves). It would be a lot more complicated if it involved changing an existing package.

--

Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Robert Haas
Date:
On Mon, Jul 13, 2020 at 9:29 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> But updating this tool can fit to the release schedule and
> policy of PostgreSQL?
>
> While investigating the problem by using this tool, we may want to
> add new feature into the tool because it's necessary for the investigation.
> But users would need to wait for next minor version release, to use this
> new feature.

Yeah, that's a point that needs careful thought. I don't think it
means that we shouldn't have something in core; after all, this is a
problem that is created in part by the way that PostgreSQL itself
works, and I think it would be quite unfriendly if we refused to do
anything about that in the core distribution. On the other hand, it
might be a good reason not to back-patch, which is something most
people don't seem enthusiastic about anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Robert Haas
Date:
On Tue, Jul 14, 2020 at 8:25 AM Magnus Hagander <magnus@hagander.net> wrote:
> I don't think that it necessarily has to be. As long as we're talking about adding something and not actually
changingtheir existing packages, getting this into both yum and apt shouldn't be *that* hard, if it's coordinated well
withChristoph and Devrim (obviously that's based on my experience and they will have to give a more complete answer
themselves).It would be a lot more complicated if it involved changing an existing package. 

I mean, you presumably could not move pg_resetwal to this new package
in existing branches, right?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Magnus Hagander
Date:


On Tue, Jul 14, 2020 at 4:09 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Jul 14, 2020 at 8:25 AM Magnus Hagander <magnus@hagander.net> wrote:
> I don't think that it necessarily has to be. As long as we're talking about adding something and not actually changing their existing packages, getting this into both yum and apt shouldn't be *that* hard, if it's coordinated well with Christoph and Devrim (obviously that's based on my experience and they will have to give a more complete answer themselves). It would be a lot more complicated if it involved changing an existing package.

I mean, you presumably could not move pg_resetwal to this new package
in existing branches, right?

Probably and eventually. But that can be done for 14+ (or 13+ depending on how "done" the packaging is there -- we should just make sure that hits the biggest platform in the same release). 

--

Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Stephen Frost
Date:
Greetings,

* Magnus Hagander (magnus@hagander.net) wrote:
> On Tue, Jul 14, 2020 at 4:09 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > On Tue, Jul 14, 2020 at 8:25 AM Magnus Hagander <magnus@hagander.net>
> > wrote:
> > > I don't think that it necessarily has to be. As long as we're talking
> > about adding something and not actually changing their existing packages,
> > getting this into both yum and apt shouldn't be *that* hard, if it's
> > coordinated well with Christoph and Devrim (obviously that's based on my
> > experience and they will have to give a more complete answer themselves).
> > It would be a lot more complicated if it involved changing an existing
> > package.
> >
> > I mean, you presumably could not move pg_resetwal to this new package
> > in existing branches, right?
>
> Probably and eventually. But that can be done for 14+ (or 13+ depending on
> how "done" the packaging is there -- we should just make sure that hits the
> biggest platform in the same release).

Considering we just got rid of the -contrib independent package on at
least Debian-based systems, it doesn't really seem likely that the
packagers are going to be anxious to create a new one- they are not
without costs.

Also, in such dire straits as this thread is contemplating, I would
think we'd *really* like to have access to these tools with as small an
amount of change as absolutely possible to the system: what if
pg_extension itself got munged and we aren't able to install this new
contrib module, for example?

I would suggest that, instead, we make this part of core, but have it be
in a relatively clearly marked special schema that isn't part of
search_path by default- eg: pg_hacks, or pg_dirty_hands (I kinda like
the latter though it seems a bit unprofessional for us).

I'd also certainly be in support of having a contrib module with the
same functions that's independent from core and available and able to be
installed on pre-v14 systems.  I'd further support having another repo
that's "core maintained" or however we want to phrase it which includes
this proposed module (and possibly all of contrib) and which has a
different release cadence and requirements for what gets into it, has
its own packages, etc.

Thanks,

Stephen

Attachment

Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Alvaro Herrera
Date:
On 2020-Jul-13, Andres Freund wrote:

> Hi,
> 
> On 2020-07-13 17:12:18 -0400, Robert Haas wrote:
> > 1. There's nothing to identify the tuple that has the problem, and no
> > way to know how many more of them there might be. Back-patching
> > b61d161c146328ae6ba9ed937862d66e5c8b035a would help with the first
> > part of this.
> 
> Not fully, I'm afraid. Afaict it doesn't currently tell you the item
> pointer offset, just the block numer, right? We probably should extend
> it to also include the offset...

Just having the block number is already a tremendous step forward; with
that you can ask the customer to set a pageinspect dump of tuple
headers, and then the problem is obvious.  Now if you want to add block
number to that, by all means do so.

FWIW I do support the idea of backpatching the vacuum errcontext commit.

One useful thing to do is to mark a tuple frozen unconditionally if it's
marked hinted XMIN_COMMITTED; no need to consult pg_clog in that case.
The attached (for 9.6) does that; IIRC it would have helped in a couple
of cases.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Andres Freund
Date:
Hi,

On 2020-07-13 21:18:10 -0400, Robert Haas wrote:
> On Mon, Jul 13, 2020 at 9:10 PM Andres Freund <andres@anarazel.de> wrote:
> > > What if clog has been truncated so that the xmin can't be looked up?
> >
> > That's possible, but probably only in cases where xmin actually
> > committed.
> 
> Isn't that the normal case? I'm imagining something like:
> 
> - Tuple gets inserted. Transaction commits.
> - VACUUM processes table.
> - Mischievous fairies mark page all-visible in the visibility map.
> - VACUUM runs lots more times, relfrozenxid advances, but without ever
> looking at the page in question, because it's all-visible.
> - clog is truncated, rendering xmin no longer accessible.
> - User runs VACUUM disabling page skipping, gets ERROR.
> - User deletes offending tuple.
> - At this point, I think the tuple is both invisible and unprunable?
> - Fairies happy, user sad.

I'm not saying it's impossible that that happens, but the cases I did
investigate didn't look like this. If something just roguely wrote to
the VM I'd expect a lot more "is not marked all-visible but visibility
map bit is set in relation" type WARNINGs, and I've not seen much of
those (they're WARNINGs though, so maybe we wouldn't). Presumably this
wouldn't always just happen with tuples that'd trigger an error first
during hot pruning.

I've definitely seen indications of both datfrozenxid and relfrozenxid
getting corrupted (in particular vac_update_datfrozenxid being racy as
hell), xid wraparound, indications of multixact problems (although it's
possible we've now fixed those) and some signs of corrupted relcache
entries for shared relations leading to vacuums being skipped.

Greetings,

Andres Freund



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Andres Freund
Date:
Hi,

On 2020-07-14 13:20:25 -0400, Alvaro Herrera wrote:
> On 2020-Jul-13, Andres Freund wrote:
> 
> > Hi,
> > 
> > On 2020-07-13 17:12:18 -0400, Robert Haas wrote:
> > > 1. There's nothing to identify the tuple that has the problem, and no
> > > way to know how many more of them there might be. Back-patching
> > > b61d161c146328ae6ba9ed937862d66e5c8b035a would help with the first
> > > part of this.
> > 
> > Not fully, I'm afraid. Afaict it doesn't currently tell you the item
> > pointer offset, just the block numer, right? We probably should extend
> > it to also include the offset...
> 
> Just having the block number is already a tremendous step forward; with
> that you can ask the customer to set a pageinspect dump of tuple
> headers, and then the problem is obvious.  Now if you want to add block
> number to that, by all means do so.

offset number I assume?


> One useful thing to do is to mark a tuple frozen unconditionally if it's
> marked hinted XMIN_COMMITTED; no need to consult pg_clog in that case.
> The attached (for 9.6) does that; IIRC it would have helped in a couple
> of cases.

I think it might also have hidden corruption in at least one case where
we subsequently fixed a bug (and helped detect at least one unfixed
bug). That should only be possible if either required clog has been
removed, or if relfrozenxid/datfrozenxid are corrupt, right?

Greetings,

Andres Freund



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Andres Freund
Date:
Hi,

On 2020-07-14 07:51:27 -0400, Robert Haas wrote:
> On Tue, Jul 14, 2020 at 3:08 AM Peter Eisentraut
> <peter.eisentraut@2ndquadrant.com> wrote:
> > In the meantime, if you're wizard enough to deal with this kind of
> > thing, you could also clone the module from the PG14 tree and build it
> > against older versions manually.
> 
> But what if you are NOT a wizard, and a wizard is giving you
> directions? Then having to build from source is a real pain. And
> that's normally the situation I'm in when a customer has this issue.

The "found xmin ... from before relfrozenxid ..." cases should all be
fixable without needing such a function, and without it making fixing
them significantly easier, no? As far as I understand your suggested
solution, you need the tid(s) of these tuples, right? If you have those,
I don't think it's meaningfully harder to INSERT ... DELETE WHERE ctid =
.... or something like that.

ISTM that the hard part is finding all problematic tuples in an
efficient manner (i.e. that doesn't require one manual VACUUM for each
individual block + parsing VACUUMs error message), not "fixing" those
tuples.

Greetings,

Andres Freund



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Alvaro Herrera
Date:
On 2020-Jul-14, Andres Freund wrote:

> Hi,
> 
> On 2020-07-14 13:20:25 -0400, Alvaro Herrera wrote:

> > Just having the block number is already a tremendous step forward; with
> > that you can ask the customer to set a pageinspect dump of tuple
> > headers, and then the problem is obvious.  Now if you want to add block
> > number to that, by all means do so.
> 
> offset number I assume?

Eh, yeah, that.

> > One useful thing to do is to mark a tuple frozen unconditionally if it's
> > marked hinted XMIN_COMMITTED; no need to consult pg_clog in that case.
> > The attached (for 9.6) does that; IIRC it would have helped in a couple
> > of cases.
> 
> I think it might also have hidden corruption in at least one case where
> we subsequently fixed a bug (and helped detect at least one unfixed
> bug). That should only be possible if either required clog has been
> removed, or if relfrozenxid/datfrozenxid are corrupt, right?

Yes, that's precisely the reason I never submitted it :-)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Robert Haas
Date:
On Tue, Jul 14, 2020 at 3:42 PM Andres Freund <andres@anarazel.de> wrote:
> The "found xmin ... from before relfrozenxid ..." cases should all be
> fixable without needing such a function, and without it making fixing
> them significantly easier, no? As far as I understand your suggested
> solution, you need the tid(s) of these tuples, right? If you have those,
> I don't think it's meaningfully harder to INSERT ... DELETE WHERE ctid =
> .... or something like that.
>
> ISTM that the hard part is finding all problematic tuples in an
> efficient manner (i.e. that doesn't require one manual VACUUM for each
> individual block + parsing VACUUMs error message), not "fixing" those
> tuples.

I haven't tried the INSERT ... DELETE approach, but I've definitely
seen a case where a straight UPDATE did not fix the problem; VACUUM
continued failing afterwards. In that case, it was a system catalog
that was affected, and not one where TRUNCATE + re-INSERT was remotely
practical. The only solution I could come up with was to drop the
database and recreate it. Fortunately in that case the affected
database didn't seem to have any actual data in it, but if it had been
a 1TB database I think we would have been in really bad trouble.

Do you have a reason for believing that INSERT ... DELETE is going to
be better than UPDATE? It seems to me that either way you can end up
with a deleted and thus invisible tuple that you still can't get rid
of.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Andres Freund
Date:
Hi,

On 2020-07-14 15:59:21 -0400, Robert Haas wrote:
> On Tue, Jul 14, 2020 at 3:42 PM Andres Freund <andres@anarazel.de> wrote:
> > The "found xmin ... from before relfrozenxid ..." cases should all be
> > fixable without needing such a function, and without it making fixing
> > them significantly easier, no? As far as I understand your suggested
> > solution, you need the tid(s) of these tuples, right? If you have those,
> > I don't think it's meaningfully harder to INSERT ... DELETE WHERE ctid =
> > .... or something like that.
> >
> > ISTM that the hard part is finding all problematic tuples in an
> > efficient manner (i.e. that doesn't require one manual VACUUM for each
> > individual block + parsing VACUUMs error message), not "fixing" those
> > tuples.
> 
> I haven't tried the INSERT ... DELETE approach, but I've definitely
> seen a case where a straight UPDATE did not fix the problem; VACUUM
> continued failing afterwards.

The only way I can see that to happen is for the old tuple's multixact
being copied forward. That'd not happen with INSERT ... DELETE.


> In that case, it was a system catalog
> that was affected, and not one where TRUNCATE + re-INSERT was remotely
> practical.

FWIW, an rewriting ALTER TABLE would likely also fix it. But obviously
that'd require allow_system_table_mods...



> Do you have a reason for believing that INSERT ... DELETE is going to
> be better than UPDATE? It seems to me that either way you can end up
> with a deleted and thus invisible tuple that you still can't get rid
> of.

None of the "new" checks around freezing would apply to deleted
tuples. So we shouldn't fail with an error like $subject.

Greetings,

Andres Freund



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Robert Haas
Date:
On Wed, Jul 15, 2020 at 11:41 AM Andres Freund <andres@anarazel.de> wrote:
> > Do you have a reason for believing that INSERT ... DELETE is going to
> > be better than UPDATE? It seems to me that either way you can end up
> > with a deleted and thus invisible tuple that you still can't get rid
> > of.
>
> None of the "new" checks around freezing would apply to deleted
> tuples. So we shouldn't fail with an error like $subject.

It can definitely happen at least transiently:

S1:
rhaas=# create table wubble (a int, b text);
CREATE TABLE
rhaas=# insert into wubble values (1, 'glumpf');
INSERT 0 1

S2:
rhaas=# begin transaction isolation level repeatable read;
BEGIN
rhaas=*# select * from wubble;
 a |   b
---+--------
 1 | glumpf
(1 row)

S1:
rhaas=# delete from wubble;
DELETE 1
rhaas=# update pg_class set relfrozenxid =
(relfrozenxid::text::integer + 1000000)::text::xid where relname =
'wubble';
UPDATE 1
rhaas=# vacuum verbose wubble;
INFO:  vacuuming "public.wubble"
ERROR:  found xmin 528 from before relfrozenxid 1000527
CONTEXT:  while scanning block 0 of relation "public.wubble"

S2:
rhaas=*# commit;
COMMIT

S1:
rhaas=# vacuum verbose wubble;
INFO:  vacuuming "public.wubble"
INFO:  "wubble": removed 1 row versions in 1 pages
INFO:  "wubble": found 1 removable, 0 nonremovable row versions in 1
out of 1 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 531
There were 0 unused item identifiers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  "wubble": truncated 1 to 0 pages
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO:  vacuuming "pg_toast.pg_toast_16415"
INFO:  index "pg_toast_16415_index" now contains 0 row versions in 1 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  "pg_toast_16415": found 0 removable, 0 nonremovable row
versions in 0 out of 0 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 532
There were 0 unused item identifiers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
VACUUM

I see your point, though: the tuple has to be able to survive
HOT-pruning in order to cause a problem when we check whether it needs
freezing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Robert Haas
Date:
On Thu, Jul 16, 2020 at 10:00 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I see your point, though: the tuple has to be able to survive
> HOT-pruning in order to cause a problem when we check whether it needs
> freezing.

Here's an example where the new sanity checks fail on an invisible
tuple without any concurrent transactions:

$ initdb
$ pg_ctl start -l ~/logfile
$ createdb
$ psql

create table simpsons (a int, b text);
vacuum freeze;

$ cat > txid.sql
select txid_current();
$ pgbench -t 131072 -c 8 -j 8 -n -f txid.sql
$ psql

insert into simpsons values (1, 'homer');

$ pg_ctl stop
$ pg_resetwal -x 1000 $PGDATA
$ pg_ctl start -l ~/logfile
$ psql

update pg_class set relfrozenxid = (relfrozenxid::text::integer +
2000000)::text::xid where relname = 'simpsons';

rhaas=# select * from simpsons;
 a | b
---+---
(0 rows)

rhaas=# vacuum simpsons;
ERROR:  found xmin 1049082 from before relfrozenxid 2000506
CONTEXT:  while scanning block 0 of relation "public.simpsons"

This is a fairly insane situation, because we should have relfrozenxid
< tuple xid < xid counter, but instead we have xid counter < tuple xid
< relfrozenxid, but it demonstrates that it's possible to have a
database which is sufficiently corrupt that you can't escape from the
new sanity checks using only INSERT, UPDATE, and DELETE.

Now, an even easier way to create a table with a tuple that prevents
vacuuming and also can't just be deleted is to simply remove a
required pg_clog file (and maybe restart the server to clear out any
cached data in the SLRUs). What we typically do with customers who
need to recover from that situation today is give them a script to
fabricate a bogus CLOG file that shows all transactions as committed
(or, perhaps, aborted). But I think that the tools proposed on this
thread might be a better approach in certain cases. If the problem is
that a pg_clog file vanished, then recreating it with whatever content
you think is closest to what was probably there before is likely the
best you can do. But if you've got some individual tuples with crazy
xmin values, you don't really want to drop matching files in pg_clog;
it's better to fix the tuples.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
Hi All,

Attached is the patch that adds heap_force_kill(regclass, tid[]) and heap_force_freeze(regclass, tid[]) functions which Robert mentioned in the first email in this thread. The patch basically adds an extension named pg_surgery that contains these functions.  Please have a look and let me know your feedback. Thank you.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


On Thu, Jul 16, 2020 at 9:44 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Jul 16, 2020 at 10:00 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I see your point, though: the tuple has to be able to survive
> HOT-pruning in order to cause a problem when we check whether it needs
> freezing.

Here's an example where the new sanity checks fail on an invisible
tuple without any concurrent transactions:

$ initdb
$ pg_ctl start -l ~/logfile
$ createdb
$ psql

create table simpsons (a int, b text);
vacuum freeze;

$ cat > txid.sql
select txid_current();
$ pgbench -t 131072 -c 8 -j 8 -n -f txid.sql
$ psql

insert into simpsons values (1, 'homer');

$ pg_ctl stop
$ pg_resetwal -x 1000 $PGDATA
$ pg_ctl start -l ~/logfile
$ psql

update pg_class set relfrozenxid = (relfrozenxid::text::integer +
2000000)::text::xid where relname = 'simpsons';

rhaas=# select * from simpsons;
 a | b
---+---
(0 rows)

rhaas=# vacuum simpsons;
ERROR:  found xmin 1049082 from before relfrozenxid 2000506
CONTEXT:  while scanning block 0 of relation "public.simpsons"

This is a fairly insane situation, because we should have relfrozenxid
< tuple xid < xid counter, but instead we have xid counter < tuple xid
< relfrozenxid, but it demonstrates that it's possible to have a
database which is sufficiently corrupt that you can't escape from the
new sanity checks using only INSERT, UPDATE, and DELETE.

Now, an even easier way to create a table with a tuple that prevents
vacuuming and also can't just be deleted is to simply remove a
required pg_clog file (and maybe restart the server to clear out any
cached data in the SLRUs). What we typically do with customers who
need to recover from that situation today is give them a script to
fabricate a bogus CLOG file that shows all transactions as committed
(or, perhaps, aborted). But I think that the tools proposed on this
thread might be a better approach in certain cases. If the problem is
that a pg_clog file vanished, then recreating it with whatever content
you think is closest to what was probably there before is likely the
best you can do. But if you've got some individual tuples with crazy
xmin values, you don't really want to drop matching files in pg_clog;
it's better to fix the tuples.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Attachment

Re: recovering from "found xmin ... from before relfrozenxid ..."

From
"Andrey M. Borodin"
Date:

> 24 июля 2020 г., в 14:05, Ashutosh Sharma <ashu.coek88@gmail.com> написал(а):
>
> Attached is the patch that adds heap_force_kill(regclass, tid[]) and heap_force_freeze(regclass, tid[]) functions
whichRobert mentioned in the first email in this thread. The patch basically adds an extension named pg_surgery that
containsthese functions.  Please have a look and let me know your feedback. Thank you. 

Thanks for the patch!
I have just few random thoughts.

I think here we should report that we haven't done what was asked.
+            /* Nothing to do if the itemid is unused or already dead. */
+            if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid))
+                continue;

Also, should we try to fix VM along the way?
Are there any caveats with concurrent VACUUM? (I do not see any, just asking)
It would be good to have some checks for interrupts in safe places.

I think we should not trust user entierly here. I'd prefer validation and graceful exit, not a core dump.
+        Assert(noffs <= PageGetMaxOffsetNumber(page));

For some reason we had unlogged versions of these functions. But I do not recall exact rationale..
Also, I'd be happy if we had something like "Restore this tuple iff this does not break unique constraint". To do so we
needto sort tids by xmin\xmax, to revive most recent data first. 

Thanks!

Best regards, Andrey Borodin.


Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
Hi,

Thanks for sharing your thoughts. Please find my comments inline below:
 

I think here we should report that we haven't done what was asked.
+                       /* Nothing to do if the itemid is unused or already dead. */
+                       if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid))
+                               continue;


Okay. Will add a log message saying "skipping tid ... because ..."
 
Also, should we try to fix VM along the way?

I think we should let VACUUM do that.
 
Are there any caveats with concurrent VACUUM? (I do not see any, just asking)

As of now, I don't see any.
 
It would be good to have some checks for interrupts in safe places.

I think I have already added those wherever I felt it was required. If you feel it's missing somewhere, it could be good if you could point it out.
 
I think we should not trust user entierly here. I'd prefer validation and graceful exit, not a core dump.
+               Assert(noffs <= PageGetMaxOffsetNumber(page));


Yeah, sounds reasonable. Will do that.
 
For some reason we had unlogged versions of these functions. But I do not recall exact rationale..
Also, I'd be happy if we had something like "Restore this tuple iff this does not break unique constraint". To do so we need to sort tids by xmin\xmax, to revive most recent data first.

I didn't get this point. Could you please elaborate. 

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: recovering from "found xmin ... from before relfrozenxid ..."

From
"Andrey M. Borodin"
Date:

> 27 июля 2020 г., в 09:36, Ashutosh Sharma <ashu.coek88@gmail.com> написал(а):
>
> > Also, should we try to fix VM along the way?
>
> I think we should let VACUUM do that.
Sometimes VACUUM will not get to these pages, because they are marked All Frozen.
An possibly some tuples will get stale on this page again

> > Are there any caveats with concurrent VACUUM? (I do not see any, just asking)
>
> As of now, I don't see any.
VACUUM has collection of dead item pointers. We will not resurrect any of them, right?

> > It would be good to have some checks for interrupts in safe places.
>
> I think I have already added those wherever I felt it was required. If you feel it's missing somewhere, it could be
goodif you could point it out. 
Sorry, I just overlooked that call, everything is fine here.

> > Also, I'd be happy if we had something like "Restore this tuple iff this does not break unique constraint". To do
sowe need to sort tids by xmin\xmax, to revive most recent data first. 
>
> I didn't get this point. Could you please elaborate.
You may have 10 corrupted tuples for the same record, that was updated 9 times. And if you have unique constraint on
thetable you may want to have only latest version of the row. So you want to kill 9 tuples and freeze 1. 

Thanks!

Best regards, Andrey Borodin.


Re: recovering from "found xmin ... from before relfrozenxid ..."

From
MBeena Emerson
Date:
Hello Ashutosh,

On Fri, 24 Jul 2020 at 14:35, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Hi All,
>
> Attached is the patch that adds heap_force_kill(regclass, tid[]) and heap_force_freeze(regclass, tid[]) functions
whichRobert mentioned in the first email in this thread. The patch basically adds an extension named pg_surgery that
containsthese functions.  Please have a look and let me know your feedback. Thank you. 
>

Thanks for the patch.

1. We would be marking buffer dirty and writing wal even if we have
not done any changes( ex if we pass invalid/dead tids). Maybe we can
handle this better?

cosmetic changes
1. Maybe "HTupleSurgicalOption" instead of "HTupleForceOption" and
also the variable names could use surgery instead.
2. extension comment pg_surgery.control "extension to perform surgery
the damaged heap table" -> "extension to perform surgery on the
damaged heap table"

> On Thu, Jul 16, 2020 at 9:44 PM Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Thu, Jul 16, 2020 at 10:00 AM Robert Haas <robertmhaas@gmail.com> wrote:
>> > I see your point, though: the tuple has to be able to survive
>> > HOT-pruning in order to cause a problem when we check whether it needs
>> > freezing.
>>
>> Here's an example where the new sanity checks fail on an invisible
>> tuple without any concurrent transactions:
>>
>> $ initdb
>> $ pg_ctl start -l ~/logfile
>> $ createdb
>> $ psql
>>
>> create table simpsons (a int, b text);
>> vacuum freeze;
>>
>> $ cat > txid.sql
>> select txid_current();
>> $ pgbench -t 131072 -c 8 -j 8 -n -f txid.sql
>> $ psql
>>
>> insert into simpsons values (1, 'homer');
>>
>> $ pg_ctl stop
>> $ pg_resetwal -x 1000 $PGDATA
>> $ pg_ctl start -l ~/logfile
>> $ psql
>>
>> update pg_class set relfrozenxid = (relfrozenxid::text::integer +
>> 2000000)::text::xid where relname = 'simpsons';
>>
>> rhaas=# select * from simpsons;
>>  a | b
>> ---+---
>> (0 rows)
>>
>> rhaas=# vacuum simpsons;
>> ERROR:  found xmin 1049082 from before relfrozenxid 2000506
>> CONTEXT:  while scanning block 0 of relation "public.simpsons"
>>
>> This is a fairly insane situation, because we should have relfrozenxid
>> < tuple xid < xid counter, but instead we have xid counter < tuple xid
>> < relfrozenxid, but it demonstrates that it's possible to have a
>> database which is sufficiently corrupt that you can't escape from the
>> new sanity checks using only INSERT, UPDATE, and DELETE.
>>
>> Now, an even easier way to create a table with a tuple that prevents
>> vacuuming and also can't just be deleted is to simply remove a
>> required pg_clog file (and maybe restart the server to clear out any
>> cached data in the SLRUs). What we typically do with customers who
>> need to recover from that situation today is give them a script to
>> fabricate a bogus CLOG file that shows all transactions as committed
>> (or, perhaps, aborted). But I think that the tools proposed on this
>> thread might be a better approach in certain cases. If the problem is
>> that a pg_clog file vanished, then recreating it with whatever content
>> you think is closest to what was probably there before is likely the
>> best you can do. But if you've got some individual tuples with crazy
>> xmin values, you don't really want to drop matching files in pg_clog;
>> it's better to fix the tuples.
>>
>> --
>> Robert Haas
>> EnterpriseDB: http://www.enterprisedb.com
>> The Enterprise PostgreSQL Company
>>
>>


--

M Beena Emerson

Sr. Software Engineer


edbpostgres.com



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:

> I think we should let VACUUM do that.
Sometimes VACUUM will not get to these pages, because they are marked All Frozen.
An possibly some tuples will get stale on this page again

Hmm, okay, will have a look into this. Thanks.
 

> > Are there any caveats with concurrent VACUUM? (I do not see any, just asking)
>
> As of now, I don't see any.
VACUUM has collection of dead item pointers. We will not resurrect any of them, right?

We won't be doing any such things.
 
> > It would be good to have some checks for interrupts in safe places.
>
> I think I have already added those wherever I felt it was required. If you feel it's missing somewhere, it could be good if you could point it out.
Sorry, I just overlooked that call, everything is fine here.

Okay, thanks for confirming.
 
> > Also, I'd be happy if we had something like "Restore this tuple iff this does not break unique constraint". To do so we need to sort tids by xmin\xmax, to revive most recent data first.
>
> I didn't get this point. Could you please elaborate.
You may have 10 corrupted tuples for the same record, that was updated 9 times. And if you have unique constraint on the table you may want to have only latest version of the row. So you want to kill 9 tuples and freeze 1.

Okay, in that case, users need to pass the tids of the 9 tuples that they want to kill to heap_force_kill function and the tid of the tuple that they want to be marked as frozen to heap_force_freeze function. Just to inform you that this tool is not used to detect any data corruption, it is just meant to remove/clean the corrupted data in a table so that the operations like vacuum, pg_dump/restore succeeds. It's users responsibility to identify the corrupted data and pass its tid to either of these functions as per the requirement.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
Hi Beena,

Thanks for the review.

1. We would be marking buffer dirty and writing wal even if we have
not done any changes( ex if we pass invalid/dead tids). Maybe we can
handle this better?

Yeah, we can skip this when nothing has changed. Will take care of it in the next version of patch.
 
cosmetic changes
1. Maybe "HTupleSurgicalOption" instead of "HTupleForceOption" and
also the variable names could use surgery instead.

I think that looks fine. I would rather prefer the word "Force" just because all the enum options contain the word "Force" in it.
 
2. extension comment pg_surgery.control "extension to perform surgery
the damaged heap table" -> "extension to perform surgery on the
damaged heap table"

Okay, will fix that typo.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Thomas Munro
Date:
On Tue, Jul 14, 2020 at 9:12 AM Robert Haas <robertmhaas@gmail.com> wrote:
> A number of EDB customers have had this error crop on their tables for
> reasons that we have usually not been able to determine. In many

<long-shot>Do you happen to know if they ever used the
snapshot-too-old feature?</long-shot>



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Robert Haas
Date:
On Wed, Jul 29, 2020 at 3:23 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Tue, Jul 14, 2020 at 9:12 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > A number of EDB customers have had this error crop on their tables for
> > reasons that we have usually not been able to determine. In many
>
> <long-shot>Do you happen to know if they ever used the
> snapshot-too-old feature?</long-shot>

I don't have any reason to believe that they did. Why?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Thomas Munro
Date:
On Thu, Jul 30, 2020 at 1:36 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jul 29, 2020 at 3:23 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > On Tue, Jul 14, 2020 at 9:12 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > > A number of EDB customers have had this error crop on their tables for
> > > reasons that we have usually not been able to determine. In many
> >
> > <long-shot>Do you happen to know if they ever used the
> > snapshot-too-old feature?</long-shot>
>
> I don't have any reason to believe that they did. Why?

Nothing specific, I was just contemplating the problems with that
feature and the patches[1] proposed so far to fix some of them, and
what types of corruption might be possible due to that stuff, and it
occurred to me to ask if you'd thought about that in connection to
these reports.

[1]
https://www.postgresql.org/message-id/flat/CA%2BTgmoY%3Daqf0zjTD%2B3dUWYkgMiNDegDLFjo%2B6ze%3DWtpik%2B3XqA%40mail.gmail.com



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:

On Wed, Jul 29, 2020 at 9:58 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

> I think we should let VACUUM do that.
Sometimes VACUUM will not get to these pages, because they are marked All Frozen.
An possibly some tuples will get stale on this page again

Hmm, okay, will have a look into this. Thanks.

I had a look over this and found that one can use the DISABLE_PAGE_SKIPPING option with VACUUM to disable all its page-skipping behavior.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
Attached is the new version of patch that addresses the comments from Andrey and Beena.

On Wed, Jul 29, 2020 at 10:07 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
Hi Beena,

Thanks for the review.

1. We would be marking buffer dirty and writing wal even if we have
not done any changes( ex if we pass invalid/dead tids). Maybe we can
handle this better?

Yeah, we can skip this when nothing has changed. Will take care of it in the next version of patch.
 
cosmetic changes
1. Maybe "HTupleSurgicalOption" instead of "HTupleForceOption" and
also the variable names could use surgery instead.

I think that looks fine. I would rather prefer the word "Force" just because all the enum options contain the word "Force" in it.
 
2. extension comment pg_surgery.control "extension to perform surgery
the damaged heap table" -> "extension to perform surgery on the
damaged heap table"

Okay, will fix that typo.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
Attachment

Re: recovering from "found xmin ... from before relfrozenxid ..."

From
"Andrey M. Borodin"
Date:

> 31 июля 2020 г., в 17:32, Ashutosh Sharma <ashu.coek88@gmail.com> написал(а):
>
>
> On Wed, Jul 29, 2020 at 9:58 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> > I think we should let VACUUM do that.
> Sometimes VACUUM will not get to these pages, because they are marked All Frozen.
> An possibly some tuples will get stale on this page again
>
> Hmm, okay, will have a look into this. Thanks.
>
> I had a look over this and found that one can use the DISABLE_PAGE_SKIPPING option with VACUUM to disable all its
page-skippingbehavior. 

Oh, wow, I didn't know that. Thanks! This actually will do the trick.
I'll try to review your patch again next week.

Thanks!

Best regards, Andrey Borodin.


Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Robert Haas
Date:
On Fri, Jul 31, 2020 at 8:52 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> Attached is the new version of patch that addresses the comments from Andrey and Beena.

+PGFILEDESC = "pg_surgery - perform surgery on the damaged heap table"

the -> a

I also suggest: heap table -> relation, because we might want to
extend this to other cases later.

+-- toast table.
+begin;
+create table ttab(a text);
+insert into ttab select string_agg(chr(floor(random() * 26)::int +
65), '') from generate_series(1,10000);
+select * from ttab where xmin = 2;
+ a
+---
+(0 rows)
+
+select heap_force_freeze('ttab'::regclass, ARRAY['(0, 1)']::tid[]);
+ heap_force_freeze
+-------------------
+
+(1 row)
+

I don't understand the point of this. You're not testing the function
on the TOAST table; you're testing it on the main table when there
happens to be a TOAST table that is probably getting used for
something. But that's not really relevant to what is being tested
here, so as written this seems redundant with the previous cases.

+-- test pg_surgery functions with the unsupported relations. Should fail.

Please name the specific functions being tested here in case we add
more in the future that are tested separately.

+++ b/contrib/pg_surgery/heap_surgery_funcs.c

I think we could drop _funcs from the file name.

+#ifdef PG_MODULE_MAGIC
+PG_MODULE_MAGIC;
+#endif

The #ifdef here is not required, and if you look at other contrib
modules you'll see that they don't have it.

I don't like all the macros at the top of the file much. CHECKARRVALID
is only used in one place, so it seems to me that you might as well
just inline it and lose the macro. Likewise for SORT and ARRISEMPTY.

Once you do that, heap_force_common() can just compute the number of
array elements once, instead of doing it once inside ARRISEMPTY, then
again inside SORT, and then a third time to initialize ntids. You can
just have a local variable in that function that is set once and then
used as needed. Then you are only doing ARRNELEMS once, so you can get
rid of that macro too. The same technique can be used to get rid of
ARRPTR. So then all the macros go away without introducing any code
duplication.

+/* Options to forcefully change the state of a heap tuple. */
+typedef enum HTupleForceOption
+{
+ FORCE_KILL,
+ FORCE_FREEZE
+} HTupleForceOption;

I suggest un-abbreviating HTuple to HeapTuple and un-abbreviating the
enum members to HEAP_FORCE_KILL and HEAP_FORCE_FREE. Also, how about
option -> operation?

+ return heap_force_common(fcinfo, FORCE_KILL);

I think it might be more idiomatic to use PG_RETURN_DATUM here.  I
know it's the same thing, though, and perhaps I'm even wrong about the
prevailing style.

+ Assert(force_opt == FORCE_KILL || force_opt == FORCE_FREEZE);

I think this is unnecessary. It's an enum with 2 values.

+ if (ARRISEMPTY(ta))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("empty tid array")));

I don't see why this should be an error. Why can't we just continue
normally and if it does nothing, it does nothing? You'd need to change
the do..while loop to a while loop so that the end condition is tested
at the top, but that seems fine.

+ rel = relation_open(relid, AccessShareLock);

Maybe we should take RowExclusiveLock, since we are going to modify
rows. Not sure how much it matters, though.

+ if (!superuser() && GetUserId() != rel->rd_rel->relowner)
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ errmsg("must be superuser or object owner to use %s.",
+ force_opt == FORCE_KILL ? "heap_force_kill()" :
+ "heap_force_freeze()")));

This is the wrong way to do a permissions check, and it's also the
wrong way to write an error message about having failed one. To see
the correct method, grep for pg_class_aclcheck. However, I think that
we shouldn't in general trust the object owner to do this, unless the
super-user gave permission. This is a data-corrupting operation, and
only the boss is allowed to authorize it. So I think you should also
add REVOKE EXECUTE FROM PUBLIC statements to the SQL file, and then
have this check as a backup. Then, the superuser is always allowed,
and if they choose to GRANT EXECUTE on this function to some users,
those users can do it for their own relations, but not others.

+ if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("only heap AM is supported")));
+
+ check_relation_relkind(rel);

Seems like these checks are in the wrong order. Also, maybe you could
call the function something like check_relation_ok() and put the
permissions test, the relkind test, and the relam test all inside of
it, just to tighten up the code in this main function a bit.

+ if (noffs > maxoffset)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("number of offsets specified for block %u exceeds the max
offset number %u",
+ blkno, maxoffset)));

Hmm, this doesn't seem quite right. The actual problem is if an
individual item pointer's offset number is greater than maxoffset,
which can be true even if the total number of offsets is less than
maxoffset. So I think you need to remove this check and add a check
inside the loop which follows that offnos[i] is in range.

The way you've structured that loop is actually problematic -- I don't
think we want to be calling elog() or ereport() inside a critical
section. You could fix the case that checks for an invalid force_opt
by just doing if (op == HEAP_FORCE_KILL) { ... } else { Assert(op ==
HEAP_FORCE_FREEZE); ... }, or by using a switch with no default. The
NOTICE case you have here is a bigger problem. You really cannot
modify the buffer like this and then decide, oops, never mind, I think
I won't mark it dirty or write WAL for the changes. If you do that,
the buffer is still in memory, but it's now been modified. A
subsequent operation that modifies it will start with the altered
state you created here, quite possibly leading to WAL that cannot be
correctly replayed on the standby. In other words, you've got to
decide for certain whether you want to proceed with the operation
*before* you enter the critical section. You also need to emit any
messages before or after the critical section.  So you could:

1. If you encounter a TID that's unused or dead, skip it silently.
-or-
2. Loop over offsets twice. The first time, ERROR if you find any one
that is unused or dead. Then start a critical section. Loop again and
do the real work.
-or-
3. Like #2, but emit a NOTICE about a unused or dead item rather than
an ERROR, and skip the critical section and the second loop if you did
that >0 times.
-or-
4. Like #3, but don't skip anything just because you emitted a NOTICE
about the page.

#3 is closest to the behavior you have now, but I'm not sure what else
it has going for it. It doesn't seem like particularly intuitive
behavior that finding a dead or unused TID should cause other item
TIDs on the same page not to get processed while still permitting TIDs
on other pages to get processed. I don't think that's the behavior
users will be expecting. I think my vote is for #4, which will emit a
NOTICE about any TID that is dead or unused -- and I guess also about
any TID whose offset number is out of range -- but won't actually skip
any operations that can be performed. But there are decent arguments
for #1 or #2 too.

+ (errmsg("skipping tid (%u, %u) because it is already marked %s",
+ blkno, offnos[i],
+ ItemIdIsDead(itemid) ? "dead" : "unused")));

I believe this violates our guidelines on message construction. Have
two completely separate messages -- and maybe lose the word "already":

"skipping tid (%u, %u) because it is dead"
"skipping tid (%u, %u) because it is unused"

The point of this is that it makes it easier for translators.

I see very little point in what verify_tid() is doing. Before using
each block number, we should check that it's less than or equal to a
cached value of RelationGetNumberOfBlocks(rel). That's necessary in
any case to avoid funny errors; and then the check here against
specifically InvalidBlockNumber is redundant. For the offset number,
same thing: we need to check each offset against the page's
PageGetMaxOffsetNumber(page); and if we do that then we don't need
these checks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
Hi Robert,

Thanks for the review.

I've gone through all your review comments and understood all of them except this one:

You really cannot
modify the buffer like this and then decide, oops, never mind, I think
I won't mark it dirty or write WAL for the changes. If you do that,
the buffer is still in memory, but it's now been modified. A
subsequent operation that modifies it will start with the altered
state you created here, quite possibly leading to WAL that cannot be
correctly replayed on the standby. In other words, you've got to
decide for certain whether you want to proceed with the operation
*before* you enter the critical section. 

Could you please explain this point once more in detail? I am not quite able to understand under what circumstances a buffer would be modified, but won't be marked as dirty or a WAL won't be written for it.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Robert Haas
Date:
On Mon, Aug 3, 2020 at 5:05 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> Could you please explain this point once more in detail? I am not quite able to understand under what circumstances a
bufferwould be modified, but won't be marked as dirty or a WAL won't be written for it.
 

Whenever this branch is taken:

+               if (nskippedItems == noffs)
+                       goto skip_wal;

At this point you have already modified the page, using ItemIdSetDead,
HeapTupleHeaderSet*, and/or directly adjusting htup->infomask. If this
branch is taken, then MarkBufferDirty() and log_newpage_buffer() are
skipped.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
On Mon, Aug 3, 2020 at 7:06 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Aug 3, 2020 at 5:05 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > Could you please explain this point once more in detail? I am not quite able to understand under what circumstances
abuffer would be modified, but won't be marked as dirty or a WAL won't be written for it.
 
>
> Whenever this branch is taken:
>
> +               if (nskippedItems == noffs)
> +                       goto skip_wal;
>

If the above path is taken that means none of the items in the page
got changed. As per the following if-check whenever an item in the
offnos[] array is found dead or unused, it is skipped (due to continue
statement) which means the item is neither marked dead nor it is
marked frozen. Now, if this happens for all the items in a page, then
the above condition (nskippedItems == noffs) would be true and hence
the buffer would remain unchanged, so, we don't mark such a buffer as
dirty and neither do any WAL logging for it. This is my understanding,
please let me know if I am missing something here. Thank you.

if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid))
{
        nskippedItems++;
        ereport(NOTICE,
                     (errmsg("skipping tid (%u, %u) because it is
already marked %s",
                                   blkno, offnos[i],
                                   ItemIdIsDead(itemid) ? "dead" : "unused")));
        continue;
}

> At this point you have already modified the page, using ItemIdSetDead,
> HeapTupleHeaderSet*, and/or directly adjusting htup->infomask. If this
> branch is taken, then MarkBufferDirty() and log_newpage_buffer() are
> skipped.
>

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
Hi Robert,

Thanks for the review. Please find my comments inline:

On Sat, Aug 1, 2020 at 12:18 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Jul 31, 2020 at 8:52 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > Attached is the new version of patch that addresses the comments from Andrey and Beena.
>
> +PGFILEDESC = "pg_surgery - perform surgery on the damaged heap table"
>
> the -> a
>
> I also suggest: heap table -> relation, because we might want to
> extend this to other cases later.
>

Corrected.

> +-- toast table.
> +begin;
> +create table ttab(a text);
> +insert into ttab select string_agg(chr(floor(random() * 26)::int +
> 65), '') from generate_series(1,10000);
> +select * from ttab where xmin = 2;
> + a
> +---
> +(0 rows)
> +
> +select heap_force_freeze('ttab'::regclass, ARRAY['(0, 1)']::tid[]);
> + heap_force_freeze
> +-------------------
> +
> +(1 row)
> +
>
> I don't understand the point of this. You're not testing the function
> on the TOAST table; you're testing it on the main table when there
> happens to be a TOAST table that is probably getting used for
> something. But that's not really relevant to what is being tested
> here, so as written this seems redundant with the previous cases.
>

Yeah, it's being tested on the main table, not on a toast table. I've
removed this test-case and also restricted direct access to the toast
table using heap_force_kill/freeze functions. I think we shouldn't be
using these functions to do any changes in the toast table. We will
only use these functions with the main table and let VACUUM remove the
corresponding data chunks (pointed by the tuple that got removed from
the main table).

Another option would be to identify all the data chunks corresponding
to the tuple (ctid) being killed from the main table and remove them
one by one. We will only do this if the tuple from the main table that
has been marked as killed has an external storage. We will have to add
a bunch of code for this otherwise we can let VACUUM do this for us.
Let me know your thoughts on this.

> +-- test pg_surgery functions with the unsupported relations. Should fail.
>
> Please name the specific functions being tested here in case we add
> more in the future that are tested separately.
>

Done.

> +++ b/contrib/pg_surgery/heap_surgery_funcs.c
>
> I think we could drop _funcs from the file name.
>

Done.

> +#ifdef PG_MODULE_MAGIC
> +PG_MODULE_MAGIC;
> +#endif
>
> The #ifdef here is not required, and if you look at other contrib
> modules you'll see that they don't have it.
>

Okay, done.

> I don't like all the macros at the top of the file much. CHECKARRVALID
> is only used in one place, so it seems to me that you might as well
> just inline it and lose the macro. Likewise for SORT and ARRISEMPTY.
>

Done.

> Once you do that, heap_force_common() can just compute the number of
> array elements once, instead of doing it once inside ARRISEMPTY, then
> again inside SORT, and then a third time to initialize ntids. You can
> just have a local variable in that function that is set once and then
> used as needed. Then you are only doing ARRNELEMS once, so you can get
> rid of that macro too. The same technique can be used to get rid of
> ARRPTR. So then all the macros go away without introducing any code
> duplication.
>

Done.

> +/* Options to forcefully change the state of a heap tuple. */
> +typedef enum HTupleForceOption
> +{
> + FORCE_KILL,
> + FORCE_FREEZE
> +} HTupleForceOption;
>
> I suggest un-abbreviating HTuple to HeapTuple and un-abbreviating the
> enum members to HEAP_FORCE_KILL and HEAP_FORCE_FREE.

Done.

Also, how about
> option -> operation?
>

I think both look okay to me.

> + return heap_force_common(fcinfo, FORCE_KILL);
>
> I think it might be more idiomatic to use PG_RETURN_DATUM here.  I
> know it's the same thing, though, and perhaps I'm even wrong about the
> prevailing style.
>

Done.

> + Assert(force_opt == FORCE_KILL || force_opt == FORCE_FREEZE);
>
> I think this is unnecessary. It's an enum with 2 values.
>

Removed.

> + if (ARRISEMPTY(ta))
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("empty tid array")));
>
> I don't see why this should be an error. Why can't we just continue
> normally and if it does nothing, it does nothing? You'd need to change
> the do..while loop to a while loop so that the end condition is tested
> at the top, but that seems fine.
>

I think it's okay to have this check. So, just left it as-is. We do
have such checks in other contrib modules as well wherever the array
is being passed as an input to the function.

> + rel = relation_open(relid, AccessShareLock);
>
> Maybe we should take RowExclusiveLock, since we are going to modify
> rows. Not sure how much it matters, though.
>

I don't know how it would make a difference, but still as you said
replaced AccessShare with RowExclusive.

> + if (!superuser() && GetUserId() != rel->rd_rel->relowner)
> + ereport(ERROR,
> + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
> + errmsg("must be superuser or object owner to use %s.",
> + force_opt == FORCE_KILL ? "heap_force_kill()" :
> + "heap_force_freeze()")));
>
> This is the wrong way to do a permissions check, and it's also the
> wrong way to write an error message about having failed one. To see
> the correct method, grep for pg_class_aclcheck. However, I think that
> we shouldn't in general trust the object owner to do this, unless the
> super-user gave permission. This is a data-corrupting operation, and
> only the boss is allowed to authorize it. So I think you should also
> add REVOKE EXECUTE FROM PUBLIC statements to the SQL file, and then
> have this check as a backup. Then, the superuser is always allowed,
> and if they choose to GRANT EXECUTE on this function to some users,
> those users can do it for their own relations, but not others.
>

Done.

> + if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
> + ereport(ERROR,
> + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> + errmsg("only heap AM is supported")));
> +
> + check_relation_relkind(rel);
>
> Seems like these checks are in the wrong order.

I don't think there is anything wrong with the order. I can see the
same order in other contrib modules as well.

Also, maybe you could
> call the function something like check_relation_ok() and put the
> permissions test, the relkind test, and the relam test all inside of
> it, just to tighten up the code in this main function a bit.
>

Yeah, I've added a couple of functions named sanity_check_relation and
sanity_check_tid_array and shifted all the sanity checks there.

> + if (noffs > maxoffset)
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("number of offsets specified for block %u exceeds the max
> offset number %u",
> + blkno, maxoffset)));
>
> Hmm, this doesn't seem quite right. The actual problem is if an
> individual item pointer's offset number is greater than maxoffset,
> which can be true even if the total number of offsets is less than
> maxoffset. So I think you need to remove this check and add a check
> inside the loop which follows that offnos[i] is in range.
>

Agreed and done.

> The way you've structured that loop is actually problematic -- I don't
> think we want to be calling elog() or ereport() inside a critical
> section. You could fix the case that checks for an invalid force_opt
> by just doing if (op == HEAP_FORCE_KILL) { ... } else { Assert(op ==
> HEAP_FORCE_FREEZE); ... }, or by using a switch with no default. The
> NOTICE case you have here is a bigger problem.

Done.

You really cannot
> modify the buffer like this and then decide, oops, never mind, I think
> I won't mark it dirty or write WAL for the changes. If you do that,
> the buffer is still in memory, but it's now been modified. A
> subsequent operation that modifies it will start with the altered
> state you created here, quite possibly leading to WAL that cannot be
> correctly replayed on the standby. In other words, you've got to
> decide for certain whether you want to proceed with the operation
> *before* you enter the critical section. You also need to emit any
> messages before or after the critical section.  So you could:
>

This is still not clear. I think Robert needs to respond to my earlier comment.

> I believe this violates our guidelines on message construction. Have
> two completely separate messages -- and maybe lose the word "already":
>
> "skipping tid (%u, %u) because it is dead"
> "skipping tid (%u, %u) because it is unused"
>
> The point of this is that it makes it easier for translators.
>

Done.

> I see very little point in what verify_tid() is doing. Before using
> each block number, we should check that it's less than or equal to a
> cached value of RelationGetNumberOfBlocks(rel). That's necessary in
> any case to avoid funny errors; and then the check here against
> specifically InvalidBlockNumber is redundant. For the offset number,
> same thing: we need to check each offset against the page's
> PageGetMaxOffsetNumber(page); and if we do that then we don't need
> these checks.
>

Done.

Please check the attached patch for the changes.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachment

Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Robert Haas
Date:
On Mon, Aug 3, 2020 at 12:13 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> If the above path is taken that means none of the items in the page
> got changed.

Oops. I didn't realize that, sorry. Maybe it would be a little more
clear if instead of "int nSkippedItems" you had "bool
did_modify_page"? Then you could initialize it to false and set it to
true just before doing the page modifications.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Robert Haas
Date:
On Wed, Aug 5, 2020 at 9:42 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> Yeah, it's being tested on the main table, not on a toast table. I've
> removed this test-case and also restricted direct access to the toast
> table using heap_force_kill/freeze functions. I think we shouldn't be
> using these functions to do any changes in the toast table. We will
> only use these functions with the main table and let VACUUM remove the
> corresponding data chunks (pointed by the tuple that got removed from
> the main table).

I agree with removing the test case, but I disagree with restricting
this from being used on the TOAST table. These are tools for experts,
who may use them as they see fit. It's unlikely that it would be
useful to use this on a TOAST table, I think, but not impossible.

> Another option would be to identify all the data chunks corresponding
> to the tuple (ctid) being killed from the main table and remove them
> one by one. We will only do this if the tuple from the main table that
> has been marked as killed has an external storage. We will have to add
> a bunch of code for this otherwise we can let VACUUM do this for us.
> Let me know your thoughts on this.

I don't think VACUUM will do anything for us automatically -- it isn't
going to know that we force-killed the tuple in the main table.
Normally, a tuple delete would have to set xmax on the TOAST tuples
and then VACUUM would do its thing, but in this case that won't
happen. So if you force-kill a tuple in the main table you would end
up with a space leak in the TOAST table.

The problem here is that one reason you might force-killing a tuple in
the main table is because it's full of garbage. If so, trying to
decode the tuple so that you can find the TOAST pointers might crash
or error out, or maybe that part will work but then you'll error out
trying to look up the corresponding TOAST tuples, either because the
values are not valid or because the TOAST table itself is generally
hosed in some way. So I think it is probably best if we keep this tool
as simple as possible, with as few dependencies as we can, and
document the possible negative outcomes of using it. It's not
impossible to recover from a space-leak like this; you can always move
the data into a new table with CTAS and then drop the old one. Not
sure whether CLUSTER or VACUUM FULL would also be sufficient.

Separately, we might want to add a TOAST-checker to amcheck, or
enhance the heap-checker Mark is working on, and one of the things it
could do is check for TOAST entries to which nothing points. Then if
you force-kill tuples in the main table you could also use that tool
to look for things in the TOAST table that ought to also be
force-killed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
On Thu, Aug 6, 2020 at 1:04 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Aug 3, 2020 at 12:13 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > If the above path is taken that means none of the items in the page
> > got changed.
>
> Oops. I didn't realize that, sorry. Maybe it would be a little more
> clear if instead of "int nSkippedItems" you had "bool
> did_modify_page"? Then you could initialize it to false and set it to
> true just before doing the page modifications.
>

Okay, np, in that case, as you suggested, I will replace "int
nSkippedItems" with "did_modify_page" to increase the clarity. I will
do this change in the next version of patch. Thanks.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
On Thu, Aug 6, 2020 at 1:29 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Aug 5, 2020 at 9:42 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > Yeah, it's being tested on the main table, not on a toast table. I've
> > removed this test-case and also restricted direct access to the toast
> > table using heap_force_kill/freeze functions. I think we shouldn't be
> > using these functions to do any changes in the toast table. We will
> > only use these functions with the main table and let VACUUM remove the
> > corresponding data chunks (pointed by the tuple that got removed from
> > the main table).
>
> I agree with removing the test case, but I disagree with restricting
> this from being used on the TOAST table. These are tools for experts,
> who may use them as they see fit. It's unlikely that it would be
> useful to use this on a TOAST table, I think, but not impossible.
>

Okay, If you want I can remove the restriction on a toast table, but,
then that means a user can directly remove the data chunks from the
toast table without changing anything in the main table. This means we
won't be able to query the main table as it will fail with an error
like "ERROR:  unexpected chunk number ...". So, we will have to find
some way to identify the pointer to the chunks that got deleted from
the toast table and remove that pointer from the main table. We also
need to make sure that before we remove a tuple (pointer) from the
main table, we identify all the remaining data chunks pointed by this
tuple and remove them completely only then that table would be
considered to be in a good state. Now, I am not sure if we can
currently do all these things.

> > Another option would be to identify all the data chunks corresponding
> > to the tuple (ctid) being killed from the main table and remove them
> > one by one. We will only do this if the tuple from the main table that
> > has been marked as killed has an external storage. We will have to add
> > a bunch of code for this otherwise we can let VACUUM do this for us.
> > Let me know your thoughts on this.
>
> I don't think VACUUM will do anything for us automatically -- it isn't
> going to know that we force-killed the tuple in the main table.
> Normally, a tuple delete would have to set xmax on the TOAST tuples
> and then VACUUM would do its thing, but in this case that won't
> happen. So if you force-kill a tuple in the main table you would end
> up with a space leak in the TOAST table.
>
> The problem here is that one reason you might force-killing a tuple in
> the main table is because it's full of garbage. If so, trying to
> decode the tuple so that you can find the TOAST pointers might crash
> or error out, or maybe that part will work but then you'll error out
> trying to look up the corresponding TOAST tuples, either because the
> values are not valid or because the TOAST table itself is generally
> hosed in some way. So I think it is probably best if we keep this tool
> as simple as possible, with as few dependencies as we can, and
> document the possible negative outcomes of using it.

I completely agree with you.

It's not
> impossible to recover from a space-leak like this; you can always move
> the data into a new table with CTAS and then drop the old one. Not
> sure whether CLUSTER or VACUUM FULL would also be sufficient.
>

Yeah, I think, we can either use CTAS or VACUUM FULL, both look fine.

> Separately, we might want to add a TOAST-checker to amcheck, or
> enhance the heap-checker Mark is working on, and one of the things it
> could do is check for TOAST entries to which nothing points. Then if
> you force-kill tuples in the main table you could also use that tool
> to look for things in the TOAST table that ought to also be
> force-killed.
>

Okay, good to know that. Thanks for sharing this info.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Masahiko Sawada
Date:
On Wed, 5 Aug 2020 at 22:42, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Hi Robert,
>
> Thanks for the review. Please find my comments inline:
>
> On Sat, Aug 1, 2020 at 12:18 AM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Fri, Jul 31, 2020 at 8:52 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > > Attached is the new version of patch that addresses the comments from Andrey and Beena.
> >
> > +PGFILEDESC = "pg_surgery - perform surgery on the damaged heap table"
> >
> > the -> a
> >
> > I also suggest: heap table -> relation, because we might want to
> > extend this to other cases later.
> >
>
> Corrected.
>
> > +-- toast table.
> > +begin;
> > +create table ttab(a text);
> > +insert into ttab select string_agg(chr(floor(random() * 26)::int +
> > 65), '') from generate_series(1,10000);
> > +select * from ttab where xmin = 2;
> > + a
> > +---
> > +(0 rows)
> > +
> > +select heap_force_freeze('ttab'::regclass, ARRAY['(0, 1)']::tid[]);
> > + heap_force_freeze
> > +-------------------
> > +
> > +(1 row)
> > +
> >
> > I don't understand the point of this. You're not testing the function
> > on the TOAST table; you're testing it on the main table when there
> > happens to be a TOAST table that is probably getting used for
> > something. But that's not really relevant to what is being tested
> > here, so as written this seems redundant with the previous cases.
> >
>
> Yeah, it's being tested on the main table, not on a toast table. I've
> removed this test-case and also restricted direct access to the toast
> table using heap_force_kill/freeze functions. I think we shouldn't be
> using these functions to do any changes in the toast table. We will
> only use these functions with the main table and let VACUUM remove the
> corresponding data chunks (pointed by the tuple that got removed from
> the main table).
>
> Another option would be to identify all the data chunks corresponding
> to the tuple (ctid) being killed from the main table and remove them
> one by one. We will only do this if the tuple from the main table that
> has been marked as killed has an external storage. We will have to add
> a bunch of code for this otherwise we can let VACUUM do this for us.
> Let me know your thoughts on this.
>
> > +-- test pg_surgery functions with the unsupported relations. Should fail.
> >
> > Please name the specific functions being tested here in case we add
> > more in the future that are tested separately.
> >
>
> Done.
>
> > +++ b/contrib/pg_surgery/heap_surgery_funcs.c
> >
> > I think we could drop _funcs from the file name.
> >
>
> Done.
>
> > +#ifdef PG_MODULE_MAGIC
> > +PG_MODULE_MAGIC;
> > +#endif
> >
> > The #ifdef here is not required, and if you look at other contrib
> > modules you'll see that they don't have it.
> >
>
> Okay, done.
>
> > I don't like all the macros at the top of the file much. CHECKARRVALID
> > is only used in one place, so it seems to me that you might as well
> > just inline it and lose the macro. Likewise for SORT and ARRISEMPTY.
> >
>
> Done.
>
> > Once you do that, heap_force_common() can just compute the number of
> > array elements once, instead of doing it once inside ARRISEMPTY, then
> > again inside SORT, and then a third time to initialize ntids. You can
> > just have a local variable in that function that is set once and then
> > used as needed. Then you are only doing ARRNELEMS once, so you can get
> > rid of that macro too. The same technique can be used to get rid of
> > ARRPTR. So then all the macros go away without introducing any code
> > duplication.
> >
>
> Done.
>
> > +/* Options to forcefully change the state of a heap tuple. */
> > +typedef enum HTupleForceOption
> > +{
> > + FORCE_KILL,
> > + FORCE_FREEZE
> > +} HTupleForceOption;
> >
> > I suggest un-abbreviating HTuple to HeapTuple and un-abbreviating the
> > enum members to HEAP_FORCE_KILL and HEAP_FORCE_FREE.
>
> Done.
>
> Also, how about
> > option -> operation?
> >
>
> I think both look okay to me.
>
> > + return heap_force_common(fcinfo, FORCE_KILL);
> >
> > I think it might be more idiomatic to use PG_RETURN_DATUM here.  I
> > know it's the same thing, though, and perhaps I'm even wrong about the
> > prevailing style.
> >
>
> Done.
>
> > + Assert(force_opt == FORCE_KILL || force_opt == FORCE_FREEZE);
> >
> > I think this is unnecessary. It's an enum with 2 values.
> >
>
> Removed.
>
> > + if (ARRISEMPTY(ta))
> > + ereport(ERROR,
> > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > + errmsg("empty tid array")));
> >
> > I don't see why this should be an error. Why can't we just continue
> > normally and if it does nothing, it does nothing? You'd need to change
> > the do..while loop to a while loop so that the end condition is tested
> > at the top, but that seems fine.
> >
>
> I think it's okay to have this check. So, just left it as-is. We do
> have such checks in other contrib modules as well wherever the array
> is being passed as an input to the function.
>
> > + rel = relation_open(relid, AccessShareLock);
> >
> > Maybe we should take RowExclusiveLock, since we are going to modify
> > rows. Not sure how much it matters, though.
> >
>
> I don't know how it would make a difference, but still as you said
> replaced AccessShare with RowExclusive.
>
> > + if (!superuser() && GetUserId() != rel->rd_rel->relowner)
> > + ereport(ERROR,
> > + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
> > + errmsg("must be superuser or object owner to use %s.",
> > + force_opt == FORCE_KILL ? "heap_force_kill()" :
> > + "heap_force_freeze()")));
> >
> > This is the wrong way to do a permissions check, and it's also the
> > wrong way to write an error message about having failed one. To see
> > the correct method, grep for pg_class_aclcheck. However, I think that
> > we shouldn't in general trust the object owner to do this, unless the
> > super-user gave permission. This is a data-corrupting operation, and
> > only the boss is allowed to authorize it. So I think you should also
> > add REVOKE EXECUTE FROM PUBLIC statements to the SQL file, and then
> > have this check as a backup. Then, the superuser is always allowed,
> > and if they choose to GRANT EXECUTE on this function to some users,
> > those users can do it for their own relations, but not others.
> >
>
> Done.
>
> > + if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
> > + ereport(ERROR,
> > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> > + errmsg("only heap AM is supported")));
> > +
> > + check_relation_relkind(rel);
> >
> > Seems like these checks are in the wrong order.
>
> I don't think there is anything wrong with the order. I can see the
> same order in other contrib modules as well.
>
> Also, maybe you could
> > call the function something like check_relation_ok() and put the
> > permissions test, the relkind test, and the relam test all inside of
> > it, just to tighten up the code in this main function a bit.
> >
>
> Yeah, I've added a couple of functions named sanity_check_relation and
> sanity_check_tid_array and shifted all the sanity checks there.
>
> > + if (noffs > maxoffset)
> > + ereport(ERROR,
> > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > + errmsg("number of offsets specified for block %u exceeds the max
> > offset number %u",
> > + blkno, maxoffset)));
> >
> > Hmm, this doesn't seem quite right. The actual problem is if an
> > individual item pointer's offset number is greater than maxoffset,
> > which can be true even if the total number of offsets is less than
> > maxoffset. So I think you need to remove this check and add a check
> > inside the loop which follows that offnos[i] is in range.
> >
>
> Agreed and done.
>
> > The way you've structured that loop is actually problematic -- I don't
> > think we want to be calling elog() or ereport() inside a critical
> > section. You could fix the case that checks for an invalid force_opt
> > by just doing if (op == HEAP_FORCE_KILL) { ... } else { Assert(op ==
> > HEAP_FORCE_FREEZE); ... }, or by using a switch with no default. The
> > NOTICE case you have here is a bigger problem.
>
> Done.
>
> You really cannot
> > modify the buffer like this and then decide, oops, never mind, I think
> > I won't mark it dirty or write WAL for the changes. If you do that,
> > the buffer is still in memory, but it's now been modified. A
> > subsequent operation that modifies it will start with the altered
> > state you created here, quite possibly leading to WAL that cannot be
> > correctly replayed on the standby. In other words, you've got to
> > decide for certain whether you want to proceed with the operation
> > *before* you enter the critical section. You also need to emit any
> > messages before or after the critical section.  So you could:
> >
>
> This is still not clear. I think Robert needs to respond to my earlier comment.
>
> > I believe this violates our guidelines on message construction. Have
> > two completely separate messages -- and maybe lose the word "already":
> >
> > "skipping tid (%u, %u) because it is dead"
> > "skipping tid (%u, %u) because it is unused"
> >
> > The point of this is that it makes it easier for translators.
> >
>
> Done.
>
> > I see very little point in what verify_tid() is doing. Before using
> > each block number, we should check that it's less than or equal to a
> > cached value of RelationGetNumberOfBlocks(rel). That's necessary in
> > any case to avoid funny errors; and then the check here against
> > specifically InvalidBlockNumber is redundant. For the offset number,
> > same thing: we need to check each offset against the page's
> > PageGetMaxOffsetNumber(page); and if we do that then we don't need
> > these checks.
> >
>
> Done.
>
> Please check the attached patch for the changes.

I also looked at this version patch and have some small comments:

+   Oid             relid = PG_GETARG_OID(0);
+   ArrayType      *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
+   ItemPointer     tids;
+   int             ntids;
+   Relation        rel;
+   Buffer          buf;
+   Page            page;
+   ItemId          itemid;
+   BlockNumber     blkno;
+   OffsetNumber   *offnos;
+   OffsetNumber    offno,
+                   noffs,
+                   curr_start_ptr,
+                   next_start_ptr,
+                   maxoffset;
+   int             i,
+                   nskippedItems,
+                   nblocks;

You declare all variables at the top of heap_force_common() function
but I think we can declare some variables such as buf, page inside of
the do loop.

---
+           if (offnos[i] > maxoffset)
+           {
+               ereport(NOTICE,
+                        errmsg("skipping tid (%u, %u) because it
contains an invalid offset",
+                               blkno, offnos[i]));
+               continue;
+           }

If all tids on a page take the above path, we will end up logging FPI
in spite of modifying nothing on the page.

---
+       /* XLOG stuff */
+       if (RelationNeedsWAL(rel))
+           log_newpage_buffer(buf, true);

I think we need to set the returned LSN by log_newpage_buffer() to the page lsn.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Rajkumar Raghuwanshi
Date:

I have been doing some user-level testing of this feature, apart from sanity test for extension and it's functions

I have tried to corrupt tuples and then able to fix it using heap_force_freeze/kill functions.


--corrupt relfrozenxid, cause vacuum failed.

update pg_class set relfrozenxid = (relfrozenxid::text::integer + 10)::text::xid where relname = 'test_tbl';

UPDATE 1

insert into test_tbl values (2, 'BBB');


postgres=# vacuum test_tbl;

ERROR:  found xmin 507 from before relfrozenxid 516

CONTEXT:  while scanning block 0 of relation "public.test_tbl"


postgres=# select *, ctid, xmin, xmax from test_tbl;

 a |  b  | ctid  | xmin | xmax 

---+-----+-------+------+------

 1 | AAA | (0,1) |  505 |    0

 2 | BBB | (0,2) |  507 |    0

(2 rows)


--fixed using heap_force_freeze

postgres=# select heap_force_freeze('test_tbl'::regclass, ARRAY['(0,2)']::tid[]);

 heap_force_freeze 

-------------------


postgres=# vacuum test_tbl;

VACUUM

postgres=# select *, ctid, xmin, xmax from test_tbl;

 a |  b  | ctid  | xmin | xmax 

---+-----+-------+------+------

 1 | AAA | (0,1) |  505 |    0

 2 | BBB | (0,2) |    2 |    0

(2 rows)


--corrupt table headers in base/oid. file, cause table access failed.

postgres=# select ctid, * from test_tbl;

ERROR:  could not access status of transaction 4294967295

DETAIL:  Could not open file "pg_xact/0FFF": No such file or directory.


--removed corrupted tuple using heap_force_kill

postgres=# select heap_force_kill('test_tbl'::regclass, ARRAY['(0,2)']::tid[]);

 heap_force_kill 

-----------------

 

(1 row)


postgres=# select ctid, * from test_tbl;

 ctid  | a |  b  

-------+---+-----

 (0,1) | 1 | AAA

(1 row)


I will be continuing with my testing with the latest patch and update here if found anything.


Thanks & Regards,
Rajkumar Raghuwanshi


On Thu, Aug 6, 2020 at 1:42 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
On Wed, 5 Aug 2020 at 22:42, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Hi Robert,
>
> Thanks for the review. Please find my comments inline:
>
> On Sat, Aug 1, 2020 at 12:18 AM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Fri, Jul 31, 2020 at 8:52 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > > Attached is the new version of patch that addresses the comments from Andrey and Beena.
> >
> > +PGFILEDESC = "pg_surgery - perform surgery on the damaged heap table"
> >
> > the -> a
> >
> > I also suggest: heap table -> relation, because we might want to
> > extend this to other cases later.
> >
>
> Corrected.
>
> > +-- toast table.
> > +begin;
> > +create table ttab(a text);
> > +insert into ttab select string_agg(chr(floor(random() * 26)::int +
> > 65), '') from generate_series(1,10000);
> > +select * from ttab where xmin = 2;
> > + a
> > +---
> > +(0 rows)
> > +
> > +select heap_force_freeze('ttab'::regclass, ARRAY['(0, 1)']::tid[]);
> > + heap_force_freeze
> > +-------------------
> > +
> > +(1 row)
> > +
> >
> > I don't understand the point of this. You're not testing the function
> > on the TOAST table; you're testing it on the main table when there
> > happens to be a TOAST table that is probably getting used for
> > something. But that's not really relevant to what is being tested
> > here, so as written this seems redundant with the previous cases.
> >
>
> Yeah, it's being tested on the main table, not on a toast table. I've
> removed this test-case and also restricted direct access to the toast
> table using heap_force_kill/freeze functions. I think we shouldn't be
> using these functions to do any changes in the toast table. We will
> only use these functions with the main table and let VACUUM remove the
> corresponding data chunks (pointed by the tuple that got removed from
> the main table).
>
> Another option would be to identify all the data chunks corresponding
> to the tuple (ctid) being killed from the main table and remove them
> one by one. We will only do this if the tuple from the main table that
> has been marked as killed has an external storage. We will have to add
> a bunch of code for this otherwise we can let VACUUM do this for us.
> Let me know your thoughts on this.
>
> > +-- test pg_surgery functions with the unsupported relations. Should fail.
> >
> > Please name the specific functions being tested here in case we add
> > more in the future that are tested separately.
> >
>
> Done.
>
> > +++ b/contrib/pg_surgery/heap_surgery_funcs.c
> >
> > I think we could drop _funcs from the file name.
> >
>
> Done.
>
> > +#ifdef PG_MODULE_MAGIC
> > +PG_MODULE_MAGIC;
> > +#endif
> >
> > The #ifdef here is not required, and if you look at other contrib
> > modules you'll see that they don't have it.
> >
>
> Okay, done.
>
> > I don't like all the macros at the top of the file much. CHECKARRVALID
> > is only used in one place, so it seems to me that you might as well
> > just inline it and lose the macro. Likewise for SORT and ARRISEMPTY.
> >
>
> Done.
>
> > Once you do that, heap_force_common() can just compute the number of
> > array elements once, instead of doing it once inside ARRISEMPTY, then
> > again inside SORT, and then a third time to initialize ntids. You can
> > just have a local variable in that function that is set once and then
> > used as needed. Then you are only doing ARRNELEMS once, so you can get
> > rid of that macro too. The same technique can be used to get rid of
> > ARRPTR. So then all the macros go away without introducing any code
> > duplication.
> >
>
> Done.
>
> > +/* Options to forcefully change the state of a heap tuple. */
> > +typedef enum HTupleForceOption
> > +{
> > + FORCE_KILL,
> > + FORCE_FREEZE
> > +} HTupleForceOption;
> >
> > I suggest un-abbreviating HTuple to HeapTuple and un-abbreviating the
> > enum members to HEAP_FORCE_KILL and HEAP_FORCE_FREE.
>
> Done.
>
> Also, how about
> > option -> operation?
> >
>
> I think both look okay to me.
>
> > + return heap_force_common(fcinfo, FORCE_KILL);
> >
> > I think it might be more idiomatic to use PG_RETURN_DATUM here.  I
> > know it's the same thing, though, and perhaps I'm even wrong about the
> > prevailing style.
> >
>
> Done.
>
> > + Assert(force_opt == FORCE_KILL || force_opt == FORCE_FREEZE);
> >
> > I think this is unnecessary. It's an enum with 2 values.
> >
>
> Removed.
>
> > + if (ARRISEMPTY(ta))
> > + ereport(ERROR,
> > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > + errmsg("empty tid array")));
> >
> > I don't see why this should be an error. Why can't we just continue
> > normally and if it does nothing, it does nothing? You'd need to change
> > the do..while loop to a while loop so that the end condition is tested
> > at the top, but that seems fine.
> >
>
> I think it's okay to have this check. So, just left it as-is. We do
> have such checks in other contrib modules as well wherever the array
> is being passed as an input to the function.
>
> > + rel = relation_open(relid, AccessShareLock);
> >
> > Maybe we should take RowExclusiveLock, since we are going to modify
> > rows. Not sure how much it matters, though.
> >
>
> I don't know how it would make a difference, but still as you said
> replaced AccessShare with RowExclusive.
>
> > + if (!superuser() && GetUserId() != rel->rd_rel->relowner)
> > + ereport(ERROR,
> > + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
> > + errmsg("must be superuser or object owner to use %s.",
> > + force_opt == FORCE_KILL ? "heap_force_kill()" :
> > + "heap_force_freeze()")));
> >
> > This is the wrong way to do a permissions check, and it's also the
> > wrong way to write an error message about having failed one. To see
> > the correct method, grep for pg_class_aclcheck. However, I think that
> > we shouldn't in general trust the object owner to do this, unless the
> > super-user gave permission. This is a data-corrupting operation, and
> > only the boss is allowed to authorize it. So I think you should also
> > add REVOKE EXECUTE FROM PUBLIC statements to the SQL file, and then
> > have this check as a backup. Then, the superuser is always allowed,
> > and if they choose to GRANT EXECUTE on this function to some users,
> > those users can do it for their own relations, but not others.
> >
>
> Done.
>
> > + if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
> > + ereport(ERROR,
> > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> > + errmsg("only heap AM is supported")));
> > +
> > + check_relation_relkind(rel);
> >
> > Seems like these checks are in the wrong order.
>
> I don't think there is anything wrong with the order. I can see the
> same order in other contrib modules as well.
>
> Also, maybe you could
> > call the function something like check_relation_ok() and put the
> > permissions test, the relkind test, and the relam test all inside of
> > it, just to tighten up the code in this main function a bit.
> >
>
> Yeah, I've added a couple of functions named sanity_check_relation and
> sanity_check_tid_array and shifted all the sanity checks there.
>
> > + if (noffs > maxoffset)
> > + ereport(ERROR,
> > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > + errmsg("number of offsets specified for block %u exceeds the max
> > offset number %u",
> > + blkno, maxoffset)));
> >
> > Hmm, this doesn't seem quite right. The actual problem is if an
> > individual item pointer's offset number is greater than maxoffset,
> > which can be true even if the total number of offsets is less than
> > maxoffset. So I think you need to remove this check and add a check
> > inside the loop which follows that offnos[i] is in range.
> >
>
> Agreed and done.
>
> > The way you've structured that loop is actually problematic -- I don't
> > think we want to be calling elog() or ereport() inside a critical
> > section. You could fix the case that checks for an invalid force_opt
> > by just doing if (op == HEAP_FORCE_KILL) { ... } else { Assert(op ==
> > HEAP_FORCE_FREEZE); ... }, or by using a switch with no default. The
> > NOTICE case you have here is a bigger problem.
>
> Done.
>
> You really cannot
> > modify the buffer like this and then decide, oops, never mind, I think
> > I won't mark it dirty or write WAL for the changes. If you do that,
> > the buffer is still in memory, but it's now been modified. A
> > subsequent operation that modifies it will start with the altered
> > state you created here, quite possibly leading to WAL that cannot be
> > correctly replayed on the standby. In other words, you've got to
> > decide for certain whether you want to proceed with the operation
> > *before* you enter the critical section. You also need to emit any
> > messages before or after the critical section.  So you could:
> >
>
> This is still not clear. I think Robert needs to respond to my earlier comment.
>
> > I believe this violates our guidelines on message construction. Have
> > two completely separate messages -- and maybe lose the word "already":
> >
> > "skipping tid (%u, %u) because it is dead"
> > "skipping tid (%u, %u) because it is unused"
> >
> > The point of this is that it makes it easier for translators.
> >
>
> Done.
>
> > I see very little point in what verify_tid() is doing. Before using
> > each block number, we should check that it's less than or equal to a
> > cached value of RelationGetNumberOfBlocks(rel). That's necessary in
> > any case to avoid funny errors; and then the check here against
> > specifically InvalidBlockNumber is redundant. For the offset number,
> > same thing: we need to check each offset against the page's
> > PageGetMaxOffsetNumber(page); and if we do that then we don't need
> > these checks.
> >
>
> Done.
>
> Please check the attached patch for the changes.

I also looked at this version patch and have some small comments:

+   Oid             relid = PG_GETARG_OID(0);
+   ArrayType      *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
+   ItemPointer     tids;
+   int             ntids;
+   Relation        rel;
+   Buffer          buf;
+   Page            page;
+   ItemId          itemid;
+   BlockNumber     blkno;
+   OffsetNumber   *offnos;
+   OffsetNumber    offno,
+                   noffs,
+                   curr_start_ptr,
+                   next_start_ptr,
+                   maxoffset;
+   int             i,
+                   nskippedItems,
+                   nblocks;

You declare all variables at the top of heap_force_common() function
but I think we can declare some variables such as buf, page inside of
the do loop.

---
+           if (offnos[i] > maxoffset)
+           {
+               ereport(NOTICE,
+                        errmsg("skipping tid (%u, %u) because it
contains an invalid offset",
+                               blkno, offnos[i]));
+               continue;
+           }

If all tids on a page take the above path, we will end up logging FPI
in spite of modifying nothing on the page.

---
+       /* XLOG stuff */
+       if (RelationNeedsWAL(rel))
+           log_newpage_buffer(buf, true);

I think we need to set the returned LSN by log_newpage_buffer() to the page lsn.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
Hello Masahiko-san,

Thanks for looking into the patch. Please find my comments inline below:

On Thu, Aug 6, 2020 at 1:42 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Wed, 5 Aug 2020 at 22:42, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > Please check the attached patch for the changes.
>
> I also looked at this version patch and have some small comments:
>
> +   Oid             relid = PG_GETARG_OID(0);
> +   ArrayType      *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
> +   ItemPointer     tids;
> +   int             ntids;
> +   Relation        rel;
> +   Buffer          buf;
> +   Page            page;
> +   ItemId          itemid;
> +   BlockNumber     blkno;
> +   OffsetNumber   *offnos;
> +   OffsetNumber    offno,
> +                   noffs,
> +                   curr_start_ptr,
> +                   next_start_ptr,
> +                   maxoffset;
> +   int             i,
> +                   nskippedItems,
> +                   nblocks;
>
> You declare all variables at the top of heap_force_common() function
> but I think we can declare some variables such as buf, page inside of
> the do loop.
>

Sure, I will do this in the next version of patch.

> ---
> +           if (offnos[i] > maxoffset)
> +           {
> +               ereport(NOTICE,
> +                        errmsg("skipping tid (%u, %u) because it
> contains an invalid offset",
> +                               blkno, offnos[i]));
> +               continue;
> +           }
>
> If all tids on a page take the above path, we will end up logging FPI
> in spite of modifying nothing on the page.
>

Yeah, that's right. I've taken care of this in the new version of
patch which I am yet to share.

> ---
> +       /* XLOG stuff */
> +       if (RelationNeedsWAL(rel))
> +           log_newpage_buffer(buf, true);
>
> I think we need to set the returned LSN by log_newpage_buffer() to the page lsn.
>

I think we are already setting the page lsn in the log_newpage() which
is being called from log_newpage_buffer(). So, AFAIU, no change would
be required here. Please let me know if I am missing something.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Masahiko Sawada
Date:
On Thu, 6 Aug 2020 at 18:05, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Hello Masahiko-san,
>
> Thanks for looking into the patch. Please find my comments inline below:
>
> On Thu, Aug 6, 2020 at 1:42 PM Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> >
> > On Wed, 5 Aug 2020 at 22:42, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > > Please check the attached patch for the changes.
> >
> > I also looked at this version patch and have some small comments:
> >
> > +   Oid             relid = PG_GETARG_OID(0);
> > +   ArrayType      *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
> > +   ItemPointer     tids;
> > +   int             ntids;
> > +   Relation        rel;
> > +   Buffer          buf;
> > +   Page            page;
> > +   ItemId          itemid;
> > +   BlockNumber     blkno;
> > +   OffsetNumber   *offnos;
> > +   OffsetNumber    offno,
> > +                   noffs,
> > +                   curr_start_ptr,
> > +                   next_start_ptr,
> > +                   maxoffset;
> > +   int             i,
> > +                   nskippedItems,
> > +                   nblocks;
> >
> > You declare all variables at the top of heap_force_common() function
> > but I think we can declare some variables such as buf, page inside of
> > the do loop.
> >
>
> Sure, I will do this in the next version of patch.
>
> > ---
> > +           if (offnos[i] > maxoffset)
> > +           {
> > +               ereport(NOTICE,
> > +                        errmsg("skipping tid (%u, %u) because it
> > contains an invalid offset",
> > +                               blkno, offnos[i]));
> > +               continue;
> > +           }
> >
> > If all tids on a page take the above path, we will end up logging FPI
> > in spite of modifying nothing on the page.
> >
>
> Yeah, that's right. I've taken care of this in the new version of
> patch which I am yet to share.
>
> > ---
> > +       /* XLOG stuff */
> > +       if (RelationNeedsWAL(rel))
> > +           log_newpage_buffer(buf, true);
> >
> > I think we need to set the returned LSN by log_newpage_buffer() to the page lsn.
> >
>
> I think we are already setting the page lsn in the log_newpage() which
> is being called from log_newpage_buffer(). So, AFAIU, no change would
> be required here. Please let me know if I am missing something.

You're right. I'd missed the comment of log_newpage_buffer(). Thanks!

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
Attached v4 patch fixes the latest comments from Robert and Masahiko-san.

Changes:
1) Let heap_force_kill and freeze functions to be used with toast tables.
2) Replace "int nskippedItems" with "bool did_modify_page" flag to
know if any modification happened in the page or not.
3) Declare some of the variables such as buf, page inside of the do
loop instead of declaring them at the top of heap_force_common
function.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Thu, Aug 6, 2020 at 2:49 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Thu, 6 Aug 2020 at 18:05, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> >
> > Hello Masahiko-san,
> >
> > Thanks for looking into the patch. Please find my comments inline below:
> >
> > On Thu, Aug 6, 2020 at 1:42 PM Masahiko Sawada
> > <masahiko.sawada@2ndquadrant.com> wrote:
> > >
> > > On Wed, 5 Aug 2020 at 22:42, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > > > Please check the attached patch for the changes.
> > >
> > > I also looked at this version patch and have some small comments:
> > >
> > > +   Oid             relid = PG_GETARG_OID(0);
> > > +   ArrayType      *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
> > > +   ItemPointer     tids;
> > > +   int             ntids;
> > > +   Relation        rel;
> > > +   Buffer          buf;
> > > +   Page            page;
> > > +   ItemId          itemid;
> > > +   BlockNumber     blkno;
> > > +   OffsetNumber   *offnos;
> > > +   OffsetNumber    offno,
> > > +                   noffs,
> > > +                   curr_start_ptr,
> > > +                   next_start_ptr,
> > > +                   maxoffset;
> > > +   int             i,
> > > +                   nskippedItems,
> > > +                   nblocks;
> > >
> > > You declare all variables at the top of heap_force_common() function
> > > but I think we can declare some variables such as buf, page inside of
> > > the do loop.
> > >
> >
> > Sure, I will do this in the next version of patch.
> >
> > > ---
> > > +           if (offnos[i] > maxoffset)
> > > +           {
> > > +               ereport(NOTICE,
> > > +                        errmsg("skipping tid (%u, %u) because it
> > > contains an invalid offset",
> > > +                               blkno, offnos[i]));
> > > +               continue;
> > > +           }
> > >
> > > If all tids on a page take the above path, we will end up logging FPI
> > > in spite of modifying nothing on the page.
> > >
> >
> > Yeah, that's right. I've taken care of this in the new version of
> > patch which I am yet to share.
> >
> > > ---
> > > +       /* XLOG stuff */
> > > +       if (RelationNeedsWAL(rel))
> > > +           log_newpage_buffer(buf, true);
> > >
> > > I think we need to set the returned LSN by log_newpage_buffer() to the page lsn.
> > >
> >
> > I think we are already setting the page lsn in the log_newpage() which
> > is being called from log_newpage_buffer(). So, AFAIU, no change would
> > be required here. Please let me know if I am missing something.
>
> You're right. I'd missed the comment of log_newpage_buffer(). Thanks!
>
> Regards,
>
> --
> Masahiko Sawada            http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Robert Haas
Date:
On Thu, Aug 6, 2020 at 2:11 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> Okay, If you want I can remove the restriction on a toast table, but,
> then that means a user can directly remove the data chunks from the
> toast table without changing anything in the main table. This means we
> won't be able to query the main table as it will fail with an error
> like "ERROR:  unexpected chunk number ...". So, we will have to find
> some way to identify the pointer to the chunks that got deleted from
> the toast table and remove that pointer from the main table. We also
> need to make sure that before we remove a tuple (pointer) from the
> main table, we identify all the remaining data chunks pointed by this
> tuple and remove them completely only then that table would be
> considered to be in a good state. Now, I am not sure if we can
> currently do all these things.

That's the user's problem. If they don't have a plan for that, they
shouldn't use this tool. I don't think we can or should try to handle
it in this code.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
On Thu, Aug 6, 2020 at 9:19 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Aug 6, 2020 at 2:11 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > Okay, If you want I can remove the restriction on a toast table, but,
> > then that means a user can directly remove the data chunks from the
> > toast table without changing anything in the main table. This means we
> > won't be able to query the main table as it will fail with an error
> > like "ERROR:  unexpected chunk number ...". So, we will have to find
> > some way to identify the pointer to the chunks that got deleted from
> > the toast table and remove that pointer from the main table. We also
> > need to make sure that before we remove a tuple (pointer) from the
> > main table, we identify all the remaining data chunks pointed by this
> > tuple and remove them completely only then that table would be
> > considered to be in a good state. Now, I am not sure if we can
> > currently do all these things.
>
> That's the user's problem. If they don't have a plan for that, they
> shouldn't use this tool. I don't think we can or should try to handle
> it in this code.
>

Okay, thanks.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
Thanks Rajkumar for testing the patch.

Here are some of the additional test-cases that I would suggest you to
execute, if possible:

1) You may try running the test-cases that you have executed so far
with SR setup and see if the changes are getting reflected on the
standby.

2) You may also try running some concurrent test-cases for e.g. try
running these functions with VACUUM or some other sql commands
(preferable DML commands) in parallel.

3) See what happens when you pass some invalid tids (containing
invalid block or offset number) to these functions. You may also try
running these functions on the same tuple repeatedly and see the
behaviour.

...

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Thu, Aug 6, 2020 at 2:25 PM Rajkumar Raghuwanshi
<rajkumar.raghuwanshi@enterprisedb.com> wrote:
>
> I have been doing some user-level testing of this feature, apart from sanity test for extension and it's functions
>
> I have tried to corrupt tuples and then able to fix it using heap_force_freeze/kill functions.
>
>
> --corrupt relfrozenxid, cause vacuum failed.
>
> update pg_class set relfrozenxid = (relfrozenxid::text::integer + 10)::text::xid where relname = 'test_tbl';
>
> UPDATE 1
>
> insert into test_tbl values (2, 'BBB');
>
>
> postgres=# vacuum test_tbl;
>
> ERROR:  found xmin 507 from before relfrozenxid 516
>
> CONTEXT:  while scanning block 0 of relation "public.test_tbl"
>
>
> postgres=# select *, ctid, xmin, xmax from test_tbl;
>
>  a |  b  | ctid  | xmin | xmax
>
> ---+-----+-------+------+------
>
>  1 | AAA | (0,1) |  505 |    0
>
>  2 | BBB | (0,2) |  507 |    0
>
> (2 rows)
>
>
> --fixed using heap_force_freeze
>
> postgres=# select heap_force_freeze('test_tbl'::regclass, ARRAY['(0,2)']::tid[]);
>
>  heap_force_freeze
>
> -------------------
>
>
> postgres=# vacuum test_tbl;
>
> VACUUM
>
> postgres=# select *, ctid, xmin, xmax from test_tbl;
>
>  a |  b  | ctid  | xmin | xmax
>
> ---+-----+-------+------+------
>
>  1 | AAA | (0,1) |  505 |    0
>
>  2 | BBB | (0,2) |    2 |    0
>
> (2 rows)
>
>
> --corrupt table headers in base/oid. file, cause table access failed.
>
> postgres=# select ctid, * from test_tbl;
>
> ERROR:  could not access status of transaction 4294967295
>
> DETAIL:  Could not open file "pg_xact/0FFF": No such file or directory.
>
>
> --removed corrupted tuple using heap_force_kill
>
> postgres=# select heap_force_kill('test_tbl'::regclass, ARRAY['(0,2)']::tid[]);
>
>  heap_force_kill
>
> -----------------
>
>
>
> (1 row)
>
>
> postgres=# select ctid, * from test_tbl;
>
>  ctid  | a |  b
>
> -------+---+-----
>
>  (0,1) | 1 | AAA
>
> (1 row)
>
>
> I will be continuing with my testing with the latest patch and update here if found anything.
>
>
> Thanks & Regards,
> Rajkumar Raghuwanshi
>
>
> On Thu, Aug 6, 2020 at 1:42 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>>
>> On Wed, 5 Aug 2020 at 22:42, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>> >
>> > Hi Robert,
>> >
>> > Thanks for the review. Please find my comments inline:
>> >
>> > On Sat, Aug 1, 2020 at 12:18 AM Robert Haas <robertmhaas@gmail.com> wrote:
>> > >
>> > > On Fri, Jul 31, 2020 at 8:52 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>> > > > Attached is the new version of patch that addresses the comments from Andrey and Beena.
>> > >
>> > > +PGFILEDESC = "pg_surgery - perform surgery on the damaged heap table"
>> > >
>> > > the -> a
>> > >
>> > > I also suggest: heap table -> relation, because we might want to
>> > > extend this to other cases later.
>> > >
>> >
>> > Corrected.
>> >
>> > > +-- toast table.
>> > > +begin;
>> > > +create table ttab(a text);
>> > > +insert into ttab select string_agg(chr(floor(random() * 26)::int +
>> > > 65), '') from generate_series(1,10000);
>> > > +select * from ttab where xmin = 2;
>> > > + a
>> > > +---
>> > > +(0 rows)
>> > > +
>> > > +select heap_force_freeze('ttab'::regclass, ARRAY['(0, 1)']::tid[]);
>> > > + heap_force_freeze
>> > > +-------------------
>> > > +
>> > > +(1 row)
>> > > +
>> > >
>> > > I don't understand the point of this. You're not testing the function
>> > > on the TOAST table; you're testing it on the main table when there
>> > > happens to be a TOAST table that is probably getting used for
>> > > something. But that's not really relevant to what is being tested
>> > > here, so as written this seems redundant with the previous cases.
>> > >
>> >
>> > Yeah, it's being tested on the main table, not on a toast table. I've
>> > removed this test-case and also restricted direct access to the toast
>> > table using heap_force_kill/freeze functions. I think we shouldn't be
>> > using these functions to do any changes in the toast table. We will
>> > only use these functions with the main table and let VACUUM remove the
>> > corresponding data chunks (pointed by the tuple that got removed from
>> > the main table).
>> >
>> > Another option would be to identify all the data chunks corresponding
>> > to the tuple (ctid) being killed from the main table and remove them
>> > one by one. We will only do this if the tuple from the main table that
>> > has been marked as killed has an external storage. We will have to add
>> > a bunch of code for this otherwise we can let VACUUM do this for us.
>> > Let me know your thoughts on this.
>> >
>> > > +-- test pg_surgery functions with the unsupported relations. Should fail.
>> > >
>> > > Please name the specific functions being tested here in case we add
>> > > more in the future that are tested separately.
>> > >
>> >
>> > Done.
>> >
>> > > +++ b/contrib/pg_surgery/heap_surgery_funcs.c
>> > >
>> > > I think we could drop _funcs from the file name.
>> > >
>> >
>> > Done.
>> >
>> > > +#ifdef PG_MODULE_MAGIC
>> > > +PG_MODULE_MAGIC;
>> > > +#endif
>> > >
>> > > The #ifdef here is not required, and if you look at other contrib
>> > > modules you'll see that they don't have it.
>> > >
>> >
>> > Okay, done.
>> >
>> > > I don't like all the macros at the top of the file much. CHECKARRVALID
>> > > is only used in one place, so it seems to me that you might as well
>> > > just inline it and lose the macro. Likewise for SORT and ARRISEMPTY.
>> > >
>> >
>> > Done.
>> >
>> > > Once you do that, heap_force_common() can just compute the number of
>> > > array elements once, instead of doing it once inside ARRISEMPTY, then
>> > > again inside SORT, and then a third time to initialize ntids. You can
>> > > just have a local variable in that function that is set once and then
>> > > used as needed. Then you are only doing ARRNELEMS once, so you can get
>> > > rid of that macro too. The same technique can be used to get rid of
>> > > ARRPTR. So then all the macros go away without introducing any code
>> > > duplication.
>> > >
>> >
>> > Done.
>> >
>> > > +/* Options to forcefully change the state of a heap tuple. */
>> > > +typedef enum HTupleForceOption
>> > > +{
>> > > + FORCE_KILL,
>> > > + FORCE_FREEZE
>> > > +} HTupleForceOption;
>> > >
>> > > I suggest un-abbreviating HTuple to HeapTuple and un-abbreviating the
>> > > enum members to HEAP_FORCE_KILL and HEAP_FORCE_FREE.
>> >
>> > Done.
>> >
>> > Also, how about
>> > > option -> operation?
>> > >
>> >
>> > I think both look okay to me.
>> >
>> > > + return heap_force_common(fcinfo, FORCE_KILL);
>> > >
>> > > I think it might be more idiomatic to use PG_RETURN_DATUM here.  I
>> > > know it's the same thing, though, and perhaps I'm even wrong about the
>> > > prevailing style.
>> > >
>> >
>> > Done.
>> >
>> > > + Assert(force_opt == FORCE_KILL || force_opt == FORCE_FREEZE);
>> > >
>> > > I think this is unnecessary. It's an enum with 2 values.
>> > >
>> >
>> > Removed.
>> >
>> > > + if (ARRISEMPTY(ta))
>> > > + ereport(ERROR,
>> > > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>> > > + errmsg("empty tid array")));
>> > >
>> > > I don't see why this should be an error. Why can't we just continue
>> > > normally and if it does nothing, it does nothing? You'd need to change
>> > > the do..while loop to a while loop so that the end condition is tested
>> > > at the top, but that seems fine.
>> > >
>> >
>> > I think it's okay to have this check. So, just left it as-is. We do
>> > have such checks in other contrib modules as well wherever the array
>> > is being passed as an input to the function.
>> >
>> > > + rel = relation_open(relid, AccessShareLock);
>> > >
>> > > Maybe we should take RowExclusiveLock, since we are going to modify
>> > > rows. Not sure how much it matters, though.
>> > >
>> >
>> > I don't know how it would make a difference, but still as you said
>> > replaced AccessShare with RowExclusive.
>> >
>> > > + if (!superuser() && GetUserId() != rel->rd_rel->relowner)
>> > > + ereport(ERROR,
>> > > + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
>> > > + errmsg("must be superuser or object owner to use %s.",
>> > > + force_opt == FORCE_KILL ? "heap_force_kill()" :
>> > > + "heap_force_freeze()")));
>> > >
>> > > This is the wrong way to do a permissions check, and it's also the
>> > > wrong way to write an error message about having failed one. To see
>> > > the correct method, grep for pg_class_aclcheck. However, I think that
>> > > we shouldn't in general trust the object owner to do this, unless the
>> > > super-user gave permission. This is a data-corrupting operation, and
>> > > only the boss is allowed to authorize it. So I think you should also
>> > > add REVOKE EXECUTE FROM PUBLIC statements to the SQL file, and then
>> > > have this check as a backup. Then, the superuser is always allowed,
>> > > and if they choose to GRANT EXECUTE on this function to some users,
>> > > those users can do it for their own relations, but not others.
>> > >
>> >
>> > Done.
>> >
>> > > + if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
>> > > + ereport(ERROR,
>> > > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
>> > > + errmsg("only heap AM is supported")));
>> > > +
>> > > + check_relation_relkind(rel);
>> > >
>> > > Seems like these checks are in the wrong order.
>> >
>> > I don't think there is anything wrong with the order. I can see the
>> > same order in other contrib modules as well.
>> >
>> > Also, maybe you could
>> > > call the function something like check_relation_ok() and put the
>> > > permissions test, the relkind test, and the relam test all inside of
>> > > it, just to tighten up the code in this main function a bit.
>> > >
>> >
>> > Yeah, I've added a couple of functions named sanity_check_relation and
>> > sanity_check_tid_array and shifted all the sanity checks there.
>> >
>> > > + if (noffs > maxoffset)
>> > > + ereport(ERROR,
>> > > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>> > > + errmsg("number of offsets specified for block %u exceeds the max
>> > > offset number %u",
>> > > + blkno, maxoffset)));
>> > >
>> > > Hmm, this doesn't seem quite right. The actual problem is if an
>> > > individual item pointer's offset number is greater than maxoffset,
>> > > which can be true even if the total number of offsets is less than
>> > > maxoffset. So I think you need to remove this check and add a check
>> > > inside the loop which follows that offnos[i] is in range.
>> > >
>> >
>> > Agreed and done.
>> >
>> > > The way you've structured that loop is actually problematic -- I don't
>> > > think we want to be calling elog() or ereport() inside a critical
>> > > section. You could fix the case that checks for an invalid force_opt
>> > > by just doing if (op == HEAP_FORCE_KILL) { ... } else { Assert(op ==
>> > > HEAP_FORCE_FREEZE); ... }, or by using a switch with no default. The
>> > > NOTICE case you have here is a bigger problem.
>> >
>> > Done.
>> >
>> > You really cannot
>> > > modify the buffer like this and then decide, oops, never mind, I think
>> > > I won't mark it dirty or write WAL for the changes. If you do that,
>> > > the buffer is still in memory, but it's now been modified. A
>> > > subsequent operation that modifies it will start with the altered
>> > > state you created here, quite possibly leading to WAL that cannot be
>> > > correctly replayed on the standby. In other words, you've got to
>> > > decide for certain whether you want to proceed with the operation
>> > > *before* you enter the critical section. You also need to emit any
>> > > messages before or after the critical section.  So you could:
>> > >
>> >
>> > This is still not clear. I think Robert needs to respond to my earlier comment.
>> >
>> > > I believe this violates our guidelines on message construction. Have
>> > > two completely separate messages -- and maybe lose the word "already":
>> > >
>> > > "skipping tid (%u, %u) because it is dead"
>> > > "skipping tid (%u, %u) because it is unused"
>> > >
>> > > The point of this is that it makes it easier for translators.
>> > >
>> >
>> > Done.
>> >
>> > > I see very little point in what verify_tid() is doing. Before using
>> > > each block number, we should check that it's less than or equal to a
>> > > cached value of RelationGetNumberOfBlocks(rel). That's necessary in
>> > > any case to avoid funny errors; and then the check here against
>> > > specifically InvalidBlockNumber is redundant. For the offset number,
>> > > same thing: we need to check each offset against the page's
>> > > PageGetMaxOffsetNumber(page); and if we do that then we don't need
>> > > these checks.
>> > >
>> >
>> > Done.
>> >
>> > Please check the attached patch for the changes.
>>
>> I also looked at this version patch and have some small comments:
>>
>> +   Oid             relid = PG_GETARG_OID(0);
>> +   ArrayType      *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
>> +   ItemPointer     tids;
>> +   int             ntids;
>> +   Relation        rel;
>> +   Buffer          buf;
>> +   Page            page;
>> +   ItemId          itemid;
>> +   BlockNumber     blkno;
>> +   OffsetNumber   *offnos;
>> +   OffsetNumber    offno,
>> +                   noffs,
>> +                   curr_start_ptr,
>> +                   next_start_ptr,
>> +                   maxoffset;
>> +   int             i,
>> +                   nskippedItems,
>> +                   nblocks;
>>
>> You declare all variables at the top of heap_force_common() function
>> but I think we can declare some variables such as buf, page inside of
>> the do loop.
>>
>> ---
>> +           if (offnos[i] > maxoffset)
>> +           {
>> +               ereport(NOTICE,
>> +                        errmsg("skipping tid (%u, %u) because it
>> contains an invalid offset",
>> +                               blkno, offnos[i]));
>> +               continue;
>> +           }
>>
>> If all tids on a page take the above path, we will end up logging FPI
>> in spite of modifying nothing on the page.
>>
>> ---
>> +       /* XLOG stuff */
>> +       if (RelationNeedsWAL(rel))
>> +           log_newpage_buffer(buf, true);
>>
>> I think we need to set the returned LSN by log_newpage_buffer() to the page lsn.
>>
>> Regards,
>>
>> --
>> Masahiko Sawada            http://www.2ndQuadrant.com/
>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>>
>>



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Robert Haas
Date:
On Thu, Aug 6, 2020 at 9:23 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> Attached v4 patch fixes the latest comments from Robert and Masahiko-san.

Compiler warning:

heap_surgery.c:136:13: error: comparison of unsigned expression < 0 is
always false [-Werror,-Wtautological-compare]
                if (blkno < 0 || blkno >= nblocks)
                    ~~~~~ ^ ~

There's a certain inconsistency to these messages:

rhaas=# create table foo (a int);
CREATE TABLE
rhaas=# insert into foo values (1);
INSERT 0 1
rhaas=# select heap_force_kill('foo'::regclass, array['(0,2)'::tid]);
NOTICE:  skipping tid (0, 2) because it contains an invalid offset
 heap_force_kill
-----------------

(1 row)

rhaas=# select heap_force_kill('foo'::regclass, array['(1,0)'::tid]);
ERROR:  invalid item pointer
LOCATION:  tids_same_page_fetch_offnums, heap_surgery.c:347
rhaas=# select heap_force_kill('foo'::regclass, array['(1,1)'::tid]);
ERROR:  block number 1 is out of range for relation "foo"

From a user perspective it seems like I've made three very similar
mistakes: in the first case, the offset is too high, in the second
case it's too low, and in the third case the block number is out of
range. But in one case I get a NOTICE and in the other two cases I get
an ERROR. In one case I get the relation name and in the other two
cases I don't. The two complaints about an invalid offset are phrased
completely differently from each other. For example, suppose you do
this:

ERROR: tid (%u, %u) is invalid for relation "%s" because the block
number is out of range (%u..%u)
ERROR: tid (%u, %u) is invalid for relation "%s" because the item
number is out of range for this block (%u..%u)
ERROR: tid (%u, %u) is invalid for relation "%s" because the item is unused
ERROR: tid (%u, %u) is invalid for relation "%s" because the item is dead

I think I misled you when I said to use pg_class_aclcheck. I think it
should actually be pg_class_ownercheck.

I think the relkind sanity check should permit RELKIND_MATVIEW also.

It's unclear to me why the freeze logic here shouldn't do this part
what heap_prepare_freeze_tuple() does when freezing xmax:

        frz->t_infomask2 &= ~HEAP_HOT_UPDATED;
        frz->t_infomask2 &= ~HEAP_KEYS_UPDATED;

Likewise, why should we not freeze or invalidate xvac in the case
where tuple->t_infomask & HEAP_MOVED, as heap_prepare_freeze_tuple()
would do?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
On Fri, Aug 7, 2020 at 9:21 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Aug 6, 2020 at 9:23 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> There's a certain inconsistency to these messages:
>
> rhaas=# create table foo (a int);
> CREATE TABLE
> rhaas=# insert into foo values (1);
> INSERT 0 1
> rhaas=# select heap_force_kill('foo'::regclass, array['(0,2)'::tid]);
> NOTICE:  skipping tid (0, 2) because it contains an invalid offset
>  heap_force_kill
> -----------------
>
> (1 row)
>
> rhaas=# select heap_force_kill('foo'::regclass, array['(1,0)'::tid]);
> ERROR:  invalid item pointer
> LOCATION:  tids_same_page_fetch_offnums, heap_surgery.c:347
> rhaas=# select heap_force_kill('foo'::regclass, array['(1,1)'::tid]);
> ERROR:  block number 1 is out of range for relation "foo"
>
> From a user perspective it seems like I've made three very similar
> mistakes: in the first case, the offset is too high, in the second
> case it's too low, and in the third case the block number is out of
> range. But in one case I get a NOTICE and in the other two cases I get
> an ERROR. In one case I get the relation name and in the other two
> cases I don't. The two complaints about an invalid offset are phrased
> completely differently from each other. For example, suppose you do
> this:
>
> ERROR: tid (%u, %u) is invalid for relation "%s" because the block
> number is out of range (%u..%u)
> ERROR: tid (%u, %u) is invalid for relation "%s" because the item
> number is out of range for this block (%u..%u)
> ERROR: tid (%u, %u) is invalid for relation "%s" because the item is unused
> ERROR: tid (%u, %u) is invalid for relation "%s" because the item is dead
>

Thank you for your suggestions. To make this consistent, I am planning
to do the following changes:

Remove the error message to report "invalid item pointer" from
tids_same_page_fetch_offnums() and expand the if-check to detect any
invalid offset number in the CRITICAL section such that it not just
checks if the offset number is > maxoffset, but also checks if the
offset number is equal to 0. That way it would also do the job that
"if (!ItemPointerIsValid)" was doing for us.

Further, if any invalid block number is detected, then I am planning
to skip all the tids associated with this block and move to the next
block. Hence, instead of reporting the error I would report the NOTICE
message to the user.

The other two messages for reporting unused items and dead items
remain the same. Hence, with above change, we would be reporting the
following 4 messages:

NOTICE:  skipping all the tids in block %u for relation "%s" because
the block number is out of range

NOTICE:  skipping tid (%u, %u) for relation "%s" because the item
number is out of range for this block

NOTICE:  skipping tid (%u, %u) for relation "%s" because it is marked dead

NOTICE:  skipping tid (%u, %u) for relation "%s" because it is marked unused

Please let me know if you are okay with the above changes or not?

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Robert Haas
Date:
On Tue, Aug 11, 2020 at 3:39 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> The other two messages for reporting unused items and dead items
> remain the same. Hence, with above change, we would be reporting the
> following 4 messages:
>
> NOTICE:  skipping all the tids in block %u for relation "%s" because
> the block number is out of range
>
> NOTICE:  skipping tid (%u, %u) for relation "%s" because the item
> number is out of range for this block
>
> NOTICE:  skipping tid (%u, %u) for relation "%s" because it is marked dead
>
> NOTICE:  skipping tid (%u, %u) for relation "%s" because it is marked unused
>
> Please let me know if you are okay with the above changes or not?

That seems broadly reasonable, but I would suggest phrasing the first
message like this:

skipping block %u for relation "%s" because the block number is out of range

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
On Tue, Aug 11, 2020 at 7:33 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Aug 11, 2020 at 3:39 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > The other two messages for reporting unused items and dead items
> > remain the same. Hence, with above change, we would be reporting the
> > following 4 messages:
> >
> > NOTICE:  skipping all the tids in block %u for relation "%s" because
> > the block number is out of range
> >
> > NOTICE:  skipping tid (%u, %u) for relation "%s" because the item
> > number is out of range for this block
> >
> > NOTICE:  skipping tid (%u, %u) for relation "%s" because it is marked dead
> >
> > NOTICE:  skipping tid (%u, %u) for relation "%s" because it is marked unused
> >
> > Please let me know if you are okay with the above changes or not?
>
> That seems broadly reasonable, but I would suggest phrasing the first
> message like this:
>
> skipping block %u for relation "%s" because the block number is out of range
>

Okay, thanks for the confirmation. I'll do that.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
Thanks Robert for the review. Please find my comments inline below:

On Fri, Aug 7, 2020 at 9:21 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Aug 6, 2020 at 9:23 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > Attached v4 patch fixes the latest comments from Robert and Masahiko-san.
>
> Compiler warning:
>
> heap_surgery.c:136:13: error: comparison of unsigned expression < 0 is
> always false [-Werror,-Wtautological-compare]
>                 if (blkno < 0 || blkno >= nblocks)
>                     ~~~~~ ^ ~
>

Fixed.

> There's a certain inconsistency to these messages:
>
> rhaas=# create table foo (a int);
> CREATE TABLE
> rhaas=# insert into foo values (1);
> INSERT 0 1
> rhaas=# select heap_force_kill('foo'::regclass, array['(0,2)'::tid]);
> NOTICE:  skipping tid (0, 2) because it contains an invalid offset
>  heap_force_kill
> -----------------
>
> (1 row)
>
> rhaas=# select heap_force_kill('foo'::regclass, array['(1,0)'::tid]);
> ERROR:  invalid item pointer
> LOCATION:  tids_same_page_fetch_offnums, heap_surgery.c:347
> rhaas=# select heap_force_kill('foo'::regclass, array['(1,1)'::tid]);
> ERROR:  block number 1 is out of range for relation "foo"
>
> From a user perspective it seems like I've made three very similar
> mistakes: in the first case, the offset is too high, in the second
> case it's too low, and in the third case the block number is out of
> range. But in one case I get a NOTICE and in the other two cases I get
> an ERROR. In one case I get the relation name and in the other two
> cases I don't. The two complaints about an invalid offset are phrased
> completely differently from each other. For example, suppose you do
> this:
>
> ERROR: tid (%u, %u) is invalid for relation "%s" because the block
> number is out of range (%u..%u)
> ERROR: tid (%u, %u) is invalid for relation "%s" because the item
> number is out of range for this block (%u..%u)
> ERROR: tid (%u, %u) is invalid for relation "%s" because the item is unused
> ERROR: tid (%u, %u) is invalid for relation "%s" because the item is dead
>

Corrected.

> I think I misled you when I said to use pg_class_aclcheck. I think it
> should actually be pg_class_ownercheck.
>

okay, I've changed it to pg_class_ownercheck.

> I think the relkind sanity check should permit RELKIND_MATVIEW also.
>

Yeah, actually we should allow MATVIEW, don't know why I thought of
blocking it earlier.

> It's unclear to me why the freeze logic here shouldn't do this part
> what heap_prepare_freeze_tuple() does when freezing xmax:
>
>         frz->t_infomask2 &= ~HEAP_HOT_UPDATED;
>         frz->t_infomask2 &= ~HEAP_KEYS_UPDATED;
>

Yeah, we should have these changes when freezing the xmax.

> Likewise, why should we not freeze or invalidate xvac in the case
> where tuple->t_infomask & HEAP_MOVED, as heap_prepare_freeze_tuple()
> would do?
>

Again, we should have this as well.

Apart from above, this time I've also added the documentation on
pg_surgery module and added a few more test-cases.

Attached patch with above changes.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachment

Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Asim Praveen
Date:
Hi Ashutosh

I stumbled upon this thread today, went through your patch and it looks good.  A minor suggestion in
sanity_check_relation():

    if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
        ereport(ERROR,
                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
                 errmsg("only heap AM is supported")));

Instead of checking the access method OID, it seems better to check the handler OID like so:

    if (rel->rd_amhandler != HEAP_TABLE_AM_HANDLER_OID)

The reason is current version of sanity_check_relation() would emit error for the following case even when the table
structureis actually heap.
 

    create access method myam type table handler heap_tableam_handler;
    create table mytable (…) using myam;

Asim

Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
Hi Asim,

Thanks for having a look into the patch and for sharing your feedback.
Please find my comments inline below:

On Thu, Aug 13, 2020 at 12:36 PM Asim Praveen <pasim@vmware.com> wrote:
>
> Hi Ashutosh
>
> I stumbled upon this thread today, went through your patch and it looks good.  A minor suggestion in
sanity_check_relation():
>
>         if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
>                 ereport(ERROR,
>                                 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
>                                  errmsg("only heap AM is supported")));
>
> Instead of checking the access method OID, it seems better to check the handler OID like so:
>
>         if (rel->rd_amhandler != HEAP_TABLE_AM_HANDLER_OID)
>
> The reason is current version of sanity_check_relation() would emit error for the following case even when the table
structureis actually heap. 
>
>         create access method myam type table handler heap_tableam_handler;
>         create table mytable (…) using myam;
>

This looks like a very good suggestion to me. I will do this change in
the next version. Just wondering if we should be doing similar changes
in other contrib modules (like pgrowlocks, pageinspect and
pgstattuple) as well?

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Robert Haas
Date:
On Thu, Aug 13, 2020 at 3:52 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> This looks like a very good suggestion to me. I will do this change in
> the next version. Just wondering if we should be doing similar changes
> in other contrib modules (like pgrowlocks, pageinspect and
> pgstattuple) as well?

It seems like it should be consistent, but I'm not sure the proposed
change is really an improvement.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Masahiko Sawada
Date:
On Wed, 12 Aug 2020 at 22:27, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Thanks Robert for the review. Please find my comments inline below:
>
> On Fri, Aug 7, 2020 at 9:21 PM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Thu, Aug 6, 2020 at 9:23 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > > Attached v4 patch fixes the latest comments from Robert and Masahiko-san.
> >
> > Compiler warning:
> >
> > heap_surgery.c:136:13: error: comparison of unsigned expression < 0 is
> > always false [-Werror,-Wtautological-compare]
> >                 if (blkno < 0 || blkno >= nblocks)
> >                     ~~~~~ ^ ~
> >
>
> Fixed.
>
> > There's a certain inconsistency to these messages:
> >
> > rhaas=# create table foo (a int);
> > CREATE TABLE
> > rhaas=# insert into foo values (1);
> > INSERT 0 1
> > rhaas=# select heap_force_kill('foo'::regclass, array['(0,2)'::tid]);
> > NOTICE:  skipping tid (0, 2) because it contains an invalid offset
> >  heap_force_kill
> > -----------------
> >
> > (1 row)
> >
> > rhaas=# select heap_force_kill('foo'::regclass, array['(1,0)'::tid]);
> > ERROR:  invalid item pointer
> > LOCATION:  tids_same_page_fetch_offnums, heap_surgery.c:347
> > rhaas=# select heap_force_kill('foo'::regclass, array['(1,1)'::tid]);
> > ERROR:  block number 1 is out of range for relation "foo"
> >
> > From a user perspective it seems like I've made three very similar
> > mistakes: in the first case, the offset is too high, in the second
> > case it's too low, and in the third case the block number is out of
> > range. But in one case I get a NOTICE and in the other two cases I get
> > an ERROR. In one case I get the relation name and in the other two
> > cases I don't. The two complaints about an invalid offset are phrased
> > completely differently from each other. For example, suppose you do
> > this:
> >
> > ERROR: tid (%u, %u) is invalid for relation "%s" because the block
> > number is out of range (%u..%u)
> > ERROR: tid (%u, %u) is invalid for relation "%s" because the item
> > number is out of range for this block (%u..%u)
> > ERROR: tid (%u, %u) is invalid for relation "%s" because the item is unused
> > ERROR: tid (%u, %u) is invalid for relation "%s" because the item is dead
> >
>
> Corrected.
>
> > I think I misled you when I said to use pg_class_aclcheck. I think it
> > should actually be pg_class_ownercheck.
> >
>
> okay, I've changed it to pg_class_ownercheck.
>
> > I think the relkind sanity check should permit RELKIND_MATVIEW also.
> >
>
> Yeah, actually we should allow MATVIEW, don't know why I thought of
> blocking it earlier.
>
> > It's unclear to me why the freeze logic here shouldn't do this part
> > what heap_prepare_freeze_tuple() does when freezing xmax:
> >
> >         frz->t_infomask2 &= ~HEAP_HOT_UPDATED;
> >         frz->t_infomask2 &= ~HEAP_KEYS_UPDATED;
> >
>
> Yeah, we should have these changes when freezing the xmax.
>
> > Likewise, why should we not freeze or invalidate xvac in the case
> > where tuple->t_infomask & HEAP_MOVED, as heap_prepare_freeze_tuple()
> > would do?
> >
>
> Again, we should have this as well.
>
> Apart from above, this time I've also added the documentation on
> pg_surgery module and added a few more test-cases.
>
> Attached patch with above changes.
>

Thank you for updating the patch! Here are my comments on v5 patch:

--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -35,6 +35,7 @@ SUBDIRS = \
        pg_standby  \
        pg_stat_statements \
        pg_trgm     \
+       pg_surgery  \
        pgcrypto    \

I guess we use alphabetical order here. So pg_surgery should be placed
before pg_trgm.

---
+           if (heap_force_opt == HEAP_FORCE_KILL)
+               ItemIdSetDead(itemid);

I think that if the page is an all-visible page, we should clear an
all-visible bit on the visibility map corresponding to the page and
PD_ALL_VISIBLE on the page header. Otherwise, index only scan would
return the wrong results.

---
+       /*
+        * We do not mark the buffer dirty or do WAL logging for unmodifed
+        * pages.
+        */
+       if (!did_modify_page)
+           goto skip_wal;
+
+       /* Mark buffer dirty before we write WAL. */
+       MarkBufferDirty(buf);
+
+       /* XLOG stuff */
+       if (RelationNeedsWAL(rel))
+           log_newpage_buffer(buf, true);
+
+skip_wal:
+       END_CRIT_SECTION();
+

s/unmodifed/unmodified/

Do we really need to use goto? I think we can modify it like follows:

    if (did_modity_page)
    {
       /* Mark buffer dirty before we write WAL. */
       MarkBufferDirty(buf);

       /* XLOG stuff */
       if (RelationNeedsWAL(rel))
           log_newpage_buffer(buf, true);
    }

    END_CRIT_SECTION();

---
pg_force_freeze() can revival a tuple that is already deleted but not
vacuumed yet. Therefore, the user might need to reindex indexes after
using that function. For instance, with the following script, the last
two queries: index scan and seq scan, will return different results.

set enable_seqscan to off;
set enable_bitmapscan to off;
set enable_indexonlyscan to off;
create table tbl (a int primary key);
insert into tbl values (1);

update tbl set a = a + 100 where a = 1;

explain analyze select * from tbl where a < 200;

-- revive deleted tuple on heap
select heap_force_freeze('tbl', array['(0,1)'::tid]);

-- index scan returns 2 tuples
explain analyze select * from tbl where a < 200;

-- seq scan returns 1 tuple
set enable_seqscan to on;
explain analyze select * from tbl;

Also, if a tuple updated and moved to another partition is revived by
heap_force_freeze(), its ctid still has special values:
MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
see a problem yet caused by a visible tuple having the special ctid
value, but it might be worth considering either to reset ctid value as
well or to not freezing already-deleted tuple.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
Hello Masahiko-san,

Thanks for the review. Please check the comments inline below:

On Fri, Aug 14, 2020 at 10:07 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:

> Thank you for updating the patch! Here are my comments on v5 patch:
>
> --- a/contrib/Makefile
> +++ b/contrib/Makefile
> @@ -35,6 +35,7 @@ SUBDIRS = \
>         pg_standby  \
>         pg_stat_statements \
>         pg_trgm     \
> +       pg_surgery  \
>         pgcrypto    \
>
> I guess we use alphabetical order here. So pg_surgery should be placed
> before pg_trgm.
>

Okay, will take care of this in the next version of patch.

> ---
> +           if (heap_force_opt == HEAP_FORCE_KILL)
> +               ItemIdSetDead(itemid);
>
> I think that if the page is an all-visible page, we should clear an
> all-visible bit on the visibility map corresponding to the page and
> PD_ALL_VISIBLE on the page header. Otherwise, index only scan would
> return the wrong results.
>

I think we should let VACUUM do that. Please note that this module is
intended to be used only on a damaged relation and should only be
operated on damaged tuples of such relations. And the execution of any
of the functions provided by this module on a damaged relation must be
followed by VACUUM with DISABLE_PAGE_SKIPPING option on that relation.
This is necessary to bring back a damaged relation to the sane state
once a surgery is performed on it. I will try to add this note in the
documentation for this module.

> ---
> +       /*
> +        * We do not mark the buffer dirty or do WAL logging for unmodifed
> +        * pages.
> +        */
> +       if (!did_modify_page)
> +           goto skip_wal;
> +
> +       /* Mark buffer dirty before we write WAL. */
> +       MarkBufferDirty(buf);
> +
> +       /* XLOG stuff */
> +       if (RelationNeedsWAL(rel))
> +           log_newpage_buffer(buf, true);
> +
> +skip_wal:
> +       END_CRIT_SECTION();
> +
>
> s/unmodifed/unmodified/
>

okay, will fix this typo.

> Do we really need to use goto? I think we can modify it like follows:
>
>     if (did_modity_page)
>     {
>        /* Mark buffer dirty before we write WAL. */
>        MarkBufferDirty(buf);
>
>        /* XLOG stuff */
>        if (RelationNeedsWAL(rel))
>            log_newpage_buffer(buf, true);
>     }
>
>     END_CRIT_SECTION();
>

No, we don't need it. We can achieve the same by checking the status
of did_modify_page flag as you suggested. I will do this change in the
next version.

> ---
> pg_force_freeze() can revival a tuple that is already deleted but not
> vacuumed yet. Therefore, the user might need to reindex indexes after
> using that function. For instance, with the following script, the last
> two queries: index scan and seq scan, will return different results.
>
> set enable_seqscan to off;
> set enable_bitmapscan to off;
> set enable_indexonlyscan to off;
> create table tbl (a int primary key);
> insert into tbl values (1);
>
> update tbl set a = a + 100 where a = 1;
>
> explain analyze select * from tbl where a < 200;
>
> -- revive deleted tuple on heap
> select heap_force_freeze('tbl', array['(0,1)'::tid]);
>
> -- index scan returns 2 tuples
> explain analyze select * from tbl where a < 200;
>
> -- seq scan returns 1 tuple
> set enable_seqscan to on;
> explain analyze select * from tbl;
>

I am not sure if this is the right use-case of pg_force_freeze
function. I think we should only be running pg_force_freeze function
on a tuple for which VACUUM reports "found xmin ABC from before
relfrozenxid PQR" sort of error otherwise it might worsen the things
instead of making it better. Now, the question is - can VACUUM report
this type of error for a deleted tuple or it would only report it for
a live tuple? AFAIU this won't be reported for the deleted tuples
because VACUUM wouldn't consider freezing a tuple that has been
deleted.

> Also, if a tuple updated and moved to another partition is revived by
> heap_force_freeze(), its ctid still has special values:
> MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
> see a problem yet caused by a visible tuple having the special ctid
> value, but it might be worth considering either to reset ctid value as
> well or to not freezing already-deleted tuple.
>

For this as well, the answer remains the same as above.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
Hello Masahiko-san,

I've spent some more time trying to understand the code in
lazy_scan_heap function to know under what all circumstances a VACUUM
can fail with "found xmin ... before relfrozenxid ..." error for a
tuple whose xmin is behind relfrozenxid. Here are my observations:

1) It can fail with this error for a live tuple

OR,

2) It can also fail with this error if a tuple (that went through
update) is marked as HEAP_HOT_UPDATED or HEAP_ONLY_TUPLE.

OR,

3) If there are any concurrent transactions, then the tuple might be
marked as HEAPTUPLE_INSERT_IN_PROGRESS or HEAPTUPLE_DELETE_IN_PROGRESS
or HEAPTUPLE_RECENTLY_DEAD in which case also VACUUM can fail with
this error.

Now, AFAIU, as we will be dealing with a damaged table, the chances of
point #3 being the cause of this error looks impossible in our case
because I don't think we will be doing anything in parallel when
performing surgery on a damaged table, in fact we shouldn't be doing
any such things. However, it is quite possible that reason #2 could
cause VACUUM to fail with this sort of error, but, as we are already
skipping redirected item pointers in heap_force_common(), I think, we
would never be marking HEAP_HOT_UPDATED tuple as frozen and I don't
see any problem in marking HEAP_ONLY_TUPLE as frozen. So, probably, we
may not need to handle point #2 as well.

Further, I also don't see VACUUM reporting this error for a tuple that
has been moved from one partition to another. So, I think we might not
need to do any special handling for a tuple that got updated and its
new version was moved to another partition.

If you feel I am missing something here, please correct me. Thank you.

Moreover, while I was exploring on above, I noticed that in
lazy_scan_heap(), before we call HeapTupleSatisfiesVacuum() we check
for a redirected item pointers and if any redirected item pointer is
detected we do not call HeapTupleSatisfiesVacuum(). So, not sure how
HeapTupleSatisfiesVacuum would ever return a dead tuple that is marked
with HEAP_HOT_UPDATED. I am referring to the following code in
lazy_scan_heap().

        for (offnum = FirstOffsetNumber;
             offnum <= maxoff;
             offnum = OffsetNumberNext(offnum))
        {
            ItemId      itemid;

            itemid = PageGetItemId(page, offnum);

.............
.............


            /* Redirect items mustn't be touched */ <-- this check
would bypass the redirected item pointers from being checked for
HeapTupleSatisfiesVacuum.
            if (ItemIdIsRedirected(itemid))
            {
                hastup = true;  /* this page won't be truncatable */
                continue;
            }

..............
..............

            switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))
            {
                case HEAPTUPLE_DEAD:

                    if (HeapTupleIsHotUpdated(&tuple) ||
                        HeapTupleIsHeapOnly(&tuple) ||
                        params->index_cleanup == VACOPT_TERNARY_DISABLED)
                        nkeep += 1;
                    else
                        tupgone = true; /* we can delete the tuple */
..............
..............
             }


So, the point is, would HeapTupleIsHotUpdated(&tuple) ever be true?

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Mon, Aug 17, 2020 at 11:35 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Hello Masahiko-san,
>
> Thanks for the review. Please check the comments inline below:
>
> On Fri, Aug 14, 2020 at 10:07 AM Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
>
> > Thank you for updating the patch! Here are my comments on v5 patch:
> >
> > --- a/contrib/Makefile
> > +++ b/contrib/Makefile
> > @@ -35,6 +35,7 @@ SUBDIRS = \
> >         pg_standby  \
> >         pg_stat_statements \
> >         pg_trgm     \
> > +       pg_surgery  \
> >         pgcrypto    \
> >
> > I guess we use alphabetical order here. So pg_surgery should be placed
> > before pg_trgm.
> >
>
> Okay, will take care of this in the next version of patch.
>
> > ---
> > +           if (heap_force_opt == HEAP_FORCE_KILL)
> > +               ItemIdSetDead(itemid);
> >
> > I think that if the page is an all-visible page, we should clear an
> > all-visible bit on the visibility map corresponding to the page and
> > PD_ALL_VISIBLE on the page header. Otherwise, index only scan would
> > return the wrong results.
> >
>
> I think we should let VACUUM do that. Please note that this module is
> intended to be used only on a damaged relation and should only be
> operated on damaged tuples of such relations. And the execution of any
> of the functions provided by this module on a damaged relation must be
> followed by VACUUM with DISABLE_PAGE_SKIPPING option on that relation.
> This is necessary to bring back a damaged relation to the sane state
> once a surgery is performed on it. I will try to add this note in the
> documentation for this module.
>
> > ---
> > +       /*
> > +        * We do not mark the buffer dirty or do WAL logging for unmodifed
> > +        * pages.
> > +        */
> > +       if (!did_modify_page)
> > +           goto skip_wal;
> > +
> > +       /* Mark buffer dirty before we write WAL. */
> > +       MarkBufferDirty(buf);
> > +
> > +       /* XLOG stuff */
> > +       if (RelationNeedsWAL(rel))
> > +           log_newpage_buffer(buf, true);
> > +
> > +skip_wal:
> > +       END_CRIT_SECTION();
> > +
> >
> > s/unmodifed/unmodified/
> >
>
> okay, will fix this typo.
>
> > Do we really need to use goto? I think we can modify it like follows:
> >
> >     if (did_modity_page)
> >     {
> >        /* Mark buffer dirty before we write WAL. */
> >        MarkBufferDirty(buf);
> >
> >        /* XLOG stuff */
> >        if (RelationNeedsWAL(rel))
> >            log_newpage_buffer(buf, true);
> >     }
> >
> >     END_CRIT_SECTION();
> >
>
> No, we don't need it. We can achieve the same by checking the status
> of did_modify_page flag as you suggested. I will do this change in the
> next version.
>
> > ---
> > pg_force_freeze() can revival a tuple that is already deleted but not
> > vacuumed yet. Therefore, the user might need to reindex indexes after
> > using that function. For instance, with the following script, the last
> > two queries: index scan and seq scan, will return different results.
> >
> > set enable_seqscan to off;
> > set enable_bitmapscan to off;
> > set enable_indexonlyscan to off;
> > create table tbl (a int primary key);
> > insert into tbl values (1);
> >
> > update tbl set a = a + 100 where a = 1;
> >
> > explain analyze select * from tbl where a < 200;
> >
> > -- revive deleted tuple on heap
> > select heap_force_freeze('tbl', array['(0,1)'::tid]);
> >
> > -- index scan returns 2 tuples
> > explain analyze select * from tbl where a < 200;
> >
> > -- seq scan returns 1 tuple
> > set enable_seqscan to on;
> > explain analyze select * from tbl;
> >
>
> I am not sure if this is the right use-case of pg_force_freeze
> function. I think we should only be running pg_force_freeze function
> on a tuple for which VACUUM reports "found xmin ABC from before
> relfrozenxid PQR" sort of error otherwise it might worsen the things
> instead of making it better. Now, the question is - can VACUUM report
> this type of error for a deleted tuple or it would only report it for
> a live tuple? AFAIU this won't be reported for the deleted tuples
> because VACUUM wouldn't consider freezing a tuple that has been
> deleted.
>
> > Also, if a tuple updated and moved to another partition is revived by
> > heap_force_freeze(), its ctid still has special values:
> > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
> > see a problem yet caused by a visible tuple having the special ctid
> > value, but it might be worth considering either to reset ctid value as
> > well or to not freezing already-deleted tuple.
> >
>
> For this as well, the answer remains the same as above.
>
> --
> With Regards,
> Ashutosh Sharma
> EnterpriseDB:http://www.enterprisedb.com



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
Attached is the new version of patch that addresses the comments from
Asim Praveen and Masahiko-san. It also improves the documentation to
some extent.


On Tue, Aug 18, 2020 at 1:46 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Hello Masahiko-san,
>
> I've spent some more time trying to understand the code in
> lazy_scan_heap function to know under what all circumstances a VACUUM
> can fail with "found xmin ... before relfrozenxid ..." error for a
> tuple whose xmin is behind relfrozenxid. Here are my observations:
>
> 1) It can fail with this error for a live tuple
>
> OR,
>
> 2) It can also fail with this error if a tuple (that went through
> update) is marked as HEAP_HOT_UPDATED or HEAP_ONLY_TUPLE.
>
> OR,
>
> 3) If there are any concurrent transactions, then the tuple might be
> marked as HEAPTUPLE_INSERT_IN_PROGRESS or HEAPTUPLE_DELETE_IN_PROGRESS
> or HEAPTUPLE_RECENTLY_DEAD in which case also VACUUM can fail with
> this error.
>
> Now, AFAIU, as we will be dealing with a damaged table, the chances of
> point #3 being the cause of this error looks impossible in our case
> because I don't think we will be doing anything in parallel when
> performing surgery on a damaged table, in fact we shouldn't be doing
> any such things. However, it is quite possible that reason #2 could
> cause VACUUM to fail with this sort of error, but, as we are already
> skipping redirected item pointers in heap_force_common(), I think, we
> would never be marking HEAP_HOT_UPDATED tuple as frozen and I don't
> see any problem in marking HEAP_ONLY_TUPLE as frozen. So, probably, we
> may not need to handle point #2 as well.
>
> Further, I also don't see VACUUM reporting this error for a tuple that
> has been moved from one partition to another. So, I think we might not
> need to do any special handling for a tuple that got updated and its
> new version was moved to another partition.
>
> If you feel I am missing something here, please correct me. Thank you.
>
> Moreover, while I was exploring on above, I noticed that in
> lazy_scan_heap(), before we call HeapTupleSatisfiesVacuum() we check
> for a redirected item pointers and if any redirected item pointer is
> detected we do not call HeapTupleSatisfiesVacuum(). So, not sure how
> HeapTupleSatisfiesVacuum would ever return a dead tuple that is marked
> with HEAP_HOT_UPDATED. I am referring to the following code in
> lazy_scan_heap().
>
>         for (offnum = FirstOffsetNumber;
>              offnum <= maxoff;
>              offnum = OffsetNumberNext(offnum))
>         {
>             ItemId      itemid;
>
>             itemid = PageGetItemId(page, offnum);
>
> .............
> .............
>
>
>             /* Redirect items mustn't be touched */ <-- this check
> would bypass the redirected item pointers from being checked for
> HeapTupleSatisfiesVacuum.
>             if (ItemIdIsRedirected(itemid))
>             {
>                 hastup = true;  /* this page won't be truncatable */
>                 continue;
>             }
>
> ..............
> ..............
>
>             switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))
>             {
>                 case HEAPTUPLE_DEAD:
>
>                     if (HeapTupleIsHotUpdated(&tuple) ||
>                         HeapTupleIsHeapOnly(&tuple) ||
>                         params->index_cleanup == VACOPT_TERNARY_DISABLED)
>                         nkeep += 1;
>                     else
>                         tupgone = true; /* we can delete the tuple */
> ..............
> ..............
>              }
>
>
> So, the point is, would HeapTupleIsHotUpdated(&tuple) ever be true?
>
> --
> With Regards,
> Ashutosh Sharma
> EnterpriseDB:http://www.enterprisedb.com
>
> On Mon, Aug 17, 2020 at 11:35 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> >
> > Hello Masahiko-san,
> >
> > Thanks for the review. Please check the comments inline below:
> >
> > On Fri, Aug 14, 2020 at 10:07 AM Masahiko Sawada
> > <masahiko.sawada@2ndquadrant.com> wrote:
> >
> > > Thank you for updating the patch! Here are my comments on v5 patch:
> > >
> > > --- a/contrib/Makefile
> > > +++ b/contrib/Makefile
> > > @@ -35,6 +35,7 @@ SUBDIRS = \
> > >         pg_standby  \
> > >         pg_stat_statements \
> > >         pg_trgm     \
> > > +       pg_surgery  \
> > >         pgcrypto    \
> > >
> > > I guess we use alphabetical order here. So pg_surgery should be placed
> > > before pg_trgm.
> > >
> >
> > Okay, will take care of this in the next version of patch.
> >
> > > ---
> > > +           if (heap_force_opt == HEAP_FORCE_KILL)
> > > +               ItemIdSetDead(itemid);
> > >
> > > I think that if the page is an all-visible page, we should clear an
> > > all-visible bit on the visibility map corresponding to the page and
> > > PD_ALL_VISIBLE on the page header. Otherwise, index only scan would
> > > return the wrong results.
> > >
> >
> > I think we should let VACUUM do that. Please note that this module is
> > intended to be used only on a damaged relation and should only be
> > operated on damaged tuples of such relations. And the execution of any
> > of the functions provided by this module on a damaged relation must be
> > followed by VACUUM with DISABLE_PAGE_SKIPPING option on that relation.
> > This is necessary to bring back a damaged relation to the sane state
> > once a surgery is performed on it. I will try to add this note in the
> > documentation for this module.
> >
> > > ---
> > > +       /*
> > > +        * We do not mark the buffer dirty or do WAL logging for unmodifed
> > > +        * pages.
> > > +        */
> > > +       if (!did_modify_page)
> > > +           goto skip_wal;
> > > +
> > > +       /* Mark buffer dirty before we write WAL. */
> > > +       MarkBufferDirty(buf);
> > > +
> > > +       /* XLOG stuff */
> > > +       if (RelationNeedsWAL(rel))
> > > +           log_newpage_buffer(buf, true);
> > > +
> > > +skip_wal:
> > > +       END_CRIT_SECTION();
> > > +
> > >
> > > s/unmodifed/unmodified/
> > >
> >
> > okay, will fix this typo.
> >
> > > Do we really need to use goto? I think we can modify it like follows:
> > >
> > >     if (did_modity_page)
> > >     {
> > >        /* Mark buffer dirty before we write WAL. */
> > >        MarkBufferDirty(buf);
> > >
> > >        /* XLOG stuff */
> > >        if (RelationNeedsWAL(rel))
> > >            log_newpage_buffer(buf, true);
> > >     }
> > >
> > >     END_CRIT_SECTION();
> > >
> >
> > No, we don't need it. We can achieve the same by checking the status
> > of did_modify_page flag as you suggested. I will do this change in the
> > next version.
> >
> > > ---
> > > pg_force_freeze() can revival a tuple that is already deleted but not
> > > vacuumed yet. Therefore, the user might need to reindex indexes after
> > > using that function. For instance, with the following script, the last
> > > two queries: index scan and seq scan, will return different results.
> > >
> > > set enable_seqscan to off;
> > > set enable_bitmapscan to off;
> > > set enable_indexonlyscan to off;
> > > create table tbl (a int primary key);
> > > insert into tbl values (1);
> > >
> > > update tbl set a = a + 100 where a = 1;
> > >
> > > explain analyze select * from tbl where a < 200;
> > >
> > > -- revive deleted tuple on heap
> > > select heap_force_freeze('tbl', array['(0,1)'::tid]);
> > >
> > > -- index scan returns 2 tuples
> > > explain analyze select * from tbl where a < 200;
> > >
> > > -- seq scan returns 1 tuple
> > > set enable_seqscan to on;
> > > explain analyze select * from tbl;
> > >
> >
> > I am not sure if this is the right use-case of pg_force_freeze
> > function. I think we should only be running pg_force_freeze function
> > on a tuple for which VACUUM reports "found xmin ABC from before
> > relfrozenxid PQR" sort of error otherwise it might worsen the things
> > instead of making it better. Now, the question is - can VACUUM report
> > this type of error for a deleted tuple or it would only report it for
> > a live tuple? AFAIU this won't be reported for the deleted tuples
> > because VACUUM wouldn't consider freezing a tuple that has been
> > deleted.
> >
> > > Also, if a tuple updated and moved to another partition is revived by
> > > heap_force_freeze(), its ctid still has special values:
> > > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
> > > see a problem yet caused by a visible tuple having the special ctid
> > > value, but it might be worth considering either to reset ctid value as
> > > well or to not freezing already-deleted tuple.
> > >
> >
> > For this as well, the answer remains the same as above.
> >
> > --
> > With Regards,
> > Ashutosh Sharma
> > EnterpriseDB:http://www.enterprisedb.com

Attachment

Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Chris Travers
Date:


On Tue, Jul 14, 2020 at 12:28 AM Peter Geoghegan <pg@bowt.ie> wrote:
On Mon, Jul 13, 2020 at 2:12 PM Robert Haas <robertmhaas@gmail.com> wrote:
> 1. There's nothing to identify the tuple that has the problem, and no
> way to know how many more of them there might be. Back-patching
> b61d161c146328ae6ba9ed937862d66e5c8b035a would help with the first
> part of this.

I am in favor of backpatching such changes in cases where senior
community members feel that it could help with hypothetical
undiscovered data corruption issues -- if they're willing to take
responsibility for the change. It certainly wouldn't be the first
time. A "defense in depth" mindset seems like the right one when it
comes to data corruption bugs. Early detection is really important.

> Moreover, not everyone is as
> interested in an extended debugging exercise as they are in getting
> the system working again, and VACUUM failing repeatedly is a pretty
> serious problem.

That's absolutely consistent with my experience. Most users want to
get back to business as usual now, while letting somebody else do the
hard work of debugging.

Also even if you do trace the problem you still have to recover.

And sometimes I have found latent corruption from times when dbs were running on older versions and older servers, making debugging largely a futile exercise.

> Therefore, one of my colleagues has - at my request - created a couple
> of functions called heap_force_kill() and heap_force_freeze() which
> take an array of TIDs.

> So I have these questions:
>
> - Do people think it would me smart/good/useful to include something
> like this in PostgreSQL?

I'm in favor of it.

+1

Would be worth extending it with some functions to grab rows that have various TOAST oids too.

> - If so, how? I would propose a new contrib module that we back-patch
> all the way, because the VACUUM errors were back-patched all the way,
> and there seems to be no advantage in making people wait 5 years for a
> new version that has some kind of tooling in this area.

I'm in favor of it being *possible* to backpatch tooling that is
clearly related to correctness in a fundamental way. Obviously this
would mean that we'd be revising our general position on backpatching
to allow some limited exceptions around corruption. I'm not sure that
this meets that standard, though. It's hardly something that we can
expect all that many users to be able to use effectively.

I may be biased, but I'd be inclined to permit it in the case of
something like amcheck, or pg_visibility, on the grounds that they're
more or less the same as the new VACUUM errcontext instrumentation you
mentioned. The same cannot be said of something like this new
heap_force_kill() stuff.

> - Any ideas for additional things we should include, or improvements
> on the sketch above?

Clearly you should work out a way of making it very hard to
accidentally (mis)use. For example, maybe you make the functions check
for the presence of a sentinel file in the data directory.

Agreed.


--
Peter Geoghegan




--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin

Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Rajkumar Raghuwanshi
Date:
Thanks for suggestion Ashutosh, I have done testing around these suggestion
and found no issues. I will continue testing same with updated patch posted
on this thread.

On Fri, Aug 7, 2020 at 12:45 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
Thanks Rajkumar for testing the patch.

Here are some of the additional test-cases that I would suggest you to
execute, if possible:

1) You may try running the test-cases that you have executed so far
with SR setup and see if the changes are getting reflected on the
standby.

2) You may also try running some concurrent test-cases for e.g. try
running these functions with VACUUM or some other sql commands
(preferable DML commands) in parallel.

3) See what happens when you pass some invalid tids (containing
invalid block or offset number) to these functions. You may also try
running these functions on the same tuple repeatedly and see the
behaviour.

...

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
 
Thanks & Regards,
Rajkumar Raghuwanshi



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Alvaro Herrera
Date:
On 2020-Aug-17, Ashutosh Sharma wrote:

> > +           if (heap_force_opt == HEAP_FORCE_KILL)
> > +               ItemIdSetDead(itemid);
> >
> > I think that if the page is an all-visible page, we should clear an
> > all-visible bit on the visibility map corresponding to the page and
> > PD_ALL_VISIBLE on the page header. Otherwise, index only scan would
> > return the wrong results.
> 
> I think we should let VACUUM do that. Please note that this module is
> intended to be used only on a damaged relation and should only be
> operated on damaged tuples of such relations. And the execution of any
> of the functions provided by this module on a damaged relation must be
> followed by VACUUM with DISABLE_PAGE_SKIPPING option on that relation.
> This is necessary to bring back a damaged relation to the sane state
> once a surgery is performed on it. I will try to add this note in the
> documentation for this module.

It makes sense to recommend VACUUM after fixing the page, but I agree
with Sawada-san that it would be sensible to reset the VM bit while
doing surgery, since that's the state that the page would be in.  We
should certainly *strongly recommend* to do VACUUM DISABLE_PAGE_SKIPPING,
but if users fail to do so, then leaving the VM bit set just means that
we know *for certain* that there will be further corruption as soon as
the XID counter advances sufficiently.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Masahiko Sawada
Date:
On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Hello Masahiko-san,
>
> Thanks for the review. Please check the comments inline below:
>
> On Fri, Aug 14, 2020 at 10:07 AM Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
>
> > Thank you for updating the patch! Here are my comments on v5 patch:
> >
> > --- a/contrib/Makefile
> > +++ b/contrib/Makefile
> > @@ -35,6 +35,7 @@ SUBDIRS = \
> >         pg_standby  \
> >         pg_stat_statements \
> >         pg_trgm     \
> > +       pg_surgery  \
> >         pgcrypto    \
> >
> > I guess we use alphabetical order here. So pg_surgery should be placed
> > before pg_trgm.
> >
>
> Okay, will take care of this in the next version of patch.
>
> > ---
> > +           if (heap_force_opt == HEAP_FORCE_KILL)
> > +               ItemIdSetDead(itemid);
> >
> > I think that if the page is an all-visible page, we should clear an
> > all-visible bit on the visibility map corresponding to the page and
> > PD_ALL_VISIBLE on the page header. Otherwise, index only scan would
> > return the wrong results.
> >
>
> I think we should let VACUUM do that. Please note that this module is
> intended to be used only on a damaged relation and should only be
> operated on damaged tuples of such relations. And the execution of any
> of the functions provided by this module on a damaged relation must be
> followed by VACUUM with DISABLE_PAGE_SKIPPING option on that relation.
> This is necessary to bring back a damaged relation to the sane state
> once a surgery is performed on it. I will try to add this note in the
> documentation for this module.
>
> > ---
> > +       /*
> > +        * We do not mark the buffer dirty or do WAL logging for unmodifed
> > +        * pages.
> > +        */
> > +       if (!did_modify_page)
> > +           goto skip_wal;
> > +
> > +       /* Mark buffer dirty before we write WAL. */
> > +       MarkBufferDirty(buf);
> > +
> > +       /* XLOG stuff */
> > +       if (RelationNeedsWAL(rel))
> > +           log_newpage_buffer(buf, true);
> > +
> > +skip_wal:
> > +       END_CRIT_SECTION();
> > +
> >
> > s/unmodifed/unmodified/
> >
>
> okay, will fix this typo.
>
> > Do we really need to use goto? I think we can modify it like follows:
> >
> >     if (did_modity_page)
> >     {
> >        /* Mark buffer dirty before we write WAL. */
> >        MarkBufferDirty(buf);
> >
> >        /* XLOG stuff */
> >        if (RelationNeedsWAL(rel))
> >            log_newpage_buffer(buf, true);
> >     }
> >
> >     END_CRIT_SECTION();
> >
>
> No, we don't need it. We can achieve the same by checking the status
> of did_modify_page flag as you suggested. I will do this change in the
> next version.
>
> > ---
> > pg_force_freeze() can revival a tuple that is already deleted but not
> > vacuumed yet. Therefore, the user might need to reindex indexes after
> > using that function. For instance, with the following script, the last
> > two queries: index scan and seq scan, will return different results.
> >
> > set enable_seqscan to off;
> > set enable_bitmapscan to off;
> > set enable_indexonlyscan to off;
> > create table tbl (a int primary key);
> > insert into tbl values (1);
> >
> > update tbl set a = a + 100 where a = 1;
> >
> > explain analyze select * from tbl where a < 200;
> >
> > -- revive deleted tuple on heap
> > select heap_force_freeze('tbl', array['(0,1)'::tid]);
> >
> > -- index scan returns 2 tuples
> > explain analyze select * from tbl where a < 200;
> >
> > -- seq scan returns 1 tuple
> > set enable_seqscan to on;
> > explain analyze select * from tbl;
> >
>
> I am not sure if this is the right use-case of pg_force_freeze
> function. I think we should only be running pg_force_freeze function
> on a tuple for which VACUUM reports "found xmin ABC from before
> relfrozenxid PQR" sort of error otherwise it might worsen the things
> instead of making it better.

Should this also be documented? I think that it's hard to force the
user to always use this module in the right situation but we need to
show at least when to use.

> > Also, if a tuple updated and moved to another partition is revived by
> > heap_force_freeze(), its ctid still has special values:
> > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
> > see a problem yet caused by a visible tuple having the special ctid
> > value, but it might be worth considering either to reset ctid value as
> > well or to not freezing already-deleted tuple.
> >
>
> For this as well, the answer remains the same as above.

Perhaps the same is true when a tuple header is corrupted including
xmin and ctid for some reason and the user wants to fix it? I'm
concerned that a live tuple having the wrong ctid will cause SEGV or
PANIC error in the future.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
On Tue, Aug 18, 2020 at 9:44 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> On 2020-Aug-17, Ashutosh Sharma wrote:
>
> > > +           if (heap_force_opt == HEAP_FORCE_KILL)
> > > +               ItemIdSetDead(itemid);
> > >
> > > I think that if the page is an all-visible page, we should clear an
> > > all-visible bit on the visibility map corresponding to the page and
> > > PD_ALL_VISIBLE on the page header. Otherwise, index only scan would
> > > return the wrong results.
> >
> > I think we should let VACUUM do that. Please note that this module is
> > intended to be used only on a damaged relation and should only be
> > operated on damaged tuples of such relations. And the execution of any
> > of the functions provided by this module on a damaged relation must be
> > followed by VACUUM with DISABLE_PAGE_SKIPPING option on that relation.
> > This is necessary to bring back a damaged relation to the sane state
> > once a surgery is performed on it. I will try to add this note in the
> > documentation for this module.
>
> It makes sense to recommend VACUUM after fixing the page, but I agree
> with Sawada-san that it would be sensible to reset the VM bit while
> doing surgery, since that's the state that the page would be in.

Sure, I will try to do that change but I would still recommend to
always run VACUUM with DISABLE_PAGE_SKIPPING option on the relation
that underwent surgery.

We
> should certainly *strongly recommend* to do VACUUM DISABLE_PAGE_SKIPPING,
> but if users fail to do so, then leaving the VM bit set just means that
> we know *for certain* that there will be further corruption as soon as
> the XID counter advances sufficiently.
>

Yeah, I've already added a note for this in the documentation:

Note: "After a surgery is performed on a damaged relation using this
module, we must run VACUUM with DISABLE_PAGE_SKIPPING option on that
relation to bring it back into a sane state."

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
On Wed, Aug 19, 2020 at 9:27 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> >
> > > pg_force_freeze() can revival a tuple that is already deleted but not
> > > vacuumed yet. Therefore, the user might need to reindex indexes after
> > > using that function. For instance, with the following script, the last
> > > two queries: index scan and seq scan, will return different results.
> > >
> > > set enable_seqscan to off;
> > > set enable_bitmapscan to off;
> > > set enable_indexonlyscan to off;
> > > create table tbl (a int primary key);
> > > insert into tbl values (1);
> > >
> > > update tbl set a = a + 100 where a = 1;
> > >
> > > explain analyze select * from tbl where a < 200;
> > >
> > > -- revive deleted tuple on heap
> > > select heap_force_freeze('tbl', array['(0,1)'::tid]);
> > >
> > > -- index scan returns 2 tuples
> > > explain analyze select * from tbl where a < 200;
> > >
> > > -- seq scan returns 1 tuple
> > > set enable_seqscan to on;
> > > explain analyze select * from tbl;
> > >
> >
> > I am not sure if this is the right use-case of pg_force_freeze
> > function. I think we should only be running pg_force_freeze function
> > on a tuple for which VACUUM reports "found xmin ABC from before
> > relfrozenxid PQR" sort of error otherwise it might worsen the things
> > instead of making it better.
>
> Should this also be documented? I think that it's hard to force the
> user to always use this module in the right situation but we need to
> show at least when to use.
>

I've already added some examples in the documentation explaining the
use-case of force_freeze function. If required, I will also add a note
about it.

> > > Also, if a tuple updated and moved to another partition is revived by
> > > heap_force_freeze(), its ctid still has special values:
> > > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
> > > see a problem yet caused by a visible tuple having the special ctid
> > > value, but it might be worth considering either to reset ctid value as
> > > well or to not freezing already-deleted tuple.
> > >
> >
> > For this as well, the answer remains the same as above.
>
> Perhaps the same is true when a tuple header is corrupted including
> xmin and ctid for some reason and the user wants to fix it? I'm
> concerned that a live tuple having the wrong ctid will cause SEGV or
> PANIC error in the future.
>

If a tuple header itself is corrupted, then I think we must kill that
tuple. If only xmin and t_ctid fields are corrupted, then probably we
can think of resetting the ctid value of that tuple. However, it won't
be always possible to detect the corrupted ctid value. It's quite
possible that the corrupted ctid field has valid values for block
number and offset number in it, but it's actually corrupted and it
would be difficult to consider such ctid as corrupted. Hence, we can't
do anything about such types of corruption. Probably in such cases we
need to run VACUUM FULL on such tables so that new ctid gets generated
for each tuple in the table.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Masahiko Sawada
Date:
On Wed, 19 Aug 2020 at 15:09, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> On Wed, Aug 19, 2020 at 9:27 AM Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> >
> > On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > >
> > > > pg_force_freeze() can revival a tuple that is already deleted but not
> > > > vacuumed yet. Therefore, the user might need to reindex indexes after
> > > > using that function. For instance, with the following script, the last
> > > > two queries: index scan and seq scan, will return different results.
> > > >
> > > > set enable_seqscan to off;
> > > > set enable_bitmapscan to off;
> > > > set enable_indexonlyscan to off;
> > > > create table tbl (a int primary key);
> > > > insert into tbl values (1);
> > > >
> > > > update tbl set a = a + 100 where a = 1;
> > > >
> > > > explain analyze select * from tbl where a < 200;
> > > >
> > > > -- revive deleted tuple on heap
> > > > select heap_force_freeze('tbl', array['(0,1)'::tid]);
> > > >
> > > > -- index scan returns 2 tuples
> > > > explain analyze select * from tbl where a < 200;
> > > >
> > > > -- seq scan returns 1 tuple
> > > > set enable_seqscan to on;
> > > > explain analyze select * from tbl;
> > > >
> > >
> > > I am not sure if this is the right use-case of pg_force_freeze
> > > function. I think we should only be running pg_force_freeze function
> > > on a tuple for which VACUUM reports "found xmin ABC from before
> > > relfrozenxid PQR" sort of error otherwise it might worsen the things
> > > instead of making it better.
> >
> > Should this also be documented? I think that it's hard to force the
> > user to always use this module in the right situation but we need to
> > show at least when to use.
> >
>
> I've already added some examples in the documentation explaining the
> use-case of force_freeze function. If required, I will also add a note
> about it.
>
> > > > Also, if a tuple updated and moved to another partition is revived by
> > > > heap_force_freeze(), its ctid still has special values:
> > > > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
> > > > see a problem yet caused by a visible tuple having the special ctid
> > > > value, but it might be worth considering either to reset ctid value as
> > > > well or to not freezing already-deleted tuple.
> > > >
> > >
> > > For this as well, the answer remains the same as above.
> >
> > Perhaps the same is true when a tuple header is corrupted including
> > xmin and ctid for some reason and the user wants to fix it? I'm
> > concerned that a live tuple having the wrong ctid will cause SEGV or
> > PANIC error in the future.
> >
>
> If a tuple header itself is corrupted, then I think we must kill that
> tuple. If only xmin and t_ctid fields are corrupted, then probably we
> can think of resetting the ctid value of that tuple. However, it won't
> be always possible to detect the corrupted ctid value. It's quite
> possible that the corrupted ctid field has valid values for block
> number and offset number in it, but it's actually corrupted and it
> would be difficult to consider such ctid as corrupted. Hence, we can't
> do anything about such types of corruption. Probably in such cases we
> need to run VACUUM FULL on such tables so that new ctid gets generated
> for each tuple in the table.

Understood.

Perhaps such corruption will be able to be detected by new heapam
check functions discussed on another thread. My point was that it
might be better to attempt making the tuple header sane state as much
as possible when fixing a live tuple in order to prevent further
problems such as databases crash by malicious attackers.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
On Wed, Aug 19, 2020 at 3:55 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Wed, 19 Aug 2020 at 15:09, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> >
> > On Wed, Aug 19, 2020 at 9:27 AM Masahiko Sawada
> > <masahiko.sawada@2ndquadrant.com> wrote:
> > >
> > > On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > > >
> > > > > pg_force_freeze() can revival a tuple that is already deleted but not
> > > > > vacuumed yet. Therefore, the user might need to reindex indexes after
> > > > > using that function. For instance, with the following script, the last
> > > > > two queries: index scan and seq scan, will return different results.
> > > > >
> > > > > set enable_seqscan to off;
> > > > > set enable_bitmapscan to off;
> > > > > set enable_indexonlyscan to off;
> > > > > create table tbl (a int primary key);
> > > > > insert into tbl values (1);
> > > > >
> > > > > update tbl set a = a + 100 where a = 1;
> > > > >
> > > > > explain analyze select * from tbl where a < 200;
> > > > >
> > > > > -- revive deleted tuple on heap
> > > > > select heap_force_freeze('tbl', array['(0,1)'::tid]);
> > > > >
> > > > > -- index scan returns 2 tuples
> > > > > explain analyze select * from tbl where a < 200;
> > > > >
> > > > > -- seq scan returns 1 tuple
> > > > > set enable_seqscan to on;
> > > > > explain analyze select * from tbl;
> > > > >
> > > >
> > > > I am not sure if this is the right use-case of pg_force_freeze
> > > > function. I think we should only be running pg_force_freeze function
> > > > on a tuple for which VACUUM reports "found xmin ABC from before
> > > > relfrozenxid PQR" sort of error otherwise it might worsen the things
> > > > instead of making it better.
> > >
> > > Should this also be documented? I think that it's hard to force the
> > > user to always use this module in the right situation but we need to
> > > show at least when to use.
> > >
> >
> > I've already added some examples in the documentation explaining the
> > use-case of force_freeze function. If required, I will also add a note
> > about it.
> >
> > > > > Also, if a tuple updated and moved to another partition is revived by
> > > > > heap_force_freeze(), its ctid still has special values:
> > > > > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
> > > > > see a problem yet caused by a visible tuple having the special ctid
> > > > > value, but it might be worth considering either to reset ctid value as
> > > > > well or to not freezing already-deleted tuple.
> > > > >
> > > >
> > > > For this as well, the answer remains the same as above.
> > >
> > > Perhaps the same is true when a tuple header is corrupted including
> > > xmin and ctid for some reason and the user wants to fix it? I'm
> > > concerned that a live tuple having the wrong ctid will cause SEGV or
> > > PANIC error in the future.
> > >
> >
> > If a tuple header itself is corrupted, then I think we must kill that
> > tuple. If only xmin and t_ctid fields are corrupted, then probably we
> > can think of resetting the ctid value of that tuple. However, it won't
> > be always possible to detect the corrupted ctid value. It's quite
> > possible that the corrupted ctid field has valid values for block
> > number and offset number in it, but it's actually corrupted and it
> > would be difficult to consider such ctid as corrupted. Hence, we can't
> > do anything about such types of corruption. Probably in such cases we
> > need to run VACUUM FULL on such tables so that new ctid gets generated
> > for each tuple in the table.
>
> Understood.
>
> Perhaps such corruption will be able to be detected by new heapam
> check functions discussed on another thread. My point was that it
> might be better to attempt making the tuple header sane state as much
> as possible when fixing a live tuple in order to prevent further
> problems such as databases crash by malicious attackers.
>

Agreed. So, to handle the ctid related concern that you raised, I'm
planning to do the following changes to ensure that the tuple being
marked as frozen contains the correct item pointer value. Please let
me know if you are okay with these changes.

                               HeapTupleHeader htup;
+                              ItemPointerData ctid;

                               Assert(heap_force_opt == HEAP_FORCE_FREEZE);

+                              ItemPointerSet(&ctid, blkno, offnos[i]);
+
                               htup = (HeapTupleHeader)
PageGetItem(page, itemid);

+                               /*
+                                * Make sure that this tuple holds the
correct item pointer
+                                * value.
+                                */
+                               if
(!HeapTupleHeaderIndicatesMovedPartitions(htup) &&
+                                   !ItemPointerEquals(&ctid, &htup->t_ctid))
+                                   ItemPointerSet(&htup->t_ctid,
blkno, offnos[i]);
+
                                HeapTupleHeaderSetXmin(htup,
FrozenTransactionId);
                                HeapTupleHeaderSetXmax(htup,
InvalidTransactionId);

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Masahiko Sawada
Date:
On Wed, 19 Aug 2020 at 20:45, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> On Wed, Aug 19, 2020 at 3:55 PM Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> >
> > On Wed, 19 Aug 2020 at 15:09, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > >
> > > On Wed, Aug 19, 2020 at 9:27 AM Masahiko Sawada
> > > <masahiko.sawada@2ndquadrant.com> wrote:
> > > >
> > > > On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > > > >
> > > > > > pg_force_freeze() can revival a tuple that is already deleted but not
> > > > > > vacuumed yet. Therefore, the user might need to reindex indexes after
> > > > > > using that function. For instance, with the following script, the last
> > > > > > two queries: index scan and seq scan, will return different results.
> > > > > >
> > > > > > set enable_seqscan to off;
> > > > > > set enable_bitmapscan to off;
> > > > > > set enable_indexonlyscan to off;
> > > > > > create table tbl (a int primary key);
> > > > > > insert into tbl values (1);
> > > > > >
> > > > > > update tbl set a = a + 100 where a = 1;
> > > > > >
> > > > > > explain analyze select * from tbl where a < 200;
> > > > > >
> > > > > > -- revive deleted tuple on heap
> > > > > > select heap_force_freeze('tbl', array['(0,1)'::tid]);
> > > > > >
> > > > > > -- index scan returns 2 tuples
> > > > > > explain analyze select * from tbl where a < 200;
> > > > > >
> > > > > > -- seq scan returns 1 tuple
> > > > > > set enable_seqscan to on;
> > > > > > explain analyze select * from tbl;
> > > > > >
> > > > >
> > > > > I am not sure if this is the right use-case of pg_force_freeze
> > > > > function. I think we should only be running pg_force_freeze function
> > > > > on a tuple for which VACUUM reports "found xmin ABC from before
> > > > > relfrozenxid PQR" sort of error otherwise it might worsen the things
> > > > > instead of making it better.
> > > >
> > > > Should this also be documented? I think that it's hard to force the
> > > > user to always use this module in the right situation but we need to
> > > > show at least when to use.
> > > >
> > >
> > > I've already added some examples in the documentation explaining the
> > > use-case of force_freeze function. If required, I will also add a note
> > > about it.
> > >
> > > > > > Also, if a tuple updated and moved to another partition is revived by
> > > > > > heap_force_freeze(), its ctid still has special values:
> > > > > > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
> > > > > > see a problem yet caused by a visible tuple having the special ctid
> > > > > > value, but it might be worth considering either to reset ctid value as
> > > > > > well or to not freezing already-deleted tuple.
> > > > > >
> > > > >
> > > > > For this as well, the answer remains the same as above.
> > > >
> > > > Perhaps the same is true when a tuple header is corrupted including
> > > > xmin and ctid for some reason and the user wants to fix it? I'm
> > > > concerned that a live tuple having the wrong ctid will cause SEGV or
> > > > PANIC error in the future.
> > > >
> > >
> > > If a tuple header itself is corrupted, then I think we must kill that
> > > tuple. If only xmin and t_ctid fields are corrupted, then probably we
> > > can think of resetting the ctid value of that tuple. However, it won't
> > > be always possible to detect the corrupted ctid value. It's quite
> > > possible that the corrupted ctid field has valid values for block
> > > number and offset number in it, but it's actually corrupted and it
> > > would be difficult to consider such ctid as corrupted. Hence, we can't
> > > do anything about such types of corruption. Probably in such cases we
> > > need to run VACUUM FULL on such tables so that new ctid gets generated
> > > for each tuple in the table.
> >
> > Understood.
> >
> > Perhaps such corruption will be able to be detected by new heapam
> > check functions discussed on another thread. My point was that it
> > might be better to attempt making the tuple header sane state as much
> > as possible when fixing a live tuple in order to prevent further
> > problems such as databases crash by malicious attackers.
> >
>
> Agreed. So, to handle the ctid related concern that you raised, I'm
> planning to do the following changes to ensure that the tuple being
> marked as frozen contains the correct item pointer value. Please let
> me know if you are okay with these changes.

Given that a live tuple never indicates to ve moved partitions, I
guess the first condition in the if statement is not necessary. The
rest looks good to me, although other hackers might think differently.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
On Thu, Aug 20, 2020 at 11:04 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Wed, 19 Aug 2020 at 20:45, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> >
> > On Wed, Aug 19, 2020 at 3:55 PM Masahiko Sawada
> > <masahiko.sawada@2ndquadrant.com> wrote:
> > >
> > > On Wed, 19 Aug 2020 at 15:09, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > > >
> > > > On Wed, Aug 19, 2020 at 9:27 AM Masahiko Sawada
> > > > <masahiko.sawada@2ndquadrant.com> wrote:
> > > > >
> > > > > On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > > > > >
> > > > > > > pg_force_freeze() can revival a tuple that is already deleted but not
> > > > > > > vacuumed yet. Therefore, the user might need to reindex indexes after
> > > > > > > using that function. For instance, with the following script, the last
> > > > > > > two queries: index scan and seq scan, will return different results.
> > > > > > >
> > > > > > > set enable_seqscan to off;
> > > > > > > set enable_bitmapscan to off;
> > > > > > > set enable_indexonlyscan to off;
> > > > > > > create table tbl (a int primary key);
> > > > > > > insert into tbl values (1);
> > > > > > >
> > > > > > > update tbl set a = a + 100 where a = 1;
> > > > > > >
> > > > > > > explain analyze select * from tbl where a < 200;
> > > > > > >
> > > > > > > -- revive deleted tuple on heap
> > > > > > > select heap_force_freeze('tbl', array['(0,1)'::tid]);
> > > > > > >
> > > > > > > -- index scan returns 2 tuples
> > > > > > > explain analyze select * from tbl where a < 200;
> > > > > > >
> > > > > > > -- seq scan returns 1 tuple
> > > > > > > set enable_seqscan to on;
> > > > > > > explain analyze select * from tbl;
> > > > > > >
> > > > > >
> > > > > > I am not sure if this is the right use-case of pg_force_freeze
> > > > > > function. I think we should only be running pg_force_freeze function
> > > > > > on a tuple for which VACUUM reports "found xmin ABC from before
> > > > > > relfrozenxid PQR" sort of error otherwise it might worsen the things
> > > > > > instead of making it better.
> > > > >
> > > > > Should this also be documented? I think that it's hard to force the
> > > > > user to always use this module in the right situation but we need to
> > > > > show at least when to use.
> > > > >
> > > >
> > > > I've already added some examples in the documentation explaining the
> > > > use-case of force_freeze function. If required, I will also add a note
> > > > about it.
> > > >
> > > > > > > Also, if a tuple updated and moved to another partition is revived by
> > > > > > > heap_force_freeze(), its ctid still has special values:
> > > > > > > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
> > > > > > > see a problem yet caused by a visible tuple having the special ctid
> > > > > > > value, but it might be worth considering either to reset ctid value as
> > > > > > > well or to not freezing already-deleted tuple.
> > > > > > >
> > > > > >
> > > > > > For this as well, the answer remains the same as above.
> > > > >
> > > > > Perhaps the same is true when a tuple header is corrupted including
> > > > > xmin and ctid for some reason and the user wants to fix it? I'm
> > > > > concerned that a live tuple having the wrong ctid will cause SEGV or
> > > > > PANIC error in the future.
> > > > >
> > > >
> > > > If a tuple header itself is corrupted, then I think we must kill that
> > > > tuple. If only xmin and t_ctid fields are corrupted, then probably we
> > > > can think of resetting the ctid value of that tuple. However, it won't
> > > > be always possible to detect the corrupted ctid value. It's quite
> > > > possible that the corrupted ctid field has valid values for block
> > > > number and offset number in it, but it's actually corrupted and it
> > > > would be difficult to consider such ctid as corrupted. Hence, we can't
> > > > do anything about such types of corruption. Probably in such cases we
> > > > need to run VACUUM FULL on such tables so that new ctid gets generated
> > > > for each tuple in the table.
> > >
> > > Understood.
> > >
> > > Perhaps such corruption will be able to be detected by new heapam
> > > check functions discussed on another thread. My point was that it
> > > might be better to attempt making the tuple header sane state as much
> > > as possible when fixing a live tuple in order to prevent further
> > > problems such as databases crash by malicious attackers.
> > >
> >
> > Agreed. So, to handle the ctid related concern that you raised, I'm
> > planning to do the following changes to ensure that the tuple being
> > marked as frozen contains the correct item pointer value. Please let
> > me know if you are okay with these changes.
>
> Given that a live tuple never indicates to ve moved partitions, I
> guess the first condition in the if statement is not necessary. The
> rest looks good to me, although other hackers might think differently.
>

Okay, thanks for confirming. I am planning to go ahead with this
approach. Will later see what others have to say about it.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
Hi Masahiko-san,

Please find the updated patch with the following new changes:

1) It adds the code changes in heap_force_kill function to clear an
all-visible bit on the visibility map corresponding to the page that
is marked all-visible. Along the way it also clears PD_ALL_VISIBLE
flag on the page header.

2) It adds the code changes in heap_force_freeze function to reset the
ctid value in a tuple header if it is corrupted.

3) It adds several notes and examples in the documentation stating
when and how we need to use the functions provided by this module.

Please have a look and let me know for any other concern.

Thanks,

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Thu, Aug 20, 2020 at 11:43 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> On Thu, Aug 20, 2020 at 11:04 AM Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> >
> > On Wed, 19 Aug 2020 at 20:45, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > >
> > > On Wed, Aug 19, 2020 at 3:55 PM Masahiko Sawada
> > > <masahiko.sawada@2ndquadrant.com> wrote:
> > > >
> > > > On Wed, 19 Aug 2020 at 15:09, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > > > >
> > > > > On Wed, Aug 19, 2020 at 9:27 AM Masahiko Sawada
> > > > > <masahiko.sawada@2ndquadrant.com> wrote:
> > > > > >
> > > > > > On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > > > > > >
> > > > > > > > pg_force_freeze() can revival a tuple that is already deleted but not
> > > > > > > > vacuumed yet. Therefore, the user might need to reindex indexes after
> > > > > > > > using that function. For instance, with the following script, the last
> > > > > > > > two queries: index scan and seq scan, will return different results.
> > > > > > > >
> > > > > > > > set enable_seqscan to off;
> > > > > > > > set enable_bitmapscan to off;
> > > > > > > > set enable_indexonlyscan to off;
> > > > > > > > create table tbl (a int primary key);
> > > > > > > > insert into tbl values (1);
> > > > > > > >
> > > > > > > > update tbl set a = a + 100 where a = 1;
> > > > > > > >
> > > > > > > > explain analyze select * from tbl where a < 200;
> > > > > > > >
> > > > > > > > -- revive deleted tuple on heap
> > > > > > > > select heap_force_freeze('tbl', array['(0,1)'::tid]);
> > > > > > > >
> > > > > > > > -- index scan returns 2 tuples
> > > > > > > > explain analyze select * from tbl where a < 200;
> > > > > > > >
> > > > > > > > -- seq scan returns 1 tuple
> > > > > > > > set enable_seqscan to on;
> > > > > > > > explain analyze select * from tbl;
> > > > > > > >
> > > > > > >
> > > > > > > I am not sure if this is the right use-case of pg_force_freeze
> > > > > > > function. I think we should only be running pg_force_freeze function
> > > > > > > on a tuple for which VACUUM reports "found xmin ABC from before
> > > > > > > relfrozenxid PQR" sort of error otherwise it might worsen the things
> > > > > > > instead of making it better.
> > > > > >
> > > > > > Should this also be documented? I think that it's hard to force the
> > > > > > user to always use this module in the right situation but we need to
> > > > > > show at least when to use.
> > > > > >
> > > > >
> > > > > I've already added some examples in the documentation explaining the
> > > > > use-case of force_freeze function. If required, I will also add a note
> > > > > about it.
> > > > >
> > > > > > > > Also, if a tuple updated and moved to another partition is revived by
> > > > > > > > heap_force_freeze(), its ctid still has special values:
> > > > > > > > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
> > > > > > > > see a problem yet caused by a visible tuple having the special ctid
> > > > > > > > value, but it might be worth considering either to reset ctid value as
> > > > > > > > well or to not freezing already-deleted tuple.
> > > > > > > >
> > > > > > >
> > > > > > > For this as well, the answer remains the same as above.
> > > > > >
> > > > > > Perhaps the same is true when a tuple header is corrupted including
> > > > > > xmin and ctid for some reason and the user wants to fix it? I'm
> > > > > > concerned that a live tuple having the wrong ctid will cause SEGV or
> > > > > > PANIC error in the future.
> > > > > >
> > > > >
> > > > > If a tuple header itself is corrupted, then I think we must kill that
> > > > > tuple. If only xmin and t_ctid fields are corrupted, then probably we
> > > > > can think of resetting the ctid value of that tuple. However, it won't
> > > > > be always possible to detect the corrupted ctid value. It's quite
> > > > > possible that the corrupted ctid field has valid values for block
> > > > > number and offset number in it, but it's actually corrupted and it
> > > > > would be difficult to consider such ctid as corrupted. Hence, we can't
> > > > > do anything about such types of corruption. Probably in such cases we
> > > > > need to run VACUUM FULL on such tables so that new ctid gets generated
> > > > > for each tuple in the table.
> > > >
> > > > Understood.
> > > >
> > > > Perhaps such corruption will be able to be detected by new heapam
> > > > check functions discussed on another thread. My point was that it
> > > > might be better to attempt making the tuple header sane state as much
> > > > as possible when fixing a live tuple in order to prevent further
> > > > problems such as databases crash by malicious attackers.
> > > >
> > >
> > > Agreed. So, to handle the ctid related concern that you raised, I'm
> > > planning to do the following changes to ensure that the tuple being
> > > marked as frozen contains the correct item pointer value. Please let
> > > me know if you are okay with these changes.
> >
> > Given that a live tuple never indicates to ve moved partitions, I
> > guess the first condition in the if statement is not necessary. The
> > rest looks good to me, although other hackers might think differently.
> >
>
> Okay, thanks for confirming. I am planning to go ahead with this
> approach. Will later see what others have to say about it.
>
> --
> With Regards,
> Ashutosh Sharma
> EnterpriseDB:http://www.enterprisedb.com

Attachment

Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Robert Haas
Date:
On Tue, Aug 18, 2020 at 12:14 PM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> It makes sense to recommend VACUUM after fixing the page, but I agree
> with Sawada-san that it would be sensible to reset the VM bit while
> doing surgery, since that's the state that the page would be in.  We
> should certainly *strongly recommend* to do VACUUM DISABLE_PAGE_SKIPPING,
> but if users fail to do so, then leaving the VM bit set just means that
> we know *for certain* that there will be further corruption as soon as
> the XID counter advances sufficiently.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
"Andrey M. Borodin"
Date:
Hi!

> 21 авг. 2020 г., в 18:24, Ashutosh Sharma <ashu.coek88@gmail.com> написал(а):
>
> Please find the updated patch with the following new changes:

Do you have plans to support pg_surgery as external extension? For example for earlier versions of Postgres and for new
features,like amcheck_next is maintained. 
ISTM that I'll have to use something like that tomorrow and I'm in doubt - should I resurrect our pg_dirty_hands or try
yournew pg_surgey... 

Thanks!

Best regards, Andrey Borodin.


Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
On Mon, Aug 24, 2020 at 7:15 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Aug 18, 2020 at 12:14 PM Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > It makes sense to recommend VACUUM after fixing the page, but I agree
> > with Sawada-san that it would be sensible to reset the VM bit while
> > doing surgery, since that's the state that the page would be in.  We
> > should certainly *strongly recommend* to do VACUUM DISABLE_PAGE_SKIPPING,
> > but if users fail to do so, then leaving the VM bit set just means that
> > we know *for certain* that there will be further corruption as soon as
> > the XID counter advances sufficiently.
>
> +1.
>

This has been taken care of in the latest v7 patch.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
On Mon, Aug 24, 2020 at 11:00 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
>
> Hi!
>
> > 21 авг. 2020 г., в 18:24, Ashutosh Sharma <ashu.coek88@gmail.com> написал(а):
> >
> > Please find the updated patch with the following new changes:
>
> Do you have plans to support pg_surgery as external extension? For example for earlier versions of Postgres and for
newfeatures, like amcheck_next is maintained. 
> ISTM that I'll have to use something like that tomorrow and I'm in doubt - should I resurrect our pg_dirty_hands or
tryyour new pg_surgey... 
>

AFAICS, we don't have any plans to support pg_surgery as an external
extension as of now. Based on the discussion that has happened earlier
in this thread, I think we might also back-patch this contrib module.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Masahiko Sawada
Date:
On Fri, 21 Aug 2020 at 22:25, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Hi Masahiko-san,
>
> Please find the updated patch with the following new changes:
>

Thank you for updating the patch!

> 1) It adds the code changes in heap_force_kill function to clear an
> all-visible bit on the visibility map corresponding to the page that
> is marked all-visible. Along the way it also clears PD_ALL_VISIBLE
> flag on the page header.

I think we need to clear all visibility map bits by using
VISIBILITYMAP_VALID_BITS. Otherwise, the page has all-frozen bit but
not all-visible bit, which is not a valid state.

>
> 2) It adds the code changes in heap_force_freeze function to reset the
> ctid value in a tuple header if it is corrupted.
>
> 3) It adds several notes and examples in the documentation stating
> when and how we need to use the functions provided by this module.
>
> Please have a look and let me know for any other concern.
>

And here are small comments on the heap_surgery.c:

+       /*
+        * Get the offset numbers from the tids belonging to one particular page
+        * and process them one by one.
+        */
+       blkno = tids_same_page_fetch_offnums(tids, ntids, &next_start_ptr,
+                                            offnos);
+
+       /* Calculate the number of offsets stored in offnos array. */
+       noffs = next_start_ptr - curr_start_ptr;
+
+       /*
+        * Update the current start pointer so that next time when
+        * tids_same_page_fetch_offnums() is called, we can calculate the number
+        * of offsets present in the offnos array.
+        */
+       curr_start_ptr = next_start_ptr;
+
+       /* Check whether the block number is valid. */
+       if (blkno >= nblocks)
+       {
+           ereport(NOTICE,
+                   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                    errmsg("skipping block %u for relation \"%s\"
because the block number is out of range",
+                           blkno, RelationGetRelationName(rel))));
+           continue;
+       }
+
+       CHECK_FOR_INTERRUPTS();

I guess it would be better to call CHECK_FOR_INTERRUPTS() at the top
of the do loop for safety. I think it's unlikely to happen but the
user might mistakenly specify a lot of wrong block numbers.

----
+   offnos = (OffsetNumber *) palloc(ntids * sizeof(OffsetNumber));
+   noffs = curr_start_ptr = next_start_ptr = 0;
+   nblocks = RelationGetNumberOfBlocks(rel);
+
+   do
+   {

(snip)

+
+       /*
+        * Get the offset numbers from the tids belonging to one particular page
+        * and process them one by one.
+        */
+       blkno = tids_same_page_fetch_offnums(tids, ntids, &next_start_ptr,
+                                            offnos);
+
+       /* Calculate the number of offsets stored in offnos array. */
+       noffs = next_start_ptr - curr_start_ptr;
+

(snip)

+       /* No ereport(ERROR) from here until all the changes are logged. */
+       START_CRIT_SECTION();
+
+       for (i = 0; i < noffs; i++)

You copy all offset numbers belonging to the same page to palloc'd
array, offnos, and iterate it while processing the tuples. I might be
missing something but I think we can do that without allocating the
space for offset numbers. Is there any reason for this? I guess we can
do that by just iterating the sorted tids array.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Masahiko Sawada
Date:
On Tue, 25 Aug 2020 at 17:08, Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Fri, 21 Aug 2020 at 22:25, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> >
> > Hi Masahiko-san,
> >
> > Please find the updated patch with the following new changes:
> >
>
> Thank you for updating the patch!
>
> > 1) It adds the code changes in heap_force_kill function to clear an
> > all-visible bit on the visibility map corresponding to the page that
> > is marked all-visible. Along the way it also clears PD_ALL_VISIBLE
> > flag on the page header.
>
> I think we need to clear all visibility map bits by using
> VISIBILITYMAP_VALID_BITS. Otherwise, the page has all-frozen bit but
> not all-visible bit, which is not a valid state.
>
> >
> > 2) It adds the code changes in heap_force_freeze function to reset the
> > ctid value in a tuple header if it is corrupted.
> >
> > 3) It adds several notes and examples in the documentation stating
> > when and how we need to use the functions provided by this module.
> >
> > Please have a look and let me know for any other concern.
> >
>
> And here are small comments on the heap_surgery.c:
>
> +       /*
> +        * Get the offset numbers from the tids belonging to one particular page
> +        * and process them one by one.
> +        */
> +       blkno = tids_same_page_fetch_offnums(tids, ntids, &next_start_ptr,
> +                                            offnos);
> +
> +       /* Calculate the number of offsets stored in offnos array. */
> +       noffs = next_start_ptr - curr_start_ptr;
> +
> +       /*
> +        * Update the current start pointer so that next time when
> +        * tids_same_page_fetch_offnums() is called, we can calculate the number
> +        * of offsets present in the offnos array.
> +        */
> +       curr_start_ptr = next_start_ptr;
> +
> +       /* Check whether the block number is valid. */
> +       if (blkno >= nblocks)
> +       {
> +           ereport(NOTICE,
> +                   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +                    errmsg("skipping block %u for relation \"%s\"
> because the block number is out of range",
> +                           blkno, RelationGetRelationName(rel))));
> +           continue;
> +       }
> +
> +       CHECK_FOR_INTERRUPTS();
>
> I guess it would be better to call CHECK_FOR_INTERRUPTS() at the top
> of the do loop for safety. I think it's unlikely to happen but the
> user might mistakenly specify a lot of wrong block numbers.
>
> ----
> +   offnos = (OffsetNumber *) palloc(ntids * sizeof(OffsetNumber));
> +   noffs = curr_start_ptr = next_start_ptr = 0;
> +   nblocks = RelationGetNumberOfBlocks(rel);
> +
> +   do
> +   {
>
> (snip)
>
> +
> +       /*
> +        * Get the offset numbers from the tids belonging to one particular page
> +        * and process them one by one.
> +        */
> +       blkno = tids_same_page_fetch_offnums(tids, ntids, &next_start_ptr,
> +                                            offnos);
> +
> +       /* Calculate the number of offsets stored in offnos array. */
> +       noffs = next_start_ptr - curr_start_ptr;
> +
>
> (snip)
>
> +       /* No ereport(ERROR) from here until all the changes are logged. */
> +       START_CRIT_SECTION();
> +
> +       for (i = 0; i < noffs; i++)
>
> You copy all offset numbers belonging to the same page to palloc'd
> array, offnos, and iterate it while processing the tuples. I might be
> missing something but I think we can do that without allocating the
> space for offset numbers. Is there any reason for this? I guess we can
> do that by just iterating the sorted tids array.
>

Let me share other comments on the latest version patch:

Some words need to be tagged. For instance, I found the following words:

VACUUM
DISABLE_PAGE_SKIPPING
HEAP_XMIN_FROZEN
HEAP_XMAX_INVALID

---
+test=# select ctid from t1 where xmin = 507;
+ ctid
+-------
+ (0,3)
+(1 row)
+
+test=# select heap_force_freeze('t1'::regclass, ARRAY['(0, 3)']::tid[]);
+-[ RECORD 1 ]-----+-
+heap_force_freeze |

I think it's better to use a consistent output format. The former uses
the normal format whereas the latter uses the expanded format.

---
+ <note>
+ <para>
+  While performing surgery on a damaged relation, we must not be doing anything
+  else on that relation in parallel. This is to ensure that when we are
+  operating on a damaged tuple there is no other transaction trying to modify
+  that tuple.
+ </para>
+ </note>

If we prefer to avoid concurrent operations on the target relation why
don't we use AccessExclusiveLock?

---
+CREATE FUNCTION heap_force_kill(reloid regclass, tids tid[])
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'heap_force_kill'
+LANGUAGE C STRICT;

+CREATE FUNCTION heap_force_freeze(reloid regclass, tids tid[])
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'heap_force_freeze'
+LANGUAGE C STRICT;

I think these functions should be PARALLEL UNSAFE.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
Hi Masahiko-san,

Thank you for the review. Please check my comments inline below:

On Tue, Aug 25, 2020 at 1:39 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Fri, 21 Aug 2020 at 22:25, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> >
> > Hi Masahiko-san,
> >
> > Please find the updated patch with the following new changes:
> >
>
> Thank you for updating the patch!
>
> > 1) It adds the code changes in heap_force_kill function to clear an
> > all-visible bit on the visibility map corresponding to the page that
> > is marked all-visible. Along the way it also clears PD_ALL_VISIBLE
> > flag on the page header.
>
> I think we need to clear all visibility map bits by using
> VISIBILITYMAP_VALID_BITS. Otherwise, the page has all-frozen bit but
> not all-visible bit, which is not a valid state.
>

Yeah, makes sense, I will do that change in the next version of patch.

> >
> > 2) It adds the code changes in heap_force_freeze function to reset the
> > ctid value in a tuple header if it is corrupted.
> >
> > 3) It adds several notes and examples in the documentation stating
> > when and how we need to use the functions provided by this module.
> >
> > Please have a look and let me know for any other concern.
> >
>
> And here are small comments on the heap_surgery.c:
>
> +       /*
> +        * Get the offset numbers from the tids belonging to one particular page
> +        * and process them one by one.
> +        */
> +       blkno = tids_same_page_fetch_offnums(tids, ntids, &next_start_ptr,
> +                                            offnos);
> +
> +       /* Calculate the number of offsets stored in offnos array. */
> +       noffs = next_start_ptr - curr_start_ptr;
> +
> +       /*
> +        * Update the current start pointer so that next time when
> +        * tids_same_page_fetch_offnums() is called, we can calculate the number
> +        * of offsets present in the offnos array.
> +        */
> +       curr_start_ptr = next_start_ptr;
> +
> +       /* Check whether the block number is valid. */
> +       if (blkno >= nblocks)
> +       {
> +           ereport(NOTICE,
> +                   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +                    errmsg("skipping block %u for relation \"%s\"
> because the block number is out of range",
> +                           blkno, RelationGetRelationName(rel))));
> +           continue;
> +       }
> +
> +       CHECK_FOR_INTERRUPTS();
>
> I guess it would be better to call CHECK_FOR_INTERRUPTS() at the top
> of the do loop for safety. I think it's unlikely to happen but the
> user might mistakenly specify a lot of wrong block numbers.
>

Okay, np, will shift it to top of the do loop.

> ----
> +   offnos = (OffsetNumber *) palloc(ntids * sizeof(OffsetNumber));
> +   noffs = curr_start_ptr = next_start_ptr = 0;
> +   nblocks = RelationGetNumberOfBlocks(rel);
> +
> +   do
> +   {
>
> (snip)
>
> +
> +       /*
> +        * Get the offset numbers from the tids belonging to one particular page
> +        * and process them one by one.
> +        */
> +       blkno = tids_same_page_fetch_offnums(tids, ntids, &next_start_ptr,
> +                                            offnos);
> +
> +       /* Calculate the number of offsets stored in offnos array. */
> +       noffs = next_start_ptr - curr_start_ptr;
> +
>
> (snip)
>
> +       /* No ereport(ERROR) from here until all the changes are logged. */
> +       START_CRIT_SECTION();
> +
> +       for (i = 0; i < noffs; i++)
>
> You copy all offset numbers belonging to the same page to palloc'd
> array, offnos, and iterate it while processing the tuples. I might be
> missing something but I think we can do that without allocating the
> space for offset numbers. Is there any reason for this? I guess we can
> do that by just iterating the sorted tids array.
>

Hmmm.. okay, I see your point. I think probably what you are trying to
suggest here is to make use of the current and next start pointers to
get the tids belonging to the same page and process them one by one
instead of fetching the offset numbers of all tids belonging to one
page into the offnos array and then iterate through the offnos array.
I think that is probably possible and I will try to do that in the
next version of patch. If there is something else that you have in
your mind, please let me know.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Robert Haas
Date:
On Tue, Aug 25, 2020 at 8:17 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
> + <note>
> + <para>
> +  While performing surgery on a damaged relation, we must not be doing anything
> +  else on that relation in parallel. This is to ensure that when we are
> +  operating on a damaged tuple there is no other transaction trying to modify
> +  that tuple.
> + </para>
> + </note>
>
> If we prefer to avoid concurrent operations on the target relation why
> don't we use AccessExclusiveLock?

I disagree with the content of the note. It's up to the user whether
to perform any concurrent operations on the target relation, and in
many cases it would be fine to do so. Users who can afford to take the
table off-line to repair the problem don't really need this tool in
the first place.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
On Tue, Aug 25, 2020 at 11:51 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Aug 25, 2020 at 8:17 AM Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> > + <note>
> > + <para>
> > +  While performing surgery on a damaged relation, we must not be doing anything
> > +  else on that relation in parallel. This is to ensure that when we are
> > +  operating on a damaged tuple there is no other transaction trying to modify
> > +  that tuple.
> > + </para>
> > + </note>
> >
> > If we prefer to avoid concurrent operations on the target relation why
> > don't we use AccessExclusiveLock?
>
> I disagree with the content of the note. It's up to the user whether
> to perform any concurrent operations on the target relation, and in
> many cases it would be fine to do so. Users who can afford to take the
> table off-line to repair the problem don't really need this tool in
> the first place.
>

The only reason I added this note was to ensure that we do not revive
the tuple that is deleted but not yet vacuumed. There is one
corner-case scenario as reported by you in - [1] where you have
explained a scenario under which vacuum can report "found xmin ...
from before relfrozenxid ..." sort of error for the  deleted tuples.
And as per the explanation provided there, it can happen when there
are multiple transactions operating on the same tuple. However, I
think we can take care of this scenario by doing some code changes in
heap_force_freeze to identify the deleted tuples and maybe skip such
tuples. So, yeah, I will do the code changes for handling this and
remove the note added in the documentation. Thank you Robert and
Masahiko-san for pointing this out.

[1] - https://www.postgresql.org/message-id/CA%2BTgmobfJ8CkabKJZ-1FGfvbSz%2Bb8bBX807Y6hHEtVfzVe%2Bg6A%40mail.gmail.com

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
Thanks for the review. Please find my comments inline below:

On Tue, Aug 25, 2020 at 5:47 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> Let me share other comments on the latest version patch:
>
> Some words need to be tagged. For instance, I found the following words:
>
> VACUUM
> DISABLE_PAGE_SKIPPING
> HEAP_XMIN_FROZEN
> HEAP_XMAX_INVALID
>

Okay, done.

> ---
> +test=# select ctid from t1 where xmin = 507;
> + ctid
> +-------
> + (0,3)
> +(1 row)
> +
> +test=# select heap_force_freeze('t1'::regclass, ARRAY['(0, 3)']::tid[]);
> +-[ RECORD 1 ]-----+-
> +heap_force_freeze |
>
> I think it's better to use a consistent output format. The former uses
> the normal format whereas the latter uses the expanded format.
>

Yep, makes sense, done.

> ---
> + <note>
> + <para>
> +  While performing surgery on a damaged relation, we must not be doing anything
> +  else on that relation in parallel. This is to ensure that when we are
> +  operating on a damaged tuple there is no other transaction trying to modify
> +  that tuple.
> + </para>
> + </note>
>
> If we prefer to avoid concurrent operations on the target relation why
> don't we use AccessExclusiveLock?
>

Removed this note from the documentation and added a note saying: "The
user needs to ensure that they do not operate pg_force_freeze function
on a deleted tuple because it may revive the deleted tuple."

> ---
> +CREATE FUNCTION heap_force_kill(reloid regclass, tids tid[])
> +RETURNS VOID
> +AS 'MODULE_PATHNAME', 'heap_force_kill'
> +LANGUAGE C STRICT;
>
> +CREATE FUNCTION heap_force_freeze(reloid regclass, tids tid[])
> +RETURNS VOID
> +AS 'MODULE_PATHNAME', 'heap_force_freeze'
> +LANGUAGE C STRICT;
>
> I think these functions should be PARALLEL UNSAFE.
>

By default the functions are marked PARALLEL UNSAFE, so I think there
is nothing to do here.

Attached patch with above changes.

This patch also takes care of all the other review comments from - [1].

[1] - https://www.postgresql.org/message-id/CA%2Bfd4k6%2BJWq2MfQt5b7fSJ2wMvCes9TRfbDhVO_fQP9B8JJRAA%40mail.gmail.com


--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachment

Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Robert Haas
Date:
On Wed, Aug 26, 2020 at 7:36 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> Removed this note from the documentation and added a note saying: "The
> user needs to ensure that they do not operate pg_force_freeze function
> on a deleted tuple because it may revive the deleted tuple."

I do not agree with that note, either. I believe that trying to tell
people what things specifically they should do or avoid doing with the
tool is the wrong approach. Instead, the thrust of the message should
be to tell people that if you use this, it may corrupt your database,
and that's your problem. The difficulty with telling people what
specifically they ought to avoid doing is that experts will be annoyed
to be told that something is not safe when they know that it is fine,
and non-experts will think that some uses are safer than they really
are.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
On Wed, Aug 26, 2020 at 9:19 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Aug 26, 2020 at 7:36 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > Removed this note from the documentation and added a note saying: "The
> > user needs to ensure that they do not operate pg_force_freeze function
> > on a deleted tuple because it may revive the deleted tuple."
>
> I do not agree with that note, either. I believe that trying to tell
> people what things specifically they should do or avoid doing with the
> tool is the wrong approach. Instead, the thrust of the message should
> be to tell people that if you use this, it may corrupt your database,
> and that's your problem. The difficulty with telling people what
> specifically they ought to avoid doing is that experts will be annoyed
> to be told that something is not safe when they know that it is fine,
> and non-experts will think that some uses are safer than they really
> are.
>

Okay, point noted.

Thanks,

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Robert Haas
Date:
On Wed, Aug 26, 2020 at 10:26 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> Okay, point noted.

I spent some time today working on this patch. I'm fairly happy with
it now and intend to commit it if nobody sees a big problem with that.
Per discussion, I do not intend to back-patch at this time. The two
most significant changes I made to your version are:

1. I changed things around to avoid using any form of ereport() in a
critical section. I'm not actually sure whether it is project policy
to avoid ereport(NOTICE, ...) or similar in a critical section, but it
seems prudent, because if anything fails in a critical section, we
will PANIC, so doing fewer things there seems prudent.

2. I changed the code so that it does not try to follow redirected
line pointers; instead, it skips them with an appropriate message, as
we were already doing for dead and unused line pointers. I think the
way you had it coded might've been my suggestion originally, but the
more I looked into it the less I liked it. One problem is that it
didn't match the docs. A second is that following a corrupted line
pointer might index off the end of the line pointer array, and while
that probably shouldn't happen, we are talking about corruption
recovery here. Then I realized that, as you coded it, if the line
pointer was redirected to a line pointer that is in turn dead (or
unused, if there's corruption) the user would get a NOTICE complaining
about a TID they hadn't specified, which seems like it would be very
confusing. I thought about trying to fix all that stuff, but it just
didn't seem worth it, because I can't think of a good reason to pass
this function the TID of a redirected line pointer in the first place.
If you're doing surgery, you should probably specify the exact thing
upon which you want to operate, not some other thing that points to
it.

Here is a list of other changes I made:

* Added a .gitignore file.
* Renamed the regression test file from pg_surgery to heap_surgery to
match the name of the single C source file we currently have.
* Capitalized TID in a few places.
* Ran pgindent.
* Adjusted various comments.
* Removed the check for an empty TID array. I don't see any reason why
this should be an error case and I don't see much precedent for having
such a check.
* Fixed the code to work properly with that change and added a test case.
* Added a check that the array is not multi-dimensional.
* Put the AM type check before the relkind check, following existing precedent.
* Adjusted the check to use the AM OID rather than the handler OID,
following existing precedent. Fixed the message wording accordingly.
* Changed the documentation wording to say less about specific
recovery procedures and focus more on the general idea that this is
dangerous.
* Removed all but one of the test cases that checked what happens if
you use this on a non-heap; three tests for basically the same thing
seemed excessive.
* Added some additional tests to improve code coverage. There are now
only a handful of lines not covered.
* Reorganized the test cases somewhat.

New patch attached.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Mark Dilger
Date:

> On Aug 26, 2020, at 4:36 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> This patch also takes care of all the other review comments from - [1].
>
> [1] - https://www.postgresql.org/message-id/CA%2Bfd4k6%2BJWq2MfQt5b7fSJ2wMvCes9TRfbDhVO_fQP9B8JJRAA%40mail.gmail.com
>
>
> --
> With Regards,
> Ashutosh Sharma
> EnterpriseDB:http://www.enterprisedb.com
> <v8-0001-Add-contrib-pg_surgery-to-perform-surgery-on-a-damag.patch>


Hi Ashutosh,

I took a look at the v8 patch, created a commitfest entry [1] because I did not find one already existent, and have the
followingreview comments: 


HeapTupleForceOption should be added to src/tools/pgindent/typedefs.list.


The tidcmp function can be removed, and ItemPointerCompare used directly by qsort as:

-               qsort((void*) tids, ntids, sizeof(ItemPointerData), tidcmp);
+               qsort((void*) tids, ntids, sizeof(ItemPointerData),
+                         (int (*) (const void *, const void *)) ItemPointerCompare);


sanity_check_tid_array() has two error messages:

  "array must not contain nulls"
  "empty tid array"

I would change the first to say "tid array must not contain nulls", as "tid" is the name of the parameter being
checked. It is also more consistent with the second error message, but that doesn't matter to me so much, as I'd argue
forremoving the second check.  I don't see why an empty array should draw an error.  It seems more reasonable to just
returnearly since there is no work to do.  Consider if somebody uses a function that returns the tids for all corrupt
tuplesin a table, aggregates that into an array, and hands that to this function.  It doesn't seem like an error for
thataggregated array to have zero elements in it.  I suppose you could emit a NOTICE in this case? 


Upthread:
> On Aug 13, 2020, at 12:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
>> This looks like a very good suggestion to me. I will do this change in
>> the next version. Just wondering if we should be doing similar changes
>> in other contrib modules (like pgrowlocks, pageinspect and
>> pgstattuple) as well?
>
> It seems like it should be consistent, but I'm not sure the proposed
> change is really an improvement.

You have used Asim's proposed check:

    if (rel->rd_amhandler != HEAP_TABLE_AM_HANDLER_OID)
        ereport(ERROR,
                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
                 errmsg("only the relation using heap_tableam_handler is supported")));

which Robert seems unenthusiastic about, but if you are going that direction, I think at least the language of the
errormessage should be changed. I recommend something like: 

        if (rel->rd_amhandler != HEAP_TABLE_AM_HANDLER_OID)
                ereport(ERROR,
                                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-                                errmsg("only the relation using heap_tableam_handler is supported")));
+                                errmsg("\"%s\" does not use a heap access method",
+                                               RelationGetRelationName(rel))));

where "a heap access method" could also be written as "a heap table type access method", "a heap table compatible
accessmethod", and so forth.  There doesn't seem to be enough precedent to dictate exactly how to phrase this, or
perhapsI'm just not looking in the right place. 


The header comment for function find_tids_one_page should state the requirement that the tids array must be sorted.


The heap_force_common function contains multiple ereport(NOTICE,...) within a critical section.  I don't think that is
normalpractice.  Can those reports be buffered until after the critical section is exited?  You also have a
CHECK_FOR_INTERRUPTSwithin the critical section. 

[1] https://commitfest.postgresql.org/29/2700/
—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
On Fri, Aug 28, 2020 at 1:44 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Aug 26, 2020 at 10:26 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > Okay, point noted.
>
> I spent some time today working on this patch. I'm fairly happy with
> it now and intend to commit it if nobody sees a big problem with that.
> Per discussion, I do not intend to back-patch at this time. The two
> most significant changes I made to your version are:
>
> 1. I changed things around to avoid using any form of ereport() in a
> critical section. I'm not actually sure whether it is project policy
> to avoid ereport(NOTICE, ...) or similar in a critical section, but it
> seems prudent, because if anything fails in a critical section, we
> will PANIC, so doing fewer things there seems prudent.
>
> 2. I changed the code so that it does not try to follow redirected
> line pointers; instead, it skips them with an appropriate message, as
> we were already doing for dead and unused line pointers. I think the
> way you had it coded might've been my suggestion originally, but the
> more I looked into it the less I liked it. One problem is that it
> didn't match the docs. A second is that following a corrupted line
> pointer might index off the end of the line pointer array, and while
> that probably shouldn't happen, we are talking about corruption
> recovery here. Then I realized that, as you coded it, if the line
> pointer was redirected to a line pointer that is in turn dead (or
> unused, if there's corruption) the user would get a NOTICE complaining
> about a TID they hadn't specified, which seems like it would be very
> confusing. I thought about trying to fix all that stuff, but it just
> didn't seem worth it, because I can't think of a good reason to pass
> this function the TID of a redirected line pointer in the first place.
> If you're doing surgery, you should probably specify the exact thing
> upon which you want to operate, not some other thing that points to
> it.
>
> Here is a list of other changes I made:
>
> * Added a .gitignore file.
> * Renamed the regression test file from pg_surgery to heap_surgery to
> match the name of the single C source file we currently have.
> * Capitalized TID in a few places.
> * Ran pgindent.
> * Adjusted various comments.
> * Removed the check for an empty TID array. I don't see any reason why
> this should be an error case and I don't see much precedent for having
> such a check.
> * Fixed the code to work properly with that change and added a test case.
> * Added a check that the array is not multi-dimensional.
> * Put the AM type check before the relkind check, following existing precedent.
> * Adjusted the check to use the AM OID rather than the handler OID,
> following existing precedent. Fixed the message wording accordingly.
> * Changed the documentation wording to say less about specific
> recovery procedures and focus more on the general idea that this is
> dangerous.
> * Removed all but one of the test cases that checked what happens if
> you use this on a non-heap; three tests for basically the same thing
> seemed excessive.
> * Added some additional tests to improve code coverage. There are now
> only a handful of lines not covered.
> * Reorganized the test cases somewhat.
>
> New patch attached.
>

Thank you Robert for the patch. I've looked into the changes you've
made to the v8 patch and they all look good to me.

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
Hi Mark,

Thanks for the review. Please find my comments inline below:

> HeapTupleForceOption should be added to src/tools/pgindent/typedefs.list.
>

This has been fixed in the v9 patch.

>
> The tidcmp function can be removed, and ItemPointerCompare used directly by qsort as:
>
> -               qsort((void*) tids, ntids, sizeof(ItemPointerData), tidcmp);
> +               qsort((void*) tids, ntids, sizeof(ItemPointerData),
> +                         (int (*) (const void *, const void *)) ItemPointerCompare);
>

Will have a look into this.

>
> sanity_check_tid_array() has two error messages:
>
>   "array must not contain nulls"
>   "empty tid array"
>
> I would change the first to say "tid array must not contain nulls", as "tid" is the name of the parameter being
checked. It is also more consistent with the second error message, but that doesn't matter to me so much, as I'd argue
forremoving the second check.  I don't see why an empty array should draw an error.  It seems more reasonable to just
returnearly since there is no work to do.  Consider if somebody uses a function that returns the tids for all corrupt
tuplesin a table, aggregates that into an array, and hands that to this function.  It doesn't seem like an error for
thataggregated array to have zero elements in it.  I suppose you could emit a NOTICE in this case? 
>

This comment is no more valid as per the changes done in the v9 patch.

>
> Upthread:
> > On Aug 13, 2020, at 12:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >
> >> This looks like a very good suggestion to me. I will do this change in
> >> the next version. Just wondering if we should be doing similar changes
> >> in other contrib modules (like pgrowlocks, pageinspect and
> >> pgstattuple) as well?
> >
> > It seems like it should be consistent, but I'm not sure the proposed
> > change is really an improvement.
>
> You have used Asim's proposed check:
>
>     if (rel->rd_amhandler != HEAP_TABLE_AM_HANDLER_OID)
>         ereport(ERROR,
>                 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
>                  errmsg("only the relation using heap_tableam_handler is supported")));
>
> which Robert seems unenthusiastic about, but if you are going that direction, I think at least the language of the
errormessage should be changed. I recommend something like: 
>
>         if (rel->rd_amhandler != HEAP_TABLE_AM_HANDLER_OID)
>                 ereport(ERROR,
>                                 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> -                                errmsg("only the relation using heap_tableam_handler is supported")));
> +                                errmsg("\"%s\" does not use a heap access method",
> +                                               RelationGetRelationName(rel))));
>
> where "a heap access method" could also be written as "a heap table type access method", "a heap table compatible
accessmethod", and so forth.  There doesn't seem to be enough precedent to dictate exactly how to phrase this, or
perhapsI'm just not looking in the right place. 
>

Same here. This also looks invalid as per the changes done in v9 patch.

>
> The header comment for function find_tids_one_page should state the requirement that the tids array must be sorted.
>

Okay, will add a comment for this.

>
> The heap_force_common function contains multiple ereport(NOTICE,...) within a critical section.  I don't think that
isnormal practice.  Can those reports be buffered until after the critical section is exited?  You also have a
CHECK_FOR_INTERRUPTSwithin the critical section. 
>

This has been fixed in the v9 patch.

> [1] https://commitfest.postgresql.org/29/2700/
> —

Thanks for adding a commitfest entry for this.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Masahiko Sawada
Date:
On Fri, 28 Aug 2020 at 05:14, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Aug 26, 2020 at 10:26 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > Okay, point noted.
>
> I spent some time today working on this patch. I'm fairly happy with
> it now and intend to commit it if nobody sees a big problem with that.
> Per discussion, I do not intend to back-patch at this time. The two
> most significant changes I made to your version are:

Thank you for updating the patch.

>
> 1. I changed things around to avoid using any form of ereport() in a
> critical section. I'm not actually sure whether it is project policy
> to avoid ereport(NOTICE, ...) or similar in a critical section, but it
> seems prudent, because if anything fails in a critical section, we
> will PANIC, so doing fewer things there seems prudent.
>
> 2. I changed the code so that it does not try to follow redirected
> line pointers; instead, it skips them with an appropriate message, as
> we were already doing for dead and unused line pointers. I think the
> way you had it coded might've been my suggestion originally, but the
> more I looked into it the less I liked it. One problem is that it
> didn't match the docs. A second is that following a corrupted line
> pointer might index off the end of the line pointer array, and while
> that probably shouldn't happen, we are talking about corruption
> recovery here. Then I realized that, as you coded it, if the line
> pointer was redirected to a line pointer that is in turn dead (or
> unused, if there's corruption) the user would get a NOTICE complaining
> about a TID they hadn't specified, which seems like it would be very
> confusing. I thought about trying to fix all that stuff, but it just
> didn't seem worth it, because I can't think of a good reason to pass
> this function the TID of a redirected line pointer in the first place.
> If you're doing surgery, you should probably specify the exact thing
> upon which you want to operate, not some other thing that points to
> it.
>
> Here is a list of other changes I made:
>
> * Added a .gitignore file.
> * Renamed the regression test file from pg_surgery to heap_surgery to
> match the name of the single C source file we currently have.
> * Capitalized TID in a few places.
> * Ran pgindent.
> * Adjusted various comments.
> * Removed the check for an empty TID array. I don't see any reason why
> this should be an error case and I don't see much precedent for having
> such a check.
> * Fixed the code to work properly with that change and added a test case.
> * Added a check that the array is not multi-dimensional.
> * Put the AM type check before the relkind check, following existing precedent.
> * Adjusted the check to use the AM OID rather than the handler OID,
> following existing precedent. Fixed the message wording accordingly.
> * Changed the documentation wording to say less about specific
> recovery procedures and focus more on the general idea that this is
> dangerous.

You've removed the description about executing VACUUM with
DISABLE_PAGE_SKIPPING option on the target relation after using
pg_surgery functions from the doc but I guess it’s better to recommend
that in the doc for safety. Could you please tell me the reason for
removing that?

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
> > The tidcmp function can be removed, and ItemPointerCompare used directly by qsort as:
> >
> > -               qsort((void*) tids, ntids, sizeof(ItemPointerData), tidcmp);
> > +               qsort((void*) tids, ntids, sizeof(ItemPointerData),
> > +                         (int (*) (const void *, const void *)) ItemPointerCompare);
> >
>
> Will have a look into this.
>

We can certainly do this way, but I would still prefer having a comparator function (tidcmp) here for the reasons that it makes the code look a bit cleaner, it also makes us more consistent with the way the comparator function argument is being passed to qsort at several other places in postgres which kinda of increases the code readability and simplicity. For e.g. there is a comparator function for gin that does the same thing as tidcmp is doing here. See below:

static int
qsortCompareItemPointers(const void *a, const void *b)
{
    int         res = ginCompareItemPointers((ItemPointer) a, (ItemPointer) b);
 
    /* Assert that there are no equal item pointers being sorted */
    Assert(res != 0);
    return res;
}


In this case as well, it could have been done the way you are suggesting, but it seems like writing a small comparator function with the prototype that qsort accepts looked like a better option. Considering this, I am just leaving this as-it-is. Please let me know if you feel the other way round.

> > The header comment for function find_tids_one_page should state the requirement that the tids array must be sorted.
> >
>
> Okay, will add a comment for this.
>

Added a comment for this in the attached patch.

Please have a look into the attached patch for the changes and let me know for any other concerns. Thank you.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
Attachment

Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Robert Haas
Date:
On Fri, Aug 28, 2020 at 5:55 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> We can certainly do this way, but I would still prefer having a comparator function (tidcmp) here for the reasons
thatit makes the code look a bit cleaner, it also makes us more consistent with the way the comparator function
argumentis being passed to qsort at several other places in postgres which kinda of increases the code readability and
simplicity.For e.g. there is a comparator function for gin that does the same thing as tidcmp is doing here. 

Me too. Casting one kind of function pointer to another kind of
function pointer assumes that the compiler is using the same
argument-passing conventions in both cases, which seems slightly
risky. It also means that if the signature for the function were to
diverge further from the signature that we need in the future, the
compiler might not warn us about it. Perhaps there is some case where
the performance gains would be sufficiently to justify those risks,
but this is certainly not that case.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Robert Haas
Date:
On Fri, Aug 28, 2020 at 4:07 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
> You've removed the description about executing VACUUM with
> DISABLE_PAGE_SKIPPING option on the target relation after using
> pg_surgery functions from the doc but I guess it’s better to recommend
> that in the doc for safety. Could you please tell me the reason for
> removing that?

Well, I think that was added because there wasn't code to clear the
visibility map bits, either page-level in the map, but we added code
for that, so now I don't really see why it's necessary or even
desirable.

Here are a few example scenarios:

1. My table is not corrupt. For no particular reason, I force-freeze
or force-kill a tuple which is neither dead nor all-visible.
Concurrent queries might return wrong answers, but the table is not
corrupt. It does not require VACUUM and would not benefit from it.
Actually, it doesn't need anything at all.

2. My table is not corrupt. For no particular reason, I force-freeze a
tuple which is dead. I believe it's possible that the index entries
for that tuple might be gone already, but VACUUM won't fix that.
REINDEX or a table rewrite would, though. It's also possible, if the
dead tuple was added by an aborted transaction which added columns to
the table, that the tuple might have been created using a tuple
descriptor that differs from the table's current tuple descriptor. If
so, I think scanning the table could produce a crash. VACUUM won't fix
this, either. I would need to delete or force-kill the offending
tuple.

3. I have one or more tuples in my table that are intact except that
they have garbage values for xmin, resulting in VACUUM failure or
possibly even SELECT failure if the CLOG entries are also missing. I
force-kill or force-freeze them. If by chance the affected tuples were
also omitted from one or more indexes, a REINDEX or table rewrite is
needed to fix them, but a VACUUM will not help. On the other hand, if
those tuples are present in the indexes, there's no remaining problem
and VACUUM is not needed for the purpose of restoring the integrity of
the table. If the problem has been ongoing for a while, VACUUM might
be needed to advance relfrozenxid, but that doesn't require
DISABLE_PAGE_SKIPPING.

4. I have some pages in my table that have incorrect visibility map
bits. In this case, I need VACUUM (DISABLE_PAGE_SKIPPING). However, I
don't need the functions we're talking about here at all unless I also
have tuples with corrupted visibility information. If I do happen to
have both tuples with corrupted visibility information and also pages
with incorrect visibility map bits, then I suppose I need both these
tools and also VACUUM (DISABLE_PAGE_SKIPPING). Probably, I'll want to
do the VACUUM second. But, if I happened to do the VACUUM first and
then use these functions afterward, the worst thing that could happen
is that I might end up with a some dead tuples that could've gotten
removed faster if I'd switched the order. And that's not a disaster.

Basically, I can see no real reason to recommend VACUUM
(DISABLE_PAGE_SKIPPING) here. There are problems that can be fixed
with that command, and there are problems that can be fixed by this
method, but they are mostly independent of each other. We should not
recommend that people run VACUUM "just in case." That kind of fuzzy
thinking seems relatively prevalent already, and it leads to people
spending a lot of time running slow maintenance commands that do
nothing to help them, and which occasionally make things worse.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Masahiko Sawada
Date:
On Fri, 28 Aug 2020 at 23:39, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Aug 28, 2020 at 4:07 AM Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> > You've removed the description about executing VACUUM with
> > DISABLE_PAGE_SKIPPING option on the target relation after using
> > pg_surgery functions from the doc but I guess it’s better to recommend
> > that in the doc for safety. Could you please tell me the reason for
> > removing that?
>
> Well, I think that was added because there wasn't code to clear the
> visibility map bits, either page-level in the map, but we added code
> for that, so now I don't really see why it's necessary or even
> desirable.
>
> Here are a few example scenarios:
>
> 1. My table is not corrupt. For no particular reason, I force-freeze
> or force-kill a tuple which is neither dead nor all-visible.
> Concurrent queries might return wrong answers, but the table is not
> corrupt. It does not require VACUUM and would not benefit from it.
> Actually, it doesn't need anything at all.
>
> 2. My table is not corrupt. For no particular reason, I force-freeze a
> tuple which is dead. I believe it's possible that the index entries
> for that tuple might be gone already, but VACUUM won't fix that.
> REINDEX or a table rewrite would, though. It's also possible, if the
> dead tuple was added by an aborted transaction which added columns to
> the table, that the tuple might have been created using a tuple
> descriptor that differs from the table's current tuple descriptor. If
> so, I think scanning the table could produce a crash. VACUUM won't fix
> this, either. I would need to delete or force-kill the offending
> tuple.
>
> 3. I have one or more tuples in my table that are intact except that
> they have garbage values for xmin, resulting in VACUUM failure or
> possibly even SELECT failure if the CLOG entries are also missing. I
> force-kill or force-freeze them. If by chance the affected tuples were
> also omitted from one or more indexes, a REINDEX or table rewrite is
> needed to fix them, but a VACUUM will not help. On the other hand, if
> those tuples are present in the indexes, there's no remaining problem
> and VACUUM is not needed for the purpose of restoring the integrity of
> the table. If the problem has been ongoing for a while, VACUUM might
> be needed to advance relfrozenxid, but that doesn't require
> DISABLE_PAGE_SKIPPING.
>
> 4. I have some pages in my table that have incorrect visibility map
> bits. In this case, I need VACUUM (DISABLE_PAGE_SKIPPING). However, I
> don't need the functions we're talking about here at all unless I also
> have tuples with corrupted visibility information. If I do happen to
> have both tuples with corrupted visibility information and also pages
> with incorrect visibility map bits, then I suppose I need both these
> tools and also VACUUM (DISABLE_PAGE_SKIPPING). Probably, I'll want to
> do the VACUUM second. But, if I happened to do the VACUUM first and
> then use these functions afterward, the worst thing that could happen
> is that I might end up with a some dead tuples that could've gotten
> removed faster if I'd switched the order. And that's not a disaster.
>
> Basically, I can see no real reason to recommend VACUUM
> (DISABLE_PAGE_SKIPPING) here. There are problems that can be fixed
> with that command, and there are problems that can be fixed by this
> method, but they are mostly independent of each other. We should not
> recommend that people run VACUUM "just in case." That kind of fuzzy
> thinking seems relatively prevalent already, and it leads to people
> spending a lot of time running slow maintenance commands that do
> nothing to help them, and which occasionally make things worse.
>

Thank you for your explanation. That very makes sense to me.

If vacuum could fix the particular kind of problem by using together
with pg_surgery we could recommend using vacuum. But I agree that the
corruption of heap table is not the case.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Robert Haas
Date:
On Fri, Aug 28, 2020 at 5:55 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> Please have a look into the attached patch for the changes and let me know for any other concerns. Thank you.

I have committed this version.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Justin Pryzby
Date:
On Thu, Sep 10, 2020 at 11:21:02AM -0400, Robert Haas wrote:
> On Fri, Aug 28, 2020 at 5:55 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > Please have a look into the attached patch for the changes and let me know for any other concerns. Thank you.
> 
> I have committed this version.

Thanks ; I marked it as such in CF app.
https://commitfest.postgresql.org/29/2700/

-- 
Justin



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I have committed this version.

This failure says that the test case is not entirely stable:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2020-09-12%2005%3A13%3A12

diff -U3 /home/nm/farm/gcc64/HEAD/pgsql.build/contrib/pg_surgery/expected/heap_surgery.out
/home/nm/farm/gcc64/HEAD/pgsql.build/contrib/pg_surgery/results/heap_surgery.out
--- /home/nm/farm/gcc64/HEAD/pgsql.build/contrib/pg_surgery/expected/heap_surgery.out    2020-09-11 06:31:36.000000000
+0000
+++ /home/nm/farm/gcc64/HEAD/pgsql.build/contrib/pg_surgery/results/heap_surgery.out    2020-09-12 11:40:26.000000000
+0000
@@ -116,7 +116,6 @@
  vacuum freeze htab2;
  -- unused TIDs should be skipped
  select heap_force_kill('htab2'::regclass, ARRAY['(0, 2)']::tid[]);
- NOTICE:  skipping tid (0, 2) for relation "htab2" because it is marked unused
   heap_force_kill
  -----------------


sungazer's first run after pg_surgery went in was successful, so it's
not a hard failure.  I'm guessing that it's timing dependent.

The most obvious theory for the cause is that what VACUUM does with
a tuple depends on whether the tuple's xmin is below global xmin,
and a concurrent autovacuum could very easily be holding back global
xmin.  While I can't easily get autovac to run at just the right
time, I did verify that a concurrent regular session holding back
global xmin produces the symptom seen above.  (To replicate, insert
"select pg_sleep(...)" in heap_surgery.sql before "-- now create an unused
line pointer"; run make installcheck; and use the delay to connect
to the database manually, start a serializable transaction, and do
any query to acquire a snapshot.)

I suggest that the easiest way to make this test reliable is to
make the test tables be temp tables (which allows dropping the
autovacuum_enabled = off property, too).  In the wake of commit
a7212be8b, that should guarantee that vacuum has stable tuple-level
behavior regardless of what is happening concurrently.

            regards, tom lane



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
On Sun, Sep 13, 2020 at 3:30 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Robert Haas <robertmhaas@gmail.com> writes:
> > I have committed this version.
>
> This failure says that the test case is not entirely stable:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2020-09-12%2005%3A13%3A12
>
> diff -U3 /home/nm/farm/gcc64/HEAD/pgsql.build/contrib/pg_surgery/expected/heap_surgery.out
/home/nm/farm/gcc64/HEAD/pgsql.build/contrib/pg_surgery/results/heap_surgery.out
> --- /home/nm/farm/gcc64/HEAD/pgsql.build/contrib/pg_surgery/expected/heap_surgery.out   2020-09-11 06:31:36.000000000
+0000
> +++ /home/nm/farm/gcc64/HEAD/pgsql.build/contrib/pg_surgery/results/heap_surgery.out    2020-09-12 11:40:26.000000000
+0000
> @@ -116,7 +116,6 @@
>   vacuum freeze htab2;
>   -- unused TIDs should be skipped
>   select heap_force_kill('htab2'::regclass, ARRAY['(0, 2)']::tid[]);
> - NOTICE:  skipping tid (0, 2) for relation "htab2" because it is marked unused
>    heap_force_kill
>   -----------------
>
>
> sungazer's first run after pg_surgery went in was successful, so it's
> not a hard failure.  I'm guessing that it's timing dependent.
>
> The most obvious theory for the cause is that what VACUUM does with
> a tuple depends on whether the tuple's xmin is below global xmin,
> and a concurrent autovacuum could very easily be holding back global
> xmin.  While I can't easily get autovac to run at just the right
> time, I did verify that a concurrent regular session holding back
> global xmin produces the symptom seen above.  (To replicate, insert
> "select pg_sleep(...)" in heap_surgery.sql before "-- now create an unused
> line pointer"; run make installcheck; and use the delay to connect
> to the database manually, start a serializable transaction, and do
> any query to acquire a snapshot.)
>

Thanks for reporting. I'm able to reproduce the issue by creating some
delay just before "-- now create an unused line pointer" and use the
delay to start a new session either with repeatable read or
serializable transaction isolation level and run some query on the
test table. To fix this, as you suggested I've converted the test
table to the temp table. Attached is the patch with the changes.
Please have a look and let me know about any concerns.

Thanks,

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachment

Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Robert Haas
Date:
On Mon, Sep 14, 2020 at 6:26 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> Thanks for reporting. I'm able to reproduce the issue by creating some
> delay just before "-- now create an unused line pointer" and use the
> delay to start a new session either with repeatable read or
> serializable transaction isolation level and run some query on the
> test table. To fix this, as you suggested I've converted the test
> table to the temp table. Attached is the patch with the changes.
> Please have a look and let me know about any concerns.

Tom, do you have any concerns about this fix?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Sep 14, 2020 at 6:26 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>> Thanks for reporting. I'm able to reproduce the issue by creating some
>> delay just before "-- now create an unused line pointer" and use the
>> delay to start a new session either with repeatable read or
>> serializable transaction isolation level and run some query on the
>> test table. To fix this, as you suggested I've converted the test
>> table to the temp table. Attached is the patch with the changes.
>> Please have a look and let me know about any concerns.

> Tom, do you have any concerns about this fix?

It seems OK as far as it goes.  Two thoughts:

* Do we need a comment in the test pointing out that the table must be
temp to ensure that we get stable vacuum results?  Or will the commit
log message be enough documentation?

* Should any of the other tables in the test be converted to temp?
I see that the other test cases are kluging around related issues
by dint of never committing their tables at all.  It's not clear
to me how badly those test cases have been distorted by that, or
whether it means they're testing less-than-typical situations.

Anyway, if you're satisfied with leaving the other cases as-is,
I have no objections.

            regards, tom lane



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
On Wed, Sep 16, 2020 at 1:25 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Mon, Sep 14, 2020 at 6:26 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> >> Thanks for reporting. I'm able to reproduce the issue by creating some
> >> delay just before "-- now create an unused line pointer" and use the
> >> delay to start a new session either with repeatable read or
> >> serializable transaction isolation level and run some query on the
> >> test table. To fix this, as you suggested I've converted the test
> >> table to the temp table. Attached is the patch with the changes.
> >> Please have a look and let me know about any concerns.
>
> > Tom, do you have any concerns about this fix?
>
> It seems OK as far as it goes.  Two thoughts:
>
> * Do we need a comment in the test pointing out that the table must be
> temp to ensure that we get stable vacuum results?  Or will the commit
> log message be enough documentation?
>

I'll add a note for this.

> * Should any of the other tables in the test be converted to temp?
> I see that the other test cases are kluging around related issues
> by dint of never committing their tables at all.  It's not clear
> to me how badly those test cases have been distorted by that, or
> whether it means they're testing less-than-typical situations.
>

Are you trying to say that we can achieve the things being done in
test-case 1 and 2 by having a single temp table and we should aim for
it because it will make the test-case more efficient and easy to
maintain? If so, I will try to do the necessary changes and submit a
new patch for it. please confirm.

Thanks,

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Tom Lane
Date:
Ashutosh Sharma <ashu.coek88@gmail.com> writes:
> On Wed, Sep 16, 2020 at 1:25 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> * Should any of the other tables in the test be converted to temp?

> Are you trying to say that we can achieve the things being done in
> test-case 1 and 2 by having a single temp table and we should aim for
> it because it will make the test-case more efficient and easy to
> maintain?

Well, I'm just looking at the comment that says the reason for the
begin/rollback structure is to keep autovacuum's hands off the table.
In most if not all of the other places where we need that, the preferred
method is to make the table temp or mark it with (autovacuum = off).
While this way isn't wrong exactly, nor inefficient, it does seem
a little restrictive.  For instance, you can't easily test cases that
involve intentional errors.

Another point is that we have a few optimizations that apply to tables
created in the current transaction.  I'm not sure whether any of them
would fire in this test case, but if they do (now or in the future)
that might mean you aren't testing the usual scenario for use of
pg_surgery, which is surely not going to be a new-in-transaction
table.  (That might be an argument for preferring autovacuum = off
over a temp table, too.)

Like I said, I don't have a big problem with leaving the rest of the
test as-is.  It just seems to be doing things in an unusual way for
no very good reason.

            regards, tom lane



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
On Wed, Sep 16, 2020 at 9:14 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Ashutosh Sharma <ashu.coek88@gmail.com> writes:
> > On Wed, Sep 16, 2020 at 1:25 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> * Should any of the other tables in the test be converted to temp?
>
> > Are you trying to say that we can achieve the things being done in
> > test-case 1 and 2 by having a single temp table and we should aim for
> > it because it will make the test-case more efficient and easy to
> > maintain?
>
> Well, I'm just looking at the comment that says the reason for the
> begin/rollback structure is to keep autovacuum's hands off the table.
> In most if not all of the other places where we need that, the preferred
> method is to make the table temp or mark it with (autovacuum = off).
> While this way isn't wrong exactly, nor inefficient, it does seem
> a little restrictive.  For instance, you can't easily test cases that
> involve intentional errors.
>
> Another point is that we have a few optimizations that apply to tables
> created in the current transaction.  I'm not sure whether any of them
> would fire in this test case, but if they do (now or in the future)
> that might mean you aren't testing the usual scenario for use of
> pg_surgery, which is surely not going to be a new-in-transaction
> table.  (That might be an argument for preferring autovacuum = off
> over a temp table, too.)
>

I agree with you on both the above points. I'll try to make the
necessary changes to address all your comments. Also, I'd prefer
creating a normal heap table with autovacuum = off over the temp table
for the reasons you mentioned in the second point.

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
On Wed, Sep 16, 2020 at 10:40 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> On Wed, Sep 16, 2020 at 9:14 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > Ashutosh Sharma <ashu.coek88@gmail.com> writes:
> > > On Wed, Sep 16, 2020 at 1:25 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > >> * Should any of the other tables in the test be converted to temp?
> >
> > > Are you trying to say that we can achieve the things being done in
> > > test-case 1 and 2 by having a single temp table and we should aim for
> > > it because it will make the test-case more efficient and easy to
> > > maintain?
> >
> > Well, I'm just looking at the comment that says the reason for the
> > begin/rollback structure is to keep autovacuum's hands off the table.
> > In most if not all of the other places where we need that, the preferred
> > method is to make the table temp or mark it with (autovacuum = off).
> > While this way isn't wrong exactly, nor inefficient, it does seem
> > a little restrictive.  For instance, you can't easily test cases that
> > involve intentional errors.
> >
> > Another point is that we have a few optimizations that apply to tables
> > created in the current transaction.  I'm not sure whether any of them
> > would fire in this test case, but if they do (now or in the future)
> > that might mean you aren't testing the usual scenario for use of
> > pg_surgery, which is surely not going to be a new-in-transaction
> > table.  (That might be an argument for preferring autovacuum = off
> > over a temp table, too.)
> >
>
> I agree with you on both the above points. I'll try to make the
> necessary changes to address all your comments. Also, I'd prefer
> creating a normal heap table with autovacuum = off over the temp table
> for the reasons you mentioned in the second point.
>

Attached is the patch with the changes suggested here. I've tried to
use a normal heap table with (autovacuum = off) wherever possible.
Please have a look and let me know for any other issues.

Thanks,

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachment

Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Robert Haas
Date:
On Wed, Sep 16, 2020 at 1:48 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> Attached is the patch with the changes suggested here. I've tried to
> use a normal heap table with (autovacuum = off) wherever possible.
> Please have a look and let me know for any other issues.

I think the comment needs some wordsmithing -- "unlike other cases" is
not that informative, and "we get a stable vacuum results" isn't
either very clear or all that grammatical. If we're going to add a
comment add here, why not just "use a temp table, so autovacuum can't
interfere"?

Tom, I know that you often have strong feelings about the exact
wording and details of this kind of stuff, so if you feel moved to
commit something that is fine with me. If not, I will take my best
shot at it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Tom, I know that you often have strong feelings about the exact
> wording and details of this kind of stuff, so if you feel moved to
> commit something that is fine with me. If not, I will take my best
> shot at it.

I'm not feeling terribly picky about it --- so it's all yours.

            regards, tom lane



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Robert Haas
Date:
On Wed, Sep 16, 2020 at 11:48 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > Tom, I know that you often have strong feelings about the exact
> > wording and details of this kind of stuff, so if you feel moved to
> > commit something that is fine with me. If not, I will take my best
> > shot at it.
>
> I'm not feeling terribly picky about it --- so it's all yours.

OK. After some more study of the thread and some more experimentation,
I came up with the attached. I'll go ahead and commit this if nobody
objects.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> OK. After some more study of the thread and some more experimentation,
> I came up with the attached. I'll go ahead and commit this if nobody
> objects.

This is OK by me.

            regards, tom lane



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
On Thu, Sep 17, 2020 at 12:14 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Robert Haas <robertmhaas@gmail.com> writes:
> > OK. After some more study of the thread and some more experimentation,
> > I came up with the attached. I'll go ahead and commit this if nobody
> > objects.
>
> This is OK by me.
>

Looks good to me too.

Thanks,

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Robert Haas
Date:
On Wed, Sep 16, 2020 at 10:27 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > This is OK by me.
>
> Looks good to me too.

Cool, thanks to both of you for looking. Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Cool, thanks to both of you for looking. Committed.

Hmph, according to whelk this is worse not better:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=whelk&dt=2020-09-18%2017%3A42%3A11

I'm at a loss to understand what's going on there.

            regards, tom lane



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Amit Kapila
Date:
On Sat, Sep 19, 2020 at 4:03 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Robert Haas <robertmhaas@gmail.com> writes:
> > Cool, thanks to both of you for looking. Committed.
>
> Hmph, according to whelk this is worse not better:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=whelk&dt=2020-09-18%2017%3A42%3A11
>
> I'm at a loss to understand what's going on there.
>

I think our assumption that changing the tests to have temp tables
will make them safe w.r.t concurrent activity doesn't seem to be
correct. We do set OldestXmin for temp tables aggressive enough that
it allows us to remove all dead tuples but the test case behavior lies
on whether we are able to prune the chain. AFAICS, we are using
different cut-offs in heap_page_prune when it is called via
lazy_scan_heap. So that seems to be causing both the failures.

-- 
With Regards,
Amit Kapila.



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Tom Lane
Date:
Amit Kapila <amit.kapila16@gmail.com> writes:
> I think our assumption that changing the tests to have temp tables
> will make them safe w.r.t concurrent activity doesn't seem to be
> correct. We do set OldestXmin for temp tables aggressive enough that
> it allows us to remove all dead tuples but the test case behavior lies
> on whether we are able to prune the chain. AFAICS, we are using
> different cut-offs in heap_page_prune when it is called via
> lazy_scan_heap. So that seems to be causing both the failures.

Hm, reasonable theory.

I was able to partially reproduce whelk's failure here.  I got a
couple of cases of "cannot freeze committed xmax", which then leads
to the second NOTICE diff; but I couldn't reproduce the first
NOTICE diff.  That was out of about a thousand tries :-( so it's not
looking like a promising thing to reproduce without modifying the test.

I wonder whether "cannot freeze committed xmax" doesn't represent an
actual bug, ie is a7212be8b setting the cutoff *too* aggressively?
But if so, why's it so hard to reproduce?

            regards, tom lane



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Tom Lane
Date:
I wrote:
> I was able to partially reproduce whelk's failure here.  I got a
> couple of cases of "cannot freeze committed xmax", which then leads
> to the second NOTICE diff; but I couldn't reproduce the first
> NOTICE diff.  That was out of about a thousand tries :-( so it's not
> looking like a promising thing to reproduce without modifying the test.

... however, it's trivial to reproduce via manual interference,
using the same strategy discussed recently for another case:
add a pg_sleep at the start of the heap_surgery.sql script,
run "make installcheck", and while that's running start another
session in which you begin a serializable transaction, execute
any old SELECT, and wait.  AFAICT this reproduces all of whelk's
symptoms with 100% reliability.

With a little more effort, this could be automated by putting
some long-running transaction (likely, it needn't be any more
complicated than "select pg_sleep(10)") in a second test script
launched in parallel with heap_surgery.sql.

So this confirms the suspicion that the cause of the buildfarm
failures is a concurrently-open transaction, presumably from
autovacuum.  I don't have time to poke further right now.

            regards, tom lane



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Tom Lane
Date:
I wrote:
> So this confirms the suspicion that the cause of the buildfarm
> failures is a concurrently-open transaction, presumably from
> autovacuum.  I don't have time to poke further right now.

I spent some more time analyzing this, and there seem to be two distinct
issues:

1. My patch a7212be8b does indeed have a problem.  It will allow
vacuum_set_xid_limits to compute freezeLimit = nextXid for a temp
table if freeze_min_age is zero (ie VACUUM FREEZE).  If there's
any concurrent transactions, this falls foul of
heap_prepare_freeze_tuple's expectation that

 * NB: cutoff_xid *must* be <= the current global xmin, to ensure that any
 * XID older than it could neither be running nor seen as running by any
 * open transaction.  This ensures that the replacement will not change
 * anyone's idea of the tuple state.

The "cannot freeze committed xmax" error message appears to be banking on
the assumption that we'd not reach heap_prepare_freeze_tuple for any
committed-dead tuple unless its xmax is past the specified cutoff_xid.

2. As Amit suspected, there's an inconsistency between pruneheap.c's
rules for which tuples are removable and vacuum.c's rules for that.
This seems like a massive bug in its own right: what's the point of
pruneheap.c going to huge effort to decide whether it should keep a
tuple if vacuum will then kill it anyway?  I do not understand why
whoever put in the GlobalVisState stuff only applied it in pruneheap.c
and not VACUUM proper.

These two points interact, in that we don't get to the "cannot freeze"
failure except when pruneheap has decided not to remove something that
is removable according to VACUUM's rules.  (VACUUM doesn't actually
remove it, because lazy_scan_heap won't try to remove HeapOnly tuples
even when it thinks they're HEAPTUPLE_DEAD; but then it tries to freeze
the tuple, and heap_prepare_freeze_tuple spits up.)  However, if I revert
a7212be8b then the pg_surgery test still fails in the presence of a
concurrent transaction (both of the expected "skipping TID" notices
disappear).  So reverting that patch wouldn't get us out of trouble.

I think to move forward, we need to figure out what the freezing
behavior ought to be for temp tables.  We could make it the same
as it was before a7212be8b, which'd just require some more complexity
in vacuum_set_xid_limits.  However, that negates the idea that we'd
like VACUUM's behavior on a temp table to be fully independent of
whether concurrent transactions exist.  I'd prefer to allow a7212be8b's
behavior to stand, but then it seems we need to lobotomize the error
check in heap_prepare_freeze_tuple to some extent.

Independently of that, it seems like we need to fix things so that
when pruneheap.c is called by vacuum, it makes EXACTLY the same
dead-or-not-dead decisions that the main vacuum code makes.  This
business with applying some GlobalVisState rule or other instead
seems just as unsafe as can be.

AFAICS, there is no chance of the existing pg_surgery regression test
being fully stable if we don't fix both things.

BTW, attached is a quick-hack patch to allow automated testing
of this scenario, along the lines I sketched yesterday.  This
test passes if you run the two scripts serially, but not when
you run them in parallel.  I'm not proposing this for commit;
it's a hack and its timing behavior is probably not stable enough
for the buildfarm.  But it's pretty useful for poking at these
issues.

            regards, tom lane

diff --git a/contrib/pg_surgery/Makefile b/contrib/pg_surgery/Makefile
index a66776c4c4..a7bcfca0fb 100644
--- a/contrib/pg_surgery/Makefile
+++ b/contrib/pg_surgery/Makefile
@@ -9,7 +9,7 @@ EXTENSION = pg_surgery
 DATA = pg_surgery--1.0.sql
 PGFILEDESC = "pg_surgery - perform surgery on a damaged relation"

-REGRESS = heap_surgery
+REGRESS = --schedule=surgery_schedule

 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/pg_surgery/expected/heap_surgery.out b/contrib/pg_surgery/expected/heap_surgery.out
index d4a757ffa0..6ef5597e7c 100644
--- a/contrib/pg_surgery/expected/heap_surgery.out
+++ b/contrib/pg_surgery/expected/heap_surgery.out
@@ -1,4 +1,10 @@
 create extension pg_surgery;
+select pg_sleep(0.1);  -- ensure concurrent transaction is ready
+ pg_sleep
+----------
+
+(1 row)
+
 -- create a normal heap table and insert some rows.
 -- use a temp table so that vacuum behavior doesn't depend on global xmin
 create temp table htab (a int);
diff --git a/contrib/pg_surgery/expected/sleep.out b/contrib/pg_surgery/expected/sleep.out
new file mode 100644
index 0000000000..208e31163d
--- /dev/null
+++ b/contrib/pg_surgery/expected/sleep.out
@@ -0,0 +1,6 @@
+select pg_sleep(1.0);  -- long enough for whole heap_surgery test
+ pg_sleep
+----------
+
+(1 row)
+
diff --git a/contrib/pg_surgery/sql/heap_surgery.sql b/contrib/pg_surgery/sql/heap_surgery.sql
index 6526b27535..212c657f3c 100644
--- a/contrib/pg_surgery/sql/heap_surgery.sql
+++ b/contrib/pg_surgery/sql/heap_surgery.sql
@@ -1,5 +1,7 @@
 create extension pg_surgery;

+select pg_sleep(0.1);  -- ensure concurrent transaction is ready
+
 -- create a normal heap table and insert some rows.
 -- use a temp table so that vacuum behavior doesn't depend on global xmin
 create temp table htab (a int);
diff --git a/contrib/pg_surgery/sql/sleep.sql b/contrib/pg_surgery/sql/sleep.sql
new file mode 100644
index 0000000000..a85c25bca2
--- /dev/null
+++ b/contrib/pg_surgery/sql/sleep.sql
@@ -0,0 +1 @@
+select pg_sleep(1.0);  -- long enough for whole heap_surgery test
diff --git a/contrib/pg_surgery/surgery_schedule b/contrib/pg_surgery/surgery_schedule
new file mode 100644
index 0000000000..1bb3aa5e53
--- /dev/null
+++ b/contrib/pg_surgery/surgery_schedule
@@ -0,0 +1 @@
+test: heap_surgery sleep

Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Ashutosh Sharma
Date:
Hi Tom,

On Sun, Sep 20, 2020 at 10:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I wrote:
> > So this confirms the suspicion that the cause of the buildfarm
> > failures is a concurrently-open transaction, presumably from
> > autovacuum.  I don't have time to poke further right now.
>
> I spent some more time analyzing this, and there seem to be two distinct
> issues:
>
> 1. My patch a7212be8b does indeed have a problem.  It will allow
> vacuum_set_xid_limits to compute freezeLimit = nextXid for a temp
> table if freeze_min_age is zero (ie VACUUM FREEZE).  If there's
> any concurrent transactions, this falls foul of
> heap_prepare_freeze_tuple's expectation that
>
>  * NB: cutoff_xid *must* be <= the current global xmin, to ensure that any
>  * XID older than it could neither be running nor seen as running by any
>  * open transaction.  This ensures that the replacement will not change
>  * anyone's idea of the tuple state.
>
> The "cannot freeze committed xmax" error message appears to be banking on
> the assumption that we'd not reach heap_prepare_freeze_tuple for any
> committed-dead tuple unless its xmax is past the specified cutoff_xid.
>
> 2. As Amit suspected, there's an inconsistency between pruneheap.c's
> rules for which tuples are removable and vacuum.c's rules for that.
> This seems like a massive bug in its own right: what's the point of
> pruneheap.c going to huge effort to decide whether it should keep a
> tuple if vacuum will then kill it anyway?  I do not understand why
> whoever put in the GlobalVisState stuff only applied it in pruneheap.c
> and not VACUUM proper.
>
> These two points interact, in that we don't get to the "cannot freeze"
> failure except when pruneheap has decided not to remove something that
> is removable according to VACUUM's rules.  (VACUUM doesn't actually
> remove it, because lazy_scan_heap won't try to remove HeapOnly tuples
> even when it thinks they're HEAPTUPLE_DEAD; but then it tries to freeze
> the tuple, and heap_prepare_freeze_tuple spits up.)  However, if I revert
> a7212be8b then the pg_surgery test still fails in the presence of a
> concurrent transaction (both of the expected "skipping TID" notices
> disappear).  So reverting that patch wouldn't get us out of trouble.
>
> I think to move forward, we need to figure out what the freezing
> behavior ought to be for temp tables.  We could make it the same
> as it was before a7212be8b, which'd just require some more complexity
> in vacuum_set_xid_limits.  However, that negates the idea that we'd
> like VACUUM's behavior on a temp table to be fully independent of
> whether concurrent transactions exist.  I'd prefer to allow a7212be8b's
> behavior to stand, but then it seems we need to lobotomize the error
> check in heap_prepare_freeze_tuple to some extent.
>
> Independently of that, it seems like we need to fix things so that
> when pruneheap.c is called by vacuum, it makes EXACTLY the same
> dead-or-not-dead decisions that the main vacuum code makes.  This
> business with applying some GlobalVisState rule or other instead
> seems just as unsafe as can be.
>
> AFAICS, there is no chance of the existing pg_surgery regression test
> being fully stable if we don't fix both things.
>

From the explanation you provided above it seems like the test-case
for pg_surgery module is failing because there is some issue with the
changes done in a7212be8b commit (shown below). In other words, I
believe that the test-case for pg_surgery has actually detected an
issue in this commit.

commit a7212be8b9e0885ee769e8c55f99ef742cda487b
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Tue Sep 1 18:37:12 2020 -0400

    Set cutoff xmin more aggressively when vacuuming a temporary table.

    ....

So, do you mean to say that if the issues related to temp tables
induced by the above commit is fixed, it will make the regression test
for pg_surgery stable?

Please let me know if I am missing something here. Thank you.

> BTW, attached is a quick-hack patch to allow automated testing
> of this scenario, along the lines I sketched yesterday.  This
> test passes if you run the two scripts serially, but not when
> you run them in parallel.  I'm not proposing this for commit;
> it's a hack and its timing behavior is probably not stable enough
> for the buildfarm.  But it's pretty useful for poking at these
> issues.
>

Yeah, understood, thanks for sharing this.

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Amit Kapila
Date:
On Sun, Sep 20, 2020 at 10:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I wrote:
> > So this confirms the suspicion that the cause of the buildfarm
> > failures is a concurrently-open transaction, presumably from
> > autovacuum.  I don't have time to poke further right now.
>
..
> 2. As Amit suspected, there's an inconsistency between pruneheap.c's
> rules for which tuples are removable and vacuum.c's rules for that.
> This seems like a massive bug in its own right: what's the point of
> pruneheap.c going to huge effort to decide whether it should keep a
> tuple if vacuum will then kill it anyway?  I do not understand why
> whoever put in the GlobalVisState stuff only applied it in pruneheap.c
> and not VACUUM proper.
>
> These two points interact, in that we don't get to the "cannot freeze"
> failure except when pruneheap has decided not to remove something that
> is removable according to VACUUM's rules.  (VACUUM doesn't actually
> remove it, because lazy_scan_heap won't try to remove HeapOnly tuples
> even when it thinks they're HEAPTUPLE_DEAD; but then it tries to freeze
> the tuple, and heap_prepare_freeze_tuple spits up.)  However, if I revert
> a7212be8b then the pg_surgery test still fails in the presence of a
> concurrent transaction (both of the expected "skipping TID" notices
> disappear).  So reverting that patch wouldn't get us out of trouble.
>
> I think to move forward, we need to figure out what the freezing
> behavior ought to be for temp tables.  We could make it the same
> as it was before a7212be8b, which'd just require some more complexity
> in vacuum_set_xid_limits.  However, that negates the idea that we'd
> like VACUUM's behavior on a temp table to be fully independent of
> whether concurrent transactions exist.  I'd prefer to allow a7212be8b's
> behavior to stand, but then it seems we need to lobotomize the error
> check in heap_prepare_freeze_tuple to some extent.
>
> Independently of that, it seems like we need to fix things so that
> when pruneheap.c is called by vacuum, it makes EXACTLY the same
> dead-or-not-dead decisions that the main vacuum code makes.  This
> business with applying some GlobalVisState rule or other instead
> seems just as unsafe as can be.
>

Yeah, on a quick look it seems before commit dc7420c2c9 the
pruneheap.c and the main Vacuum code use to make the same decision and
that is commit which has introduced GlobalVisState stuff.

> AFAICS, there is no chance of the existing pg_surgery regression test
> being fully stable if we don't fix both things.
>

What if ensure that it runs with autovacuum = off and there is no
parallel test running? I am not sure about the second part but if we
can do that then the test will be probably stable.

-- 
With Regards,
Amit Kapila.



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Tom Lane
Date:
Amit Kapila <amit.kapila16@gmail.com> writes:
> On Sun, Sep 20, 2020 at 10:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> AFAICS, there is no chance of the existing pg_surgery regression test
>> being fully stable if we don't fix both things.

> What if ensure that it runs with autovacuum = off and there is no
> parallel test running? I am not sure about the second part but if we
> can do that then the test will be probably stable.

Then it'll not be usable under "make installcheck", which is not
very nice.  It's also arguable that you aren't testing pg_surgery
under real-world conditions if you do it like that.

Moreover, I think that both of these points need to be addressed
anyway, as they represent bugs that are reachable independently
of pg_surgery.  Admittedly, we do not have a test case that
proves that the inconsistency between pruneheap and vacuum has
any bad effects in the absence of a7212be8b.  But do you really
want to bet that there are none?

            regards, tom lane



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Robert Haas
Date:
On Sun, Sep 20, 2020 at 1:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> 1. My patch a7212be8b does indeed have a problem.  It will allow
> vacuum_set_xid_limits to compute freezeLimit = nextXid for a temp
> table if freeze_min_age is zero (ie VACUUM FREEZE).  If there's
> any concurrent transactions, this falls foul of
> heap_prepare_freeze_tuple's expectation that
>
>  * NB: cutoff_xid *must* be <= the current global xmin, to ensure that any
>  * XID older than it could neither be running nor seen as running by any
>  * open transaction.  This ensures that the replacement will not change
>  * anyone's idea of the tuple state.
>
> The "cannot freeze committed xmax" error message appears to be banking on
> the assumption that we'd not reach heap_prepare_freeze_tuple for any
> committed-dead tuple unless its xmax is past the specified cutoff_xid.
>
> 2. As Amit suspected, there's an inconsistency between pruneheap.c's
> rules for which tuples are removable and vacuum.c's rules for that.
> This seems like a massive bug in its own right: what's the point of
> pruneheap.c going to huge effort to decide whether it should keep a
> tuple if vacuum will then kill it anyway?  I do not understand why
> whoever put in the GlobalVisState stuff only applied it in pruneheap.c
> and not VACUUM proper.

I am not sure I fully understand why you're contrasting pruneheap.c
with vacuum here, because vacuum just does HOT pruning to remove dead
tuples - maybe calling the relevant functions with different
arguments, but it doesn't have its own independent logic for that.

The key point is that the freezing code isn't, or at least
historically wasn't, very smart about dead tuples. For example, I
think if you told it to freeze something that was dead it would just
do it, which is obviously bad. And that's why Andres stuck those
sanity checks in there. But it's still pretty fragile. I think perhaps
the pruning code should be rewritten in such a way that it can be
combined with the code that freezes and marks pages all-visible, so
that there's not so much action at a distance, but such an endeavor is
in itself pretty scary, and certainly not back-patchable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Sun, Sep 20, 2020 at 1:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> 2. As Amit suspected, there's an inconsistency between pruneheap.c's
>> rules for which tuples are removable and vacuum.c's rules for that.
>> This seems like a massive bug in its own right: what's the point of
>> pruneheap.c going to huge effort to decide whether it should keep a
>> tuple if vacuum will then kill it anyway?  I do not understand why
>> whoever put in the GlobalVisState stuff only applied it in pruneheap.c
>> and not VACUUM proper.

> I am not sure I fully understand why you're contrasting pruneheap.c
> with vacuum here, because vacuum just does HOT pruning to remove dead
> tuples - maybe calling the relevant functions with different
> arguments, but it doesn't have its own independent logic for that.

Right, but what we end up with is that the very same tuple xmin and
xmax might result in pruning/deletion, or not, depending on whether
it's part of a HOT chain or not.  That's at best pretty weird, and
at worst it means that corner-case bugs in other places are triggered
in only one of the two scenarios ... which is what we have here.

> The key point is that the freezing code isn't, or at least
> historically wasn't, very smart about dead tuples. For example, I
> think if you told it to freeze something that was dead it would just
> do it, which is obviously bad. And that's why Andres stuck those
> sanity checks in there. But it's still pretty fragile. I think perhaps
> the pruning code should be rewritten in such a way that it can be
> combined with the code that freezes and marks pages all-visible, so
> that there's not so much action at a distance, but such an endeavor is
> in itself pretty scary, and certainly not back-patchable.

Not sure.  The pruning code is trying to serve two masters, that is
both VACUUM and on-the-fly cleanup during ordinary queries.  If you
try to merge it with other tasks that VACUUM does, you're going to
have a mess for the second usage.  I fear there's going to be pretty
strong conservation of cruft either way.

FWIW, weakening the sanity checks in heap_prepare_freeze_tuple is
*not* my preferred fix here.  But it'll take some work in other
places to preserve them.

            regards, tom lane



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Robert Haas
Date:
On Mon, Sep 21, 2020 at 2:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Right, but what we end up with is that the very same tuple xmin and
> xmax might result in pruning/deletion, or not, depending on whether
> it's part of a HOT chain or not.  That's at best pretty weird, and
> at worst it means that corner-case bugs in other places are triggered
> in only one of the two scenarios ... which is what we have here.

I'm not sure I really understand how that's happening, because surely
HOT chains and non-HOT chains are pruned by the same code, but it
doesn't sound good.

> FWIW, weakening the sanity checks in heap_prepare_freeze_tuple is
> *not* my preferred fix here.  But it'll take some work in other
> places to preserve them.

Make sense.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Andres Freund
Date:
Hi,

On 2020-09-21 16:02:29 -0400, Robert Haas wrote:
> On Mon, Sep 21, 2020 at 2:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Right, but what we end up with is that the very same tuple xmin and
> > xmax might result in pruning/deletion, or not, depending on whether
> > it's part of a HOT chain or not.  That's at best pretty weird, and
> > at worst it means that corner-case bugs in other places are triggered
> > in only one of the two scenarios ... which is what we have here.
> 
> I'm not sure I really understand how that's happening, because surely
> HOT chains and non-HOT chains are pruned by the same code, but it
> doesn't sound good.

Not necessarily, unfortunately:

        case HEAPTUPLE_DEAD:

            /*
             * Ordinarily, DEAD tuples would have been removed by
             * heap_page_prune(), but it's possible that the tuple
             * state changed since heap_page_prune() looked.  In
             * particular an INSERT_IN_PROGRESS tuple could have
             * changed to DEAD if the inserter aborted.  So this
             * cannot be considered an error condition.
             *
             * If the tuple is HOT-updated then it must only be
             * removed by a prune operation; so we keep it just as if
             * it were RECENTLY_DEAD.  Also, if it's a heap-only
             * tuple, we choose to keep it, because it'll be a lot
             * cheaper to get rid of it in the next pruning pass than
             * to treat it like an indexed tuple. Finally, if index
             * cleanup is disabled, the second heap pass will not
             * execute, and the tuple will not get removed, so we must
             * treat it like any other dead tuple that we choose to
             * keep.
             *
             * If this were to happen for a tuple that actually needed
             * to be deleted, we'd be in trouble, because it'd
             * possibly leave a tuple below the relation's xmin
             * horizon alive.  heap_prepare_freeze_tuple() is prepared
             * to detect that case and abort the transaction,
             * preventing corruption.
             */
            if (HeapTupleIsHotUpdated(&tuple) ||
                HeapTupleIsHeapOnly(&tuple) ||
                params->index_cleanup == VACOPT_TERNARY_DISABLED)
                nkeep += 1;
            else
                tupgone = true; /* we can delete the tuple */
            all_visible = false;


So if e.g. a transaction aborts between the heap_page_prune and this
check the pruning behaviour depends on whether the tuple is part of a
HOT chain or not.

Greetings,

Andres Freund



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Sep 21, 2020 at 2:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Right, but what we end up with is that the very same tuple xmin and
>> xmax might result in pruning/deletion, or not, depending on whether
>> it's part of a HOT chain or not.  That's at best pretty weird, and
>> at worst it means that corner-case bugs in other places are triggered
>> in only one of the two scenarios ... which is what we have here.

> I'm not sure I really understand how that's happening, because surely
> HOT chains and non-HOT chains are pruned by the same code, but it
> doesn't sound good.

No, they're not.  lazy_scan_heap() will never remove a tuple that
is HeapTupleIsHotUpdated or HeapTupleIsHeapOnly, even if it thinks
it's DEAD -- cf. vacuumlazy.c, about line 1350.  So tuples in
a HOT chain are deleted exactly when pruneheap.c sees fit to do so.
OTOH, for tuples not in a HOT chain, the decision is (I believe)
entirely on lazy_scan_heap().  And the core of my complaint is that
pruneheap.c's decisions about what is DEAD are not reliably identical
to what HeapTupleSatisfiesVacuum thinks.

I don't mind if a free-standing prune operation has its own rules,
but when it's invoked by VACUUM it ought to follow VACUUM's rules about
what is dead or alive.  What remains unclear at this point is whether
we ought to import some of the intelligence added by the GlobalVisState
patch into VACUUM's behavior, or just dumb down pruneheap.c so that
it applies exactly the HeapTupleSatisfiesVacuum rules when invoked
by VACUUM.

            regards, tom lane



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Andres Freund
Date:
Hi,

On 2020-09-20 13:13:16 -0400, Tom Lane wrote:
> 2. As Amit suspected, there's an inconsistency between pruneheap.c's
> rules for which tuples are removable and vacuum.c's rules for that.
> This seems like a massive bug in its own right: what's the point of
> pruneheap.c going to huge effort to decide whether it should keep a
> tuple if vacuum will then kill it anyway?  I do not understand why
> whoever put in the GlobalVisState stuff only applied it in pruneheap.c
> and not VACUUM proper.

The reason for that is that the GlobalVisState stuff is computed
heuristically (and then re-checked if that's not sufficient to prune a
tuple, unless already done so). That's done so GetSnapshotData() doesn't
have to look at each backends ->xmin, which is quite a massive speedup
at higher connection counts, as each backends ->xmin changes much more
often than each backend's xid.

But for VACUUM we need to do the accurate scan of the procarray anyway,
because we need an accurate value for things other than HOT pruning
decisions.

What do you exactly mean with the "going to huge effort to decide" bit?


> I think to move forward, we need to figure out what the freezing
> behavior ought to be for temp tables.  We could make it the same
> as it was before a7212be8b, which'd just require some more complexity
> in vacuum_set_xid_limits.  However, that negates the idea that we'd
> like VACUUM's behavior on a temp table to be fully independent of
> whether concurrent transactions exist.  I'd prefer to allow a7212be8b's
> behavior to stand, but then it seems we need to lobotomize the error
> check in heap_prepare_freeze_tuple to some extent.

I think that's an argument for what I suggested elsewhere, which is that
we should move the logic for a different horizon for temp tables out of
vacuum_set_xid_limits, and into procarray.


> Independently of that, it seems like we need to fix things so that
> when pruneheap.c is called by vacuum, it makes EXACTLY the same
> dead-or-not-dead decisions that the main vacuum code makes.  This
> business with applying some GlobalVisState rule or other instead
> seems just as unsafe as can be.

It's not great, I agree. Not sure there is a super nice answer
though. Note that, even before my changes, vacuumlazy can decide
differently than pruning whether a tuple is live. E.g. when an inserting
transaction aborts. That's pretty much unavoidable as long as we have
multiple HTSV calls for a tuple, since none of our locking can (nor
should) prevent concurrent transactions from aborting.

Before your new code avoiding the GetOldestNonRemovableTransactionId()
call for temp tables, the GlobalVis* can never be more pessimistic than
decisions based ona prior GetOldestNonRemovableTransactionId call (as
that internally updates the heuristic horizons used by GlobalVis).

Greetings,

Andres Freund



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> The reason for that is that the GlobalVisState stuff is computed
> heuristically (and then re-checked if that's not sufficient to prune a
> tuple, unless already done so). That's done so GetSnapshotData() doesn't
> have to look at each backends ->xmin, which is quite a massive speedup
> at higher connection counts, as each backends ->xmin changes much more
> often than each backend's xid.

OK.

> What do you exactly mean with the "going to huge effort to decide" bit?

I'd looked at all the complexity around GlobalVisState, but failed to
register that it should be pretty cheap on a per-tuple basis.  So never
mind that complaint.  The point that remains is just that it's different
from HeapTupleSatisfiesVacuum's rules.

>> I think to move forward, we need to figure out what the freezing
>> behavior ought to be for temp tables.  We could make it the same
>> as it was before a7212be8b, which'd just require some more complexity
>> in vacuum_set_xid_limits.  However, that negates the idea that we'd
>> like VACUUM's behavior on a temp table to be fully independent of
>> whether concurrent transactions exist.  I'd prefer to allow a7212be8b's
>> behavior to stand, but then it seems we need to lobotomize the error
>> check in heap_prepare_freeze_tuple to some extent.

> I think that's an argument for what I suggested elsewhere, which is that
> we should move the logic for a different horizon for temp tables out of
> vacuum_set_xid_limits, and into procarray.

But procarray does not seem like a great place for
table-persistence-dependent decisions either?

>> Independently of that, it seems like we need to fix things so that
>> when pruneheap.c is called by vacuum, it makes EXACTLY the same
>> dead-or-not-dead decisions that the main vacuum code makes.  This
>> business with applying some GlobalVisState rule or other instead
>> seems just as unsafe as can be.

> It's not great, I agree. Not sure there is a super nice answer
> though. Note that, even before my changes, vacuumlazy can decide
> differently than pruning whether a tuple is live. E.g. when an inserting
> transaction aborts. That's pretty much unavoidable as long as we have
> multiple HTSV calls for a tuple, since none of our locking can (nor
> should) prevent concurrent transactions from aborting.

It's clear that if the environment changes between test A and test B,
we might get different results.  What I'm not happy about is that the
rules are different, so we might get different results even if the
environment did not change.

            regards, tom lane



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Andres Freund
Date:
Hi,

On 2020-09-21 16:40:40 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> >> I think to move forward, we need to figure out what the freezing
> >> behavior ought to be for temp tables.  We could make it the same
> >> as it was before a7212be8b, which'd just require some more complexity
> >> in vacuum_set_xid_limits.  However, that negates the idea that we'd
> >> like VACUUM's behavior on a temp table to be fully independent of
> >> whether concurrent transactions exist.  I'd prefer to allow a7212be8b's
> >> behavior to stand, but then it seems we need to lobotomize the error
> >> check in heap_prepare_freeze_tuple to some extent.
> 
> > I think that's an argument for what I suggested elsewhere, which is that
> > we should move the logic for a different horizon for temp tables out of
> > vacuum_set_xid_limits, and into procarray.
> 
> But procarray does not seem like a great place for
> table-persistence-dependent decisions either?

That ship has sailed a long long time ago though. GetOldestXmin() has
looked at the passed in relation for a quite a while, and even before
that we had logic about 'allDbs' etc.  It doesn't easily seem possible
to avoid that, given how intimately that's coupled with how snapshots
are built and used, database & vacuumFlags checks etc.

Greetings,

Andres Freund



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2020-09-21 16:40:40 -0400, Tom Lane wrote:
>> Andres Freund <andres@anarazel.de> writes:
>>> I think that's an argument for what I suggested elsewhere, which is that
>>> we should move the logic for a different horizon for temp tables out of
>>> vacuum_set_xid_limits, and into procarray.

>> But procarray does not seem like a great place for
>> table-persistence-dependent decisions either?

> That ship has sailed a long long time ago though. GetOldestXmin() has
> looked at the passed in relation for a quite a while, and even before
> that we had logic about 'allDbs' etc.  It doesn't easily seem possible
> to avoid that, given how intimately that's coupled with how snapshots
> are built and used, database & vacuumFlags checks etc.

OK.  Given that you've got strong feelings about this, do you want to
propose a patch?  I'm happy to fix it, since it's at least in part my
bug, but I probably won't do it exactly like you would.

            regards, tom lane



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Andres Freund
Date:
On 2020-09-21 17:03:53 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2020-09-21 16:40:40 -0400, Tom Lane wrote:
> >> Andres Freund <andres@anarazel.de> writes:
> >>> I think that's an argument for what I suggested elsewhere, which is that
> >>> we should move the logic for a different horizon for temp tables out of
> >>> vacuum_set_xid_limits, and into procarray.
> 
> >> But procarray does not seem like a great place for
> >> table-persistence-dependent decisions either?
> 
> > That ship has sailed a long long time ago though. GetOldestXmin() has
> > looked at the passed in relation for a quite a while, and even before
> > that we had logic about 'allDbs' etc.  It doesn't easily seem possible
> > to avoid that, given how intimately that's coupled with how snapshots
> > are built and used, database & vacuumFlags checks etc.
> 
> OK.  Given that you've got strong feelings about this, do you want to
> propose a patch?  I'm happy to fix it, since it's at least in part my
> bug, but I probably won't do it exactly like you would.

I can give it a try. I can see several paths of varying invasiveness,
not sure yet what the best approach is. Let me think about if for a bit.

Greetings,

Andres Freund



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Andres Freund
Date:
Hi,

On 2020-09-21 14:20:03 -0700, Andres Freund wrote:
> I can give it a try. I can see several paths of varying invasiveness,
> not sure yet what the best approach is. Let me think about if for a bit.

Ugh, sorry for taking so long to get around to this.

Attached is a *prototype* implemention of this concept, which clearly is
lacking some comment work (and is intentionally lacking some
re-indentation).

I described my thoughts about how to limit the horizons for temp tables in
https://www.postgresql.org/message-id/20201014203103.72oke6hqywcyhx7s%40alap3.anarazel.de

Besides comments this probably mainly needs a bit more tests around temp
table vacuuming. Should have at least an isolation test that verifies
that temp table rows can be a) vacuumed b) pruned away in the presence
of other sessions with xids.

Greetings,

Andres Freund

Attachment

Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Andres Freund
Date:
Hi,

On 2020-10-15 01:37:35 -0700, Andres Freund wrote:
> Attached is a *prototype* implemention of this concept, which clearly is
> lacking some comment work (and is intentionally lacking some
> re-indentation).
> 
> I described my thoughts about how to limit the horizons for temp tables in
> https://www.postgresql.org/message-id/20201014203103.72oke6hqywcyhx7s%40alap3.anarazel.de

Attached is an updated version of this patch. Quite a bit of polish,
added removal of the isTopLevel arguments added a7212be8b9e that are now
unnecessary, and changed the initialization of the temp table horizons
to be latestCompletedXid + 1 instead of just latestCompletedXid when no
xid is assigned.


> Besides comments this probably mainly needs a bit more tests around temp
> table vacuuming. Should have at least an isolation test that verifies
> that temp table rows can be a) vacuumed b) pruned away in the presence
> of other sessions with xids.

I added an isolationtester test for this. It verifies that dead rows in
temp tables get vacuumed and pruned despite concurrent sessions having
older snapshots. It does so by forcing an IOS and checking the number of
heap fetches reported by EXPLAIN. I also added a companion test for
permanent relations, ensuring that such rows do not get removed.


Any comments? Otherwise I'll push that patch tomorrow.

Greetings,

Andres Freund

Attachment

Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Andres Freund
Date:
Hi,

On 2020-10-27 20:51:10 -0700, Andres Freund wrote:
> On 2020-10-15 01:37:35 -0700, Andres Freund wrote:
> > Attached is a *prototype* implemention of this concept, which clearly is
> > lacking some comment work (and is intentionally lacking some
> > re-indentation).
> > 
> > I described my thoughts about how to limit the horizons for temp tables in
> > https://www.postgresql.org/message-id/20201014203103.72oke6hqywcyhx7s%40alap3.anarazel.de
> 
> Attached is an updated version of this patch. Quite a bit of polish,
> added removal of the isTopLevel arguments added a7212be8b9e that are now
> unnecessary, and changed the initialization of the temp table horizons
> to be latestCompletedXid + 1 instead of just latestCompletedXid when no
> xid is assigned.
> 
> 
> > Besides comments this probably mainly needs a bit more tests around temp
> > table vacuuming. Should have at least an isolation test that verifies
> > that temp table rows can be a) vacuumed b) pruned away in the presence
> > of other sessions with xids.
> 
> I added an isolationtester test for this. It verifies that dead rows in
> temp tables get vacuumed and pruned despite concurrent sessions having
> older snapshots. It does so by forcing an IOS and checking the number of
> heap fetches reported by EXPLAIN. I also added a companion test for
> permanent relations, ensuring that such rows do not get removed.
> 
> 
> Any comments? Otherwise I'll push that patch tomorrow.

Just pushed this. Let's see what the BF says...

It's kinda cool how much more aggressive hot pruning / killtuples now is
for temp tables.

Greetings,

Andres Freund



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Andres Freund
Date:
On 2020-10-28 18:13:44 -0700, Andres Freund wrote:
> Just pushed this. Let's see what the BF says...

It says that apparently something is unstable about my new test. It
first passed on a few animals, but then failed a lot in a row. Looking.



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Andres Freund
Date:
Hi,

On 2020-10-28 19:09:14 -0700, Andres Freund wrote:
> On 2020-10-28 18:13:44 -0700, Andres Freund wrote:
> > Just pushed this. Let's see what the BF says...
> 
> It says that apparently something is unstable about my new test. It
> first passed on a few animals, but then failed a lot in a row. Looking.

The differentiating factor is force_parallel_mode=regress.

Ugh, this is nasty: The problem is that we can end up computing the
horizons the first time before MyDatabaseId is even set. Which leads us
to compute a too aggressive horizon for plain tables, because we skip
over them, as MyDatabaseId still is InvalidOid:

        /*
         * Normally queries in other databases are ignored for anything but
         * the shared horizon. But in recovery we cannot compute an accurate
         * per-database horizon as all xids are managed via the
         * KnownAssignedXids machinery.
         */
        if (in_recovery ||
            proc->databaseId == MyDatabaseId ||
            proc->databaseId == 0)    /* always include WalSender */
            h->data_oldest_nonremovable =
                TransactionIdOlder(h->data_oldest_nonremovable, xmin);

That then subsequently leads us consider a row fully dead in
heap_hot_search_buffers(). Triggering the killtuples logic. Causing the
test to fail.

With force_parallel_mode=regress we constantly start parallel workers,
which makes it much more likely that this case is hit.

It's trivial to fix luckily...

Greetings,

Andres Freund



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Andres Freund
Date:
Hi,

On 2020-10-28 21:00:30 -0700, Andres Freund wrote:
> On 2020-10-28 19:09:14 -0700, Andres Freund wrote:
> > On 2020-10-28 18:13:44 -0700, Andres Freund wrote:
> > > Just pushed this. Let's see what the BF says...
> > 
> > It says that apparently something is unstable about my new test. It
> > first passed on a few animals, but then failed a lot in a row. Looking.
> 
> The differentiating factor is force_parallel_mode=regress.
> 
> Ugh, this is nasty: The problem is that we can end up computing the
> horizons the first time before MyDatabaseId is even set. Which leads us
> to compute a too aggressive horizon for plain tables, because we skip
> over them, as MyDatabaseId still is InvalidOid:
> 
>         /*
>          * Normally queries in other databases are ignored for anything but
>          * the shared horizon. But in recovery we cannot compute an accurate
>          * per-database horizon as all xids are managed via the
>          * KnownAssignedXids machinery.
>          */
>         if (in_recovery ||
>             proc->databaseId == MyDatabaseId ||
>             proc->databaseId == 0)    /* always include WalSender */
>             h->data_oldest_nonremovable =
>                 TransactionIdOlder(h->data_oldest_nonremovable, xmin);
> 
> That then subsequently leads us consider a row fully dead in
> heap_hot_search_buffers(). Triggering the killtuples logic. Causing the
> test to fail.
> 
> With force_parallel_mode=regress we constantly start parallel workers,
> which makes it much more likely that this case is hit.
> 
> It's trivial to fix luckily...

Pushed that fix, hopefully that calms the BF.

Greetings,

Andres Freund



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Andrey Borodin
Date:
Hi hackers!

Does anyone maintain opensource pg_surgery analogs for released versions of PG?
It seems to me I'll have to use something like this and I just though that I should consider pg_surgery in favour of
ourpg_dirty_hands. 

Thanks!

Best regards, Andrey Borodin.


Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Robert Haas
Date:
On Sat, Jan 16, 2021 at 10:41 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> Does anyone maintain opensource pg_surgery analogs for released versions of PG?
> It seems to me I'll have to use something like this and I just though that I should consider pg_surgery in favour of
ourpg_dirty_hands.
 

I do not. I'm still of the opinion that we ought to back-patch
pg_surgery. This didn't attract a consensus before, and it's hard to
dispute that it's a new feature in what would be a back branch. But
it's unclear to me how users are otherwise supposed to recover from
some of the bugs that are or have been present in those back branches.
I'm not sure that I see the logic in telling people we'll try to
prevent them from getting hosed in the future but if they're already
hosed they can wait for v14 to fix it. They can't wait that long, and
a dump-and-restore cycle is awfully painful.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Michael Banck
Date:
On Mon, Jan 18, 2021 at 08:54:10AM -0500, Robert Haas wrote:
> On Sat, Jan 16, 2021 at 10:41 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> > Does anyone maintain opensource pg_surgery analogs for released
> > versions of PG?  It seems to me I'll have to use something like this
> > and I just though that I should consider pg_surgery in favour of our
> > pg_dirty_hands.
> 
> I do not. I'm still of the opinion that we ought to back-patch
> pg_surgery. This didn't attract a consensus before, and it's hard to
> dispute that it's a new feature in what would be a back branch. But
> it's unclear to me how users are otherwise supposed to recover from
> some of the bugs that are or have been present in those back branches.

One other possiblity would be to push a version of pg_surgery that is
compatible with the back-branches somewhere external (e.g. either
git.postgresql.org and/or Github), so that it can be picked up by
distributions and/or individual users in need.

That is Assuming it does not need assorted server changes to go with; I
did not read the thread in detail but I was under the assumption it is a
client program?


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Robert Haas
Date:
On Mon, Jan 18, 2021 at 9:25 AM Michael Banck <michael.banck@credativ.de> wrote:
> One other possiblity would be to push a version of pg_surgery that is
> compatible with the back-branches somewhere external (e.g. either
> git.postgresql.org and/or Github), so that it can be picked up by
> distributions and/or individual users in need.

Sure, but I don't see how that's better.

> That is Assuming it does not need assorted server changes to go with; I
> did not read the thread in detail but I was under the assumption it is a
> client program?

It's a server extension. It does not require core changes.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: recovering from "found xmin ... from before relfrozenxid ..."

From
Andrey Borodin
Date:

> 18 янв. 2021 г., в 18:54, Robert Haas <robertmhaas@gmail.com> написал(а):
>
> On Sat, Jan 16, 2021 at 10:41 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>> Does anyone maintain opensource pg_surgery analogs for released versions of PG?
>> It seems to me I'll have to use something like this and I just though that I should consider pg_surgery in favour of
ourpg_dirty_hands. 
>
> I do not. I'm still of the opinion that we ought to back-patch
> pg_surgery.
+1.
Yesterday I spent a few hours packaging pg_dirty_hands and pg_surgery(BTW it works fine for 12).
It's a kind of a 911 tool, one doesn't think they will need it until they actually do. And clocks are ticking.
OTOH, it opens new ways to shoot in the foot.

Best regards, Andrey Borodin.