Thread: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

From

Dilip Kumar

Date:

17 July 2020, 13:46:23

The attached patch allows the vacuum to continue by emitting WARNING
for the corrupted tuple instead of immediately error out as discussed
at [1].

Basically, it provides a new GUC called vacuum_tolerate_damage, to
control whether to continue the vacuum or to stop on the occurrence of
a corrupted tuple.  So if the vacuum_tolerate_damage is set then in
all the cases in heap_prepare_freeze_tuple where the corrupted xid is
detected, it will emit a warning and return that nothing is changed in
the tuple and the 'tuple_totally_frozen' will also be set to false.
Since we are returning false the caller will not try to freeze such
tuple and the tuple_totally_frozen is also set to false so that the
page will not be marked to all frozen even if all other tuples in the
page are frozen.

Alternatively,  we can try to freeze other XIDs in the tuple which is
not corrupted but I don't think we will gain anything from this,
because if one of the xmin or xmax is wrong then next time also if we
run the vacuum then we are going to get the same WARNING or the ERROR.
Is there any other opinion on this?

[1] http://postgr.es/m/CA+TgmoaZwZHtFFU6NUJgEAp6adDs-qWfNOXpZGQpZMmm0VTDfg@mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

v1-0001-Provide-a-GUC-to-allow-vacuum-to-continue-on-corr.patch

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

From

"Andrey M. Borodin"

Date:

19 July 2020, 14:26:54

Hi Dilip!


> 17 июля 2020 г., в 15:46, Dilip Kumar <dilipbalaut@gmail.com> написал(а):
>
> The attached patch allows the vacuum to continue by emitting WARNING
> for the corrupted tuple instead of immediately error out as discussed
> at [1].
>
> Basically, it provides a new GUC called vacuum_tolerate_damage, to
> control whether to continue the vacuum or to stop on the occurrence of
> a corrupted tuple.  So if the vacuum_tolerate_damage is set then in
> all the cases in heap_prepare_freeze_tuple where the corrupted xid is
> detected, it will emit a warning and return that nothing is changed in
> the tuple and the 'tuple_totally_frozen' will also be set to false.
> Since we are returning false the caller will not try to freeze such
> tuple and the tuple_totally_frozen is also set to false so that the
> page will not be marked to all frozen even if all other tuples in the
> page are frozen.
>
> Alternatively,  we can try to freeze other XIDs in the tuple which is
> not corrupted but I don't think we will gain anything from this,
> because if one of the xmin or xmax is wrong then next time also if we
> run the vacuum then we are going to get the same WARNING or the ERROR.
> Is there any other opinion on this?

FWIW AFAIK this ERROR was the reason why we had to use older versions of heap_prepare_freeze_tuple() in our recovery
kit[0]. 
So +1 from me.
But I do not think that just ignoring corruption here is sufficient. Soon after this freeze problem user will,
probably,have to deal with absent CLOG. 
I think this GUC is only a part of an incomplete solution.
Personally I'd be happy if this is backported - our recovery kit would be much smaller. But this does not seem like a
validreason. 

Thanks!

Best regards, Andrey Borodin.


[0] https://github.com/dsarafan/pg_dirty_hands/blob/master/src/pg_dirty_hands.c#L443

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

From

Dilip Kumar

Date:

20 July 2020, 12:25:10

On Fri, Jul 17, 2020 at 4:16 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> The attached patch allows the vacuum to continue by emitting WARNING
> for the corrupted tuple instead of immediately error out as discussed
> at [1].
>
> Basically, it provides a new GUC called vacuum_tolerate_damage, to
> control whether to continue the vacuum or to stop on the occurrence of
> a corrupted tuple.  So if the vacuum_tolerate_damage is set then in
> all the cases in heap_prepare_freeze_tuple where the corrupted xid is
> detected, it will emit a warning and return that nothing is changed in
> the tuple and the 'tuple_totally_frozen' will also be set to false.
> Since we are returning false the caller will not try to freeze such
> tuple and the tuple_totally_frozen is also set to false so that the
> page will not be marked to all frozen even if all other tuples in the
> page are frozen.
>
> Alternatively,  we can try to freeze other XIDs in the tuple which is
> not corrupted but I don't think we will gain anything from this,
> because if one of the xmin or xmax is wrong then next time also if we
> run the vacuum then we are going to get the same WARNING or the ERROR.
> Is there any other opinion on this?

Robert has mentioned at [1] that we probably should refuse to update
'relfrozenxid/relminmxid' when we encounter such tuple and emit
WARNING instead of an error.  I think we shall do that in some cases
but IMHO it's not a very good idea in all the cases.  Basically, if
the xmin precedes the relfrozenxid then probably we should allow to
update the relfrozenxid whereas if the xmin precedes cutoff xid and
still uncommitted then probably we might stop relfrozenxid from being
updated so that we can stop CLOG from getting truncated.  I will make
these changes if we agree with the idea?  Or we should keep it simple
and never allow to update 'relfrozenxid/relminmxid' in such cases?

[1] http://postgr.es/m/CA+TgmoaZwZHtFFU6NUJgEAp6adDs-qWfNOXpZGQpZMmm0VTDfg@mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

From

Dilip Kumar

Date:

20 July 2020, 14:21:57

On Sun, Jul 19, 2020 at 4:56 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
>
> Hi Dilip!
>
>
> > 17 июля 2020 г., в 15:46, Dilip Kumar <dilipbalaut@gmail.com> написал(а):
> >
> > The attached patch allows the vacuum to continue by emitting WARNING
> > for the corrupted tuple instead of immediately error out as discussed
> > at [1].
> >
> > Basically, it provides a new GUC called vacuum_tolerate_damage, to
> > control whether to continue the vacuum or to stop on the occurrence of
> > a corrupted tuple.  So if the vacuum_tolerate_damage is set then in
> > all the cases in heap_prepare_freeze_tuple where the corrupted xid is
> > detected, it will emit a warning and return that nothing is changed in
> > the tuple and the 'tuple_totally_frozen' will also be set to false.
> > Since we are returning false the caller will not try to freeze such
> > tuple and the tuple_totally_frozen is also set to false so that the
> > page will not be marked to all frozen even if all other tuples in the
> > page are frozen.
> >
> > Alternatively,  we can try to freeze other XIDs in the tuple which is
> > not corrupted but I don't think we will gain anything from this,
> > because if one of the xmin or xmax is wrong then next time also if we
> > run the vacuum then we are going to get the same WARNING or the ERROR.
> > Is there any other opinion on this?
>
> FWIW AFAIK this ERROR was the reason why we had to use older versions of heap_prepare_freeze_tuple() in our recovery
kit[0]. 
> So +1 from me.

Thanks for showing interest in this patch.

> But I do not think that just ignoring corruption here is sufficient. Soon after this freeze problem user will,
probably,have to deal with absent CLOG. 
> I think this GUC is only a part of an incomplete solution.
> Personally I'd be happy if this is backported - our recovery kit would be much smaller. But this does not seem like a
validreason. 

I agree that this is just solving one part of the problem and in some
cases, it may not work if the CLOG itself is corrupted i.e does not
exist for the xid which are not yet frozen.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

From

Alvaro Herrera

Date:

20 July 2020, 19:44:39

On 2020-Jul-20, Dilip Kumar wrote:

> On Fri, Jul 17, 2020 at 4:16 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

> > So if the vacuum_tolerate_damage is set then in
> > all the cases in heap_prepare_freeze_tuple where the corrupted xid is
> > detected, it will emit a warning and return that nothing is changed in
> > the tuple and the 'tuple_totally_frozen' will also be set to false.
> > Since we are returning false the caller will not try to freeze such
> > tuple and the tuple_totally_frozen is also set to false so that the
> > page will not be marked to all frozen even if all other tuples in the
> > page are frozen.

> Robert has mentioned at [1] that we probably should refuse to update
> 'relfrozenxid/relminmxid' when we encounter such tuple and emit
> WARNING instead of an error.

Isn't this already happening per your description above?

> I think we shall do that in some cases
> but IMHO it's not a very good idea in all the cases.  Basically, if
> the xmin precedes the relfrozenxid then probably we should allow to
> update the relfrozenxid whereas if the xmin precedes cutoff xid and
> still uncommitted then probably we might stop relfrozenxid from being
> updated so that we can stop CLOG from getting truncated.

I'm not sure I understand 100% what you're talking about here (the first
half seems dangerous unless you misspoke), but in any case it seems a
pointless optimization.  I mean, if the heap is corrupted, you can hope
to complete the vacuum (which will hopefully return which *other* tuples
are similarly corrupt) but trying to advance relfrozenxid is a lost
cause.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

From

"Andrey M. Borodin"

Date:

20 July 2020, 21:37:24


> 20 июля 2020 г., в 21:44, Alvaro Herrera <alvherre@2ndquadrant.com> написал(а):
>
>> I think we shall do that in some cases
>> but IMHO it's not a very good idea in all the cases.  Basically, if
>> the xmin precedes the relfrozenxid then probably we should allow to
>> update the relfrozenxid whereas if the xmin precedes cutoff xid and
>> still uncommitted then probably we might stop relfrozenxid from being
>> updated so that we can stop CLOG from getting truncated.
>
> I'm not sure I understand 100% what you're talking about here (the first
> half seems dangerous unless you misspoke), but in any case it seems a
> pointless optimization.  I mean, if the heap is corrupted, you can hope
> to complete the vacuum (which will hopefully return which *other* tuples
> are similarly corrupt) but trying to advance relfrozenxid is a lost
> cause.

I think the point here is to actually move relfrozenxid back. But the mince can't be turned back. If CLOG is rotated -
thetable is corrupted beyond easy repair. 

I'm not sure it's Dilip's case, but I'll try to describe what I was encountering.

We were observing this kind of corruption in three cases:
1. With a bug in patched Linux kernel page cache we could loose FS page write
2. With a bug in WAL-G block-level incremental backup - we could loose update of the page.
3. With a firmware bug in SSD drives from one vendor - one write to block storage device was lost
One page in a database is of some non-latest version (but with correct checksum, it's just an old version). And in our
caseusually a VACUUMing of a page was lost (with freezes of all tuples). Some tuples are not marked as frozen, while VM
hasfrozen bit for page. Everything works just fine until someone updates a tuple on the same page: VM bit is reset and
eventuallyuser will try to consult CLOG, which is already truncated. 

This is why we may need to defer CLOG truncation or even move relfrozenxid back.

FWIW we coped with this by actively monitoring this kind of corruption with this amcheck patch [0]. One can observe
thislost page updates cheaply in indexes and act on first sight of corruption: identify source of the buggy behaviour. 

Dilip, does this sound like a corruption case you are working on?

Thanks!

Best regards, Andrey Borodin.

[0] https://commitfest.postgresql.org/24/2254/

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

From

Alvaro Herrera

Date:

20 July 2020, 22:36:06

On 2020-Jul-20, Andrey M. Borodin wrote:

> I think the point here is to actually move relfrozenxid back. But the
> mince can't be turned back. If CLOG is rotated - the table is
> corrupted beyond easy repair.

Oh, I see.  Hmm.  Well, if you discover relfrozenxid that's newer and
the pg_clog files are still there, then yeah it might make sense to move
relfrozenxid back. But it seems difficult to do correctly ... you have
to move datfrozenxid back too ... frankly, I'd rather not go there.

> I'm not sure it's Dilip's case, but I'll try to describe what I was encountering.
> 
> We were observing this kind of corruption in three cases:
> 1. With a bug in patched Linux kernel page cache we could loose FS page write

I think I've seen this too.  (Or possibly your #3, which from Postgres
POV is the same thing -- a write was claimed done but actually not
done.)

> FWIW we coped with this by actively monitoring this kind of corruption
> with this amcheck patch [0]. One can observe this lost page updates
> cheaply in indexes and act on first sight of corruption: identify
> source of the buggy behaviour.

Right.

I wish we had some way to better protect against this kind of problems,
but I don't have any ideas.  Some things can be protected against with
checksums, but if you just lose a write, there's nothing to indicate
that.  We don't have a per-page write counter, or a central repository
of per-page LSNs or checksums, and it seems very expensive to maintain
such things.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

From

Andres Freund

Date:

20 July 2020, 23:30:46

Hi,

On 2020-07-17 16:16:23 +0530, Dilip Kumar wrote:
> The attached patch allows the vacuum to continue by emitting WARNING
> for the corrupted tuple instead of immediately error out as discussed
> at [1].
> 
> Basically, it provides a new GUC called vacuum_tolerate_damage, to
> control whether to continue the vacuum or to stop on the occurrence of
> a corrupted tuple.  So if the vacuum_tolerate_damage is set then in
> all the cases in heap_prepare_freeze_tuple where the corrupted xid is
> detected, it will emit a warning and return that nothing is changed in
> the tuple and the 'tuple_totally_frozen' will also be set to false.
> Since we are returning false the caller will not try to freeze such
> tuple and the tuple_totally_frozen is also set to false so that the
> page will not be marked to all frozen even if all other tuples in the
> page are frozen.

I'm extremely doubtful this is a good idea. In all likelihood this will
just exascerbate corruption.

You cannot just stop freezing tuples, that'll lead to relfrozenxid
getting *further* out of sync with the actual table contents. And you
cannot just freeze such tuples, because that has a good chance of making
deleted tuples suddenly visible, leading to unique constraint violations
etc. Which will then subsequently lead to clog lookup errors and such.

At the very least you'd need to signal up that relfrozenxid/relminmxid
cannot be increased. Without that I think it's entirely unacceptable to
do this.

If we really were to do something like this the option would need to be
called vacuum_allow_making_corruption_worse or such. Its need to be
*exceedingly* clear that it will likely lead to making everything much
worse.

> @@ -6123,6 +6124,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
>      frz->t_infomask = tuple->t_infomask;
>      frz->xmax = HeapTupleHeaderGetRawXmax(tuple);

I don't think it can be right to just update heap_prepare_freeze_tuple()
but not FreezeMultiXactId().

Greetings,

Andres Freund

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

From

"Andrey M. Borodin"

Date:

21 July 2020, 05:53:59


> 21 июля 2020 г., в 00:36, Alvaro Herrera <alvherre@2ndquadrant.com> написал(а):
>
>
>> FWIW we coped with this by actively monitoring this kind of corruption
>> with this amcheck patch [0]. One can observe this lost page updates
>> cheaply in indexes and act on first sight of corruption: identify
>> source of the buggy behaviour.
>
> Right.
>
> I wish we had some way to better protect against this kind of problems,
> but I don't have any ideas.  Some things can be protected against with
> checksums, but if you just lose a write, there's nothing to indicate
> that.  We don't have a per-page write counter, or a central repository
> of per-page LSNs or checksums, and it seems very expensive to maintain
> such things.

If we had data checksums in another fork we could flush them on checkpoint.
This checksums could protect from lost page update.
And it would be much easier to maintain these checksums for SLRUs.

Best regards, Andrey Borodin.

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

From

Dilip Kumar

Date:

21 July 2020, 06:26:33

On Mon, Jul 20, 2020 at 10:14 PM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
>
> On 2020-Jul-20, Dilip Kumar wrote:
>
> > On Fri, Jul 17, 2020 at 4:16 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> > > So if the vacuum_tolerate_damage is set then in
> > > all the cases in heap_prepare_freeze_tuple where the corrupted xid is
> > > detected, it will emit a warning and return that nothing is changed in
> > > the tuple and the 'tuple_totally_frozen' will also be set to false.
> > > Since we are returning false the caller will not try to freeze such
> > > tuple and the tuple_totally_frozen is also set to false so that the
> > > page will not be marked to all frozen even if all other tuples in the
> > > page are frozen.
>
> > Robert has mentioned at [1] that we probably should refuse to update
> > 'relfrozenxid/relminmxid' when we encounter such tuple and emit
> > WARNING instead of an error.
>
> Isn't this already happening per your description above?

As per the above description, we are avoiding to set the page as all
frozen.  But the vacrelstats->scanned_pages count has already been
increased for this page.  Now, right after the lazy_scan_heap,  we
will update the pg_class tuple with the new FreezeLimit and
MultiXactCutoff.

>
> > I think we shall do that in some cases
> > but IMHO it's not a very good idea in all the cases.  Basically, if
> > the xmin precedes the relfrozenxid then probably we should allow to
> > update the relfrozenxid whereas if the xmin precedes cutoff xid and
> > still uncommitted then probably we might stop relfrozenxid from being
> > updated so that we can stop CLOG from getting truncated.
>
> I'm not sure I understand 100% what you're talking about here (the first
> half seems dangerous unless you misspoke), but in any case it seems a
> pointless optimization.  I mean, if the heap is corrupted, you can hope
> to complete the vacuum (which will hopefully return which *other* tuples
> are similarly corrupt) but trying to advance relfrozenxid is a lost
> cause.

I agree with your point.  I think we just need to avoid advancing the
relfrozenxid in all such cases.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

From

Dilip Kumar

Date:

21 July 2020, 08:30:10

On Tue, Jul 21, 2020 at 2:00 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2020-07-17 16:16:23 +0530, Dilip Kumar wrote:
> > The attached patch allows the vacuum to continue by emitting WARNING
> > for the corrupted tuple instead of immediately error out as discussed
> > at [1].
> >
> > Basically, it provides a new GUC called vacuum_tolerate_damage, to
> > control whether to continue the vacuum or to stop on the occurrence of
> > a corrupted tuple.  So if the vacuum_tolerate_damage is set then in
> > all the cases in heap_prepare_freeze_tuple where the corrupted xid is
> > detected, it will emit a warning and return that nothing is changed in
> > the tuple and the 'tuple_totally_frozen' will also be set to false.
> > Since we are returning false the caller will not try to freeze such
> > tuple and the tuple_totally_frozen is also set to false so that the
> > page will not be marked to all frozen even if all other tuples in the
> > page are frozen.
>
> I'm extremely doubtful this is a good idea. In all likelihood this will
> just exascerbate corruption.
>
> You cannot just stop freezing tuples, that'll lead to relfrozenxid
> getting *further* out of sync with the actual table contents. And you
> cannot just freeze such tuples, because that has a good chance of making
> deleted tuples suddenly visible, leading to unique constraint violations
> etc. Which will then subsequently lead to clog lookup errors and such.

I agree with the point.  But, if we keep giving the ERROR in such
cases then also the situation is not any better.  Basically, we are
not freezing such tuple as well as we can not advance the
relfrozenxid.  So if we follow the same rule that we don't freeze
those tuples and also don't advance the relfrozenxid.  The only
difference is, allow the vacuum to continue with other tuples.

> At the very least you'd need to signal up that relfrozenxid/relminmxid
> cannot be increased. Without that I think it's entirely unacceptable to
> do this.

I agree with that point.  I was just confused that shall we disallow
to advance the relfrozenxid in all such cases or in some cases where
the xid already precedes the relfrozenxid, we can allow it to advance
as it can not become any worse.  But, as Alvaro pointed out that there
is no point in optimizing such cases.   I will update the patch to
stop advancing the relfrozenxid if we find any corrupted xid during
tuple freeze.

> If we really were to do something like this the option would need to be
> called vacuum_allow_making_corruption_worse or such. Its need to be
> *exceedingly* clear that it will likely lead to making everything much
> worse.
>
Maybe we can clearly describe this in the document.

> > @@ -6123,6 +6124,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
> >       frz->t_infomask = tuple->t_infomask;
> >       frz->xmax = HeapTupleHeaderGetRawXmax(tuple);
>
> I don't think it can be right to just update heap_prepare_freeze_tuple()
> but not FreezeMultiXactId().

oh, I missed this part.  I will fix it.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

From

Dilip Kumar

Date:

21 July 2020, 13:38:36

On Tue, Jul 21, 2020 at 11:00 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Jul 21, 2020 at 2:00 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2020-07-17 16:16:23 +0530, Dilip Kumar wrote:
> > > The attached patch allows the vacuum to continue by emitting WARNING
> > > for the corrupted tuple instead of immediately error out as discussed
> > > at [1].
> > >
> > > Basically, it provides a new GUC called vacuum_tolerate_damage, to
> > > control whether to continue the vacuum or to stop on the occurrence of
> > > a corrupted tuple.  So if the vacuum_tolerate_damage is set then in
> > > all the cases in heap_prepare_freeze_tuple where the corrupted xid is
> > > detected, it will emit a warning and return that nothing is changed in
> > > the tuple and the 'tuple_totally_frozen' will also be set to false.
> > > Since we are returning false the caller will not try to freeze such
> > > tuple and the tuple_totally_frozen is also set to false so that the
> > > page will not be marked to all frozen even if all other tuples in the
> > > page are frozen.
> >
> > I'm extremely doubtful this is a good idea. In all likelihood this will
> > just exascerbate corruption.
> >
> > You cannot just stop freezing tuples, that'll lead to relfrozenxid
> > getting *further* out of sync with the actual table contents. And you
> > cannot just freeze such tuples, because that has a good chance of making
> > deleted tuples suddenly visible, leading to unique constraint violations
> > etc. Which will then subsequently lead to clog lookup errors and such.
>
> I agree with the point.  But, if we keep giving the ERROR in such
> cases then also the situation is not any better.  Basically, we are
> not freezing such tuple as well as we can not advance the
> relfrozenxid.  So if we follow the same rule that we don't freeze
> those tuples and also don't advance the relfrozenxid.  The only
> difference is, allow the vacuum to continue with other tuples.
>
> > At the very least you'd need to signal up that relfrozenxid/relminmxid
> > cannot be increased. Without that I think it's entirely unacceptable to
> > do this.
>
> I agree with that point.  I was just confused that shall we disallow
> to advance the relfrozenxid in all such cases or in some cases where
> the xid already precedes the relfrozenxid, we can allow it to advance
> as it can not become any worse.  But, as Alvaro pointed out that there
> is no point in optimizing such cases.   I will update the patch to
> stop advancing the relfrozenxid if we find any corrupted xid during
> tuple freeze.
>
> > If we really were to do something like this the option would need to be
> > called vacuum_allow_making_corruption_worse or such. Its need to be
> > *exceedingly* clear that it will likely lead to making everything much
> > worse.
> >
> Maybe we can clearly describe this in the document.
>
> > > @@ -6123,6 +6124,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
> > >       frz->t_infomask = tuple->t_infomask;
> > >       frz->xmax = HeapTupleHeaderGetRawXmax(tuple);
> >
> > I don't think it can be right to just update heap_prepare_freeze_tuple()
> > but not FreezeMultiXactId().
>
> oh, I missed this part.  I will fix it.

Please find the updated patch.  In this version, we don't allow the
relfrozenxid and relminmxid to advance if the corruption is detected
and also added the handling in FreezeMultiXactId.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

v2-0001-Provide-a-GUC-to-allow-vacuum-to-continue-on-corr.patch

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

From

Dilip Kumar

Date:

21 July 2020, 16:21:17

On Tue, Jul 21, 2020 at 4:08 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Jul 21, 2020 at 11:00 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, Jul 21, 2020 at 2:00 AM Andres Freund <andres@anarazel.de> wrote:
> > >
> > > Hi,
> > >
> > > On 2020-07-17 16:16:23 +0530, Dilip Kumar wrote:
> > > > The attached patch allows the vacuum to continue by emitting WARNING
> > > > for the corrupted tuple instead of immediately error out as discussed
> > > > at [1].
> > > >
> > > > Basically, it provides a new GUC called vacuum_tolerate_damage, to
> > > > control whether to continue the vacuum or to stop on the occurrence of
> > > > a corrupted tuple.  So if the vacuum_tolerate_damage is set then in
> > > > all the cases in heap_prepare_freeze_tuple where the corrupted xid is
> > > > detected, it will emit a warning and return that nothing is changed in
> > > > the tuple and the 'tuple_totally_frozen' will also be set to false.
> > > > Since we are returning false the caller will not try to freeze such
> > > > tuple and the tuple_totally_frozen is also set to false so that the
> > > > page will not be marked to all frozen even if all other tuples in the
> > > > page are frozen.
> > >
> > > I'm extremely doubtful this is a good idea. In all likelihood this will
> > > just exascerbate corruption.
> > >
> > > You cannot just stop freezing tuples, that'll lead to relfrozenxid
> > > getting *further* out of sync with the actual table contents. And you
> > > cannot just freeze such tuples, because that has a good chance of making
> > > deleted tuples suddenly visible, leading to unique constraint violations
> > > etc. Which will then subsequently lead to clog lookup errors and such.
> >
> > I agree with the point.  But, if we keep giving the ERROR in such
> > cases then also the situation is not any better.  Basically, we are
> > not freezing such tuple as well as we can not advance the
> > relfrozenxid.  So if we follow the same rule that we don't freeze
> > those tuples and also don't advance the relfrozenxid.  The only
> > difference is, allow the vacuum to continue with other tuples.
> >
> > > At the very least you'd need to signal up that relfrozenxid/relminmxid
> > > cannot be increased. Without that I think it's entirely unacceptable to
> > > do this.
> >
> > I agree with that point.  I was just confused that shall we disallow
> > to advance the relfrozenxid in all such cases or in some cases where
> > the xid already precedes the relfrozenxid, we can allow it to advance
> > as it can not become any worse.  But, as Alvaro pointed out that there
> > is no point in optimizing such cases.   I will update the patch to
> > stop advancing the relfrozenxid if we find any corrupted xid during
> > tuple freeze.
> >
> > > If we really were to do something like this the option would need to be
> > > called vacuum_allow_making_corruption_worse or such. Its need to be
> > > *exceedingly* clear that it will likely lead to making everything much
> > > worse.
> > >
> > Maybe we can clearly describe this in the document.
> >
> > > > @@ -6123,6 +6124,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
> > > >       frz->t_infomask = tuple->t_infomask;
> > > >       frz->xmax = HeapTupleHeaderGetRawXmax(tuple);
> > >
> > > I don't think it can be right to just update heap_prepare_freeze_tuple()
> > > but not FreezeMultiXactId().
> >
> > oh, I missed this part.  I will fix it.
>
> Please find the updated patch.  In this version, we don't allow the
> relfrozenxid and relminmxid to advance if the corruption is detected
> and also added the handling in FreezeMultiXactId.

In the previous version, the feature was enabled for cluster/vacuum
full command as well.   in the attached patch I have enabled it only
if we are running vacuum command.  It will not be enabled during a
table rewrite.  If we think that it should be enabled for the 'vacuum
full' then we might need to pass a flag from the cluster_rel, all the
way down to the heap_freeze_tuple.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

v3-0001-Provide-a-GUC-to-allow-vacuum-to-continue-on-corr.patch

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

From

Robert Haas

Date:

03 August 2020, 17:20:15

On Mon, Jul 20, 2020 at 4:30 PM Andres Freund <andres@anarazel.de> wrote:
> I'm extremely doubtful this is a good idea. In all likelihood this will
> just exascerbate corruption.
>
> You cannot just stop freezing tuples, that'll lead to relfrozenxid
> getting *further* out of sync with the actual table contents. And you
> cannot just freeze such tuples, because that has a good chance of making
> deleted tuples suddenly visible, leading to unique constraint violations
> etc. Which will then subsequently lead to clog lookup errors and such.

I think that the behavior ought to be:

- If we encounter any damaged tuples (e.g. tuple xid < relfrozenxid),
we give up on advancing relfrozenxid and relminmxid. This vacuum won't
change them at all.

- We do nothing to the damaged tuples themselves.

- We can still prune pages, and we can still freeze tuples that do not
appear to be damaged.

This amounts to an assumption that relfrozenxid is probably sane, and
that there are individual tuples that are messed up. It's probably not
the right thing if relfrozenxid got overwritten with a nonsense value
without changing the table contents. But, I think it's difficult to
cater to all contingencies. In my experience, the normal problem here
is that there are a few tuples or pages in the table that somehow
escaped vacuuming for long enough that they contain references to XIDs
from before the last time relfrozenxid was advanced - so continuing to
do what we can to the rest of the table is the right thing to do.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

From

Robert Haas

Date:

28 August 2020, 19:19:24

On Tue, Jul 21, 2020 at 9:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> In the previous version, the feature was enabled for cluster/vacuum
> full command as well.   in the attached patch I have enabled it only
> if we are running vacuum command.  It will not be enabled during a
> table rewrite.  If we think that it should be enabled for the 'vacuum
> full' then we might need to pass a flag from the cluster_rel, all the
> way down to the heap_freeze_tuple.

I think this is a somewhat clunky way of accomplishing this. The
caller passes down a flag to heap_prepare_freeze_tuple() which decides
whether or not an error is forced, and then that function and
FreezeMultiXactId use vacuum_damage_elevel() to combine the results of
that flag with the value of the vacuum_tolerate_damage GUC. But that
means that a decision that could be made in one place is instead made
in many places. If we just had heap_freeze_tuple() and
FreezeMultiXactId() take an argument int vacuum_damage_elevel, then
heap_freeze_tuple() could pass ERROR and lazy_scan_heap() could
arrange to pass WARNING or ERROR based on the value of
vacuum_tolerate_damage. I think that would likely end up cleaner. What
do you think?

I also suggest renaming is_corrupted_xid to found_corruption. With the
current name, it's not very clear which XID we're saying is corrupted;
in fact, the problem might be a MultiXactId rather than an XID, and
the real issue might be with the table's pg_class entry or something.

The new arguments to heap_prepare_freeze_tuple() need to be documented
in its header comment.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

From

Robert Haas

Date:

28 August 2020, 19:37:17

On Mon, Jul 20, 2020 at 4:30 PM Andres Freund <andres@anarazel.de> wrote:
> If we really were to do something like this the option would need to be
> called vacuum_allow_making_corruption_worse or such. Its need to be
> *exceedingly* clear that it will likely lead to making everything much
> worse.

I don't really understand this objection. How does letting VACUUM
continue after problems have been detected make anything worse? I
agree that if it does, it shouldn't touch relfrozenxid or relminmxid,
but the patch has been adjusted to work that way. Assuming you don't
touch relfrozenxid or relminmxid, what harm befalls if you continue
freezing undamaged tuples and continue removing dead tuples after
finding a bad tuple? You may have already done an arbitrary amount of
that before encountering the damage, and doing it afterward is no
different. Doing the index vacuuming step is different, but I don't
see how that would exacerbate corruption either.

The point is that when you make VACUUM fail, you not only don't
advance relfrozenxid/relminmxid, but also don't remove dead tuples. In
the long run, either thing will kill you, but it is not difficult to
have a situation where failing to remove dead tuples kills you a lot
faster. The table can just bloat until performance tanks, and then the
application goes down, even if you still had 100+ million XIDs before
you needed a wraparound vacuum.

Honestly, I wonder why continuing (but without advancing relfrozenxid
or relminmxid) shouldn't be the default behavior. I mean, if it
actually corrupts your data, then it clearly shouldn't be, and
probably shouldn't even be an optional behavior, but I still don't see
why it would do that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

From

Andres Freund

Date:

28 August 2020, 20:29:16

Hi,

On 2020-08-28 12:37:17 -0400, Robert Haas wrote:
> On Mon, Jul 20, 2020 at 4:30 PM Andres Freund <andres@anarazel.de> wrote:
> > If we really were to do something like this the option would need to be
> > called vacuum_allow_making_corruption_worse or such. Its need to be
> > *exceedingly* clear that it will likely lead to making everything much
> > worse.
> 
> I don't really understand this objection. How does letting VACUUM
> continue after problems have been detected make anything worse?

It can break HOT chains, plain ctid chains etc, for example. Which, if
earlier / follower tuples are removed can't be detected anymore at a
later time.


> The point is that when you make VACUUM fail, you not only don't
> advance relfrozenxid/relminmxid, but also don't remove dead tuples. In
> the long run, either thing will kill you, but it is not difficult to
> have a situation where failing to remove dead tuples kills you a lot
> faster. The table can just bloat until performance tanks, and then the
> application goes down, even if you still had 100+ million XIDs before
> you needed a wraparound vacuum.
> 
> Honestly, I wonder why continuing (but without advancing relfrozenxid
> or relminmxid) shouldn't be the default behavior. I mean, if it
> actually corrupts your data, then it clearly shouldn't be, and
> probably shouldn't even be an optional behavior, but I still don't see
> why it would do that.

I think it's an EXTREMELY bad idea to enable anything like this by
default. It'll make bugs entirely undiagnosable, because we'll remove a
lot of the evidence of what the problem is. And we've had many long
standing bugs in this area, several only found because we actually
started to bleat about them. And quite evidently, we have more bugs to
fix in the area.

Greetings,

Andres Freund

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

From

Robert Haas

Date:

28 August 2020, 23:15:58

On Fri, Aug 28, 2020 at 1:29 PM Andres Freund <andres@anarazel.de> wrote:
> It can break HOT chains, plain ctid chains etc, for example. Which, if
> earlier / follower tuples are removed can't be detected anymore at a
> later time.

I think I need a more specific example here to understand the problem.
If the xmax of one tuple matches the xmin of the next, then either
both values precede relfrozenxid or both follow it. In the former
case, neither tuple should be frozen and the chain should not get
broken; in the latter case, everything's normal anyway. If the xmax
and xmin don't match, then the chain was already broken. Perhaps we
are removing important evidence, though it seems like that might've
happened anyway prior to reaching the damaged page, but we're not
making whatever corruption may exist any worse. At least, not as far
as I can see.

> And we've had many long
> standing bugs in this area, several only found because we actually
> started to bleat about them. And quite evidently, we have more bugs to
> fix in the area.

I agree with all of this, but I do not think that it establishes that
we should abandon the entire VACUUM. "Bleating" about something
usually means logging it, and I think you understand that I am not now
nor have I ever complained about the logging we are doing here. I also
think you understand why I don't like the current behavior, and that
EDB has actual customers who have actually been damaged by it. All the
same, I don't expect to win an argument about changing the default,
but I hope to win one about at least providing an option. And if we're
not even going to do that much, then I hope to come out of this
discussion with a clear understanding of exactly why that's a bad
idea. I don't think "we need the data for forensics" is a sufficient
justification for "if you end up with one corrupted XID in a
billion-row table, your entire table will bloat out the wazoo, and
there is no option to get any other behavior."

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

From

Dilip Kumar

Date:

29 August 2020, 11:08:59

On Fri, Aug 28, 2020 at 9:49 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Jul 21, 2020 at 9:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > In the previous version, the feature was enabled for cluster/vacuum
> > full command as well.   in the attached patch I have enabled it only
> > if we are running vacuum command.  It will not be enabled during a
> > table rewrite.  If we think that it should be enabled for the 'vacuum
> > full' then we might need to pass a flag from the cluster_rel, all the
> > way down to the heap_freeze_tuple.
>
> I think this is a somewhat clunky way of accomplishing this. The
> caller passes down a flag to heap_prepare_freeze_tuple() which decides
> whether or not an error is forced, and then that function and
> FreezeMultiXactId use vacuum_damage_elevel() to combine the results of
> that flag with the value of the vacuum_tolerate_damage GUC. But that
> means that a decision that could be made in one place is instead made
> in many places. If we just had heap_freeze_tuple() and
> FreezeMultiXactId() take an argument int vacuum_damage_elevel, then
> heap_freeze_tuple() could pass ERROR and lazy_scan_heap() could
> arrange to pass WARNING or ERROR based on the value of
> vacuum_tolerate_damage. I think that would likely end up cleaner. What
> do you think?

I agree this way it is much more cleaner.  I have changed in the attached patch.

> I also suggest renaming is_corrupted_xid to found_corruption. With the
> current name, it's not very clear which XID we're saying is corrupted;
> in fact, the problem might be a MultiXactId rather than an XID, and
> the real issue might be with the table's pg_class entry or something.

Okay, changed to found_corruption.

> The new arguments to heap_prepare_freeze_tuple() need to be documented
> in its header comment.

Done.

I have also done a few more cosmetic changes to the patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

v4-0001-Provide-a-GUC-to-allow-vacuum-to-continue-on-corr.patch

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

From

Dilip Kumar

Date:

29 August 2020, 11:36:24

On Sat, Aug 29, 2020 at 1:46 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Aug 28, 2020 at 1:29 PM Andres Freund <andres@anarazel.de> wrote:
> > It can break HOT chains, plain ctid chains etc, for example. Which, if
> > earlier / follower tuples are removed can't be detected anymore at a
> > later time.
>
> I think I need a more specific example here to understand the problem.
> If the xmax of one tuple matches the xmin of the next, then either
> both values precede relfrozenxid or both follow it. In the former
> case, neither tuple should be frozen and the chain should not get
> broken; in the latter case, everything's normal anyway. If the xmax
> and xmin don't match, then the chain was already broken. Perhaps we
> are removing important evidence, though it seems like that might've
> happened anyway prior to reaching the damaged page, but we're not
> making whatever corruption may exist any worse. At least, not as far
> as I can see.

One example is,  suppose during vacuum, there are 2 tuples in the hot
chain,  and the xmin of the first tuple is corrupted (i.e. smaller
than relfrozenxid).  And the xmax of this tuple (which is same as the
xmin of the second tuple) is smaller than the cutoff_xid while trying
to freeze the tuple.  As a result, it will freeze the second tuple but
the first tuple will be left untouched.

Now,  if we come for the heap_hot_search_buffer, then the xmax of the
first tuple will not match the xmin of the second tuple as we have
frozen the second tuple.  But, I feel this is easily fixable right? I
mean instead of not doing anything to the corrupted tuple we can
partially freeze it?  I mean we can just leave the corrupted xid alone
but mark the other xid as frozen if that is smaller then cutoff_xid.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

From

Robert Haas

Date:

14 September 2020, 20:26:27

On Sat, Aug 29, 2020 at 4:36 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> One example is,  suppose during vacuum, there are 2 tuples in the hot
> chain,  and the xmin of the first tuple is corrupted (i.e. smaller
> than relfrozenxid).  And the xmax of this tuple (which is same as the
> xmin of the second tuple) is smaller than the cutoff_xid while trying
> to freeze the tuple.  As a result, it will freeze the second tuple but
> the first tuple will be left untouched.
>
> Now,  if we come for the heap_hot_search_buffer, then the xmax of the
> first tuple will not match the xmin of the second tuple as we have
> frozen the second tuple.  But, I feel this is easily fixable right? I
> mean instead of not doing anything to the corrupted tuple we can
> partially freeze it?  I mean we can just leave the corrupted xid alone
> but mark the other xid as frozen if that is smaller then cutoff_xid.

That seems reasonable to me. Andres, what do you think?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

From

Andres Freund

Date:

14 September 2020, 21:39:05

Hi,

On 2020-09-14 13:26:27 -0400, Robert Haas wrote:
> On Sat, Aug 29, 2020 at 4:36 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > One example is,  suppose during vacuum, there are 2 tuples in the hot
> > chain,  and the xmin of the first tuple is corrupted (i.e. smaller
> > than relfrozenxid).  And the xmax of this tuple (which is same as the
> > xmin of the second tuple) is smaller than the cutoff_xid while trying
> > to freeze the tuple.  As a result, it will freeze the second tuple but
> > the first tuple will be left untouched.
> >
> > Now,  if we come for the heap_hot_search_buffer, then the xmax of the
> > first tuple will not match the xmin of the second tuple as we have
> > frozen the second tuple.  But, I feel this is easily fixable right? I
> > mean instead of not doing anything to the corrupted tuple we can
> > partially freeze it?  I mean we can just leave the corrupted xid alone
> > but mark the other xid as frozen if that is smaller then cutoff_xid.
> 
> That seems reasonable to me. Andres, what do you think?

It seems pretty dangerous to me. What exactly are you going to put into
xmin/xmax here? And how would anything you put into the first tuple not
break index lookups? There's no such thing as a frozen xmax (so far), so
what are you going to put in there? A random different xid?
FrozenTransactionId? HEAP_XMAX_INVALID?

This whole approach just seems likely to exascerbate corruption while
also making it impossible to debug. That's ok enough if it's an explicit
user action, but doing it based on a config variable setting seems
absurdly dangerous to me.

Greetings,

Andres Freund

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

From

Alvaro Herrera

Date:

14 September 2020, 22:00:18

On 2020-Sep-14, Andres Freund wrote:

> It seems pretty dangerous to me. What exactly are you going to put into
> xmin/xmax here? And how would anything you put into the first tuple not
> break index lookups? There's no such thing as a frozen xmax (so far), so
> what are you going to put in there? A random different xid?
> FrozenTransactionId? HEAP_XMAX_INVALID?
> 
> This whole approach just seems likely to exascerbate corruption while
> also making it impossible to debug. That's ok enough if it's an explicit
> user action, but doing it based on a config variable setting seems
> absurdly dangerous to me.

FWIW I agree with Andres' stance on this.  The current system is *very*
complicated and bugs are obscure already.  If we hide them, what we'll
be getting is a system where data can become corrupted for no apparent
reason.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

From

Robert Haas

Date:

14 September 2020, 22:50:49

On Mon, Sep 14, 2020 at 3:00 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> FWIW I agree with Andres' stance on this.  The current system is *very*
> complicated and bugs are obscure already.  If we hide them, what we'll
> be getting is a system where data can become corrupted for no apparent
> reason.

I think I might have to give up on this proposal given the level of
opposition to it, but the nature of the opposition doesn't make any
sense to me on a technical level. Suppose a tuple with tid A has been
updated, producing a new version at tid B. The argument that is now
being offered is that if A has been found to be corrupt then we'd
better stop vacuuming the table altogether lest we reach B and vacuum
it too, further corrupting the table and destroying forensic evidence.
But even ignoring the fact that many users want to get the database
running again more than they want to do forensics, it's entirely
possible that B < A, in which case the damage has already been done.
Therefore, I can't see any argument that this patch creates any
scenario that can't happen already. It seems entirely reasonable to me
to say, as a review comment, hey, you haven't sufficiently considered
this particular scenario, that still needs work. But the argument here
is much more about whether this is a reasonable thing to do in general
and under any circumstances, and it feels to me like you guys are
saying "no" without offering any really convincing evidence that there
are unfixable problems here. IOW, I agree that having a GUC
corrupt_my_tables_more=true is not a reasonable thing, but I disagree
that the proposal on the table is tantamount to that.

The big picture here is that people have terabyte-scale tables, 1 or 2
tuples get corrupted, and right now the only real fix is to dump and
restore the whole table, which leads to prolonged downtime. The
pg_surgery stuff should help with that, and the work to make VACUUM
report the exact TID will also help, and if we can get the heapcheck
stuff Mark Dilger is working on committed, that will provide an
alternative and probably better way of finding this kind of
corruption, which is all to the good. However, I disagree with the
idea that a typical user who has a 2TB with one corrupted tuple on
page 0 probably wants VACUUM to fail over and over again, letting the
table bloat like crazy, instead of bleating loudly but still vacuuming
the other 0.999999% of the table. I mean, somebody probably wants
that, and that's fine. But I have a hard time imagining it as a
typical view. Am I just lacking in imagination?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

From

Andres Freund

Date:

14 September 2020, 23:13:35

Hi,

On 2020-09-14 15:50:49 -0400, Robert Haas wrote:
> On Mon, Sep 14, 2020 at 3:00 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > FWIW I agree with Andres' stance on this.  The current system is *very*
> > complicated and bugs are obscure already.  If we hide them, what we'll
> > be getting is a system where data can become corrupted for no apparent
> > reason.
> 
> I think I might have to give up on this proposal given the level of
> opposition to it, but the nature of the opposition doesn't make any
> sense to me on a technical level. Suppose a tuple with tid A has been
> updated, producing a new version at tid B. The argument that is now
> being offered is that if A has been found to be corrupt then we'd
> better stop vacuuming the table altogether lest we reach B and vacuum
> it too, further corrupting the table and destroying forensic evidence.
> But even ignoring the fact that many users want to get the database
> running again more than they want to do forensics, it's entirely
> possible that B < A, in which case the damage has already been done.

My understanding of the case we're discussing is that it's corruption
(e.g. relfrozenxid being different than table contents) affecting a HOT
chain. I.e. by definition all within a single page.  We won't have
modified part of it independent of B < A, because freezing is
all-or-nothing.  Just breaking the HOT chain into two or something like
that will just make things worse, because indexes won't find tuples, and
because reindexing might then get confused e.g. by HOT chains without a
valid start, or by having two visible tuples for the same PK.

> But even ignoring the fact that many users want to get the database
> running again more than they want to do forensics

The user isn't always right. And I am not objecting against providing a
tool to get things running. I'm objecting to VACUUM doing so, especially
when it's a system wide config option triggering that behaviour.

> Therefore, I can't see any argument that this patch creates any
> scenario that can't happen already. It seems entirely reasonable to me
> to say, as a review comment, hey, you haven't sufficiently considered
> this particular scenario, that still needs work. But the argument here
> is much more about whether this is a reasonable thing to do in general
> and under any circumstances, and it feels to me like you guys are
> saying "no" without offering any really convincing evidence that there
> are unfixable problems here.

I don't think that's quite the calculation. You're suggesting to make
already really complicated and failure prone code even more complicated
by adding heuristic error recovery to it. That has substantial cost,
even if we were to get it perfectly right (which I don't believe we
will).

> The big picture here is that people have terabyte-scale tables, 1 or 2
> tuples get corrupted, and right now the only real fix is to dump and
> restore the whole table, which leads to prolonged downtime. The
> pg_surgery stuff should help with that, and the work to make VACUUM
> report the exact TID will also help, and if we can get the heapcheck
> stuff Mark Dilger is working on committed, that will provide an
> alternative and probably better way of finding this kind of
> corruption, which is all to the good.

Agreed.

> However, I disagree with the idea that a typical user who has a 2TB
> with one corrupted tuple on page 0 probably wants VACUUM to fail over
> and over again, letting the table bloat like crazy, instead of
> bleating loudly but still vacuuming the other 0.999999% of the
> table. I mean, somebody probably wants that, and that's fine. But I
> have a hard time imagining it as a typical view. Am I just lacking in
> imagination?

I know that that kind of user exists, but yea, I disagree extremely
strongly that that's a reasonable thing that the majority of users
want. And I don't think that that's something we should encourage. Those
cases indicate that either postgres has a bug, or their storage / memory
/ procedures have an issue. Reacting by making it harder to diagnose is
just a bad idea.

Greetings,

Andres Freund

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

From

Robert Haas

Date:

15 September 2020, 00:00:48

On Mon, Sep 14, 2020 at 4:13 PM Andres Freund <andres@anarazel.de> wrote:
> My understanding of the case we're discussing is that it's corruption
> (e.g. relfrozenxid being different than table contents) affecting a HOT
> chain. I.e. by definition all within a single page.  We won't have
> modified part of it independent of B < A, because freezing is
> all-or-nothing.  Just breaking the HOT chain into two or something like
> that will just make things worse, because indexes won't find tuples, and
> because reindexing might then get confused e.g. by HOT chains without a
> valid start, or by having two visible tuples for the same PK.

If we adopt the proposal made by Dilip, we will not do that. We must
have a.xmax = b.xmin, and that value is either less than relfrozenxid
or it is not. If we skip an entire tuple because one XID is bad, then
we could break the HOT chain when a.xmin is bad and the remaining
values are OK. But if we decide separately for xmin and xmax then we
should be alright. Alternately, if we're only concerned about HOT
chains, we could skip the entire page if any tuple on the page shows
evidence of damage.

> I don't think that's quite the calculation. You're suggesting to make
> already really complicated and failure prone code even more complicated
> by adding heuristic error recovery to it. That has substantial cost,
> even if we were to get it perfectly right (which I don't believe we
> will).

That's a legitimate concern, but I think it would make more sense to
first make the design as good as we can and then decide whether it's
adequate than to decide ab initio that there's no way to make it good
enough.

> > However, I disagree with the idea that a typical user who has a 2TB
> > with one corrupted tuple on page 0 probably wants VACUUM to fail over
> > and over again, letting the table bloat like crazy, instead of
> > bleating loudly but still vacuuming the other 0.999999% of the
> > table. I mean, somebody probably wants that, and that's fine. But I
> > have a hard time imagining it as a typical view. Am I just lacking in
> > imagination?
>
> I know that that kind of user exists, but yea, I disagree extremely
> strongly that that's a reasonable thing that the majority of users
> want. And I don't think that that's something we should encourage. Those
> cases indicate that either postgres has a bug, or their storage / memory
> / procedures have an issue. Reacting by making it harder to diagnose is
> just a bad idea.

Well, the people I tend to deal with are not going to let me conduct a
lengthy investigation almost no matter what, and the more severe the
operational consequences of the problem are, the less likely it is
that I'm going to have time to figure anything out. Being able to
create some kind of breathing room is pretty valuable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

From

Andres Freund

Date:

15 September 2020, 00:05:22

Hi,

On 2020-09-14 17:00:48 -0400, Robert Haas wrote:
> On Mon, Sep 14, 2020 at 4:13 PM Andres Freund <andres@anarazel.de> wrote:
> > My understanding of the case we're discussing is that it's corruption
> > (e.g. relfrozenxid being different than table contents) affecting a HOT
> > chain. I.e. by definition all within a single page.  We won't have
> > modified part of it independent of B < A, because freezing is
> > all-or-nothing.  Just breaking the HOT chain into two or something like
> > that will just make things worse, because indexes won't find tuples, and
> > because reindexing might then get confused e.g. by HOT chains without a
> > valid start, or by having two visible tuples for the same PK.
> 
> If we adopt the proposal made by Dilip, we will not do that. We must
> have a.xmax = b.xmin, and that value is either less than relfrozenxid
> or it is not. If we skip an entire tuple because one XID is bad, then
> we could break the HOT chain when a.xmin is bad and the remaining
> values are OK. But if we decide separately for xmin and xmax then we
> should be alright.

I thought I precisely addressed this case:

> What exactly are you going to put into xmin/xmax here? And how would
> anything you put into the first tuple not break index lookups? There's
> no such thing as a frozen xmax (so far), so what are you going to put
> in there? A random different xid?  FrozenTransactionId?
> HEAP_XMAX_INVALID?

What am I missing?

Greetings,

Andres Freund

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

From

Dilip Kumar

Date:

15 September 2020, 08:24:29

On Tue, Sep 15, 2020 at 2:35 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2020-09-14 17:00:48 -0400, Robert Haas wrote:
> > On Mon, Sep 14, 2020 at 4:13 PM Andres Freund <andres@anarazel.de> wrote:
> > > My understanding of the case we're discussing is that it's corruption
> > > (e.g. relfrozenxid being different than table contents) affecting a HOT
> > > chain. I.e. by definition all within a single page.  We won't have
> > > modified part of it independent of B < A, because freezing is
> > > all-or-nothing.  Just breaking the HOT chain into two or something like
> > > that will just make things worse, because indexes won't find tuples, and
> > > because reindexing might then get confused e.g. by HOT chains without a
> > > valid start, or by having two visible tuples for the same PK.
> >
> > If we adopt the proposal made by Dilip, we will not do that. We must
> > have a.xmax = b.xmin, and that value is either less than relfrozenxid
> > or it is not. If we skip an entire tuple because one XID is bad, then
> > we could break the HOT chain when a.xmin is bad and the remaining
> > values are OK. But if we decide separately for xmin and xmax then we
> > should be alright.
>
> I thought I precisely addressed this case:
>
> > What exactly are you going to put into xmin/xmax here? And how would
> > anything you put into the first tuple not break index lookups? There's
> > no such thing as a frozen xmax (so far), so what are you going to put
> > in there? A random different xid?  FrozenTransactionId?
> > HEAP_XMAX_INVALID?
>
> What am I missing?

What problem do you see if we set xmax to the InvalidTransactionId and
HEAP_XMAX_INVALID flag in the infomask ?  I mean now also if the xmax
is older than the cutoff xid then we do the same thing i.e.
if (freeze_xmax)
{
..
frz->xmax = InvalidTransactionId;
..
frz->t_infomask &= ~HEAP_XMAX_BITS;
frz->t_infomask |= HEAP_XMAX_INVALID;
frz->t_infomask2 &= ~HEAP_HOT_UPDATED;
frz->t_infomask2 &= ~HEAP_KEYS_UPDATED;
changed = true;
}

So if we do that it will not be part of the hot chain anymore.  I
might be missing something but could not see how it can be more broken
than what it is without our change.  I agree that in case of corrupted
xmin it can now mark tuple with HEAP_XMAX_INVALID without freezing the
xmin but that is anyway a valid status for a tuple.

However, if we think it still can cause some issues then I feel that
we can skip the whole page as Robert suggested.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

From

Andres Freund

Date:

15 September 2020, 08:44:00

On 2020-09-15 10:54:29 +0530, Dilip Kumar wrote:
> What problem do you see if we set xmax to the InvalidTransactionId and
> HEAP_XMAX_INVALID flag in the infomask ?

1) It'll make a dead tuple appear live. You cannot do this for tuples
   with an xid below the horizon.
2) it'll break HOT chain following / indexes.

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

From

Dilip Kumar

Date:

15 September 2020, 10:22:25

On Tue, Sep 15, 2020 at 11:14 AM Andres Freund <andres@anarazel.de> wrote:
>
> On 2020-09-15 10:54:29 +0530, Dilip Kumar wrote:
> > What problem do you see if we set xmax to the InvalidTransactionId and
> > HEAP_XMAX_INVALID flag in the infomask ?
>
> 1) It'll make a dead tuple appear live. You cannot do this for tuples
>    with an xid below the horizon.

How is it possible?  Because tuple which has a committed xmax and the
xmax is older than the oldestXmin, should not come for freezing unless
it is lock_only xid (because those tuples are already gone).  So if
the xmax is smaller than the cutoff xid than either it is lock_only or
it is aborted.  If the XMAX is lock only then I don't see any issue
OTOH if it is aborted xid and if it is already smaller than the
cut-off xid then it is anyway live tuple.

>2) it'll break HOT chain following / indexes.

If my above theory in point 1 is correct then I don't see this issue as well.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

From

Andres Freund

Date:

15 September 2020, 21:04:13

Hi,

On 2020-09-15 12:52:25 +0530, Dilip Kumar wrote:
> On Tue, Sep 15, 2020 at 11:14 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > On 2020-09-15 10:54:29 +0530, Dilip Kumar wrote:
> > > What problem do you see if we set xmax to the InvalidTransactionId and
> > > HEAP_XMAX_INVALID flag in the infomask ?
> >
> > 1) It'll make a dead tuple appear live. You cannot do this for tuples
> >    with an xid below the horizon.
> 
> How is it possible?  Because tuple which has a committed xmax and the
> xmax is older than the oldestXmin, should not come for freezing unless
> it is lock_only xid (because those tuples are already gone).

There've been several cases of this in the past. A fairly easy way is a
corrupted relfrozenxid (of which there are many examples).

You simply cannot just assume that everything is OK and argue that
that's why it's ok to fix data corruption in some approximate manner. By
definition everything *is not ok* if you ever come here.

Greetings,

Andres Freund

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

From

Robert Haas

Date:

16 September 2020, 00:28:28

On Tue, Sep 15, 2020 at 2:04 PM Andres Freund <andres@anarazel.de> wrote:
> > How is it possible?  Because tuple which has a committed xmax and the
> > xmax is older than the oldestXmin, should not come for freezing unless
> > it is lock_only xid (because those tuples are already gone).
>
> There've been several cases of this in the past. A fairly easy way is a
> corrupted relfrozenxid (of which there are many examples).

Hmm, so is the case you're worried about here the case where the
freezing threshold is greater than the pruning threshold? i.e. The
relfrozenxid has been updated to a value greater than the xmin we
derive from the procarray?

If that's not the case, then I don't see what problem there can be
here. To reach heap_prepare_freeze_tuple the tuple has to survive
pruning. If xmin < freezing-threshold and freezing-threshold <
pruning-threshold and the tuple survived pruning, then xmin must be a
committed transaction visible to everyone so setting xmin to
FrozenTransactionId is fine. If xmax < freezing-threshold and
freezing-threshold < pruning-threshold and the tuple survived pruning,
xmax must be visible to everyone and can't be running so it must have
aborted, so setting xmax to InvalidTransactionId is fine.

On the other hand if, somehow, freezing-threshold > pruning-threshold,
then freezing seems categorically unsafe. Doing so would change
visibility decisions of transactions that are still running, or that
were running at the time when we computed the pruning threshold. But
the sanity checks in heap_prepare_freeze_tuple() seem like they would
catch many such cases, but I'm not sure if they're all water-tight. It
might be better to skip calling heap_prepare_freeze_tuple() altogether
if the freezing threshold does not precede the pruning threshold.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company