Thread: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING
The attached patch allows the vacuum to continue by emitting WARNING for the corrupted tuple instead of immediately error out as discussed at [1]. Basically, it provides a new GUC called vacuum_tolerate_damage, to control whether to continue the vacuum or to stop on the occurrence of a corrupted tuple. So if the vacuum_tolerate_damage is set then in all the cases in heap_prepare_freeze_tuple where the corrupted xid is detected, it will emit a warning and return that nothing is changed in the tuple and the 'tuple_totally_frozen' will also be set to false. Since we are returning false the caller will not try to freeze such tuple and the tuple_totally_frozen is also set to false so that the page will not be marked to all frozen even if all other tuples in the page are frozen. Alternatively, we can try to freeze other XIDs in the tuple which is not corrupted but I don't think we will gain anything from this, because if one of the xmin or xmax is wrong then next time also if we run the vacuum then we are going to get the same WARNING or the ERROR. Is there any other opinion on this? [1] http://postgr.es/m/CA+TgmoaZwZHtFFU6NUJgEAp6adDs-qWfNOXpZGQpZMmm0VTDfg@mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING
From
"Andrey M. Borodin"
Date:
Hi Dilip! > 17 июля 2020 г., в 15:46, Dilip Kumar <dilipbalaut@gmail.com> написал(а): > > The attached patch allows the vacuum to continue by emitting WARNING > for the corrupted tuple instead of immediately error out as discussed > at [1]. > > Basically, it provides a new GUC called vacuum_tolerate_damage, to > control whether to continue the vacuum or to stop on the occurrence of > a corrupted tuple. So if the vacuum_tolerate_damage is set then in > all the cases in heap_prepare_freeze_tuple where the corrupted xid is > detected, it will emit a warning and return that nothing is changed in > the tuple and the 'tuple_totally_frozen' will also be set to false. > Since we are returning false the caller will not try to freeze such > tuple and the tuple_totally_frozen is also set to false so that the > page will not be marked to all frozen even if all other tuples in the > page are frozen. > > Alternatively, we can try to freeze other XIDs in the tuple which is > not corrupted but I don't think we will gain anything from this, > because if one of the xmin or xmax is wrong then next time also if we > run the vacuum then we are going to get the same WARNING or the ERROR. > Is there any other opinion on this? FWIW AFAIK this ERROR was the reason why we had to use older versions of heap_prepare_freeze_tuple() in our recovery kit[0]. So +1 from me. But I do not think that just ignoring corruption here is sufficient. Soon after this freeze problem user will, probably,have to deal with absent CLOG. I think this GUC is only a part of an incomplete solution. Personally I'd be happy if this is backported - our recovery kit would be much smaller. But this does not seem like a validreason. Thanks! Best regards, Andrey Borodin. [0] https://github.com/dsarafan/pg_dirty_hands/blob/master/src/pg_dirty_hands.c#L443
On Fri, Jul 17, 2020 at 4:16 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > The attached patch allows the vacuum to continue by emitting WARNING > for the corrupted tuple instead of immediately error out as discussed > at [1]. > > Basically, it provides a new GUC called vacuum_tolerate_damage, to > control whether to continue the vacuum or to stop on the occurrence of > a corrupted tuple. So if the vacuum_tolerate_damage is set then in > all the cases in heap_prepare_freeze_tuple where the corrupted xid is > detected, it will emit a warning and return that nothing is changed in > the tuple and the 'tuple_totally_frozen' will also be set to false. > Since we are returning false the caller will not try to freeze such > tuple and the tuple_totally_frozen is also set to false so that the > page will not be marked to all frozen even if all other tuples in the > page are frozen. > > Alternatively, we can try to freeze other XIDs in the tuple which is > not corrupted but I don't think we will gain anything from this, > because if one of the xmin or xmax is wrong then next time also if we > run the vacuum then we are going to get the same WARNING or the ERROR. > Is there any other opinion on this? Robert has mentioned at [1] that we probably should refuse to update 'relfrozenxid/relminmxid' when we encounter such tuple and emit WARNING instead of an error. I think we shall do that in some cases but IMHO it's not a very good idea in all the cases. Basically, if the xmin precedes the relfrozenxid then probably we should allow to update the relfrozenxid whereas if the xmin precedes cutoff xid and still uncommitted then probably we might stop relfrozenxid from being updated so that we can stop CLOG from getting truncated. I will make these changes if we agree with the idea? Or we should keep it simple and never allow to update 'relfrozenxid/relminmxid' in such cases? [1] http://postgr.es/m/CA+TgmoaZwZHtFFU6NUJgEAp6adDs-qWfNOXpZGQpZMmm0VTDfg@mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Sun, Jul 19, 2020 at 4:56 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > > Hi Dilip! > > > > 17 июля 2020 г., в 15:46, Dilip Kumar <dilipbalaut@gmail.com> написал(а): > > > > The attached patch allows the vacuum to continue by emitting WARNING > > for the corrupted tuple instead of immediately error out as discussed > > at [1]. > > > > Basically, it provides a new GUC called vacuum_tolerate_damage, to > > control whether to continue the vacuum or to stop on the occurrence of > > a corrupted tuple. So if the vacuum_tolerate_damage is set then in > > all the cases in heap_prepare_freeze_tuple where the corrupted xid is > > detected, it will emit a warning and return that nothing is changed in > > the tuple and the 'tuple_totally_frozen' will also be set to false. > > Since we are returning false the caller will not try to freeze such > > tuple and the tuple_totally_frozen is also set to false so that the > > page will not be marked to all frozen even if all other tuples in the > > page are frozen. > > > > Alternatively, we can try to freeze other XIDs in the tuple which is > > not corrupted but I don't think we will gain anything from this, > > because if one of the xmin or xmax is wrong then next time also if we > > run the vacuum then we are going to get the same WARNING or the ERROR. > > Is there any other opinion on this? > > FWIW AFAIK this ERROR was the reason why we had to use older versions of heap_prepare_freeze_tuple() in our recovery kit[0]. > So +1 from me. Thanks for showing interest in this patch. > But I do not think that just ignoring corruption here is sufficient. Soon after this freeze problem user will, probably,have to deal with absent CLOG. > I think this GUC is only a part of an incomplete solution. > Personally I'd be happy if this is backported - our recovery kit would be much smaller. But this does not seem like a validreason. I agree that this is just solving one part of the problem and in some cases, it may not work if the CLOG itself is corrupted i.e does not exist for the xid which are not yet frozen. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING
From
Alvaro Herrera
Date:
On 2020-Jul-20, Dilip Kumar wrote: > On Fri, Jul 17, 2020 at 4:16 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > So if the vacuum_tolerate_damage is set then in > > all the cases in heap_prepare_freeze_tuple where the corrupted xid is > > detected, it will emit a warning and return that nothing is changed in > > the tuple and the 'tuple_totally_frozen' will also be set to false. > > Since we are returning false the caller will not try to freeze such > > tuple and the tuple_totally_frozen is also set to false so that the > > page will not be marked to all frozen even if all other tuples in the > > page are frozen. > Robert has mentioned at [1] that we probably should refuse to update > 'relfrozenxid/relminmxid' when we encounter such tuple and emit > WARNING instead of an error. Isn't this already happening per your description above? > I think we shall do that in some cases > but IMHO it's not a very good idea in all the cases. Basically, if > the xmin precedes the relfrozenxid then probably we should allow to > update the relfrozenxid whereas if the xmin precedes cutoff xid and > still uncommitted then probably we might stop relfrozenxid from being > updated so that we can stop CLOG from getting truncated. I'm not sure I understand 100% what you're talking about here (the first half seems dangerous unless you misspoke), but in any case it seems a pointless optimization. I mean, if the heap is corrupted, you can hope to complete the vacuum (which will hopefully return which *other* tuples are similarly corrupt) but trying to advance relfrozenxid is a lost cause. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING
From
"Andrey M. Borodin"
Date:
> 20 июля 2020 г., в 21:44, Alvaro Herrera <alvherre@2ndquadrant.com> написал(а): > >> I think we shall do that in some cases >> but IMHO it's not a very good idea in all the cases. Basically, if >> the xmin precedes the relfrozenxid then probably we should allow to >> update the relfrozenxid whereas if the xmin precedes cutoff xid and >> still uncommitted then probably we might stop relfrozenxid from being >> updated so that we can stop CLOG from getting truncated. > > I'm not sure I understand 100% what you're talking about here (the first > half seems dangerous unless you misspoke), but in any case it seems a > pointless optimization. I mean, if the heap is corrupted, you can hope > to complete the vacuum (which will hopefully return which *other* tuples > are similarly corrupt) but trying to advance relfrozenxid is a lost > cause. I think the point here is to actually move relfrozenxid back. But the mince can't be turned back. If CLOG is rotated - thetable is corrupted beyond easy repair. I'm not sure it's Dilip's case, but I'll try to describe what I was encountering. We were observing this kind of corruption in three cases: 1. With a bug in patched Linux kernel page cache we could loose FS page write 2. With a bug in WAL-G block-level incremental backup - we could loose update of the page. 3. With a firmware bug in SSD drives from one vendor - one write to block storage device was lost One page in a database is of some non-latest version (but with correct checksum, it's just an old version). And in our caseusually a VACUUMing of a page was lost (with freezes of all tuples). Some tuples are not marked as frozen, while VM hasfrozen bit for page. Everything works just fine until someone updates a tuple on the same page: VM bit is reset and eventuallyuser will try to consult CLOG, which is already truncated. This is why we may need to defer CLOG truncation or even move relfrozenxid back. FWIW we coped with this by actively monitoring this kind of corruption with this amcheck patch [0]. One can observe thislost page updates cheaply in indexes and act on first sight of corruption: identify source of the buggy behaviour. Dilip, does this sound like a corruption case you are working on? Thanks! Best regards, Andrey Borodin. [0] https://commitfest.postgresql.org/24/2254/
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING
From
Alvaro Herrera
Date:
On 2020-Jul-20, Andrey M. Borodin wrote: > I think the point here is to actually move relfrozenxid back. But the > mince can't be turned back. If CLOG is rotated - the table is > corrupted beyond easy repair. Oh, I see. Hmm. Well, if you discover relfrozenxid that's newer and the pg_clog files are still there, then yeah it might make sense to move relfrozenxid back. But it seems difficult to do correctly ... you have to move datfrozenxid back too ... frankly, I'd rather not go there. > I'm not sure it's Dilip's case, but I'll try to describe what I was encountering. > > We were observing this kind of corruption in three cases: > 1. With a bug in patched Linux kernel page cache we could loose FS page write I think I've seen this too. (Or possibly your #3, which from Postgres POV is the same thing -- a write was claimed done but actually not done.) > FWIW we coped with this by actively monitoring this kind of corruption > with this amcheck patch [0]. One can observe this lost page updates > cheaply in indexes and act on first sight of corruption: identify > source of the buggy behaviour. Right. I wish we had some way to better protect against this kind of problems, but I don't have any ideas. Some things can be protected against with checksums, but if you just lose a write, there's nothing to indicate that. We don't have a per-page write counter, or a central repository of per-page LSNs or checksums, and it seems very expensive to maintain such things. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2020-07-17 16:16:23 +0530, Dilip Kumar wrote: > The attached patch allows the vacuum to continue by emitting WARNING > for the corrupted tuple instead of immediately error out as discussed > at [1]. > > Basically, it provides a new GUC called vacuum_tolerate_damage, to > control whether to continue the vacuum or to stop on the occurrence of > a corrupted tuple. So if the vacuum_tolerate_damage is set then in > all the cases in heap_prepare_freeze_tuple where the corrupted xid is > detected, it will emit a warning and return that nothing is changed in > the tuple and the 'tuple_totally_frozen' will also be set to false. > Since we are returning false the caller will not try to freeze such > tuple and the tuple_totally_frozen is also set to false so that the > page will not be marked to all frozen even if all other tuples in the > page are frozen. I'm extremely doubtful this is a good idea. In all likelihood this will just exascerbate corruption. You cannot just stop freezing tuples, that'll lead to relfrozenxid getting *further* out of sync with the actual table contents. And you cannot just freeze such tuples, because that has a good chance of making deleted tuples suddenly visible, leading to unique constraint violations etc. Which will then subsequently lead to clog lookup errors and such. At the very least you'd need to signal up that relfrozenxid/relminmxid cannot be increased. Without that I think it's entirely unacceptable to do this. If we really were to do something like this the option would need to be called vacuum_allow_making_corruption_worse or such. Its need to be *exceedingly* clear that it will likely lead to making everything much worse. > @@ -6123,6 +6124,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple, > frz->t_infomask = tuple->t_infomask; > frz->xmax = HeapTupleHeaderGetRawXmax(tuple); I don't think it can be right to just update heap_prepare_freeze_tuple() but not FreezeMultiXactId(). Greetings, Andres Freund
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING
From
"Andrey M. Borodin"
Date:
> 21 июля 2020 г., в 00:36, Alvaro Herrera <alvherre@2ndquadrant.com> написал(а): > > >> FWIW we coped with this by actively monitoring this kind of corruption >> with this amcheck patch [0]. One can observe this lost page updates >> cheaply in indexes and act on first sight of corruption: identify >> source of the buggy behaviour. > > Right. > > I wish we had some way to better protect against this kind of problems, > but I don't have any ideas. Some things can be protected against with > checksums, but if you just lose a write, there's nothing to indicate > that. We don't have a per-page write counter, or a central repository > of per-page LSNs or checksums, and it seems very expensive to maintain > such things. If we had data checksums in another fork we could flush them on checkpoint. This checksums could protect from lost page update. And it would be much easier to maintain these checksums for SLRUs. Best regards, Andrey Borodin.
On Mon, Jul 20, 2020 at 10:14 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > On 2020-Jul-20, Dilip Kumar wrote: > > > On Fri, Jul 17, 2020 at 4:16 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > So if the vacuum_tolerate_damage is set then in > > > all the cases in heap_prepare_freeze_tuple where the corrupted xid is > > > detected, it will emit a warning and return that nothing is changed in > > > the tuple and the 'tuple_totally_frozen' will also be set to false. > > > Since we are returning false the caller will not try to freeze such > > > tuple and the tuple_totally_frozen is also set to false so that the > > > page will not be marked to all frozen even if all other tuples in the > > > page are frozen. > > > Robert has mentioned at [1] that we probably should refuse to update > > 'relfrozenxid/relminmxid' when we encounter such tuple and emit > > WARNING instead of an error. > > Isn't this already happening per your description above? As per the above description, we are avoiding to set the page as all frozen. But the vacrelstats->scanned_pages count has already been increased for this page. Now, right after the lazy_scan_heap, we will update the pg_class tuple with the new FreezeLimit and MultiXactCutoff. > > > I think we shall do that in some cases > > but IMHO it's not a very good idea in all the cases. Basically, if > > the xmin precedes the relfrozenxid then probably we should allow to > > update the relfrozenxid whereas if the xmin precedes cutoff xid and > > still uncommitted then probably we might stop relfrozenxid from being > > updated so that we can stop CLOG from getting truncated. > > I'm not sure I understand 100% what you're talking about here (the first > half seems dangerous unless you misspoke), but in any case it seems a > pointless optimization. I mean, if the heap is corrupted, you can hope > to complete the vacuum (which will hopefully return which *other* tuples > are similarly corrupt) but trying to advance relfrozenxid is a lost > cause. I agree with your point. I think we just need to avoid advancing the relfrozenxid in all such cases. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Jul 21, 2020 at 2:00 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2020-07-17 16:16:23 +0530, Dilip Kumar wrote: > > The attached patch allows the vacuum to continue by emitting WARNING > > for the corrupted tuple instead of immediately error out as discussed > > at [1]. > > > > Basically, it provides a new GUC called vacuum_tolerate_damage, to > > control whether to continue the vacuum or to stop on the occurrence of > > a corrupted tuple. So if the vacuum_tolerate_damage is set then in > > all the cases in heap_prepare_freeze_tuple where the corrupted xid is > > detected, it will emit a warning and return that nothing is changed in > > the tuple and the 'tuple_totally_frozen' will also be set to false. > > Since we are returning false the caller will not try to freeze such > > tuple and the tuple_totally_frozen is also set to false so that the > > page will not be marked to all frozen even if all other tuples in the > > page are frozen. > > I'm extremely doubtful this is a good idea. In all likelihood this will > just exascerbate corruption. > > You cannot just stop freezing tuples, that'll lead to relfrozenxid > getting *further* out of sync with the actual table contents. And you > cannot just freeze such tuples, because that has a good chance of making > deleted tuples suddenly visible, leading to unique constraint violations > etc. Which will then subsequently lead to clog lookup errors and such. I agree with the point. But, if we keep giving the ERROR in such cases then also the situation is not any better. Basically, we are not freezing such tuple as well as we can not advance the relfrozenxid. So if we follow the same rule that we don't freeze those tuples and also don't advance the relfrozenxid. The only difference is, allow the vacuum to continue with other tuples. > At the very least you'd need to signal up that relfrozenxid/relminmxid > cannot be increased. Without that I think it's entirely unacceptable to > do this. I agree with that point. I was just confused that shall we disallow to advance the relfrozenxid in all such cases or in some cases where the xid already precedes the relfrozenxid, we can allow it to advance as it can not become any worse. But, as Alvaro pointed out that there is no point in optimizing such cases. I will update the patch to stop advancing the relfrozenxid if we find any corrupted xid during tuple freeze. > If we really were to do something like this the option would need to be > called vacuum_allow_making_corruption_worse or such. Its need to be > *exceedingly* clear that it will likely lead to making everything much > worse. > Maybe we can clearly describe this in the document. > > @@ -6123,6 +6124,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple, > > frz->t_infomask = tuple->t_infomask; > > frz->xmax = HeapTupleHeaderGetRawXmax(tuple); > > I don't think it can be right to just update heap_prepare_freeze_tuple() > but not FreezeMultiXactId(). oh, I missed this part. I will fix it. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Jul 21, 2020 at 11:00 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Jul 21, 2020 at 2:00 AM Andres Freund <andres@anarazel.de> wrote: > > > > Hi, > > > > On 2020-07-17 16:16:23 +0530, Dilip Kumar wrote: > > > The attached patch allows the vacuum to continue by emitting WARNING > > > for the corrupted tuple instead of immediately error out as discussed > > > at [1]. > > > > > > Basically, it provides a new GUC called vacuum_tolerate_damage, to > > > control whether to continue the vacuum or to stop on the occurrence of > > > a corrupted tuple. So if the vacuum_tolerate_damage is set then in > > > all the cases in heap_prepare_freeze_tuple where the corrupted xid is > > > detected, it will emit a warning and return that nothing is changed in > > > the tuple and the 'tuple_totally_frozen' will also be set to false. > > > Since we are returning false the caller will not try to freeze such > > > tuple and the tuple_totally_frozen is also set to false so that the > > > page will not be marked to all frozen even if all other tuples in the > > > page are frozen. > > > > I'm extremely doubtful this is a good idea. In all likelihood this will > > just exascerbate corruption. > > > > You cannot just stop freezing tuples, that'll lead to relfrozenxid > > getting *further* out of sync with the actual table contents. And you > > cannot just freeze such tuples, because that has a good chance of making > > deleted tuples suddenly visible, leading to unique constraint violations > > etc. Which will then subsequently lead to clog lookup errors and such. > > I agree with the point. But, if we keep giving the ERROR in such > cases then also the situation is not any better. Basically, we are > not freezing such tuple as well as we can not advance the > relfrozenxid. So if we follow the same rule that we don't freeze > those tuples and also don't advance the relfrozenxid. The only > difference is, allow the vacuum to continue with other tuples. > > > At the very least you'd need to signal up that relfrozenxid/relminmxid > > cannot be increased. Without that I think it's entirely unacceptable to > > do this. > > I agree with that point. I was just confused that shall we disallow > to advance the relfrozenxid in all such cases or in some cases where > the xid already precedes the relfrozenxid, we can allow it to advance > as it can not become any worse. But, as Alvaro pointed out that there > is no point in optimizing such cases. I will update the patch to > stop advancing the relfrozenxid if we find any corrupted xid during > tuple freeze. > > > If we really were to do something like this the option would need to be > > called vacuum_allow_making_corruption_worse or such. Its need to be > > *exceedingly* clear that it will likely lead to making everything much > > worse. > > > Maybe we can clearly describe this in the document. > > > > @@ -6123,6 +6124,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple, > > > frz->t_infomask = tuple->t_infomask; > > > frz->xmax = HeapTupleHeaderGetRawXmax(tuple); > > > > I don't think it can be right to just update heap_prepare_freeze_tuple() > > but not FreezeMultiXactId(). > > oh, I missed this part. I will fix it. Please find the updated patch. In this version, we don't allow the relfrozenxid and relminmxid to advance if the corruption is detected and also added the handling in FreezeMultiXactId. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Tue, Jul 21, 2020 at 4:08 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Jul 21, 2020 at 11:00 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Tue, Jul 21, 2020 at 2:00 AM Andres Freund <andres@anarazel.de> wrote: > > > > > > Hi, > > > > > > On 2020-07-17 16:16:23 +0530, Dilip Kumar wrote: > > > > The attached patch allows the vacuum to continue by emitting WARNING > > > > for the corrupted tuple instead of immediately error out as discussed > > > > at [1]. > > > > > > > > Basically, it provides a new GUC called vacuum_tolerate_damage, to > > > > control whether to continue the vacuum or to stop on the occurrence of > > > > a corrupted tuple. So if the vacuum_tolerate_damage is set then in > > > > all the cases in heap_prepare_freeze_tuple where the corrupted xid is > > > > detected, it will emit a warning and return that nothing is changed in > > > > the tuple and the 'tuple_totally_frozen' will also be set to false. > > > > Since we are returning false the caller will not try to freeze such > > > > tuple and the tuple_totally_frozen is also set to false so that the > > > > page will not be marked to all frozen even if all other tuples in the > > > > page are frozen. > > > > > > I'm extremely doubtful this is a good idea. In all likelihood this will > > > just exascerbate corruption. > > > > > > You cannot just stop freezing tuples, that'll lead to relfrozenxid > > > getting *further* out of sync with the actual table contents. And you > > > cannot just freeze such tuples, because that has a good chance of making > > > deleted tuples suddenly visible, leading to unique constraint violations > > > etc. Which will then subsequently lead to clog lookup errors and such. > > > > I agree with the point. But, if we keep giving the ERROR in such > > cases then also the situation is not any better. Basically, we are > > not freezing such tuple as well as we can not advance the > > relfrozenxid. So if we follow the same rule that we don't freeze > > those tuples and also don't advance the relfrozenxid. The only > > difference is, allow the vacuum to continue with other tuples. > > > > > At the very least you'd need to signal up that relfrozenxid/relminmxid > > > cannot be increased. Without that I think it's entirely unacceptable to > > > do this. > > > > I agree with that point. I was just confused that shall we disallow > > to advance the relfrozenxid in all such cases or in some cases where > > the xid already precedes the relfrozenxid, we can allow it to advance > > as it can not become any worse. But, as Alvaro pointed out that there > > is no point in optimizing such cases. I will update the patch to > > stop advancing the relfrozenxid if we find any corrupted xid during > > tuple freeze. > > > > > If we really were to do something like this the option would need to be > > > called vacuum_allow_making_corruption_worse or such. Its need to be > > > *exceedingly* clear that it will likely lead to making everything much > > > worse. > > > > > Maybe we can clearly describe this in the document. > > > > > > @@ -6123,6 +6124,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple, > > > > frz->t_infomask = tuple->t_infomask; > > > > frz->xmax = HeapTupleHeaderGetRawXmax(tuple); > > > > > > I don't think it can be right to just update heap_prepare_freeze_tuple() > > > but not FreezeMultiXactId(). > > > > oh, I missed this part. I will fix it. > > Please find the updated patch. In this version, we don't allow the > relfrozenxid and relminmxid to advance if the corruption is detected > and also added the handling in FreezeMultiXactId. In the previous version, the feature was enabled for cluster/vacuum full command as well. in the attached patch I have enabled it only if we are running vacuum command. It will not be enabled during a table rewrite. If we think that it should be enabled for the 'vacuum full' then we might need to pass a flag from the cluster_rel, all the way down to the heap_freeze_tuple. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Mon, Jul 20, 2020 at 4:30 PM Andres Freund <andres@anarazel.de> wrote: > I'm extremely doubtful this is a good idea. In all likelihood this will > just exascerbate corruption. > > You cannot just stop freezing tuples, that'll lead to relfrozenxid > getting *further* out of sync with the actual table contents. And you > cannot just freeze such tuples, because that has a good chance of making > deleted tuples suddenly visible, leading to unique constraint violations > etc. Which will then subsequently lead to clog lookup errors and such. I think that the behavior ought to be: - If we encounter any damaged tuples (e.g. tuple xid < relfrozenxid), we give up on advancing relfrozenxid and relminmxid. This vacuum won't change them at all. - We do nothing to the damaged tuples themselves. - We can still prune pages, and we can still freeze tuples that do not appear to be damaged. This amounts to an assumption that relfrozenxid is probably sane, and that there are individual tuples that are messed up. It's probably not the right thing if relfrozenxid got overwritten with a nonsense value without changing the table contents. But, I think it's difficult to cater to all contingencies. In my experience, the normal problem here is that there are a few tuples or pages in the table that somehow escaped vacuuming for long enough that they contain references to XIDs from before the last time relfrozenxid was advanced - so continuing to do what we can to the rest of the table is the right thing to do. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jul 21, 2020 at 9:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > In the previous version, the feature was enabled for cluster/vacuum > full command as well. in the attached patch I have enabled it only > if we are running vacuum command. It will not be enabled during a > table rewrite. If we think that it should be enabled for the 'vacuum > full' then we might need to pass a flag from the cluster_rel, all the > way down to the heap_freeze_tuple. I think this is a somewhat clunky way of accomplishing this. The caller passes down a flag to heap_prepare_freeze_tuple() which decides whether or not an error is forced, and then that function and FreezeMultiXactId use vacuum_damage_elevel() to combine the results of that flag with the value of the vacuum_tolerate_damage GUC. But that means that a decision that could be made in one place is instead made in many places. If we just had heap_freeze_tuple() and FreezeMultiXactId() take an argument int vacuum_damage_elevel, then heap_freeze_tuple() could pass ERROR and lazy_scan_heap() could arrange to pass WARNING or ERROR based on the value of vacuum_tolerate_damage. I think that would likely end up cleaner. What do you think? I also suggest renaming is_corrupted_xid to found_corruption. With the current name, it's not very clear which XID we're saying is corrupted; in fact, the problem might be a MultiXactId rather than an XID, and the real issue might be with the table's pg_class entry or something. The new arguments to heap_prepare_freeze_tuple() need to be documented in its header comment. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jul 20, 2020 at 4:30 PM Andres Freund <andres@anarazel.de> wrote: > If we really were to do something like this the option would need to be > called vacuum_allow_making_corruption_worse or such. Its need to be > *exceedingly* clear that it will likely lead to making everything much > worse. I don't really understand this objection. How does letting VACUUM continue after problems have been detected make anything worse? I agree that if it does, it shouldn't touch relfrozenxid or relminmxid, but the patch has been adjusted to work that way. Assuming you don't touch relfrozenxid or relminmxid, what harm befalls if you continue freezing undamaged tuples and continue removing dead tuples after finding a bad tuple? You may have already done an arbitrary amount of that before encountering the damage, and doing it afterward is no different. Doing the index vacuuming step is different, but I don't see how that would exacerbate corruption either. The point is that when you make VACUUM fail, you not only don't advance relfrozenxid/relminmxid, but also don't remove dead tuples. In the long run, either thing will kill you, but it is not difficult to have a situation where failing to remove dead tuples kills you a lot faster. The table can just bloat until performance tanks, and then the application goes down, even if you still had 100+ million XIDs before you needed a wraparound vacuum. Honestly, I wonder why continuing (but without advancing relfrozenxid or relminmxid) shouldn't be the default behavior. I mean, if it actually corrupts your data, then it clearly shouldn't be, and probably shouldn't even be an optional behavior, but I still don't see why it would do that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-08-28 12:37:17 -0400, Robert Haas wrote: > On Mon, Jul 20, 2020 at 4:30 PM Andres Freund <andres@anarazel.de> wrote: > > If we really were to do something like this the option would need to be > > called vacuum_allow_making_corruption_worse or such. Its need to be > > *exceedingly* clear that it will likely lead to making everything much > > worse. > > I don't really understand this objection. How does letting VACUUM > continue after problems have been detected make anything worse? It can break HOT chains, plain ctid chains etc, for example. Which, if earlier / follower tuples are removed can't be detected anymore at a later time. > The point is that when you make VACUUM fail, you not only don't > advance relfrozenxid/relminmxid, but also don't remove dead tuples. In > the long run, either thing will kill you, but it is not difficult to > have a situation where failing to remove dead tuples kills you a lot > faster. The table can just bloat until performance tanks, and then the > application goes down, even if you still had 100+ million XIDs before > you needed a wraparound vacuum. > > Honestly, I wonder why continuing (but without advancing relfrozenxid > or relminmxid) shouldn't be the default behavior. I mean, if it > actually corrupts your data, then it clearly shouldn't be, and > probably shouldn't even be an optional behavior, but I still don't see > why it would do that. I think it's an EXTREMELY bad idea to enable anything like this by default. It'll make bugs entirely undiagnosable, because we'll remove a lot of the evidence of what the problem is. And we've had many long standing bugs in this area, several only found because we actually started to bleat about them. And quite evidently, we have more bugs to fix in the area. Greetings, Andres Freund
On Fri, Aug 28, 2020 at 1:29 PM Andres Freund <andres@anarazel.de> wrote: > It can break HOT chains, plain ctid chains etc, for example. Which, if > earlier / follower tuples are removed can't be detected anymore at a > later time. I think I need a more specific example here to understand the problem. If the xmax of one tuple matches the xmin of the next, then either both values precede relfrozenxid or both follow it. In the former case, neither tuple should be frozen and the chain should not get broken; in the latter case, everything's normal anyway. If the xmax and xmin don't match, then the chain was already broken. Perhaps we are removing important evidence, though it seems like that might've happened anyway prior to reaching the damaged page, but we're not making whatever corruption may exist any worse. At least, not as far as I can see. > And we've had many long > standing bugs in this area, several only found because we actually > started to bleat about them. And quite evidently, we have more bugs to > fix in the area. I agree with all of this, but I do not think that it establishes that we should abandon the entire VACUUM. "Bleating" about something usually means logging it, and I think you understand that I am not now nor have I ever complained about the logging we are doing here. I also think you understand why I don't like the current behavior, and that EDB has actual customers who have actually been damaged by it. All the same, I don't expect to win an argument about changing the default, but I hope to win one about at least providing an option. And if we're not even going to do that much, then I hope to come out of this discussion with a clear understanding of exactly why that's a bad idea. I don't think "we need the data for forensics" is a sufficient justification for "if you end up with one corrupted XID in a billion-row table, your entire table will bloat out the wazoo, and there is no option to get any other behavior." -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Aug 28, 2020 at 9:49 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Jul 21, 2020 at 9:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > In the previous version, the feature was enabled for cluster/vacuum > > full command as well. in the attached patch I have enabled it only > > if we are running vacuum command. It will not be enabled during a > > table rewrite. If we think that it should be enabled for the 'vacuum > > full' then we might need to pass a flag from the cluster_rel, all the > > way down to the heap_freeze_tuple. > > I think this is a somewhat clunky way of accomplishing this. The > caller passes down a flag to heap_prepare_freeze_tuple() which decides > whether or not an error is forced, and then that function and > FreezeMultiXactId use vacuum_damage_elevel() to combine the results of > that flag with the value of the vacuum_tolerate_damage GUC. But that > means that a decision that could be made in one place is instead made > in many places. If we just had heap_freeze_tuple() and > FreezeMultiXactId() take an argument int vacuum_damage_elevel, then > heap_freeze_tuple() could pass ERROR and lazy_scan_heap() could > arrange to pass WARNING or ERROR based on the value of > vacuum_tolerate_damage. I think that would likely end up cleaner. What > do you think? I agree this way it is much more cleaner. I have changed in the attached patch. > I also suggest renaming is_corrupted_xid to found_corruption. With the > current name, it's not very clear which XID we're saying is corrupted; > in fact, the problem might be a MultiXactId rather than an XID, and > the real issue might be with the table's pg_class entry or something. Okay, changed to found_corruption. > The new arguments to heap_prepare_freeze_tuple() need to be documented > in its header comment. Done. I have also done a few more cosmetic changes to the patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Sat, Aug 29, 2020 at 1:46 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Aug 28, 2020 at 1:29 PM Andres Freund <andres@anarazel.de> wrote: > > It can break HOT chains, plain ctid chains etc, for example. Which, if > > earlier / follower tuples are removed can't be detected anymore at a > > later time. > > I think I need a more specific example here to understand the problem. > If the xmax of one tuple matches the xmin of the next, then either > both values precede relfrozenxid or both follow it. In the former > case, neither tuple should be frozen and the chain should not get > broken; in the latter case, everything's normal anyway. If the xmax > and xmin don't match, then the chain was already broken. Perhaps we > are removing important evidence, though it seems like that might've > happened anyway prior to reaching the damaged page, but we're not > making whatever corruption may exist any worse. At least, not as far > as I can see. One example is, suppose during vacuum, there are 2 tuples in the hot chain, and the xmin of the first tuple is corrupted (i.e. smaller than relfrozenxid). And the xmax of this tuple (which is same as the xmin of the second tuple) is smaller than the cutoff_xid while trying to freeze the tuple. As a result, it will freeze the second tuple but the first tuple will be left untouched. Now, if we come for the heap_hot_search_buffer, then the xmax of the first tuple will not match the xmin of the second tuple as we have frozen the second tuple. But, I feel this is easily fixable right? I mean instead of not doing anything to the corrupted tuple we can partially freeze it? I mean we can just leave the corrupted xid alone but mark the other xid as frozen if that is smaller then cutoff_xid. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Sat, Aug 29, 2020 at 4:36 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > One example is, suppose during vacuum, there are 2 tuples in the hot > chain, and the xmin of the first tuple is corrupted (i.e. smaller > than relfrozenxid). And the xmax of this tuple (which is same as the > xmin of the second tuple) is smaller than the cutoff_xid while trying > to freeze the tuple. As a result, it will freeze the second tuple but > the first tuple will be left untouched. > > Now, if we come for the heap_hot_search_buffer, then the xmax of the > first tuple will not match the xmin of the second tuple as we have > frozen the second tuple. But, I feel this is easily fixable right? I > mean instead of not doing anything to the corrupted tuple we can > partially freeze it? I mean we can just leave the corrupted xid alone > but mark the other xid as frozen if that is smaller then cutoff_xid. That seems reasonable to me. Andres, what do you think? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-09-14 13:26:27 -0400, Robert Haas wrote: > On Sat, Aug 29, 2020 at 4:36 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > One example is, suppose during vacuum, there are 2 tuples in the hot > > chain, and the xmin of the first tuple is corrupted (i.e. smaller > > than relfrozenxid). And the xmax of this tuple (which is same as the > > xmin of the second tuple) is smaller than the cutoff_xid while trying > > to freeze the tuple. As a result, it will freeze the second tuple but > > the first tuple will be left untouched. > > > > Now, if we come for the heap_hot_search_buffer, then the xmax of the > > first tuple will not match the xmin of the second tuple as we have > > frozen the second tuple. But, I feel this is easily fixable right? I > > mean instead of not doing anything to the corrupted tuple we can > > partially freeze it? I mean we can just leave the corrupted xid alone > > but mark the other xid as frozen if that is smaller then cutoff_xid. > > That seems reasonable to me. Andres, what do you think? It seems pretty dangerous to me. What exactly are you going to put into xmin/xmax here? And how would anything you put into the first tuple not break index lookups? There's no such thing as a frozen xmax (so far), so what are you going to put in there? A random different xid? FrozenTransactionId? HEAP_XMAX_INVALID? This whole approach just seems likely to exascerbate corruption while also making it impossible to debug. That's ok enough if it's an explicit user action, but doing it based on a config variable setting seems absurdly dangerous to me. Greetings, Andres Freund
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING
From
Alvaro Herrera
Date:
On 2020-Sep-14, Andres Freund wrote: > It seems pretty dangerous to me. What exactly are you going to put into > xmin/xmax here? And how would anything you put into the first tuple not > break index lookups? There's no such thing as a frozen xmax (so far), so > what are you going to put in there? A random different xid? > FrozenTransactionId? HEAP_XMAX_INVALID? > > This whole approach just seems likely to exascerbate corruption while > also making it impossible to debug. That's ok enough if it's an explicit > user action, but doing it based on a config variable setting seems > absurdly dangerous to me. FWIW I agree with Andres' stance on this. The current system is *very* complicated and bugs are obscure already. If we hide them, what we'll be getting is a system where data can become corrupted for no apparent reason. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Sep 14, 2020 at 3:00 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > FWIW I agree with Andres' stance on this. The current system is *very* > complicated and bugs are obscure already. If we hide them, what we'll > be getting is a system where data can become corrupted for no apparent > reason. I think I might have to give up on this proposal given the level of opposition to it, but the nature of the opposition doesn't make any sense to me on a technical level. Suppose a tuple with tid A has been updated, producing a new version at tid B. The argument that is now being offered is that if A has been found to be corrupt then we'd better stop vacuuming the table altogether lest we reach B and vacuum it too, further corrupting the table and destroying forensic evidence. But even ignoring the fact that many users want to get the database running again more than they want to do forensics, it's entirely possible that B < A, in which case the damage has already been done. Therefore, I can't see any argument that this patch creates any scenario that can't happen already. It seems entirely reasonable to me to say, as a review comment, hey, you haven't sufficiently considered this particular scenario, that still needs work. But the argument here is much more about whether this is a reasonable thing to do in general and under any circumstances, and it feels to me like you guys are saying "no" without offering any really convincing evidence that there are unfixable problems here. IOW, I agree that having a GUC corrupt_my_tables_more=true is not a reasonable thing, but I disagree that the proposal on the table is tantamount to that. The big picture here is that people have terabyte-scale tables, 1 or 2 tuples get corrupted, and right now the only real fix is to dump and restore the whole table, which leads to prolonged downtime. The pg_surgery stuff should help with that, and the work to make VACUUM report the exact TID will also help, and if we can get the heapcheck stuff Mark Dilger is working on committed, that will provide an alternative and probably better way of finding this kind of corruption, which is all to the good. However, I disagree with the idea that a typical user who has a 2TB with one corrupted tuple on page 0 probably wants VACUUM to fail over and over again, letting the table bloat like crazy, instead of bleating loudly but still vacuuming the other 0.999999% of the table. I mean, somebody probably wants that, and that's fine. But I have a hard time imagining it as a typical view. Am I just lacking in imagination? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-09-14 15:50:49 -0400, Robert Haas wrote: > On Mon, Sep 14, 2020 at 3:00 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > FWIW I agree with Andres' stance on this. The current system is *very* > > complicated and bugs are obscure already. If we hide them, what we'll > > be getting is a system where data can become corrupted for no apparent > > reason. > > I think I might have to give up on this proposal given the level of > opposition to it, but the nature of the opposition doesn't make any > sense to me on a technical level. Suppose a tuple with tid A has been > updated, producing a new version at tid B. The argument that is now > being offered is that if A has been found to be corrupt then we'd > better stop vacuuming the table altogether lest we reach B and vacuum > it too, further corrupting the table and destroying forensic evidence. > But even ignoring the fact that many users want to get the database > running again more than they want to do forensics, it's entirely > possible that B < A, in which case the damage has already been done. My understanding of the case we're discussing is that it's corruption (e.g. relfrozenxid being different than table contents) affecting a HOT chain. I.e. by definition all within a single page. We won't have modified part of it independent of B < A, because freezing is all-or-nothing. Just breaking the HOT chain into two or something like that will just make things worse, because indexes won't find tuples, and because reindexing might then get confused e.g. by HOT chains without a valid start, or by having two visible tuples for the same PK. > But even ignoring the fact that many users want to get the database > running again more than they want to do forensics The user isn't always right. And I am not objecting against providing a tool to get things running. I'm objecting to VACUUM doing so, especially when it's a system wide config option triggering that behaviour. > Therefore, I can't see any argument that this patch creates any > scenario that can't happen already. It seems entirely reasonable to me > to say, as a review comment, hey, you haven't sufficiently considered > this particular scenario, that still needs work. But the argument here > is much more about whether this is a reasonable thing to do in general > and under any circumstances, and it feels to me like you guys are > saying "no" without offering any really convincing evidence that there > are unfixable problems here. I don't think that's quite the calculation. You're suggesting to make already really complicated and failure prone code even more complicated by adding heuristic error recovery to it. That has substantial cost, even if we were to get it perfectly right (which I don't believe we will). > The big picture here is that people have terabyte-scale tables, 1 or 2 > tuples get corrupted, and right now the only real fix is to dump and > restore the whole table, which leads to prolonged downtime. The > pg_surgery stuff should help with that, and the work to make VACUUM > report the exact TID will also help, and if we can get the heapcheck > stuff Mark Dilger is working on committed, that will provide an > alternative and probably better way of finding this kind of > corruption, which is all to the good. Agreed. > However, I disagree with the idea that a typical user who has a 2TB > with one corrupted tuple on page 0 probably wants VACUUM to fail over > and over again, letting the table bloat like crazy, instead of > bleating loudly but still vacuuming the other 0.999999% of the > table. I mean, somebody probably wants that, and that's fine. But I > have a hard time imagining it as a typical view. Am I just lacking in > imagination? I know that that kind of user exists, but yea, I disagree extremely strongly that that's a reasonable thing that the majority of users want. And I don't think that that's something we should encourage. Those cases indicate that either postgres has a bug, or their storage / memory / procedures have an issue. Reacting by making it harder to diagnose is just a bad idea. Greetings, Andres Freund
On Mon, Sep 14, 2020 at 4:13 PM Andres Freund <andres@anarazel.de> wrote: > My understanding of the case we're discussing is that it's corruption > (e.g. relfrozenxid being different than table contents) affecting a HOT > chain. I.e. by definition all within a single page. We won't have > modified part of it independent of B < A, because freezing is > all-or-nothing. Just breaking the HOT chain into two or something like > that will just make things worse, because indexes won't find tuples, and > because reindexing might then get confused e.g. by HOT chains without a > valid start, or by having two visible tuples for the same PK. If we adopt the proposal made by Dilip, we will not do that. We must have a.xmax = b.xmin, and that value is either less than relfrozenxid or it is not. If we skip an entire tuple because one XID is bad, then we could break the HOT chain when a.xmin is bad and the remaining values are OK. But if we decide separately for xmin and xmax then we should be alright. Alternately, if we're only concerned about HOT chains, we could skip the entire page if any tuple on the page shows evidence of damage. > I don't think that's quite the calculation. You're suggesting to make > already really complicated and failure prone code even more complicated > by adding heuristic error recovery to it. That has substantial cost, > even if we were to get it perfectly right (which I don't believe we > will). That's a legitimate concern, but I think it would make more sense to first make the design as good as we can and then decide whether it's adequate than to decide ab initio that there's no way to make it good enough. > > However, I disagree with the idea that a typical user who has a 2TB > > with one corrupted tuple on page 0 probably wants VACUUM to fail over > > and over again, letting the table bloat like crazy, instead of > > bleating loudly but still vacuuming the other 0.999999% of the > > table. I mean, somebody probably wants that, and that's fine. But I > > have a hard time imagining it as a typical view. Am I just lacking in > > imagination? > > I know that that kind of user exists, but yea, I disagree extremely > strongly that that's a reasonable thing that the majority of users > want. And I don't think that that's something we should encourage. Those > cases indicate that either postgres has a bug, or their storage / memory > / procedures have an issue. Reacting by making it harder to diagnose is > just a bad idea. Well, the people I tend to deal with are not going to let me conduct a lengthy investigation almost no matter what, and the more severe the operational consequences of the problem are, the less likely it is that I'm going to have time to figure anything out. Being able to create some kind of breathing room is pretty valuable. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-09-14 17:00:48 -0400, Robert Haas wrote: > On Mon, Sep 14, 2020 at 4:13 PM Andres Freund <andres@anarazel.de> wrote: > > My understanding of the case we're discussing is that it's corruption > > (e.g. relfrozenxid being different than table contents) affecting a HOT > > chain. I.e. by definition all within a single page. We won't have > > modified part of it independent of B < A, because freezing is > > all-or-nothing. Just breaking the HOT chain into two or something like > > that will just make things worse, because indexes won't find tuples, and > > because reindexing might then get confused e.g. by HOT chains without a > > valid start, or by having two visible tuples for the same PK. > > If we adopt the proposal made by Dilip, we will not do that. We must > have a.xmax = b.xmin, and that value is either less than relfrozenxid > or it is not. If we skip an entire tuple because one XID is bad, then > we could break the HOT chain when a.xmin is bad and the remaining > values are OK. But if we decide separately for xmin and xmax then we > should be alright. I thought I precisely addressed this case: > What exactly are you going to put into xmin/xmax here? And how would > anything you put into the first tuple not break index lookups? There's > no such thing as a frozen xmax (so far), so what are you going to put > in there? A random different xid? FrozenTransactionId? > HEAP_XMAX_INVALID? What am I missing? Greetings, Andres Freund
On Tue, Sep 15, 2020 at 2:35 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2020-09-14 17:00:48 -0400, Robert Haas wrote: > > On Mon, Sep 14, 2020 at 4:13 PM Andres Freund <andres@anarazel.de> wrote: > > > My understanding of the case we're discussing is that it's corruption > > > (e.g. relfrozenxid being different than table contents) affecting a HOT > > > chain. I.e. by definition all within a single page. We won't have > > > modified part of it independent of B < A, because freezing is > > > all-or-nothing. Just breaking the HOT chain into two or something like > > > that will just make things worse, because indexes won't find tuples, and > > > because reindexing might then get confused e.g. by HOT chains without a > > > valid start, or by having two visible tuples for the same PK. > > > > If we adopt the proposal made by Dilip, we will not do that. We must > > have a.xmax = b.xmin, and that value is either less than relfrozenxid > > or it is not. If we skip an entire tuple because one XID is bad, then > > we could break the HOT chain when a.xmin is bad and the remaining > > values are OK. But if we decide separately for xmin and xmax then we > > should be alright. > > I thought I precisely addressed this case: > > > What exactly are you going to put into xmin/xmax here? And how would > > anything you put into the first tuple not break index lookups? There's > > no such thing as a frozen xmax (so far), so what are you going to put > > in there? A random different xid? FrozenTransactionId? > > HEAP_XMAX_INVALID? > > What am I missing? What problem do you see if we set xmax to the InvalidTransactionId and HEAP_XMAX_INVALID flag in the infomask ? I mean now also if the xmax is older than the cutoff xid then we do the same thing i.e. if (freeze_xmax) { .. frz->xmax = InvalidTransactionId; .. frz->t_infomask &= ~HEAP_XMAX_BITS; frz->t_infomask |= HEAP_XMAX_INVALID; frz->t_infomask2 &= ~HEAP_HOT_UPDATED; frz->t_infomask2 &= ~HEAP_KEYS_UPDATED; changed = true; } So if we do that it will not be part of the hot chain anymore. I might be missing something but could not see how it can be more broken than what it is without our change. I agree that in case of corrupted xmin it can now mark tuple with HEAP_XMAX_INVALID without freezing the xmin but that is anyway a valid status for a tuple. However, if we think it still can cause some issues then I feel that we can skip the whole page as Robert suggested. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On 2020-09-15 10:54:29 +0530, Dilip Kumar wrote: > What problem do you see if we set xmax to the InvalidTransactionId and > HEAP_XMAX_INVALID flag in the infomask ? 1) It'll make a dead tuple appear live. You cannot do this for tuples with an xid below the horizon. 2) it'll break HOT chain following / indexes.
On Tue, Sep 15, 2020 at 11:14 AM Andres Freund <andres@anarazel.de> wrote: > > On 2020-09-15 10:54:29 +0530, Dilip Kumar wrote: > > What problem do you see if we set xmax to the InvalidTransactionId and > > HEAP_XMAX_INVALID flag in the infomask ? > > 1) It'll make a dead tuple appear live. You cannot do this for tuples > with an xid below the horizon. How is it possible? Because tuple which has a committed xmax and the xmax is older than the oldestXmin, should not come for freezing unless it is lock_only xid (because those tuples are already gone). So if the xmax is smaller than the cutoff xid than either it is lock_only or it is aborted. If the XMAX is lock only then I don't see any issue OTOH if it is aborted xid and if it is already smaller than the cut-off xid then it is anyway live tuple. >2) it'll break HOT chain following / indexes. If my above theory in point 1 is correct then I don't see this issue as well. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Hi, On 2020-09-15 12:52:25 +0530, Dilip Kumar wrote: > On Tue, Sep 15, 2020 at 11:14 AM Andres Freund <andres@anarazel.de> wrote: > > > > On 2020-09-15 10:54:29 +0530, Dilip Kumar wrote: > > > What problem do you see if we set xmax to the InvalidTransactionId and > > > HEAP_XMAX_INVALID flag in the infomask ? > > > > 1) It'll make a dead tuple appear live. You cannot do this for tuples > > with an xid below the horizon. > > How is it possible? Because tuple which has a committed xmax and the > xmax is older than the oldestXmin, should not come for freezing unless > it is lock_only xid (because those tuples are already gone). There've been several cases of this in the past. A fairly easy way is a corrupted relfrozenxid (of which there are many examples). You simply cannot just assume that everything is OK and argue that that's why it's ok to fix data corruption in some approximate manner. By definition everything *is not ok* if you ever come here. Greetings, Andres Freund
On Tue, Sep 15, 2020 at 2:04 PM Andres Freund <andres@anarazel.de> wrote: > > How is it possible? Because tuple which has a committed xmax and the > > xmax is older than the oldestXmin, should not come for freezing unless > > it is lock_only xid (because those tuples are already gone). > > There've been several cases of this in the past. A fairly easy way is a > corrupted relfrozenxid (of which there are many examples). Hmm, so is the case you're worried about here the case where the freezing threshold is greater than the pruning threshold? i.e. The relfrozenxid has been updated to a value greater than the xmin we derive from the procarray? If that's not the case, then I don't see what problem there can be here. To reach heap_prepare_freeze_tuple the tuple has to survive pruning. If xmin < freezing-threshold and freezing-threshold < pruning-threshold and the tuple survived pruning, then xmin must be a committed transaction visible to everyone so setting xmin to FrozenTransactionId is fine. If xmax < freezing-threshold and freezing-threshold < pruning-threshold and the tuple survived pruning, xmax must be visible to everyone and can't be running so it must have aborted, so setting xmax to InvalidTransactionId is fine. On the other hand if, somehow, freezing-threshold > pruning-threshold, then freezing seems categorically unsafe. Doing so would change visibility decisions of transactions that are still running, or that were running at the time when we computed the pruning threshold. But the sanity checks in heap_prepare_freeze_tuple() seem like they would catch many such cases, but I'm not sure if they're all water-tight. It might be better to skip calling heap_prepare_freeze_tuple() altogether if the freezing threshold does not precede the pruning threshold. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company