Re: Backpatch b61d161c14 - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Backpatch b61d161c14
Date
Msg-id 20200622200939.jkuniq3vtiumeszj@alap3.anarazel.de
Whole thread Raw
In response to Backpatch b61d161c14  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Backpatch b61d161c14 (Introduce vacuum errcontext ...)  (Justin Pryzby <pryzby@telsasoft.com>)
List pgsql-hackers
Hi,

On 2020-06-22 10:35:47 +0530, Amit Kapila wrote:
> I propose to backpatch b61d161c14 [1] (Introduce vacuum errcontext to
> display additional information.).  In the recent past, we have seen an
> error report similar to "ERROR: found xmin 2157740646 from before
> relfrozenxid 1197" from multiple EDB customers.  A similar report is
> seen on pgsql-bugs as well [2] which I think has triggered the
> implementation of this feature for v13.  Such reports mostly indicate
> database corruption rather than any postgres bug which is also
> indicated by the error-code (from before relfrozenxid) for this
> message. I think there is a good reason to back-patch this as multiple
> users are facing similar issues.  This patch won't fix this issue but
> it will help us in detecting the problematic part of the heap/index
> and then if users wish they can delete the portion of data that
> appeared to be corrupted and resume the operations on that relation.
> 
> I have tried to back-patch this for v12 and attached is the result.
> The attached patch passes make check-world but I have yet to test it
> manually and also prepare the patch for other branches once we agree
> on this proposal.

I think having the additional information in the back branches would be
good. But on the other hand I think this is a somewhat large change
to backpatch, and it hasn't yet much real world exposure.

I'm also uncomfortable with the approach of just copying all of
LVRelStats in several places:

>  /*
> @@ -1580,9 +1648,15 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
>      int            uncnt = 0;
>      TransactionId visibility_cutoff_xid;
>      bool        all_frozen;
> +    LVRelStats    olderrinfo;
>  
>      pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
>  
> +    /* Update error traceback information */
> +    olderrinfo = *vacrelstats;
> +    update_vacuum_error_info(vacrelstats, VACUUM_ERRCB_PHASE_VACUUM_HEAP,
> +                             blkno, NULL);
> +
>      START_CRIT_SECTION();
>  
>      for (; tupindex < vacrelstats->num_dead_tuples; tupindex++)
> @@ -1659,6 +1733,11 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
>                                *vmbuffer, visibility_cutoff_xid, flags);
>      }
>  
> +    /* Revert to the previous phase information for error traceback */
> +    update_vacuum_error_info(vacrelstats,
> +                             olderrinfo.phase,
> +                             olderrinfo.blkno,
> +                             olderrinfo.indname);
>      return tupindex;
>  }

To me that's a very weird approach. It's fragile because we need to be
sure that there's no updates to the wrong LVRelStats for important
things, and it has a good bit of potential to be inefficient because
LVRelStats isn't exactly small. This pretty much relies on the compiler
doing good enough escape analysis to realize that most parts of
olderrinfo aren't touched.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: More tzdb fun: POSIXRULES is being deprecated upstream
Next
From: Jeff Davis
Date:
Subject: Re: Default setting for enable_hashagg_disk