Re: recovering from "found xmin ... from before relfrozenxid ..." - Mailing list pgsql-hackers

From Rajkumar Raghuwanshi
Subject Re: recovering from "found xmin ... from before relfrozenxid ..."
Date
Msg-id CAKcux6kBQxHN1FRGU8RkPWuuu+4pFvkoTMMMczYJQGrVwCtXEg@mail.gmail.com
Whole thread Raw
In response to Re: recovering from "found xmin ... from before relfrozenxid ..."  (Masahiko Sawada <masahiko.sawada@2ndquadrant.com>)
Responses Re: recovering from "found xmin ... from before relfrozenxid ..."
List pgsql-hackers

I have been doing some user-level testing of this feature, apart from sanity test for extension and it's functions

I have tried to corrupt tuples and then able to fix it using heap_force_freeze/kill functions.


--corrupt relfrozenxid, cause vacuum failed.

update pg_class set relfrozenxid = (relfrozenxid::text::integer + 10)::text::xid where relname = 'test_tbl';

UPDATE 1

insert into test_tbl values (2, 'BBB');


postgres=# vacuum test_tbl;

ERROR:  found xmin 507 from before relfrozenxid 516

CONTEXT:  while scanning block 0 of relation "public.test_tbl"


postgres=# select *, ctid, xmin, xmax from test_tbl;

 a |  b  | ctid  | xmin | xmax 

---+-----+-------+------+------

 1 | AAA | (0,1) |  505 |    0

 2 | BBB | (0,2) |  507 |    0

(2 rows)


--fixed using heap_force_freeze

postgres=# select heap_force_freeze('test_tbl'::regclass, ARRAY['(0,2)']::tid[]);

 heap_force_freeze 

-------------------


postgres=# vacuum test_tbl;

VACUUM

postgres=# select *, ctid, xmin, xmax from test_tbl;

 a |  b  | ctid  | xmin | xmax 

---+-----+-------+------+------

 1 | AAA | (0,1) |  505 |    0

 2 | BBB | (0,2) |    2 |    0

(2 rows)


--corrupt table headers in base/oid. file, cause table access failed.

postgres=# select ctid, * from test_tbl;

ERROR:  could not access status of transaction 4294967295

DETAIL:  Could not open file "pg_xact/0FFF": No such file or directory.


--removed corrupted tuple using heap_force_kill

postgres=# select heap_force_kill('test_tbl'::regclass, ARRAY['(0,2)']::tid[]);

 heap_force_kill 

-----------------

 

(1 row)


postgres=# select ctid, * from test_tbl;

 ctid  | a |  b  

-------+---+-----

 (0,1) | 1 | AAA

(1 row)


I will be continuing with my testing with the latest patch and update here if found anything.


Thanks & Regards,
Rajkumar Raghuwanshi


On Thu, Aug 6, 2020 at 1:42 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
On Wed, 5 Aug 2020 at 22:42, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Hi Robert,
>
> Thanks for the review. Please find my comments inline:
>
> On Sat, Aug 1, 2020 at 12:18 AM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Fri, Jul 31, 2020 at 8:52 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > > Attached is the new version of patch that addresses the comments from Andrey and Beena.
> >
> > +PGFILEDESC = "pg_surgery - perform surgery on the damaged heap table"
> >
> > the -> a
> >
> > I also suggest: heap table -> relation, because we might want to
> > extend this to other cases later.
> >
>
> Corrected.
>
> > +-- toast table.
> > +begin;
> > +create table ttab(a text);
> > +insert into ttab select string_agg(chr(floor(random() * 26)::int +
> > 65), '') from generate_series(1,10000);
> > +select * from ttab where xmin = 2;
> > + a
> > +---
> > +(0 rows)
> > +
> > +select heap_force_freeze('ttab'::regclass, ARRAY['(0, 1)']::tid[]);
> > + heap_force_freeze
> > +-------------------
> > +
> > +(1 row)
> > +
> >
> > I don't understand the point of this. You're not testing the function
> > on the TOAST table; you're testing it on the main table when there
> > happens to be a TOAST table that is probably getting used for
> > something. But that's not really relevant to what is being tested
> > here, so as written this seems redundant with the previous cases.
> >
>
> Yeah, it's being tested on the main table, not on a toast table. I've
> removed this test-case and also restricted direct access to the toast
> table using heap_force_kill/freeze functions. I think we shouldn't be
> using these functions to do any changes in the toast table. We will
> only use these functions with the main table and let VACUUM remove the
> corresponding data chunks (pointed by the tuple that got removed from
> the main table).
>
> Another option would be to identify all the data chunks corresponding
> to the tuple (ctid) being killed from the main table and remove them
> one by one. We will only do this if the tuple from the main table that
> has been marked as killed has an external storage. We will have to add
> a bunch of code for this otherwise we can let VACUUM do this for us.
> Let me know your thoughts on this.
>
> > +-- test pg_surgery functions with the unsupported relations. Should fail.
> >
> > Please name the specific functions being tested here in case we add
> > more in the future that are tested separately.
> >
>
> Done.
>
> > +++ b/contrib/pg_surgery/heap_surgery_funcs.c
> >
> > I think we could drop _funcs from the file name.
> >
>
> Done.
>
> > +#ifdef PG_MODULE_MAGIC
> > +PG_MODULE_MAGIC;
> > +#endif
> >
> > The #ifdef here is not required, and if you look at other contrib
> > modules you'll see that they don't have it.
> >
>
> Okay, done.
>
> > I don't like all the macros at the top of the file much. CHECKARRVALID
> > is only used in one place, so it seems to me that you might as well
> > just inline it and lose the macro. Likewise for SORT and ARRISEMPTY.
> >
>
> Done.
>
> > Once you do that, heap_force_common() can just compute the number of
> > array elements once, instead of doing it once inside ARRISEMPTY, then
> > again inside SORT, and then a third time to initialize ntids. You can
> > just have a local variable in that function that is set once and then
> > used as needed. Then you are only doing ARRNELEMS once, so you can get
> > rid of that macro too. The same technique can be used to get rid of
> > ARRPTR. So then all the macros go away without introducing any code
> > duplication.
> >
>
> Done.
>
> > +/* Options to forcefully change the state of a heap tuple. */
> > +typedef enum HTupleForceOption
> > +{
> > + FORCE_KILL,
> > + FORCE_FREEZE
> > +} HTupleForceOption;
> >
> > I suggest un-abbreviating HTuple to HeapTuple and un-abbreviating the
> > enum members to HEAP_FORCE_KILL and HEAP_FORCE_FREE.
>
> Done.
>
> Also, how about
> > option -> operation?
> >
>
> I think both look okay to me.
>
> > + return heap_force_common(fcinfo, FORCE_KILL);
> >
> > I think it might be more idiomatic to use PG_RETURN_DATUM here.  I
> > know it's the same thing, though, and perhaps I'm even wrong about the
> > prevailing style.
> >
>
> Done.
>
> > + Assert(force_opt == FORCE_KILL || force_opt == FORCE_FREEZE);
> >
> > I think this is unnecessary. It's an enum with 2 values.
> >
>
> Removed.
>
> > + if (ARRISEMPTY(ta))
> > + ereport(ERROR,
> > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > + errmsg("empty tid array")));
> >
> > I don't see why this should be an error. Why can't we just continue
> > normally and if it does nothing, it does nothing? You'd need to change
> > the do..while loop to a while loop so that the end condition is tested
> > at the top, but that seems fine.
> >
>
> I think it's okay to have this check. So, just left it as-is. We do
> have such checks in other contrib modules as well wherever the array
> is being passed as an input to the function.
>
> > + rel = relation_open(relid, AccessShareLock);
> >
> > Maybe we should take RowExclusiveLock, since we are going to modify
> > rows. Not sure how much it matters, though.
> >
>
> I don't know how it would make a difference, but still as you said
> replaced AccessShare with RowExclusive.
>
> > + if (!superuser() && GetUserId() != rel->rd_rel->relowner)
> > + ereport(ERROR,
> > + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
> > + errmsg("must be superuser or object owner to use %s.",
> > + force_opt == FORCE_KILL ? "heap_force_kill()" :
> > + "heap_force_freeze()")));
> >
> > This is the wrong way to do a permissions check, and it's also the
> > wrong way to write an error message about having failed one. To see
> > the correct method, grep for pg_class_aclcheck. However, I think that
> > we shouldn't in general trust the object owner to do this, unless the
> > super-user gave permission. This is a data-corrupting operation, and
> > only the boss is allowed to authorize it. So I think you should also
> > add REVOKE EXECUTE FROM PUBLIC statements to the SQL file, and then
> > have this check as a backup. Then, the superuser is always allowed,
> > and if they choose to GRANT EXECUTE on this function to some users,
> > those users can do it for their own relations, but not others.
> >
>
> Done.
>
> > + if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
> > + ereport(ERROR,
> > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> > + errmsg("only heap AM is supported")));
> > +
> > + check_relation_relkind(rel);
> >
> > Seems like these checks are in the wrong order.
>
> I don't think there is anything wrong with the order. I can see the
> same order in other contrib modules as well.
>
> Also, maybe you could
> > call the function something like check_relation_ok() and put the
> > permissions test, the relkind test, and the relam test all inside of
> > it, just to tighten up the code in this main function a bit.
> >
>
> Yeah, I've added a couple of functions named sanity_check_relation and
> sanity_check_tid_array and shifted all the sanity checks there.
>
> > + if (noffs > maxoffset)
> > + ereport(ERROR,
> > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > + errmsg("number of offsets specified for block %u exceeds the max
> > offset number %u",
> > + blkno, maxoffset)));
> >
> > Hmm, this doesn't seem quite right. The actual problem is if an
> > individual item pointer's offset number is greater than maxoffset,
> > which can be true even if the total number of offsets is less than
> > maxoffset. So I think you need to remove this check and add a check
> > inside the loop which follows that offnos[i] is in range.
> >
>
> Agreed and done.
>
> > The way you've structured that loop is actually problematic -- I don't
> > think we want to be calling elog() or ereport() inside a critical
> > section. You could fix the case that checks for an invalid force_opt
> > by just doing if (op == HEAP_FORCE_KILL) { ... } else { Assert(op ==
> > HEAP_FORCE_FREEZE); ... }, or by using a switch with no default. The
> > NOTICE case you have here is a bigger problem.
>
> Done.
>
> You really cannot
> > modify the buffer like this and then decide, oops, never mind, I think
> > I won't mark it dirty or write WAL for the changes. If you do that,
> > the buffer is still in memory, but it's now been modified. A
> > subsequent operation that modifies it will start with the altered
> > state you created here, quite possibly leading to WAL that cannot be
> > correctly replayed on the standby. In other words, you've got to
> > decide for certain whether you want to proceed with the operation
> > *before* you enter the critical section. You also need to emit any
> > messages before or after the critical section.  So you could:
> >
>
> This is still not clear. I think Robert needs to respond to my earlier comment.
>
> > I believe this violates our guidelines on message construction. Have
> > two completely separate messages -- and maybe lose the word "already":
> >
> > "skipping tid (%u, %u) because it is dead"
> > "skipping tid (%u, %u) because it is unused"
> >
> > The point of this is that it makes it easier for translators.
> >
>
> Done.
>
> > I see very little point in what verify_tid() is doing. Before using
> > each block number, we should check that it's less than or equal to a
> > cached value of RelationGetNumberOfBlocks(rel). That's necessary in
> > any case to avoid funny errors; and then the check here against
> > specifically InvalidBlockNumber is redundant. For the offset number,
> > same thing: we need to check each offset against the page's
> > PageGetMaxOffsetNumber(page); and if we do that then we don't need
> > these checks.
> >
>
> Done.
>
> Please check the attached patch for the changes.

I also looked at this version patch and have some small comments:

+   Oid             relid = PG_GETARG_OID(0);
+   ArrayType      *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
+   ItemPointer     tids;
+   int             ntids;
+   Relation        rel;
+   Buffer          buf;
+   Page            page;
+   ItemId          itemid;
+   BlockNumber     blkno;
+   OffsetNumber   *offnos;
+   OffsetNumber    offno,
+                   noffs,
+                   curr_start_ptr,
+                   next_start_ptr,
+                   maxoffset;
+   int             i,
+                   nskippedItems,
+                   nblocks;

You declare all variables at the top of heap_force_common() function
but I think we can declare some variables such as buf, page inside of
the do loop.

---
+           if (offnos[i] > maxoffset)
+           {
+               ereport(NOTICE,
+                        errmsg("skipping tid (%u, %u) because it
contains an invalid offset",
+                               blkno, offnos[i]));
+               continue;
+           }

If all tids on a page take the above path, we will end up logging FPI
in spite of modifying nothing on the page.

---
+       /* XLOG stuff */
+       if (RelationNeedsWAL(rel))
+           log_newpage_buffer(buf, true);

I think we need to set the returned LSN by log_newpage_buffer() to the page lsn.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: PROC_IN_ANALYZE stillborn 13 years ago
Next
From: Ashutosh Sharma
Date:
Subject: Re: recovering from "found xmin ... from before relfrozenxid ..."