Re: recovering from "found xmin ... from before relfrozenxid ..." - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: recovering from "found xmin ... from before relfrozenxid ..."
Date
Msg-id CA+fd4k7mgHGHa2UxPsAONR9WaMNJo-7ybB5iuc63J77nNzF=Mw@mail.gmail.com
Whole thread Raw
In response to Re: recovering from "found xmin ... from before relfrozenxid ..."  (Ashutosh Sharma <ashu.coek88@gmail.com>)
Responses Re: recovering from "found xmin ... from before relfrozenxid ..."
List pgsql-hackers
On Wed, 12 Aug 2020 at 22:27, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Thanks Robert for the review. Please find my comments inline below:
>
> On Fri, Aug 7, 2020 at 9:21 PM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Thu, Aug 6, 2020 at 9:23 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > > Attached v4 patch fixes the latest comments from Robert and Masahiko-san.
> >
> > Compiler warning:
> >
> > heap_surgery.c:136:13: error: comparison of unsigned expression < 0 is
> > always false [-Werror,-Wtautological-compare]
> >                 if (blkno < 0 || blkno >= nblocks)
> >                     ~~~~~ ^ ~
> >
>
> Fixed.
>
> > There's a certain inconsistency to these messages:
> >
> > rhaas=# create table foo (a int);
> > CREATE TABLE
> > rhaas=# insert into foo values (1);
> > INSERT 0 1
> > rhaas=# select heap_force_kill('foo'::regclass, array['(0,2)'::tid]);
> > NOTICE:  skipping tid (0, 2) because it contains an invalid offset
> >  heap_force_kill
> > -----------------
> >
> > (1 row)
> >
> > rhaas=# select heap_force_kill('foo'::regclass, array['(1,0)'::tid]);
> > ERROR:  invalid item pointer
> > LOCATION:  tids_same_page_fetch_offnums, heap_surgery.c:347
> > rhaas=# select heap_force_kill('foo'::regclass, array['(1,1)'::tid]);
> > ERROR:  block number 1 is out of range for relation "foo"
> >
> > From a user perspective it seems like I've made three very similar
> > mistakes: in the first case, the offset is too high, in the second
> > case it's too low, and in the third case the block number is out of
> > range. But in one case I get a NOTICE and in the other two cases I get
> > an ERROR. In one case I get the relation name and in the other two
> > cases I don't. The two complaints about an invalid offset are phrased
> > completely differently from each other. For example, suppose you do
> > this:
> >
> > ERROR: tid (%u, %u) is invalid for relation "%s" because the block
> > number is out of range (%u..%u)
> > ERROR: tid (%u, %u) is invalid for relation "%s" because the item
> > number is out of range for this block (%u..%u)
> > ERROR: tid (%u, %u) is invalid for relation "%s" because the item is unused
> > ERROR: tid (%u, %u) is invalid for relation "%s" because the item is dead
> >
>
> Corrected.
>
> > I think I misled you when I said to use pg_class_aclcheck. I think it
> > should actually be pg_class_ownercheck.
> >
>
> okay, I've changed it to pg_class_ownercheck.
>
> > I think the relkind sanity check should permit RELKIND_MATVIEW also.
> >
>
> Yeah, actually we should allow MATVIEW, don't know why I thought of
> blocking it earlier.
>
> > It's unclear to me why the freeze logic here shouldn't do this part
> > what heap_prepare_freeze_tuple() does when freezing xmax:
> >
> >         frz->t_infomask2 &= ~HEAP_HOT_UPDATED;
> >         frz->t_infomask2 &= ~HEAP_KEYS_UPDATED;
> >
>
> Yeah, we should have these changes when freezing the xmax.
>
> > Likewise, why should we not freeze or invalidate xvac in the case
> > where tuple->t_infomask & HEAP_MOVED, as heap_prepare_freeze_tuple()
> > would do?
> >
>
> Again, we should have this as well.
>
> Apart from above, this time I've also added the documentation on
> pg_surgery module and added a few more test-cases.
>
> Attached patch with above changes.
>

Thank you for updating the patch! Here are my comments on v5 patch:

--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -35,6 +35,7 @@ SUBDIRS = \
        pg_standby  \
        pg_stat_statements \
        pg_trgm     \
+       pg_surgery  \
        pgcrypto    \

I guess we use alphabetical order here. So pg_surgery should be placed
before pg_trgm.

---
+           if (heap_force_opt == HEAP_FORCE_KILL)
+               ItemIdSetDead(itemid);

I think that if the page is an all-visible page, we should clear an
all-visible bit on the visibility map corresponding to the page and
PD_ALL_VISIBLE on the page header. Otherwise, index only scan would
return the wrong results.

---
+       /*
+        * We do not mark the buffer dirty or do WAL logging for unmodifed
+        * pages.
+        */
+       if (!did_modify_page)
+           goto skip_wal;
+
+       /* Mark buffer dirty before we write WAL. */
+       MarkBufferDirty(buf);
+
+       /* XLOG stuff */
+       if (RelationNeedsWAL(rel))
+           log_newpage_buffer(buf, true);
+
+skip_wal:
+       END_CRIT_SECTION();
+

s/unmodifed/unmodified/

Do we really need to use goto? I think we can modify it like follows:

    if (did_modity_page)
    {
       /* Mark buffer dirty before we write WAL. */
       MarkBufferDirty(buf);

       /* XLOG stuff */
       if (RelationNeedsWAL(rel))
           log_newpage_buffer(buf, true);
    }

    END_CRIT_SECTION();

---
pg_force_freeze() can revival a tuple that is already deleted but not
vacuumed yet. Therefore, the user might need to reindex indexes after
using that function. For instance, with the following script, the last
two queries: index scan and seq scan, will return different results.

set enable_seqscan to off;
set enable_bitmapscan to off;
set enable_indexonlyscan to off;
create table tbl (a int primary key);
insert into tbl values (1);

update tbl set a = a + 100 where a = 1;

explain analyze select * from tbl where a < 200;

-- revive deleted tuple on heap
select heap_force_freeze('tbl', array['(0,1)'::tid]);

-- index scan returns 2 tuples
explain analyze select * from tbl where a < 200;

-- seq scan returns 1 tuple
set enable_seqscan to on;
explain analyze select * from tbl;

Also, if a tuple updated and moved to another partition is revived by
heap_force_freeze(), its ctid still has special values:
MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
see a problem yet caused by a visible tuple having the special ctid
value, but it might be worth considering either to reset ctid value as
well or to not freezing already-deleted tuple.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: massive FPI_FOR_HINT load after promote
Next
From: Thomas Munro
Date:
Subject: Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions