Re: should there be a hard-limit on the number of transactionspending undo? - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: should there be a hard-limit on the number of transactionspending undo?
Date
Msg-id CAH2-Wzkc6MYi=ffELG-0zNrKisxL0ampMLXMc4vrEppNkRFrBw@mail.gmail.com
Whole thread Raw
In response to Re: should there be a hard-limit on the number of transactionspending undo?  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Mon, Jul 29, 2019 at 1:04 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I think there's a cost-benefit analysis here.  You're completely
> correct that inserting new index tuples causes write amplification
> and, yeah, that's bad. On the other hand, row forwarding has its own
> costs. If a row ends up persistently moved to someplace else, then
> every subsequent access to that row has an extra level of indirection.

The devil is in the details. It doesn't seem that optimistic to assume
that a good implementation could practically always avoid it, by being
clever about heap fillfactor. It can work a bit like external TOAST
pointers. The oversized datums can go on the other heap page, which
presumably not be in the SELECT list of most queries. It won't be one
of the indexed columns in typical cases, so index scans will generally
only have to visit one heap page.

It occurs to me that the zheap design is still sensitive to heap
fillfactor in much the same way as it would be with reliably-stable
TIDs, combined with some amount of row forwarding. It's not essential
for correctness that you avoid creating a new HOT chain (or whatever
it's called in zheap) with new index tuples, but it is still quite
preferable on performance grounds. It's still worth going to a lot of
work to avoid having that happen, such as using external TOAST
pointers with some of the larger datums on the existing heap page.

> If it ends up split between two places, every read of that row incurs
> two reads. The "someplace else" where moved rows or ends of split rows
> are stored has to be skipped by sequential scans, which is complex and
> possibly inefficient if it breaks up a sequential I/O pattern. Those
> things are bad, too.
>
> It's a little difficult to compare the kinds of badness.

I would say that it's extremely difficult. I'm not going to speculate
about how the two approaches might compare today.

> I haven't run across the "ghost bit" terminology before.  Is there a
> good place to read about the technique you're assuming here?

"5.2 Key Range Locking and Ghost Records" from "A Survey of B-Tree
Locking Techniques" seems like a good place to start. As I said
earlier, the paper is available from:
https://15721.courses.cs.cmu.edu/spring2016/papers/a16-graefe.pdf

This description won't define the term ghost record/bit in a precise
way that you can just adopt, since the details will vary somewhat
based on considerations like whether or not MVCC is used. But you'll
get the general idea from the paper, I think.

> To put that another way, there seems to be pretty clearly a need for a
> bit, but what does the bit mean?  It could mean "please check the undo
> log," in which case it'd have to be set on insert, eventually cleared,
> and then reset on delete, but I think that's likely to suck.  I think
> therefore that the bit should mean
> is-deleted-but-not-necessarily-all-visible-yet, which avoids that
> problem.

That sounds about right to me.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Is ParsePrepareRecord dead function
Next
From: Fabien COELHO
Date:
Subject: Re: psql - add SHOW_ALL_RESULTS option