Re: Heap WARM Tuples - Design Draft - Mailing list pgsql-hackers

From Pavan Deolasee
Subject Re: Heap WARM Tuples - Design Draft
Date
Msg-id CABOikdPgsknwYcWaPkU6ow5Zz-MZ644n_e-+bnOWqO2poeB2Cw@mail.gmail.com
Whole thread Raw
In response to Re: Heap WARM Tuples - Design Draft  (Claudio Freire <klaussfreire@gmail.com>)
Responses Re: Heap WARM Tuples - Design Draft  (Bruce Momjian <bruce@momjian.us>)
Re: Heap WARM Tuples - Design Draft  (Claudio Freire <klaussfreire@gmail.com>)
Re: Heap WARM Tuples - Design Draft  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers


On Fri, Aug 5, 2016 at 8:23 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
On Thu, Aug 4, 2016 at 11:15 PM, Bruce Momjian <bruce@momjian.us> wrote:

>
> OK, that's a lot of text, and I am confused.  Please tell me the
> advantage of having an index point to a string of HOT chains, rather
> than a single one?  Is the advantage that the index points into the
> middle of the HOT chain rather than at the beginning?  I can see some
> value to that, but having more ctid HOT headers that can't be removed
> except by VACUUM seems bad, plus the need for index rechecks as you
> cross to the next HOT chain seems bad.
>
> The value of WARM is to avoid index bloat --- right now we traverse the
> HOT chain on a single page just fine with no complaints about speed so I
> don't see a value to optimizing that traversal, and I think the key
> rechecks and ctid bloat will make it all much slower.
>
> It also seems much more complicated.

The point is avoiding duplicate rows in the output of index scans.

I don't think you can avoid it simply by applying index condition
rechecks as the original proposal implies, in this case:

CREATE TABLE t (id integer not null primary key, someid integer, dat integer);
CREATE INDEX i1 ON t (someid);

INSERT INTO t (id, someid, dat) VALUES (1, 2, 100);
UPDATE t SET dat = dat + 1 where id = 1;
UPDATE t SET dat = dat + 1, id = 2 where id = 1;
UPDATE t SET dat = dat + 1, id = 3, someid = 3 where id = 2;
UPDATE t SET dat = dat + 1, id = 1, someid = 2 where id = 3;
SELECT * FROM t WHERE someid = 2;

That, I believe, will cause the problematic chains where the condition
recheck passes both times the last visible tuple is visited during the
scan. It will return more than one tuple when in reality there is only
one.

Hmm. That seems like a real problem. The only way to address that is to ensure that for a given WARM chain and given index key, there must exists only a single index entry. There can and will be multiple entries if the index key changes, but if the index key changes to the original value (or any other value already in the WARM chain) again, we must not add another index entry. Now that does not seem like a very common scenario and skipping a duplicate index entry does have its own benefit, so may be its not a terrible idea to try that. You're right, it may be expensive to search for an existing matching index key for the same root line pointer. But if we could somehow short-circuit the more common case, it might still be worth trying.

The other idea you mentioned is to index intermediate tuples but somehow restrict WARM chain following knowing that the subsequent tuples are reachable via different index tuple. I see two major problems with that approach:

1. When HOT or WARM chains are pruned and dead tuples removed, we may lose the knowledge about key changes between intermediate updates. Without that its seems difficult to know when to stop while following chain starting from the old index tuple.

2. The existence of index pointers to intermediate tuples will lead to index bloat. This is the same problem that we found in Simon's original proposal (discussed internally). None of the intermediate line pointers can be vacuumed until the entire chain becomes DEAD. Event if the a duplicate index key is inserted for every index, knowing that and reclaiming to the index pointers to the original root line pointer, will be difficult.
 

Not to mention the cost of scanning the chain of N tuples N times,
making it quadratic - bound by the number of tuples that can fit a
page, but still high.

Solving that, I believe, will remove all the simplicity of pointing to
root tuples.


You're right. But having all indexes point to the root line pointer makes things simpler to manage and less buggy. So if we can somehow solve the duplicate tuples problem, even at additional cost at update time, it might still be worth. I would suggest that we should think more and we can find some solution to the problem.
 

I don't really see how you'd do that on yours. You seem to assume
finding a particular key-item pointer pair in an index would be
efficient, but IMHO that's not the case.

That's true. But we could somehow short-circuit the more common pattern, that might still be worth. For corner cases, we can fall back to non-HOT update and keep things simple. It will still be a huge improvement over what we have currently.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: multivariate statistics (v19)
Next
From: Kyotaro HORIGUCHI
Date:
Subject: Re: Possible duplicate release of buffer lock.