Home > mailing lists

Re: Indirect indexes - Mailing list pgsql-hackers

From	Alexander Korotkov
Subject	Re: Indirect indexes
Date	October 19, 2016 22:30:12
Msg-id	CAPpHfdsdaggcEjQaRhuD1yRK8KKe=kRDYrikW8QCPbzsDTZL8A@mail.gmail.com Whole thread
In response to	Re: Indirect indexes (Simon Riggs <simon@2ndquadrant.com>)
List	pgsql-hackers

Tree view

On Wed, Oct 19, 2016 at 12:53 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 18 October 2016 at 23:46, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

> Then vacuum removes (0;1) from heap, reference to (0;1) from tbl_pk_idx.
> But how will it remove (1,1) tuple from tbl_val_indirect_idx? Thus, before
> vacuuming tbl_val_indirect_idx we should know not only values of id which
> are being removed, but actually (id, val) pairs which are being removed.
> Should we collect those paris while scanning heap? But we should also take
> into account that multiple heap tuples might have same (id, val) pair values
> (assuming there could be other columns being updated). Therefore, we should
> take into account when last pair of particular (id, val) pair value was
> deleted from heap. That would be very huge change to vacuum, may be even
> writing way more complex vacuum algorithm from scratch. Probably, you see
> the better solution of this problem.

The best way to sum up the problem is to consider how we deal with
repeated updates to a single tuple that flip the value from A to B
then back to A then to B then A etc.. Any value in the index can point
to multiple versions of the same tuple and multiple index values can
point to the same tuple (PK value). This problem behaviour was already
known to me from Claudio's earlier analysis of WARM (thanks Claudio).

Thank you for pointing. I didn't follow details of WARM discussion.

Yes, VACUUMing that is likely to be a complex issue, as you say. At
the moment I don't have a plan for that, but am not worried.

AFAICS, the main goal of indirect indexes is to reduce their maintenance cost. Indirect indexes are much easier to maintain during UPDATEs and this is good. But it's harder to VACUUM them. So, we need to figure out how much maintenance cost would be reduced for indirect indexes. This is why I think digging into VACUUM problems is justified for now.

Indirect indexes produce less index entries in general than current,
so the problem is by-design much smaller than current situation.
Indirect indexes can support killed-tuple interface, so scanning the
index by users will result in natural index maintenance, further
reducing the problem.

That makes sense. But that is not necessary true for any workload. For instance, keys, which are frequently updated, are not necessary same that keys, which are frequently selected. Thus, there is still some risk of bloat.

So there will be a much reduced need for bulk maintenance. Bulk
maintainence of the index, when needed, can be performed by scanning
the whole table via the index, after the PK index has been vacuumed.

That's possible, but such vacuum is going to be very IO consuming when heap doesn't fit cache. It's even possible that rebuilding of index would be cheaper.

That can be optimized using an index-only scan of the PK to avoid
touching the heap, which should be effective since the VM has been so
recently refreshed.

But we can't get which of indirect index keys still persist in heap by using index only scan by PK, because PK doesn't contain those keys. So, we still need to scan heap for it.

For correctness it would require the index blocks
to be locked against write while checking for removal, so bulk
collection of values to optimize the underlying index doesn't seem
useful. The index scan could also be further optimized by introducing
a visibility map for the index, which is something that would also
optimize normal index VACUUMs as well, but that is a later project and
not something for 10.x

Visibility map for indexes sounds interesting. And that means including visibility information into index. It's important property of current MVCC implementation of PostgreSQL, that while updating heap tuple, we don't have to find location of old index tuples referring it, we only have to insert new index tuples. Finding location of old index tuples, even for barely updating index visibility map, would be a substantial change.

At this stage, the discussion should be 1) can it work? 2) do we want
it?

I think that we definitely need indirect indexes and they might work. The question is design. PostgreSQL MVCC is designed so that index contain no visibility information. So, currently we're discussing approach which implies expanding of this design to indirect indexes. The downsides of this approach are (at least): 1) we should always recheck results obtained from index, 2) VACUUM becomes very difficult.

There is also alternative approach: include visibility information into indirect index. In this approach we should include fields required for visibility (xmin, xmax, etc) into indirect index tuple and keep them up to date. Then while updating indexed column we would have to update old index tuple as well. This is the downside. But we would be able to scan without recheck and VACUUM will be much more easier. We would be even able to VACUUM indirect index independently from heap. Implementation of this approach would be way more intrusive. But in my opinion it's much more clear design.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

pgsql-hackers by date:

From: Claudio Freire
Date: 19 October 2016, 22:08:47
Subject: Re: Indirect indexes

From: Alvaro Herrera
Date: 19 October 2016, 22:30:57
Subject: Re: Avoiding pin scan during btree vacuum

Re: Indirect indexes - Mailing list pgsql-hackers

Previous

Next