[WIP] [B-Tree] Keep indexes sorted by heap physical location - Mailing list pgsql-hackers

From Claudio Freire
Subject [WIP] [B-Tree] Keep indexes sorted by heap physical location
Date
Msg-id CAGTBQpZ-kTRQiAa13xG1GNe461YOwrA-s-ycCQPtyFrpKTaDBQ@mail.gmail.com
Whole thread Raw
Responses Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location  (Robert Haas <robertmhaas@gmail.com>)
Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location  (Amit Kapila <amit.kapila16@gmail.com>)
Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
The attached patch tries to maintain the initial status of B-Tree
indexes, which are created with equal-key runs in physical order,
during the whole life of the B-Tree, and make key-tid pairs
efficiently searchable in the process.

The patch doesn't add an interface to perform the key-tid pairs, nor
does it address all potential issues that could arise from the
implementation, or backward compatibility with earlier-format indexes,
but passes all tests, and seems to work well with the workloads I have
tried until now, which include:

 - dump/restore/compare (with lots of indexes, both unique and non)
 - sanity check of index lookups
 - running a complex application on top of it, no crashes or
corruption (detected)

Recovery hasn't been tested at all, and I'd expect it to have issues -
though I have tried to make the necessary changes I may have missed
more than one. Nor have I tested high-concurrency workloads (although
the application mentioned does produce some of it), so this is an
early WIP in search of feedback on the overall approach. Still, it
should be enough to measure the merits of the idea, namely index size,
performance and code impact.

A couple of points make me uneasy about this patch, yet I can think of
no better alternative, so I seek feedback:

 - introducing a different format for inner index tuples makes for an
invasive patch and quite difficult-to-reason code (it's easy to forget
whether a page is leaf or inner and that now causes assertion failures
or segfaults)
 - the ascent-descent to check for uniqueness when there are large
dead tuple runs could have locking subtleties that escape me. Perhaps
there's better ways to solve this.
 - the patch changes the output of regression tests in
nondeterministic ways, making index scan results always follow
physical order. While that's not only not incorrect but desirable,
lots of tests depended on the almost deterministic order that resulted
from the way a suitable insertion point was selected before. The first
patch addresses that, but at the expense of some "coverage" (ignoring
some lines of reference output, the minimum I could ignore).
 - while I don't see why it would cause more bloat under normal usage
than the previous method, there are probably new/different
pathological cases to account for, and maybe I'm missing one that is a
common pattern
 - it's btree, it doesn't get any more critical than that, it really
needs a heapload of testing before going in, and some of it (crash
recovery) may be tough to get right

On the performance side, I don't see a significant impact. If
anything, it has the potential to speed up things, since scans over
long runs of equal keys will now be done in physical order, and
insertions will be evenly spread around the index leaf pages (due to
the spread of heap tuples), but I haven't done any benchmarks beyond
pgbench to back that up.

It is also an important enabling feature to make vacuum more
efficient, and to implement WARM tuples in a more general way, so that
has to be accounted for when evaluating the potential benefit from
this patch.

So, the numbers.

The size impact I measured on a dump of a test database, and seems
under the noise, which is what I expected since it only affects
non-leaf pages:

patched

db: 23G
hourly_full: 562M + 790M (118M + 118M + 554M)
bid_logs: 591M + 268M (134M + 134M)
uas: 285M + 496M (20M + 138M + 159M + 20M + 159M)

unpatched

db: 23G
hourly_full: 562M + 789M (118M + 118M + 553M)
bid_logs: 591M + 268M (134M + 134M)
uas: 285M + 496M (20M + 138M + 159M + 20M + 159M)

Here, numbers are HEAP + INDEXES (i1 + i2 + i3 ...), for tables that
are a mix of wide and narrow, wide indexed columns (varchar) and
narrow (int). Only a few indexes seem to have grown, but this is
before any bloat accumulates (I haven't devised a test yet that
produces the same amount of bloat on both databases to make them
comparable).

pgbench, in-memory, patched:

transaction type: <builtin: select only>
scaling factor: 100
query mode: simple
number of clients: 8
number of threads: 1
duration: 300 s
number of transactions actually processed: 13271118
latency average: 0.181 ms
tps = 44237.026822 (including connections establishing)
tps = 44237.227678 (excluding connections establishing)

pgbench, in-memory, unpatched:

transaction type: <builtin: select only>
scaling factor: 100
query mode: simple
number of clients: 8
number of threads: 1
duration: 300 s
number of transactions actually processed: 13336128
latency average: 0.180 ms
tps = 44453.718954 (including connections establishing)
tps = 44453.895436 (excluding connections establishing)

pgbench, bigger-than-RAM, patched:

transaction type: <builtin: select only>
scaling factor: 1000
query mode: simple
number of clients: 8
number of threads: 1
duration: 300 s
number of transactions actually processed: 71194
latency average: 33.711 ms
tps = 237.269144 (including connections establishing)
tps = 237.270122 (excluding connections establishing)

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 8
number of threads: 1
duration: 300 s
number of transactions actually processed: 26748
latency average: 89.726 ms
tps = 89.043800 (including connections establishing)
tps = 89.044218 (excluding connections establishing)


pgbench, bigger-than-RAM, unpatched:

transaction type: <builtin: select only>
scaling factor: 1000
query mode: simple
number of clients: 8
number of threads: 1
duration: 300 s
number of transactions actually processed: 86495
latency average: 27.747 ms
tps = 288.256959 (including connections establishing)
tps = 288.258202 (excluding connections establishing)

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 8
number of threads: 1
duration: 300 s
number of transactions actually processed: 27145
latency average: 88.414 ms
tps = 90.461600 (including connections establishing)
tps = 90.462105 (excluding connections establishing)



Which is expected since pgbench exercises none of the potentially
improved paths. I/O here (just a laptop) is horrible so the difference
is probably just noise. The drop in select-only seems strangely high,
I think that's also noise, but I haven't run the tests again yet.

Attachment

pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: Add -c to rsync commands on SR tutorial wiki page
Next
From: Michael Paquier
Date:
Subject: Re: Patch: initdb: "'" for QUOTE_PATH (non-windows)