Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location - Mailing list pgsql-hackers

From Claudio Freire
Subject Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location
Date
Msg-id CAGTBQpY8BS7SfR-DHVh45TW+4CXuM04bP28-eijtt=dyAD5YQA@mail.gmail.com
Whole thread Raw
In response to Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location  (Claudio Freire <klaussfreire@gmail.com>)
List pgsql-hackers
On Thu, Aug 18, 2016 at 6:26 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Thu, Aug 18, 2016 at 6:15 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> On Thu, Aug 18, 2016 at 1:41 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>>> In fact, that's why non-leaf index tuples need a different format,
>>> because while leaf index tuples contain the heap pointer already,
>>> non-leaf ones contain only the downlink, not the pointer into the
>>> heap. To be able to do comparisons and pick the right downlink, the
>>> original heap pointer in the leaf index tuple is copied into the
>>> downlink index tuple when splitting pages into an additional
>>> IndexTupleData header that is prepended only to non-leaf index tuples.
>>
>> I think that this is a bad idea. We need to implement suffix
>> truncation of internal page index tuples at some point, to make them
>> contain less information from the original leaf page index tuple.
>> That's an important optimization, because it increases fan-in. This
>> seems like a move in the opposite direction.
>
> I see that. I could try to measure average depth to measure the impact
> this had on fan-in.
>
> While it should cut it in half for narrow indexes, half of very high
> is still high. Wide indexes, which are are the ones that would suffer
> from poor fan-in, would feel this far less, since the overhead is
> constant.
>
> Even if it does have an impact, I don't see an alternative, without
> also implementing suffix truncation. Perhaps I could try to avoid
> adding the leaf tid header if it isn't necessary, though I would have
> to use up the last available flag bit in t_info for that.

So, I did some tests.

TL;DR version is, as expected, there is a big impact on narrow
indexes, but I don't believe it's something to be worried about.

**Begin LONG version** (skip to end long version if not interested)

Regardless of the fanout (or fanin, if you look at it that way), what
matters is tree depth, so I used pageinspect to check how depth of
indexes changed after patching.

This is all on pristine recently built indexes. I will try to measure
the same on bloated indexes later on, but the tests are painfully
slow.

I used the following query to inspect all user indexes, since
pageinspect failed to work for toast indexes for some reason (it
probably needed qualified relnames or something like that). Anyway
user indexes should be enough.

select level, count(*), pg_size_pretty(max(relpages) * 8192.0) as biggest
from pg_class, bt_metap(relname)
where relkind = 'i' and relnamespace = 2200 and relam = 403
group by level
order by level desc;

I tested it on both patched and unpached versions of both my test
database (roughly 23GB), pgbench scale 1000 (15GB) and pgbench scale
10000 (146GB).

I believe pgbench is a worst case example, because it only has very
narrow PK indexes, which are the ones that will suffer the most from
this patch.

The numbers:

mat, patched

level;count;biggest
3;18;"1369 MB"
2;60;"322 MB"
1;26;"808 kB"
0;50;"16 kB"

mat, unpatched

level;count;biggest
3;15;"1367 MB"
2;63;"334 MB"
1;26;"800 kB"
0;49;"16 kB"

pgbench, scale 1000, patched

level;count;biggest
3;1;"2145 MB"
1;2;"456 kB"

pgbench, scale 1000, unpatched

level;count;biggest
3;1;"2142 MB"
1;2;"456 kB"

pgbench, scale 10000, patched

level;count;biggest
3;1;"21 GB"
2;1;"2224 kB"
1;1;"240 kB"

pgbench, scale 10000, unpatched

level;count;biggest
3;1;"21 GB"
1;2;"2208 kB"

So, clearly some indexes are pushed to the next level, but it isn't a
widespread effect - it looks more like the threshold index size for
needing a root split is slightly pushed downwards.

It totally agrees with the math. The index tuple header is 8 bytes,
and the datums need to be maxaligned so they will also be 8 bytes at
least. That means the B in B-tree gets cut by 50% (2/3) at the worst
but only for inner tuples, and so you get that

if a * log_(B)(N/B) = log_(B/(3/2))(N/B)

then for big enogh N

a < (3/2)*log_(B)(N) - 2

Which means the depth of huge B-trees is increased by 50%, but the
effects only begins to be seen at level 2, which is what we have here,
and level 2 for such a narrow index seems to be about 2GB. So we're
talking about very big indexes already. If level 2 can hold 2GB of
index, level 3 can hold around 2G x 8k / 16 = 1TB of index, and I have
yet to see (even in very big databases) a 1TB index on a PK.

If anyone has such an index, yes, this patch will hurt performance by
25% (4 page reads instead of 3).

Bad, but since it only affects very extreme cases, doesn't seem so bad.

**End LONG version**

So, that said, I started mocking up a version of the patch that used a
"has aux data" flag to signal another set of flags from where variable
size attributes can be read, which would enable implementation of
suffix truncation and prefix compression in the future with minimal
data format changes, and was surprised to find that it seems not only
more compact, but cleaner. So I will probably go that way, and do
that.



pgsql-hackers by date:

Previous
From: Piotr Stefaniak
Date:
Subject: Re: Re: PROPOSAL: make PostgreSQL sanitizers-friendly (and prevent information disclosure)
Next
From: Andres Freund
Date:
Subject: Re: Re: PROPOSAL: make PostgreSQL sanitizers-friendly (and prevent information disclosure)