Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index. - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
Date
Msg-id CAH2-WzkyWGpsGffErjqJ866VTEPHX0PGo3n1N-Xr7ycvooNF7g@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.  (Anastasia Lubennikova <a.lubennikova@postgrespro.ru>)
Responses Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
On Thu, Jul 4, 2019 at 5:06 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> i - number of distinct values in the index.
> So i=1 means that all rows have the same key,
> and i=10000000 means that all keys are different.
>
> i / old size (MB) / new size (MB)
> 1            215    88
> 1000        215    90
> 100000        215    71
> 10000000    214    214
>
> For more, see the attached diagram with test results.

I tried this on my own "UK land registry" test data [1], which was
originally used for the v12 nbtree work. My test case has a low
cardinality, multi-column text index. I chose this test case because
it was convenient for me.

On v12/master, the index is 1100MB. Whereas with your patch, it ends
up being 196MB -- over 5.5x smaller!

I also tried it out with the "Mouse genome informatics" database [2],
which was already improved considerably by the v12 work on duplicates.
This is helped tremendously by your patch. It's not quite 5.5x across
the board, of course. There are 187 indexes (on 28 tables), and almost
all of the indexes are smaller. Actually, *most* of the indexes are
*much* smaller. Very often 50% smaller.

I don't have time to do an in-depth analysis of these results today,
but clearly the patch is very effective on real world data. I think
that we tend to underestimate just how common indexes with a huge
number of duplicates are.

[1] https://https:/postgr.es/m/CAH2-Wzn_NAyK4pR0HRWO0StwHmxjP5qyu+X8vppt030XpqrO6w@mail.gmail.com
[2] http://www.informatics.jax.org/software.shtml
--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: "Kumar, Pawan (Nokia - IN/Bangalore)"
Date:
Subject: RE: duplicate key entries for primary key -- need urgent help
Next
From: Tomas Vondra
Date:
Subject: Re: Optimize partial TOAST decompression