Home > mailing lists

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index. - Mailing list pgsql-hackers

From	Peter Geoghegan
Subject	Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
Date	July 4, 2019 20:38:21
Msg-id	CAH2-WzkyWGpsGffErjqJ866VTEPHX0PGo3n1N-Xr7ycvooNF7g@mail.gmail.com Whole thread Raw
In response to	Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index. (Anastasia Lubennikova <a.lubennikova@postgrespro.ru>)
Responses	Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index. (Peter Geoghegan <pg@bowt.ie>)
List	pgsql-hackers

Tree view

On Thu, Jul 4, 2019 at 5:06 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> i - number of distinct values in the index.
> So i=1 means that all rows have the same key,
> and i=10000000 means that all keys are different.
>
> i / old size (MB) / new size (MB)
> 1            215    88
> 1000        215    90
> 100000        215    71
> 10000000    214    214
>
> For more, see the attached diagram with test results.

I tried this on my own "UK land registry" test data [1], which was
originally used for the v12 nbtree work. My test case has a low
cardinality, multi-column text index. I chose this test case because
it was convenient for me.

On v12/master, the index is 1100MB. Whereas with your patch, it ends
up being 196MB -- over 5.5x smaller!

I also tried it out with the "Mouse genome informatics" database [2],
which was already improved considerably by the v12 work on duplicates.
This is helped tremendously by your patch. It's not quite 5.5x across
the board, of course. There are 187 indexes (on 28 tables), and almost
all of the indexes are smaller. Actually, *most* of the indexes are
*much* smaller. Very often 50% smaller.

I don't have time to do an in-depth analysis of these results today,
but clearly the patch is very effective on real world data. I think
that we tend to underestimate just how common indexes with a huge
number of duplicates are.

[1] https://https:/postgr.es/m/CAH2-Wzn_NAyK4pR0HRWO0StwHmxjP5qyu+X8vppt030XpqrO6w@mail.gmail.com
[2] http://www.informatics.jax.org/software.shtml
--
Peter Geoghegan

pgsql-hackers by date:

From: "Kumar, Pawan (Nokia - IN/Bangalore)"
Date: 04 July 2019, 20:34:21
Subject: RE: duplicate key entries for primary key -- need urgent help

From: Tomas Vondra
Date: 04 July 2019, 20:46:54
Subject: Re: Optimize partial TOAST decompression

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index. - Mailing list pgsql-hackers

Previous

Next