Home > mailing lists

An inverted index using roaring bitmaps - Mailing list pgsql-hackers

From	Chinmay Kanchi
Subject	An inverted index using roaring bitmaps
Date	June 7, 2022 05:41:52
Msg-id	CAJqPDh9o=oxRP=xZhbCzzAs4Vr5hNUkYLMspGee+=mgbqG3tAg@mail.gmail.com Whole thread Raw
Responses	Re: An inverted index using roaring bitmaps
List	pgsql-hackers

Tree view

I've been doing some preliminary prep work to see how an inverted index using roaring bitmaps (https://roaringbitmap.org/) would perform. I'm presenting some early work using SQL code with the roaring bitmap Postgres extension (https://github.com/ChenHuajun/pg_roaringbitmap) to simulate a hypothetical index using this approach.

I'd like to solicit feedback from the community to see if this is something worth pursuing or if there are potential issues that I'm not aware of (or if this approach has been considered in the past and discarded for whatever reason). For context, my experience with Postgres is primarily as a user and I'm not at all familiar with the code base, so please be gentle :).

That said, here's a quick and dirty demo:

I have a table "cities"

select count(*) from cities;
count
--------
139739
(1 row)

And just some sample rows:

Since a bitmap requires you to convert your inputs into integers, I created a function as a hack to convert our TIDs to integers. It's ugly as hell, but it serves. 2048 is 2^11, which according to the GIN index source code is a safe assumption for the highest possible MaxHeapTuplesPerPage.

create function ctid_to_int(ctid tid) returns integer as $$
select (ctid::text::point)[0] * 2048 + (ctid::text::point)[1];
$$
language sql returns null on null input;

And the reverse:

create function int_to_ctid(i integer) returns tid as $$
select point(i/2048, i%2048)::text::tid;
$$
language sql returns null on null input;

In addition, I created a table "cities_rb" to roughly represent an "index" on the country column:

Now for the fun stuff - to simulate the "index" I will be running some queries against the cities_rb table using bitmap aggregations and comparing them to functionally the same queries using the BTree index on cities.

explain analyze select ctid from cities where country = 'Japan';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on cities (cost=18.77..971.83 rows=1351 width=6) (actual time=0.041..0.187 rows=1322 loops=1)
Recheck Cond: (country = 'Japan'::text)
Heap Blocks: exact=65
-> Bitmap Index Scan on cities_country_idx (cost=0.00..18.43 rows=1351 width=0) (actual time=0.031..0.031 rows=1322 loops=1)
Index Cond: (country = 'Japan'::text)
Planning Time: 0.055 ms
Execution Time: 0.233 ms
(7 rows)

explain analyze select rb_to_array(idx) from cities_rb where country = 'Japan';
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Seq Scan on cities_rb (cost=0.00..14.88 rows=1 width=32) (actual time=0.050..0.067 rows=1 loops=1)
Filter: (country = 'Japan'::text)
Rows Removed by Filter: 229
Planning Time: 0.033 ms
Execution Time: 0.076 ms
(5 rows)

explain analyze select count(*) from cities where country = 'Japan';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=35.31..35.32 rows=1 width=8) (actual time=0.151..0.151 rows=1 loops=1)
-> Index Only Scan using cities_country_idx on cities (cost=0.29..31.94 rows=1351 width=0) (actual time=0.026..0.103 rows=1322 loops=1)
Index Cond: (country = 'Japan'::text)
Heap Fetches: 0
Planning Time: 0.056 ms
Execution Time: 0.180 ms
(6 rows)

explain analyze select rb_cardinality(idx) from cities_rb where country = 'Japan';
QUERY PLAN
----------------------------------------------------------------------------------------------------
Seq Scan on cities_rb (cost=0.00..14.88 rows=1 width=8) (actual time=0.037..0.053 rows=1 loops=1)
Filter: (country = 'Japan'::text)
Rows Removed by Filter: 229
Planning Time: 0.037 ms
Execution Time: 0.063 ms
(5 rows)

explain analyze select country, count(*) from cities group by country;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=2990.09..2992.22 rows=214 width=17) (actual time=34.054..34.076 rows=230 loops=1)
Group Key: country
Batches: 1 Memory Usage: 77kB
-> Seq Scan on cities (cost=0.00..2291.39 rows=139739 width=9) (actual time=0.005..8.552 rows=139739 loops=1)
Planning Time: 0.051 ms
Execution Time: 34.103 ms
(6 rows)

explain analyze select country, rb_cardinality(idx) from cities_rb;
QUERY PLAN
---------------------------------------------------------------------------------------------------------
Seq Scan on cities_rb (cost=0.00..14.88 rows=230 width=19) (actual time=0.008..0.184 rows=230 loops=1)
Planning Time: 0.030 ms
Execution Time: 0.200 ms
(3 rows)

The simulated index in this case is outrageously fast, up to ~150x on the GROUP BY.

Cheers,

Chinmay

pgsql-hackers by date:

From: Pavel Stehule
Date: 07 June 2022, 05:26:17
Subject: Re: proposal - psql - use pager for \watch command

From: Kyotaro Horiguchi
Date: 07 June 2022, 07:05:20
Subject: Inconvenience of pg_read_binary_file()

An inverted index using roaring bitmaps - Mailing list pgsql-hackers

Previous

Next