Thread: many to many performance

many to many performance

From
"Chavdar Kopoev"
Date:
Hello,

I have following common situation:

Category IDs: about 50 000
Document IDs: about 3 000 000
Many to many relationship.
A document id have a relation with 10 up to 1000 category ids
One query, with input set of document ids, resulting set of category ids, having relation with input ids. (very often
thisquery is called with more than 500k of input document ids) 

I use custom written datawarehouse, file storage, and memory kept "id offset" like indecies. So a query for all 3
milliondocument ids, resulting almost all category ids take less than a second on desktop machine. Abstracting from
concreterealization, query plan is like: 
1. for each input document id: look up an array by document id and retrive a list of ralated category ids.
1.1 for each category id in the list: look up an array value by category id and mark it as found
2. traverse category array to extract category ids marked as found

I want to use as a data storage postgresql. Tried several data structures, testing btree, gin, gist indecies over them,
butbest achieved performance for a 10 times smaller dataset (10k cat ids, 100k doc ids, 1m relations) is slower more
than5 times. 

I read about postgresql bitmap indecies and "data lookup" when scanning indecies to get a value for current
transaction.Downloaded latest v8.4 snapshot, compiled it, but as I see there is no bitmap index in it. Maybe if I
downloadHEAD revision I will find them there, dont know. 

Anyway, I want to ask, have anyone faced similar situation, and is there any way to achive closer to optimal
performanceusing postgresql functionality and extensibility? 

Regards,
Chavdar Kopoev


Re: many to many performance

From
"Chavdar Kopoev"
Date:
Hi Craig,

Thank you for your answer.

Here are my test table and indecies definitions:

-- document id to category id
create table doc_to_cat (
  doc_id integer not null,
  cat_id integer not null
) with (oids=false);

-- Total 1m rows. 500k unique document ids. 20k unique category ids. Each doc_id refers to exactly two cat_id.
insert into doc_to_cat
select i/50*25 + i%50 as doc_id, i/50 as cat_id from generate_series(0, 1000000) i

create index doc_to_cat__doc_id on doc_to_cat using btree (doc_id);
create index doc_to_cat__cat_id on doc_to_cat using btree (cat_id);

-- document id to array of category ids
create table doc_to_cat_arr (
  doc_id integer not null,
  cat_arr integer[] not null
) with (oids=false);

insert into doc_to_cat_arr
select doc_id, int_array_aggregate(cat_id) as cat_arr
from doc_to_cat
group by doc_id

create index doc_to_cat_arr__doc_id on doc_to_cat_arr using btree (doc_id);

-- category id to array of document ids
create table cat_to_doc_arr (
  cat_id integer not null,
  doc_arr integer[] not null
) with (oids=false);

insert into cat_to_doc_arr
select cat_id, int_array_aggregate(doc_id) as doc_arr
from doc_to_cat
group by cat_id

create index cat_to_doc_arr__doc_arr__gin on cat_to_doc_arr using gin (doc_arr gin__int_ops);

-- Query Ids
create table query_ids (
  doc_id integer not null
) with (oids=false);

-- 200k test ids for query with
insert into query_ids
select generate_series(100000, 300000);

create index query_ids__doc_id on query_ids using btree (doc_id);

Now queries plans. We are looking for cat_id having relations with 200k doc ids:

explain analyze
select distinct cat_id from doc_to_cat join query_ids using (doc_id)

"Unique  (cost=19303.84..19602.03 rows=20544 width=4) (actual time=1006.745..1190.472 rows=8002 loops=1)"
"  ->  Sort  (cost=19303.84..19452.93 rows=372735 width=4) (actual time=1006.743..1094.908 rows=400002 loops=1)"
"        Sort Key: doc_to_cat.cat_id"
"        Sort Method:  quicksort  Memory: 31039kB"
"        ->  Merge Join  (cost=2972.22..13785.04 rows=372735 width=4) (actual time=167.591..750.177 rows=400002
loops=1)"
"              Merge Cond: (query_ids.doc_id = doc_to_cat.doc_id)"
"              ->  Index Scan using query_ids_doc_id on query_ids  (cost=0.00..2912.05 rows=200001 width=4) (actual
time=0.021..81.291rows=200001 loops=1)" 
"              ->  Index Scan using doc_to_cat_doc_id on doc_to_cat  (cost=0.00..14543.09 rows=1000001 width=8) (actual
time=0.017..281.769rows=599978 loops=1)" 
"Total runtime: 1195.397 ms"

explain analyze
select distinct int_array_enum(cat_arr) from doc_to_cat_arr join query_ids using (doc_id)
"Unique  (cost=13676.57..13836.57 rows=19732 width=29) (actual time=1061.490..1246.595 rows=8002 loops=1)"
"  ->  Sort  (cost=13676.57..13756.57 rows=200001 width=29) (actual time=1061.488..1150.451 rows=400002 loops=1)"
"        Sort Key: (int_array_enum(doc_to_cat_arr.cat_arr))"
"        Sort Method:  quicksort  Memory: 31039kB"
"        ->  Merge Join  (cost=2318.98..10859.01 rows=200001 width=29) (actual time=163.840..816.697 rows=400002
loops=1)"
"              Merge Cond: (doc_to_cat_arr.doc_id = query_ids.doc_id)"
"              ->  Index Scan using doc_to_cat_arr_doc_id on doc_to_cat_arr  (cost=0.00..11311.10 rows=500025 width=33)
(actualtime=0.022..359.673 rows=300002 loops=1)" 
"              ->  Index Scan using query_ids_doc_id on query_ids  (cost=0.00..2912.05 rows=200001 width=4) (actual
time=0.016..81.370rows=200001 loops=1)" 
"Total runtime: 1251.613 ms"

explain analyze
select cat_id from cat_to_doc_arr
where doc_arr && (select int_array_aggregate(q.doc_id) from (select doc_id from query_ids limit 20000) as q)
This query should never be run - too slow even with limit 20k of input ids.

So .. final best result is more than 1 second (on fast machine) for test dataset 5 times less than needed. So I am far
awayfrom achieving good results. 
I have to ask again if anyone faced similar situation, and is there any way to achive closer to optimal performance
usingpostgresql functionality and extensibility? 

Chavdar Kopoev

-----Original Message-----
From: Craig Ringer [mailto:craig@postnewspapers.com.au]
Sent: 2008-11-26, 19:40:47
To: Chavdar Kopoev [mailto:chavdar@kopoev.com]
Subject: Re: [PERFORM] many to many performance

Chavdar Kopoev wrote:

> I want to use as a data storage postgresql. Tried several data structures, testing btree, gin, gist indecies over
them,but best achieved performance for a 10 times smaller dataset (10k cat ids, 100k doc ids, 1m relations) is slower
morethan 5 times. 

Can you post your queries and table definitions so people trying to help
you know what you did / tried to do? A downloadable self contained
example might also be useful.

Please also post the output of `EXPLAIN' on your queries, eg:

EXPLAIN SELECT blah, ... FROM blah;

> I read about postgresql bitmap indecies and "data lookup" when scanning indecies to get a value for current
transaction.Downloaded latest v8.4 snapshot, compiled it, but as I see there is no bitmap index in it. Maybe if I
downloadHEAD revision I will find them there, dont know. 

Bitmap index scans are an internal function that's used to combine two
indexes on the fly during a query (or at least use both of them in one
table scan). You don't make a bitmap index, you just make two ordinary
btree indexes and let Pg take care of this for you.

If you query on the columns of interest a lot, you might want to use a
multi-column index instead.

--
Craig Ringer

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance