Thread: KVP table vs. hstore - hstore performance (Was: Postgres NoSQL emulation)

KVP table vs. hstore - hstore performance (Was: Postgres NoSQL emulation)

From
Stefan Keller
Date:
Hi,

I am conducting a benchmark to compare KVP table vs. hstore and got
bad hstore performance results when the no. of records is greater than
about 500'000.

CREATE TABLE kvp ( id SERIAL PRIMARY KEY, key text NOT NULL, value text );
-- with index on key
CREATE TABLE myhstore ( id SERIAL PRIMARY KEY, obj hstore NOT NULL );
-- with GIST index on obj

Does anyone have experience with that?

Yours, Stefan

On 14/05/11 18:10, Stefan Keller wrote:
> Hi,
>
> I am conducting a benchmark to compare KVP table vs. hstore and got
> bad hstore performance results when the no. of records is greater than
> about 500'000.
>
> CREATE TABLE kvp ( id SERIAL PRIMARY KEY, key text NOT NULL, value text );
> -- with index on key
> CREATE TABLE myhstore ( id SERIAL PRIMARY KEY, obj hstore NOT NULL );
> -- with GIST index on obj
>
> Does anyone have experience with that?

What are your queries?

What does EXPLAIN ANALYZE report on those queries?

Did you ANALYZE your tables after populating them?

--
Craig Ringer

On Sat, May 14, 2011 at 5:10 AM, Stefan Keller <sfkeller@gmail.com> wrote:
> Hi,
>
> I am conducting a benchmark to compare KVP table vs. hstore and got
> bad hstore performance results when the no. of records is greater than
> about 500'000.
>
> CREATE TABLE kvp ( id SERIAL PRIMARY KEY, key text NOT NULL, value text );
> -- with index on key
> CREATE TABLE myhstore ( id SERIAL PRIMARY KEY, obj hstore NOT NULL );
> -- with GIST index on obj
>
> Does anyone have experience with that?

hstore is not really designed for large-ish sets like that.

merlin

On May 16, 2011, at 8:47 AM, Merlin Moncure wrote:
> On Sat, May 14, 2011 at 5:10 AM, Stefan Keller <sfkeller@gmail.com> wrote:
>> Hi,
>>
>> I am conducting a benchmark to compare KVP table vs. hstore and got
>> bad hstore performance results when the no. of records is greater than
>> about 500'000.
>>
>> CREATE TABLE kvp ( id SERIAL PRIMARY KEY, key text NOT NULL, value text );
>> -- with index on key
>> CREATE TABLE myhstore ( id SERIAL PRIMARY KEY, obj hstore NOT NULL );
>> -- with GIST index on obj
>>
>> Does anyone have experience with that?
>
> hstore is not really designed for large-ish sets like that.

And KVP is? ;)

IIRC hstore ends up just storing everything as text, with pointers to know where things start and end. There's no real
indexinginside hstore, so basically the only thing it can do is scan the entire hstore. 

That said, I would strongly reconsider using KVP for anything except the most trivial of data sets. It is *extremely*
inefficient.Do you really have absolutely no idea what *any* of your keys will be? Even if you need to support a
certainamount of non-deterministic stuff, I would put everything you possibly can into real fields and only use KVP or
hstorefor things that you really didn't anticipate. 

Keep in mind that for every *value*, your overhead is 24 bytes for the heap header, 2+ varlena bytes in the heap, plus
thelength of the key. In the index you're looking at 6+ bytes of overhead, 1+ byte for varlena, plus the length of the
key.The PK will cost you an additional 16-24 bytes, depending on alignment. So that's a *minimum* of ~50 bytes per
value,and realistically the overhead will be closer to 65-70 bytes, *per value*. Unless your values are decent-sized
strings,the overhead is going to be many times larger than the actual data! 
--
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Hi Jim

You actually made me think about the schema Michel and I are using:

> And KVP is? ;)

CREATE TABLE mykvpstore( id bigint PRIMARY KEY )
CREATE TABLE kvp ( id bigint REFERENCES mykvpstore(id), key text NOT
NULL, value text, );
-- with index on key

And the table with the associative array type (hstore) is:
CREATE TABLE myhstore ( id bigint PRIMARY KEY, kvps hstore NOT NULL );
-- with GIST index on obj

It seems to me that in the mykvpstore-kvp there is also some overhead.

And yes, we have no clue what keys to anticipate, except for some
common ones like 'name': The use case is coming from OpenStreetMap
(http://wiki.openstreetmap.org/wiki/Database_schema ).

Yours, Stefan


2011/5/17 Jim Nasby <jim@nasby.net>:
> On May 16, 2011, at 8:47 AM, Merlin Moncure wrote:
>> On Sat, May 14, 2011 at 5:10 AM, Stefan Keller <sfkeller@gmail.com> wrote:
>>> Hi,
>>>
>>> I am conducting a benchmark to compare KVP table vs. hstore and got
>>> bad hstore performance results when the no. of records is greater than
>>> about 500'000.
>>>
>>> CREATE TABLE kvp ( id SERIAL PRIMARY KEY, key text NOT NULL, value text );
>>> -- with index on key
>>> CREATE TABLE myhstore ( id SERIAL PRIMARY KEY, obj hstore NOT NULL );
>>> -- with GIST index on obj
>>>
>>> Does anyone have experience with that?
>>
>> hstore is not really designed for large-ish sets like that.
>
> And KVP is? ;)
>
> IIRC hstore ends up just storing everything as text, with pointers to know where things start and end. There's no
realindexing inside hstore, so basically the only thing it can do is scan the entire hstore. 
>
> That said, I would strongly reconsider using KVP for anything except the most trivial of data sets. It is *extremely*
inefficient.Do you really have absolutely no idea what *any* of your keys will be? Even if you need to support a
certainamount of non-deterministic stuff, I would put everything you possibly can into real fields and only use KVP or
hstorefor things that you really didn't anticipate. 
>
> Keep in mind that for every *value*, your overhead is 24 bytes for the heap header, 2+ varlena bytes in the heap,
plusthe length of the key. In the index you're looking at 6+ bytes of overhead, 1+ byte for varlena, plus the length of
thekey. The PK will cost you an additional 16-24 bytes, depending on alignment. So that's a *minimum* of ~50 bytes per
value,and realistically the overhead will be closer to 65-70 bytes, *per value*. Unless your values are decent-sized
strings,the overhead is going to be many times larger than the actual data! 
> --
> Jim C. Nasby, Database Architect                   jim@nasby.net
> 512.569.9461 (cell)                         http://jim.nasby.net
>
>
>

Hi Merlin

The analyze command gave the following result:

On the KVP table:
Index Scan using kvpidx on bench_kvp (cost=0.00..8.53 rows=1 width=180) (actual time=0.037..0.038 rows=1 loops=1)
Index Cond: (bench_id = '200000_200000'::text)
Total runtime: 0.057 ms

And on the Hstore table:
Bitmap Heap Scan on bench_hstore (cost=32.22..3507.54 rows=1000 width=265) (actual time=145.040..256.173 rows=1
loops=1)
Recheck Cond: (bench_hstore @> '"bench_id"=>"200000_200000"'::hstore)
-> Bitmap Index Scan on hidx (cost=0.00..31.97 rows=1000 width=0) (actual time=114.748..114.748 rows=30605 loops=1)
Index Cond: (bench_hstore @> '"bench_id"=>"200000_200000"'::hstore)
Total runtime: 256.211 ms

For Hstore I'm using a GIST index.

Table analysis returned no message.

Michel


Von: Merlin Moncure mmoncure@gmail.com
An: Stefan Keller <sfkeller@gmail.com>
cc: pgsql-performance@postgresql.org
Datum: 16. Mai 2011 15:47
Betreff: Re: [PERFORM] KVP table vs. hstore - hstore performance (Was:
Postgres NoSQL emulation)

Merlin Moncure

hstore is not really designed for large-ish sets like that.

merlin

2011/5/16 Stefan Keller <sfkeller@gmail.com>:
> Hoi Michel
>
> Hast du die EXPLAIN ANALYZE information?
>
> LG, Stefan
>
>
> ---------- Forwarded message ----------
> From: Craig Ringer <craig@postnewspapers.com.au>
> Date: 2011/5/16
> Subject: Re: [PERFORM] KVP table vs. hstore - hstore performance (Was:
> Postgres NoSQL emulation)
> To: Stefan Keller <sfkeller@gmail.com>
> Cc: pgsql-performance@postgresql.org
>
>
> On 14/05/11 18:10, Stefan Keller wrote:
>> Hi,
>>
>> I am conducting a benchmark to compare KVP table vs. hstore and got
>> bad hstore performance results when the no. of records is greater than
>> about 500'000.
>>
>> CREATE TABLE kvp ( id SERIAL PRIMARY KEY, key text NOT NULL, value text );
>> -- with index on key
>> CREATE TABLE myhstore ( id SERIAL PRIMARY KEY, obj hstore NOT NULL );
>> -- with GIST index on obj
>>
>> Does anyone have experience with that?
>
> What are your queries?
>
> What does EXPLAIN ANALYZE report on those queries?
>
> Did you ANALYZE your tables after populating them?
>
> --
> Craig Ringer
>

> Hi Merlin
>
> The analyze command gave the following result:
>
> On the KVP table:
> Index Scan using kvpidx on bench_kvp (cost=0.00..8.53 rows=1 width=180)
> (actual time=0.037..0.038 rows=1 loops=1)
> Index Cond: (bench_id = '200000_200000'::text)
> Total runtime: 0.057 ms
>
> And on the Hstore table:
> Bitmap Heap Scan on bench_hstore (cost=32.22..3507.54 rows=1000
> width=265) (actual time=145.040..256.173 rows=1 loops=1)
> Recheck Cond: (bench_hstore @> '"bench_id"=>"200000_200000"'::hstore)
> -> Bitmap Index Scan on hidx (cost=0.00..31.97 rows=1000 width=0)
> (actual time=114.748..114.748 rows=30605 loops=1)
> Index Cond: (bench_hstore @> '"bench_id"=>"200000_200000"'::hstore)
> Total runtime: 256.211 ms
>
> For Hstore I'm using a GIST index.
>

Try to create a btree index on "(bench_hstore->bench_id) WHERE
(bench_hstore->bench_id) IS NOT NULL".


On Tue, May 17, 2011 at 11:10 AM,  <m1ott@hsr.ch> wrote:
> For Hstore I'm using a GIST index.

I would have thought that GIN would be a better choice for this workload.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Salut Pierre

You wrote
> Try to create a btree index on "(bench_hstore->bench_id) WHERE
> (bench_hstore->bench_id) IS NOT NULL".

What  do you mean exactly?
=> CREATE INDEX myhstore_kps_gin_idx ON myhstore USING gin(kvps) WHERE
??? IS NOT NULL;

My table's def is:
> CREATE TABLE myhstore ( id bigint PRIMARY KEY, kvps hstore NOT NULL );
So I'm doing something like:
CREATE INDEX myhstore_kps_gin_idx ON myhstore USING gin(kvps);

Stefan


2011/5/23 Pierre C <lists@peufeu.com>:
>
>> Hi Merlin
>>
>> The analyze command gave the following result:
>>
>> On the KVP table:
>> Index Scan using kvpidx on bench_kvp (cost=0.00..8.53 rows=1 width=180)
>> (actual time=0.037..0.038 rows=1 loops=1)
>> Index Cond: (bench_id = '200000_200000'::text)
>> Total runtime: 0.057 ms
>>
>> And on the Hstore table:
>> Bitmap Heap Scan on bench_hstore (cost=32.22..3507.54 rows=1000 width=265)
>> (actual time=145.040..256.173 rows=1 loops=1)
>> Recheck Cond: (bench_hstore @> '"bench_id"=>"200000_200000"'::hstore)
>> -> Bitmap Index Scan on hidx (cost=0.00..31.97 rows=1000 width=0) (actual
>> time=114.748..114.748 rows=30605 loops=1)
>> Index Cond: (bench_hstore @> '"bench_id"=>"200000_200000"'::hstore)
>> Total runtime: 256.211 ms
>>
>> For Hstore I'm using a GIST index.
>>
>
> Try to create a btree index on "(bench_hstore->bench_id) WHERE
> (bench_hstore->bench_id) IS NOT NULL".
>
>

> You wrote
>> Try to create a btree index on "(bench_hstore->bench_id) WHERE
>> (bench_hstore->bench_id) IS NOT NULL".
>
> What  do you mean exactly?
> => CREATE INDEX myhstore_kps_gin_idx ON myhstore USING gin(kvps) WHERE
> ??? IS NOT NULL;
>
> My table's def is:
>> CREATE TABLE myhstore ( id bigint PRIMARY KEY, kvps hstore NOT NULL );
> So I'm doing something like:
> CREATE INDEX myhstore_kps_gin_idx ON myhstore USING gin(kvps);

Hello ;

I meant a plain old btree index like this :

CREATE INDEX foo ON myhstore((kvps->'yourkeyname')) WHERE
(kvps->'yourkeyname') IS NOT NULL;

The idea is that :

- The reason to use hstore is to have an arbitrary number of keys and use
the keys you want, not have a fixed set of columns like in a table
- Therefore, no hstore key is present in all rows (if it was, you'd make
it a table column, and maybe index it)
- You'll probably only want to index some of the keys/values (avoiding to
index values that contain serialized data or other stuff that never
appears in a WHERE clause)

So, for each key that corresponds to a searchable attribute, I'd use a
conditional index on that key, which only indexes the relevant rows. For
keys that never appear in a WHERE, no index is needed.

gist is good if you want the intersecton of a hstore with another one (for
instance), btree is good if you want simple search or range search.

On Wed, May 25, 2011 at 11:59 AM, Pierre C <lists@peufeu.com> wrote:
>> You wrote
>>>
>>> Try to create a btree index on "(bench_hstore->bench_id) WHERE
>>> (bench_hstore->bench_id) IS NOT NULL".
>>
>> What  do you mean exactly?
>> => CREATE INDEX myhstore_kps_gin_idx ON myhstore USING gin(kvps) WHERE
>> ??? IS NOT NULL;
>>
>> My table's def is:
>>>
>>> CREATE TABLE myhstore ( id bigint PRIMARY KEY, kvps hstore NOT NULL );
>>
>> So I'm doing something like:
>> CREATE INDEX myhstore_kps_gin_idx ON myhstore USING gin(kvps);
>
> Hello ;
>
> I meant a plain old btree index like this :
>
> CREATE INDEX foo ON myhstore((kvps->'yourkeyname')) WHERE
> (kvps->'yourkeyname') IS NOT NULL;
>
> The idea is that :
>
> - The reason to use hstore is to have an arbitrary number of keys and use
> the keys you want, not have a fixed set of columns like in a table
> - Therefore, no hstore key is present in all rows (if it was, you'd make
> it a table column, and maybe index it)
> - You'll probably only want to index some of the keys/values (avoiding to
> index values that contain serialized data or other stuff that never
> appears in a WHERE clause)
>
> So, for each key that corresponds to a searchable attribute, I'd use a
> conditional index on that key, which only indexes the relevant rows. For
> keys that never appear in a WHERE, no index is needed.
>
> gist is good if you want the intersecton of a hstore with another one (for
> instance), btree is good if you want simple search or range search.

+1 on this approach.  it works really well (unless, of course, you
need 50 indexes...)

merlin

Hi all

Thank you to all who answered: That worked:

CREATE INDEX planet_osm_point_tags_amenity
ON planet_osm_point ((tags->'amenity'))
WHERE (tags->'amenity') IS NOT NULL;

My problem is, that in fact I don't know which tag to index since I'm
running a web admin application where users can enter arbitrary
queries.

Yours, Stefan

2011/5/25 Pierre C <lists@peufeu.com>:
>> You wrote
>>>
>>> Try to create a btree index on "(bench_hstore->bench_id) WHERE
>>> (bench_hstore->bench_id) IS NOT NULL".
>>
>> What  do you mean exactly?
>> => CREATE INDEX myhstore_kps_gin_idx ON myhstore USING gin(kvps) WHERE
>> ??? IS NOT NULL;
>>
>> My table's def is:
>>>
>>> CREATE TABLE myhstore ( id bigint PRIMARY KEY, kvps hstore NOT NULL );
>>
>> So I'm doing something like:
>> CREATE INDEX myhstore_kps_gin_idx ON myhstore USING gin(kvps);
>
> Hello ;
>
> I meant a plain old btree index like this :
>
> CREATE INDEX foo ON myhstore((kvps->'yourkeyname')) WHERE
> (kvps->'yourkeyname') IS NOT NULL;
>
> The idea is that :
>
> - The reason to use hstore is to have an arbitrary number of keys and use
> the keys you want, not have a fixed set of columns like in a table
> - Therefore, no hstore key is present in all rows (if it was, you'd make
> it a table column, and maybe index it)
> - You'll probably only want to index some of the keys/values (avoiding to
> index values that contain serialized data or other stuff that never
> appears in a WHERE clause)
>
> So, for each key that corresponds to a searchable attribute, I'd use a
> conditional index on that key, which only indexes the relevant rows. For
> keys that never appear in a WHERE, no index is needed.
>
> gist is good if you want the intersecton of a hstore with another one (for
> instance), btree is good if you want simple search or range search.
>

> My problem is, that in fact I don't know which tag to index since I'm
> running a web admin application where users can enter arbitrary
> queries.

For a tag cloud, try this :

- table tags ( tag_id, tag_name )
- table articles ( article_id )
- table articles_to_tags( article_id, tag_id )

now this is the classical approach, which doesn't work so well when you
want to get an article that has several tags (tag intersection).

So, materialize the list of tag_ids for each article in an INTEGER[] array
in the articles table, kept up to date with triggers.

Create a gist index on that, and use indexed array vs array operators.