Thread: KVP table vs. hstore - hstore performance (Was: Postgres NoSQL emulation)
Hi, I am conducting a benchmark to compare KVP table vs. hstore and got bad hstore performance results when the no. of records is greater than about 500'000. CREATE TABLE kvp ( id SERIAL PRIMARY KEY, key text NOT NULL, value text ); -- with index on key CREATE TABLE myhstore ( id SERIAL PRIMARY KEY, obj hstore NOT NULL ); -- with GIST index on obj Does anyone have experience with that? Yours, Stefan
Re: KVP table vs. hstore - hstore performance (Was: Postgres NoSQL emulation)
From
Craig Ringer
Date:
On 14/05/11 18:10, Stefan Keller wrote: > Hi, > > I am conducting a benchmark to compare KVP table vs. hstore and got > bad hstore performance results when the no. of records is greater than > about 500'000. > > CREATE TABLE kvp ( id SERIAL PRIMARY KEY, key text NOT NULL, value text ); > -- with index on key > CREATE TABLE myhstore ( id SERIAL PRIMARY KEY, obj hstore NOT NULL ); > -- with GIST index on obj > > Does anyone have experience with that? What are your queries? What does EXPLAIN ANALYZE report on those queries? Did you ANALYZE your tables after populating them? -- Craig Ringer
Re: KVP table vs. hstore - hstore performance (Was: Postgres NoSQL emulation)
From
Merlin Moncure
Date:
On Sat, May 14, 2011 at 5:10 AM, Stefan Keller <sfkeller@gmail.com> wrote: > Hi, > > I am conducting a benchmark to compare KVP table vs. hstore and got > bad hstore performance results when the no. of records is greater than > about 500'000. > > CREATE TABLE kvp ( id SERIAL PRIMARY KEY, key text NOT NULL, value text ); > -- with index on key > CREATE TABLE myhstore ( id SERIAL PRIMARY KEY, obj hstore NOT NULL ); > -- with GIST index on obj > > Does anyone have experience with that? hstore is not really designed for large-ish sets like that. merlin
On May 16, 2011, at 8:47 AM, Merlin Moncure wrote: > On Sat, May 14, 2011 at 5:10 AM, Stefan Keller <sfkeller@gmail.com> wrote: >> Hi, >> >> I am conducting a benchmark to compare KVP table vs. hstore and got >> bad hstore performance results when the no. of records is greater than >> about 500'000. >> >> CREATE TABLE kvp ( id SERIAL PRIMARY KEY, key text NOT NULL, value text ); >> -- with index on key >> CREATE TABLE myhstore ( id SERIAL PRIMARY KEY, obj hstore NOT NULL ); >> -- with GIST index on obj >> >> Does anyone have experience with that? > > hstore is not really designed for large-ish sets like that. And KVP is? ;) IIRC hstore ends up just storing everything as text, with pointers to know where things start and end. There's no real indexinginside hstore, so basically the only thing it can do is scan the entire hstore. That said, I would strongly reconsider using KVP for anything except the most trivial of data sets. It is *extremely* inefficient.Do you really have absolutely no idea what *any* of your keys will be? Even if you need to support a certainamount of non-deterministic stuff, I would put everything you possibly can into real fields and only use KVP or hstorefor things that you really didn't anticipate. Keep in mind that for every *value*, your overhead is 24 bytes for the heap header, 2+ varlena bytes in the heap, plus thelength of the key. In the index you're looking at 6+ bytes of overhead, 1+ byte for varlena, plus the length of the key.The PK will cost you an additional 16-24 bytes, depending on alignment. So that's a *minimum* of ~50 bytes per value,and realistically the overhead will be closer to 65-70 bytes, *per value*. Unless your values are decent-sized strings,the overhead is going to be many times larger than the actual data! -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
Re: KVP table vs. hstore - hstore performance (Was: Postgres NoSQL emulation)
From
Stefan Keller
Date:
Hi Jim You actually made me think about the schema Michel and I are using: > And KVP is? ;) CREATE TABLE mykvpstore( id bigint PRIMARY KEY ) CREATE TABLE kvp ( id bigint REFERENCES mykvpstore(id), key text NOT NULL, value text, ); -- with index on key And the table with the associative array type (hstore) is: CREATE TABLE myhstore ( id bigint PRIMARY KEY, kvps hstore NOT NULL ); -- with GIST index on obj It seems to me that in the mykvpstore-kvp there is also some overhead. And yes, we have no clue what keys to anticipate, except for some common ones like 'name': The use case is coming from OpenStreetMap (http://wiki.openstreetmap.org/wiki/Database_schema ). Yours, Stefan 2011/5/17 Jim Nasby <jim@nasby.net>: > On May 16, 2011, at 8:47 AM, Merlin Moncure wrote: >> On Sat, May 14, 2011 at 5:10 AM, Stefan Keller <sfkeller@gmail.com> wrote: >>> Hi, >>> >>> I am conducting a benchmark to compare KVP table vs. hstore and got >>> bad hstore performance results when the no. of records is greater than >>> about 500'000. >>> >>> CREATE TABLE kvp ( id SERIAL PRIMARY KEY, key text NOT NULL, value text ); >>> -- with index on key >>> CREATE TABLE myhstore ( id SERIAL PRIMARY KEY, obj hstore NOT NULL ); >>> -- with GIST index on obj >>> >>> Does anyone have experience with that? >> >> hstore is not really designed for large-ish sets like that. > > And KVP is? ;) > > IIRC hstore ends up just storing everything as text, with pointers to know where things start and end. There's no realindexing inside hstore, so basically the only thing it can do is scan the entire hstore. > > That said, I would strongly reconsider using KVP for anything except the most trivial of data sets. It is *extremely* inefficient.Do you really have absolutely no idea what *any* of your keys will be? Even if you need to support a certainamount of non-deterministic stuff, I would put everything you possibly can into real fields and only use KVP or hstorefor things that you really didn't anticipate. > > Keep in mind that for every *value*, your overhead is 24 bytes for the heap header, 2+ varlena bytes in the heap, plusthe length of the key. In the index you're looking at 6+ bytes of overhead, 1+ byte for varlena, plus the length of thekey. The PK will cost you an additional 16-24 bytes, depending on alignment. So that's a *minimum* of ~50 bytes per value,and realistically the overhead will be closer to 65-70 bytes, *per value*. Unless your values are decent-sized strings,the overhead is going to be many times larger than the actual data! > -- > Jim C. Nasby, Database Architect jim@nasby.net > 512.569.9461 (cell) http://jim.nasby.net > > >
Hi Merlin The analyze command gave the following result: On the KVP table: Index Scan using kvpidx on bench_kvp (cost=0.00..8.53 rows=1 width=180) (actual time=0.037..0.038 rows=1 loops=1) Index Cond: (bench_id = '200000_200000'::text) Total runtime: 0.057 ms And on the Hstore table: Bitmap Heap Scan on bench_hstore (cost=32.22..3507.54 rows=1000 width=265) (actual time=145.040..256.173 rows=1 loops=1) Recheck Cond: (bench_hstore @> '"bench_id"=>"200000_200000"'::hstore) -> Bitmap Index Scan on hidx (cost=0.00..31.97 rows=1000 width=0) (actual time=114.748..114.748 rows=30605 loops=1) Index Cond: (bench_hstore @> '"bench_id"=>"200000_200000"'::hstore) Total runtime: 256.211 ms For Hstore I'm using a GIST index. Table analysis returned no message. Michel Von: Merlin Moncure mmoncure@gmail.com An: Stefan Keller <sfkeller@gmail.com> cc: pgsql-performance@postgresql.org Datum: 16. Mai 2011 15:47 Betreff: Re: [PERFORM] KVP table vs. hstore - hstore performance (Was: Postgres NoSQL emulation) Merlin Moncure hstore is not really designed for large-ish sets like that. merlin 2011/5/16 Stefan Keller <sfkeller@gmail.com>: > Hoi Michel > > Hast du die EXPLAIN ANALYZE information? > > LG, Stefan > > > ---------- Forwarded message ---------- > From: Craig Ringer <craig@postnewspapers.com.au> > Date: 2011/5/16 > Subject: Re: [PERFORM] KVP table vs. hstore - hstore performance (Was: > Postgres NoSQL emulation) > To: Stefan Keller <sfkeller@gmail.com> > Cc: pgsql-performance@postgresql.org > > > On 14/05/11 18:10, Stefan Keller wrote: >> Hi, >> >> I am conducting a benchmark to compare KVP table vs. hstore and got >> bad hstore performance results when the no. of records is greater than >> about 500'000. >> >> CREATE TABLE kvp ( id SERIAL PRIMARY KEY, key text NOT NULL, value text ); >> -- with index on key >> CREATE TABLE myhstore ( id SERIAL PRIMARY KEY, obj hstore NOT NULL ); >> -- with GIST index on obj >> >> Does anyone have experience with that? > > What are your queries? > > What does EXPLAIN ANALYZE report on those queries? > > Did you ANALYZE your tables after populating them? > > -- > Craig Ringer >
Re: FW: KVP table vs. hstore - hstore performance (Was: Postgres NoSQL emulation)
From
"Pierre C"
Date:
> Hi Merlin > > The analyze command gave the following result: > > On the KVP table: > Index Scan using kvpidx on bench_kvp (cost=0.00..8.53 rows=1 width=180) > (actual time=0.037..0.038 rows=1 loops=1) > Index Cond: (bench_id = '200000_200000'::text) > Total runtime: 0.057 ms > > And on the Hstore table: > Bitmap Heap Scan on bench_hstore (cost=32.22..3507.54 rows=1000 > width=265) (actual time=145.040..256.173 rows=1 loops=1) > Recheck Cond: (bench_hstore @> '"bench_id"=>"200000_200000"'::hstore) > -> Bitmap Index Scan on hidx (cost=0.00..31.97 rows=1000 width=0) > (actual time=114.748..114.748 rows=30605 loops=1) > Index Cond: (bench_hstore @> '"bench_id"=>"200000_200000"'::hstore) > Total runtime: 256.211 ms > > For Hstore I'm using a GIST index. > Try to create a btree index on "(bench_hstore->bench_id) WHERE (bench_hstore->bench_id) IS NOT NULL".
Re: FW: KVP table vs. hstore - hstore performance (Was: Postgres NoSQL emulation)
From
Robert Haas
Date:
On Tue, May 17, 2011 at 11:10 AM, <m1ott@hsr.ch> wrote: > For Hstore I'm using a GIST index. I would have thought that GIN would be a better choice for this workload. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: FW: KVP table vs. hstore - hstore performance (Was: Postgres NoSQL emulation)
From
Stefan Keller
Date:
Salut Pierre You wrote > Try to create a btree index on "(bench_hstore->bench_id) WHERE > (bench_hstore->bench_id) IS NOT NULL". What do you mean exactly? => CREATE INDEX myhstore_kps_gin_idx ON myhstore USING gin(kvps) WHERE ??? IS NOT NULL; My table's def is: > CREATE TABLE myhstore ( id bigint PRIMARY KEY, kvps hstore NOT NULL ); So I'm doing something like: CREATE INDEX myhstore_kps_gin_idx ON myhstore USING gin(kvps); Stefan 2011/5/23 Pierre C <lists@peufeu.com>: > >> Hi Merlin >> >> The analyze command gave the following result: >> >> On the KVP table: >> Index Scan using kvpidx on bench_kvp (cost=0.00..8.53 rows=1 width=180) >> (actual time=0.037..0.038 rows=1 loops=1) >> Index Cond: (bench_id = '200000_200000'::text) >> Total runtime: 0.057 ms >> >> And on the Hstore table: >> Bitmap Heap Scan on bench_hstore (cost=32.22..3507.54 rows=1000 width=265) >> (actual time=145.040..256.173 rows=1 loops=1) >> Recheck Cond: (bench_hstore @> '"bench_id"=>"200000_200000"'::hstore) >> -> Bitmap Index Scan on hidx (cost=0.00..31.97 rows=1000 width=0) (actual >> time=114.748..114.748 rows=30605 loops=1) >> Index Cond: (bench_hstore @> '"bench_id"=>"200000_200000"'::hstore) >> Total runtime: 256.211 ms >> >> For Hstore I'm using a GIST index. >> > > Try to create a btree index on "(bench_hstore->bench_id) WHERE > (bench_hstore->bench_id) IS NOT NULL". > >
Re: FW: KVP table vs. hstore - hstore performance (Was: Postgres NoSQL emulation)
From
"Pierre C"
Date:
> You wrote >> Try to create a btree index on "(bench_hstore->bench_id) WHERE >> (bench_hstore->bench_id) IS NOT NULL". > > What do you mean exactly? > => CREATE INDEX myhstore_kps_gin_idx ON myhstore USING gin(kvps) WHERE > ??? IS NOT NULL; > > My table's def is: >> CREATE TABLE myhstore ( id bigint PRIMARY KEY, kvps hstore NOT NULL ); > So I'm doing something like: > CREATE INDEX myhstore_kps_gin_idx ON myhstore USING gin(kvps); Hello ; I meant a plain old btree index like this : CREATE INDEX foo ON myhstore((kvps->'yourkeyname')) WHERE (kvps->'yourkeyname') IS NOT NULL; The idea is that : - The reason to use hstore is to have an arbitrary number of keys and use the keys you want, not have a fixed set of columns like in a table - Therefore, no hstore key is present in all rows (if it was, you'd make it a table column, and maybe index it) - You'll probably only want to index some of the keys/values (avoiding to index values that contain serialized data or other stuff that never appears in a WHERE clause) So, for each key that corresponds to a searchable attribute, I'd use a conditional index on that key, which only indexes the relevant rows. For keys that never appear in a WHERE, no index is needed. gist is good if you want the intersecton of a hstore with another one (for instance), btree is good if you want simple search or range search.
Re: FW: KVP table vs. hstore - hstore performance (Was: Postgres NoSQL emulation)
From
Merlin Moncure
Date:
On Wed, May 25, 2011 at 11:59 AM, Pierre C <lists@peufeu.com> wrote: >> You wrote >>> >>> Try to create a btree index on "(bench_hstore->bench_id) WHERE >>> (bench_hstore->bench_id) IS NOT NULL". >> >> What do you mean exactly? >> => CREATE INDEX myhstore_kps_gin_idx ON myhstore USING gin(kvps) WHERE >> ??? IS NOT NULL; >> >> My table's def is: >>> >>> CREATE TABLE myhstore ( id bigint PRIMARY KEY, kvps hstore NOT NULL ); >> >> So I'm doing something like: >> CREATE INDEX myhstore_kps_gin_idx ON myhstore USING gin(kvps); > > Hello ; > > I meant a plain old btree index like this : > > CREATE INDEX foo ON myhstore((kvps->'yourkeyname')) WHERE > (kvps->'yourkeyname') IS NOT NULL; > > The idea is that : > > - The reason to use hstore is to have an arbitrary number of keys and use > the keys you want, not have a fixed set of columns like in a table > - Therefore, no hstore key is present in all rows (if it was, you'd make > it a table column, and maybe index it) > - You'll probably only want to index some of the keys/values (avoiding to > index values that contain serialized data or other stuff that never > appears in a WHERE clause) > > So, for each key that corresponds to a searchable attribute, I'd use a > conditional index on that key, which only indexes the relevant rows. For > keys that never appear in a WHERE, no index is needed. > > gist is good if you want the intersecton of a hstore with another one (for > instance), btree is good if you want simple search or range search. +1 on this approach. it works really well (unless, of course, you need 50 indexes...) merlin
Re: FW: KVP table vs. hstore - hstore performance (Was: Postgres NoSQL emulation)
From
Stefan Keller
Date:
Hi all Thank you to all who answered: That worked: CREATE INDEX planet_osm_point_tags_amenity ON planet_osm_point ((tags->'amenity')) WHERE (tags->'amenity') IS NOT NULL; My problem is, that in fact I don't know which tag to index since I'm running a web admin application where users can enter arbitrary queries. Yours, Stefan 2011/5/25 Pierre C <lists@peufeu.com>: >> You wrote >>> >>> Try to create a btree index on "(bench_hstore->bench_id) WHERE >>> (bench_hstore->bench_id) IS NOT NULL". >> >> What do you mean exactly? >> => CREATE INDEX myhstore_kps_gin_idx ON myhstore USING gin(kvps) WHERE >> ??? IS NOT NULL; >> >> My table's def is: >>> >>> CREATE TABLE myhstore ( id bigint PRIMARY KEY, kvps hstore NOT NULL ); >> >> So I'm doing something like: >> CREATE INDEX myhstore_kps_gin_idx ON myhstore USING gin(kvps); > > Hello ; > > I meant a plain old btree index like this : > > CREATE INDEX foo ON myhstore((kvps->'yourkeyname')) WHERE > (kvps->'yourkeyname') IS NOT NULL; > > The idea is that : > > - The reason to use hstore is to have an arbitrary number of keys and use > the keys you want, not have a fixed set of columns like in a table > - Therefore, no hstore key is present in all rows (if it was, you'd make > it a table column, and maybe index it) > - You'll probably only want to index some of the keys/values (avoiding to > index values that contain serialized data or other stuff that never > appears in a WHERE clause) > > So, for each key that corresponds to a searchable attribute, I'd use a > conditional index on that key, which only indexes the relevant rows. For > keys that never appear in a WHERE, no index is needed. > > gist is good if you want the intersecton of a hstore with another one (for > instance), btree is good if you want simple search or range search. >
Re: FW: KVP table vs. hstore - hstore performance (Was: Postgres NoSQL emulation)
From
"Pierre C"
Date:
> My problem is, that in fact I don't know which tag to index since I'm > running a web admin application where users can enter arbitrary > queries. For a tag cloud, try this : - table tags ( tag_id, tag_name ) - table articles ( article_id ) - table articles_to_tags( article_id, tag_id ) now this is the classical approach, which doesn't work so well when you want to get an article that has several tags (tag intersection). So, materialize the list of tag_ids for each article in an INTEGER[] array in the articles table, kept up to date with triggers. Create a gist index on that, and use indexed array vs array operators.