Thread: Using a GIN index on an integer array to model sets of tags

Using a GIN index on an integer array to model sets of tags

From

Mike Jarmy

Date:

17 November 2012, 16:01:46

I am researching how to set up a schema for querying a set of tags
associated with an object.  There are approximately 100 distinct tags
in my application (these will be pre-populated), and I am expecting a
fairly low number of distinct sets of these tags  -- in other words, a
given set of tags will most likely be reused by many objects.  There
will be on the order of thousands to hundreds-of-thousands of records
in the 'objects' table.

I could use the standard many-to-many approach for this, that I've
used many times in the past:

CREATE TABLE TAGS (
    ID SMALLINT PRIMARY KEY,
    NAME VARCHAR);

CREATE TABLE OBJECTS (
    ID SERIAL PRIMARY KEY,
    FOO VARCHAR);

CREATE TABLE OBJECT_TAGS (
    OBJECT_ID INT,
    TAG_ID SMALLINT,
    PRIMARY KEY (OBJECT_ID, TAG_ID));

However, I've run across a couple of articles suggesting that I could
use an array field to store the tags on the object, and use a GIN
index to do set comparisons, e.g:
http://sjsnyder.com/using-postgresql-arrays-with-gin-indexes-for.
Though note that in that article they are storing each tag in the
'tags' field as text, whereas I think it would make more sense to
store the id of each tag in the 'tags' field, and cache the mapping
from tag names to tag ids in my application:

CREATE TABLE TAGS (
    ID SMALLINT PRIMARY KEY,
    NAME VARCHAR);

CREATE TABLE OBJECTS (
    ID SERIAL PRIMARY KEY,
    TAGS SMALLINT[],
    FOO VARCHAR);

CREATE INDEX OBJECTS_TAGS_INDEX ON OBJECTS USING GIN (TAGS);

I like this array-plus-gin-index approach because the object id is not
repeated over and over again as it is in the many-to-many link table.
It seems like it would be much more efficient.  Another advantage is
that the queries are so much more legible -- it just looks a lot
cleaner to me.

However, the PostgreSQL manual's chapter on arrays
(http://www.postgresql.org/docs/9.2/static/arrays.html) does not
mention GIN indexes.  Instead, it has a tip saying "Arrays are not
sets; searching for specific array elements can be a sign of database
misdesign." Of course the manual is correct if the array field is
unindexed, but I find the fact that the GIN-indexing approach is not
mentioned to be puzzling.

So, my question is: Is it in fact a good idea to use the
integer-array-plus-GIN-index approach as a way to store sets in-line?
Does anyone actually do this in production? Or is the old-school
many-to-many relationship still the way to go?  Of course I'm going to
try both approaches out myself and see how it works, but I'm
interested in the communities opinions on this subject (particularly
since I'm new to PostgreSQL, having mostly used commercial databases
in the past).

Thanks, Mike Jarmy

Re: Using a GIN index on an integer array to model sets of tags

From

"T. E. Lawrence"

Date:

17 November 2012, 16:39:31

> I am researching how to set up a schema for querying a set of tags
> associated with an object.

I would be interested in hearing your conclusions.

I am currently researching in a similar direction.

We have streaming replication where the slaves are used for data mining, storing currently about 100m objects and
expectedto grow in direction of 1-2b or more. 

While part of the data is stored in a classic RDBM model, there are "soft" properties, for which we are looking for
"soft"and in the same time, if possible, searchable/indexable storage. 

So far I have mostly looked into hstore and JSON, but arrays are an interesting option, as well.

The only conclusion which we have reached so far is that we should divorce the core storage from search optimization
andfrom versioning. 

T.E.L.

Re: Using a GIN index on an integer array to model sets of tags

From

Ryan Kelly

Date:

17 November 2012, 21:15:12

On Sat, Nov 17, 2012 at 11:01:41AM -0500, Mike Jarmy wrote:
> I am researching how to set up a schema for querying a set of tags
> associated with an object.  There are approximately 100 distinct tags
> in my application (these will be pre-populated), and I am expecting a
> fairly low number of distinct sets of these tags  -- in other words, a
> given set of tags will most likely be reused by many objects.  There
> will be on the order of thousands to hundreds-of-thousands of records
> in the 'objects' table.
>
> I could use the standard many-to-many approach for this, that I've
> used many times in the past:
>
> CREATE TABLE TAGS (
>     ID SMALLINT PRIMARY KEY,
>     NAME VARCHAR);
>
> CREATE TABLE OBJECTS (
>     ID SERIAL PRIMARY KEY,
>     FOO VARCHAR);
>
> CREATE TABLE OBJECT_TAGS (
>     OBJECT_ID INT,
>     TAG_ID SMALLINT,
>     PRIMARY KEY (OBJECT_ID, TAG_ID));
>
> However, I've run across a couple of articles suggesting that I could
> use an array field to store the tags on the object, and use a GIN
> index to do set comparisons, e.g:
> http://sjsnyder.com/using-postgresql-arrays-with-gin-indexes-for.
> Though note that in that article they are storing each tag in the
> 'tags' field as text, whereas I think it would make more sense to
> store the id of each tag in the 'tags' field, and cache the mapping
> from tag names to tag ids in my application:
We actually do exactly what the link above suggests and store the text
directly, but we also have thousands of distinct tags. On the other
hand, with a large number of objects, you (we?) could save a tremendous
amount of space by "decoding" the tags either application-side or using
a stored procedure, which is something I've been investigating recently.

> CREATE TABLE TAGS (
>     ID SMALLINT PRIMARY KEY,
>     NAME VARCHAR);
>
> CREATE TABLE OBJECTS (
>     ID SERIAL PRIMARY KEY,
>     TAGS SMALLINT[],
>     FOO VARCHAR);
>
> CREATE INDEX OBJECTS_TAGS_INDEX ON OBJECTS USING GIN (TAGS);
>
> I like this array-plus-gin-index approach because the object id is not
> repeated over and over again as it is in the many-to-many link table.
> It seems like it would be much more efficient.  Another advantage is
> that the queries are so much more legible -- it just looks a lot
> cleaner to me.
>
> However, the PostgreSQL manual's chapter on arrays
> (http://www.postgresql.org/docs/9.2/static/arrays.html) does not
> mention GIN indexes.  Instead, it has a tip saying "Arrays are not
> sets; searching for specific array elements can be a sign of database
> misdesign." Of course the manual is correct if the array field is
> unindexed, but I find the fact that the GIN-indexing approach is not
> mentioned to be puzzling.
Arrays are not sets, but we use them just like sets. It makes me wish
SQL standard sets were implemented in postgres.

> So, my question is: Is it in fact a good idea to use the
> integer-array-plus-GIN-index approach as a way to store sets in-line?
> Does anyone actually do this in production? Or is the old-school
> many-to-many relationship still the way to go?  Of course I'm going to
> try both approaches out myself and see how it works, but I'm
> interested in the communities opinions on this subject (particularly
> since I'm new to PostgreSQL, having mostly used commercial databases
> in the past).
One-to-many and many-to-many our still our go-to options, but if the
number of items participating in the join is going to be large, arrays
are far superior.  Especially when then join does no filtering. A common
operation for us is to download a very large number - millions - of
items filtered by a certain set of tags. The original design used
one-to-many to store the association between items and tags. When it was
replaced with an approach using arrays with gin indexes, the speed up
for finding large numbers of items was 20x, and finding smaller subsets
was 60x. This was about 18 months ago on postgres 8.4, so I may not be
100% correct here. But that was the ballpark.

> Thanks, Mike Jarmy

-Ryan Kelly