Thread: Optmal tags design?

Optmal tags design?

From

lists@on-track.ca

Date:

18 July 2007, 18:30:49

I am planning to add a tags (as in the "web 2.0" thing) feature to my web
based application. I would like some feedback from the experts here on
what the best database design for that would be.

The possibilities I have come up with are:
* A tags table containing the tag and id number of what it links to.
select pid from tags where tag='bla'
select tag from tags where pid=xxx.

* a tags table where each tag exists only once, and a table with the tag
ID and picture ID to link them together.

select pid from tags inner join picture_tags using(tag_id) where tag='bla'
select tag from tags inner join picture_tags using(tag_id) where pid='xxx'

* A full text index in the picture table containing the tags

select pid from pictures where tags @@ to_tsquery('bla')
(or the non-fti version)
select pid from pictures where tags ~* '.*bla.*'

select tags from pictures where pid=xxx;

Re: Optmal tags design?

From

Mark Lewis

Date:

18 July 2007, 18:52:13

On Wed, 2007-07-18 at 14:26 -0700, lists@on-track.ca wrote:
> I am planning to add a tags (as in the "web 2.0" thing) feature to my web
> based application. I would like some feedback from the experts here on
> what the best database design for that would be.
>
> The possibilities I have come up with are:
> * A tags table containing the tag and id number of what it links to.
> select pid from tags where tag='bla'
> select tag from tags where pid=xxx.

Properly indexed, this schema can handle common lookups such as 'show me
all pictures with tag X'.

The problem here is that any operation involving all tags (for example,
'show me a list of all tags in the database') may be slow and/or
awkward.

> * a tags table where each tag exists only once, and a table with the tag
> ID and picture ID to link them together.

This sounds the most reasonable, and is the "right way" to do it in the
relational model.  Can handle common queries such as 'show me all
pictures with tag X'.  Can also easily perform queries such as 'show me
a list of all tags in the database'.

This also gives you a logical place to store additional information for
each tag, such as the user and timestamp of the first usage of the tag,
or a cache of the approximate number of pictures with that tag (for a
future performance optimization, maybe), or whatever else you can think
up that might be useful to store on a per-tag level.

> select pid from tags inner join picture_tags using(tag_id) where tag='bla'
> select tag from tags inner join picture_tags using(tag_id) where pid='xxx'
>
> * A full text index in the picture table containing the tags
>
> select pid from pictures where tags @@ to_tsquery('bla')
> (or the non-fti version)
> select pid from pictures where tags ~* '.*bla.*'
>
> select tags from pictures where pid=xxx;

I'm not experienced with full text indexing so perhaps I'm wrong about
this, but it seems like it would give you approximately the same
flexibility as #1 in terms of your data model.  The only reason I can
think of why you might want this over #1 would be for a performance
improvement, but if there's a reasonably small number of distinct tags
and/or distinct tags per picture I can't imagine it being much faster
than #1.

-- Mark Lewis

Re: Optmal tags design?

From

"Bryan Murphy"

Date:

18 July 2007, 18:54:13

We store tags on our items like this like this:

Tag.ID INT NOT NULL PRIMARY KEY
Tag.Value TEXT LCASE NOT NULL UNIQUE

Item.ID INT NOT NULL PRIMARY KEY

ItemTagBinding.ItemID INT NOT NULL REFERENCES Item.ID
ItemTagBinding.TagID INT NOT NULL REFERENCES Tag.ID
ItemTagBinding.ItemID + ItemTagBinding.TagID UNIQUE

with appropriate indexes on the columns we need to frequently query.

We have about 3 million tag bindings right now, and have not run into any performance issues related to tagging other than generating tag clouds (which we pre-calculate anyway).

I'll have to get back to you when we get up to 10's, or even 100's of millions and let you know how it scaled.

Bryan

On 7/18/07, lists@on-track.ca <lists@on-track.ca> wrote:

I am planning to add a tags (as in the "web 2.0" thing) feature to my web
based application. I would like some feedback from the experts here on
what the best database design for that would be.

The possibilities I have come up with are:
* A tags table containing the tag and id number of what it links to.
select pid from tags where tag='bla'
select tag from tags where pid=xxx.

* a tags table where each tag exists only once, and a table with the tag
ID and picture ID to link them together.

select pid from tags inner join picture_tags using(tag_id) where tag='bla'
select tag from tags inner join picture_tags using(tag_id) where pid='xxx'

* A full text index in the picture table containing the tags

select pid from pictures where tags @@ to_tsquery('bla')
(or the non-fti version)
select pid from pictures where tags ~* '.*bla.*'

select tags from pictures where pid=xxx;

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match