On Wed, 2007-07-18 at 14:26 -0700, lists@on-track.ca wrote:
> I am planning to add a tags (as in the "web 2.0" thing) feature to my web
> based application. I would like some feedback from the experts here on
> what the best database design for that would be.
>
> The possibilities I have come up with are:
> * A tags table containing the tag and id number of what it links to.
> select pid from tags where tag='bla'
> select tag from tags where pid=xxx.
Properly indexed, this schema can handle common lookups such as 'show me
all pictures with tag X'.
The problem here is that any operation involving all tags (for example,
'show me a list of all tags in the database') may be slow and/or
awkward.
> * a tags table where each tag exists only once, and a table with the tag
> ID and picture ID to link them together.
This sounds the most reasonable, and is the "right way" to do it in the
relational model. Can handle common queries such as 'show me all
pictures with tag X'. Can also easily perform queries such as 'show me
a list of all tags in the database'.
This also gives you a logical place to store additional information for
each tag, such as the user and timestamp of the first usage of the tag,
or a cache of the approximate number of pictures with that tag (for a
future performance optimization, maybe), or whatever else you can think
up that might be useful to store on a per-tag level.
> select pid from tags inner join picture_tags using(tag_id) where tag='bla'
> select tag from tags inner join picture_tags using(tag_id) where pid='xxx'
>
> * A full text index in the picture table containing the tags
>
> select pid from pictures where tags @@ to_tsquery('bla')
> (or the non-fti version)
> select pid from pictures where tags ~* '.*bla.*'
>
> select tags from pictures where pid=xxx;
I'm not experienced with full text indexing so perhaps I'm wrong about
this, but it seems like it would give you approximately the same
flexibility as #1 in terms of your data model. The only reason I can
think of why you might want this over #1 would be for a performance
improvement, but if there's a reasonably small number of distinct tags
and/or distinct tags per picture I can't imagine it being much faster
than #1.
-- Mark Lewis