Thread: Optmal tags design?
I am planning to add a tags (as in the "web 2.0" thing) feature to my web based application. I would like some feedback from the experts here on what the best database design for that would be. The possibilities I have come up with are: * A tags table containing the tag and id number of what it links to. select pid from tags where tag='bla' select tag from tags where pid=xxx. * a tags table where each tag exists only once, and a table with the tag ID and picture ID to link them together. select pid from tags inner join picture_tags using(tag_id) where tag='bla' select tag from tags inner join picture_tags using(tag_id) where pid='xxx' * A full text index in the picture table containing the tags select pid from pictures where tags @@ to_tsquery('bla') (or the non-fti version) select pid from pictures where tags ~* '.*bla.*' select tags from pictures where pid=xxx;
On Wed, 2007-07-18 at 14:26 -0700, lists@on-track.ca wrote: > I am planning to add a tags (as in the "web 2.0" thing) feature to my web > based application. I would like some feedback from the experts here on > what the best database design for that would be. > > The possibilities I have come up with are: > * A tags table containing the tag and id number of what it links to. > select pid from tags where tag='bla' > select tag from tags where pid=xxx. Properly indexed, this schema can handle common lookups such as 'show me all pictures with tag X'. The problem here is that any operation involving all tags (for example, 'show me a list of all tags in the database') may be slow and/or awkward. > * a tags table where each tag exists only once, and a table with the tag > ID and picture ID to link them together. This sounds the most reasonable, and is the "right way" to do it in the relational model. Can handle common queries such as 'show me all pictures with tag X'. Can also easily perform queries such as 'show me a list of all tags in the database'. This also gives you a logical place to store additional information for each tag, such as the user and timestamp of the first usage of the tag, or a cache of the approximate number of pictures with that tag (for a future performance optimization, maybe), or whatever else you can think up that might be useful to store on a per-tag level. > select pid from tags inner join picture_tags using(tag_id) where tag='bla' > select tag from tags inner join picture_tags using(tag_id) where pid='xxx' > > * A full text index in the picture table containing the tags > > select pid from pictures where tags @@ to_tsquery('bla') > (or the non-fti version) > select pid from pictures where tags ~* '.*bla.*' > > select tags from pictures where pid=xxx; I'm not experienced with full text indexing so perhaps I'm wrong about this, but it seems like it would give you approximately the same flexibility as #1 in terms of your data model. The only reason I can think of why you might want this over #1 would be for a performance improvement, but if there's a reasonably small number of distinct tags and/or distinct tags per picture I can't imagine it being much faster than #1. -- Mark Lewis
We store tags on our items like this like this:
Tag.ID INT NOT NULL PRIMARY KEY
Tag.Value TEXT LCASE NOT NULL UNIQUE
Item.ID INT NOT NULL PRIMARY KEY
ItemTagBinding.ItemID INT NOT NULL REFERENCES Item.ID
ItemTagBinding.TagID INT NOT NULL REFERENCES Tag.ID
ItemTagBinding.ItemID + ItemTagBinding.TagID UNIQUE
with appropriate indexes on the columns we need to frequently query.
We have about 3 million tag bindings right now, and have not run into any performance issues related to tagging other than generating tag clouds (which we pre-calculate anyway).
I'll have to get back to you when we get up to 10's, or even 100's of millions and let you know how it scaled.
Bryan
Tag.ID INT NOT NULL PRIMARY KEY
Tag.Value TEXT LCASE NOT NULL UNIQUE
Item.ID INT NOT NULL PRIMARY KEY
ItemTagBinding.ItemID INT NOT NULL REFERENCES Item.ID
ItemTagBinding.TagID INT NOT NULL REFERENCES Tag.ID
ItemTagBinding.ItemID + ItemTagBinding.TagID UNIQUE
with appropriate indexes on the columns we need to frequently query.
We have about 3 million tag bindings right now, and have not run into any performance issues related to tagging other than generating tag clouds (which we pre-calculate anyway).
I'll have to get back to you when we get up to 10's, or even 100's of millions and let you know how it scaled.
Bryan
On 7/18/07, lists@on-track.ca <lists@on-track.ca> wrote:
I am planning to add a tags (as in the "web 2.0" thing) feature to my web
based application. I would like some feedback from the experts here on
what the best database design for that would be.
The possibilities I have come up with are:
* A tags table containing the tag and id number of what it links to.
select pid from tags where tag='bla'
select tag from tags where pid=xxx.
* a tags table where each tag exists only once, and a table with the tag
ID and picture ID to link them together.
select pid from tags inner join picture_tags using(tag_id) where tag='bla'
select tag from tags inner join picture_tags using(tag_id) where pid='xxx'
* A full text index in the picture table containing the tags
select pid from pictures where tags @@ to_tsquery('bla')
(or the non-fti version)
select pid from pictures where tags ~* '.*bla.*'
select tags from pictures where pid=xxx;
---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match