Thread: Questions on Tags table schema

Questions on Tags table schema

From
"Jay Kang"
Date:
Hello,

I'm currently trying to decide on a database design for tags in my web 2.0 application. The problem I'm facing is that I have 3 separate tables i.e. cars, planes, and schools. All three tables need to interact with the tags, so there will only be one universal set of tags for the three tables.

I read a lot about tags and the best articles I found were:

Road to Web 2.0 ( http://wyome.com/docs/Road_to_Web_2.0:_The_Database_Design  )
tags: database schema ( http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html  )

and a forum discussion on tags with a very similar problem:
http://www.webmasterworld.com/forum112/502.htm 
But I don't like the solution, would like to stick with serial integer for Cars, Planes and Schools tables.

Currently, this is my DB design:

Cars (carid, carname, text, etc.)
Planes (planeid, planename, text, etc.)
Schools (schoolname, text, etc.) <------ School does not take int as primary key but a varchar.

Tags (tagid, tagname, etc)
 
--- Now here is where I have the question. I have to link up three separate tables to use Tags
--- So when a new car is created in the Cars table, should I insert that carID into the TagsItems table
--- as itemID? So something like this?

TagsItems
(
  tagid INT NOT NULL REFERENCES Tags.TagID,
  itemid INT NULL,  <---- really references Cars.carID and Planes.planeID
  schoolname varchar NULL  <---- Saves the Schools.schoolname
  itemid + tagId as Unique
)

I also have a question on the schoolname field, because it accepts varchar not integer. There seems to be some design that would better fit my needs. I'm asking  you guys for a little assistance.

--
Regards,
Jay Kang

Re: Questions on Tags table schema

From
Richard Huxton
Date:
Jay Kang wrote:
> Hello,
>
> I'm currently trying to decide on a database design for tags in my web
> 2.0application. The problem I'm facing is that I have 3 separate
> tables
> i.e. cars, planes, and schools. All three tables need to interact with the
> tags, so there will only be one universal set of tags for the three tables.
>
> I read a lot about tags and the best articles I found were:
>
> Road to Web 2.0 ( http://wyome.com/docs/Road_to_Web_2.0:_The_Database_Design )

And what in particular recommended this to you?

> Currently, this is my DB design:
>
> Cars (carid, carname, text, etc.)
> Planes (planeid, planename, text, etc.)
> Schools (schoolname, text, etc.) <------ School does not take int as primary
> key but a varchar.

You don't mention a primary-key here at all. You're not thinking of
using "schoolname" are you?

> Tags (tagid, tagname, etc)
>
> --- Now here is where I have the question. I have to link up three separate
> tables to use Tags
> --- So when a new car is created in the Cars table, should I insert that
> carID into the TagsItems table
> --- as itemID? So something like this?
>
> TagsItems
> (
>   tagid INT NOT NULL REFERENCES Tags.TagID,
>   itemid INT NULL,  <---- really references Cars.carID and Planes.planeID
>   schoolname varchar NULL  <---- Saves the Schools.schoolname
>   itemid + tagId as Unique
> )

What's wrong with the completely standard:
   car_tags (carid, tagid)
   plane_tags (planeid, tagid)
   school_tags (schoolid, tagid)

> I also have a question on the schoolname field, because it accepts varchar
> not integer. There seems to be some design that would better fit my needs.
> I'm asking  you guys for a little assistance.

Sorry, don't understand this question.

--
   Richard Huxton
   Archonet Ltd

Re: Questions on Tags table schema

From
"Jay Kang"
Date:
Thanks for the reply Richard, but I guess I didn't explain myself well. I have three tables that needs to be mapped to the Tags table. Most of the web references that I mentioned only maps one table to the Tags table. Here is my Tags table:

CREATE TABLE Tags
(
   TagID serial NOT NULL,
   TagName varchar(64) NOT NULL,
   AddedBy varchar(256) NOT NULL,
   AddedDate timestamp NOT NULL,
   Status int NOT NULL,
   ViewCount int NOT NULL CONSTRAINT DF_tm_Tags_ViewCount  DEFAULT (('0'))
);

Is it your opinion that the most standard solution for my problem would be to create three separate tables called car_tags, plane_tags and school_tags, which maps to each of the tables:

CREATE TABLE car_tags
(
   CarID integer NOT NULL,
   TagID integer NOT NULL
);

CREATE TABLE plane_tags
(
   PlaneID integer NOT NULL,
   TagID integer NOT NULL
);

CREATE TABLE school_tags
(
   SchoolID integer NOT NULL,
   TagID integer NOT NULL
);

Would TagID for each of these three tables be a foreign key for the Tags table? Also would each CarID, PlaneID, and SchoolID be a foreign for each corresponding tables? Also won't getting tags for three tables be more complicated? Isn't there a better solution or is this wishful thinking?

On 7/30/07, Richard Huxton <dev@archonet.com > wrote:
Jay Kang wrote:
> Hello,
>
> I'm currently trying to decide on a database design for tags in my web
> 2.0application. The problem I'm facing is that I have 3 separate
> tables
> i.e. cars, planes, and schools. All three tables need to interact with the
> tags, so there will only be one universal set of tags for the three tables.
>
> I read a lot about tags and the best articles I found were:
>
> Road to Web 2.0 ( http://wyome.com/docs/Road_to_Web_2.0:_The_Database_Design )

And what in particular recommended this to you?

The Road to Web 2.0 is an example of tag implementation, just thought it would be helpful to someone with the same problem that I have.

> Currently, this is my DB design:
>
> Cars (carid, carname, text, etc.)
> Planes (planeid, planename, text, etc.)
> Schools (schoolname, text, etc.) <------ School does not take int as primary
> key but a varchar.

You don't mention a primary-key here at all. You're not thinking of
using "schoolname" are you?

Yes, I used school name varchar(64) as primary key for the school tables. You can consider schoolName as Pagename for a wiki.

> Tags (tagid, tagname, etc)
>
> --- Now here is where I have the question. I have to link up three separate
> tables to use Tags
> --- So when a new car is created in the Cars table, should I insert that
> carID into the TagsItems table
> --- as itemID? So something like this?
>
> TagsItems
> (
>   tagid INT NOT NULL REFERENCES Tags.TagID,
>   itemid INT NULL,  <---- really references Cars.carID and Planes.planeID
>   schoolname varchar NULL  <---- Saves the Schools.schoolname
>   itemid + tagId as Unique
> )

What's wrong with the completely standard:
   car_tags (carid, tagid)
   plane_tags (planeid, tagid)
   school_tags (schoolid, tagid)
> I also have a question on the schoolname field, because it accepts varchar
> not integer. There seems to be some design that would better fit my needs.
> I'm asking  you guys for a little assistance.

Sorry, don't understand this question.

--
   Richard Huxton
   Archonet Ltd



--
Regards,
Jay Kang

Re: Questions on Tags table schema

From
Richard Huxton
Date:
Jay Kang wrote:
> Thanks for the reply Richard, but I guess I didn't explain myself well. I
> have three tables that needs to be mapped to the Tags table. Most of the web
> references that I mentioned only maps one table to the Tags table. Here is
> my Tags table:

One quick point. SQL is case-insensitive unless you double-quote
identifiers. This means CamelCase tend not to be used. So instead of
AddedBy you'd more commonly see added_by.

> CREATE TABLE Tags
> (
>    TagID serial NOT NULL,
>    TagName varchar(64) NOT NULL,
>    AddedBy varchar(256) NOT NULL,

This is supposed to be a user? But it's not a foreign-key, and you've
decided that 255 characters will be a good length, but 257 is impossible.

>    AddedDate timestamp NOT NULL,

You probably want "timestamp with time zone" (which represents an
absolute time) rather than without time-zone (which means 1pm in London
is different from 1pm in New York).

Also, if it's "AddedDate" why isn't it a date?

>    Status int NOT NULL,
>    ViewCount int NOT NULL CONSTRAINT DF_tm_Tags_ViewCount  DEFAULT (('0'))
> );

You might not want to mix in details about number of views with details
of the tag. Particularly if you might record more details later (when
viewed, by whom etc).

> Is it your opinion that the most standard solution for my problem would be
> to create three separate tables called car_tags, plane_tags and school_tags,
> which maps to each of the tables:

Well, yes.

> CREATE TABLE car_tags
> (
>    CarID integer NOT NULL,
>    TagID integer NOT NULL
> );
[snip other table defs]

Don't forget CarID isn't really an integer (I mean, you're not going to
be doing sums with car id's are you?) it's actually just a unique code.
Of course, computers are particularly fast at dealing with 32-bit integers.

> Would TagID for each of these three tables be a foreign key for the Tags
> table? Also would each CarID, PlaneID, and SchoolID be a foreign for each
> corresponding tables? Also won't getting tags for three tables be more
> complicated? Isn't there a better solution or is this wishful thinking?

Yes, yes, and no.

You have cars which have tags and planes which have tags. Tagging a
plane is not the same as tagging a car. Either you confuse that issue,
or you want separate tables to track each relationship.

Fetching a list of everything with a specific tag is straightforward enough:

SELECT 'car'::text AS item_type, car_id AS item_id, carname AS item_name
FROM cars JOIN car_tags WHERE tag_id = <x>
UNION ALL
SELECT 'plane'::text AS item_type, plane_id AS item_id, planename AS
item_name
FROM planes JOIN plane_tags WHERE tag_id = <x>
...

--
   Richard Huxton
   Archonet Ltd

Re: Questions on Tags table schema

From
Richard Huxton
Date:
Richard Huxton wrote:
>
>> CREATE TABLE car_tags
>> (
>>    CarID integer NOT NULL,
>>    TagID integer NOT NULL
>> );
> [snip other table defs]
>
> Don't forget CarID isn't really an integer (I mean, you're not going to
> be doing sums with car id's are you?) it's actually just a unique code.
> Of course, computers are particularly fast at dealing with 32-bit integers.

Just realised I haven't explained what I meant by that.

CarID is a different type from PlaneID and TagID. As it happens, we are
using integers to represent them all, but a CarID = 1 is different from
a PlaneID = 1 and although you can numerically compare the two it is an
error to do so.

--
   Richard Huxton
   Archonet Ltd

Re: Questions on Tags table schema

From
"Jay Kang"
Date:
Hey Richard,

Thanks again for the reply, its great to hear some feedback. So once again, here we go:

On 7/30/07, Richard Huxton < dev@archonet.com> wrote:
Jay Kang wrote:
> Thanks for the reply Richard, but I guess I didn't explain myself well. I
> have three tables that needs to be mapped to the Tags table. Most of the web
> references that I mentioned only maps one table to the Tags table. Here is
> my Tags table:

One quick point. SQL is case-insensitive unless you double-quote
identifiers. This means CamelCase tend not to be used. So instead of
AddedBy you'd more commonly see added_by.


Yes, I am aware that postgre is case-insensitive, but I write all query with case so its easier for me to read later on.

> CREATE TABLE Tags
> (
>    TagID serial NOT NULL,
>    TagName varchar(64) NOT NULL,
>    AddedBy varchar(256) NOT NULL,

This is supposed to be a user? But it's not a foreign-key, and you've
decided that 255 characters will be a good length, but 257 is impossible.

I'm developing in c# with asp.net 2.0 which as a membership provider. I'm using ASP.NET 2.0 Website Programming / Problem - Design - Solution" (Wrox Press) as a reference, so not having AddedBy as a foreign key within each of the tables was taken directly from the text. I do not understand your comment about 255 character with 257 being impossible? Could you elaborate, if you feel it warrants further elaboration.

>    AddedDate timestamp NOT NULL,

You probably want "timestamp with time zone" (which represents an
absolute time) rather than without time-zone (which means 1pm in London
is different from 1pm in New York).

OK, timestamp with time zone it is. To be honest, I've been using postgresql for a while now, but never tried using timestamp with time zone.
 

Also, if it's "AddedDate" why isn't it a date?

I had this first as a date, but asp.net 2.0 didn't like it, and changing it to a timestamp fixed the problem.

>    Status int NOT NULL,
>    ViewCount int NOT NULL CONSTRAINT DF_tm_Tags_ViewCount  DEFAULT (('0'))
> );

You might not want to mix in details about number of views with details
of the tag. Particularly if you might record more details later (when
viewed, by whom etc).

Are you suggesting to separate the Tags table into Tags and TagDetails? Because ViewCount within Tags table would represent how many times that tag was clicked, I think others would call this field Popularity. I've been reading alot about tags and I am fascinated at all the information about user tags can provide. Where would I put information such as ViewCount, AddedBy, Status, etc if not within the Tags table? Sorry, if I'm totally missing your point.

> Is it your opinion that the most standard solution for my problem would be
> to create three separate tables called car_tags, plane_tags and school_tags,
> which maps to each of the tables:

Well, yes.

> CREATE TABLE car_tags
> (
>    CarID integer NOT NULL,
>    TagID integer NOT NULL
> );
[snip other table defs]

Don't forget CarID isn't really an integer (I mean, you're not going to
be doing sums with car id's are you?) it's actually just a unique code.
Of course, computers are particularly fast at dealing with 32-bit integers.

Yes, within the Cars table CarID would be a serial so it would auto increment with each row. I understand your concern.

> Would TagID for each of these three tables be a foreign key for the Tags
> table? Also would each CarID, PlaneID, and SchoolID be a foreign for each
> corresponding tables? Also won't getting tags for three tables be more
> complicated? Isn't there a better solution or is this wishful thinking?

Yes, yes, and no.

You have cars which have tags and planes which have tags. Tagging a
plane is not the same as tagging a car. Either you confuse that issue,
or you want separate tables to track each relationship.

Hmm, so if I have a tag called "Saab" and a user clicks on Saab, then information from both Cars and Planes table would appear. If I'm inserting a new row for a tag, wouldn't I need to check if that tagname already appears within the Tags table or would I just create a new row with that tag name. Sorry, I'm not sure what " 'car'::text " this is doing, but I'm guessing its used to group the cars, planes, etc. so it knows which item_type it is. Brilliant!

Fetching a list of everything with a specific tag is straightforward enough:

SELECT 'car'::text AS item_type, car_id AS item_id, carname AS item_name
FROM cars JOIN car_tags WHERE tag_id = <x>
UNION ALL
SELECT 'plane'::text AS item_type, plane_id AS item_id, planename AS
item_name
FROM planes JOIN plane_tags WHERE tag_id = <x>


Thanks for the query, I'm going to start programming so I can figure it out as I go along.

...

--
   Richard Huxton
   Archonet Ltd



--
Regards,
Jay Kang

Re: Questions on Tags table schema

From
Richard Huxton
Date:
Jay Kang wrote:
>> One quick point. SQL is case-insensitive unless you double-quote
>> identifiers. This means CamelCase tend not to be used. So instead of
>> AddedBy you'd more commonly see added_by.
>
> Yes, I am aware that postgre is case-insensitive, but I write all query with
> case so its easier for me to read later on.

It's SQL that's case insensitive. Pretty much any SQL-based database
system you use will do case-folding in some way.

>> CREATE TABLE Tags
>>> (
>>>    TagID serial NOT NULL,
>>>    TagName varchar(64) NOT NULL,
>>>    AddedBy varchar(256) NOT NULL,
>> This is supposed to be a user? But it's not a foreign-key, and you've
>> decided that 255 characters will be a good length, but 257 is impossible.
>
>
> I'm developing in c# with asp.net 2.0 which as a membership provider. I'm
> using ASP.NET 2.0 Website Programming / Problem - Design - Solution" (Wrox
> Press) <http://www.amazon.com/gp/product/0764584642> as a reference, so not
> having AddedBy as a foreign key within each of the tables was taken directly
> from the text. I do not understand your comment about 255 character with 257
> being impossible? Could you elaborate, if you feel it warrants further
> elaboration.

What is AddedBy - a name, a user-id?
If it's an ID, then it seems very long.
If it's a name, then 256 characters sounds a bit arbitrary as a length.
Why choose 256?

The advantage of *not* having AddedBy as a foreign-key is that you can
delete users and not have to update tags with their user-id. The
disadvantage is the same thing. You can end up with tags added by
non-existent users.

>>    AddedDate timestamp NOT NULL,
>>
>> You probably want "timestamp with time zone" (which represents an
>> absolute time) rather than without time-zone (which means 1pm in London
>> is different from 1pm in New York).
>
> OK, timestamp with time zone it is. To be honest, I've been using postgresql
> for a while now, but never tried using timestamp with time zone.

You can get away with it as long as the time-zone setting on your client
  stays the same. Then it changes, and you're left wondering why all
your comparisons are hours out.

> Also, if it's "AddedDate" why isn't it a date?
>
> I had this first as a date, but asp.net 2.0 didn't like it, and changing it
> to a timestamp fixed the problem.

Surely asp.net has a date type? If not, I'd suggest AddedTimestamp as a
name (or AddedTS if you don't enjoy lots of typing :-). It won't matter
to you now, but 12 months from now it'll save you looking up data types.

>>    Status int NOT NULL,
>>>    ViewCount int NOT NULL CONSTRAINT DF_tm_Tags_ViewCount  DEFAULT
>> (('0'))
>>> );
>> You might not want to mix in details about number of views with details
>> of the tag. Particularly if you might record more details later (when
>> viewed, by whom etc).
>
> Are you suggesting to separate the Tags table into Tags and TagDetails?
> Because ViewCount within Tags table would represent how many times that tag
> was clicked, I think others would call this field Popularity. I've been
> reading alot about tags and I am fascinated at all the information about
> user tags can provide. Where would I put information such as ViewCount,
> AddedBy, Status, etc if not within the Tags table? Sorry, if I'm totally
> missing your point.

Well, it all depends on your use analysis. You could make a good
argument that there are two sets of fact data:
1. "Identity" data - id, name, added-by, added-ts, status
2. "Activity" data - number of views, last time clicked etc

Depending on how you intend to use the tags, it might make sense to
separate these. Particularly if you find yourself with no sensible
values in activity data until a tag is used.

 From a separate performance-related point of view, you would expect
activity data to be updated much more often than identity data.

>> You have cars which have tags and planes which have tags. Tagging a
>> plane is not the same as tagging a car. Either you confuse that issue,
>> or you want separate tables to track each relationship.
>
> Hmm, so if I have a tag called "Saab" and a user clicks on Saab, then
> information from both Cars and Planes table would appear.

Well, if you UNION them, yes. Of course you'll need to find columns that
make sense across all types. If planes have e.g. "wingspan" then you'll
need to add 'not applicable'::text in the car-related subquery.

 > If I'm inserting a
> new row for a tag, wouldn't I need to check if that tagname already appears
> within the Tags table or would I just create a new row with that tag name.
> Sorry, I'm not sure what " 'car'::text " this is doing, but I'm guessing its
> used to group the cars, planes, etc. so it knows which item_type it is.
> Brilliant!

Yes, if you're giving a list of all "Saab"s, I'm assuming your users
will want to know if it's a plane or car. The ::text is just PostgreSQL
shorthand for a cast - it's good practice to specify precise types for
literal values.

--
   Richard Huxton
   Archonet Ltd

Re: Questions on Tags table schema

From
Richard Huxton
Date:
Jay Kang wrote:
> Hey Richard,
>
> Sorry for the late reply, I was just making my first test version of the DB
> closely resembling you suggested design. Just wanted to write you back
> answering your questions. So here we go:

No problem - it's email and what with different timezones it's common to
have gaps.

>> What is AddedBy - a name, a user-id?
>> If it's an ID, then it seems very long.
>> If it's a name, then 256 characters sounds a bit arbitrary as a length.
>> Why choose 256?
>
> No, AddedBy is the username of the individual. Why is 256 characters
> arbitrary as a length? Would 255 be better or 32? I guess, your saying its
> too long, shorten it, I'm just going with what the books says, but I really
> welcome any comments you have^^

The book doesn't know what you're trying to do. You do. The important
thing is not whether you choose 256 or 32 or 100, it's that you've
thought about it first.

Obvious thoughts:
1. Is this going to be a real name "Richard Huxton" or an identifier
"rhuxton123"? You'll want more characters for the real name than an
identifier.
2. Where will this be displayed and will it take up too much space? If I
pick a username of WWW...250 repeats...WWW does that mess up any formatting?
3. Do we allow HTML-unsafe characters ('<', '&') and escape them when
used, or just not allow them?

No wrong or right, the process of thinking about it is important.

> The advantage of *not* having AddedBy as a foreign-key is that you can
>> delete users and not have to update tags with their user-id. The
>> disadvantage is the same thing. You can end up with tags added by
>> non-existent users.
>
> Thanks, I would like anonymous users to be able to add tags, so I guess I'll
> leave it the way it is^^

You would normally use NULL to indicate "unknown", which in the case of
an anonymous user would be true. A NULL foreign-key is allowed (unless
you define the column not-null of course).

[snip]
>> Well, it all depends on your use analysis. You could make a good
>> argument that there are two sets of fact data:
>> 1. "Identity" data - id, name, added-by, added-ts, status
>> 2. "Activity" data - number of views, last time clicked etc
>
> If I were to create Identity and Activity for the Tags table, would I be
> creating two separate tables called TagActivities and TagIdentities?

That's what I'm talking about, and you'd have a foreign-key constraint
to make sure activity refers to a real tag identity. Again, I'm not
saying you *do* want to do this, just that you'll need to think about it.

> Currently, I'm not sure how I'll analysis the data for Tags. I know that I
> want to do the bigger font if it is popular and smaller font if its not.
> Hmm, would like to see examples of other websites that utilized Tags tables
> to see how they implemented this function. I was thinking of adding tagcount
> (popularity) for each user within the user definition table.

For this particular case, you'll almost certainly want to cache the
results anyway. The popularity isn't going to change that fast, and
presumably you'll only want to categorise them as VERY BIG, Big, normal
etc. I assume asp.net allows you to cache this sort of information somehow.

>>>> You have cars which have tags and planes which have tags. Tagging a
>>>> plane is not the same as tagging a car. Either you confuse that issue,
>>>> or you want separate tables to track each relationship.
>>> Hmm, so if I have a tag called "Saab" and a user clicks on Saab, then
>>> information from both Cars and Planes table would appear.
>> Well, if you UNION them, yes. Of course you'll need to find columns that
>> make sense across all types. If planes have e.g. "wingspan" then you'll
>> need to add 'not applicable'::text in the car-related subquery.
>
> Hmm, currently I can't visualize the query, again, it would help if I can
> see the data to see what you mean. If planes table had a tag called
> wingspan, wouldn't the query just not show any value for the field so it
> wouldn't need 'not applicable' in the car-related subquery? Not sure really.

Sorry - I'm trying to say that if you UNION together several queries
they all need to have the same columns. So - if one subquery doesn't
have that column you'll need to provide a "not applicable" value instead.

Best of luck with the application, and don't forget to cache query
results when they don't change often. It'll boost performance quite a bit.

P.S. - try the "general" mailing list if you want to discuss this sort
of thing some more. This one is really supposed to be
performance-related questions only.

--
   Richard Huxton
   Archonet Ltd

Re: Questions on Tags table schema

From
Ron Mayer
Date:
Jay Kang wrote:
> Hello,
>
> I'm currently trying to decide on a database design for tags in my web
> 2.0 application. The problem I'm facing is that I have 3 separate tables
> i.e. cars, planes, and schools. All three tables need to interact with
> the tags, so there will only be one universal set of tags for the three
> tables.

It strikes me that some tsearch2 ts_vector like datatype might
work well for this; depending on the types of queries you're doing.