query structure for selecting row by tags - Mailing list pgsql-sql
From | Samuel Gendler |
---|---|
Subject | query structure for selecting row by tags |
Date | |
Msg-id | CAEV0TzDNMbukP+QikJ8PH8tmti3OJCG2Crx5wF+W1prQbD65-g@mail.gmail.com Whole thread Raw |
List | pgsql-sql |
I need to tag entities in my db with arbitrary strings. That part is simple enough. The difficulty is that I need to select entities based on whether they match an expression built via boolean combinations of tags:
(tag1 || tag2 || tag3) && tag4
I can roll my own tagging mechanism, in which case it'd probably look like this:
create table entity (
id bigserial not null primary key,
...
)
create table entity_tags (
entity_fk bigint not null,
tag varchar
)
Or I can use a generic tagging plugin in my webapp framework (grails, taggable plugin, hibernate under the hood), in which case the table structure looks like this:
create table entity (
id bigserial not null primary key,
....
)
create table tag (
id bigserial,
tag varchar
)
create table taglink (
entity_ref bigint,
entity_type varchar, -- stores the type of entity since this table is used for all entities in the system
tag_ref bigint
)
In the grails-native version, there is no explicit foreign key to a single entity table, since entity_ref is used as a foreign key into as many entity tables as there are taggable entities, but within any given query, its role is to act as a foreign key to the table specified in entity_type. tag_ref is always a foreign key to the tags table, which does eliminate the replication of the tag string in every row, but given that most tag names are likely to be pretty short, it probably doesn't save much.
I don't see a very strong reason to support one version over the other, as the problem of constructing a query that can resolve entity_ids based on the expression given at the top of this message seems to have similar difficulty in either context. My question is, is there a sneaky way to turn my expression into a query other than the following:
For every tag element in the expression, join to the tag table one time (via taglink, if necessary), and then express a where condition that looks similar to the expression. If each tag table join is aliased as t1, t2, t3, ..., the where clause would look like this:
select distinct e.id
from entity e
join entity_tags t1 on t1.entity_fk = e.id
join entity_tags t2 on t2.entity_fk = e.id
join entity_tags t3 on t3.entity_fk = e.id
join entity_tags t4 on t4.entity_fk = e.id
where ((t1.tag = 'tag1' or t2.tag = 'tag2' or t3.tag = 'tag3') and t4.tag = 'tag4') and [rest of where clause goes here]
This has the downside of creating a huge cross multiple of the entity_tags table with itself, though I could keep it from getting out of hand by limiting the join to only those tags that are used in the expression -
join entity_tags t1 on t1.entity_fk = e.id and t1.tag in ('tag1', 'tag2', 'tag3', tag4')
join entity_tags t2 on t2.entity_fk = e.id and t2.tag in ('tag1', 'tag2', 'tag3', tag4')
join entity_tags t3 on t3.entity_fk = e.id and t3.tag in ('tag1', 'tag2', 'tag3', tag4')
join entity_tags t4 on t4.entity_fk = e.id and t4.tag in ('tag1', 'tag2', 'tag3', tag4')
There is no reason for a tag expression to ever include more than 10 tag elements, so I'd have no trouble limiting the number of tags allowed in an expression to a low number like that in my expression parser.
But maybe there is a better mechanism, one which aggregates all tags into an array and then builds a where clause that looks for values in a single array column? In my app, there will be a far higher number of entities (milions) than tags (hundreds), with many tags shared amongst many entities, but no entity having more than maybe 5-10 tags (and no entity having less than 4 tags). An array mechanism would look like this:
select e.id
from entity e
join entity_tags t on t.entity_fk = e.id and t1.tag in ('tag1', 'tag2', 'tag3', 'tag4')
group by e.id
having ((array_agg(t.tag) && ARRAY['tag1','tag2','tag3']::varchar[]) and array_agg(t.tag) @> ARRAY['tag4'])
though simplifying the having clause that way would be difficult, so it would probably be more like this:
having ((array_agg(t.tag) @> ARRAY['tag1']::varchar[] or
array_agg(t.tag) @> ARRAY['tag2']::varchar[] or
array_agg(t.tag) @> ARRAY['tag3']::varchar[]) and
array_agg(t.tag) @> ARRAY['tag4']::varchar[])
Hopefully, the query planer would simplify that?
Additionally, there are actually 4 columns of the entity that I want to treat as tags in the context of a query that selects based on tag expression. My intent was to just create a tag that contains the same value as each of the 4 columns and deal with the maintenance of ensuring that whenever the columns are updated, the tags assigned to the entity reflect the change. This spares me from having to write a where condition that checks tag column and 4 separate entity columns for each expression element. However, maybe it is possible to roll the 4 columns up into the array of tag values somehow, eliminating the requirement of storing the string values both in the tags table and in the columns of the entity table. That would be a big normalization/maintenance win.
I'm open to using hstore or array as a column type in the entity table for tags, but not for the 4 properties of the entity object that are separate from tags, because doing so would be really difficult to implement in GORM/Hibernate. I don't need hibernate for dealing with tag expression queries (though it'd be nice if I could). I suspect I could just build my array as array_agg(t.tags) || e.col1 || e.col2 || e.col3
Basically, I'm looking for feedback as to what would be the best strategy for solving this particular problem. I'm about 99% certain that both strategies outlined above would function correctly, but I don't have a sense of which is best (probably the array solution) or if there is an even better solution that I haven't considered.
Thanks