Thread: Best approach for query with optional constraints

Best approach for query with optional constraints

From
Jon Smark
Date:
Hi,

I have a problem with multiple solutions, and my question concerns which
solution is preferred given whatever optimizations happen inside the query
planner.

In a bug tracking system, I have the tables "users", "bugs" and "tags".
One bug may have one user (the reporter) but multiple tags.  Moreover, one
tag may be associated with multiple bugs.  There is therefore a many-to-many
relation between bugs and tags.  I use an extra table "bug_tags" to keep
track of this relation, as shown below:

CREATE TABLE bugs (
  bid int4 UNIQUE NOT NULL,
  uid int4 REFERENCES users(uid) NOT NULL,
  title text NOT NULL,
  PRIMARY KEY (bid));

CREATE TABLE bug_tags (
  bid int4 REFERENCES bugs(bid) NOT NULL,
  tid int4 REFERENCES tags(tid) NOT NULL,
  PRIMARY KEY (bid, tid));


Here's the problem: I want to retrieve a list of bugs (possibly) matching
certain constraints.  One possible constraint is a user ID: if given, only
those bugs reported by the user will be returned.  Another constraint is
a set of tags: only those bugs that contain *all* the tags in the set must
be returned (if the set is empty, then all bugs match).  Moreover, the user
and tags constraints are conjunctive: if both set, that means I want only
the bugs by that user having also the given set of tags.

Below is one possible solution using a CTE to fetch all bugs that match
the tags constraint.  (Note that $IDENTIFIER will be replaced by the actual
contents of the variable using prepared statements).

with tagged as
  (select bid
  from bug_tags inner join tags on bug_tags.tid = tags.tid
  group by bid
  having array_agg (tags.tag) @> $TAGS)
select *
from bugs
where (($UID :: int4) is null or uid = $UID) and
($TAGS = (array[] :: text[]) or bid in (select bid from tagged))

The first alternative is shown below.  The basic principle is the same,
except that instead of a CTE there is a subquery.  The idea is that the
subquery will only be performed if the set of tags is non-empty.  Or is
the query planner smart enough to never execute the CTE if this is the
case?  If so, then both these solutions should be identical.

select *
from bugs
where (($UID :: int4) is null or uid = $UID) and
($TAGS = (array[] :: text[]) or
bid in (
  select bid
  from bug_tags inner join tags on bug_tags.tid = tags.tid
  group by bid
  having array_agg (tags.tag) @> $TAGS))


The second alternative is shown below.  This one uses a different approach:
instead of constructing a set of bug IDs matching the tag constraints, it
goes one by one through the bugs, verifying if each one matches the tag
constraints.

select *
from bugs
where (($UID :: int4) is null or uid = $UID) and
($TAGS = (array[] :: text[]) or
$TAGSs <@ array (select tags.tag from bug_tags inner join tags on bug_tags.tid = tags.tid where bug_tags.bid =
bugs.bid))


To conclude: is it worth worrying about such details, or can I assume the
query optimizer will automagically choose the best approach.  And if not,
from which solution should I expect better results when a) there are no
constraints, b) there are only user constraints, c) there are only tag
constraints, and d) there are both user and tag constraints?

Thank you in advance for any light you might be able to shed!
Best,
Jon

P.S. Other solutions to a similar problem can be found on this thread:
http://stackoverflow.com/questions/14527080/selecting-matching-subset-in-many-to-many-relation


Re: Best approach for query with optional constraints

From
Kevin Grittner
Date:
Jon Smark <jon.smark@yahoo.com> wrote:

> Here's the problem: I want to retrieve a list of bugs (possibly) matching
> certain constraints.  One possible constraint is a user ID: if given, only
> those bugs reported by the user will be returned.  Another constraint is
> a set of tags: only those bugs that contain *all* the tags in the set must
> be returned (if the set is empty, then all bugs match).  Moreover, the user
> and tags constraints are conjunctive: if both set, that means I want only
> the bugs by that user having also the given set of tags.
>
> Below is one possible solution using a CTE to fetch all bugs that match
> the tags constraint.  (Note that $IDENTIFIER will be replaced by the actual
> contents of the variable using prepared statements).

The best thing to do in such a situation is to mock up some data
and try.

I'll show some timings for your alternatives and one I thought of,
using one million rows in bugs and about 10 million rows in
bug_tags, with about on hundred thousand tags, randomly assigned.
Searches were for a user and three tags.  I eliminated the
apparently unneeded joins to tags.

> with tagged as
>   (select bid
>   from bug_tags inner join tags on bug_tags.tid = tags.tid
>   group by bid
>   having array_agg (tags.tag) @> $TAGS)
> select *
> from bugs
> where (($UID :: int4) is null or uid = $UID) and
> ($TAGS = (array[] :: text[]) or bid in (select bid from tagged))

About 1550 ms.

> The first alternative is shown below.  The basic principle is the same,
> except that instead of a CTE there is a subquery.  The idea is that the
> subquery will only be performed if the set of tags is non-empty.  Or is
> the query planner smart enough to never execute the CTE if this is the
> case?  If so, then both these solutions should be identical.
>
> select *
> from bugs
> where (($UID :: int4) is null or uid = $UID) and
> ($TAGS = (array[] :: text[]) or
> bid in (
>   select bid
>   from bug_tags inner join tags on bug_tags.tid = tags.tid
>   group by bid
>   having array_agg (tags.tag) @> $TAGS))

About 1550 ms.

> The second alternative is shown below.  This one uses a different approach:
> instead of constructing a set of bug IDs matching the tag constraints, it
> goes one by one through the bugs, verifying if each one matches the tag
> constraints.
>
> select *
> from bugs
> where (($UID :: int4) is null or uid = $UID) and
> ($TAGS = (array[] :: text[]) or
> $TAGSs <@ array (select tags.tag from bug_tags inner join tags on
> bug_tags.tid = tags.tid where bug_tags.bid = bugs.bid))

About 90 ms.

My idea:

SELECT *
  FROM bugs b
  WHERE (($UID :: int4) is null or uid = $UID)
    AND NOT EXISTS
          (
            SELECT * FROM (SELECT unnest($TAGSs)) t(tid)
              WHERE NOT EXISTS
                      (
                        SELECT * FROM bug_tags bt
                          WHERE bt.bid = b.bid
                            AND bt.tid = t.tid
                      )
          )

About 85 ms.

> To conclude: is it worth worrying about such details, or can I assume the
> query optimizer will automagically choose the best approach.

The optimizer gets smarter with every major release in terms of
recognizing logically equivalent plans, so that it can pick the one
with the lowest estimated cost; but there will always be constructs
which it does not recognize as logically equivalent either because
it would cost to much to test for equivalence or we just haven't
gotten to that yet.  So if you can see different ways to write a
query, it's not a bad idea to compare them on performance.

-Kevin