Thread: SQL schema and query optimisation for fast cross-table query execution
We are developing a social web site for athletes, and I am facing a technical question on how to implement a complex query on an efficient way.
The current DB schema can be adapted, if it may help...
The concept is the following : we have different users registered on the web-site, and we have several challenges ongoing on the site. Each user can be registered to one or several challenges. Each user performs sport activities. Those activities are not linked to a challenge : they count to all challenges to which the user is registered.
The problem is the following : we would want to add on the website a news feed panel where the users connected on the website can see the most recent activities of the other users. But the constraint is the following : a user can only see the activities of another user if they are both registered to at least one common challenge.
On a performance point of view, this query should be very fast : we plan to refresh automatically the news feed panel of each registered user very often..
The current DB schema is as follows:
user table : this table records all users registered on the web-site
- user_id
- ...
challenge table: this table records all ongoing challenges
- challenge_id
- ...
user_challenge_association table : this table records which user is registered to which challenge
- user_id
- challenge_id
activity table : this table records all activities of the users
- user_id
- date
- ...
a naive implementation of my query could be something like this:
SELECT * FROM activity WHERE user_id IN (SELECT user_id FROM user_challenge_association WHERE challenge_id IN (SELECT challenge_id FROM user_challenge_association WHERE user_id = ||current_user||)) ORDER BY date
But I really fear whatever index I put on the tables, the two nested IN operators will have a huge performance impact on the query...
Any help would be very welcome,
Many thanks in advance,
From: Brice André <brice@famille-andre.be> Sent: Saturday, January 25, 2020 12:49 AM
wrote:
The current DB schema is as follows:
user table : this table records all users registered on the web-site
- user_id
- ...
challenge table: this table records all ongoing challenges
- challenge_id
- ...
user_challenge_association table : this table records which user is registered to which challenge
- user_id
- challenge_id
activity table : this table records all activities of the users
- user_id
- date
- ...
a naive implementation of my query could be something like this:
SELECT * FROM activity WHERE user_id IN (SELECT user_id FROM user_challenge_association WHERE challenge_id IN (SELECT challenge_id FROM user_challenge_association WHERE user_id = ||current_user||)) ORDER BY date
Brice
---------------------------------------------------------------
Brice, you are correct, “IN” clauses are horrific performers because they turn every member of the IN into an OR, so the larger your IN list, the slower your query runs.
But every IN can be replaced with a join, and that leverages the power of the relational engine. With standard indexes on the primary and foreign keys, this will be extremely fast and scalable to many hundreds of millions of rows.
Taking your query and rewriting it with joins:
SELECT u.user_name, c.challenge_name, a.activity_name, a.activity_date
FROM activity a
Join user_challenge_association uca on (user_id)
Join challenge c on (challenge_id)
Join user u on (user_id)
Where u.user_id = ||current_user||)) ORDER BY date
Mike Sofen