Thread: slow self-join query
Disclaimer: this is a re-post, since I wasn't subscribed the first time I posted. Pardon if this is a duplicate.] The following query is abysmally slow (e.g. 7 hours+). The goal is to find "among the users that follow user #1, who do they also follow?" and to count the latter. SELECT L2.followed_id as followed_id, COUNT(U1.id) AS count FROM users AS U1 INNER JOIN links AS L1 ON L1.follower_id = U1.id INNER JOIN links AS L2 ON L2.follower_id = U1.id WHERE U1.type = 'User::Twitter' AND L1.followed_id = 1 GROUP BY L2.followed_id Here's the rest of the info. === versions psql (9.1.2, server 8.3.14) === schema create_table "users", :force => true do |t| t.string "type" end create_table "links", :force => true do |t| t.integer "followed_id" t.integer "follower_id" end add_index "links", ["follower_id"], :name => "index_links_on_follower_id" add_index "links", ["followed_id", "follower_id"], :name => "index_links_on_followed_id_and_follower_id", :unique => true add_index "links", ["followed_id"], :name => "index_links_on_followed_id" === # of rows users: 2,525,539 links: 4,559,390 === explain "HashAggregate (cost=490089.52..490238.78 rows=11941 width=8)" " -> Hash Join (cost=392604.44..483190.22 rows=1379860 width=8)" " Hash Cond: (f1.follower_id = u1.id)" " -> Bitmap Heap Scan on links f1 (cost=14589.95..55597.70 rows=764540 width=4)" " Recheck Cond: (followed_id = 1)" " -> Bitmap Index Scan on index_links_on_followed_id (cost=0.00..14398.82 rows=764540 width=0)" " Index Cond: (followed_id = 1)" " -> Hash (cost=300976.98..300976.98 rows=4559881 width=12)" " -> Hash Join (cost=94167.40..300976.98 rows=4559881 width=12)" " Hash Cond: (f2.follower_id = u1.id)" " -> Seq Scan on links f2 (cost=0.00..77049.81 rows=4559881 width=8)" " -> Hash (cost=53950.20..53950.20 rows=2526496 width=4)" " -> Seq Scan on users u1 (cost=0.00..53950.20 rows=2526496 width=4)" " Filter: ((type)::text = 'User::Twitter'::text)" === other comments I'm assuming I'm doing something obviously stupid and that the above info will be sufficient for anyone skilled in the art to detect the problem. However, if needed I will gladly invest the time to create a subset of the data in order to run EXPLAIN ANALYZE. (With the whole dataset, it requires > 7 hours to complete the query. I don't want to go down that path again!) - rdp
On Sat, Mar 17, 2012 at 2:56 PM, Robert Poor <rdpoor@gmail.com> wrote: > Disclaimer: this is a re-post, since I wasn't subscribed the first > time I posted. Pardon if this is a duplicate.] > > The following query is abysmally slow (e.g. 7 hours+). The goal is to > find "among the users that follow user #1, who do they also follow?" > and to count the latter. > > SELECT L2.followed_id as followed_id, COUNT(U1.id) AS count > FROM users AS U1 > INNER JOIN links AS L1 ON L1.follower_id = U1.id > INNER JOIN links AS L2 ON L2.follower_id = U1.id > WHERE U1.type = 'User::Twitter' > AND L1.followed_id = 1 > GROUP BY L2.followed_id > > Here's the rest of the info. > > === versions > > psql (9.1.2, server 8.3.14) > > === schema > > create_table "users", :force => true do |t| > t.string "type" > end > > create_table "links", :force => true do |t| > t.integer "followed_id" > t.integer "follower_id" > end > > add_index "links", ["follower_id"], :name => "index_links_on_follower_id" > add_index "links", ["followed_id", "follower_id"], :name => > "index_links_on_followed_id_and_follower_id", :unique => true > add_index "links", ["followed_id"], :name => "index_links_on_followed_id" > > === # of rows > > users: 2,525,539 > links: 4,559,390 > > === explain > > "HashAggregate (cost=490089.52..490238.78 rows=11941 width=8)" > " -> Hash Join (cost=392604.44..483190.22 rows=1379860 width=8)" > " Hash Cond: (f1.follower_id = u1.id)" > " -> Bitmap Heap Scan on links f1 (cost=14589.95..55597.70 > rows=764540 width=4)" > " Recheck Cond: (followed_id = 1)" > " -> Bitmap Index Scan on index_links_on_followed_id > (cost=0.00..14398.82 rows=764540 width=0)" > " Index Cond: (followed_id = 1)" > " -> Hash (cost=300976.98..300976.98 rows=4559881 width=12)" > " -> Hash Join (cost=94167.40..300976.98 rows=4559881 width=12)" > " Hash Cond: (f2.follower_id = u1.id)" > " -> Seq Scan on links f2 (cost=0.00..77049.81 > rows=4559881 width=8)" > " -> Hash (cost=53950.20..53950.20 rows=2526496 width=4)" > " -> Seq Scan on users u1 > (cost=0.00..53950.20 rows=2526496 width=4)" > " Filter: ((type)::text = > 'User::Twitter'::text)" > > === other comments > > I'm assuming I'm doing something obviously stupid and that the above > info will be sufficient for anyone skilled in the art to detect the > problem. However, if needed I will gladly invest the time > to create a subset of the data in order to run EXPLAIN ANALYZE. (With > the whole dataset, it requires > 7 hours to complete the query. I > don't want to go down that path again!) Do you have an index on users.type?
On Sat, Mar 17, 2012 at 23:07, Scott Marlowe <scott.marlowe@gmail.com> wrote:
Yeah try setting [work_mem] to something absurd like 500MB and see if the plan changes.
Suweet! Sorting now runs in-memory, and that makes a big difference, even when groveling over 1M records (under 12 seconds rather than 7 hours). Results in
On Sat, Mar 17, 2012 at 23:09, Scott Marlowe <scott.marlowe@gmail.com> wrote:
Also it looks like you're still not using the index on this:
Subquery Scan u1 (cost=0.00..313.55 rows=50 width=4) (actual
time=0.030..147.136 rows=10000 loops=1)
Filter: ((u1.type)::text = 'User::Twitter'::text)
Are you sure you're using an indexable condition?
I know that users.type is indexed -- what would keep that from being honored? FWIW, I believe that all user.type fields are set to User::Twitter, but that will change in the future.
On Sat, Mar 17, 2012 at 23:12, Scott Marlowe <scott.marlowe@gmail.com> wrote:
Also also this looks like it's the most expensive operation:
Seq Scan on followings f2 (cost=0.00..93523.95 rows=5534395 width=8)
(actual time=0.041..19365.834 rows=5535964 loops=1)
I'm guessing the f2.follower_id isn't very selective?
Not 100% sure what you mean -- f2.follower_id is very sparse (compared to f1.follower_id), but that's the point of this particular query. But since upping work_mem makes it run really fast, I'm not overly concerned about this one. Thanks for your help!
One last thought: I could re-cast this as a subquery / query pair, each with a single join. Am I correct in thinking that could make it really easy on the planner (especially if the tables were properly indexed)?
Thanks again.
- r
On 03/18/2012 10:37 AM, Robert Poor wrote: > > On Sat, Mar 17, 2012 at 23:09, Scott Marlowe <scott.marlowe@gmail.com > <mailto:scott.marlowe@gmail.com>> wrote: > > Also it looks like you're still not using the index on this: > > Subquery Scan u1 (cost=0.00..313.55 rows=50 width=4) (actual > time=0.030..147.136 rows=10000 loops=1) > > Filter: ((u1.type)::text = 'User::Twitter'::text) > > Are you sure you're using an indexable condition? > > > I know that users.type is indexed -- what would keep that from being > honored? FWIW, I believe that all user.type fields are set to > User::Twitter, but that will change in the future. > > If all the rows have that value, then using the index would be silly. Postgres knows from the stats that ANALYZE calculates whether or not using an index is likely to be more efficient, and avoids doing so in cases where it isn't. cheers andrew
On Sun, Mar 18, 2012 at 8:37 AM, Robert Poor <rdpoor@gmail.com> wrote: > On Sat, Mar 17, 2012 at 23:12, Scott > Marlowe <scott.marlowe@gmail.com> wrote: > >> >> Also also this looks like it's the most expensive operation: >> >> Seq Scan on followings f2 (cost=0.00..93523.95 rows=5534395 width=8) >> (actual time=0.041..19365.834 rows=5535964 loops=1) >> >> I'm guessing the f2.follower_id isn't very selective? > > > Not 100% sure what you mean -- f2.follower_id is very sparse (compared to > f1.follower_id), but that's the point of this particular query. But since > upping work_mem makes it run really fast, I'm not overly concerned about > this one. Thanks for your help! Selectivity is how selective is a single value is likely to be. So if f2.follower_id has 5000 entries and there's only 2 values, it's not likely to be very selective, as most of the table will match one of two values. If it's 1M rows and 1M distinct follower_ids then it's selectivity is 1.0 because one value will get just one row ever. > One last thought: I could re-cast this as a subquery / query pair, each with > a single join. Am I correct in thinking that could make it really easy on > the planner (especially if the tables were properly indexed)? Why are you joining twice to the parent table? If you're trying to recurse without a with clause, then wouldn't you join the last table to the one before it?
also also wik On Sun, Mar 18, 2012 at 8:37 AM, Robert Poor <rdpoor@gmail.com> wrote: > On Sat, Mar 17, 2012 at 23:07, Scott Marlowe <scott.marlowe@gmail.com> > wrote: >> >> Yeah try setting [work_mem] to something absurd like 500MB and see if the >> plan changes. > > > Suweet! Sorting now runs in-memory, and that makes a big difference, even > when groveling over 1M records (under 12 seconds rather than 7 hours). > Results in > > http://explain.depesz.com/s/hNO Well that's better. Test various sizes of work_mem to see what you need, then maybe double it. How many simultaneous connections do you have to this db? Different accounts? Different apps? While it might be worth setting for a user or a db, it might or might not be a good thing to set it to something like 512MB world-wide. On servers with hundreds to thousands of connections, 16 or 32MB is often all you'd want to set it to, since it's additive across all active sorts in the db. A thousand users suddenly sorting 512MB in memory at once, can take down your db server in seconds. Still seems like it's doing a lot of work.
On Sun, Mar 18, 2012 at 08:30, Scott Marlowe <scott.marlowe@gmail.com> wrote: > Why are you joining twice to the parent table? If you're trying to > recurse without a with clause, then wouldn't you join the last table > to the one before it? I'm FAR from being an SQL expert; there's a significant chance that I'm not thinking about this right. My intention for this query (slightly renamed since the original post): SELECT F2.leader_id as leader_id, COUNT(U1.id) AS count FROM users AS U1 INNER JOIN user_associations AS F1 ON F1.follower_id = U1.id INNER JOIN user_associations AS F2 ON F2.follower_id = U1.id WHERE F1.leader_id = 321 GROUP BY F2.leader_id is "among users that follow leader 321, who are the most widely followed leaders?", or more formally, find all the users that are followers of user 321 (via inner join on F1) Of those users, tally up their leaders so we know which leaders are most popular. Recall that the user_associations table is simply a user-to-user association: create_table "user_associations", :force => true do |t| t.integer "follower_id" t.integer "leader_id" end Is there a better way to do this?
On Sun, Mar 18, 2012 at 10:57 PM, Robert Poor <rdpoor@gmail.com> wrote: > On Sun, Mar 18, 2012 at 08:30, Scott Marlowe <scott.marlowe@gmail.com> wrote: >> Why are you joining twice to the parent table? If you're trying to >> recurse without a with clause, then wouldn't you join the last table >> to the one before it? > > I'm FAR from being an SQL expert; there's a significant chance that > I'm not thinking about this right. My intention for this query > (slightly renamed since the original post): > > SELECT F2.leader_id as leader_id, COUNT(U1.id) AS count > FROM users AS U1 > INNER JOIN user_associations AS F1 ON F1.follower_id = U1.id > INNER JOIN user_associations AS F2 ON F2.follower_id = U1.id > WHERE F1.leader_id = 321 > GROUP BY F2.leader_id > > is "among users that follow leader 321, who are the most widely > followed leaders?", or more formally, find all the users that are > followers of user 321 (via inner join on F1) Of those users, tally up > their leaders so we know which leaders are most popular. Recall that > the user_associations table is simply a user-to-user association: > > create_table "user_associations", :force => true do |t| > t.integer "follower_id" > t.integer "leader_id" > end > > Is there a better way to do this? hm. Something does not seem right with your query. You're joining in the same table twice with the same clause: INNER JOIN user_associations AS F1 ON F1.follower_id = U1.id INNER JOIN user_associations AS F2 ON F2.follower_id = U1.id I think you meant to cascade through the follower back to the leader. (maybe not..it's early monday and the coffee hasn't worked it's way through the fog yet)... Also, do you really need to involve the user table? You're counting U1.Id which is equivalent to F2.follower_id. try this and see what pops out (i may not have the F1/F2 join quite right): SELECT F2.leader_id as leader_id, COUNT(*) AS count FROM user_associations AS F1 INNER JOIN user_associations AS F2 ON F1.follower_id = F2.leader_id WHERE F1.leader_id = 321 GROUP BY 1; merlin
Robert Poor <rdpoor@gmail.com> wrote: > among users that follow leader 321, who are the most widely > followed leaders?", or more formally, find all the users that are > followers of user 321 (via inner join on F1) Of those users, > tally up their leaders so we know which leaders are most popular. It sounds like you're looking for something like this: SELECT leader_id, count(*) as count FROM user_associations x WHERE exists ( SELECT * FROM user_associations y WHERE y.follower_id = x.follower_id AND y.leader_id = 321 ) GROUP BY leader_id ; > create_table "user_associations", :force => true do |t| > t.integer "follower_id" > t.integer "leader_id" > end I don't really know what that means. In the future, it would make things easier on those who are trying to help if you either post the SQL form or go into psql and type `\d tablename`. -Kevin
@merlin, @kevin: Thank you both -- I'll try your suggestions as soon as I get back to the mothership. @kevin: I hear you. (I'm deeply steeped in Ruby on Rails and foolishly assume that it's easy to read.) With that in mind: \d user_associations Table "public.user_associations" Column | Type | Modifiers -------------+-----------------------------+--------------------------------------------------------- id | integer | not null default nextval('followings_id_seq'::regclass) leader_id | integer | follower_id | integer | created_at | timestamp without time zone | not null updated_at | timestamp without time zone | not null Indexes: "followings_pkey" PRIMARY KEY, btree (id) "index_followings_on_leader_id_and_follower_id" UNIQUE, btree (leader_id, follower_id) "index_followings_on_follower_id" btree (follower_id) "index_followings_on_leader_id" btree (leader_id)
Robert Poor <rdpoor@gmail.com> wrote: > @kevin: I hear you. (I'm deeply steeped in Ruby on Rails and > foolishly assume that it's easy to read.) With that in mind: > > \d user_associations > id | integer | not null default > nextval('followings_id_seq'::regclass) I assume that this is needed to keep RoR happy. Since a row seems meaningless without both leader_id and follower_id, and that is unique, the synthetic key here is totally redundant. Performance (both modifying the table and querying against it) would be faster without this column, but I understand that many ORMs (including, as I recall, RoR) are more difficult to work with unless you have this. > leader_id | integer | > follower_id | integer | I'm surprised you didn't declare both of these as NOT NULL. > created_at | timestamp without time zone | not null > updated_at | timestamp without time zone | not null I can understand tracking when the follow was initiated, but what would you ever update here? (Or is this part of a generalized optimistic concurrency control scheme?) > Indexes: > "followings_pkey" PRIMARY KEY, btree (id) > "index_followings_on_leader_id_and_follower_id" UNIQUE, btree > (leader_id, follower_id) > "index_followings_on_follower_id" btree (follower_id) > "index_followings_on_leader_id" btree (leader_id) This last index is of dubious value when you already have an index which starts with leader_id. It will be of even more dubious benefit when we have index-only scans in 9.2. -Kevin