Thread: slow self-join query

slow self-join query

From
Robert Poor
Date:
Disclaimer: this is a re-post, since I wasn't subscribed the first
time I posted.  Pardon if this is a duplicate.]

The following query is abysmally slow (e.g. 7 hours+).  The goal is to
find "among the users that follow user #1, who do they also follow?"
and to count the latter.

   SELECT L2.followed_id as followed_id, COUNT(U1.id) AS count
     FROM users AS U1
INNER JOIN links AS L1 ON L1.follower_id = U1.id
INNER JOIN links AS L2 ON L2.follower_id = U1.id
    WHERE U1.type = 'User::Twitter'
      AND L1.followed_id = 1
 GROUP BY L2.followed_id

Here's the rest of the info.

=== versions

psql (9.1.2, server 8.3.14)

=== schema

 create_table "users", :force => true do |t|
   t.string   "type"
 end

 create_table "links", :force => true do |t|
   t.integer  "followed_id"
   t.integer  "follower_id"
 end

 add_index "links", ["follower_id"], :name => "index_links_on_follower_id"
 add_index "links", ["followed_id", "follower_id"], :name =>
"index_links_on_followed_id_and_follower_id", :unique => true
 add_index "links", ["followed_id"], :name => "index_links_on_followed_id"

=== # of rows

users: 2,525,539
links: 4,559,390

=== explain

 "HashAggregate  (cost=490089.52..490238.78 rows=11941 width=8)"
 "  ->  Hash Join  (cost=392604.44..483190.22 rows=1379860 width=8)"
 "        Hash Cond: (f1.follower_id = u1.id)"
 "        ->  Bitmap Heap Scan on links f1  (cost=14589.95..55597.70
rows=764540 width=4)"
 "              Recheck Cond: (followed_id = 1)"
 "              ->  Bitmap Index Scan on index_links_on_followed_id
(cost=0.00..14398.82 rows=764540 width=0)"
 "                    Index Cond: (followed_id = 1)"
 "        ->  Hash  (cost=300976.98..300976.98 rows=4559881 width=12)"
 "              ->  Hash Join  (cost=94167.40..300976.98 rows=4559881 width=12)"
 "                    Hash Cond: (f2.follower_id = u1.id)"
 "                    ->  Seq Scan on links f2  (cost=0.00..77049.81
rows=4559881 width=8)"
 "                    ->  Hash  (cost=53950.20..53950.20 rows=2526496 width=4)"
 "                          ->  Seq Scan on users u1
(cost=0.00..53950.20 rows=2526496 width=4)"
 "                                Filter: ((type)::text =
'User::Twitter'::text)"

=== other comments

I'm assuming I'm doing something obviously stupid and that the above
info will be sufficient for anyone skilled in the art to detect the
problem.  However, if needed I will gladly invest the time
to create a subset of the data in order to run EXPLAIN ANALYZE.  (With
the whole dataset, it requires > 7 hours to complete the query.  I
don't want to go down that path again!)

- rdp

Re: slow self-join query

From
Scott Marlowe
Date:
On Sat, Mar 17, 2012 at 2:56 PM, Robert Poor <rdpoor@gmail.com> wrote:
> Disclaimer: this is a re-post, since I wasn't subscribed the first
> time I posted.  Pardon if this is a duplicate.]
>
> The following query is abysmally slow (e.g. 7 hours+).  The goal is to
> find "among the users that follow user #1, who do they also follow?"
> and to count the latter.
>
>   SELECT L2.followed_id as followed_id, COUNT(U1.id) AS count
>     FROM users AS U1
> INNER JOIN links AS L1 ON L1.follower_id = U1.id
> INNER JOIN links AS L2 ON L2.follower_id = U1.id
>    WHERE U1.type = 'User::Twitter'
>      AND L1.followed_id = 1
>  GROUP BY L2.followed_id
>
> Here's the rest of the info.
>
> === versions
>
> psql (9.1.2, server 8.3.14)
>
> === schema
>
>  create_table "users", :force => true do |t|
>   t.string   "type"
>  end
>
>  create_table "links", :force => true do |t|
>   t.integer  "followed_id"
>   t.integer  "follower_id"
>  end
>
>  add_index "links", ["follower_id"], :name => "index_links_on_follower_id"
>  add_index "links", ["followed_id", "follower_id"], :name =>
> "index_links_on_followed_id_and_follower_id", :unique => true
>  add_index "links", ["followed_id"], :name => "index_links_on_followed_id"
>
> === # of rows
>
> users: 2,525,539
> links: 4,559,390
>
> === explain
>
>  "HashAggregate  (cost=490089.52..490238.78 rows=11941 width=8)"
>  "  ->  Hash Join  (cost=392604.44..483190.22 rows=1379860 width=8)"
>  "        Hash Cond: (f1.follower_id = u1.id)"
>  "        ->  Bitmap Heap Scan on links f1  (cost=14589.95..55597.70
> rows=764540 width=4)"
>  "              Recheck Cond: (followed_id = 1)"
>  "              ->  Bitmap Index Scan on index_links_on_followed_id
> (cost=0.00..14398.82 rows=764540 width=0)"
>  "                    Index Cond: (followed_id = 1)"
>  "        ->  Hash  (cost=300976.98..300976.98 rows=4559881 width=12)"
>  "              ->  Hash Join  (cost=94167.40..300976.98 rows=4559881 width=12)"
>  "                    Hash Cond: (f2.follower_id = u1.id)"
>  "                    ->  Seq Scan on links f2  (cost=0.00..77049.81
> rows=4559881 width=8)"
>  "                    ->  Hash  (cost=53950.20..53950.20 rows=2526496 width=4)"
>  "                          ->  Seq Scan on users u1
> (cost=0.00..53950.20 rows=2526496 width=4)"
>  "                                Filter: ((type)::text =
> 'User::Twitter'::text)"
>
> === other comments
>
> I'm assuming I'm doing something obviously stupid and that the above
> info will be sufficient for anyone skilled in the art to detect the
> problem.  However, if needed I will gladly invest the time
> to create a subset of the data in order to run EXPLAIN ANALYZE.  (With
> the whole dataset, it requires > 7 hours to complete the query.  I
> don't want to go down that path again!)

Do you have an index on users.type?

Re: slow self-join query

From
Robert Poor
Date:
On Sat, Mar 17, 2012 at 23:07, Scott Marlowe <scott.marlowe@gmail.com> wrote:
Yeah try setting [work_mem] to something absurd like 500MB and see if the plan changes.

Suweet!  Sorting now runs in-memory, and that makes a big difference, even when groveling over 1M records (under 12 seconds rather than 7 hours).  Results in


On Sat, Mar 17, 2012 at 23:09, Scott Marlowe <scott.marlowe@gmail.com> wrote:
Also it looks like you're still not using the index on this:

Subquery Scan u1 (cost=0.00..313.55 rows=50 width=4) (actual
time=0.030..147.136 rows=10000 loops=1)

   Filter: ((u1.type)::text = 'User::Twitter'::text)

Are you sure you're using an indexable condition?

I know that users.type is indexed -- what would keep that from being honored?  FWIW, I believe that all user.type fields are set to User::Twitter, but that will change in the future.

On Sat, Mar 17, 2012 at 23:12, Scott Marlowe <scott.marlowe@gmail.com> wrote:

Also also this looks like it's the most expensive operation:

Seq Scan on followings f2 (cost=0.00..93523.95 rows=5534395 width=8)
(actual time=0.041..19365.834 rows=5535964 loops=1)

I'm guessing the f2.follower_id isn't very selective?

Not 100% sure what you mean -- f2.follower_id is very sparse (compared to f1.follower_id), but that's the point of this particular query.  But since upping work_mem makes it run really fast, I'm not overly concerned about this one.  Thanks for your help!

One last thought: I could re-cast this as a subquery / query pair, each with a single join.  Am I correct in thinking that could make it really easy on the planner (especially if the tables were properly indexed)?

Thanks again.

- r

Re: slow self-join query

From
Andrew Dunstan
Date:

On 03/18/2012 10:37 AM, Robert Poor wrote:
>
> On Sat, Mar 17, 2012 at 23:09, Scott Marlowe <scott.marlowe@gmail.com
> <mailto:scott.marlowe@gmail.com>> wrote:
>
>     Also it looks like you're still not using the index on this:
>
>     Subquery Scan u1 (cost=0.00..313.55 rows=50 width=4) (actual
>     time=0.030..147.136 rows=10000 loops=1)
>
>        Filter: ((u1.type)::text = 'User::Twitter'::text)
>
>     Are you sure you're using an indexable condition?
>
>
> I know that users.type is indexed -- what would keep that from being
> honored?  FWIW, I believe that all user.type fields are set to
> User::Twitter, but that will change in the future.
>
>

If all the rows have that value, then using the index would be silly.
Postgres knows from the stats that ANALYZE calculates whether or not
using an index is likely to be more efficient, and avoids doing so in
cases where it isn't.

cheers

andrew

Re: slow self-join query

From
Scott Marlowe
Date:
On Sun, Mar 18, 2012 at 8:37 AM, Robert Poor <rdpoor@gmail.com> wrote:
> On Sat, Mar 17, 2012 at 23:12, Scott
> Marlowe <scott.marlowe@gmail.com> wrote:
>
>>
>> Also also this looks like it's the most expensive operation:
>>
>> Seq Scan on followings f2 (cost=0.00..93523.95 rows=5534395 width=8)
>> (actual time=0.041..19365.834 rows=5535964 loops=1)
>>
>> I'm guessing the f2.follower_id isn't very selective?
>
>
> Not 100% sure what you mean -- f2.follower_id is very sparse (compared to
> f1.follower_id), but that's the point of this particular query.  But since
> upping work_mem makes it run really fast, I'm not overly concerned about
> this one.  Thanks for your help!

Selectivity is how selective is a single value is likely to be.  So if
f2.follower_id has 5000 entries and there's only 2 values, it's not
likely to be very selective, as most of the table will match one of
two values.  If it's 1M rows and 1M distinct follower_ids then it's
selectivity is 1.0 because one value will get just one row ever.

> One last thought: I could re-cast this as a subquery / query pair, each with
> a single join.  Am I correct in thinking that could make it really easy on
> the planner (especially if the tables were properly indexed)?

Why are you joining twice to the parent table?  If you're trying to
recurse without a with clause, then wouldn't you join the last table
to the one before it?

Re: slow self-join query

From
Scott Marlowe
Date:
also also wik

On Sun, Mar 18, 2012 at 8:37 AM, Robert Poor <rdpoor@gmail.com> wrote:
> On Sat, Mar 17, 2012 at 23:07, Scott Marlowe <scott.marlowe@gmail.com>
> wrote:
>>
>> Yeah try setting [work_mem] to something absurd like 500MB and see if the
>> plan changes.
>
>
> Suweet!  Sorting now runs in-memory, and that makes a big difference, even
> when groveling over 1M records (under 12 seconds rather than 7 hours).
>  Results in
>
>    http://explain.depesz.com/s/hNO

Well that's better.  Test various sizes of work_mem to see what you
need, then maybe double it.  How many simultaneous connections do you
have to this db?  Different accounts?  Different apps?  While it might
be worth setting for a user or a db, it might or might not be a good
thing to set it to something like 512MB world-wide.  On servers with
hundreds to thousands of connections, 16 or 32MB is often all you'd
want to set it to, since it's additive across all active sorts in the
db.  A thousand users suddenly sorting 512MB in memory at once, can
take down your db server in seconds.

Still seems like it's doing a lot of work.

Re: slow self-join query

From
Robert Poor
Date:
On Sun, Mar 18, 2012 at 08:30, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> Why are you joining twice to the parent table?  If you're trying to
> recurse without a with clause, then wouldn't you join the last table
> to the one before it?

I'm FAR from being an SQL expert; there's a significant chance that
I'm not thinking about this right.  My intention for this query
(slightly renamed since the original post):

    SELECT F2.leader_id as leader_id, COUNT(U1.id) AS count
      FROM users AS U1
INNER JOIN user_associations AS F1 ON F1.follower_id = U1.id
INNER JOIN user_associations AS F2 ON F2.follower_id = U1.id
     WHERE F1.leader_id = 321
  GROUP BY F2.leader_id

is "among users that follow leader 321, who are the most widely
followed leaders?", or more formally, find all the users that are
followers of user 321 (via inner join on F1)  Of those users, tally up
their leaders so we know which leaders are most popular.  Recall that
the user_associations table is simply a user-to-user association:

 create_table "user_associations", :force => true do |t|
  t.integer  "follower_id"
  t.integer  "leader_id"
 end

Is there a better way to do this?

Re: slow self-join query

From
Merlin Moncure
Date:
On Sun, Mar 18, 2012 at 10:57 PM, Robert Poor <rdpoor@gmail.com> wrote:
> On Sun, Mar 18, 2012 at 08:30, Scott Marlowe <scott.marlowe@gmail.com> wrote:
>> Why are you joining twice to the parent table?  If you're trying to
>> recurse without a with clause, then wouldn't you join the last table
>> to the one before it?
>
> I'm FAR from being an SQL expert; there's a significant chance that
> I'm not thinking about this right.  My intention for this query
> (slightly renamed since the original post):
>
>    SELECT F2.leader_id as leader_id, COUNT(U1.id) AS count
>      FROM users AS U1
> INNER JOIN user_associations AS F1 ON F1.follower_id = U1.id
> INNER JOIN user_associations AS F2 ON F2.follower_id = U1.id
>     WHERE F1.leader_id = 321
>  GROUP BY F2.leader_id
>
> is "among users that follow leader 321, who are the most widely
> followed leaders?", or more formally, find all the users that are
> followers of user 321 (via inner join on F1)  Of those users, tally up
> their leaders so we know which leaders are most popular.  Recall that
> the user_associations table is simply a user-to-user association:
>
>  create_table "user_associations", :force => true do |t|
>  t.integer  "follower_id"
>  t.integer  "leader_id"
>  end
>
> Is there a better way to do this?

hm. Something does not seem right with your query.  You're joining in
the same table twice with the same clause:

INNER JOIN user_associations AS F1 ON F1.follower_id = U1.id
INNER JOIN user_associations AS F2 ON F2.follower_id = U1.id

I think you meant to cascade through the follower back to the leader.
(maybe not..it's early monday and the coffee hasn't worked it's way
through the fog yet)...

Also, do you really need to involve the user table?  You're counting
U1.Id which is equivalent to F2.follower_id.

try this and see what pops out (i may not have the F1/F2 join quite right):
SELECT F2.leader_id as leader_id, COUNT(*) AS count
  FROM user_associations AS F1
  INNER JOIN user_associations AS F2 ON F1.follower_id = F2.leader_id
  WHERE F1.leader_id = 321
  GROUP BY 1;

merlin

Re: slow self-join query

From
"Kevin Grittner"
Date:
Robert Poor <rdpoor@gmail.com> wrote:
> among users that follow leader 321, who are the most widely
> followed leaders?", or more formally, find all the users that are
> followers of user 321 (via inner join on F1)  Of those users,
> tally up their leaders so we know which leaders are most popular.

It sounds like you're looking for something like this:

SELECT leader_id, count(*) as count
  FROM user_associations x
  WHERE exists
        (
          SELECT * FROM user_associations y
            WHERE y.follower_id = x.follower_id
              AND y.leader_id = 321
        )
  GROUP BY leader_id
;

>  create_table "user_associations", :force => true do |t|
>   t.integer  "follower_id"
>   t.integer  "leader_id"
>  end

I don't really know what that means.  In the future, it would make
things easier on those who are trying to help if you either post the
SQL form or go into psql and type `\d tablename`.

-Kevin

Re: slow self-join query

From
Robert Poor
Date:
@merlin, @kevin:  Thank you both -- I'll try your suggestions as soon
as I get back to the mothership.

@kevin: I hear you.  (I'm deeply steeped in Ruby on Rails and
foolishly assume that it's easy to read.)  With that in mind:

\d user_associations
                                  Table "public.user_associations"
   Column    |            Type             |
Modifiers
-------------+-----------------------------+---------------------------------------------------------
 id          | integer                     | not null default
nextval('followings_id_seq'::regclass)
 leader_id   | integer                     |
 follower_id | integer                     |
 created_at  | timestamp without time zone | not null
 updated_at  | timestamp without time zone | not null
Indexes:
    "followings_pkey" PRIMARY KEY, btree (id)
    "index_followings_on_leader_id_and_follower_id" UNIQUE, btree
(leader_id, follower_id)
    "index_followings_on_follower_id" btree (follower_id)
    "index_followings_on_leader_id" btree (leader_id)

Re: slow self-join query

From
"Kevin Grittner"
Date:
Robert Poor <rdpoor@gmail.com> wrote:

> @kevin: I hear you.  (I'm deeply steeped in Ruby on Rails and
> foolishly assume that it's easy to read.)  With that in mind:
>
> \d user_associations

>  id          | integer                     | not null default
> nextval('followings_id_seq'::regclass)

I assume that this is needed to keep RoR happy.  Since a row seems
meaningless without both leader_id and follower_id, and that is
unique, the synthetic key here is totally redundant.  Performance
(both modifying the table and querying against it) would be faster
without this column, but I understand that many ORMs (including, as
I recall, RoR) are more difficult to work with unless you have this.

>  leader_id   | integer                     |
>  follower_id | integer                     |

I'm surprised you didn't declare both of these as NOT NULL.

>  created_at  | timestamp without time zone | not null
>  updated_at  | timestamp without time zone | not null

I can understand tracking when the follow was initiated, but what
would you ever update here?  (Or is this part of a generalized
optimistic concurrency control scheme?)

> Indexes:
>     "followings_pkey" PRIMARY KEY, btree (id)
>     "index_followings_on_leader_id_and_follower_id" UNIQUE, btree
> (leader_id, follower_id)
>     "index_followings_on_follower_id" btree (follower_id)
>     "index_followings_on_leader_id" btree (leader_id)

This last index is of dubious value when you already have an index
which starts with leader_id.  It will be of even more dubious
benefit when we have index-only scans in 9.2.

-Kevin