Re: Slow query with backwards index scan - Mailing list pgsql-performance

From Tilmann Singer
Subject Re: Slow query with backwards index scan
Date
Msg-id 20070728205450.GI19392@tils.net
Whole thread Raw
In response to Re: Slow query with backwards index scan  (Craig James <craig_james@emolecules.com>)
Responses Re: Slow query with backwards index scan
List pgsql-performance
* Craig James <craig_james@emolecules.com> [20070728 22:00]:
> >>SELECT * FROM (
> >> (SELECT * FROM large_table lt
> >> WHERE lt.user_id = 12345
> >> ORDER BY created_at DESC LIMIT 10) AS q1
> >> UNION
> >> (SELECT * FROM large_table lt
> >> WHERE user_id IN (SELECT contact_id FROM relationships WHERE
> >> user_id=12345)
> >> ORDER BY created_at DESC LIMIT 10) AS q2
> >>ORDER BY created_at DESC LIMIT 10;
> >
> >It's not possible to use ORDER BY or LIMIT within unioned queries.
> >
> >http://www.postgresql.org/docs/8.2/static/sql-select.html#SQL-UNION
>
> If I'm reading this documentation correctly, it *is* possible, as long as
> they're inside of a sub-select, as in this case.

I completely overlooked that obvious note in the documentation,
sorry. I tried it only with the aliases which fooled me into thinking
that doesn't work at all:

testdb=# (select 1 limit 1) as q1 union (select 2) as q2;
ERROR:  syntax error at or near "as"
LINE 1: (select 1 limit 1) as q1 union (select 2) as q2;
                           ^
but this works:

testdb=# (select 1 limit 1) union (select 2);
 ?column?
----------
        1
        2


Great - that works!

What I didn't realize in the original post is that the problem
actually seems to be how to retrieve the rows from large_table for the
correlated relationships in an efficient way - the second of the two
queries that could be UNIONed.

Using a subselect is efficient for the user with few relationships and
matched rows at the end of the sorted large_table:

testdb=# EXPLAIN ANALYZE SELECT * FROM large_table lt
 WHERE user_id IN (SELECT contact_id FROM relationships WHERE
 user_id=12345)
 ORDER BY created_at DESC LIMIT 10;
                                                                                    QUERY PLAN
                                  

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=6963.94..6963.96 rows=10 width=621) (actual time=94.598..94.629 rows=4 loops=1)
   ->  Sort  (cost=6963.94..6966.96 rows=1211 width=621) (actual time=94.592..94.602 rows=4 loops=1)
         Sort Key: lt.created_at
         ->  Nested Loop  (cost=39.52..6901.92 rows=1211 width=621) (actual time=85.670..94.547 rows=4 loops=1)
               ->  HashAggregate  (cost=39.52..39.53 rows=1 width=4) (actual time=23.549..23.552 rows=1 loops=1)
                     ->  Bitmap Heap Scan on relationships  (cost=4.33..39.49 rows=10 width=4) (actual
time=23.526..23.530rows=1 loops=1) 
                           Recheck Cond: (user_id = 12345)
                           ->  Bitmap Index Scan on relationships_user_id_contact_id_index  (cost=0.00..4.33 rows=10
width=0)(actual time=0.027..0.027 rows=1 loops=1) 
                                 Index Cond: (user_id = 12345)
               ->  Index Scan using large_user_id_started_at_index on large_table lt  (cost=0.00..6834.04 rows=2268
width=621)(actual time=62.108..70.952 rows=4 loops=1) 
                     Index Cond: (lt.user_id = relationships.contact_id)
 Total runtime: 94.875 ms


But the subselect is not fast for the user with many relationships and
matched rows at the beginning of the sorted large_table:

testdb=# EXPLAIN ANALYZE SELECT * FROM large_table lt
 WHERE user_id IN (SELECT contact_id FROM relationships WHERE
 user_id=55555)
 ORDER BY created_at DESC LIMIT 10;
                                                                                       QUERY PLAN
                                  

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=6963.94..6963.96 rows=10 width=621) (actual time=53187.349..53187.424 rows=10 loops=1)
   ->  Sort  (cost=6963.94..6966.96 rows=1211 width=621) (actual time=53187.341..53187.360 rows=10 loops=1)
         Sort Key: lt.created_at
         ->  Nested Loop  (cost=39.52..6901.92 rows=1211 width=621) (actual time=201.728..52673.800 rows=69018 loops=1)
               ->  HashAggregate  (cost=39.52..39.53 rows=1 width=4) (actual time=178.777..178.966 rows=40 loops=1)
                     ->  Bitmap Heap Scan on relationships  (cost=4.33..39.49 rows=10 width=4) (actual
time=47.049..178.560rows=40 loops=1) 
                           Recheck Cond: (user_id = 55555)
                           ->  Bitmap Index Scan on relationships_user_id_contact_id_index  (cost=0.00..4.33 rows=10
width=0)(actual time=28.721..28.721 rows=40 loops=1) 
                                 Index Cond: (user_id = 55555)
               ->  Index Scan using large_user_id_started_at_index on large_table lt  (cost=0.00..6834.04 rows=2268
width=621)(actual time=21.994..1301.375 rows=1725 loops=40) 
                     Index Cond: (lt.user_id = relationships.contact_id)
 Total runtime: 53188.584 ms



Using a join now the query for matches for the user with little data is slow:

testdb=# EXPLAIN ANALYZE SELECT * FROM large_table lt
 JOIN relationships r ON lt.user_id=r.contact_id WHERE
 r.user_id=12345
 ORDER BY lt.created_at DESC LIMIT 10;
                                                                                         QUERY PLAN
                                  

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..24116.65 rows=10 width=645) (actual time=100348.436..145552.633 rows=4 loops=1)
   ->  Nested Loop  (cost=0.00..28751864.52 rows=11922 width=645) (actual time=100348.429..145552.602 rows=4 loops=1)
         ->  Index Scan Backward using large_created_at_index on large_table lt  (cost=0.00..824833.09 rows=4384343
width=621)(actual time=28.961..82448.167 rows=4384064 loops=1) 
         ->  Index Scan using relationships_user_id_contact_id_index on relationships r  (cost=0.00..6.36 rows=1
width=24)(actual time=0.009..0.009 rows=0 loops=4384064) 
               Index Cond: ((r.user_id = 12345) AND (lt.user_id = r.contact_id))
 Total runtime: 145552.809 ms


And for the user with much data it's fast:

testdb=# EXPLAIN ANALYZE SELECT * FROM large_table lt
 JOIN relationships r ON lt.user_id=r.contact_id WHERE
 r.user_id=55555
 ORDER BY lt.created_at DESC LIMIT 10;
                                                                                    QUERY PLAN
                                  

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..24116.65 rows=10 width=645) (actual time=0.068..0.428 rows=10 loops=1)
   ->  Nested Loop  (cost=0.00..28751864.52 rows=11922 width=645) (actual time=0.063..0.376 rows=10 loops=1)
         ->  Index Scan Backward using large_created_at_index on large_table lt  (cost=0.00..824833.09 rows=4384343
width=621)(actual time=0.028..0.064 rows=13 loops=1) 
         ->  Index Scan using relationships_user_id_contact_id_index on relationships r  (cost=0.00..6.36 rows=1
width=24)(actual time=0.010..0.013 rows=1 loops=13) 
               Index Cond: ((r.user_id = 55555) AND (lt.user_id = r.contact_id))
 Total runtime: 0.609 ms



Any ideas?



tia, Til

pgsql-performance by date:

Previous
From: andrew@pillette.com
Date:
Subject: Re: Slow query with backwards index scan
Next
From: Ragnar
Date:
Subject: Re: RES: select on 1milion register = 6s