Thread: Slow index scan on B-Tree index over timestamp field
Caio Casimiro <casimiro.listas@gmail.com> wrote: > I have one query running at ~ 7 seconds and I would like to know > if it's possible to make it run faster, once this query runs lots > of time in my experiment. > Buffers: shared hit=2390 read=32778 > Total runtime: 24066.145 ms > effective_cache_size = 2GB > it seems the problem is with the 'tweet' table. The EXPLAIN ANALYZE output shows it taking 24 seconds, 8.9 seconds of which is in accessing the tweet_topic table and 15.1 seconds in accessing the tweet table. It looks like you have a painfully low cache hit ratio. The plan looks reasonable to me; it looks like you need more RAM to cache data if you want better speed. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-11-04 13:56, Kevin Grittner wrote: > Caio Casimiro <casimiro.listas@gmail.com> wrote: > >> I have one query running at ~ 7 seconds and I would like to know >> if it's possible to make it run faster, once this query runs lots >> of time in my experiment. >> Buffers: shared hit=2390 read=32778 >> Total runtime: 24066.145 ms >> effective_cache_size = 2GB >> it seems the problem is with the 'tweet' table. > The EXPLAIN ANALYZE output shows it taking 24 seconds, 8.9 seconds > of which is in accessing the tweet_topic table and 15.1 seconds in > accessing the tweet table. It looks like you have a painfully low > cache hit ratio. The plan looks reasonable to me; it looks like > you need more RAM to cache data if you want better speed. > > -- > Kevin Grittner > EDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > There's also an index scan that turns up 1.8 million rows, but only 1,600 of them wind up in the final output. I'd start with restating the "user_id in (select followed_id ...)" as a join against the relationship table. The planner is filtering first on the tweet time, but that doesn't reduce the set of tweets down very well. Assuming that the user being looked up doesn't follow a large proportion of other users, I'd figure that reducing the set first by followed users should be quicker.
Hello all,I have one query running at ~ 7 seconds and I would like to know if it's possible to make it run faster, once this query runs lots of time in my experiment.
Basically the query return the topics of tweets published by users that the user N follows and that are published between D1 and D2.Query:SELECT tt.tweet_id, tt.topic, tt.topic_valueFROM tweet_topic AS tt LEFT JOIN tweet AS t ON tt.tweet_id = t.idWHERE creation_time BETWEEN 'D1' AND 'D2' AND user_id in(SELECT followed_id FROM relationship WHERE follower_id = N) ORDER BY tt.tweet_id;
Indexes:"tweet_plk" PRIMARY KEY, btree (id) CLUSTER"tweet_creation_time_index" btree (creation_time)"tweet_id_index" hash (id)"tweet_ios_index" btree (id, user_id, creation_time)"tweet_retweeted_idx" hash (retweeted)"tweet_user_id_creation_time_index" btree (creation_time, user_id)"tweet_user_id_index" hash (user_id)
On Sun, Nov 3, 2013 at 2:05 PM, Caio Casimiro <casimiro.listas@gmail.com> wrote:Hello all,I have one query running at ~ 7 seconds and I would like to know if it's possible to make it run faster, once this query runs lots of time in my experiment.Do you mean you want it to be fast because it runs many times, or that you want it to become fast after it runs many times (i.e. once the data is fully cached)? The plan you show takes 24 seconds, not 7 seconds.
Basically the query return the topics of tweets published by users that the user N follows and that are published between D1 and D2.Query:SELECT tt.tweet_id, tt.topic, tt.topic_valueFROM tweet_topic AS tt LEFT JOIN tweet AS t ON tt.tweet_id = t.idWHERE creation_time BETWEEN 'D1' AND 'D2' AND user_id in(SELECT followed_id FROM relationship WHERE follower_id = N) ORDER BY tt.tweet_id;I don't know if this affects the plan at all, but it is silly to do a left join to "tweet" when the WHERE clause has conditions that can't be satisfied with a null row. Also, you could try changing the IN-list to an EXISTS subquery.
Is there some patterns to D1 and D2 that could help the caching? For example, are they both usually in the just-recent past?
Indexes:"tweet_plk" PRIMARY KEY, btree (id) CLUSTER"tweet_creation_time_index" btree (creation_time)"tweet_id_index" hash (id)"tweet_ios_index" btree (id, user_id, creation_time)"tweet_retweeted_idx" hash (retweeted)"tweet_user_id_creation_time_index" btree (creation_time, user_id)"tweet_user_id_index" hash (user_id)Are all of those indexes important? If your table is heavily updated/inserted, which I assume it is, maintaining those indexes is going to take up precious RAM that could probably be better used elsewhere.
Cheers,Jeff
Thank you very much for your answers guys!On Mon, Nov 4, 2013 at 5:15 PM, Jeff Janes <jeff.janes@gmail.com> wrote:On Sun, Nov 3, 2013 at 2:05 PM, Caio Casimiro <casimiro.listas@gmail.com> wrote:Hello all,I have one query running at ~ 7 seconds and I would like to know if it's possible to make it run faster, once this query runs lots of time in my experiment.Do you mean you want it to be fast because it runs many times, or that you want it to become fast after it runs many times (i.e. once the data is fully cached)? The plan you show takes 24 seconds, not 7 seconds.I want it to be fast because it runs many times. I have an experiment that evaluates recommendation algorithms for a set of twitter users. This query returns recommendation candidates so it is called a lot of times for different users and time intervals.Basically the query return the topics of tweets published by users that the user N follows and that are published between D1 and D2.Query:SELECT tt.tweet_id, tt.topic, tt.topic_valueFROM tweet_topic AS tt LEFT JOIN tweet AS t ON tt.tweet_id = t.idWHERE creation_time BETWEEN 'D1' AND 'D2' AND user_id in(SELECT followed_id FROM relationship WHERE follower_id = N) ORDER BY tt.tweet_id;I don't know if this affects the plan at all, but it is silly to do a left join to "tweet" when the WHERE clause has conditions that can't be satisfied with a null row. Also, you could try changing the IN-list to an EXISTS subquery.I'm sorry the ignorance, but I don't understand the issue with the left join, could you explain more?Is there some patterns to D1 and D2 that could help the caching? For example, are they both usually in the just-recent past?The only pattern is that it is always a one day interval, e.g. D1 = '2013-05-01' and D2 = '2013-05-02'.Indexes:"tweet_plk" PRIMARY KEY, btree (id) CLUSTER"tweet_creation_time_index" btree (creation_time)"tweet_id_index" hash (id)"tweet_ios_index" btree (id, user_id, creation_time)"tweet_retweeted_idx" hash (retweeted)"tweet_user_id_creation_time_index" btree (creation_time, user_id)"tweet_user_id_index" hash (user_id)Are all of those indexes important? If your table is heavily updated/inserted, which I assume it is, maintaining those indexes is going to take up precious RAM that could probably be better used elsewhere.Probably not. But once this database is read only, the quantity of index grew following my desperation. =)Cheers,JeffThank you very much again!Caio
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Caio Casimiro Sent: Monday, November 04, 2013 3:44 PM To: Jeff Janes Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Slow index scan on B-Tree index over timestamp field Thank you very much for your answers guys! On Mon, Nov 4, 2013 at 5:15 PM, Jeff Janes <jeff.janes@gmail.com> wrote: On Sun, Nov 3, 2013 at 2:05 PM, Caio Casimiro <casimiro.listas@gmail.com> wrote: Hello all, I have one query running at ~ 7 seconds and I would like to know if it's possible to make it run faster, once this queryruns lots of time in my experiment. Do you mean you want it to be fast because it runs many times, or that you want it to become fast after it runs many times(i.e. once the data is fully cached)? The plan you show takes 24 seconds, not 7 seconds. I want it to be fast because it runs many times. I have an experiment that evaluates recommendation algorithms for a setof twitter users. This query returns recommendation candidates so it is called a lot of times for different users andtime intervals. Basically the query return the topics of tweets published by users that the user N follows and that are published betweenD1 and D2. Query: SELECT tt.tweet_id, tt.topic, tt.topic_value FROM tweet_topic AS tt LEFT JOIN tweet AS t ON tt.tweet_id = t.id WHERE creation_time BETWEEN 'D1' AND 'D2' AND user_id in (SELECT followed_id FROM relationship WHERE follower_id = N) ORDER BY tt.tweet_id; I don't know if this affects the plan at all, but it is silly to do a left join to "tweet" when the WHERE clause has conditionsthat can't be satisfied with a null row. Also, you could try changing the IN-list to an EXISTS subquery. I'm sorry the ignorance, but I don't understand the issue with the left join, could you explain more? ........................................... Thank you very much again! Caio Just try the following: SELECT tt.tweet_id, tt.topic, tt.topic_value FROM tweet_topic AS tt JOIN tweet AS t ON (tt.tweet_id = t.id AND t.creation_time BETWEEN 'D1' AND 'D2' AND t.user_id in (SELECT followed_id FROM relationship WHERE follower_id = N)) ORDER BY tt.tweet_id; And see if it helps with performance. Regards, Igor Neyman
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Caio Casimiro
Sent: Monday, November 04, 2013 3:44 PM
To: Jeff Janes
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Slow index scan on B-Tree index over timestamp field...........................................
Thank you very much for your answers guys!
On Mon, Nov 4, 2013 at 5:15 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Sun, Nov 3, 2013 at 2:05 PM, Caio Casimiro <casimiro.listas@gmail.com> wrote:
Hello all,
I have one query running at ~ 7 seconds and I would like to know if it's possible to make it run faster, once this query runs lots of time in my experiment.
Do you mean you want it to be fast because it runs many times, or that you want it to become fast after it runs many times (i.e. once the data is fully cached)? The plan you show takes 24 seconds, not 7 seconds.
I want it to be fast because it runs many times. I have an experiment that evaluates recommendation algorithms for a set of twitter users. This query returns recommendation candidates so it is called a lot of times for different users and time intervals.
Basically the query return the topics of tweets published by users that the user N follows and that are published between D1 and D2.
Query:
SELECT tt.tweet_id, tt.topic, tt.topic_value
FROM tweet_topic AS tt LEFT JOIN tweet AS t ON tt.tweet_id = t.id
WHERE creation_time BETWEEN 'D1' AND 'D2' AND user_id in
(SELECT followed_id FROM relationship WHERE follower_id = N) ORDER BY tt.tweet_id;
I don't know if this affects the plan at all, but it is silly to do a left join to "tweet" when the WHERE clause has conditions that can't be satisfied with a null row. Also, you could try changing the IN-list to an EXISTS subquery.
I'm sorry the ignorance, but I don't understand the issue with the left join, could you explain more?Thank you very much again!Just try the following:
Caio
SELECT tt.tweet_id, tt.topic, tt.topic_value
FROM tweet_topic AS tt JOIN tweet AS t ON (tt.tweet_id = t.id
AND t.creation_time BETWEEN 'D1' AND 'D2' AND t.user_id in
(SELECT followed_id FROM relationship WHERE follower_id = N))
ORDER BY tt.tweet_id;
And see if it helps with performance.
Regards,
Igor Neyman
Yes, because that part of the query is kicking back so many rows, many of which are totally unnecessary anyway - you're first getting all the tweets in a particular time range, then limiting them down to just users that are followed. Here's clarification on the approach I mentioned earlier. All you should really need are basic (btree) indexes on your different keys (tweet_topic.tweet_id, tweet.id, tweet.user_id, relationship.follower_id, relationship.followed_id). I also changed the left join to an inner join as somebody pointed out that your logic amounted to reducing the match to an inner join anyway.Hi Neyman, thank you for your answer.Unfortunately this query runs almost at the same time:Sort (cost=4877693.98..4877702.60 rows=3449 width=20) (actual time=25820.291..25821.845 rows=1640 loops=1)Sort Key: tt.tweet_idSort Method: quicksort Memory: 97kBBuffers: shared hit=1849 read=32788-> Nested Loop (cost=247.58..4877491.32 rows=3449 width=20) (actual time=486.839..25814.120 rows=1640 loops=1)Buffers: shared hit=1849 read=32788-> Hash Semi Join (cost=229.62..88553.23 rows=1681 width=8) (actual time=431.654..13209.159 rows=597 loops=1)Hash Cond: (t.user_id = relationship.followed_id)Buffers: shared hit=3 read=31870-> Index Scan using tweet_creation_time_index on tweet t (cost=0.57..83308.25 rows=1781234 width=16) (actual time=130.144..10037.764 rows=1759645 loops=1)Index Cond: ((creation_time >= '2013-05-05 00:00:00-03'::timestamp with time zone) AND (creation_time <= '2013-05-06 00:00:00-03'::timestamp with time zone))Buffers: shared hit=1 read=31867-> Hash (cost=227.12..227.12 rows=154 width=8) (actual time=94.365..94.365 rows=106 loops=1)Buckets: 1024 Batches: 1 Memory Usage: 3kBBuffers: shared hit=2 read=3-> Index Only Scan using relationship_id on relationship (cost=0.42..227.12 rows=154 width=8) (actual time=74.540..94.101 rows=106 loops=1)Index Cond: (follower_id = 335093362)Heap Fetches: 0Buffers: shared hit=2 read=3-> Bitmap Heap Scan on tweet_topic tt (cost=17.96..2841.63 rows=723 width=20) (actual time=21.014..21.085 rows=3 loops=597)Recheck Cond: (tweet_id = t.id)Buffers: shared hit=1846 read=918-> Bitmap Index Scan on tweet_topic_pk (cost=0.00..17.78 rows=723 width=0) (actual time=15.012..15.012 rows=3 loops=597)Index Cond: (tweet_id = t.id)Buffers: shared hit=1763 read=632Total runtime: 25823.386 msI have noticed that in both queries the index scan on tweet_creation_time_index is very expensive. Is there anything I can do to make the planner choose a index only scan?
SELECT tt.tweet_id, tt.topic, tt.topic_value
FROM tweet_topic AS tt
JOIN tweet AS t
ON tt.tweet_id = t.id
join relationship
on t.user_id = relationship.followed_id
WHERE creation_time BETWEEN 'D1' AND 'D2'
AND relationship.follower_id = N
ORDER BY tt.tweet_id
;
From: Caio Casimiro [mailto:casimiro.listas@gmail.com] Sent: Monday, November 04, 2013 4:10 PM To: Igor Neyman Cc: Jeff Janes; pgsql-performance@postgresql.org Subject: Re: [PERFORM] Slow index scan on B-Tree index over timestamp field Hi Neyman, thank you for your answer. Unfortunately this query runs almost at the same time: Sort (cost=4877693.98..4877702.60 rows=3449 width=20) (actual time=25820.291..25821.845 rows=1640 loops=1) Sort Key: tt.tweet_id Sort Method: quicksort Memory: 97kB Buffers: shared hit=1849 read=32788 -> Nested Loop (cost=247.58..4877491.32 rows=3449 width=20) (actual time=486.839..25814.120 rows=1640 loops=1) Buffers: shared hit=1849 read=32788 -> Hash Semi Join (cost=229.62..88553.23 rows=1681 width=8) (actual time=431.654..13209.159 rows=597 loops=1) Hash Cond: (t.user_id = relationship.followed_id) Buffers: shared hit=3 read=31870 -> Index Scan using tweet_creation_time_index on tweet t (cost=0.57..83308.25 rows=1781234 width=16) (actualtime=130.144..10037.764 rows=1759645 loops=1) Index Cond: ((creation_time >= '2013-05-05 00:00:00-03'::timestamp with time zone) AND (creation_time<= '2013-05-06 00:00:00-03'::timestamp with time zone)) Buffers: shared hit=1 read=31867 -> Hash (cost=227.12..227.12 rows=154 width=8) (actual time=94.365..94.365 rows=106 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 3kB Buffers: shared hit=2 read=3 -> Index Only Scan using relationship_id on relationship (cost=0.42..227.12 rows=154 width=8) (actualtime=74.540..94.101 rows=106 loops=1) Index Cond: (follower_id = 335093362) Heap Fetches: 0 Buffers: shared hit=2 read=3 -> Bitmap Heap Scan on tweet_topic tt (cost=17.96..2841.63 rows=723 width=20) (actual time=21.014..21.085 rows=3loops=597) Recheck Cond: (tweet_id = t.id) Buffers: shared hit=1846 read=918 -> Bitmap Index Scan on tweet_topic_pk (cost=0.00..17.78 rows=723 width=0) (actual time=15.012..15.012 rows=3loops=597) Index Cond: (tweet_id = t.id) Buffers: shared hit=1763 read=632 Total runtime: 25823.386 ms I have noticed that in both queries the index scan on tweet_creation_time_index is very expensive. Is there anything I cando to make the planner choose a index only scan? Thank you, Caio Just try the following: SELECT tt.tweet_id, tt.topic, tt.topic_value FROM tweet_topic AS tt JOIN tweet AS t ON (tt.tweet_id = t.id AND t.creation_time BETWEEN 'D1' AND 'D2' AND t.user_id in (SELECT followed_id FROM relationship WHERE follower_id = N)) ORDER BY tt.tweet_id; And see if it helps with performance. Regards, Igor Neyman What is your hardware configuration, and Postgres config parameters modified from default values? Regards, Igor Neyman
From: Caio Casimiro [mailto:casimiro.listas@gmail.com]
Sent: Monday, November 04, 2013 4:10 PM
To: Igor Neyman
Cc: Jeff Janes; pgsql-performance@postgresql.orgSubject: Re: [PERFORM] Slow index scan on B-Tree index over timestamp fieldHi Neyman, thank you for your answer.
Unfortunately this query runs almost at the same time:
Sort (cost=4877693.98..4877702.60 rows=3449 width=20) (actual time=25820.291..25821.845 rows=1640 loops=1)
Sort Key: tt.tweet_id
Sort Method: quicksort Memory: 97kB
Buffers: shared hit=1849 read=32788
-> Nested Loop (cost=247.58..4877491.32 rows=3449 width=20) (actual time=486.839..25814.120 rows=1640 loops=1)
Buffers: shared hit=1849 read=32788
-> Hash Semi Join (cost=229.62..88553.23 rows=1681 width=8) (actual time=431.654..13209.159 rows=597 loops=1)
Hash Cond: (t.user_id = relationship.followed_id)
Buffers: shared hit=3 read=31870
-> Index Scan using tweet_creation_time_index on tweet t (cost=0.57..83308.25 rows=1781234 width=16) (actual time=130.144..10037.764 rows=1759645 loops=1)
Index Cond: ((creation_time >= '2013-05-05 00:00:00-03'::timestamp with time zone) AND (creation_time <= '2013-05-06 00:00:00-03'::timestamp with time zone))
Buffers: shared hit=1 read=31867
-> Hash (cost=227.12..227.12 rows=154 width=8) (actual time=94.365..94.365 rows=106 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 3kB
Buffers: shared hit=2 read=3
-> Index Only Scan using relationship_id on relationship (cost=0.42..227.12 rows=154 width=8) (actual time=74.540..94.101 rows=106 loops=1)
Index Cond: (follower_id = 335093362)
Heap Fetches: 0
Buffers: shared hit=2 read=3
-> Bitmap Heap Scan on tweet_topic tt (cost=17.96..2841.63 rows=723 width=20) (actual time=21.014..21.085 rows=3 loops=597)
Recheck Cond: (tweet_id = t.id)
Buffers: shared hit=1846 read=918
-> Bitmap Index Scan on tweet_topic_pk (cost=0.00..17.78 rows=723 width=0) (actual time=15.012..15.012 rows=3 loops=597)
Index Cond: (tweet_id = t.id)
Buffers: shared hit=1763 read=632
Total runtime: 25823.386 ms
I have noticed that in both queries the index scan on tweet_creation_time_index is very expensive. Is there anything I can do to make the planner choose a index only scan?
Thank you,
CaioJust try the following:What is your hardware configuration, and Postgres config parameters modified from default values?
SELECT tt.tweet_id, tt.topic, tt.topic_value
FROM tweet_topic AS tt JOIN tweet AS t ON (tt.tweet_id = t.id
AND t.creation_time BETWEEN 'D1' AND 'D2' AND t.user_id in
(SELECT followed_id FROM relationship WHERE follower_id = N))
ORDER BY tt.tweet_id;
And see if it helps with performance.
Regards,
Igor Neyman
Regards,
Igor Neyman
Yes, because that part of the query is kicking back so many rows, many of which are totally unnecessary anyway - you're first getting all the tweets in a particular time range, then limiting them down to just users that are followed. Here's clarification on the approach I mentioned earlier. All you should really need are basic (btree) indexes on your different keys (tweet_topic.tweet_id, tweet.id, tweet.user_id, relationship.follower_id, relationship.followed_id). I also changed the left join to an inner join as somebody pointed out that your logic amounted to reducing the match to an inner join anyway.On 2013-11-04 16:10, Caio Casimiro wrote:Hi Neyman, thank you for your answer.Unfortunately this query runs almost at the same time:Sort (cost=4877693.98..4877702.60 rows=3449 width=20) (actual time=25820.291..25821.845 rows=1640 loops=1)Sort Key: tt.tweet_idSort Method: quicksort Memory: 97kBBuffers: shared hit=1849 read=32788-> Nested Loop (cost=247.58..4877491.32 rows=3449 width=20) (actual time=486.839..25814.120 rows=1640 loops=1)Buffers: shared hit=1849 read=32788-> Hash Semi Join (cost=229.62..88553.23 rows=1681 width=8) (actual time=431.654..13209.159 rows=597 loops=1)Hash Cond: (t.user_id = relationship.followed_id)Buffers: shared hit=3 read=31870-> Index Scan using tweet_creation_time_index on tweet t (cost=0.57..83308.25 rows=1781234 width=16) (actual time=130.144..10037.764 rows=1759645 loops=1)Index Cond: ((creation_time >= '2013-05-05 00:00:00-03'::timestamp with time zone) AND (creation_time <= '2013-05-06 00:00:00-03'::timestamp with time zone))Buffers: shared hit=1 read=31867-> Hash (cost=227.12..227.12 rows=154 width=8) (actual time=94.365..94.365 rows=106 loops=1)Buckets: 1024 Batches: 1 Memory Usage: 3kBBuffers: shared hit=2 read=3-> Index Only Scan using relationship_id on relationship (cost=0.42..227.12 rows=154 width=8) (actual time=74.540..94.101 rows=106 loops=1)Index Cond: (follower_id = 335093362)Heap Fetches: 0Buffers: shared hit=2 read=3-> Bitmap Heap Scan on tweet_topic tt (cost=17.96..2841.63 rows=723 width=20) (actual time=21.014..21.085 rows=3 loops=597)Recheck Cond: (tweet_id = t.id)Buffers: shared hit=1846 read=918-> Bitmap Index Scan on tweet_topic_pk (cost=0.00..17.78 rows=723 width=0) (actual time=15.012..15.012 rows=3 loops=597)Index Cond: (tweet_id = t.id)Buffers: shared hit=1763 read=632Total runtime: 25823.386 msI have noticed that in both queries the index scan on tweet_creation_time_index is very expensive. Is there anything I can do to make the planner choose a index only scan?ON tt.tweet_id = t.id
SELECT tt.tweet_id, tt.topic, tt.topic_value
FROM tweet_topic AS tt
JOIN tweet AS t
join relationship
on t.user_id = relationship.followed_idAND relationship.follower_id = N
WHERE creation_time BETWEEN 'D1' AND 'D2'
ORDER BY tt.tweet_id
;
Hi Elliot, thank you for your answer.I tried this query but it still suffer with index scan on tweet_creation_time_index:"Sort (cost=4899904.57..4899913.19 rows=3447 width=20) (actual time=37560.938..37562.503 rows=1640 loops=1)"" Sort Key: tt.tweet_id"" Sort Method: quicksort Memory: 97kB"" Buffers: shared hit=1849 read=32788"" -> Nested Loop (cost=105592.06..4899702.04 rows=3447 width=20) (actual time=19151.036..37555.227 rows=1640 loops=1)"" Buffers: shared hit=1849 read=32788"" -> Hash Join (cost=105574.10..116461.68 rows=1679 width=8) (actual time=19099.848..19127.606 rows=597 loops=1)"" Hash Cond: (relationship.followed_id = t.user_id)"" Buffers: shared hit=3 read=31870"" -> Index Only Scan using relationship_id on relationship (cost=0.42..227.12 rows=154 width=8) (actual time=66.102..89.721 rows=106 loops=1)"" -> Hash (cost=83308.25..83308.25 rows=1781234 width=16) (actual time=19031.916..19031.916 rows=1759645 loops=1)"" Buckets: 262144 Batches: 1 Memory Usage: 61863kB"" Buffers: shared hit=1 read=31867"" -> Index Scan using tweet_creation_time_index on tweet t (cost=0.57..83308.25 rows=1781234 width=16) (actual time=48.595..13759.768 rows=1759645 loops=1)"" Index Cond: ((creation_time >= '2013-05-05 00:00:00-03'::timestamp with time zone) AND (creation_time <= '2013-05-06 00:00:00-03'::timestamp with time zone))"" Buffers: shared hit=1 read=31867"" -> Bitmap Heap Scan on tweet_topic tt (cost=17.96..2841.63 rows=723 width=20) (actual time=30.774..30.847 rows=3 loops=597)"" Recheck Cond: (tweet_id = t.id)"" Buffers: shared hit=1846 read=918"" -> Bitmap Index Scan on tweet_topic_pk (cost=0.00..17.78 rows=723 width=0) (actual time=23.084..23.084 rows=3 loops=597)"" Index Cond: (tweet_id = t.id)"" Buffers: shared hit=1763 read=632"You said that I would need B-Tree indexes on the fields that I want the planner to use index only scan, and I think I have them already on the tweet table:"tweet_ios_index" btree (id, user_id, creation_time)Shouldn't the tweet_ios_index be enough to make the scan over tweet_creation_time_index be a index only scan? And, more important, would it be really faster?Thank you very much,CaioOn Mon, Nov 4, 2013 at 7:22 PM, Elliot <yields.falsehood@gmail.com> wrote:Yes, because that part of the query is kicking back so many rows, many of which are totally unnecessary anyway - you're first getting all the tweets in a particular time range, then limiting them down to just users that are followed. Here's clarification on the approach I mentioned earlier. All you should really need are basic (btree) indexes on your different keys (tweet_topic.tweet_id, tweet.id, tweet.user_id, relationship.follower_id, relationship.followed_id). I also changed the left join to an inner join as somebody pointed out that your logic amounted to reducing the match to an inner join anyway.On 2013-11-04 16:10, Caio Casimiro wrote:Hi Neyman, thank you for your answer.Unfortunately this query runs almost at the same time:Sort (cost=4877693.98..4877702.60 rows=3449 width=20) (actual time=25820.291..25821.845 rows=1640 loops=1)Sort Key: tt.tweet_idSort Method: quicksort Memory: 97kBBuffers: shared hit=1849 read=32788-> Nested Loop (cost=247.58..4877491.32 rows=3449 width=20) (actual time=486.839..25814.120 rows=1640 loops=1)Buffers: shared hit=1849 read=32788-> Hash Semi Join (cost=229.62..88553.23 rows=1681 width=8) (actual time=431.654..13209.159 rows=597 loops=1)Hash Cond: (t.user_id = relationship.followed_id)Buffers: shared hit=3 read=31870-> Index Scan using tweet_creation_time_index on tweet t (cost=0.57..83308.25 rows=1781234 width=16) (actual time=130.144..10037.764 rows=1759645 loops=1)Index Cond: ((creation_time >= '2013-05-05 00:00:00-03'::timestamp with time zone) AND (creation_time <= '2013-05-06 00:00:00-03'::timestamp with time zone))Buffers: shared hit=1 read=31867-> Hash (cost=227.12..227.12 rows=154 width=8) (actual time=94.365..94.365 rows=106 loops=1)Buckets: 1024 Batches: 1 Memory Usage: 3kBBuffers: shared hit=2 read=3-> Index Only Scan using relationship_id on relationship (cost=0.42..227.12 rows=154 width=8) (actual time=74.540..94.101 rows=106 loops=1)Index Cond: (follower_id = 335093362)Heap Fetches: 0Buffers: shared hit=2 read=3-> Bitmap Heap Scan on tweet_topic tt (cost=17.96..2841.63 rows=723 width=20) (actual time=21.014..21.085 rows=3 loops=597)Recheck Cond: (tweet_id = t.id)Buffers: shared hit=1846 read=918-> Bitmap Index Scan on tweet_topic_pk (cost=0.00..17.78 rows=723 width=0) (actual time=15.012..15.012 rows=3 loops=597)Index Cond: (tweet_id = t.id)Buffers: shared hit=1763 read=632Total runtime: 25823.386 msI have noticed that in both queries the index scan on tweet_creation_time_index is very expensive. Is there anything I can do to make the planner choose a index only scan?ON tt.tweet_id = t.id
SELECT tt.tweet_id, tt.topic, tt.topic_value
FROM tweet_topic AS tt
JOIN tweet AS t
join relationship
on t.user_id = relationship.followed_idAND relationship.follower_id = N
WHERE creation_time BETWEEN 'D1' AND 'D2'
ORDER BY tt.tweet_id
;
Hello,I think you could try with an index on tweet table columns "user_id, creation_time" [in this order , because the first argument is for the equality predicate and the second with the range scan predicate, the index tweet_user_id_creation_time_index is not ok because it has the reverse order ] so the Hash Join between relationship and tweet will become in theory a netsted loop and so the filter relationship.followed_id = t.user_id will be pushed on the new index search condition with also the creation_time > .. and creation_time < ... . In this manner you will reduce the random i/o of the scanning of 1759645 rows from tweet that are filter later now in hash join to 1679.I hope it will work, if not, I hope you could attach the DDL of the table ( with constraints and indexes) to better understand the problem.Bye2013/11/4 Caio Casimiro <casimiro.listas@gmail.com>Hi Elliot, thank you for your answer.I tried this query but it still suffer with index scan on tweet_creation_time_index:"Sort (cost=4899904.57..4899913.19 rows=3447 width=20) (actual time=37560.938..37562.503 rows=1640 loops=1)"" Sort Key: tt.tweet_id"" Sort Method: quicksort Memory: 97kB"" Buffers: shared hit=1849 read=32788"" -> Nested Loop (cost=105592.06..4899702.04 rows=3447 width=20) (actual time=19151.036..37555.227 rows=1640 loops=1)"" Buffers: shared hit=1849 read=32788"" -> Hash Join (cost=105574.10..116461.68 rows=1679 width=8) (actual time=19099.848..19127.606 rows=597 loops=1)"" Hash Cond: (relationship.followed_id = t.user_id)"" Buffers: shared hit=3 read=31870"" -> Index Only Scan using relationship_id on relationship (cost=0.42..227.12 rows=154 width=8) (actual time=66.102..89.721 rows=106 loops=1)"" -> Hash (cost=83308.25..83308.25 rows=1781234 width=16) (actual time=19031.916..19031.916 rows=1759645 loops=1)"" Buckets: 262144 Batches: 1 Memory Usage: 61863kB"" Buffers: shared hit=1 read=31867"" -> Index Scan using tweet_creation_time_index on tweet t (cost=0.57..83308.25 rows=1781234 width=16) (actual time=48.595..13759.768 rows=1759645 loops=1)"" Index Cond: ((creation_time >= '2013-05-05 00:00:00-03'::timestamp with time zone) AND (creation_time <= '2013-05-06 00:00:00-03'::timestamp with time zone))"" Buffers: shared hit=1 read=31867"" -> Bitmap Heap Scan on tweet_topic tt (cost=17.96..2841.63 rows=723 width=20) (actual time=30.774..30.847 rows=3 loops=597)"" Recheck Cond: (tweet_id = t.id)"" Buffers: shared hit=1846 read=918"" -> Bitmap Index Scan on tweet_topic_pk (cost=0.00..17.78 rows=723 width=0) (actual time=23.084..23.084 rows=3 loops=597)"" Index Cond: (tweet_id = t.id)"" Buffers: shared hit=1763 read=632"You said that I would need B-Tree indexes on the fields that I want the planner to use index only scan, and I think I have them already on the tweet table:"tweet_ios_index" btree (id, user_id, creation_time)Shouldn't the tweet_ios_index be enough to make the scan over tweet_creation_time_index be a index only scan? And, more important, would it be really faster?Thank you very much,CaioOn Mon, Nov 4, 2013 at 7:22 PM, Elliot <yields.falsehood@gmail.com> wrote:Yes, because that part of the query is kicking back so many rows, many of which are totally unnecessary anyway - you're first getting all the tweets in a particular time range, then limiting them down to just users that are followed. Here's clarification on the approach I mentioned earlier. All you should really need are basic (btree) indexes on your different keys (tweet_topic.tweet_id, tweet.id, tweet.user_id, relationship.follower_id, relationship.followed_id). I also changed the left join to an inner join as somebody pointed out that your logic amounted to reducing the match to an inner join anyway.On 2013-11-04 16:10, Caio Casimiro wrote:Hi Neyman, thank you for your answer.Unfortunately this query runs almost at the same time:Sort (cost=4877693.98..4877702.60 rows=3449 width=20) (actual time=25820.291..25821.845 rows=1640 loops=1)Sort Key: tt.tweet_idSort Method: quicksort Memory: 97kBBuffers: shared hit=1849 read=32788-> Nested Loop (cost=247.58..4877491.32 rows=3449 width=20) (actual time=486.839..25814.120 rows=1640 loops=1)Buffers: shared hit=1849 read=32788-> Hash Semi Join (cost=229.62..88553.23 rows=1681 width=8) (actual time=431.654..13209.159 rows=597 loops=1)Hash Cond: (t.user_id = relationship.followed_id)Buffers: shared hit=3 read=31870-> Index Scan using tweet_creation_time_index on tweet t (cost=0.57..83308.25 rows=1781234 width=16) (actual time=130.144..10037.764 rows=1759645 loops=1)Index Cond: ((creation_time >= '2013-05-05 00:00:00-03'::timestamp with time zone) AND (creation_time <= '2013-05-06 00:00:00-03'::timestamp with time zone))Buffers: shared hit=1 read=31867-> Hash (cost=227.12..227.12 rows=154 width=8) (actual time=94.365..94.365 rows=106 loops=1)Buckets: 1024 Batches: 1 Memory Usage: 3kBBuffers: shared hit=2 read=3-> Index Only Scan using relationship_id on relationship (cost=0.42..227.12 rows=154 width=8) (actual time=74.540..94.101 rows=106 loops=1)Index Cond: (follower_id = 335093362)Heap Fetches: 0Buffers: shared hit=2 read=3-> Bitmap Heap Scan on tweet_topic tt (cost=17.96..2841.63 rows=723 width=20) (actual time=21.014..21.085 rows=3 loops=597)Recheck Cond: (tweet_id = t.id)Buffers: shared hit=1846 read=918-> Bitmap Index Scan on tweet_topic_pk (cost=0.00..17.78 rows=723 width=0) (actual time=15.012..15.012 rows=3 loops=597)Index Cond: (tweet_id = t.id)Buffers: shared hit=1763 read=632Total runtime: 25823.386 msI have noticed that in both queries the index scan on tweet_creation_time_index is very expensive. Is there anything I can do to make the planner choose a index only scan?ON tt.tweet_id = t.id
SELECT tt.tweet_id, tt.topic, tt.topic_value
FROM tweet_topic AS tt
JOIN tweet AS t
join relationship
on t.user_id = relationship.followed_idAND relationship.follower_id = N
WHERE creation_time BETWEEN 'D1' AND 'D2'
ORDER BY tt.tweet_id
;
From: Caio Casimiro [mailto:casimiro.listas@gmail.com]
Sent: Monday, November 04, 2013 4:33 PM
To: Igor Neyman
Cc: Jeff Janes; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Slow index scan on B-Tree index over timestamp field
These are the parameters I have set in postgresql.conf:
work_mem = 128MB
shared_buffers = 1GB
maintenance_work_mem = 1536MB
fsync = off
synchronous_commit = off
effective_cache_size = 2GB
The hardware is a modest one:
CPU: Intel(R) Atom(TM) CPU 230 @ 1.60GHz
RAM: 2GB
HD: 1TV 7200 RPM (WDC WD10EZEX-00RKKA0)
This machine runs a slackware 14.0 dedicated to the Postgresql.
Thank you,
Caio
With just 2GB RAM, this:
shared_buffers = 1GB
and this:
effective_cache_size = 2GB
is too high.
You should lower those:
shared_buffers = 256MB
effective_cache_size = 1GB
and see how your execution plan changes.
Oh, and this:
maintenance_work_mem = 1536MB
is also too high.
Turning off fsync and synchronous_commit is not very good idea.
Regards,
Igor Neyman
You said that I would need B-Tree indexes on the fields that I want the planner to use index only scan, and I think I have them already on the tweet table:"tweet_ios_index" btree (id, user_id, creation_time)Shouldn't the tweet_ios_index be enough to make the scan over tweet_creation_time_index be a index only scan?
And, more important, would it be really faster?
On Sun, Nov 3, 2013 at 4:05 PM, Caio Casimiro <casimiro.listas@gmail.com> wrote: > System Information: > OS: Slackware 14.0 > Postgresql Version: 9.3 Beta2 This probably doesn't have anything to do with your problem, but it's long past time to migrate from the beta to the production 9.3. merlin
Thank you very much for your answers guys!On Mon, Nov 4, 2013 at 5:15 PM, Jeff Janes <jeff.janes@gmail.com> wrote:On Sun, Nov 3, 2013 at 2:05 PM, Caio Casimiro <casimiro.listas@gmail.com> wrote:SELECT tt.tweet_id, tt.topic, tt.topic_valueFROM tweet_topic AS tt LEFT JOIN tweet AS t ON tt.tweet_id = t.idWHERE creation_time BETWEEN 'D1' AND 'D2' AND user_id in(SELECT followed_id FROM relationship WHERE follower_id = N) ORDER BY tt.tweet_id;I don't know if this affects the plan at all, but it is silly to do a left join to "tweet" when the WHERE clause has conditions that can't be satisfied with a null row. Also, you could try changing the IN-list to an EXISTS subquery.I'm sorry the ignorance, but I don't understand the issue with the left join, could you explain more?
Is there some patterns to D1 and D2 that could help the caching? For example, are they both usually in the just-recent past?The only pattern is that it is always a one day interval, e.g. D1 = '2013-05-01' and D2 = '2013-05-02'.
Casimiro
SELECT tt.tweet_id, tt.topic, tt.topic_value
FROM tweet_topic AS tt JOIN tweet AS t ON (tt.tweet_id = t.id
AND t.creation_time BETWEEN 'D1' AND 'D2' AND t.user_id in
(SELECT followed_id FROM relationship WHERE follower_id = N))
"Nested Loop (cost=1.57..5580452.55 rows=3961 width=20) (actual time=33.737..5106.898 rows=3058 loops=1)"
" Buffers: shared hit=3753 read=1278"
" -> Nested Loop (cost=1.00..1005.35 rows=1930 width=8) (actual time=0.070..77.244 rows=978 loops=1)"
" Buffers: shared hit=484 read=5"
" -> Index Only Scan using relationship_id on relationship (cost=0.42..231.12 rows=154 width=8) (actual time=0.034..0.314 rows=106 loops=1)"
" Index Cond: (follower_id = 335093362)"
" Heap Fetches: 0"
" Buffers: shared hit=5"
" -> Index Only Scan using tweet_ios_index on tweet t (cost=0.57..4.90 rows=13 width=16) (actual time=0.025..0.695 rows=9 loops=106)"
" Index Cond: ((user_id = relationship.followed_id) AND (creation_time >= '2013-06-21 00:00:00-03'::timestamp with time zone) AND (creation_time <= '2013-06-22 00:00:00-03'::timestamp with time zone))"
" Heap Fetches: 0"
" Buffers: shared hit=479 read=5"
" -> Index Scan using tweet_topic_tweet_id_index on tweet_topic tt (cost=0.57..2883.60 rows=731 width=20) (actual time=5.119..5.128 rows=3 loops=978)"
" Index Cond: (tweet_id = t.id)"
" Buffers: shared hit=3269 read=1273"
"Total runtime: 5110.217 ms"
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SELECT t.id FROM tweet AS t WHERE t.creation_time BETWEEN 'D1' AND 'D2' AND t.user_id in
(SELECT followed_id FROM relationship WHERE follower_id = N)
"Nested Loop (cost=1.00..1012.13 rows=2244 width=8) (actual time=0.074..51.855 rows=877 loops=1)"
" Buffers: shared hit=432 read=4"
" -> Index Only Scan using relationship_id on relationship (cost=0.42..227.12 rows=154 width=8) (actual time=0.034..0.218 rows=106 loops=1)"
" Index Cond: (follower_id = 335093362)"
" Heap Fetches: 0"
" Buffers: shared hit=5"
" -> Index Only Scan using tweet_ios_index on tweet t (cost=0.57..4.95 rows=15 width=16) (actual time=0.021..0.468 rows=8 loops=106)"
" Index Cond: ((user_id = relationship.followed_id) AND (creation_time >= '2013-06-22 00:00:00-03'::timestamp with time zone) AND (creation_time <= '2013-06-23 00:00:00-03'::timestamp with time zone))"
" Heap Fetches: 0"
" Buffers: shared hit=427 read=4"
"Total runtime: 52.692 ms"
On Mon, Nov 4, 2013 at 12:44 PM, Caio Casimiro <casimiro.listas@gmail.com> wrote:Thank you very much for your answers guys!On Mon, Nov 4, 2013 at 5:15 PM, Jeff Janes <jeff.janes@gmail.com> wrote:On Sun, Nov 3, 2013 at 2:05 PM, Caio Casimiro <casimiro.listas@gmail.com> wrote:SELECT tt.tweet_id, tt.topic, tt.topic_valueFROM tweet_topic AS tt LEFT JOIN tweet AS t ON tt.tweet_id = t.idWHERE creation_time BETWEEN 'D1' AND 'D2' AND user_id in(SELECT followed_id FROM relationship WHERE follower_id = N) ORDER BY tt.tweet_id;I don't know if this affects the plan at all, but it is silly to do a left join to "tweet" when the WHERE clause has conditions that can't be satisfied with a null row. Also, you could try changing the IN-list to an EXISTS subquery.I'm sorry the ignorance, but I don't understand the issue with the left join, could you explain more?A left join means you are telling it to make up an all-NULL tweet row for any tweet_topic that does not have a corresponding tweet. But then once it did so, it would just filter out that row later, because the null creation_time and user_id cannot pass the WHERE criteria--so doing a left join can't change the answer, but it can fool the planner into making a worse choice.Is there some patterns to D1 and D2 that could help the caching? For example, are they both usually in the just-recent past?The only pattern is that it is always a one day interval, e.g. D1 = '2013-05-01' and D2 = '2013-05-02'.If you only compare creation_time to dates, rather than ever using date+time, then it would probably be better to store them in the table as date, not timestamp. This might make the index smaller, and can also lead to better estimates and index usage.But why would you want to offer suggestions to someone based on tweets that were made on exactly one day, over 5 months ago? I can see why would want a brief period in the immediate past, or a long period; but a brief period that is not the recent past just seems like a strange thing to want to do. (And it is going to be hard to get good performance with that requirement.)Cheers,Jeff