Home > mailing lists

Fw: Getting rid of a seq scan in query on a large table - Mailing list pgsql-performance

From	Denis de Bernardy
Subject	Fw: Getting rid of a seq scan in query on a large table
Date	June 27, 2011 20:01:03
Msg-id	1309215653.11968.YahooMailNeo@web112401.mail.gq1.yahoo.com Whole thread Raw
In response to	Getting rid of a seq scan in query on a large table (Jens Hoffrichter <jens.hoffrichter@gmail.com>)
List	pgsql-performance

Tree view

----- Forwarded Message -----
From: Denis de Bernardy <ddebernardy@yahoo.com>
To: Jens Hoffrichter <jens.hoffrichter@gmail.com>
Sent: Tuesday, June 28, 2011 12:59 AM
Subject: Re: [PERFORM] Getting rid of a seq scan in query on a large table

> Hash Cond: (posts.poster_id = posters.poster_id)
> -> Seq Scan on posts (cost=0.00..11862.12 rows=112312 width=24) (actual time=0.019..60.092 rows=112312 loops=1)

Unless I am mistaking, you've very few poster ids in there (since the two rows arguments are equal). The Postgres planner will identify this and just seq scan the whole thing instead of bothering to randomly access the rows one by one using the index. This looks like a case where you actually do not want it to use an index scan -- doing so will be slower.

D

From: Jens Hoffrichter <jens.hoffrichter@gmail.com>
To: pgsql-performance@postgresql.org
Sent: Monday, June 27, 2011 2:46 PM
Subject: [PERFORM] Getting rid of a seq scan in query on a large table

Hi everyone,

I'm having trouble getting rid of a sequential scan on a table with roughly 120k entries it. Creation of an index on that particular column which triggers the sequential scan doesn't do anything, VACUUM and ANALYZE has been done on the table.

The table in question has the following definition:

Column | Type | Modifiers
--------------------+--------------------------+------------------------------------------------------------------
post_id | bigint | not null default nextval('posts_post_id_seq'::regclass)
forum_id | bigint | not null
threadlink | character varying(255) | not null
timestamp | timestamp with time zone | not null
poster_id | bigint |
thread_id | bigint | not null
subject | text | not null
text | text | not null
postername | character varying(255) |
internal_post_id | bigint | not null default nextval('posts_internal_post_id_seq'::regclass)
internal_thread_id | bigint |
Indexes:
"posts_pkey" PRIMARY KEY, btree (internal_post_id)
"posts_forum_id_key" UNIQUE, btree (forum_id, post_id)
"idx_internal_thread_id" btree (internal_thread_id)
"idx_posts_poster_id" btree (poster_id)
Foreign-key constraints:
"posts_forum_id_fkey" FOREIGN KEY (forum_id) REFERENCES forums(forum_id)
"posts_internal_thread_id_fkey" FOREIGN KEY (internal_thread_id) REFERENCES threads(internal_thread_id)
"posts_poster_id_fkey" FOREIGN KEY (poster_id) REFERENCES posters(poster_id)

The query is this:

SELECT threads.internal_thread_id AS threads_internal_thread_id, threads.forum_id AS threads_forum_id, threads.thread_id AS threads_thread_id, threads.title AS threads_title, threads.poster_id AS threads_poster_id, threads.postername AS threads_postername, threads.category AS threads_category, threads.posttype AS threads_posttype FROM threads JOIN posts ON threads.internal_thread_id = posts.internal_thread_id JOIN posters ON posts.poster_id = posters.poster_id JOIN posters_groups AS posters_groups_1 ON posters.poster_id = posters_groups_1.poster_id JOIN groups ON groups.group_id = posters_groups_1.group_id WHERE groups.group_id = 4 ORDER BY posts.timestamp DESC;

The query plan (with an explain analyze) gives me the following:

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=13995.93..14006.63 rows=4279 width=108) (actual time=79.927..79.947 rows=165 loops=1)
Sort Key: posts."timestamp"
Sort Method: quicksort Memory: 50kB
-> Nested Loop (cost=6.97..13737.84 rows=4279 width=108) (actual time=0.605..79.693 rows=165 loops=1)
-> Seq Scan on groups (cost=0.00..1.05 rows=1 width=8) (actual time=0.013..0.014 rows=1 loops=1)
Filter: (group_id = 4)
-> Nested Loop (cost=6.97..13694.00 rows=4279 width=116) (actual time=0.587..79.616 rows=165 loops=1)
-> Hash Join (cost=6.97..12343.10 rows=4279 width=24) (actual time=0.568..78.230 rows=165 loops=1)
Hash Cond: (posts.poster_id = posters.poster_id)
-> Seq Scan on posts (cost=0.00..11862.12 rows=112312 width=24) (actual time=0.019..60.092 rows=112312 loops=1)
-> Hash (cost=6.79..6.79 rows=14 width=24) (actual time=0.101..0.101 rows=14 loops=1)
-> Hash Join (cost=2.14..6.79 rows=14 width=24) (actual time=0.060..0.093 rows=14 loops=1)
Hash Cond: (posters.poster_id = posters_groups_1.poster_id)
-> Seq Scan on posters (cost=0.00..3.83 rows=183 width=8) (actual time=0.006..0.023 rows=185 loops=1)
-> Hash (cost=1.96..1.96 rows=14 width=16) (actual time=0.025..0.025 rows=14 loops=1)
-> Seq Scan on posters_groups posters_groups_1 (cost=0.00..1.96 rows=14 width=16) (actual time=0.016..0.021 rows=14 loops=1)
Filter: (group_id = 4)
-> Index Scan using threads_pkey on threads (cost=0.00..0.30 rows=1 width=100) (actual time=0.006..0.007 rows=1 loops=165)
Index Cond: (threads.internal_thread_id = posts.internal_thread_id)
Total runtime: 80.137 ms
(20 rows)

So the big time lost is in this line:

Seq Scan on posts (cost=0.00..11862.12 rows=112312 width=24) (actual time=0.019..60.092 rows=112312 loops=1)

which I can understand why it slow ;)

But I haven't yet managed to convert the Seq Scan into an Index Scan, and I'm not sure how to continue there.

As I am not a big expert on psql optimization, any input would be greatly appreciated.

Best regards,
Jens

pgsql-performance by date:

From: "Kevin Grittner"
Date: 27 June 2011, 17:56:56
Subject: Re: Performance issue with Insert

From: Greg Smith
Date: 27 June 2011, 21:37:18
Subject: Re: Long Running Update - My Solution

Fw: Getting rid of a seq scan in query on a large table - Mailing list pgsql-performance

Previous

Next