Index "misbehavior" in PostgreSQL 8.2.2? - Mailing list pgsql-general

From Tonnerre LOMBARD
Subject Index "misbehavior" in PostgreSQL 8.2.2?
Date
Msg-id 20070406175755.GC4527@thebsh.sygroup-int.ch
Whole thread Raw
List pgsql-general
Salut,

I have four tables, two of which are relevant for this:

searchengine=# \d websites
                                    Table "public.websites"
   Column   |           Type           |                       Modifiers
------------+--------------------------+-------------------------------------------------------
 id         | bigint                   | not null default nextval('websites_id_seq'::regclass)
 url        | text                     | not null
 title      | text                     |
 abstract   | text                     |
 lastindex  | timestamp with time zone |
 authority  | bigint                   | not null default 0
 failed     | boolean                  | not null default false
 spamminess | smallint                 | not null default 0
Indexes:
    "websites_pkey" PRIMARY KEY, btree (id)
    "websites_url_key" UNIQUE, btree (url)
    "websites_authority_key" btree (authority)
    "websites_failed_key" btree (failed)
    "websites_lastindex_key" btree (lastindex)
    "websites_spamminess_key" btree (spamminess)

searchengine=# SELECT pg_size_pretty (pg_relation_size ('websites'));
 pg_size_pretty
----------------
 293 MB
(1 row)

searchengine=# SELECT COUNT(*) FROM websites;
 count
--------
 828778
(1 row)

searchengine=# \d sitewords
    Table "public.sitewords"
   Column   |  Type  | Modifiers
------------+--------+-----------
 id_site    | bigint | not null
 id_keyword | bigint | not null
 count      | bigint | not null
 ratio      | bigint |
Indexes:
    "sitewords_key" UNIQUE, btree (id_site, id_keyword)
    "sitewords_id_keyword_key" btree (id_keyword)
    "sitewords_id_site_key" btree (id_site)

searchengine=# SELECT COUNT(*) FROM sitewords;
  count
----------
 46696260
(1 row)

searchengine=# SELECT pg_size_pretty (pg_relation_size ('sitewords'));
 pg_size_pretty
----------------
 3041 MB
(1 row)

searchengine=#

My effective_cache_size is set to 768MB.

I am trying to run a query:

searchengine=# EXPLAIN ANALYZE SELECT COUNT(me.id_keyword) * SUM(ratio) AS relevance, id_site FROM sitewords me JOIN
websitesws ON me.id_site = ws.id WHERE me.id_keyword IN (4241,28303) GROUP BY id_site ORDER BY relevance DESC LIMIT 10; 
                                                                            QUERY PLAN
                                          

------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=150319.90..150319.91 rows=3 width=24) (actual time=9211.903..9212.013 rows=10 loops=1)
   ->  Sort  (cost=150319.90..150319.91 rows=3 width=24) (actual time=9211.895..9211.931 rows=10 loops=1)
         Sort Key: ((count(me.id_keyword))::numeric * sum(me.ratio))
         ->  HashAggregate  (cost=150319.81..150319.87 rows=3 width=24) (actual time=9180.744..9197.080 rows=3320
loops=1)
               ->  Hash Join  (cost=60470.11..150116.47 rows=27112 width=24) (actual time=7816.860..9151.601 rows=5747
loops=1)
                     Hash Cond: (me.id_site = ws.id)
                     ->  Bitmap Heap Scan on sitewords me  (cost=609.31..85356.43 rows=27112 width=24) (actual
time=3.245..137.026rows=5748 loops=1) 
                           Recheck Cond: (id_keyword = ANY ('{4241,28303}'::bigint[]))
                           ->  Bitmap Index Scan on sitewords_id_keyword_key  (cost=0.00..602.53 rows=27112 width=0)
(actualtime=2.169..2.169 rows=5748 loops=1) 
                                 Index Cond: (id_keyword = ANY ('{4241,28303}'::bigint[]))
                     ->  Hash  (cost=45674.80..45674.80 rows=816080 width=8) (actual time=7786.940..7786.940
rows=829078loops=1) 
                           ->  Seq Scan on websites ws  (cost=0.00..45674.80 rows=816080 width=8) (actual
time=0.026..4126.069rows=829078 loops=1) 
 Total runtime: 9212.679 ms
(13 rows)

searchengine=#

The only way to get PostgreSQL to use the index appears to be to set the
random_page_cost to 1.0 (1.1 also doesn't work). In that case, I get:

searchengine=# EXPLAIN ANALYZE SELECT COUNT(me.id_keyword) * SUM(ratio) AS relevance, id_site FROM sitewords me JOIN
websitesws ON me.id_site = ws.id WHERE me.id_keyword IN (4241,28303) GROUP BY id_site ORDER BY relevance DESC LIMIT 10; 
                                                                            QUERY PLAN
                                          

------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=78647.65..78647.66 rows=3 width=24) (actual time=418.368..418.478 rows=10 loops=1)
   ->  Sort  (cost=78647.65..78647.66 rows=3 width=24) (actual time=418.359..418.395 rows=10 loops=1)
         Sort Key: ((count(me.id_keyword))::numeric * sum(me.ratio))
         ->  HashAggregate  (cost=78647.57..78647.63 rows=3 width=24) (actual time=386.734..403.241 rows=3320 loops=1)
               ->  Nested Loop  (cost=310.29..78444.23 rows=27112 width=24) (actual time=2.878..354.418 rows=5747
loops=1)
                     ->  Bitmap Heap Scan on sitewords me  (cost=310.29..26849.19 rows=27112 width=24) (actual
time=2.789..135.915rows=5748 loops=1) 
                           Recheck Cond: (id_keyword = ANY ('{4241,28303}'::bigint[]))
                           ->  Bitmap Index Scan on sitewords_id_keyword_key  (cost=0.00..303.51 rows=27112 width=0)
(actualtime=1.762..1.762 rows=5748 loops=1) 
                                 Index Cond: (id_keyword = ANY ('{4241,28303}'::bigint[]))
                     ->  Index Scan using websites_pkey on websites ws  (cost=0.00..1.89 rows=1 width=8) (actual
time=0.017..0.022rows=1 loops=5748) 
                           Index Cond: (me.id_site = ws.id)
 Total runtime: 418.791 ms
(12 rows)

searchengine=#

which is fine. However, if I leave random_page_cost at 4 and do the following
query:

searchengine=# EXPLAIN ANALYZE select x.ratio * x.count, x.id_site from (SELECT me.id_site, count(me.id_site),
sum(me.ratio)as ratio FROM sitewords me WHERE me.id_keyword IN (4241, 28303) group by id_site) x join websites ws on
(ws.id= x.id_site); 
                                                                      QUERY PLAN
                              

------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=85559.77..85585.13 rows=3 width=48) (actual time=169.707..337.717 rows=3320 loops=1)
   ->  HashAggregate  (cost=85559.77..85559.82 rows=3 width=16) (actual time=169.567..184.975 rows=3321 loops=1)
         ->  Bitmap Heap Scan on sitewords me  (cost=609.31..85356.43 rows=27112 width=16) (actual time=2.897..137.590
rows=5748loops=1) 
               Recheck Cond: (id_keyword = ANY ('{4241,28303}'::bigint[]))
               ->  Bitmap Index Scan on sitewords_id_keyword_key  (cost=0.00..602.53 rows=27112 width=0) (actual
time=1.820..1.820rows=5748 loops=1) 
                     Index Cond: (id_keyword = ANY ('{4241,28303}'::bigint[]))
   ->  Index Scan using websites_pkey on websites ws  (cost=0.00..8.41 rows=1 width=8) (actual time=0.025..0.029 rows=1
loops=3321)
         Index Cond: (ws.id = x.id_site)
 Total runtime: 351.053 ms
(9 rows)

searchengine=#

the result is even more acceptable than with the above JOIN statement.

The tables are analyzed, and I have vacuumed regularly.

Any idea why PostgreSQL attempts a full sequential scan here?


                Tonnerre

Attachment

pgsql-general by date:

Previous
From: "Danny E. Armstrong"
Date:
Subject: Re: YTA Time Zone Question
Next
From: "filippo"
Date:
Subject: Re: perl DBI: problems searching text strings with ' symbol (es d'ambrose)