Re: Index ignored on column containing mostly 0 values - Mailing list pgsql-performance
From | Andreas Kostyrka |
---|---|
Subject | Re: Index ignored on column containing mostly 0 values |
Date | |
Msg-id | 1162307035.18283.3.camel@andi-lap Whole thread Raw |
In response to | Index ignored on column containing mostly 0 values (Leif Mortenson <leiflists@tanukisoftware.com>) |
List | pgsql-performance |
Am Dienstag, den 31.10.2006, 13:04 +0900 schrieb Leif Mortenson: > Hello, > I have been having a problem with the following query ignoring an index > on the foos.bar column. > > SELECT c.id > FROM foos c, bars r > WHERE r.id != 0 > AND r.modified_time > '2006-10-20 10:00:00.000' > AND r.modified_time <= '2006-10-30 15:20:00.000' > AND c.bar = r.id > > The bars table contains 597 rows, while the foos table contains 5031203 > rows. > > After much research I figured out that the problem is being caused by the > PG planner deciding that my foos.bar index is not useful. The data in the > foos.bar column contains 5028698 0 values and 2505 that are ids in the bars > table. > > Both tables have just been analyzed. > > When I EXPLAIN ANALYZE the above query, I get the following: > > "Hash Join (cost=3.06..201642.49 rows=25288 width=8) (actual > time=0.234..40025.514 rows=11 loops=1)" > " Hash Cond: ("outer".bar = "inner".id)" > " -> Seq Scan on foos c (cost=0.00..176225.03 rows=5032303 width=16) > (actual time=0.007..30838.623 rows=5031203 loops=1)" > " -> Hash (cost=3.06..3.06 rows=3 width=8) (actual time=0.117..0.117 > rows=20 loops=1)" > " -> Index Scan using bars_index_modified_time on bars r > (cost=0.00..3.06 rows=3 width=8) (actual time=0.016..0.066 rows=20 loops=1)" > " Index Cond: ((modified_time > '2006-10-20 10:00:00'::timestamp without > time zone) AND (modified_time <= '2006-10-30 15:20:00'::timestamp > without time zone))" > " Filter: (id <> 0)" > "Total runtime: 40025.629 ms" > > The solution I found was to change the statistics on my foos.bar column from > the default -1 to 1000. When I do this, reanalyze the table, and rerun > the above > query, I get the following expected result. > > "Nested Loop (cost=0.00..25194.66 rows=25282 width=8) (actual > time=13.035..23.338 rows=11 loops=1)" > " -> Index Scan using bars_index_modified_time on bars r > (cost=0.00..3.06 rows=3 width=8) (actual time=0.063..0.115 rows=20 loops=1)" > " Index Cond: ((modified_time > '2006-10-20 10:00:00'::timestamp without > time zone) AND (modified_time <= '2006-10-30 15:20:00'::timestamp > without time zone))" > " Filter: (id <> 0)" > " -> Index Scan using foos_index_bar on foos c (cost=0.00..6824.95 > rows=125780 width=16) (actual time=1.141..1.152 rows=1 loops=20)" > " Index Cond: (c.bar = "outer".id)" > "Total runtime: 23.446 ms" > > Having to do this concerns me as I am not sure what a good statistics value > should be. Also we expect this table to grow much larger and I am concerned > that it may not continue to function correctly. I tried a value of 100 > and that > works when the number of bars records is small, but as soon as I increase > them, the query starts ignoring the index again. > > Is increasing the statistics value the best way to resolve this problem? How > can I best decide on a good statistics value? > > Having a column containing large numbers of null or 0 values seems fairly > common. Is there way to tell Postgres to create an index of all values with > meaning. Ie all non-0 values? None that I could find. Have you tried CREATE INDEX partial ON foos (bar) WHERE bar IS NOT NULL; Andreas
Attachment
pgsql-performance by date: