Thread: Index ignored on column containing mostly 0 values

Index ignored on column containing mostly 0 values

From
Leif Mortenson
Date:
Hello,
I have been having a problem with the following query ignoring an index
on the foos.bar column.

SELECT c.id
FROM foos c, bars r
WHERE r.id != 0
AND r.modified_time > '2006-10-20 10:00:00.000'
AND r.modified_time <= '2006-10-30 15:20:00.000'
AND c.bar = r.id

The bars table contains 597 rows, while the foos table contains 5031203
rows.

After much research I figured out that the problem is being caused by the
PG planner deciding that my foos.bar index is not useful. The data in the
foos.bar column contains 5028698 0 values and 2505 that are ids in the bars
table.

Both tables have just been analyzed.

When I EXPLAIN ANALYZE the above query, I get the following:

"Hash Join (cost=3.06..201642.49 rows=25288 width=8) (actual
time=0.234..40025.514 rows=11 loops=1)"
" Hash Cond: ("outer".bar = "inner".id)"
" -> Seq Scan on foos c (cost=0.00..176225.03 rows=5032303 width=16)
(actual time=0.007..30838.623 rows=5031203 loops=1)"
" -> Hash (cost=3.06..3.06 rows=3 width=8) (actual time=0.117..0.117
rows=20 loops=1)"
" -> Index Scan using bars_index_modified_time on bars r
(cost=0.00..3.06 rows=3 width=8) (actual time=0.016..0.066 rows=20 loops=1)"
" Index Cond: ((modified_time > '2006-10-20 10:00:00'::timestamp without
time zone) AND (modified_time <= '2006-10-30 15:20:00'::timestamp
without time zone))"
" Filter: (id <> 0)"
"Total runtime: 40025.629 ms"

The solution I found was to change the statistics on my foos.bar column from
the default -1 to 1000. When I do this, reanalyze the table, and rerun
the above
query, I get the following expected result.

"Nested Loop (cost=0.00..25194.66 rows=25282 width=8) (actual
time=13.035..23.338 rows=11 loops=1)"
" -> Index Scan using bars_index_modified_time on bars r
(cost=0.00..3.06 rows=3 width=8) (actual time=0.063..0.115 rows=20 loops=1)"
" Index Cond: ((modified_time > '2006-10-20 10:00:00'::timestamp without
time zone) AND (modified_time <= '2006-10-30 15:20:00'::timestamp
without time zone))"
" Filter: (id <> 0)"
" -> Index Scan using foos_index_bar on foos c (cost=0.00..6824.95
rows=125780 width=16) (actual time=1.141..1.152 rows=1 loops=20)"
" Index Cond: (c.bar = "outer".id)"
"Total runtime: 23.446 ms"

Having to do this concerns me as I am not sure what a good statistics value
should be. Also we expect this table to grow much larger and I am concerned
that it may not continue to function correctly. I tried a value of 100
and that
works when the number of bars records is small, but as soon as I increase
them, the query starts ignoring the index again.

Is increasing the statistics value the best way to resolve this problem? How
can I best decide on a good statistics value?

Having a column containing large numbers of null or 0 values seems fairly
common. Is there way to tell Postgres to create an index of all values with
meaning. Ie all non-0 values? None that I could find.

Thanks in advance,
Leif



Re: Index ignored on column containing mostly 0 values

From
Michael Glaesemann
Date:
On Oct 31, 2006, at 13:04 , Leif Mortenson wrote:

> Hello,
> I have been having a problem with the following query ignoring an
> index
> on the foos.bar column.
>
> SELECT c.id
> FROM foos c, bars r
> WHERE r.id != 0
> AND r.modified_time > '2006-10-20 10:00:00.000'
> AND r.modified_time <= '2006-10-30 15:20:00.000'
> AND c.bar = r.id

<snip />

> Having a column containing large numbers of null or 0 values seems
> fairly
> common. Is there way to tell Postgres to create an index of all
> values with
> meaning. Ie all non-0 values? None that I could find.

Try

create index foo_non_zero_bar_index on foos(bar) where bar <> 0;

Take a look on the docs on partial indexes for more information.

http://www.postgresql.org/docs/current/interactive/indexes-partial.html

Hope this helps.

Michael Glaesemann
grzm seespotcode net



Re: Index ignored on column containing mostly 0 values

From
Tom Lane
Date:
Leif Mortenson <leiflists@tanukisoftware.com> writes:
> Having a column containing large numbers of null or 0 values seems fairly
> common.

You would likely be better off to use NULL as a no-value placeholder,
instead of an arbitrarily chosen regular value (which the planner cannot
be certain does not match any entries in the other table...)

> Is there way to tell Postgres to create an index of all values with
> meaning. Ie all non-0 values? None that I could find.

Partial index.  Though I'm not sure that would help here.  The problem
is that the nestloop join you want would be spectacularly awful if there
happened to be any zeroes in bars.id, and the planner's statistical
estimates allow some probability of that happening.

            regards, tom lane

Re: Index ignored on column containing mostly 0 values

From
Andreas Kostyrka
Date:
Am Dienstag, den 31.10.2006, 13:04 +0900 schrieb Leif Mortenson:
> Hello,
> I have been having a problem with the following query ignoring an index
> on the foos.bar column.
>
> SELECT c.id
> FROM foos c, bars r
> WHERE r.id != 0
> AND r.modified_time > '2006-10-20 10:00:00.000'
> AND r.modified_time <= '2006-10-30 15:20:00.000'
> AND c.bar = r.id
>
> The bars table contains 597 rows, while the foos table contains 5031203
> rows.
>
> After much research I figured out that the problem is being caused by the
> PG planner deciding that my foos.bar index is not useful. The data in the
> foos.bar column contains 5028698 0 values and 2505 that are ids in the bars
> table.
>
> Both tables have just been analyzed.
>
> When I EXPLAIN ANALYZE the above query, I get the following:
>
> "Hash Join (cost=3.06..201642.49 rows=25288 width=8) (actual
> time=0.234..40025.514 rows=11 loops=1)"
> " Hash Cond: ("outer".bar = "inner".id)"
> " -> Seq Scan on foos c (cost=0.00..176225.03 rows=5032303 width=16)
> (actual time=0.007..30838.623 rows=5031203 loops=1)"
> " -> Hash (cost=3.06..3.06 rows=3 width=8) (actual time=0.117..0.117
> rows=20 loops=1)"
> " -> Index Scan using bars_index_modified_time on bars r
> (cost=0.00..3.06 rows=3 width=8) (actual time=0.016..0.066 rows=20 loops=1)"
> " Index Cond: ((modified_time > '2006-10-20 10:00:00'::timestamp without
> time zone) AND (modified_time <= '2006-10-30 15:20:00'::timestamp
> without time zone))"
> " Filter: (id <> 0)"
> "Total runtime: 40025.629 ms"
>
> The solution I found was to change the statistics on my foos.bar column from
> the default -1 to 1000. When I do this, reanalyze the table, and rerun
> the above
> query, I get the following expected result.
>
> "Nested Loop (cost=0.00..25194.66 rows=25282 width=8) (actual
> time=13.035..23.338 rows=11 loops=1)"
> " -> Index Scan using bars_index_modified_time on bars r
> (cost=0.00..3.06 rows=3 width=8) (actual time=0.063..0.115 rows=20 loops=1)"
> " Index Cond: ((modified_time > '2006-10-20 10:00:00'::timestamp without
> time zone) AND (modified_time <= '2006-10-30 15:20:00'::timestamp
> without time zone))"
> " Filter: (id <> 0)"
> " -> Index Scan using foos_index_bar on foos c (cost=0.00..6824.95
> rows=125780 width=16) (actual time=1.141..1.152 rows=1 loops=20)"
> " Index Cond: (c.bar = "outer".id)"
> "Total runtime: 23.446 ms"
>
> Having to do this concerns me as I am not sure what a good statistics value
> should be. Also we expect this table to grow much larger and I am concerned
> that it may not continue to function correctly. I tried a value of 100
> and that
> works when the number of bars records is small, but as soon as I increase
> them, the query starts ignoring the index again.
>
> Is increasing the statistics value the best way to resolve this problem? How
> can I best decide on a good statistics value?
>
> Having a column containing large numbers of null or 0 values seems fairly
> common. Is there way to tell Postgres to create an index of all values with
> meaning. Ie all non-0 values? None that I could find.
Have you tried

CREATE INDEX partial ON foos (bar) WHERE bar IS NOT NULL;

Andreas


Attachment