Re: Question about difference in performance of 2 queries - Mailing list pgsql-performance

From Tom Lane
Subject Re: Question about difference in performance of 2 queries
Date
Msg-id 16345.1072726751@sss.pgh.pa.us
Whole thread Raw
In response to Re: Question about difference in performance of 2 queries  (Sean Shanny <shannyconsulting@earthlink.net>)
Responses Re: Question about difference in performance of 2 queries
Re: Question about difference in performance of 2 queries
List pgsql-performance
Sean Shanny <shannyconsulting@earthlink.net> writes:
> Here is the pg_stats data.  The explain analyze queries are still running.

> select * from pg_stats where tablename = 'f_pageviews' and attname =
> 'content_key';
>  schemaname |  tablename  |   attname   | null_frac | avg_width |
> n_distinct | most_common_vals |   most_common_freqs
> |
> histogram_bounds                                   | correlation
>
------------+-------------+-------------+-----------+-----------+------------+------------------+-----------------------+-------------------------------------------------------------------------------------+-------------
>  public     | f_pageviews | content_key |         0 |         4 |
> 983 | {-1,1528483}     | {0.749333,0.00166667} |

Oh-ho, I see the problem: about 75% of your table has content_key = -1.

Why is that a problem, you ask?  Well, the planner realizes that
"content_key > -1" is a pretty good restriction condition (better than
the date condition, apparently) and so it tries to use that as the index
scan condition.  The problem is that in 7.4 and before, the btree index
code implements a "> -1" scan starting boundary by finding the first -1
and then advancing to the first key that's not -1.  So you end up
scanning through 75% of the index before anything useful happens :-(

I just fixed this poor behavior in CVS tip a couple weeks ago:
http://archives.postgresql.org/pgsql-committers/2003-12/msg00220.php
but the patch seems too large and unproven to risk back-patching into
7.4.*.

If you expect that a pretty large fraction of your data will always have
dummy content_key, it'd probably be worth changing the index to not
index -1's at all --- that is, make it a partial index with the
condition "WHERE content_key > -1".  Another workaround is to leave the
index as-is but phrase the query WHERE condition as "content_key >= 0"
instead of "> -1".

            regards, tom lane

pgsql-performance by date:

Previous
From: Sean Shanny
Date:
Subject: Re: Question about difference in performance of 2 queries
Next
From: Sean Shanny
Date:
Subject: Re: Question about difference in performance of 2 queries