Thread: Postgres optimizer choosing wrong index

Postgres optimizer choosing wrong index

From
Jack Orenstein
Date:
I'm using postgres 7.4 and having a problem with the query optimizer. Our table,
T, looks like this:

            dh int
            fh int
            nm int
            ... -- other columns

A typical row is 400-500 bytes.

T has two indexes, idx_df on (dh, fh) and idx_dn on (dh, nm).

My query is

    select * from T
    where dh = ? and fh = ?

If I run EXPLAIN on this query, (plugging in values 1 and 2 for the variables),
before VACUUM ANALYZE, I get the desired execution plan:

  Index Scan using idx_df on T  (cost=0.00..4.83 rows=1 width=454)
    Index Cond: ((dh = 1) AND (fh = 2))

But after VACUUM ANALYZE:

  Index Scan using idx_dn on T  (cost=0.00..5.27 rows=1 width=561)
    Index Cond: (dh = 1)
    Filter: (fh = 2)

Notice that postgres is now using the other index. This behavior is somewhat
dependent on the values plugged in. I ran a query to count dh values:

     select dir_hash, count(*) from external_file group by dir_hash;

       dh        | count
     ------------+--------
       916645488 |  20000
      1057692240 | 200000

And if I use 1057692240 in the EXPLAIN, I get the desired plan:

  Index Scan using idx_df on external_file  (cost=0.00..5.27 rows=1 width=561)
    Index Cond: ((dir_hash = 1057692240) AND (fn_hash = 2))

I've tried playing with various cost settings (e.g. random_page_cost), but have
been unable to influence the optimizer's behavior. Rewriting the query as ...
where (dh, fh) = (?, ?) doesn't help.

So a few questions:

- Why would the optimizer ever choose idx_dn over idx_df given that idx_df has
to be more selective?

- Is there any way to force the use of idx_df?

Jack Orenstein


P.S. Yes, I know, 7.4. We're upgrading to 8.3 but we have this problem right now.

Re: Postgres optimizer choosing wrong index

From
Tom Lane
Date:
Jack Orenstein <jack.orenstein@hds.com> writes:
> If I run EXPLAIN on this query, (plugging in values 1 and 2 for the variables),
> before VACUUM ANALYZE, I get the desired execution plan:

>   Index Scan using idx_df on T  (cost=0.00..4.83 rows=1 width=454)
>     Index Cond: ((dh = 1) AND (fh = 2))

> But after VACUUM ANALYZE:

>   Index Scan using idx_dn on T  (cost=0.00..5.27 rows=1 width=561)
>     Index Cond: (dh = 1)
>     Filter: (fh = 2)

> Notice that postgres is now using the other index. This behavior is somewhat
> dependent on the values plugged in. I ran a query to count dh values:

>      select dir_hash, count(*) from external_file group by dir_hash;

>        dh        | count
>      ------------+--------
>        916645488 |  20000
>       1057692240 | 200000

So you're plugging in a value that doesn't appear in the index, and
Postgres knows it?

I think that in this example, the estimated costs are going to be
exactly the same either way, so it's kind of a tossup which index
gets chosen --- and it's not actually going to matter any at runtime
either.  Descending the btree to find that there's no matching entry
is going to take just about the same amount of time in either index.
If you plug in a value that *does* occur in the table it should probably
choose the more-relevant index consistently.

            regards, tom lane

Re: Postgres optimizer choosing wrong index

From
Jack Orenstein
Date:
Tom Lane wrote:
> Jack Orenstein <jack.orenstein@hds.com> writes:
>> If I run EXPLAIN on this query, (plugging in values 1 and 2 for the variables),
>> before VACUUM ANALYZE, I get the desired execution plan:
>
>>   Index Scan using idx_df on T  (cost=0.00..4.83 rows=1 width=454)
>>     Index Cond: ((dh = 1) AND (fh = 2))
>
>> But after VACUUM ANALYZE:
>
>>   Index Scan using idx_dn on T  (cost=0.00..5.27 rows=1 width=561)
>>     Index Cond: (dh = 1)
>>     Filter: (fh = 2)
>
>> Notice that postgres is now using the other index. This behavior is somewhat
>> dependent on the values plugged in. I ran a query to count dh values:
>
>>      select dir_hash, count(*) from external_file group by dir_hash;
>
>>        dh        | count
>>      ------------+--------
>>        916645488 |  20000
>>       1057692240 | 200000
>
> So you're plugging in a value that doesn't appear in the index, and
> Postgres knows it?
>
> I think that in this example, the estimated costs are going to be
> exactly the same either way, so it's kind of a tossup which index
> gets chosen --- and it's not actually going to matter any at runtime
> either.  Descending the btree to find that there's no matching entry
> is going to take just about the same amount of time in either index.
> If you plug in a value that *does* occur in the table it should probably
> choose the more-relevant index consistently.

Unfortunately, it matters a lot at runtime. The dh value is not very selective,
as shown by the statistics above. But the fh value is highly selective. The
idx_df index makes use of this selectivity; the idx_dn index does not. In fact,
the reason I noticed this problem was horrendous runtime performance, sometimes.
I started poking around and noticed the VACUUM ANALYZE dependence.

Given that the estimated costs are about the same, I'm wondering why postgres
doesn't go with the index that can do more work in the "Index Cond" part of the
plan.

Do you know how I can force use of the idx_df index?

Jack

Re: Postgres optimizer choosing wrong index

From
Tom Lane
Date:
Jack Orenstein <jack.orenstein@hds.com> writes:
> Tom Lane wrote:
>> If you plug in a value that *does* occur in the table it should probably
>> choose the more-relevant index consistently.

> Unfortunately, it matters a lot at runtime. The dh value is not very selective,
> as shown by the statistics above.

A dh value that does not occur in the index is *perfectly* selective.
I'm not sure what your problem is but this example isn't illustrating
anything wrong that I can see.

            regards, tom lane

Re: Postgres optimizer choosing wrong index

From
Jack Orenstein
Date:
Researching this some more, it appears to be the case that VACUUM (by itself, no
ANALYZE) is changing the optimizer's behavior. Here is a self-contained test:


select '*** drop t';
drop table t cascade;

select '*** create t(dh, fh, nm, filler)';
create table t (dh int, fh int, nm int, filler char(500));

select '*** create index on (dh, fh)';
create index idx_df on t(dh, fh);

select '*** create index on (dh, nm)';
create index idx_dn on t(dh, nm);

select '*** explain select * from t where dh = 1 and fh = 2';
explain select * from t where dh = 1 and fh = 2;

select '*** vacuum t (no analyze)';
vacuum t;

select '*** explain select * from t where dh = 1 and fh = 2';
explain select * from t where dh = 1 and fh = 2;


This output was produced by 7.4.8. Version 8.3.4 uses the "wrong" execution plan
both before and after the vacuum.

Jack

Re: Postgres optimizer choosing wrong index

From
Jack Orenstein
Date:
Tom Lane wrote:
> Jack Orenstein <jack.orenstein@hds.com> writes:
>> Tom Lane wrote:
>>> If you plug in a value that *does* occur in the table it should probably
>>> choose the more-relevant index consistently.
>
>> Unfortunately, it matters a lot at runtime. The dh value is not very selective,
>> as shown by the statistics above.
>
> A dh value that does not occur in the index is *perfectly* selective.
> I'm not sure what your problem is but this example isn't illustrating
> anything wrong that I can see.

I see your point.

I may have simplified too far. Our application runs a number of different
queries. All our WHERE clauses restrict dh and fh. For a given pair of (dh, fh)
values, the initial query should come up empty and then insert this pair, and
then there is further processing (SELECT, UPDATE). Something is causing a huge
number of index row reads (according to pg_stat_user_indexes) but only in tables
that have been vacuumed.

Jack

Re: Postgres optimizer choosing wrong index

From
Tom Lane
Date:
Jack Orenstein <jack.orenstein@hds.com> writes:
> I may have simplified too far. Our application runs a number of
> different queries. All our WHERE clauses restrict dh and fh. For a
> given pair of (dh, fh) values, the initial query should come up empty
> and then insert this pair, and then there is further processing
> (SELECT, UPDATE). Something is causing a huge number of index row
> reads (according to pg_stat_user_indexes) but only in tables that have
> been vacuumed.

Well, that's a bit more concrete but it's still difficult to tell where
the problem is.  Are you by any chance creating new tables and then
vacuuming them while they're still empty?  That would cause
pg_class.relpages to get set to zero, and 7.4.x is not bright enough to
change plans when you later fill the table (until you do another vacuum
or analyze).  However, I think 7.4 would always choose a seqscan on a
table it thinks is zero-size, so I'm not sure that that's what's going
on here.

            regards, tom lane

Re: Postgres optimizer choosing wrong index

From
Jack Orenstein
Date:
Tom Lane wrote:
> Jack Orenstein <jack.orenstein@hds.com> writes:
>> I may have simplified too far. Our application runs a number of
>> different queries. All our WHERE clauses restrict dh and fh. For a
>> given pair of (dh, fh) values, the initial query should come up empty
>> and then insert this pair, and then there is further processing
>> (SELECT, UPDATE). Something is causing a huge number of index row
>> reads (according to pg_stat_user_indexes) but only in tables that have
>> been vacuumed.
>
> Well, that's a bit more concrete but it's still difficult to tell where
> the problem is.  Are you by any chance creating new tables and then
> vacuuming them while they're still empty?  That would cause
> pg_class.relpages to get set to zero, and 7.4.x is not bright enough to
> change plans when you later fill the table (until you do another vacuum
> or analyze).  However, I think 7.4 would always choose a seqscan on a
> table it thinks is zero-size, so I'm not sure that that's what's going
> on here.

I've collected much more detailed information on this problem.

(Recap: In 7.4, table T(int dh, int fh, int nm, ...) has indexes idx_df on (dh,
fh), and idx_dn on (dh, nm). Queries of the form "... where dh = ? and fh = ?"
appear to be using the (dh, nm) index leading to horrendous performance in cases
where there are very few unique dh values, and T has been vacuumed.)

The optimizer is definitely picking the wrong index even when the query
restricts dh to a value actually present in the table. (Yesterday, I was misled
by running EXPLAIN with a value not in the table.) Tom's speculation above is
mostly correct except that, as shown below, the optimizer is not choosing a seqscan.

EXPERIMENT:

- I created two schemas, NOVAC and VAC, each with a table T as described above.

- Before loading data, I ran VACUUM ANALYZE on VAC.T.

- I then started loading data. The workload is a mixture of INSERT,   SELECT and
UPDATE. For SELECT and UPDATE the WHERE clause always includes "dh = ? and fh = ?".

PG_STAT RESULTS:

Stats collection was enabled. Here is sample output during the run, (yes, those
are real numbers, I just happened to run the query when the 2nd row counters hit
those nice round numbers).

     ris=# select schemaname, indexrelname, idx_scan, idx_tup_read,
idx_tup_fetch from pg_stat_user_indexes where schemaname in ('novac', 'vac') and
relname = 'external_file' order by 1, 2;
      schemaname | indexrelname | idx_scan | idx_tup_read | idx_tup_fetch
     ------------+--------------+----------+--------------+---------------
      novac      | idx_dn       |        0 |            0 |             0
      novac      | idx_dh       |    25000 |        20000 |         20000
      vac        | idx_dn       |    16093 |     25991728 |      25991728
      vac        | idx_dh       |        0 |            0 |             0

This shows the different index usage in the two schemas. The schema containing
the vacuumed table used idx_dn exclusively; in the other schema, idx_dh was used
exclusively.

EXPLAIN PLAN RESULTS:

This query shows the distribution of T.dh values in NOVAC:

     ris=# select dh, count(*) from novac.t group by dh;
      dh        | count
     -----------+-------
      280433162 |  5000
      601454890 |  5000

Using one of these dh values, the optimizer selects idx_df as I expected (the fh
values are nearly unique; 0 is the median of the possible values):

     ris=# explain select * from novac.t where dh = 280433162 and fh = 0;
                                         QUERY PLAN

     ------------------------------------------------------------------
      Index Scan using idx_df on t  (cost=0.00..4.83 rows=1 width=454)
        Index Cond: ((dh = 280433162) AND (fh = 0))
     (2 rows)

But in the VAC schema:

     ris=# select dir_hash, count(*) from vac.t group by dh;
      dir_hash  | count
     -----------+-------
      758082190 |  5000
      980351022 |  5000
     (2 rows)

     ris=# explain select * from vac.t where dh = 758082190 and fh = 0;
                                                 QUERY PLAN

     ------------------------------------------------------------------
      Index Scan using idx_dn on t  (cost=0.00..3.68 rows=1 width=454)
        Index Cond: (dh = 758082190)
        Filter: (fh = 0)
     (3 rows)

Here are page and tuple counts:

     ris=# select n.nspname, relname, relpages, reltuples from pg_class,
pg_namespace n where relname = 't' and relnamespace = n.oid;
      nspname  | relname | relpages | reltuples
     ----------+---------+----------+-----------
      novac    | t       |       10 |      1000
      vac      | t       |        0 |         0

 From Tom Lane's earlier email, the vacuum prior to the load explains the 0s in
the second row.

pg_stats has no data for table T in either schema. In the case of the VAC
schema, I'm guessing this is because the table was actually empty when analyzed.


CONCLUSION:

The optimizer is objectively making the wrong choice. The pg_stat results
included above, (as well as the block-level counts), show that the idx_df index
should have been used. While the dh part of the index is not highly selective,
the fh part is. And with very high ratios of fh to dh values, (as in my
experiment), it matters a lot. The transaction rates measured by the test
program make this clear even without looking at pg_stat numbers.

Admittedly, the optimizer doesn't have much to go on. But I would think that
with a WHERE clause of the form "dh = ? and fh = ?", choosing the (dh, fh) index
over any other index would be a reasonable heuristic.


Jack


Re: Postgres optimizer choosing wrong index

From
Tom Lane
Date:
Jack Orenstein <jack.orenstein@hds.com> writes:
> - I created two schemas, NOVAC and VAC, each with a table T as described above.

> - Before loading data, I ran VACUUM ANALYZE on VAC.T.

> - I then started loading data. The workload is a mixture of INSERT,   SELECT and
> UPDATE. For SELECT and UPDATE the WHERE clause always includes "dh = ? and fh = ?".

Basically your problem here is that vacuum records the size of the table
as zero (in pg_class.relpages/reltuples) and that causes the computed
costs of the two indexscans to be exactly the same, so it's a tossup
which one gets used.  (In recent versions I think the index with higher
OID would typically get chosen in a tie, but I forget if 7.4 worked that
way.)

8.0 and up are smart enough not to believe pg_class.relpages anymore
after you've loaded a lot of data, but 7.4 isn't.  In testing similar
cases here, I get reasonable cost estimates and a sane plan choice
from 7.4 so long as the stats are up to date.

Bottom line: you need to vacuum (or preferably analyze) *after*
initially populating a table, not before.

            regards, tom lane

Re: Postgres optimizer choosing wrong index

From
Jack Orenstein
Date:
Tom Lane wrote:
> Jack Orenstein <jack.orenstein@hds.com> writes:
>> - I created two schemas, NOVAC and VAC, each with a table T as described above.
>
>> - Before loading data, I ran VACUUM ANALYZE on VAC.T.
>
>> - I then started loading data. The workload is a mixture of INSERT,   SELECT and
>> UPDATE. For SELECT and UPDATE the WHERE clause always includes "dh = ? and fh = ?".
>
> Basically your problem here is that vacuum records the size of the table
> as zero (in pg_class.relpages/reltuples) and that causes the computed
> costs of the two indexscans to be exactly the same, so it's a tossup
> which one gets used.  (In recent versions I think the index with higher
> OID would typically get chosen in a tie, but I forget if 7.4 worked that
> way.)
>
> 8.0 and up are smart enough not to believe pg_class.relpages anymore
> after you've loaded a lot of data, but 7.4 isn't.  In testing similar
> cases here, I get reasonable cost estimates and a sane plan choice
> from 7.4 so long as the stats are up to date.
>
> Bottom line: you need to vacuum (or preferably analyze) *after*
> initially populating a table, not before.

OK, I've added this behavior to my application. As the table is being loaded, I
run VACUUM ANALYZE every 500 inserts, until we get to size 10,000. I know this
is working because of application-level logging, and because I see relpages and
reltuples go up.

EXPLAIN says that the correct index is being used -- it didn't used to. However,
pg_stat* says otherwise. In my test, I have exactly one dh value. Running
EXPLAIN with this value produces a plan using idx_dh (the correct index), but
pg_stats says that idx_dn is being used (see psql session below).

This eventually works itself out. Eventually, the pg_stats for idx_dh start
going up, showing that that index is eventually being used. But this discrepancy
between EXPLAIN and actual query execution is making life very difficult.

Is the discrepancy between EXPLAIN and pg_stats due to some sort of caching per
connection? E.g., a connection that uses one plan for a query is stuck with that
plan for that query?

What would really be nice is a logging option that reported the execution plan
actually used for a query.

Jack



ris=# \d t;
                Table "vac.t"
     Column    |          Type          | Modifiers
--------------+------------------------+-----------
  dh           | integer                | not null
  fh           | integer                | not null
  nm           | bigint                 |
...
Indexes:
     "idx_dn" btree (dh, nm)
     "idx_dh" btree (dh, fh)

ris=# select dh, count(*) from t group by dh;
  dh        | count
-----------+-------
  589849733 | 19890
(1 row)

ris=# explain select * from t where dh = 589849733 and fh = 0;
                                     QUERY PLAN
------------------------------------------------------------------
  Index Scan using idx_dh on t  (cost=0.00..5.26 rows=2 width=570)
    Index Cond: ((dh = 589849733) AND (fh = 0))
(2 rows)


ris=# select schemaname, indexrelname, idx_scan, idx_tup_read, idx_tup_fetch
from pg_stat_user_indexes where schemaname = 'vac' and relname = 't' order by 1, 2;
  schemaname | indexrelname | idx_scan | idx_tup_read | idx_tup_fetch
------------+--------------+----------+--------------+---------------
  vac        | idx_dn       |    31315 |    122773990 |     122773990
  vac        | idx_dh       |        0 |            0 |             0



Re: Postgres optimizer choosing wrong index

From
Tom Lane
Date:
Jack Orenstein <jack.orenstein@hds.com> writes:
> EXPLAIN says that the correct index is being used -- it didn't used
> to. However, pg_stat* says otherwise. In my test, I have exactly one
> dh value. Running EXPLAIN with this value produces a plan using idx_dh
> (the correct index), but pg_stats says that idx_dn is being used (see
> psql session below).

Yeah, if you are using cached plans (via PREPARE or plpgsql functions)
then the plan stays the same for the life of the session ... pre 8.3
that is.

            regards, tom lane

Re: Postgres optimizer choosing wrong index

From
Jack Orenstein
Date:
Tom Lane wrote:
> Jack Orenstein <jack.orenstein@hds.com> writes:
>> EXPLAIN says that the correct index is being used -- it didn't used
>> to. However, pg_stat* says otherwise. In my test, I have exactly one
>> dh value. Running EXPLAIN with this value produces a plan using idx_dh
>> (the correct index), but pg_stats says that idx_dn is being used (see
>> psql session below).
>
> Yeah, if you are using cached plans (via PREPARE or plpgsql functions)
> then the plan stays the same for the life of the session ... pre 8.3
> that is.

Thank you.

One of the great things about postgres is being able to understand things all
the way down with the patient help of the actual developers.

Jack