Re: simple query with radically different plan after 9.0 -> 9.2 upgrade - Mailing list pgsql-general

From Tom Lane
Subject Re: simple query with radically different plan after 9.0 -> 9.2 upgrade
Date
Msg-id 13183.1384289455@sss.pgh.pa.us
Whole thread Raw
In response to simple query with radically different plan after 9.0 -> 9.2 upgrade  (Kevin Goess <kgoess@bepress.com>)
Responses Re: simple query with radically different plan after 9.0 -> 9.2 upgrade
List pgsql-general
Kevin Goess <kgoess@bepress.com> writes:
> We noticed a big change after upgrading from 9.0 to 9.2. For
> *yesterday's*date, the query plan is fine, like you'd expect

> articles_1=> explain (analyze, buffers) select 1
>         from hits_user_daily_count
>         where userid = 1590185 and date = '2013-07-30';

> QUERY PLAN
>
------------------------------------------------------------------------------------------------------------------------------
>  Index Only Scan using hits_user_daily_count_pkey on
> hits_user_daily_count  (cost=0.00..8.02 rows=1 width=0) (actual
> time=0.02
>    Index Cond: ((userid = 1590185) AND (date = '2013-07-30'::date))
>    Heap Fetches: 1
>    Buffers: shared hit=5
>  Total runtime: 0.044 ms

> but for *today's* date it looks like it's reading all the rows for that
> date, 15,277 buffers:

> articles_1=> explain (analyze, buffers) select 1
>         from hits_user_daily_count
>         where userid = 1590185 and date = '2013-08-01';

> QUERY PLAN
>
------------------------------------------------------------------------------------------------------------------------------
>  Index Scan using hits_user_daily_count_date on hits_user_daily_count
> (cost=0.00..7.92 rows=1 width=0) (actual time=11.957..1
>    Index Cond: (date = '2013-08-01'::date)
>    Filter: (userid = 1590185)
>    Rows Removed by Filter: 20149
>    Buffers: shared hit=15277
>  Total runtime: 17.924 ms

Hm.  I can reproduce this fairly easily, per attached script --- but for
me, every PG release back to 8.4 does the same thing, so I'm a bit
mystified as to why it changed for you between 9.0 and 9.2.

The issue as I'm seeing it is that if ANALYZE didn't find any rows with
today's date, the planner will estimate the condition date = 'today'::date
as having zero selectivity, which makes an indexscan using just that
condition look as cheap as an indexscan using both columns.  In fact,
cheaper, because the index on just date is smaller than the pkey index.
So it goes for what looks like the cheaper plan (notice the cost estimates
in your examples above).

Now, the only way to get to a zero selectivity estimate for var = const
is if the planner believes that the pg_stats most-common-values list
for the column is complete, and the constant is nowhere in the list.
So one plausible explanation for the change in behavior is that you
jacked up the statistics target for the date column enough so that
it includes all of the date values you keep in that column.  Am I right
in guessing that you drop old data from this table?  How far back?

> We've addressed the problem by running 'analyze' on the table every day ate
> about 1:30am. Buffer hits on that table go from about 1,000/sec to
> 70,000/sec between midnight and that analyze job, and then go back down to
> 1,000/sec and stay flat until midnight rolls around again.

Yeah, as soon as ANALYZE sees a few rows with the newer date, the
selectivity estimate will move up enough to discourage use of the
single-column index.

            regards, tom lane

drop table hits_user_daily_count;

create table hits_user_daily_count (
 userid                     integer  not null,
 date                       date     not null,
 num_hits                   integer  default 0,
 num_cover_page_hits        integer  default 0,
 num_additional_files_hits  integer  default 0,
 primary key (userid, date));
create index hits_user_daily_count_date on hits_user_daily_count(date);

create or replace function fill_for_date(d date, n int) returns void
language plpgsql as $$
begin
  insert into hits_user_daily_count
    select uid, d, random()*10, random()*10 from
      generate_series(1,n) uid order by random();
end
$$;

select fill_for_date('today'::date - 5, 100000);
select fill_for_date('today'::date - 4, 100000);
select fill_for_date('today'::date - 3, 100000);
select fill_for_date('today'::date - 2, 100000);
select fill_for_date('today'::date - 1, 100000);

analyze hits_user_daily_count;

-- If you include this step, the query for today actually takes a long time;
-- but you risk auto-analyze changing the stats and making the problem go away.
-- select fill_for_date('today'::date - 0, 20000);

explain analyze select 1
        from hits_user_daily_count
        where userid = 15901 and date = 'yesterday'::date;

explain analyze select 1
        from hits_user_daily_count
        where userid = 15901 and date = 'today'::date;

pgsql-general by date:

Previous
From: "Francisco Figueiredo Jr."
Date:
Subject: Re: GSSAPI server side on Linux, SSPI client side on Windows
Next
From: Peter Eisentraut
Date:
Subject: Re: Clang 3.3 Analyzer Results