Thread: Setting Statistics on Functional Indexes

Setting Statistics on Functional Indexes

From
Shaun Thomas
Date:
Hey everyone,

So recently we upgraded to 9.1 and have noticed a ton of our queries got
much worse. It turns out that 9.1 is *way* more optimistic about our
functional indexes, even when they're entirely the wrong path. So after
going through the docs, I see that the normal way to increase stats is
to alter columns directly on a table, or change the
default_statistics_target itself.

But there doesn't seem to be any way to increase stats for a functional
index unless you modify default_statistics_target. I did some testing,
and for a particularly bad plan, we don't get a good result until the
stats are at 5000 or higher. As you can imagine, this drastically
increases our analyze time, and there's no way we're setting that
system-wide.

I tested this by:

SET default_statistics_target = 5000;

ANALYZE my_table;

EXPLAIN SELECT [ugly query];

I only tested 1000, 2000, 3000, 4000, and 5000 before it switched plans.
This is a 30M row table, and the "good" plan is 100x faster than the bad
one. You can see this behavior yourself with this test case:

CREATE TABLE date_test (
   id SERIAL,
   col1 varchar,
   col2 numeric,
   action_date TIMESTAMP WITHOUT TIME ZONE
);

insert into date_test (col1, col2, action_date)
select 'S:' || ((random()*a.num)::int % 10000),
        (random()*a.num)::int % 15000,
        current_date - (random()*a.num)::int % 1000
   from generate_series(1,10000000) a(num);

create index idx_date_test_action_date_trunc
     on date_test (date_trunc('day', action_date));

create index idx_date_test_col1_col2
     on date_test (col1, col2);

explain analyze
select *
   from date_test
  where col1 IN ('S:96')
    and col2 = 657
    and date_trunc('day', action_date) >= '2012-10-24'
  order by id desc, action_date


This seems to cause the problem more consistently when using a value
where col1 and col2 have no matches. In this particular example, I
didn't get the good plan until using 1000 as the default stats target.
It can't be a coincidence that there are 1000 distinct values in the
table for that column, and we get a terrible plan until a statistic is
recorded for each and every one in the functional index so it can
exclude itself. This seems counter-intuitive to pg_stats with default
stats at 500:

SELECT attname,n_distinct FROM pg_stats WHERE tablename='date_test';

    attname   | n_distinct
-------------+------------
  id          |         -1
  action_date |       1000
  col2        |      14999
  col1        |      10000

SELECT stadistinct FROM pg_statistic
  WHERE starelid='idx_date_test_col1_col2'::regclass

  stadistinct
-------------
         1000

Just on pure selectivity, it should prefer the index on col1 and col2.
Anyway, we're getting all the devs to search out that particular
functional index and eradicate it, but that will take a while to get
through testing and deployment. The overriding problem seems to be two-fold:

1. Is there any way to specifically set stats on a functional index?
2. Why is the planner so ridiculously optimistic with functional
indexes, even in the case of much higher selectivity as reported by
pg_stats on the named columns?

Thanks!

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: Setting Statistics on Functional Indexes

From
Tom Lane
Date:
Shaun Thomas <sthomas@optionshouse.com> writes:
> 1. Is there any way to specifically set stats on a functional index?

Sure, the same way you would for a table.

regression=# create table foo (f1 int, f2 int);
CREATE TABLE
regression=# create index fooi on foo ((f1 + f2));
CREATE INDEX
regression=# \d fooi
      Index "public.fooi"
 Column |  Type   | Definition
--------+---------+------------
 expr   | integer | (f1 + f2)
btree, for table "public.foo"

regression=# alter index fooi alter column expr set statistics 5000;
ALTER INDEX

The weak spot in this, and the reason this isn't "officially" supported,
is that the column name for an index expression isn't set in stone.
But as long as you check what it's called you can set its target.

> 2. Why is the planner so ridiculously optimistic with functional
> indexes, even in the case of much higher selectivity as reported by
> pg_stats on the named columns?

It's not particularly (not that you've even defined what you think
"optimistic" is, much less mentioned what baseline you're comparing to).
I tried your example on HEAD and I got what seemed pretty decent
rowcount estimates ...

            regards, tom lane


Re: Setting Statistics on Functional Indexes

From
Shaun Thomas
Date:
On 10/24/2012 02:11 PM, Tom Lane wrote:

> It's not particularly (not that you've even defined what you think
> "optimistic" is, much less mentioned what baseline you're comparing
> to).

The main flaw with my example is that it's random. But I swear I'm not
making it up! :)

There seems to be a particularly nasty edge case we're triggering, then.
Like I said, it's worse when col1+col2 don't match anything. In that
case, it's using the trunc index on the date column, which has
demonstrably worse performance. Here are the two analyzes I got
before/after front-loading statistics.

Before stats increase:

  Sort  (cost=9.38..9.39 rows=1 width=23) (actual time=78.282..78.282
rows=0 loops=1)
    Sort Key: id, action_date
    Sort Method: quicksort  Memory: 25kB
    ->  Index Scan using idx_date_test_action_date_trunc on date_test
(cost=0.00..9.37 rows=1 width=23) (actual time=78.274..78.274 rows=0
loops=1)
          Index Cond: (date_trunc('day'::text, action_date) >=
'2012-10-24 00:00:00'::timestamp without time zone)
          Filter: (((col1)::text = 'S:96'::text) AND (col2 = 657::numeric))
  Total runtime: 78.317 ms


And then after. I used your unofficial trick to set it to 1000:

alter index idx_date_test_action_date_trunc
       alter column date_trunc set statistics 1000;
analyze date_test;


  Sort  (cost=9.83..9.83 rows=1 width=23) (actual time=0.038..0.038
rows=0 loops=1)
    Sort Key: id, action_date
    Sort Method: quicksort  Memory: 25kB
    ->  Index Scan using idx_date_test_col1_col2 on date_test
(cost=0.00..9.82 rows=1 width=23) (actual time=0.030..0.030 rows=0 loops=1)
          Index Cond: (((col1)::text = 'S:96'::text) AND (col2 =
657::numeric))
          Filter: (date_trunc('day'::text, action_date) >= '2012-10-24
00:00:00'::timestamp without time zone)
  Total runtime: 0.066 ms


This is on a bone-stock PG 9.1.6 from Ubuntu 12.04 LTS, with
default_statistics increased to 500. The only thing I bumped up was the
functional index between those two query plans.

But then I noticed something else. I reverted back to the old 500
default for everything, and added an index:

create index idx_date_test_action_date_trunc_col1
     on date_test (date_trunc('day', action_date), col1);

I think we can agree that this index would be more selective than the
one on date_trunc by itself. Yet:

  Sort  (cost=9.38..9.39 rows=1 width=23) (actual time=77.055..77.055
rows=0 loops=1)
    Sort Key: id, action_date
    Sort Method: quicksort  Memory: 25kB
    ->  Index Scan using idx_date_test_action_date_trunc on date_test
(cost=0.00..9.37 rows=1 width=23) (actual time=77.046..77.046 rows=0
loops=1)
          Index Cond: (date_trunc('day'::text, action_date) >=
'2012-10-24 00:00:00'::timestamp without time zone)
          Filter: (((col1)::text = 'S:96'::text) AND (col2 = 657::numeric))
  Total runtime: 77.091 ms


All I have to say about that is: wat.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: Setting Statistics on Functional Indexes

From
Shaun Thomas
Date:
On 10/24/2012 02:31 PM, Shaun Thomas wrote:

> The main flaw with my example is that it's random. But I swear I'm not
> making it up! :)

And then I find a way to make it non-random. Hooray:

CREATE TABLE date_test (
   id SERIAL,
   col1 varchar,
   col2 numeric,
   action_date TIMESTAMP WITHOUT TIME ZONE
);

insert into date_test (col1, col2, action_date)
select 'S:' || (a.num % 10000), a.num % 15000,
        current_date - a.num % 1000
   from generate_series(1,10000000) a(num);

create index idx_date_test_action_date_trunc
     on date_test (date_trunc('day', action_date));

create index idx_date_test_col1_col2
     on date_test (col1, col2);

set default_statistics_target = 500;
vacuum analyze date_test;

explain analyze
select *
   from date_test
  where col1 IN ('S:96')
    and col2 = 657
    and date_trunc('day', action_date) >= '2012-10-24'
  order by id desc, action_date;


  Sort  (cost=9.38..9.39 rows=1 width=23) (actual time=83.418..83.418
rows=0 loops=1)
    Sort Key: id, action_date
    Sort Method: quicksort  Memory: 25kB
    ->  Index Scan using idx_date_test_action_date_trunc on date_test
(cost=0.00..9.37 rows=1 width=23) (actual time=83.409..83.409 rows=0
loops=1)
          Index Cond: (date_trunc('day'::text, action_date) >=
'2012-10-24 00:00:00'::timestamp without time zone)
          Filter: (((col1)::text = 'S:96'::text) AND (col2 = 657::numeric))
  Total runtime: 83.451 ms


alter index idx_date_test_action_date_trunc
       alter column date_trunc set statistics 1000;
analyze date_test;


  Sort  (cost=9.83..9.83 rows=1 width=23) (actual time=0.077..0.077
rows=0 loops=1)
    Sort Key: id, action_date
    Sort Method: quicksort  Memory: 25kB
    ->  Index Scan using idx_date_test_col1_col2 on date_test
(cost=0.00..9.82 rows=1 width=23) (actual time=0.069..0.069 rows=0 loops=1)
          Index Cond: (((col1)::text = 'S:96'::text) AND (col2 =
657::numeric))
          Filter: (date_trunc('day'::text, action_date) >= '2012-10-24
00:00:00'::timestamp without time zone)
  Total runtime: 0.105 m


Then for fun:


create index idx_date_test_action_date_trunc_col1
     on date_test (date_trunc('day', action_date), col1);
alter index idx_date_test_action_date_trunc
       alter column date_trunc set statistics -1;
analyze date_test;


  Sort  (cost=9.38..9.39 rows=1 width=23) (actual time=84.375..84.375
rows=0 loops=1)
    Sort Key: id, action_date
    Sort Method: quicksort  Memory: 25kB
    ->  Index Scan using idx_date_test_action_date_trunc on date_test
(cost=0.00..9.37 rows=1 width=23) (actual time=84.366..84.366 rows=0
loops=1)
          Index Cond: (date_trunc('day'::text, action_date) >=
'2012-10-24 00:00:00'::timestamp without time zone)
          Filter: (((col1)::text = 'S:96'::text) AND (col2 = 657::numeric))
  Total runtime: 84.410 ms


o_O


--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: Setting Statistics on Functional Indexes

From
Tom Lane
Date:
Shaun Thomas <sthomas@optionshouse.com> writes:
> On 10/24/2012 02:31 PM, Shaun Thomas wrote:
>> The main flaw with my example is that it's random. But I swear I'm not
>> making it up! :)

> And then I find a way to make it non-random. Hooray:

I can't reproduce this.  In 9.1 for instance, I get

 Sort  (cost=9.83..9.83 rows=1 width=23) (actual time=0.029..0.029 rows=0 loops=1)
   Sort Key: id, action_date
   Sort Method: quicksort  Memory: 25kB
   ->  Index Scan using idx_date_test_col1_col2 on date_test  (cost=0.00..9.82 rows=1 width=23) (actual
time=0.021..0.021rows=0 loops=1) 
         Index Cond: (((col1)::text = 'S:96'::text) AND (col2 = 657::numeric))
         Filter: (date_trunc('day'::text, action_date) >= '2012-10-24 00:00:00'::timestamp without time zone)
 Total runtime: 0.086 ms

and those estimates don't change materially with the stats adjustments.
If I drop that index altogether, it goes over to this:

 Sort  (cost=521.83..521.83 rows=1 width=23) (actual time=2.544..2.544 rows=0 loops=1)
   Sort Key: id, action_date
   Sort Method: quicksort  Memory: 25kB
   ->  Index Scan using idx_date_test_action_date_trunc_col1 on date_test  (cost=0.00..521.82 rows=1 width=23) (actual
time=2.536..2.536rows=0 loops=1) 
         Index Cond: ((date_trunc('day'::text, action_date) >= '2012-10-24 00:00:00'::timestamp without time zone) AND
((col1)::text= 'S:96'::text)) 
         Filter: (col2 = 657::numeric)
 Total runtime: 2.600 ms

So the planner's conclusions look fairly sane from here.  I get about
the same results from HEAD, 9.2 branch tip, or 9.1 branch tip.

So I'm wondering exactly what "9.1" version you're using, and also
whether you've got any nondefault planner cost parameters.

            regards, tom lane


Re: Setting Statistics on Functional Indexes

From
Shaun Thomas
Date:
On 10/26/2012 02:35 PM, Tom Lane wrote:

> So I'm wondering exactly what "9.1" version you're using, and also
> whether you've got any nondefault planner cost parameters.

Just a plain old 9.1.6 from Ubuntu 12.04. Only thing I personally
changed was the default_statistics_target. Later, I bumped up shared
buffers and work mem, but that just reduced the run time. Still uses the
bad index.

But I just noticed the lag in your response. :) It turns out, even
though I was substituting 2012-10-24 or 2012-10-25, what I really meant
was current_date. That does make all the difference, actually. If the
date in the where clause isn't the current date, it comes up with the
right plan. Even a single day in the past makes it work right. It only
seems to break on the very edge. This should work:


DROP TABLE IF EXISTS date_test;

CREATE TABLE date_test (
   id SERIAL,
   col1 varchar,
   col2 numeric,
   action_date TIMESTAMP WITHOUT TIME ZONE
);

insert into date_test (col1, col2, action_date)
select 'S:' || (a.num % 10000), a.num % 15000,
        current_date - a.num % 1000
   from generate_series(1,10000000) a(num);

create index idx_date_test_action_date_trunc
     on date_test (date_trunc('day', action_date));

create index idx_date_test_col1_col2
     on date_test (col1, col2);

set default_statistics_target = 500;
vacuum analyze date_test;

explain analyze
select *
   from date_test
  where col1 IN ('S:96')
    and col2 = 657
    and date_trunc('day', action_date) >= current_date
  order by id desc, action_date;


  Sort  (cost=9.39..9.39 rows=1 width=23) (actual time=10.679..10.679
rows=0 loops=1)
    Sort Key: id, action_date
    Sort Method: quicksort  Memory: 25kB
    ->  Index Scan using idx_date_test_action_date_trunc on date_test
(cost=0.01..9.38 rows=1 width=23) (actual time=10.670..10.670 rows=0
loops=1)
          Index Cond: (date_trunc('day'::text, action_date) >=
('now'::text)::date)
          Filter: (((col1)::text = 'S:96'::text) AND (col2 = 657::numeric))
  Total runtime: 10.713 ms


And if this helps:


foo=# select name,setting from pg_settings where setting != boot_val;
             name            |       setting
----------------------------+---------------------
  application_name           | psql
  archive_command            | (disabled)
  client_encoding            | UTF8
  default_statistics_target  | 500
  default_text_search_config | pg_catalog.english
  lc_collate                 | en_US.UTF-8
  lc_ctype                   | en_US.UTF-8
  lc_messages                | en_US.UTF-8
  lc_monetary                | en_US.UTF-8
  lc_numeric                 | en_US.UTF-8
  lc_time                    | en_US.UTF-8
  log_file_mode              | 0600
  log_line_prefix            | %t
  max_stack_depth            | 2048
  server_encoding            | UTF8
  shared_buffers             | 3072
  ssl                        | on
  transaction_isolation      | read committed
  unix_socket_directory      | /var/run/postgresql
  unix_socket_permissions    | 0777
  wal_buffers                | 96

That's every single setting that's not a default from the compiled PG.
Some of these were obviously modified by Ubuntu, but I didn't touch
anything else. I was trying to produce a clean-room to showcase this.
But I'm seeing it everywhere I test, even with sane settings.

Our EDB server is doing the same thing on much beefier hardware and
correspondingly increased settings, which is what prompted me to test it
in plain PG.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: Setting Statistics on Functional Indexes

From
Tom Lane
Date:
Shaun Thomas <sthomas@optionshouse.com> writes:
> But I just noticed the lag in your response. :) It turns out, even
> though I was substituting 2012-10-24 or 2012-10-25, what I really meant
> was current_date. That does make all the difference, actually.

Ah.  [ pokes at that for awhile... ]  OK, this has nothing to do with
functional indexes, and everything to do with the edge-case behavior of
scalarltsel.  What you've got is a histogram whose last entry
(corresponding to the highest observed value of the date) is
current_date, and the question is what we should assume when estimating
how many rows have a value >= that.  The answer of course is "one, plus
any duplicates" ... but we don't know how many duplicates there are,
and what we do know is it's not a particularly large number because the
value isn't present in the most-common-values stats.  So the code there
assumes there aren't any dups.

Once you have enough histogram resolution for current_date to show up
as the next-to-last as well as the last histogram entry, then of course
the estimate gets a lot better, since we can now tell that there's at
least one histogram bin's worth of duplicates.

Interestingly, this is a case where the get_actual_variable_range patch
(commit 40608e7f, which appeared in 9.0) makes the results worse.
Before that, there was a (very arbitrary) lower bound on what we'd
believe as the selectivity of a >= condition, but now, when we know the
actual upper limit of the variable, we don't clamp the result that way.
I think the clamp must have been saving you in your previous version,
because it more-or-less-accidentally accounted for the end value not
being unique.

So the bottom line is that this is a case where you need a lot of
resolution in the histogram.  I'm not sure there's anything good
we can do to avoid that.   I spent a bit of time thinking about whether
we could use n_distinct to get some idea of how many duplicates there
might be for the endpoint value, but n_distinct is unreliable enough
that I can't develop a lot of faith in such a thing.  Or we could just
arbitarily assume some fraction-of-a-histogram-bin's worth of
duplicates, but that would make the results worse for some people.

            regards, tom lane


Re: Setting Statistics on Functional Indexes

From
Claudio Freire
Date:
On Fri, Oct 26, 2012 at 6:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Interestingly, this is a case where the get_actual_variable_range patch
> (commit 40608e7f, which appeared in 9.0) makes the results worse.
> Before that, there was a (very arbitrary) lower bound on what we'd
> believe as the selectivity of a >= condition, but now, when we know the
> actual upper limit of the variable, we don't clamp the result that way.
> I think the clamp must have been saving you in your previous version,
> because it more-or-less-accidentally accounted for the end value not
> being unique.

IIRC, that patch was performing an index query (index_last) to get the
real largest value, right?

How many duplicates would you think the planner would require to
choose another (better) plan?

Because once you've accessed that last index page, it would be rather
trivial finding out how many duplicate tids are in that page and, with
a small CPU cost (no disk access if you don't query other index pages)
you could verify the assumption of near-uniqueness.


Re: Setting Statistics on Functional Indexes

From
Tom Lane
Date:
Claudio Freire <klaussfreire@gmail.com> writes:
> Because once you've accessed that last index page, it would be rather
> trivial finding out how many duplicate tids are in that page and, with
> a small CPU cost (no disk access if you don't query other index pages)
> you could verify the assumption of near-uniqueness.

I thought about that too, but I'm not sure how promising the idea is.
In the first place, it's not clear when to stop counting duplicates, and
in the second, I'm not sure we could get away with not visiting the heap
to check for tuple liveness.  There might be a lot of apparent
duplicates in the index that just represent unreaped old versions of a
frequently-updated endpoint tuple.  (The existing code is capable of
returning a "wrong" answer if the endpoint tuple is dead, but I don't
think it matters much in most cases.  I'm less sure such an argument
could be made for dup-counting.)

            regards, tom lane


Re: Setting Statistics on Functional Indexes

From
Claudio Freire
Date:
On Fri, Oct 26, 2012 at 7:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Claudio Freire <klaussfreire@gmail.com> writes:
>> Because once you've accessed that last index page, it would be rather
>> trivial finding out how many duplicate tids are in that page and, with
>> a small CPU cost (no disk access if you don't query other index pages)
>> you could verify the assumption of near-uniqueness.
>
> I thought about that too, but I'm not sure how promising the idea is.
> In the first place, it's not clear when to stop counting duplicates, and
> in the second, I'm not sure we could get away with not visiting the heap
> to check for tuple liveness.  There might be a lot of apparent
> duplicates in the index that just represent unreaped old versions of a
> frequently-updated endpoint tuple.  (The existing code is capable of
> returning a "wrong" answer if the endpoint tuple is dead, but I don't
> think it matters much in most cases.  I'm less sure such an argument
> could be made for dup-counting.)

Would checking the visibility map be too bad? An index page worth of
tuples should also fit within a page in the visibility map.


Re: Setting Statistics on Functional Indexes

From
Claudio Freire
Date:
On Fri, Oct 26, 2012 at 7:04 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Fri, Oct 26, 2012 at 7:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Claudio Freire <klaussfreire@gmail.com> writes:
>>> Because once you've accessed that last index page, it would be rather
>>> trivial finding out how many duplicate tids are in that page and, with
>>> a small CPU cost (no disk access if you don't query other index pages)
>>> you could verify the assumption of near-uniqueness.
>>
>> I thought about that too, but I'm not sure how promising the idea is.
>> In the first place, it's not clear when to stop counting duplicates, and
>> in the second, I'm not sure we could get away with not visiting the heap
>> to check for tuple liveness.  There might be a lot of apparent
>> duplicates in the index that just represent unreaped old versions of a
>> frequently-updated endpoint tuple.  (The existing code is capable of
>> returning a "wrong" answer if the endpoint tuple is dead, but I don't
>> think it matters much in most cases.  I'm less sure such an argument
>> could be made for dup-counting.)
>
> Would checking the visibility map be too bad? An index page worth of
> tuples should also fit within a page in the visibility map.

Scratch that, they're sorted by tid. So it could be lots of pages in
random order.


Re: Setting Statistics on Functional Indexes

From
Shaun Thomas
Date:
On 10/26/2012 04:08 PM, Tom Lane wrote:

> So the bottom line is that this is a case where you need a lot of
> resolution in the histogram.  I'm not sure there's anything good
> we can do to avoid that.

I kinda hoped it wouldn't be something like that. For the particularly
painful instance, it was easy to replace the index with a better (if
larger) dual index and drop the bad old one. But in some cases, I'm
having to maintain two indexes that make me sad:

CREATE TABLE activity (
   activity_id  SERIAL NOT NULL PRIMARY KEY,
   account_id   BIGINT NOT NULL,
   action_date  TIMESTAMP WITHOUT TIME ZONE
);

CREATE INDEX idx_activity_action_date_account_id
     ON activity (action_date, activity_id);

CREATE INDEX idx_activity_account_id_action_date
     ON activity (activity_id, action_date);

Because in the first case, we needed the action_date to be first for
analytics that *don't* supply account_id. But in the second case, we
need the account_id first, so we can get the most recent action(s) for
that account without a very expensive backwards index scan on the first
index.

I know that current_date seems like an edge case, but I can't see how
getting the most recent activity for something is an uncommon activity.
Tip tracking is actually the most frequent pattern in the systems I've
seen. Admittedly, those are almost always high TPS trading systems.

At this point, I'm almost willing to start putting in optimization
fences to force it along the right path. Which is gross, because that's
effectively no better than Oracle hints. But I also don't like setting
my statistics to 5000+ on problematic column/index combos to get the
right heuristics, or having semi-duplicate multi-column indexes to
exploit sorting performance.

I mean, I get it. I just wonder if this particular tweak isn't more of a
regression than initially thought.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: Setting Statistics on Functional Indexes

From
"Kevin Grittner"
Date:
Shaun Thomas wrote:

> I know that current_date seems like an edge case, but I can't see
> how getting the most recent activity for something is an uncommon
> activity. Tip tracking is actually the most frequent pattern in the
> systems I've seen.

Yeah, this has been a recurring problem with database statistics
with various products for at least 20 years. For a while I was using
a product whose optimizer engineers referred to it as "data skew" and
recommended adding a "dummy" entry to get a single value out past the
maximum end of the range. If you couldn't stomach the dummy data,
they had detailed instructions for dumping your statistics, tinkering
with the end of it to allow for the issue, and writing it back over
the actual statistics gathered. We need a better answer than that.

> I just wonder if this particular tweak isn't more of a regression
> than initially thought.

It does seem like we have a serious regression in terms of this
particular issue.

-Kevin


Re: Setting Statistics on Functional Indexes

From
Robert Haas
Date:
On Fri, Oct 26, 2012 at 5:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> So the bottom line is that this is a case where you need a lot of
> resolution in the histogram.  I'm not sure there's anything good
> we can do to avoid that.   I spent a bit of time thinking about whether
> we could use n_distinct to get some idea of how many duplicates there
> might be for the endpoint value, but n_distinct is unreliable enough
> that I can't develop a lot of faith in such a thing.  Or we could just
> arbitarily assume some fraction-of-a-histogram-bin's worth of
> duplicates, but that would make the results worse for some people.

I looked at this a bit.  It seems to me that the root of this issue is
that we aren't distinguishing (at least, not as far as I can see)
between > and >=.  ISTM that if the operator is >, we're doing exactly
the right thing, but if it's >=, we're giving exactly the same
estimate that we would give for >.  That doesn't seem right.

Worse, I suspect that in this case we're actually giving a smaller
estimate for >= than we would for =, because = would estimate as if we
were searching for an arbitrary non-MCV, while >= acts like > and
says, hey, there's nothing beyond the end.

Shouldn't there be a separate estimator for scalarlesel?  Or should
the existing estimator be adjusted to handle the two cases
differently?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Setting Statistics on Functional Indexes

From
Claudio Freire
Date:
On Wed, Nov 14, 2012 at 5:36 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Shouldn't there be a separate estimator for scalarlesel?  Or should
> the existing estimator be adjusted to handle the two cases
> differently?

Woulnd't adding eqsel to scalar(lt|gt)sel work? (saving duplication
with mvc_selectivity)


Re: Setting Statistics on Functional Indexes

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Shouldn't there be a separate estimator for scalarlesel?  Or should
> the existing estimator be adjusted to handle the two cases
> differently?

Well, it does handle it differently to some extent, in that the operator
itself is invoked when checking the MCV values, so we get the right
answer for those.

The fact that there's not separate estimators for < and <= is something
we inherited from Berkeley, so I can't give the original rationale for
certain, but I think the notion was that the difference is imperceptible
when dealing with a continuous distribution.  The question is whether
you think that the "=" case contributes any significant amount to the
probability given that the bound is not one of the MCV values.  (If it
is, the MCV check will have accounted for it, so adding anything would
be wrong.)  I guess we could add 1/ndistinct or something like that,
but I'm not convinced that will really make the estimates better, mainly
because ndistinct is none too reliable itself.

            regards, tom lane