optimizer and type question - Mailing list pgsql-hackers

From Erik Riedel
Subject optimizer and type question
Date
Msg-id sqxh7H_00gNtAmTJ5Q@andrew.cmu.edu
Whole thread Raw
Responses Re: [HACKERS] optimizer and type question
List pgsql-hackers
[last week aggregation, this week, the optimizer]

I have a somewhat general optimizer question/problem that I would like
to get some input on - i.e. I'd like to know what is "supposed" to
work here and what I should be expecting.  Sadly, I think the patch
for this is more involved than my last message.

Using my favorite table these days:

Table    = lineitem
+------------------------+----------------------------------+-------+
|              Field     |              Type                | Length|
+------------------------+----------------------------------+-------+
| l_orderkey             | int4 not null                    |     4 |
| l_partkey              | int4 not null                    |     4 |
| l_suppkey              | int4 not null                    |     4 |
| l_linenumber           | int4 not null                    |     4 |
| l_quantity             | float4 not null                  |     4 |
| l_extendedprice        | float4 not null                  |     4 |
| l_discount             | float4 not null                  |     4 |
| l_tax                  | float4 not null                  |     4 |
| l_returnflag           | char() not null                  |     1 |
| l_linestatus           | char() not null                  |     1 |
| l_shipdate             | date                             |     4 |
| l_commitdate           | date                             |     4 |
| l_receiptdate          | date                             |     4 |
| l_shipinstruct         | char() not null                  |    25 |
| l_shipmode             | char() not null                  |    10 |
| l_comment              | char() not null                  |    44 |
+------------------------+----------------------------------+-------+
Index:    lineitem_index_

and the query:

--
-- Query 1
--
explain select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, 
sum(l_extendedprice) as sum_base_price, 
sum(l_extendedprice*(1-l_discount)) as sum_disc_price, 
sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge, 
avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, 
avg(l_discount) as avg_disc, count(*) as count_order 
from lineitem 
where l_shipdate <= '1998-09-02'::date 
group by l_returnflag, l_linestatus 
order by l_returnflag, l_linestatus;


note that I have eliminated the date calculation in my query of last
week and manually replaced it with a constant (since this wasn't
happening automatically - but let's not worry about that for now).
And this is only an explain, we care about the optimizer.  So we get:

Sort  (cost=34467.88 size=0 width=0)->  Aggregate  (cost=34467.88 size=0 width=0)  ->  Group  (cost=34467.88 size=0
width=0)   ->  Sort  (cost=34467.88 size=0 width=0)      ->  Seq Scan on lineitem  (cost=34467.88 size=200191
width=44)

so let's think about the selectivity that is being chosen for the
seq scan (the where l_shipdate <= '1998-09-02').

Turns out the optimizer is choosing "33%", even though the real answer
is somewhere in 90+% (that's how the query is designed).  So, why does
it do that?

Turns out that selectivity in this case is determined via
plancat::restriction_selectivity() which calls into functionOID = 103
(intltsel) for operatorOID = 1096 (date "<=") on relation OID = 18663
(my lineitem).

This all follows because of the description of 1096 (date "<=") in
pg_operator.  Looking at local1_template1.bki.source near line 1754
shows:

insert OID = 1096 ( "<=" PGUID 0 <...> date_le intltsel intltjoinsel )

where we see that indeed, it thinks "intltsel" is the right function
to use for "oprrest" in the case of dates.

Question 1 - is intltsel the right thing for selectivity on dates?

Hope someone is still with me.

So now we're running selfuncs::intltsel() where we make a further call
to selfuncs::gethilokey().  The job of gethilokey is to determine the
min and max values of a particular attribute in the table, which will
then be used with the constant in my where clause to estimate the
selectivity.  It is going to search the pg_statistic relation with
three key values:

Anum_pg_statistic_starelid     18663  (lineitem)
Anum_pg_statistic_staattnum       11  (l_shipdate)
Anum_pg_statistic_staop         1096  (date "<=")

this finds no tuples in pg_statistic.  Why is that?  The only nearby
tuple in pg_statistic is:

starelid|staattnum|staop|stalokey        |stahikey       
--------+---------+-----+----------------+----------------  18663|       11|    0|01-02-1992      |12-01-1998

and the reason the query doesn't match anything?  Because 1096 != 0.
But why is it 0 in pg_statistic?  Statistics are determined near line
1844 in vacuum.c (assuming a 'vacuum analyze' run at some point)
            i = 0;            values[i++] = (Datum) relid;            /* 1 */            values[i++] = (Datum)
attp->attnum;/* 2 */
 
====>        values[i++] = (Datum) InvalidOid;       /* 3 */            fmgr_info(stats->outfunc, &out_function);
    out_string = <...min...>            values[i++] = (Datum) fmgr(F_TEXTIN, out_string);            pfree(out_string);
          out_string = <...max...>            values[i++] = (Datum) fmgr(F_TEXTIN, out_string);
pfree(out_string);           stup = heap_formtuple(sd->rd_att, values, nulls);
 

the "offending" line is setting the staop to InvalidOid (i.e. 0).

Question 2 - is this right?  Is the intent for 0 to serve as a
"wildcard", or should it be inserting an entry for each operation
individually?

In the case of "wildcard" then gethilokey() should allow a match for 

Anum_pg_statistic_staop         0

instead of requiring the more restrictive 1096.  In the current code,
what happens next is gethilokey() returns "not found" and intltsel()
returns the default 1/3 which I see in the resultant query plan (size
= 200191 is 1/3 of the number of lineitem tuples).

Question 3 - is there any inherent reason it couldn't get this right?
The statistic is in the table 1992 to 1998, so the '1998-09-02' date
should be 90-some% selectivity, a much better guess than 33%.

Doesn't make a difference for this particular query, of course,
because the seq scan must proceed anyhow, but it could easily affect
other queries where selectivities matter (and it affects the
modifications I am trying to test in the optimizer to be "smarter"
about selectivities - my overall context is to understand/improve the
behavior that the underlying storage system sees from queries like this).

OK, so let's say we treat 0 as a "wildcard" and stop checking for
1096.  Not we let gethilokey() return the two dates from the statistic
table.  The immediate next thing that intltsel() does, near lines 122
in selfuncs.c is call atol() on the strings from gethilokey().  And
guess what it comes up with?

low = 1
high = 12

because it calls atol() on '01-02-1992' and '12-01-1998'.  This
clearly isn't right, it should get some large integer that includes
the year and day in the result.  Then it should compare reasonably
with my constant from the where clause and give a decent selectivity
value.  This leads to a re-visit of Question 1.

Question 4 - should date "<=" use a dateltsel() function instead of
intltsel() as oprrest?

If anyone is still with me, could you tell me if this makes sense, or
if there is some other location where the appropriate type conversion
could take place so that intltsel() gets something reasonable when it
does the atol() calls?

Could someone also give me a sense for how far out-of-whack the whole
current selectivity-handling structure is?  It seems that most of the
operators in pg_operator actually use intltsel() and would have
type-specific problems like that described.  Or is the problem in the
way attribute values are stored in pg_statistic by vacuum analyze?  Or
is there another layer where type conversion belongs?

Phew.  Enough typing, hope someone can follow this and address at
least some of the questions.

Thanks.

Erik Riedel
Carnegie Mellon University
www.cs.cmu.edu/~riedel



pgsql-hackers by date:

Previous
From: Hannu Krosing
Date:
Subject: Re: [HACKERS] Re: Developers Globe (FINAL)
Next
From: Vince Vielhaber
Date:
Subject: to text or not to text