Thread: Re: order by and index path

Re: order by and index path

From
jwieck@debis.com (Jan Wieck)
Date:
Andread wrote:
> > (who really selects nearly all rows from a 5M row
> >    table?).
>
> Data Warehouse apps
>
> >    This  will hurt if someone really selects most of the rows and the index
> >    scan jumps over the disc.
>
> I think this is a non issue, since if a qual is not restrictive enough,
> the optimizer should choose a seq scan anyway. Doesn' t it do this already ?
> [...]

    Right and wrong.

    Right - it is the optimizers job to decide if an index should
    be used or not. And the decision has to be made based on  the
    cost.

    Wrong  -  PostgreSQL's query optimizer doesn't do it already.
    It assumes that a qualification is always restrictive  enough
    and  chooses  an index scan any time if the qualification can
    be thrown into the indexqual.

    In  the  following  I  only  discuss  the   situation   where
    qualifications can be used in the indexqual.

    Calculating  the  cost of a query is easy. Have N tuples in P
    data-pages  where  the  given  qualification  will  match  M.
    Assuming  that  the  tuples are not in ascending order in the
    data pages, the cost fetching one tuple  by  its  TID  raises
    with  P  (more  seeking necessary). Now you can calculate the
    cost of an index scan by C=M/N*P*F where F is  some  constant
    factor  to  make C comparable to a current seqscan cost value
    (I know, it must be  smarter,  but  for  this  description  a
    simple calculation is enough).

    The  only  problem  is  that  the optimizer has absolutely no
    chance to estimate M (the mystic value as Bruce  called  it).
    In a given qualification

        WHERE key > 0 AND key <= 100

    it cannot know if this would result in 0 or 100% of the rows.
    To estimate that, it needs statistical information about  the
    key  ranges  that  are present. Assume there would be 11 keys
    remembered by the last vacuum run, that break  up  the  whole
    present  key  range  of  10000 tuples into 10 chunks and they
    read

        5 40 70 90 500 600 1000 1100 1400 1500 2000

    where 5 is the lowest key at all, 40 is the key of tuple 1000
    (in  key  order),  70 is the key of tuple 2000 and so on. Now
    looking at the qualification and this key  range  information
    would  tell,  that  the absolute limit of rows returned by an
    index scan would be 3999 (which still could have a key  value
    of 100). But the qualification

        WHERE key >= 50

    could return at max 8999 tuples and

        WHERE key > 50 AND key < 70

    has  a  maximum  of  998  result  tuples.  This  would be the
    information required to make the right decision for the  case
    where all rows selected are wanted.

    We  do  not  have  this statistical information. So the whole
    thing is at this time academic.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#======================================== jwieck@debis.com (Jan Wieck) #

Re: [HACKERS] Re: order by and index path

From
"Thomas G. Lockhart"
Date:
>     We  do  not  have  this statistical information. So the whole
>     thing is at this time academic.

I recall that Commercial Ingres made the assumption that one row (or 1%
of rows? My memory of Ingres is fading :) would be returned from a
qualified query if no statistics were available to suggest otherwise.

It did collect statistics on data distribution to try to help make those
optimizer choices.

It may be reasonable to assume that if there is an index, then using it
with any qualified query would be a win. Since the alternative is to
decide to _not_ use an index, a decision for which we have no support
with existing statistics.

                        - Tom

Statistics on key distribution (was: Re: order by and index path)

From
jwieck@debis.com (Jan Wieck)
Date:
>
> >     We  do  not  have  this statistical information. So the whole
> >     thing is at this time academic.
>
> I recall that Commercial Ingres made the assumption that one row (or 1%
> of rows? My memory of Ingres is fading :) would be returned from a
> qualified query if no statistics were available to suggest otherwise.
>
> It did collect statistics on data distribution to try to help make those
> optimizer choices.
>
> It may be reasonable to assume that if there is an index, then using it
> with any qualified query would be a win. Since the alternative is to
> decide to _not_ use an index, a decision for which we have no support
> with existing statistics.

    It  may  be  also reasonable to collect statistic information
    and use that to quantify the cost of an index scan.

    The vacuum cleaner scans all indices on a  relation  vacuum'd
    completely.  And  at that time it already knows the number of
    pages and tuples in  the  heap  relation  (has  that  in  the
    vcrelstats).

    Based  on this it could decide to take every n'th index tuple
    while scanning and drop them somewhere where  other  backends
    can  find  them.   This  would be the statistical information
    needed by the optimizer to estimate the real cost of an index
    scan.  It  is  only of interest for big tables, where hopping
    from block to block will make an index scan a looser  against
    a  seqscan  in  a  many row matching scan.  So it's up to the
    optimizer do decide based on the # of  pages  if  statistical
    information is really required for cost calculation.

    Having   the  final  indexqual  along  with  the  statistical
    information it will be a little tricky to figure out how many
    rows it might return, but not impossible.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#======================================== jwieck@debis.com (Jan Wieck) #

Re: [HACKERS] Re: order by and index path

From
Bruce Momjian
Date:
>     could return at max 8999 tuples and
>
>         WHERE key > 50 AND key < 70
>
>     has  a  maximum  of  998  result  tuples.  This  would be the
>     information required to make the right decision for the  case
>     where all rows selected are wanted.
>
>     We  do  not  have  this statistical information. So the whole
>     thing is at this time academic.

But we do have statistical information in pg_statistic if you run vacuum
analyze.


--
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: [HACKERS] Re: order by and index path

From
Bruce Momjian
Date:
> >     We  do  not  have  this statistical information. So the whole
> >     thing is at this time academic.
>
> I recall that Commercial Ingres made the assumption that one row (or 1%
> of rows? My memory of Ingres is fading :) would be returned from a
> qualified query if no statistics were available to suggest otherwise.
>
> It did collect statistics on data distribution to try to help make those
> optimizer choices.
>
> It may be reasonable to assume that if there is an index, then using it
> with any qualified query would be a win. Since the alternative is to
> decide to _not_ use an index, a decision for which we have no support
> with existing statistics.

For =, the assumion is 1 row, for > the assumption is 1/3 of the table.
With pg_statistic, it uses that.

--
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: [HACKERS] Re: order by and index path

From
jwieck@debis.com (Jan Wieck)
Date:
>
> >     could return at max 8999 tuples and
> >
> >         WHERE key > 50 AND key < 70
> >
> >     has  a  maximum  of  998  result  tuples.  This  would be the
> >     information required to make the right decision for the  case
> >     where all rows selected are wanted.
> >
> >     We  do  not  have  this statistical information. So the whole
> >     thing is at this time academic.
>
> But we do have statistical information in pg_statistic if you run vacuum
> analyze.

    Nice  (forgot  that  - pardon), anyway only having lowest and
    highest key values isn't enough to make a  useful  estimation
    about  how  many  rows an indexqual will return. If we change
    pg_statistic in a way that  more  keys  can  get  stored  per
    relation/attribute,  then  the  optimizer  would  have a real
    chance on it.

    I have

        starelid
        staattnum
        staitupno
        staop
        stakey

    in mind, where staitupno tells the position of the key  in  a
    complete index scan. Then it becomes the place to fill in the
    key range information as described in my posting.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#======================================== jwieck@debis.com (Jan Wieck) #