Re: [HACKERS] Optimizer fails? - Mailing list pgsql-hackers

From dg@illustra.com (David Gould)
Subject Re: [HACKERS] Optimizer fails?
Date
Msg-id 9803271943.AA27390@hawk.illustra.com
Whole thread Raw
List pgsql-hackers
Michal Mosiewicz writes:
>
> Say you have a table with 1024 indexed values from 0 to 1023. Now, you
> want to select the values less than, say, 64. If you use sequential
> scan, you have to scan 1024 records, right? But if you use index scan,
> you first go to the first node of the tree, you compare 64 with the
> value stored in this node, which (for balanced tree) is 512. Now you you
> have to search through next lower node. You compare 64 to 128 which is
> the value of this node. Then you go to next lower node. You compare 64
> to 64, and yes, we hit the target. After 3 searches, we know that
> everything under this node is the set of records we are looking for
> right? So, what we have to do is to browse through those values, and
> collect the tupple indentifiers.
>
> Note, that once we have done this 3 searches, we have to browse all the
> structure of the tree below. We are looking for 64 values. So, it will
> cost us looking through 128 nodes of the subtree.
>
> OK, so by using an index, we had to check 128 + 3 nodes of the tree.

Our btree indexes are quite a bit better than the balanced tree you suppose.

> Now, let's note, that there has been only a few IO transfers by now. No
> more than few pages. And we have tupple identifiers pointing us to 64
> records. Now we may sort this tids in ascending order to optimise IO.

But, we do not do this tid sort. It really isn't easy as you might have
millions of tids, not just a few. Which would mean doing an external sort.
This might be a nice thing to do, but it isn't there now as far as I know.

> Everything took us 3 + 128 nodes from index + 64 records from table.
> This is defnitely better than reading all 1024 records.

I am not trying to flame you, but it seems to me that you have some
misconceptions about how the indexes work and I am trying only to explain
them a little better.

Using your example of 1024 rows with values from 0 to 1023. Further assuming:

 - 8K pages
 - 200 byte data rows (including overheads)
 - 4 byte keys, so perhaps 32 byte index rows with overheads
 - btree index

Then:

Table has 8192 / 200 = 40 rows per page giving 1024 / 40 = 26 pages

Index has 8192 / 32 = 256 keys per page giving 1024 / 256 = 4 leaf pages
          and one root page with 4 keys.

Something like:
                         ____________________
                 index root: | 0, 256, 512, 768 |
                             --------------------
                             /     \       \
                           /        \        \
                         /           \        \--------------\
                       /              \                       \
            _______________            _________________    ____________
     leaf0: | 0,1,2...255 |     leaf1: | 256,257...511 |    leaf2:  ....
            ---------------            -----------------    ------------
                   |   \  \--------------------------------\
                   |    \------------------\                \
                   |                        \                \
                   |                         \                \
data: | 234 |,  | 11 |,  | 763 |, | 401 |, | 29 |, | 970 |, | 55 |, ....


To scan the index to get the tids for keys 0...63 will take two page
reads: root page, leaf1.

But, to access the data rows you still need 64 random IOs.

So the total times to read the data rows for keys 0..63 look like:

 Using index:

    IOs        time       why

     1         20msec     read root
     1           20msec     read leaf0
    64       1280msec     read 64 rows from leaf pages
   ---      ---------
    66       1320msec     total


 Using table scan:

    IOs        time       why

     1         20msec     seek and read 1st page
    25        125msec     sequential read (5 msec each) of rest of table
   ---       --------
    26        145msec     total

Note that this ignores cache effects.

If you assume the cache is big enough for the whole table (and in this case
it is, and assume none of the table is resident initially (cold start) then
the IO analysis is:

 Using index (with cache):

    IOs        time       why

     1         20msec     read root
     1           20msec     read leaf0
    10        200msec     read 10 unique data pages (assumes desired keys are
                          not uniformily distributed, this favors index case)
   ---      ---------
    12        240msec     total


 Using table scan (with cache):

    IOs        time       why

     1         20msec     seek and read 1st page
    25        125msec     sequential read (5 msec each) of rest of table
   ---       --------
    26        145msec     total (no difference from uncached case)

Even with the very favorable assumption that the only part of the data pages
need to be read, the sequential scan is still slower here.

> And note, that this is a very simple example. In my case I had a 250MB
> table, and about 1% of records to select from it.

My initial calculation a few posts ago was that the break even was .5% for
the table you described so I still think it is reasonable to do a table scan.

-dg

David Gould            dg@illustra.com           510.628.3783 or 510.305.9468
Informix Software  (No, really)         300 Lakeside Drive  Oakland, CA 94612
 - Linux. Not because it is free. Because it is better.


pgsql-hackers by date:

Previous
From: Michael Meskes
Date:
Subject: Going on vacation
Next
From: jwieck@debis.com (Jan Wieck)
Date:
Subject: Re: [HACKERS] Going on vacation