Thread: Re: [HACKERS] Optimizer fails?

Re: [HACKERS] Optimizer fails?

From
dg@illustra.com (David Gould)
Date:
Michal Mosiewicz writes:
>
> Say you have a table with 1024 indexed values from 0 to 1023. Now, you
> want to select the values less than, say, 64. If you use sequential
> scan, you have to scan 1024 records, right? But if you use index scan,
> you first go to the first node of the tree, you compare 64 with the
> value stored in this node, which (for balanced tree) is 512. Now you you
> have to search through next lower node. You compare 64 to 128 which is
> the value of this node. Then you go to next lower node. You compare 64
> to 64, and yes, we hit the target. After 3 searches, we know that
> everything under this node is the set of records we are looking for
> right? So, what we have to do is to browse through those values, and
> collect the tupple indentifiers.
>
> Note, that once we have done this 3 searches, we have to browse all the
> structure of the tree below. We are looking for 64 values. So, it will
> cost us looking through 128 nodes of the subtree.
>
> OK, so by using an index, we had to check 128 + 3 nodes of the tree.

Our btree indexes are quite a bit better than the balanced tree you suppose.

> Now, let's note, that there has been only a few IO transfers by now. No
> more than few pages. And we have tupple identifiers pointing us to 64
> records. Now we may sort this tids in ascending order to optimise IO.

But, we do not do this tid sort. It really isn't easy as you might have
millions of tids, not just a few. Which would mean doing an external sort.
This might be a nice thing to do, but it isn't there now as far as I know.

> Everything took us 3 + 128 nodes from index + 64 records from table.
> This is defnitely better than reading all 1024 records.

I am not trying to flame you, but it seems to me that you have some
misconceptions about how the indexes work and I am trying only to explain
them a little better.

Using your example of 1024 rows with values from 0 to 1023. Further assuming:

 - 8K pages
 - 200 byte data rows (including overheads)
 - 4 byte keys, so perhaps 32 byte index rows with overheads
 - btree index

Then:

Table has 8192 / 200 = 40 rows per page giving 1024 / 40 = 26 pages

Index has 8192 / 32 = 256 keys per page giving 1024 / 256 = 4 leaf pages
          and one root page with 4 keys.

Something like:
                         ____________________
                 index root: | 0, 256, 512, 768 |
                             --------------------
                             /     \       \
                           /        \        \
                         /           \        \--------------\
                       /              \                       \
            _______________            _________________    ____________
     leaf0: | 0,1,2...255 |     leaf1: | 256,257...511 |    leaf2:  ....
            ---------------            -----------------    ------------
                   |   \  \--------------------------------\
                   |    \------------------\                \
                   |                        \                \
                   |                         \                \
data: | 234 |,  | 11 |,  | 763 |, | 401 |, | 29 |, | 970 |, | 55 |, ....


To scan the index to get the tids for keys 0...63 will take two page
reads: root page, leaf1.

But, to access the data rows you still need 64 random IOs.

So the total times to read the data rows for keys 0..63 look like:

 Using index:

    IOs        time       why

     1         20msec     read root
     1           20msec     read leaf0
    64       1280msec     read 64 rows from leaf pages
   ---      ---------
    66       1320msec     total


 Using table scan:

    IOs        time       why

     1         20msec     seek and read 1st page
    25        125msec     sequential read (5 msec each) of rest of table
   ---       --------
    26        145msec     total

Note that this ignores cache effects.

If you assume the cache is big enough for the whole table (and in this case
it is, and assume none of the table is resident initially (cold start) then
the IO analysis is:

 Using index (with cache):

    IOs        time       why

     1         20msec     read root
     1           20msec     read leaf0
    10        200msec     read 10 unique data pages (assumes desired keys are
                          not uniformily distributed, this favors index case)
   ---      ---------
    12        240msec     total


 Using table scan (with cache):

    IOs        time       why

     1         20msec     seek and read 1st page
    25        125msec     sequential read (5 msec each) of rest of table
   ---       --------
    26        145msec     total (no difference from uncached case)

Even with the very favorable assumption that the only part of the data pages
need to be read, the sequential scan is still slower here.

> And note, that this is a very simple example. In my case I had a 250MB
> table, and about 1% of records to select from it.

My initial calculation a few posts ago was that the break even was .5% for
the table you described so I still think it is reasonable to do a table scan.

-dg

David Gould            dg@illustra.com           510.628.3783 or 510.305.9468
Informix Software  (No, really)         300 Lakeside Drive  Oakland, CA 94612
 - Linux. Not because it is free. Because it is better.


Re: [HACKERS] Optimizer fails?

From
Michal Mosiewicz
Date:
David Gould wrote:

> > Now, let's note, that there has been only a few IO transfers by now. No
> > more than few pages. And we have tupple identifiers pointing us to 64
> > records. Now we may sort this tids in ascending order to optimise IO.
>
> But, we do not do this tid sort. It really isn't easy as you might have
> millions of tids, not just a few. Which would mean doing an external sort.
> This might be a nice thing to do, but it isn't there now as far as I know.

No, you don't need a full set. You may sort it in portions of the
predefined size. You may even try ascending/descending order to optimise
your hd heads movements, however it may be not very good idea, since
it's against the read-ahead feature of most disk IO and it may
negatively influence ORDER BY performance. Anyhow, you may accomplish
sawtooth readings, that certainly decrease the access time.

> > Everything took us 3 + 128 nodes from index + 64 records from table.
> > This is defnitely better than reading all 1024 records.
>
> I am not trying to flame you, but it seems to me that you have some
> misconceptions about how the indexes work and I am trying only to explain
> them a little better.
>[cut]
>  Using index (with cache):
>
>     IOs        time       why
>
>      1         20msec     read root
>      1         20msec     read leaf0
>     10        200msec     read 10 unique data pages (assumes desired keys are
>                           not uniformily distributed, this favors index case)
>    ---      ---------

OK, this example may be simple. In fact I would agree that in this case
figures looks like seq scan is sufficient. But that's all started from
my example of 250MB file with about 2M of records, that I wanted to
select only about 25k. (I ommit the fact, that those record were
naturally clustered by the time they came into database. So those 20.000
of records were pretty continuos).

As I observed postgres scanned this database at rate of
100kBps(sequentially). Much less than the actuall I/O throughput on this
machine.  Even when I prepared a condition to return no records it also
scanned it sequentially, while it would cost only 20msec.

Anyhow... I have to admit that similiar question asked to mysql takes...
mysql> select count(*) from log where dt < 19980209000000 and
dt>19980208000000;
+----------+
| count(*) |
+----------+
|    26707 |
+----------+
1 row in set (7.61 sec)

Of course, if I ask it without the index it takes ~3 minutes. That's why
expected that postgres would make some use of index. (The table is in
both cases the same).

Mike

--
WWW: http://www.lodz.pdi.net/~mimo  tel: Int. Acc. Code + 48 42 148340
add: Michal Mosiewicz  *  Bugaj 66 m.54 *  95-200 Pabianice  *  POLAND

Re: [HACKERS] Optimizer fails?

From
dg@illustra.com (David Gould)
Date:
Michal Mosiewicz:
> As I observed postgres scanned this database at rate of
> 100kBps(sequentially). Much less than the actuall I/O throughput on this
> machine.  Even when I prepared a condition to return no records it also
> scanned it sequentially, while it would cost only 20msec.

Well, now it looks like there is a bug or two:

 - 100kBps(sequentially) is way too slow. If you have time, try profileing
   (with gprof) this scan. We should be able to do much better than this.
   If you can't do it, we might want to put "Improve sequential scan rate"
   on the todo list.

 - a "select count(*) from x where <some_index_col> <some_qual>"
   should use the index.

> Anyhow... I have to admit that similiar question asked to mysql takes...
> mysql> select count(*) from log where dt < 19980209000000 and
> dt>19980208000000;
> +----------+
> | count(*) |
> +----------+
> |    26707 |
> +----------+
> 1 row in set (7.61 sec)
>
> Of course, if I ask it without the index it takes ~3 minutes. That's why
> expected that postgres would make some use of index. (The table is in
> both cases the same).

Just out of curiosity, how long do these queries take in MySQL vs postgreSQL?

Thanks
-dg

David Gould            dg@illustra.com           510.628.3783 or 510.305.9468
Informix Software  (No, really)         300 Lakeside Drive  Oakland, CA 94612
 - Linux. Not because it is free. Because it is better.



Re: [HACKERS] Optimizer fails?

From
Michal Mosiewicz
Date:
David Gould wrote:

>  - 100kBps(sequentially) is way too slow. If you have time, try profileing
>    (with gprof) this scan. We should be able to do much better than this.
>    If you can't do it, we might want to put "Improve sequential scan rate"
>    on the todo list.

OK. I'll profile it as soon as I find some spare moment.

> > Of course, if I ask it without the index it takes ~3 minutes. That's why
> > expected that postgres would make some use of index. (The table is in
> > both cases the same).
>
> Just out of curiosity, how long do these queries take in MySQL vs postgreSQL?

MySQL seems to be aproximately 20 times better. Using a sequential scan
I can obserwe on IO monitor that it's able to read 2-5MB/s. (This is
two-disk RAID0 configuration, that has maximum throughput of 10MBps).

Of course,  you have to remember that mySQL has much simpler storage due
to lack of some features like transactions. I've read some Monty's (the
author) thoughts on transactions. He says that introducing transactions
would lower the performance at least 4-5 times (even read performance).

Now, I've downloaded personal version of Solid that I'm also going to
compare. What I personally find very interesting in Solid is it's
optimistic locking. Actually it means no locking at all. Solid seems to
have non-overwritting feature much like postgres. But this
non-overwritting feature is limited to not-checkpointed-yet data in
buffers (logs). Also, it maintains an internal structure called 'Bonsai
Tree', that includes all transaction's with their time. If there is a
collision between data it is deducted from this bonsai tree structure,
and then the latest transaction is rolled back.

By using systematic checkpointing it makes sure that those structures
are relatively small and conflicts are resolved quickly.

Of course, don't treat it as comparisions between postgres and
commercial product. I just want to share it as a kind of 'food for
thoughts'.

Mike

--
WWW: http://www.lodz.pdi.net/~mimo  tel: Int. Acc. Code + 48 42 148340
add: Michal Mosiewicz  *  Bugaj 66 m.54 *  95-200 Pabianice  *  POLAND

Re: [HACKERS] Optimizer fails?

From
"Vadim B. Mikheev"
Date:
David Gould wrote:
>
> > Now, let's note, that there has been only a few IO transfers by now. No
> > more than few pages. And we have tupple identifiers pointing us to 64
> > records. Now we may sort this tids in ascending order to optimise IO.
>
> But, we do not do this tid sort. It really isn't easy as you might have
> millions of tids, not just a few. Which would mean doing an external sort.
> This might be a nice thing to do, but it isn't there now as far as I know.

Using TID as (last) part of index key is on my TODO.
This will speed up vacuuming, get rid of all duplicate key
problems and give us feature above.

> To scan the index to get the tids for keys 0...63 will take two page
> reads: root page, leaf1.

+ meta page read first - to get root page block number.

Vadim