Re: [HACKERS] Optimizer fails? - Mailing list pgsql-hackers
From | dg@illustra.com (David Gould) |
---|---|
Subject | Re: [HACKERS] Optimizer fails? |
Date | |
Msg-id | 9803271943.AA27390@hawk.illustra.com Whole thread Raw |
List | pgsql-hackers |
Michal Mosiewicz writes: > > Say you have a table with 1024 indexed values from 0 to 1023. Now, you > want to select the values less than, say, 64. If you use sequential > scan, you have to scan 1024 records, right? But if you use index scan, > you first go to the first node of the tree, you compare 64 with the > value stored in this node, which (for balanced tree) is 512. Now you you > have to search through next lower node. You compare 64 to 128 which is > the value of this node. Then you go to next lower node. You compare 64 > to 64, and yes, we hit the target. After 3 searches, we know that > everything under this node is the set of records we are looking for > right? So, what we have to do is to browse through those values, and > collect the tupple indentifiers. > > Note, that once we have done this 3 searches, we have to browse all the > structure of the tree below. We are looking for 64 values. So, it will > cost us looking through 128 nodes of the subtree. > > OK, so by using an index, we had to check 128 + 3 nodes of the tree. Our btree indexes are quite a bit better than the balanced tree you suppose. > Now, let's note, that there has been only a few IO transfers by now. No > more than few pages. And we have tupple identifiers pointing us to 64 > records. Now we may sort this tids in ascending order to optimise IO. But, we do not do this tid sort. It really isn't easy as you might have millions of tids, not just a few. Which would mean doing an external sort. This might be a nice thing to do, but it isn't there now as far as I know. > Everything took us 3 + 128 nodes from index + 64 records from table. > This is defnitely better than reading all 1024 records. I am not trying to flame you, but it seems to me that you have some misconceptions about how the indexes work and I am trying only to explain them a little better. Using your example of 1024 rows with values from 0 to 1023. Further assuming: - 8K pages - 200 byte data rows (including overheads) - 4 byte keys, so perhaps 32 byte index rows with overheads - btree index Then: Table has 8192 / 200 = 40 rows per page giving 1024 / 40 = 26 pages Index has 8192 / 32 = 256 keys per page giving 1024 / 256 = 4 leaf pages and one root page with 4 keys. Something like: ____________________ index root: | 0, 256, 512, 768 | -------------------- / \ \ / \ \ / \ \--------------\ / \ \ _______________ _________________ ____________ leaf0: | 0,1,2...255 | leaf1: | 256,257...511 | leaf2: .... --------------- ----------------- ------------ | \ \--------------------------------\ | \------------------\ \ | \ \ | \ \ data: | 234 |, | 11 |, | 763 |, | 401 |, | 29 |, | 970 |, | 55 |, .... To scan the index to get the tids for keys 0...63 will take two page reads: root page, leaf1. But, to access the data rows you still need 64 random IOs. So the total times to read the data rows for keys 0..63 look like: Using index: IOs time why 1 20msec read root 1 20msec read leaf0 64 1280msec read 64 rows from leaf pages --- --------- 66 1320msec total Using table scan: IOs time why 1 20msec seek and read 1st page 25 125msec sequential read (5 msec each) of rest of table --- -------- 26 145msec total Note that this ignores cache effects. If you assume the cache is big enough for the whole table (and in this case it is, and assume none of the table is resident initially (cold start) then the IO analysis is: Using index (with cache): IOs time why 1 20msec read root 1 20msec read leaf0 10 200msec read 10 unique data pages (assumes desired keys are not uniformily distributed, this favors index case) --- --------- 12 240msec total Using table scan (with cache): IOs time why 1 20msec seek and read 1st page 25 125msec sequential read (5 msec each) of rest of table --- -------- 26 145msec total (no difference from uncached case) Even with the very favorable assumption that the only part of the data pages need to be read, the sequential scan is still slower here. > And note, that this is a very simple example. In my case I had a 250MB > table, and about 1% of records to select from it. My initial calculation a few posts ago was that the break even was .5% for the table you described so I still think it is reasonable to do a table scan. -dg David Gould dg@illustra.com 510.628.3783 or 510.305.9468 Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612 - Linux. Not because it is free. Because it is better.
pgsql-hackers by date: