Thread: pgsql is 75 times faster with my new index scan
Hello, I recently spoke about extending index scan to be able to take data directly from index pages. I wanted to know whether should I spend my time and implement it. So that I hacked last pgsql a bit to use proposed scan mode and did some measurements (see bellow). Measurements was done on (id int,txt varchar(20)) table with 1 000 000 rows with btree index on both attrs. Query involved was: select id,count(txt) from big group by id; Duplicates distribution on id column was 1:1000. I was run query twice after linux restart to ensure proper cache utilization (on disk heap & index was 90MB in total). So I think that by implementing this scan mode we can expect to gain huge speedup in all queries which uses indices and can found all data in their pages. Problems: my changes implemented only indexscan and new cost function. it doesn't work when index pages contains tuples which doesn't belong to our transaction. test was done after vacuum and only one tx running. TODO: - add HeapTupleHeaderData into each IndexTupleData - change code to reflect above - when deleting-updating heap then also update tuples' HeapTupleHeaderData in indices The last step could be done in two ways. First by limiting number of indices for one table we can store coresponding indices' TIDs in each heap tuple. The update is then simple taking one disk write. Or do it in standart way - lookup appropriate index tuple by traversing index. It will cost us more disk accesses. Is someone interested in this ?? regards devik With current indexscan: ! system usage stats: ! 1812.534505 elapsed 93.060547 user 149.447266 system sec ! [93.118164 user 149.474609 sys total] ! 0/0 [0/0] filesystem blocks in/out ! 130978/32 [131603/297] page faults/reclaims, 132 [132] swaps ! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent ! 0/0 [0/0] voluntary/involuntary context switches ! postgres usage stats: ! Shared blocks: 555587 read, 551155 written, buffer hit rate = 44.68% ! Local blocks: 0 read, 0 written, buffer hit rate = 0.00% ! Direct blocks: 0 read, 0 written With improved indexscan: ! system usage stats: ! 23.686788 elapsed 22.157227 user 0.372071 system sec ! [22.193359 user 0.385742 sys total] ! 0/0 [0/0] filesystem blocks in/out ! 1186/42 [1467/266] page faults/reclaims, 0 [0] swaps ! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent ! 0/0 [0/0] voluntary/involuntary context switches ! postgres usage stats: ! Shared blocks: 4385 read, 0 written, buffer hit rate = 4.32% ! Local blocks: 0 read, 0 written, buffer hit rate = 0.00% ! Direct blocks: 0 read, 0 written
* devik@cdi.cz <devik@cdi.cz> [000926 02:33] wrote: > Hello, > I recently spoke about extending index scan to be able > to take data directly from index pages. [snip] > > Is someone interested in this ?? Considering the speedup, I sure as hell am interested. :) When are we going to have this? -Alfred
devik@cdi.cz wrote: > > Hello, > I recently spoke about extending index scan to be able > to take data directly from index pages. I wanted to know > whether should I spend my time and implement it. > So that I hacked last pgsql a bit to use proposed scan > mode and did some measurements (see bellow). Measurements > was done on (id int,txt varchar(20)) table with 1 000 000 rows > with btree index on both attrs. Query involved was: > select id,count(txt) from big group by id; > Duplicates distribution on id column was 1:1000. I was run > query twice after linux restart to ensure proper cache > utilization (on disk heap & index was 90MB in total). > So I think that by implementing this scan mode we can expect > to gain huge speedup in all queries which uses indices and > can found all data in their pages. > > Problems: > my changes implemented only indexscan and new cost function. > it doesn't work when index pages contains tuples which doesn't > belong to our transaction. test was done after vacuum and > only one tx running. > > TODO: > - add HeapTupleHeaderData into each IndexTupleData > - change code to reflect above > - when deleting-updating heap then also update tuples' > HeapTupleHeaderData in indices I doubt everyone would like trading query speed for insert/update speed plus index size Perhaps this could be implemented as a special type of index which you must explicitly specify at create time ? > The last step could be done in two ways. First by limiting > number of indices for one table we can store coresponding > indices' TIDs in each heap tuple. The update is then simple > taking one disk write. Why limit it ? One could just save an tid array in each tuple . ----------------- Hannu
At 12:28 26/09/00 +0300, Hannu Krosing wrote: >> TODO: >> - add HeapTupleHeaderData into each IndexTupleData >> - change code to reflect above >> - when deleting-updating heap then also update tuples' >> HeapTupleHeaderData in indices > >I doubt everyone would like trading query speed for insert/update >speed plus index size Anybody who has a high enquiry:update ratio would, I think. And that will be most people, especially web people. >Perhaps this could be implemented as a special type of index which >you must explicitly specify at create time ? That would be great, or alternatively, an attribute on the index? Dec RDB offers this kind of feature, and it suffers from deadlocking problems because of index node vs. row locking as well as high lock contention when updating indexed data, so it's not just the cost of doing the updates. I definitely thinks it's a good feature to implement, if possible, but making it optional would be wise. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
Hannu Krosing wrote: > > > > > TODO: > > - add HeapTupleHeaderData into each IndexTupleData > > - change code to reflect above > > - when deleting-updating heap then also update tuples' > > HeapTupleHeaderData in indices > > I doubt everyone would like trading query speed for insert/update > speed plus index size > > Perhaps this could be implemented as a special type of index which > you must explicitly specify at create time ? > > > The last step could be done in two ways. First by limiting > > number of indices for one table we can store coresponding > > indices' TIDs in each heap tuple. The update is then simple > > taking one disk write. > > Why limit it ? One could just save an tid array in each tuple . > Indice's TIDs are transient. Isn't it useless to store indice's TIDs ? Regards. Hiroshi Inoue
> > > The last step could be done in two ways. First by limiting > > > number of indices for one table we can store coresponding > > > indices' TIDs in each heap tuple. The update is then simple > > > taking one disk write. > > > > Why limit it ? One could just save an tid array in each tuple . because when you add new index you had to rescan whole heap and grow the tid array .. > Indice's TIDs are transient. > Isn't it useless to store indice's TIDs ? but yes Hiroshi is right. Index TID is transient. I first looked into pg sources two weeks ago so I have still holes in my knowledge. So that only solution is to traverse it ..
> That would be great, or alternatively, an attribute on the index? yes attribute would be nice. > > Dec RDB offers this kind of feature, and it suffers from deadlocking > problems because of index node vs. row locking as well as high lock > contention when updating indexed data, so it's not just the cost of doing > the updates. I definitely thinks it's a good feature to implement, if where is the reason of contention/ddlock ? We (or you?) only need to update index tuple's header data (like TX min/max) and key is untouched. So that all index scans should be able to go without disturbing. Index page should be locked in memory only for few ticks during actual memcpy to page. BTW: IMHO when WAL become functional, need we still multi versioned tuples in heap ? Why don't just version tuples on WAL log and add them during scans ? devik
> > The last step could be done in two ways. First by limiting > > number of indices for one table we can store coresponding > > indices' TIDs in each heap tuple. The update is then simple > > taking one disk write. > > Why limit it ? One could just save an tid array in each tuple . And update *entire* heap after addition new index?! Vadim
Why not implement *true* CLUSTER? With cluster, all heap tuples will be in cluster index. Vadim > -----Original Message----- > From: devik@cdi.cz [mailto:devik@cdi.cz] > Sent: Tuesday, September 26, 2000 2:15 AM > To: pgsql-hackers@postgresql.org > Subject: [HACKERS] pgsql is 75 times faster with my new index scan > > > Hello, > I recently spoke about extending index scan to be able > to take data directly from index pages. I wanted to know > whether should I spend my time and implement it. > So that I hacked last pgsql a bit to use proposed scan > mode and did some measurements (see bellow). Measurements > was done on (id int,txt varchar(20)) table with 1 000 000 rows > with btree index on both attrs. Query involved was: > select id,count(txt) from big group by id; > Duplicates distribution on id column was 1:1000. I was run > query twice after linux restart to ensure proper cache > utilization (on disk heap & index was 90MB in total). > So I think that by implementing this scan mode we can expect > to gain huge speedup in all queries which uses indices and > can found all data in their pages. > > Problems: > my changes implemented only indexscan and new cost function. > it doesn't work when index pages contains tuples which doesn't > belong to our transaction. test was done after vacuum and > only one tx running. > > TODO: > - add HeapTupleHeaderData into each IndexTupleData > - change code to reflect above > - when deleting-updating heap then also update tuples' > HeapTupleHeaderData in indices > > The last step could be done in two ways. First by limiting > number of indices for one table we can store coresponding > indices' TIDs in each heap tuple. The update is then simple > taking one disk write. > Or do it in standart way - lookup appropriate index tuple > by traversing index. It will cost us more disk accesses. > > Is someone interested in this ?? > regards devik > > With current indexscan: > ! system usage stats: > ! 1812.534505 elapsed 93.060547 user 149.447266 system sec > ! [93.118164 user 149.474609 sys total] > ! 0/0 [0/0] filesystem blocks in/out > ! 130978/32 [131603/297] page faults/reclaims, 132 [132] swaps > ! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent > ! 0/0 [0/0] voluntary/involuntary context switches > ! postgres usage stats: > ! Shared blocks: 555587 read, 551155 written, buffer hit > rate = 44.68% > ! Local blocks: 0 read, 0 written, buffer hit > rate = 0.00% > ! Direct blocks: 0 read, 0 written > > With improved indexscan: > ! system usage stats: > ! 23.686788 elapsed 22.157227 user 0.372071 system sec > ! [22.193359 user 0.385742 sys total] > ! 0/0 [0/0] filesystem blocks in/out > ! 1186/42 [1467/266] page faults/reclaims, 0 [0] swaps > ! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent > ! 0/0 [0/0] voluntary/involuntary context switches > ! postgres usage stats: > ! Shared blocks: 4385 read, 0 written, buffer hit > rate = 4.32% > ! Local blocks: 0 read, 0 written, buffer hit > rate = 0.00% > ! Direct blocks: 0 read, 0 written > >
> > Indice's TIDs are transient. > > Isn't it useless to store indice's TIDs ? > > but yes Hiroshi is right. Index TID is transient. I first looked > into pg sources two weeks ago so I have still holes in my knowledge. > So that only solution is to traverse it .. It was discussed several times for btree - add heap tid to index key and you'll scan index for particulare tuple much faster. Not sure what could be done for hash indices... order hash items with the same hash key? Vadim
"Mikheev, Vadim" wrote: > > Why not implement *true* CLUSTER? > With cluster, all heap tuples will be in cluster index. > What is *true* CLUSTER ? 'grep CLUSTER' over the latest SQL standards gives back nothing. > > > The last step could be done in two ways. First by limiting > > > number of indices for one table we can store coresponding > > > indices' TIDs in each heap tuple. The update is then simple > > > taking one disk write. > > > > Why limit it ? One could just save an tid array in each tuple . > > And update *entire* heap after addition new index?! I guess that this should be done even for limited number of indices' TIDs in a heap tuple ? -------------- Hannu
> Why not implement *true* CLUSTER? > With cluster, all heap tuples will be in cluster index. It would be nice. It's pity that pg AMs are not general. There is no simple way to use btree instead of heap. But it would help. But using values from index is good idea too because you can have table with many columns and aggregate query which needs only two columns. The it will be MUCH faster to create secondary index which is much smaller than heap and use values from it. Vadim where can I found some code from upcoming WAL ? I'm thinking about implementing special ranked b-tree which will store precomputed aggregate values (like cnt,min,max,sum) in btree node keys. It can be then used for extremely fast evaluation of aggregates. But in case of MVCC it is more complicated and I'd like to see how it would be affected by WAL. devik
> What is *true* CLUSTER ? > > 'grep CLUSTER' over the latest SQL standards gives back nothing. storing data in b-tree instead of heap for example. > > And update *entire* heap after addition new index?! > > I guess that this should be done even for limited number of > indices' TIDs in a heap tuple ? yep - the idea was throwed away already.
> > Why not implement *true* CLUSTER? > > With cluster, all heap tuples will be in cluster index. > > It would be nice. It's pity that pg AMs are not general. > There is no simple way to use btree instead of heap. But > it would help. > But using values from index is good idea too because you > can have table with many columns and aggregate query which > needs only two columns. > The it will be MUCH faster to create secondary index which > is much smaller than heap and use values from it. Agreed. But this will add 16 bytes (2 xid + 2 cid) to size of btitems. Currently, total size of btitem for 2-int4 key index is 16 bytes => new feature will double size of index and increase # of levels (obviously bad for mostly static tables). So, this feature should be *optional*... > Vadim where can I found some code from upcoming WAL ? > I'm thinking about implementing special ranked b-tree > which will store precomputed aggregate values (like > cnt,min,max,sum) in btree node keys. It can be then > used for extremely fast evaluation of aggregates. But > in case of MVCC it is more complicated and I'd like > to see how it would be affected by WAL. MVCC will not be affected by WAL currently. It's issue of storage manager, not WAL. Vadim
> > It was discussed several times for btree - add heap tid to > > index key and you'll scan index for particulare tuple much faster. > > good idea :) Why don't just to use tid ALWAYS as last part of key ? > When btree code sees equal keys then it will compare tids ? > Would not be better to use oids ? The don't change during vacuum > and with tupleheader in index we will know it. In some future I would like to make OIDs optional - they are not always used, so why waste space? + using TID would make keys unique and this would simplify handling of duplicates. + heap TID is already part of btitems in leaf nodes - OIDs would just increase btiem size. > > Not sure what could be done for hash indices... order hash > > items with the same hash key? > > question is whether we need it for hash indices. it is definitely > good for btree as they support range retrieval. hash ind. doesn't > it so I wouldn't implement it for them. We need in fast heap tuple --> index tuple lookup for overwriting storage manager anyway... Vadim
> > question is whether we need it for hash indices. it is definitely > > good for btree as they support range retrieval. hash ind. doesn't > > it so I wouldn't implement it for them. > > We need in fast heap tuple --> index tuple lookup for overwriting > storage manager anyway... oh .. there will be such one ?
> > The it will be MUCH faster to create secondary index which > > is much smaller than heap and use values from it. > > Agreed. But this will add 16 bytes (2 xid + 2 cid) to size of btitems. > Currently, total size of btitem for 2-int4 key index is 16 bytes => > new feature will double size of index and increase # of levels > (obviously bad for mostly static tables). > > So, this feature should be *optional*... yes. it definitely should. > MVCC will not be affected by WAL currently. It's issue > of storage manager, not WAL. and where will the WAL sit ? can you explain it a bit ? thanks devik
> > TODO: > > - add HeapTupleHeaderData into each IndexTupleData > > - change code to reflect above > > - when deleting-updating heap then also update tuples' > > HeapTupleHeaderData in indices > > I doubt everyone would like trading query speed for insert/update > speed plus index size If he is scanning through the entire index, he could do a sequential scan of the table, grab all the tid transaction status values, and use those when viewing the index. No need to store/update the transaction status in the index that way. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Hi guys, Havin some trouble. One of my databases appeared to be empty suddenly after having a large amount of data in it. I contacted our server company and they gave me the postgres dir. I have put back the folder of the newsdatabase from the base dir into the base dir of Postgres and recreated the database. That's fine. However now the database is empty. When I do a cat on the file of the same name as one of the tables - it has loads of data in it. However when I go in to Postgres and try to list the table it comes back with ) rows. Any ideas I am desperate. I am using linux redhat and Postgres Thanks Abe
> > > I doubt everyone would like trading query speed for insert/update > > > speed plus index size > > > > If he is scanning through the entire index, he could do a sequential > > scan of the table, grab all the tid transaction status values, and use > > those when viewing the index. No need to store/update the transaction > > status in the index that way. > > Huh ? How ? It is how you do it now. Do you expect > load several milion transaction statuses into memory, > then scan index and lookup these values ? > Missed I something ? > devik > > Not sure. I figured they were pretty small values. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian wrote: > > > > > I doubt everyone would like trading query speed for insert/update > > > > speed plus index size > > > > > > If he is scanning through the entire index, he could do a sequential > > > scan of the table, grab all the tid transaction status values, and use > > > those when viewing the index. No need to store/update the transaction > > > status in the index that way. > > > > Huh ? How ? It is how you do it now. Do you expect > > load several milion transaction statuses into memory, > > then scan index and lookup these values ? > > Missed I something ? > > devik > > > > > > Not sure. I figured they were pretty small values. IIRC the whole point was to avoid scanning the table ? ------------- Hannu
> > I doubt everyone would like trading query speed for insert/update > > speed plus index size > > If he is scanning through the entire index, he could do a sequential > scan of the table, grab all the tid transaction status values, and use > those when viewing the index. No need to store/update the transaction > status in the index that way. Huh ? How ? It is how you do it now. Do you expect load several milion transaction statuses into memory, then scan index and lookup these values ? Missed I something ? devik
> > > > those when viewing the index. No need to store/update the transaction > > > > status in the index that way. > > > > > > Huh ? How ? It is how you do it now. Do you expect > > > load several milion transaction statuses into memory, > > > then scan index and lookup these values ? > > > Missed I something ? > > > devik > > Not sure. I figured they were pretty small values. > IIRC the whole point was to avoid scanning the table ? Yes. This was the main point ! For small number of records the current method is fast enough. The direct index scan is useful for big tables and doing scan over large parts of them (like in aggregates). devik
> Bruce Momjian wrote: > > > > > > > I doubt everyone would like trading query speed for insert/update > > > > > speed plus index size > > > > > > > > If he is scanning through the entire index, he could do a sequential > > > > scan of the table, grab all the tid transaction status values, and use > > > > those when viewing the index. No need to store/update the transaction > > > > status in the index that way. > > > > > > Huh ? How ? It is how you do it now. Do you expect > > > load several milion transaction statuses into memory, > > > then scan index and lookup these values ? > > > Missed I something ? > > > devik > > > > > > > > > > Not sure. I figured they were pretty small values. > > IIRC the whole point was to avoid scanning the table ? Yes, sorry. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026