Thread: pgsql is 75 times faster with my new index scan

pgsql is 75 times faster with my new index scan

From
devik@cdi.cz
Date:
Hello,
I recently spoke about extending index scan to be able
to take data directly from index pages. I wanted to know
whether should I spend my time and implement it.
So that I hacked last pgsql a bit to use proposed scan
mode and did some measurements (see bellow). Measurements
was done on (id int,txt varchar(20)) table with 1 000 000 rows
with btree index on both attrs. Query involved was:
select id,count(txt) from big group by id;
Duplicates distribution on id column was 1:1000. I was run
query twice after linux restart to ensure proper cache 
utilization (on disk heap & index was 90MB in total).
So I think that by implementing this scan mode we can expect
to gain huge speedup in all queries which uses indices and
can found all data in their pages.

Problems:
my changes implemented only indexscan and new cost function.
it doesn't work when index pages contains tuples which doesn't
belong to our transaction. test was done after vacuum and
only one tx running.

TODO:
- add HeapTupleHeaderData into each IndexTupleData
- change code to reflect above
- when deleting-updating heap then also update tuples' HeapTupleHeaderData in indices

The last step could be done in two ways. First by limiting
number of indices for one table we can store coresponding 
indices' TIDs in each heap tuple. The update is then simple
taking one disk write.
Or do it in standart way - lookup appropriate index tuple
by traversing index. It will cost us more disk accesses.

Is someone interested in this ??
regards devik

With current indexscan:
! system usage stats:
!       1812.534505 elapsed 93.060547 user 149.447266 system sec
!       [93.118164 user 149.474609 sys total]
!       0/0 [0/0] filesystem blocks in/out
!       130978/32 [131603/297] page faults/reclaims, 132 [132] swaps
!       0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
!       0/0 [0/0] voluntary/involuntary context switches
! postgres usage stats:
!       Shared blocks:     555587 read,     551155 written, buffer hit
rate = 44.68%
!       Local  blocks:          0 read,          0 written, buffer hit
rate = 0.00%
!       Direct blocks:          0 read,          0 written

With improved indexscan:
! system usage stats:
!       23.686788 elapsed 22.157227 user 0.372071 system sec
!       [22.193359 user 0.385742 sys total]
!       0/0 [0/0] filesystem blocks in/out
!       1186/42 [1467/266] page faults/reclaims, 0 [0] swaps
!       0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
!       0/0 [0/0] voluntary/involuntary context switches
! postgres usage stats:
!       Shared blocks:       4385 read,          0 written, buffer hit
rate = 4.32%
!       Local  blocks:          0 read,          0 written, buffer hit
rate = 0.00%
!       Direct blocks:          0 read,          0 written




Re: pgsql is 75 times faster with my new index scan

From
Alfred Perlstein
Date:
* devik@cdi.cz <devik@cdi.cz> [000926 02:33] wrote:
> Hello,
> I recently spoke about extending index scan to be able
> to take data directly from index pages.
[snip]
> 
> Is someone interested in this ??

Considering the speedup, I sure as hell am interested. :)

When are we going to have this?

-Alfred


Re: pgsql is 75 times faster with my new index scan

From
Hannu Krosing
Date:
devik@cdi.cz wrote:
> 
> Hello,
> I recently spoke about extending index scan to be able
> to take data directly from index pages. I wanted to know
> whether should I spend my time and implement it.
> So that I hacked last pgsql a bit to use proposed scan
> mode and did some measurements (see bellow). Measurements
> was done on (id int,txt varchar(20)) table with 1 000 000 rows
> with btree index on both attrs. Query involved was:
> select id,count(txt) from big group by id;
> Duplicates distribution on id column was 1:1000. I was run
> query twice after linux restart to ensure proper cache
> utilization (on disk heap & index was 90MB in total).
> So I think that by implementing this scan mode we can expect
> to gain huge speedup in all queries which uses indices and
> can found all data in their pages.
> 
> Problems:
> my changes implemented only indexscan and new cost function.
> it doesn't work when index pages contains tuples which doesn't
> belong to our transaction. test was done after vacuum and
> only one tx running.
> 
> TODO:
> - add HeapTupleHeaderData into each IndexTupleData
> - change code to reflect above
> - when deleting-updating heap then also update tuples'
>   HeapTupleHeaderData in indices

I doubt everyone would like trading query speed for insert/update 
speed plus index size

Perhaps this could be implemented as a special type of index which 
you must explicitly specify at create time ?

> The last step could be done in two ways. First by limiting
> number of indices for one table we can store coresponding
> indices' TIDs in each heap tuple. The update is then simple
> taking one disk write.

Why limit it ? One could just save an tid array in each tuple .

-----------------
Hannu


Re: pgsql is 75 times faster with my new index scan

From
Philip Warner
Date:
At 12:28 26/09/00 +0300, Hannu Krosing wrote:
>> TODO:
>> - add HeapTupleHeaderData into each IndexTupleData
>> - change code to reflect above
>> - when deleting-updating heap then also update tuples'
>>   HeapTupleHeaderData in indices
>
>I doubt everyone would like trading query speed for insert/update 
>speed plus index size

Anybody who has a high enquiry:update ratio would, I think. And that will
be most people, especially web people.


>Perhaps this could be implemented as a special type of index which 
>you must explicitly specify at create time ?

That would be great, or alternatively, an attribute on the index?


Dec RDB offers this kind of feature, and it suffers from deadlocking
problems because of index node vs. row locking as well as high lock
contention when updating indexed data, so it's not just the cost of doing
the updates. I definitely thinks it's a good feature to implement, if
possible, but making it optional would be wise.


----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.B.N. 75 008 659 498)          |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


Re: pgsql is 75 times faster with my new index scan

From
Hiroshi Inoue
Date:

Hannu Krosing wrote:

>
> >
> > TODO:
> > - add HeapTupleHeaderData into each IndexTupleData
> > - change code to reflect above
> > - when deleting-updating heap then also update tuples'
> >   HeapTupleHeaderData in indices
>
> I doubt everyone would like trading query speed for insert/update
> speed plus index size
>
> Perhaps this could be implemented as a special type of index which
> you must explicitly specify at create time ?
>
> > The last step could be done in two ways. First by limiting
> > number of indices for one table we can store coresponding
> > indices' TIDs in each heap tuple. The update is then simple
> > taking one disk write.
>
> Why limit it ? One could just save an tid array in each tuple .
>

Indice's TIDs are transient.
Isn't it useless to store indice's TIDs ?

Regards.

Hiroshi Inoue



Re: pgsql is 75 times faster with my new index scan

From
devik@cdi.cz
Date:
> > > The last step could be done in two ways. First by limiting
> > > number of indices for one table we can store coresponding
> > > indices' TIDs in each heap tuple. The update is then simple
> > > taking one disk write.
> >
> > Why limit it ? One could just save an tid array in each tuple .

because when you add new index you had to rescan whole
heap and grow the tid array ..

> Indice's TIDs are transient.
> Isn't it useless to store indice's TIDs ?

but yes Hiroshi is right. Index TID is transient. I first looked
into pg sources two weeks ago so I have still holes in my knowledge.
So that only solution is to traverse it ..




Re: pgsql is 75 times faster with my new index scan

From
devik@cdi.cz
Date:
> That would be great, or alternatively, an attribute on the index?

yes attribute would be nice.

> 
> Dec RDB offers this kind of feature, and it suffers from deadlocking
> problems because of index node vs. row locking as well as high lock
> contention when updating indexed data, so it's not just the cost of doing
> the updates. I definitely thinks it's a good feature to implement, if

where is the reason of contention/ddlock ? We (or you?) only need to
update index tuple's header data (like TX min/max) and key is
untouched. So that all index scans should be able to go without
disturbing. Index page should be locked in memory only for few
ticks during actual memcpy to page.

BTW: IMHO when WAL become functional, need we still multi versioned
tuples in heap ? Why don't just version tuples on WAL log and add
them during scans ?

devik



RE: pgsql is 75 times faster with my new index scan

From
"Mikheev, Vadim"
Date:
> > The last step could be done in two ways. First by limiting
> > number of indices for one table we can store coresponding
> > indices' TIDs in each heap tuple. The update is then simple
> > taking one disk write.
> 
> Why limit it ? One could just save an tid array in each tuple .

And update *entire* heap after addition new index?!

Vadim


RE: pgsql is 75 times faster with my new index scan

From
"Mikheev, Vadim"
Date:
Why not implement *true* CLUSTER?
With cluster, all heap tuples will be in cluster index.

Vadim

> -----Original Message-----
> From: devik@cdi.cz [mailto:devik@cdi.cz]
> Sent: Tuesday, September 26, 2000 2:15 AM
> To: pgsql-hackers@postgresql.org
> Subject: [HACKERS] pgsql is 75 times faster with my new index scan
> 
> 
> Hello,
> I recently spoke about extending index scan to be able
> to take data directly from index pages. I wanted to know
> whether should I spend my time and implement it.
> So that I hacked last pgsql a bit to use proposed scan
> mode and did some measurements (see bellow). Measurements
> was done on (id int,txt varchar(20)) table with 1 000 000 rows
> with btree index on both attrs. Query involved was:
> select id,count(txt) from big group by id;
> Duplicates distribution on id column was 1:1000. I was run
> query twice after linux restart to ensure proper cache 
> utilization (on disk heap & index was 90MB in total).
> So I think that by implementing this scan mode we can expect
> to gain huge speedup in all queries which uses indices and
> can found all data in their pages.
> 
> Problems:
> my changes implemented only indexscan and new cost function.
> it doesn't work when index pages contains tuples which doesn't
> belong to our transaction. test was done after vacuum and
> only one tx running.
> 
> TODO:
> - add HeapTupleHeaderData into each IndexTupleData
> - change code to reflect above
> - when deleting-updating heap then also update tuples'
>   HeapTupleHeaderData in indices
> 
> The last step could be done in two ways. First by limiting
> number of indices for one table we can store coresponding 
> indices' TIDs in each heap tuple. The update is then simple
> taking one disk write.
> Or do it in standart way - lookup appropriate index tuple
> by traversing index. It will cost us more disk accesses.
> 
> Is someone interested in this ??
> regards devik
> 
> With current indexscan:
> ! system usage stats:
> !       1812.534505 elapsed 93.060547 user 149.447266 system sec
> !       [93.118164 user 149.474609 sys total]
> !       0/0 [0/0] filesystem blocks in/out
> !       130978/32 [131603/297] page faults/reclaims, 132 [132] swaps
> !       0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
> !       0/0 [0/0] voluntary/involuntary context switches
> ! postgres usage stats:
> !       Shared blocks:     555587 read,     551155 written, buffer hit
> rate = 44.68%
> !       Local  blocks:          0 read,          0 written, buffer hit
> rate = 0.00%
> !       Direct blocks:          0 read,          0 written
> 
> With improved indexscan:
> ! system usage stats:
> !       23.686788 elapsed 22.157227 user 0.372071 system sec
> !       [22.193359 user 0.385742 sys total]
> !       0/0 [0/0] filesystem blocks in/out
> !       1186/42 [1467/266] page faults/reclaims, 0 [0] swaps
> !       0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
> !       0/0 [0/0] voluntary/involuntary context switches
> ! postgres usage stats:
> !       Shared blocks:       4385 read,          0 written, buffer hit
> rate = 4.32%
> !       Local  blocks:          0 read,          0 written, buffer hit
> rate = 0.00%
> !       Direct blocks:          0 read,          0 written
> 
> 


RE: pgsql is 75 times faster with my new index scan

From
"Mikheev, Vadim"
Date:
> > Indice's TIDs are transient.
> > Isn't it useless to store indice's TIDs ?
> 
> but yes Hiroshi is right. Index TID is transient. I first looked
> into pg sources two weeks ago so I have still holes in my knowledge.
> So that only solution is to traverse it ..

It was discussed several times for btree - add heap tid to index key and
you'll
scan index for particulare tuple much faster.
Not sure what could be done for hash indices... order hash items with the
same hash key?

Vadim


Re: pgsql is 75 times faster with my new index scan

From
Hannu Krosing
Date:
"Mikheev, Vadim" wrote:
> 
> Why not implement *true* CLUSTER?
> With cluster, all heap tuples will be in cluster index.
> 

What is *true* CLUSTER ?

'grep CLUSTER' over the latest SQL standards gives back nothing.


> > > The last step could be done in two ways. First by limiting
> > > number of indices for one table we can store coresponding
> > > indices' TIDs in each heap tuple. The update is then simple
> > > taking one disk write.
> > 
> > Why limit it ? One could just save an tid array in each tuple .
> 
> And update *entire* heap after addition new index?!

I guess that this should be done even for limited number of 
indices' TIDs in a heap tuple ?

--------------
Hannu


Re: pgsql is 75 times faster with my new index scan

From
devik@cdi.cz
Date:
> Why not implement *true* CLUSTER?
> With cluster, all heap tuples will be in cluster index.

It would be nice. It's pity that pg AMs are not general.
There is no simple way to use btree instead of heap. But
it would help.
But using values from index is good idea too because you
can have table with many columns and aggregate query which
needs only two columns.
The it will be MUCH faster to create secondary index which
is much smaller than heap and use values from it.

Vadim where can I found some code from upcoming WAL ?
I'm thinking about implementing special ranked b-tree
which will store precomputed aggregate values (like
cnt,min,max,sum) in btree node keys. It can be then
used for extremely fast evaluation of aggregates. But
in case of MVCC it is more complicated and I'd like
to see how it would be affected by WAL.

devik




Re: pgsql is 75 times faster with my new index scan

From
devik@cdi.cz
Date:
> What is *true* CLUSTER ?
> 
> 'grep CLUSTER' over the latest SQL standards gives back nothing.

storing data in b-tree instead of heap for example.

> > And update *entire* heap after addition new index?!
> 
> I guess that this should be done even for limited number of
> indices' TIDs in a heap tuple ?

yep - the idea was throwed away already.


RE: pgsql is 75 times faster with my new index scan

From
"Mikheev, Vadim"
Date:
> > Why not implement *true* CLUSTER?
> > With cluster, all heap tuples will be in cluster index.
> 
> It would be nice. It's pity that pg AMs are not general.
> There is no simple way to use btree instead of heap. But
> it would help.
> But using values from index is good idea too because you
> can have table with many columns and aggregate query which
> needs only two columns.
> The it will be MUCH faster to create secondary index which
> is much smaller than heap and use values from it.

Agreed. But this will add 16 bytes (2 xid + 2 cid) to size of btitems.
Currently, total size of btitem for 2-int4 key index is 16 bytes =>
new feature will double size of index and increase # of levels
(obviously bad for mostly static tables).

So, this feature should be *optional*...

> Vadim where can I found some code from upcoming WAL ?
> I'm thinking about implementing special ranked b-tree
> which will store precomputed aggregate values (like
> cnt,min,max,sum) in btree node keys. It can be then
> used for extremely fast evaluation of aggregates. But
> in case of MVCC it is more complicated and I'd like
> to see how it would be affected by WAL.

MVCC will not be affected by WAL currently. It's issue
of storage manager, not WAL.

Vadim


RE: pgsql is 75 times faster with my new index scan

From
"Mikheev, Vadim"
Date:
> > It was discussed several times for btree - add heap tid to 
> > index key and you'll scan index for particulare tuple much faster.
> 
> good idea :) Why don't just to use tid ALWAYS as last part of key ?
> When btree code sees equal keys then it will compare tids ?
> Would not be better to use oids ? The don't change during vacuum
> and with tupleheader in index we will know it.

In some future I would like to make OIDs optional - they are not
always used, so why waste space?

+ using TID would make keys unique and this would simplify
handling of duplicates.

+ heap TID is already part of btitems in leaf nodes - OIDs would just
increase btiem size.

> > Not sure what could be done for hash indices... order hash 
> > items with the same hash key?
> 
> question is whether we need it for hash indices. it is definitely
> good for btree as they support range retrieval. hash ind. doesn't
> it so I wouldn't implement it for them.

We need in fast heap tuple --> index tuple lookup for overwriting
storage manager anyway...

Vadim


Re: pgsql is 75 times faster with my new index scan

From
devik@cdi.cz
Date:
> > question is whether we need it for hash indices. it is definitely
> > good for btree as they support range retrieval. hash ind. doesn't
> > it so I wouldn't implement it for them.
> 
> We need in fast heap tuple --> index tuple lookup for overwriting
> storage manager anyway...
oh .. there will be such one ?




Re: pgsql is 75 times faster with my new index scan

From
devik@cdi.cz
Date:
> > The it will be MUCH faster to create secondary index which
> > is much smaller than heap and use values from it.
> 
> Agreed. But this will add 16 bytes (2 xid + 2 cid) to size of btitems.
> Currently, total size of btitem for 2-int4 key index is 16 bytes =>
> new feature will double size of index and increase # of levels
> (obviously bad for mostly static tables).
> 
> So, this feature should be *optional*...

yes. it definitely should.

> MVCC will not be affected by WAL currently. It's issue
> of storage manager, not WAL.

and where will the WAL sit ? can you explain it a bit ?

thanks devik



Re: pgsql is 75 times faster with my new index scan

From
Bruce Momjian
Date:
> > TODO:
> > - add HeapTupleHeaderData into each IndexTupleData
> > - change code to reflect above
> > - when deleting-updating heap then also update tuples'
> >   HeapTupleHeaderData in indices
> 
> I doubt everyone would like trading query speed for insert/update 
> speed plus index size

If he is scanning through the entire index, he could do a sequential
scan of the table, grab all the tid transaction status values, and use
those when viewing the index.  No need to store/update the transaction
status in the index that way.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Deep Trouble

From
"Abe Asghar"
Date:
    Hi guys,

Havin some trouble.

One of my databases appeared to be empty suddenly after having a large
amount of data in it.  I contacted our server company and they gave me the
postgres dir.

I have put back the folder of the newsdatabase from the base dir into the
base dir of Postgres and recreated the database.

That's fine.

However now the database is empty.  When I do a cat on the file of the same
name as one of the tables - it has loads of data in it.  However when I go
in to Postgres and try to list the table it comes back with ) rows.

Any ideas I am desperate.

I am using linux redhat and Postgres

Thanks
Abe



Re: pgsql is 75 times faster with my new index scan

From
Bruce Momjian
Date:
> > > I doubt everyone would like trading query speed for insert/update 
> > > speed plus index size
> > 
> > If he is scanning through the entire index, he could do a sequential
> > scan of the table, grab all the tid transaction status values, and use
> > those when viewing the index.  No need to store/update the transaction
> > status in the index that way.
> 
> Huh ? How ? It is how you do it now. Do you expect
> load several milion transaction statuses into memory,
> then scan index and lookup these values ?
> Missed I something ?
> devik
> 
> 

Not sure.  I figured they were pretty small values.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: pgsql is 75 times faster with my new index scan

From
Hannu Krosing
Date:
Bruce Momjian wrote:
> 
> > > > I doubt everyone would like trading query speed for insert/update
> > > > speed plus index size
> > >
> > > If he is scanning through the entire index, he could do a sequential
> > > scan of the table, grab all the tid transaction status values, and use
> > > those when viewing the index.  No need to store/update the transaction
> > > status in the index that way.
> >
> > Huh ? How ? It is how you do it now. Do you expect
> > load several milion transaction statuses into memory,
> > then scan index and lookup these values ?
> > Missed I something ?
> > devik
> >
> >
> 
> Not sure.  I figured they were pretty small values.

IIRC the whole point was to avoid scanning the table ?

-------------
Hannu


Re: pgsql is 75 times faster with my new index scan

From
Devik
Date:
> > I doubt everyone would like trading query speed for insert/update 
> > speed plus index size
> 
> If he is scanning through the entire index, he could do a sequential
> scan of the table, grab all the tid transaction status values, and use
> those when viewing the index.  No need to store/update the transaction
> status in the index that way.

Huh ? How ? It is how you do it now. Do you expect
load several milion transaction statuses into memory,
then scan index and lookup these values ?
Missed I something ?
devik



Re: pgsql is 75 times faster with my new index scan

From
Devik
Date:
> > > > those when viewing the index.  No need to store/update the transaction
> > > > status in the index that way.
> > >
> > > Huh ? How ? It is how you do it now. Do you expect
> > > load several milion transaction statuses into memory,
> > > then scan index and lookup these values ?
> > > Missed I something ?
> > > devik
> > Not sure.  I figured they were pretty small values.
> IIRC the whole point was to avoid scanning the table ?

Yes. This was the main point ! For small number of records the
current method is fast enough. The direct index scan is useful
for big tables and doing scan over large parts of them (like
in aggregates).
devik



Re: pgsql is 75 times faster with my new index scan

From
Bruce Momjian
Date:
> Bruce Momjian wrote:
> > 
> > > > > I doubt everyone would like trading query speed for insert/update
> > > > > speed plus index size
> > > >
> > > > If he is scanning through the entire index, he could do a sequential
> > > > scan of the table, grab all the tid transaction status values, and use
> > > > those when viewing the index.  No need to store/update the transaction
> > > > status in the index that way.
> > >
> > > Huh ? How ? It is how you do it now. Do you expect
> > > load several milion transaction statuses into memory,
> > > then scan index and lookup these values ?
> > > Missed I something ?
> > > devik
> > >
> > >
> > 
> > Not sure.  I figured they were pretty small values.
> 
> IIRC the whole point was to avoid scanning the table ?

Yes, sorry.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026