Thread: todo: Hash index creation
Is anyone currently working on this TODO item? "During index creation, pre-sort the tuples to improve build speed" http://archives.postgresql.org/pgsql-hackers/2007-03/msg01199.php A few of us would like to tackle it and see if we can add some value here. Tom, Shreya
twraney@comcast.net wrote: > Is anyone currently working on this TODO item? > > "During index creation, pre-sort the tuples to improve build speed" > http://archives.postgresql.org/pgsql-hackers/2007-03/msg01199.php > > A few of us would like to tackle it and see if we can add some value here. That would be nice! If you want to work on hash indexes, though, this TODO item seems more important to me at least: > Add WAL logging for crash recovery Until that's done, you can't really use hash indexes for anything else than read-only tables. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > twraney@comcast.net wrote: >> Is anyone currently working on this TODO item? >> "During index creation, pre-sort the tuples to improve build speed" > If you want to work on hash indexes, though, this TODO item seems more > important to me at least: >> Add WAL logging for crash recovery Actually I think the *most* important thing to work on is to get hash to the point where its search speed actually beats btree consistently, so that it has an excuse to live. If that is insoluble we might well end up ripping it out entirely. (The first three TODO items for hash indexes are ideas for trying to improve the speed.) Fixing the WAL support would come after that, and bring it to the point where someone could actually recommend it for production use. After that it would be sensible to work on inessentials like improving the build speed. regards, tom lane
On Wed, Jun 27, 2007 at 08:36:54PM -0400, Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: > > twraney@comcast.net wrote: > >> Is anyone currently working on this TODO item? > >> "During index creation, pre-sort the tuples to improve build speed" > > > If you want to work on hash indexes, though, this TODO item seems more > > important to me at least: > >> Add WAL logging for crash recovery > > Actually I think the *most* important thing to work on is to get hash to > the point where its search speed actually beats btree consistently, so > that it has an excuse to live. If that is insoluble we might well end up > ripping it out entirely. (The first three TODO items for hash indexes > are ideas for trying to improve the speed.) > > Fixing the WAL support would come after that, and bring it to the point > where someone could actually recommend it for production use. > > After that it would be sensible to work on inessentials like improving > the build speed. > > regards, tom lane > I definitely agree with Tom's assessment. If we cannot need to make the hash index as performant as it is in theory, none of the other refinements are worth it. You would need to use BTree if you were concerned about speed. (and who isn't) Ken
> Actually I think the *most* important thing to work on is to get hash to > the point where its search speed actually beats btree consistently, so > that it has an excuse to live. If that is insoluble we might well end up > ripping it out entirely. (The first three TODO items for hash indexes > are ideas for trying to improve the speed.) > > Fixing the WAL support would come after that, and bring it to the point > where someone could actually recommend it for production use. > > After that it would be sensible to work on inessentials like improving > the build speed. I've been warned away from hash indexes before, however I had no idea that it's performance was that abysmal that BTREE beat it and I was definitely not aware that they were not included in WAL logs. I was told it wasn't as good as it could be, but I wasn't told it was pretty much an alpha piece of code. As a result, when creating tables containing large blocks of text I wish to index, I've been using HASH as an index method. Please can we state in the manual that HASH index types are in a beta stage of development or something similar, or perhaps remove the manual entry altogether until HASH is at a point where it is usable in production. Regards, A very surprised n00b.
Naz Gassiep <naz@mira.net> writes: > As a result, when creating tables containing large blocks of text I wish > to index, I've been using HASH as an index method. Please can we state > in the manual that HASH index types are in a beta stage of development > or something similar, or perhaps remove the manual entry altogether > until HASH is at a point where it is usable in production. Uh, the manual already does say Note: Testing has shown PostgreSQL's hash indexes to perform no better than B-tree indexes, and the index size and buildtime for hash indexes is much worse. Furthermore, hash index operations are not presently WAL-logged, so hash indexesmight need to be rebuilt with REINDEX after a database crash. For these reasons, hash index use is presently discouraged. under 11.2 Index Types, as well as various derogatory remarks elsewhere. regards, tom lane
Ühel kenal päeval, E, 2007-07-02 kell 04:27, kirjutas Naz Gassiep: > I've been warned away from hash indexes before, however I had no idea > that it's performance was that abysmal that BTREE beat it and I was > definitely not aware that they were not included in WAL logs. I was told > it wasn't as good as it could be, but I wasn't told it was pretty much > an alpha piece of code. > > As a result, when creating tables containing large blocks of text I wish > to index, I've been using HASH as an index method. If you just wish to have smaller indexes, then you can use functional btree indexes over text hash, like this: CREATE INDEX largetextindex on mytable(hashtext(largetext)); and use SELECT * FROM mytable where hashtext(largetext) = hastext('searchvalue') and largetext = 'searchvalue' ; btw, if the real hash indexes don't get fixes soon, maybe we could redefine hash index to actually mean usage like this and do the rewrites in parser? > Please can we state > in the manual that HASH index types are in a beta stage of development > or something similar, or perhaps remove the manual entry altogether > until HASH is at a point where it is usable in production. > > Regards, > A very surprised n00b. > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq
Wow... not sure how I missed that. I *did* create this schema ages ago, perhaps it wasn't there, or at the time I had noidea what the implications were. *shrug*<br /> Regards,<br /> - Naz.<br /><br /><br /> Tom Lane wrote: <blockquote cite="mid:20662.1183344247@sss.pgh.pa.us"type="cite"><pre wrap="">Naz Gassiep <a class="moz-txt-link-rfc2396E" href="mailto:naz@mira.net"><naz@mira.net></a>writes: </pre><blockquote type="cite"><pre wrap="">As a result, when creatingtables containing large blocks of text I wish to index, I've been using HASH as an index method. Please can we state in the manual that HASH index types are in a beta stage of development or something similar, or perhaps remove the manual entry altogether until HASH is at a point where it is usable in production. </pre></blockquote><pre wrap=""> Uh, the manual already does say Note: Testing has shown PostgreSQL's hash indexes to perform no better than B-tree indexes, and the index size and buildtime for hash indexes is much worse. Furthermore, hash index operations are not presently WAL-logged, so hash indexesmight need to be rebuilt with REINDEX after a database crash. For these reasons, hash index use is presently discouraged. under 11.2 Index Types, as well as various derogatory remarks elsewhere. regards, tom lane </pre></blockquote>
Kenneth Marshall wrote: > I definitely agree with Tom's assessment. If we cannot need to make the > hash index as performant as it is in theory, none of the other refinements > are worth it. You would need to use BTree if you were concerned about > speed. (and who isn't) I just got an idea. Hash indexes would take much less space, and be more efficient to search, if we only stored the hash of the key in the index. Such index tuples would be fixed size, so we could get rid of the overhead of the length-field in IndexTuple, as well as the line pointers. Of course, the key value would need to be rechecked after fetching the heap tuple, but that shouldn't be a problem assuming there's few collisions. Another idea: when searching, we scan the whole bucket page looking for matching tuples. If we stored the tuples ordered by hash value within page, we could do a binary search instead. These changes might give hash indexam the edge over b-tree it needs. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, Jul 05, 2007 at 12:26:45PM +0100, Heikki Linnakangas wrote: > Kenneth Marshall wrote: > >I definitely agree with Tom's assessment. If we cannot need to make the > >hash index as performant as it is in theory, none of the other refinements > >are worth it. You would need to use BTree if you were concerned about > >speed. (and who isn't) > > I just got an idea. > > Hash indexes would take much less space, and be more efficient to > search, if we only stored the hash of the key in the index. Such index > tuples would be fixed size, so we could get rid of the overhead of the > length-field in IndexTuple, as well as the line pointers. > > Of course, the key value would need to be rechecked after fetching the > heap tuple, but that shouldn't be a problem assuming there's few collisions. > > Another idea: when searching, we scan the whole bucket page looking for > matching tuples. If we stored the tuples ordered by hash value within > page, we could do a binary search instead. > > These changes might give hash indexam the edge over b-tree it needs. > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com > I was thinking about your idea on my commute yesterday and the new HOT functionality. The first step would be as you suggested, store only the hash value and the heap bucket page in the index. The check for collisions within a hash bucket is essentially the same thing that we do when searching for the heap tuple visible for a specific transaction. This would mean, that with a HOT type process, the index would not need to be updated when a non-indexed field is changed. We treat them as another tuple in the HOT chain. This would make typical lookups require 2 disk reads per item, the first to get the heap location and the second to get the heap page itself. The next step would be to have the hash function generate the heap location directly instead of a disk read, then retrieving an item would be one read. This would also provide for a simple clustering based on the hash index. I am certain that there are holes in this ideal, but it would be a logical next step. Regards, Ken