Thread: "Hash index" vs. "b-tree index" (PostgreSQL 8.0)
Greetings, We are working on speeding up the queries by creating indexes. We have queries with searching criteria such as "select ... where *col1='...'*". This is a simple query with only "=" operation. As a result I setup hash index on column "col1". While, in postgreSQL 8 doc, it is wirttern: *Note: * Testing has shown PostgreSQL's hash indexes to perform no better than B-tree indexes, and the index size and build time for hash indexes is much worse. For these reasons, hash index use is presently discouraged. May I know for simple "=" operation query, for "Hash index" vs. "B-tree" index, which can provide better performance please? Thanks, Emi
Ying Lu wrote: > May I know for simple "=" operation query, for "Hash index" vs. "B-tree" > index, which can provide better performance please? I don't think we've found a case in which the hash index code outperforms B+-tree indexes, even for "=". The hash index code also has a number of additional issues: for example, it isn't WAL safe, it has relatively poor concurrency, and creating a hash index is significantly slower than creating a b+-tree index. -Neil
On 5/9/05, Neil Conway <neilc@samurai.com> wrote: > I don't think we've found a case in which the hash index code > outperforms B+-tree indexes, even for "=". The hash index code also has > a number of additional issues: for example, it isn't WAL safe, it has > relatively poor concurrency, and creating a hash index is significantly > slower than creating a b+-tree index. This being the case, is there ever ANY reason for someone to use it? If not, then shouldn't we consider deprecating it and eventually removing it. This would reduce complexity, I think. Chris -- | Christopher Petrilli | petrilli@gmail.com
Christopher Petrilli wrote: > This being the case, is there ever ANY reason for someone to use it? Well, someone might fix it up at some point in the future. I don't think there's anything fundamentally wrong with hash indexes, it is just that the current implementation is a bit lacking. > If not, then shouldn't we consider deprecating it and eventually > removing it. I would personally consider the code to be deprecated already. I don't think there is much to be gained b removing it: the code is pretty isolated from the rest of the tree, and (IMHO) not a significant maintenance burden. -Neil
On Tue, May 10, 2005 at 01:34:57AM +1000, Neil Conway wrote: > Christopher Petrilli wrote: > >This being the case, is there ever ANY reason for someone to use it? > > Well, someone might fix it up at some point in the future. I don't think > there's anything fundamentally wrong with hash indexes, it is just that > the current implementation is a bit lacking. > > >If not, then shouldn't we consider deprecating it and eventually > >removing it. > > I would personally consider the code to be deprecated already. > > I don't think there is much to be gained b removing it: the code is > pretty isolated from the rest of the tree, and (IMHO) not a significant > maintenance burden. That may be true, but it's also a somewhat 'developer-centric' view. ;) Having indexes that people shouldn't be using does add confusion for users, and presents the opportunity for foot-shooting. I don't know what purpose they once served, but if there's no advantage to them they should be officially depricated and eventually removed. Even if there is some kind of advantage (would they possibly speed up hash joins?), if there's no plans to fix them they should still be removed. If someone ever really wanted to do something with, the code would still be in CVS. -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
Jim C. Nasby wrote: > Having indexes that people shouldn't be using does add confusion for > users, and presents the opportunity for foot-shooting. Emitting a warning/notice on hash-index creation is something I've suggested in the past -- that would be fine with me. > Even if there is some kind of advantage (would they possibly speed up > hash joins?) No, hash joins and hash indexes are unrelated. -Neil
On Tue, May 10, 2005 at 02:38:41AM +1000, Neil Conway wrote: > Jim C. Nasby wrote: > >Having indexes that people shouldn't be using does add confusion for > >users, and presents the opportunity for foot-shooting. > > Emitting a warning/notice on hash-index creation is something I've > suggested in the past -- that would be fine with me. Probably not a bad idea. > >Even if there is some kind of advantage (would they possibly speed up > >hash joins?) > > No, hash joins and hash indexes are unrelated. I know they are now, but does that have to be the case? Like I said, I don't know the history, so I don't know why we even have them to begin with. -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
> *Note: * Testing has shown PostgreSQL's hash indexes to perform no > better than B-tree indexes, and the index size and build time for hash > indexes is much worse. For these reasons, hash index use is presently > discouraged. > > May I know for simple "=" operation query, for "Hash index" vs. > "B-tree" index, which can provide better performance please? If you trust the documentation use a b-tree. If you don't trust the documentation do your own tests. please don't cross post.
Jim C. Nasby wrote: >> No, hash joins and hash indexes are unrelated. > I know they are now, but does that have to be the case? I mean, the algorithms are fundamentally unrelated. They share a bit of code such as the hash functions themselves, but they are really solving two different problems (disk based indexing with (hopefully) good concurrency and WAL logging vs. in-memory joins via hashing with spill to disk if needed). > Like I said, I don't know the history, so I don't know why we even > have them to begin with. As I said, the idea of using hash indexes for better performance on equality scans is perfectly valid, it is just the implementation that needs work. -Neil
Jim C. Nasby wrote: > On Tue, May 10, 2005 at 02:38:41AM +1000, Neil Conway wrote: > > Jim C. Nasby wrote: > > >Having indexes that people shouldn't be using does add confusion for > > >users, and presents the opportunity for foot-shooting. > > > > Emitting a warning/notice on hash-index creation is something I've > > suggested in the past -- that would be fine with me. > > Probably not a bad idea. Agreed. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Neil Conway <neilc@samurai.com> writes: > Jim C. Nasby wrote: >>> No, hash joins and hash indexes are unrelated. >> I know they are now, but does that have to be the case? > I mean, the algorithms are fundamentally unrelated. They share a bit of > code such as the hash functions themselves, but they are really solving > two different problems In fact, up till fairly recently they didn't even share the hash functions. Which was a bug not a feature, but the fact remains --- there's not a lot of commonality. >> Like I said, I don't know the history, so I don't know why we even >> have them to begin with. I think it's largely because some Berkeley grad student had a need to implement hash indexes for academic reasons ;-) > As I said, the idea of using hash indexes for better performance on > equality scans is perfectly valid, it is just the implementation that > needs work. I was thinking about that earlier today. It seems to me there is a window within which hash indexes are theoretically superior, but it might be pretty narrow. The basic allure of a hash index is that you look at the search key, do some allegedly-trivial computations, and go directly to the relevant index page(s); whereas a btree index requires descending through several upper levels of index pages to reach the target leaf page. On the other hand, once you reach the target index page, a hash index has no better method than linear scan through all the page's index entries to find the actually wanted key(s); in fact you have to search all the pages in that index bucket. A btree index can use binary search to find the relevant items within the page. So it strikes me that important parameters include the index entry size and the number of entries matching any particular key value. btree will probably win for smaller keys, on two grounds: it will have fewer tree levels to descend through, because of higher fan-out, and it will be much more efficient at finding the target entries within the target page when there are many entries per page. (As against this, it will have to work harder at each upper tree page to decide where to descend to --- but I think that's a second-order consideration.) hash will tend to win as the number of duplicate keys increases, because its relative inefficiency at finding the matches within a particular bucket will become less significant. (The ideal situation for a hash index is to have only one actual key value per bucket. You can't really afford to store only one index entry per bucket, because of the sheer I/O volume that would result, so you need multiple entries that will all be responsive to your search.) (This also brings up the thought that it might be interesting to support hash buckets smaller than a page ... but I don't know how to make that work in an adaptive fashion.) I suspect that people haven't looked hard for a combination of these parameters within which hash can win. Of course the real question for us is whether the window is wide enough to justify the maintenance effort for hash. regards, tom lane
Tom Lane wrote: > On the other hand, once you reach the target index page, a hash index > has no better method than linear scan through all the page's index > entries to find the actually wanted key(s) I wonder if it would be possible to store the keys in a hash bucket in sorted order, provided that the necessary ordering is defined for the index keys -- considering the ubiquity of b+-trees in Postgres, the chances of an ordering being defined are pretty good. Handling overflow pages would be tricky: perhaps we could store the entries in a given page in sorted order, but not try to maintain that order for the hash bucket as a whole. This would mean we'd need to do a binary search for each page of the bucket, but that would still be a win. -Neil
Neil Conway <neilc@samurai.com> writes: > Tom Lane wrote: >> On the other hand, once you reach the target index page, a hash index >> has no better method than linear scan through all the page's index >> entries to find the actually wanted key(s) > I wonder if it would be possible to store the keys in a hash bucket in > sorted order, provided that the necessary ordering is defined for the > index keys -- considering the ubiquity of b+-trees in Postgres, the > chances of an ordering being defined are pretty good. I have a gut reaction against that: it makes hash indexes fundamentally subservient to btrees. We shouldn't bring in concepts that are outside the basic opclass abstraction. However: what about storing the things in hashcode order? Ordering uint32s doesn't seem like any big conceptual problem. I think that efficient implementation of this would require explicitly storing the hash code for each index entry, which we don't do now, but it seems justifiable on multiple grounds --- besides this point, the search could avoid doing the data-type-specific comparison if the hash code isn't equal. There is evidence in the code that indexes used to store more info than what we now think of as a "standard" index tuple. I am not sure when that went away or what it'd cost to bring it back, but it seems worth looking into. regards, tom lane
Tom Lane wrote: > I have a gut reaction against that: it makes hash indexes fundamentally > subservient to btrees. I wouldn't say "subservient" -- if there is no ordering defined for the index key, we just do a linear scan. > However: what about storing the things in hashcode order? Ordering uint32s > doesn't seem like any big conceptual problem. Hmm, my memory of the hash code is a bit fuzzy. Do I understand correctly? - we only use some of the bits in the hash to map from the hash of a key to its bucket - therefore within a bucket, we can still distinguish most of the non-equal tuples from one another by comparing their full hash values - if we keep the entries in a bucket (or page, I guess -- per earlier mail) sorted by full hash value, we can use that to perform a binary search Sounds like a good idea to me. How likely is it that the hash index will be sufficiently large that we're using most of the bits in the hash just to map hash values to buckets, so that the binary search won't be very effective? (At this point many of the distinct keys in each bucket will be full-on hash collisions, although sorting by the key values themselves would still be effective.) > I think that efficient implementation of this would require explicitly > storing the hash code for each index entry, which we don't do now, but > it seems justifiable on multiple grounds --- besides this point, the > search could avoid doing the data-type-specific comparison if the hash > code isn't equal. Another benefit is that it would speed up page splits -- there would be no need to rehash all the keys in a bucket when doing the split. -Neil
Tom Lane <tgl@sss.pgh.pa.us> writes: > However: what about storing the things in hashcode order? Ordering uint32s > doesn't seem like any big conceptual problem. > > I think that efficient implementation of this would require explicitly > storing the hash code for each index entry, which we don't do now, but > it seems justifiable on multiple grounds --- besides this point, the > search could avoid doing the data-type-specific comparison if the hash > code isn't equal. It seems that means doubling the size of the hash index. That's a pretty big i/o to cpu tradeoff. What if the hash index stored *only* the hash code? That could be useful for indexing large datatypes that would otherwise create large indexes. A good hash function should have a pretty low collision rate anyways so the occasional extra i/o should more than be outweighed by the decrease in i/o needed to use the index. -- greg
Greg Stark <gsstark@mit.edu> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> I think that efficient implementation of this would require explicitly >> storing the hash code for each index entry, > It seems that means doubling the size of the hash index. That's a pretty big > i/o to cpu tradeoff. Hardly. The minimum possible size of a hash entry today is 8 bytes header plus 4 bytes datum, plus there's a 4-byte line pointer to factor in. So under the most pessimistic assumptions, storing the hash code would add 25% to the size. (On MAXALIGN=8 hardware, it might cost you nothing at all.) > What if the hash index stored *only* the hash code? That could be useful for > indexing large datatypes that would otherwise create large indexes. Hmm, that could be a thought. regards, tom lane
On Tue, May 10, 2005 at 10:14:11AM +1000, Neil Conway wrote: > Jim C. Nasby wrote: > >> No, hash joins and hash indexes are unrelated. > >I know they are now, but does that have to be the case? > > I mean, the algorithms are fundamentally unrelated. They share a bit of > code such as the hash functions themselves, but they are really solving > two different problems (disk based indexing with (hopefully) good > concurrency and WAL logging vs. in-memory joins via hashing with spill > to disk if needed). Well, in a hash-join right now you normally end up feeding at least one side of the join with a seqscan. Wouldn't it speed things up considerably if you could look up hashes in the hash index instead? That way you can eliminate going to the heap for any hashes that match. Of course, if limited tuple visibility info was added to hash indexes (similar to what I think is currently happening to B-tree's), many of the heap scans could be eliminated as well. A similar method could also be used for hash aggregates, assuming they use the same hash. -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
On Tue, May 10, 2005 at 12:10:57AM -0400, Tom Lane wrote: > be responsive to your search.) (This also brings up the thought that > it might be interesting to support hash buckets smaller than a page ... > but I don't know how to make that work in an adaptive fashion.) IIRC, other databases that support hash indexes also allow you to define the bucket size, so it might be a good start to allow for that. DBA's usually have a pretty good idea of what a table will look like in production, so if there's clear documentation on the effect of bucket size a good DBA should be able to make a good decision. What's the challange to making it adaptive, comming up with an algorithm that gives you the optimal bucket size (which I would think there's research on...) or allowing the index to accommodate different bucket sizes existing in the index at once? (Presumably you don't want to re-write the entire index every time it looks like a different bucket size would help.) -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
"Jim C. Nasby" <decibel@decibel.org> writes: > What's the challange to making it adaptive, comming up with an algorithm > that gives you the optimal bucket size (which I would think there's > research on...) or allowing the index to accommodate different bucket > sizes existing in the index at once? (Presumably you don't want to > re-write the entire index every time it looks like a different bucket > size would help.) Exactly. That's (a) expensive and (b) really hard to fit into the WAL paradigm --- I think we could only handle it as a REINDEX. So if it were adaptive at all I think we'd have to support multiple bucket sizes existing simultaneously in the index, and I do not see a good way to do that. Allowing a bucket size to be specified at CREATE INDEX doesn't seem out of line though. We'd have to think up a scheme for index-AM-specific index parameters ... regards, tom lane
On Tue, May 10, 2005 at 11:49:50AM -0400, Tom Lane wrote: > "Jim C. Nasby" <decibel@decibel.org> writes: > > What's the challange to making it adaptive, comming up with an algorithm > > that gives you the optimal bucket size (which I would think there's > > research on...) or allowing the index to accommodate different bucket > > sizes existing in the index at once? (Presumably you don't want to > > re-write the entire index every time it looks like a different bucket > > size would help.) > > Exactly. That's (a) expensive and (b) really hard to fit into the WAL > paradigm --- I think we could only handle it as a REINDEX. So if it > were adaptive at all I think we'd have to support multiple bucket sizes > existing simultaneously in the index, and I do not see a good way to do > that. I'm not really familiar enough with hash indexes to know if this would work, but if the maximum bucket size was known you could use that to determine a maximum range of buckets to look at. In some cases, that range would include only one bucket, otherwise it would be a set of buckets. If you found a set of buckets, I think you could then just go to the specific one you need. If we assume that the maximum bucket size is one page it becomes more realistic to take an existing large bucket and split it into several smaller ones. This could be done on an update to the index page, or a background process could handle it. In any case, should this go on the TODO list? > Allowing a bucket size to be specified at CREATE INDEX doesn't seem out > of line though. We'd have to think up a scheme for index-AM-specific > index parameters ... -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
Tom Lane <tgl@sss.pgh.pa.us> writes: > > What if the hash index stored *only* the hash code? That could be useful for > > indexing large datatypes that would otherwise create large indexes. > > Hmm, that could be a thought. Hm, if you go this route of having hash indexes store tuples ordered by hash code and storing the hash code in the index, then it seems hash indexes become just a macro for a btree index of HASH(index columns). I'm not saying that to criticize this plan. In fact I think that captures most (though not all) of what a hash index should be. It would be pretty useful. In fact if it isn't how hash indexes are implemented then it might be useful to provide a user visible hash(ROW) function that allows creating such indexes as functional indexes. Though hiding it would make the SQL simpler. -- greg
"Jim C. Nasby" <decibel@decibel.org> writes: > Well, in a hash-join right now you normally end up feeding at least one > side of the join with a seqscan. Wouldn't it speed things up > considerably if you could look up hashes in the hash index instead? That's called a "nestloop with inner index scan", not a hash join. regards, tom lane
Quoting "Jim C. Nasby" <decibel@decibel.org>: > I'm not really familiar enough with hash indexes to know if this > would > work, but if the maximum bucket size was known you could use that to > determine a maximum range of buckets to look at. In some cases, that > range would include only one bucket, otherwise it would be a set of > buckets. If you found a set of buckets, I think you could then just > go > to the specific one you need. > > If we assume that the maximum bucket size is one page it becomes > more > realistic to take an existing large bucket and split it into several > smaller ones. This could be done on an update to the index page, or > a > background process could handle it. > > In any case, should this go on the TODO list? > > > Allowing a bucket size to be specified at CREATE INDEX doesn't seem > out > > of line though. We'd have to think up a scheme for > index-AM-specific > > index parameters ... > -- > Jim C. Nasby, Database Consultant decibel@decibel.org > Give your computer some brain candy! www.distributed.net Team #1828 Google "dynamic hash" or "linear hash". It takes care of not needing to have varying bucket sizes. Hash indexes are useful if you ALWAYS require disk access; they behave like worst-case random cache-thrash tests. That's probably why dbs have gravitated toward tree indexes instead. On the other hand, there's more (good) to hashing than initially meets the eye. Dynamic multiway hashing has come a long way from just splicing the bits together from multiple columns' hash values. If you can lay your hands on Tim Merrett's old text "Relational Information Systems", it's an eye-opener. Picture an efficient terabyte spreadsheet. For one thing, unlike a btree, a multicolumn hash is symmetric: it doesn't matter which column(s) you do not specify in a partial match. For another, a multiway hash is useful for much lower selectivity than a btree. I built such indexes for OLAP cubes, and some dimensions were only 10 elements wide. At the point where btree indexing becomes worse than seqscan, a multiway hash tells you which 10% of the disk to scan.
Greg Stark <gsstark@mit.edu> writes: >>> What if the hash index stored *only* the hash code? That could be useful for >>> indexing large datatypes that would otherwise create large indexes. >> >> Hmm, that could be a thought. > Hm, if you go this route of having hash indexes store tuples ordered by hash > code and storing the hash code in the index, then it seems hash indexes become > just a macro for a btree index of HASH(index columns). No, not at all, because searching such an index will require a tree descent, thus negating the one true advantage of hash indexes. I see the potential value of sorting by hashcode within an individual page, but that doesn't mean we should do the same across the whole index. regards, tom lane
Quoting "Jim C. Nasby" <decibel@decibel.org>: > Well, in a hash-join right now you normally end up feeding at least > one > side of the join with a seqscan. Wouldn't it speed things up > considerably if you could look up hashes in the hash index instead? You might want to google on "grace hash" and "hybrid hash". The PG hash join is the simplest possible: build a hash table in memory, and match an input stream against it. *Hybrid hash* is where you spill the hash to disk in a well-designed way. Instead of thinking of it as building a hash table in memory, think of it as partitioning one input; if some or all of it fits in memory, all the better. The boundary condition is the same. The real wizard of hybrid hash has to be Goetz Graefe, who sadly has now joined the MS Borg. He demonstrated that for entire-table joins, hybrid hash completely dominates sort-merge. MSSQL now uses what he developed as an academic, but I don't know what the patent state is. "Grace hash" is the original implementation of hybrid hash: Kitsuregawa, M., Tanaka, H., and Moto-oka, T. (1984). Architecture and Performance of Relational Algebra Machine Grace.
If the original paper was published in 1984, then it's been more than 20 years. Any potential patents would already have expired, no? -- Mark Lewis On Tue, 2005-05-10 at 14:35, Mischa Sandberg wrote: > Quoting "Jim C. Nasby" <decibel@decibel.org>: > > > Well, in a hash-join right now you normally end up feeding at least > > one > > side of the join with a seqscan. Wouldn't it speed things up > > considerably if you could look up hashes in the hash index instead? > > You might want to google on "grace hash" and "hybrid hash". > > The PG hash join is the simplest possible: build a hash table in memory, > and match an input stream against it. > > *Hybrid hash* is where you spill the hash to disk in a well-designed > way. Instead of thinking of it as building a hash table in memory, think > of it as partitioning one input; if some or all of it fits in memory, > all the better. The boundary condition is the same. > > The real wizard of hybrid hash has to be Goetz Graefe, who sadly has now > joined the MS Borg. He demonstrated that for entire-table joins, hybrid > hash completely dominates sort-merge. MSSQL now uses what he developed > as an academic, but I don't know what the patent state is. > > "Grace hash" is the original implementation of hybrid hash: > Kitsuregawa, M., Tanaka, H., and Moto-oka, T. (1984). > Architecture and Performance of Relational Algebra Machine Grace. > > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly
Mischa Sandberg <mischa.sandberg@telus.net> writes: > The PG hash join is the simplest possible: build a hash table in memory, > and match an input stream against it. > *Hybrid hash* is where you spill the hash to disk in a well-designed > way. Instead of thinking of it as building a hash table in memory, think > of it as partitioning one input; if some or all of it fits in memory, > all the better. The boundary condition is the same. [ raised eyebrow... ] Apparently you've not read the code. It's been hybrid hashjoin since we got it from Berkeley. Probably not the best possible implementation of the concept, but we do understand about spill to disk. regards, tom lane
Quoting Tom Lane <tgl@sss.pgh.pa.us>: > Mischa Sandberg <mischa.sandberg@telus.net> writes: > > The PG hash join is the simplest possible: build a hash table in > memory, and match an input stream against it. > > [ raised eyebrow... ] Apparently you've not read the code. It's > been hybrid hashjoin since we got it from Berkeley. Probably not the > best possible implementation of the concept, but we do > understand about spill to disk. Apologies. I stopped reading around line 750 (PG 8.0.1) in src/backend/executor/nodeHashjoin.c if (!node->hj_hashdone) { .... /* * execute the Hash node, to build the hash table */ hashNode->hashtable = hashtable; (void) ExecProcNode((PlanState *) hashNode); ... and missed the comment: /* * Open temp files for outer batches, */ Will quietly go and read twice, talk once.
Tom Lane <tgl@sss.pgh.pa.us> writes: > No, not at all, because searching such an index will require a tree > descent, thus negating the one true advantage of hash indexes. The hash index still has to do a tree descent, it just has a larger branching factor than the btree index. btree indexes could have a special case hack to optionally use a large branching factor for the root node, effectively turning them into hash indexes. That would be useful for cases where you know the values will be very evenly distributed and won't need to scan ranges, ie, when you're indexing a hash function. -- greg
Greg Stark <gsstark@mit.edu> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> No, not at all, because searching such an index will require a tree >> descent, thus negating the one true advantage of hash indexes. > The hash index still has to do a tree descent, it just has a larger branching > factor than the btree index. There is *no* tree descent in a hash index: you go directly to the bucket you want. If the bucket spans more than one page, you pay something, but this strikes me as being equivalent to the case of multiple equal keys spanning multiple pages in a btree. It works, but it's not the design center. > btree indexes could have a special case hack to optionally use a large > branching factor for the root node, effectively turning them into hash > indexes. No, because you'd still have to fetch and search the root node. regards, tom lane
Neil Conway <neilc@samurai.com> writes: > Greg Stark wrote: >> What if the hash index stored *only* the hash code? > Attached is a WIP patch that implements this. Performance? > I'm posting mainly because I wasn't sure what to do to avoid false > positives in the case of hash collisions. In the hash AM code it is > somewhat awkward to fetch the pointed-to heap tuple and recheck the > scankey.[1] I just did the first thing that came to mind -- I marked all > the hash AM opclasses as "lossy", so the index qual is rechecked. This > works, but suggestions for a better way to do things would be welcome. AFAICS that's the *only* way to do it. I disagree completely with the idea of forcing this behavior for all datatypes. It could only be sensible for fairly wide values; you don't save enough to justify the lossiness otherwise. It would be interesting to look into whether it could be driven on a per-opclass basis. Then you could have, eg, "text_lossy_hash_ops" as a non-default opclass the DBA could select if he wanted this behavior. (The code could perhaps use the amopreqcheck flag to tell it which way to behave.) If that seems unworkable, I'd prefer to see us set this up as a new index AM type, which would share a lot of code with the old. [ BTW, posting patches to pgsql-general seems pretty off-topic. ] regards, tom lane
Neil Conway <neilc@samurai.com> writes: > I'm posting mainly because I wasn't sure what to do to avoid false positives in > the case of hash collisions. In the hash AM code it is somewhat awkward to > fetch the pointed-to heap tuple and recheck the scankey.[1] I just did the > first thing that came to mind -- I marked all the hash AM opclasses as "lossy", > so the index qual is rechecked. This works, but suggestions for a better way to > do things would be welcome. I would have thought that would be the only way worth considering. Consider for example a query involving two or more hash indexes and the new bitmap indexscan plan. You don't want to fetch the tuples if you can eliminate them using one of the other indexes. -- greg
Tom Lane wrote: > Performance? I'll run some benchmarks tomorrow, as it's rather late in my time zone. If anyone wants to post some benchmark results, they are welcome to. > I disagree completely with the idea of forcing this behavior for all > datatypes. It could only be sensible for fairly wide values; you don't > save enough to justify the lossiness otherwise. I think it would be premature to decide about this before we see some performance numbers. I'm not fundamentally opposed, though. > [ BTW, posting patches to pgsql-general seems pretty off-topic. ] Not any more than discussing implementation details is :) But your point is well taken, I'll send future patches to -patches. -Neil
Federated PG servers -- Was: Re: [GENERAL] "Hash index" vs. "b-tree index" (PostgreSQL
From
Mischa Sandberg
Date:
Was curious why you pointed out SQL-MED as a SQL-standard approach to federated servers. Always thought of it as covering access to non-SQL data, the way the lo_* interface works; as opposed to meshing compatible (to say nothing of identical) SQL servers. Just checked Jim Melton's last word on that, to make sure, too. Is there something beyond that, that I'm missing? The approach that made first best sense to me (perhaps from having gone there before) is to leave the SQL syntactically unchanged, and to manage federated relations via pg_ tables and probably procedures. MSSQL and Sybase went that route. It won't preclude moving to a system embedded in the SQL language. The hurdles for federated SQL service are: - basic syntax (how to refer to a remote object) - connection management and delegated security - timeouts and temporary connection failures - efficient distributed queries with >1 remote table - distributed transactions - interserver integrity constraints Sometimes the lines get weird because of opportunistic implementations. For example, for the longest time, MSSQL supported server.db.user.object references WITHIN STORED PROCEDURES, since the proc engine could hide some primitive connection management. PG struck me as such a natural for cross-server queries, because it keeps everything out in the open, including statistics. PG is also well set-up to handle heterogeneous table types, and has functions that return rowsets. Nothing needs to be bent out of shape syntactically, or in the cross-server interface, to get over the hurdles above. The fact that queries hence transactions can't span multiple databases tells me, PG has a way to go before it can handle dependency on a distributed transaction monitor. My 2c.