Thread: "Hash index" vs. "b-tree index" (PostgreSQL 8.0)

From:
Ying Lu
Date:

Greetings,

We are working on speeding up the queries by creating indexes. We have
queries with searching criteria such as "select ... where *col1='...'*".
This is a simple query with only "=" operation. As a result I setup hash
index on column "col1". While, in postgreSQL 8 doc, it is wirttern:

*Note: * Testing has shown PostgreSQL's hash indexes to perform no
better than B-tree indexes, and the index size and build time for hash
indexes is much worse. For these reasons, hash index use is presently
discouraged.

May I know for simple "=" operation query, for "Hash index" vs. "B-tree"
index, which can provide better performance please?

Thanks,
Emi

From:
Neil Conway
Date:

Ying Lu wrote:
> May I know for simple "=" operation query, for "Hash index" vs. "B-tree"
> index, which can provide better performance please?

I don't think we've found a case in which the hash index code
outperforms B+-tree indexes, even for "=". The hash index code also has
a number of additional issues: for example, it isn't WAL safe, it has
relatively poor concurrency, and creating a hash index is significantly
slower than creating a b+-tree index.

-Neil

From:
Christopher Petrilli
Date:

On 5/9/05, Neil Conway <> wrote:
> I don't think we've found a case in which the hash index code
> outperforms B+-tree indexes, even for "=". The hash index code also has
> a number of additional issues: for example, it isn't WAL safe, it has
> relatively poor concurrency, and creating a hash index is significantly
> slower than creating a b+-tree index.

This being the case, is there ever ANY reason for someone to use it?
If not, then shouldn't we consider deprecating it and eventually
removing it.  This would reduce complexity, I think.

Chris
--
| Christopher Petrilli
| 

From:
Neil Conway
Date:

Christopher Petrilli wrote:
> This being the case, is there ever ANY reason for someone to use it?

Well, someone might fix it up at some point in the future. I don't think
there's anything fundamentally wrong with hash indexes, it is just that
the current implementation is a bit lacking.

> If not, then shouldn't we consider deprecating it and eventually
> removing it.

I would personally consider the code to be deprecated already.

I don't think there is much to be gained b removing it: the code is
pretty isolated from the rest of the tree, and (IMHO) not a significant
maintenance burden.

-Neil

From:
"Jim C. Nasby"
Date:

On Tue, May 10, 2005 at 01:34:57AM +1000, Neil Conway wrote:
> Christopher Petrilli wrote:
> >This being the case, is there ever ANY reason for someone to use it?
>
> Well, someone might fix it up at some point in the future. I don't think
> there's anything fundamentally wrong with hash indexes, it is just that
> the current implementation is a bit lacking.
>
> >If not, then shouldn't we consider deprecating it and eventually
> >removing it.
>
> I would personally consider the code to be deprecated already.
>
> I don't think there is much to be gained b removing it: the code is
> pretty isolated from the rest of the tree, and (IMHO) not a significant
> maintenance burden.

That may be true, but it's also a somewhat 'developer-centric' view. ;)

Having indexes that people shouldn't be using does add confusion for
users, and presents the opportunity for foot-shooting. I don't know what
purpose they once served, but if there's no advantage to them they
should be officially depricated and eventually removed. Even if there is
some kind of advantage (would they possibly speed up hash joins?), if
there's no plans to fix them they should still be removed. If someone
ever really wanted to do something with, the code would still be in CVS.
--
Jim C. Nasby, Database Consultant               
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

From:
Neil Conway
Date:

Jim C. Nasby wrote:
> Having indexes that people shouldn't be using does add confusion for
> users, and presents the opportunity for foot-shooting.

Emitting a warning/notice on hash-index creation is something I've
suggested in the past -- that would be fine with me.

> Even if there is some kind of advantage (would they possibly speed up
> hash joins?)

No, hash joins and hash indexes are unrelated.

-Neil

From:
"Jim C. Nasby"
Date:

On Tue, May 10, 2005 at 02:38:41AM +1000, Neil Conway wrote:
> Jim C. Nasby wrote:
> >Having indexes that people shouldn't be using does add confusion for
> >users, and presents the opportunity for foot-shooting.
>
> Emitting a warning/notice on hash-index creation is something I've
> suggested in the past -- that would be fine with me.

Probably not a bad idea.

> >Even if there is some kind of advantage (would they possibly speed up
> >hash joins?)
>
> No, hash joins and hash indexes are unrelated.

I know they are now, but does that have to be the case? Like I said, I
don't know the history, so I don't know why we even have them to begin
with.
--
Jim C. Nasby, Database Consultant               
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

From:
David Roussel
Date:

> *Note: * Testing has shown PostgreSQL's hash indexes to perform no
> better than B-tree indexes, and the index size and build time for hash
> indexes is much worse. For these reasons, hash index use is presently
> discouraged.
>
> May I know for simple "=" operation query, for "Hash index" vs.
> "B-tree" index, which can provide better performance please?

If you trust the documentation use a b-tree.  If you don't trust the
documentation do your own tests.

please don't cross post.


From:
Neil Conway
Date:

Jim C. Nasby wrote:
 >> No, hash joins and hash indexes are unrelated.
> I know they are now, but does that have to be the case?

I mean, the algorithms are fundamentally unrelated. They share a bit of
code such as the hash functions themselves, but they are really solving
two different problems (disk based indexing with (hopefully) good
concurrency and WAL logging vs. in-memory joins via hashing with spill
to disk if needed).

> Like I said, I don't know the history, so I don't know why we even
> have them to begin with.

As I said, the idea of using hash indexes for better performance on
equality scans is perfectly valid, it is just the implementation that
needs work.

-Neil

From:
Bruce Momjian
Date:

Jim C. Nasby wrote:
> On Tue, May 10, 2005 at 02:38:41AM +1000, Neil Conway wrote:
> > Jim C. Nasby wrote:
> > >Having indexes that people shouldn't be using does add confusion for
> > >users, and presents the opportunity for foot-shooting.
> >
> > Emitting a warning/notice on hash-index creation is something I've
> > suggested in the past -- that would be fine with me.
>
> Probably not a bad idea.

Agreed.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
                 |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

From:
Tom Lane
Date:

Neil Conway <> writes:
> Jim C. Nasby wrote:
>>> No, hash joins and hash indexes are unrelated.
>> I know they are now, but does that have to be the case?

> I mean, the algorithms are fundamentally unrelated. They share a bit of
> code such as the hash functions themselves, but they are really solving
> two different problems

In fact, up till fairly recently they didn't even share the hash
functions.  Which was a bug not a feature, but the fact remains ---
there's not a lot of commonality.

>> Like I said, I don't know the history, so I don't know why we even
>> have them to begin with.

I think it's largely because some Berkeley grad student had a need to
implement hash indexes for academic reasons ;-)

> As I said, the idea of using hash indexes for better performance on
> equality scans is perfectly valid, it is just the implementation that
> needs work.

I was thinking about that earlier today.  It seems to me there is a
window within which hash indexes are theoretically superior, but it
might be pretty narrow.  The basic allure of a hash index is that you
look at the search key, do some allegedly-trivial computations, and go
directly to the relevant index page(s); whereas a btree index requires
descending through several upper levels of index pages to reach the
target leaf page.  On the other hand, once you reach the target index
page, a hash index has no better method than linear scan through all
the page's index entries to find the actually wanted key(s); in fact you
have to search all the pages in that index bucket.  A btree index can
use binary search to find the relevant items within the page.

So it strikes me that important parameters include the index entry size
and the number of entries matching any particular key value.

btree will probably win for smaller keys, on two grounds: it will have
fewer tree levels to descend through, because of higher fan-out, and it
will be much more efficient at finding the target entries within the
target page when there are many entries per page.  (As against this,
it will have to work harder at each upper tree page to decide where to
descend to --- but I think that's a second-order consideration.)

hash will tend to win as the number of duplicate keys increases, because
its relative inefficiency at finding the matches within a particular
bucket will become less significant.  (The ideal situation for a hash
index is to have only one actual key value per bucket.  You can't really
afford to store only one index entry per bucket, because of the sheer
I/O volume that would result, so you need multiple entries that will all
be responsive to your search.)  (This also brings up the thought that
it might be interesting to support hash buckets smaller than a page ...
but I don't know how to make that work in an adaptive fashion.)

I suspect that people haven't looked hard for a combination of these
parameters within which hash can win.  Of course the real question for
us is whether the window is wide enough to justify the maintenance
effort for hash.

            regards, tom lane

From:
Neil Conway
Date:

Tom Lane wrote:
> On the other hand, once you reach the target index page, a hash index
> has no better method than linear scan through all the page's index
> entries to find the actually wanted key(s)

I wonder if it would be possible to store the keys in a hash bucket in
sorted order, provided that the necessary ordering is defined for the
index keys -- considering the ubiquity of b+-trees in Postgres, the
chances of an ordering being defined are pretty good. Handling overflow
pages would be tricky: perhaps we could store the entries in a given
page in sorted order, but not try to maintain that order for the hash
bucket as a whole. This would mean we'd need to do a binary search for
each page of the bucket, but that would still be a win.

-Neil

From:
Tom Lane
Date:

Neil Conway <> writes:
> Tom Lane wrote:
>> On the other hand, once you reach the target index page, a hash index
>> has no better method than linear scan through all the page's index
>> entries to find the actually wanted key(s)

> I wonder if it would be possible to store the keys in a hash bucket in
> sorted order, provided that the necessary ordering is defined for the
> index keys -- considering the ubiquity of b+-trees in Postgres, the
> chances of an ordering being defined are pretty good.

I have a gut reaction against that: it makes hash indexes fundamentally
subservient to btrees.  We shouldn't bring in concepts that are outside
the basic opclass abstraction.

However: what about storing the things in hashcode order?  Ordering uint32s
doesn't seem like any big conceptual problem.

I think that efficient implementation of this would require explicitly
storing the hash code for each index entry, which we don't do now, but
it seems justifiable on multiple grounds --- besides this point, the
search could avoid doing the data-type-specific comparison if the hash
code isn't equal.

There is evidence in the code that indexes used to store more info than
what we now think of as a "standard" index tuple.  I am not sure when
that went away or what it'd cost to bring it back, but it seems worth
looking into.

            regards, tom lane

From:
Neil Conway
Date:

Tom Lane wrote:
> I have a gut reaction against that: it makes hash indexes fundamentally
> subservient to btrees.

I wouldn't say "subservient" -- if there is no ordering defined for the
index key, we just do a linear scan.

> However: what about storing the things in hashcode order?  Ordering uint32s
> doesn't seem like any big conceptual problem.

Hmm, my memory of the hash code is a bit fuzzy. Do I understand correctly?

- we only use some of the bits in the hash to map from the hash of a key
to its bucket

- therefore within a bucket, we can still distinguish most of the
non-equal tuples from one another by comparing their full hash values

- if we keep the entries in a bucket (or page, I guess -- per earlier
mail) sorted by full hash value, we can use that to perform a binary search

Sounds like a good idea to me. How likely is it that the hash index will
be sufficiently large that we're using most of the bits in the hash just
to map hash values to buckets, so that the binary search won't be very
effective? (At this point many of the distinct keys in each bucket will
be full-on hash collisions, although sorting by the key values
themselves would still be effective.)

> I think that efficient implementation of this would require explicitly
> storing the hash code for each index entry, which we don't do now, but
> it seems justifiable on multiple grounds --- besides this point, the
> search could avoid doing the data-type-specific comparison if the hash
> code isn't equal.

Another benefit is that it would speed up page splits -- there would be
no need to rehash all the keys in a bucket when doing the split.

-Neil

From:
Greg Stark
Date:

Tom Lane <> writes:

> However: what about storing the things in hashcode order?  Ordering uint32s
> doesn't seem like any big conceptual problem.
>
> I think that efficient implementation of this would require explicitly
> storing the hash code for each index entry, which we don't do now, but
> it seems justifiable on multiple grounds --- besides this point, the
> search could avoid doing the data-type-specific comparison if the hash
> code isn't equal.

It seems that means doubling the size of the hash index. That's a pretty big
i/o to cpu tradeoff.

What if the hash index stored *only* the hash code? That could be useful for
indexing large datatypes that would otherwise create large indexes. A good
hash function should have a pretty low collision rate anyways so the
occasional extra i/o should more than be outweighed by the decrease in i/o
needed to use the index.

--
greg

From:
Tom Lane
Date:

Greg Stark <> writes:
> Tom Lane <> writes:
>> I think that efficient implementation of this would require explicitly
>> storing the hash code for each index entry,

> It seems that means doubling the size of the hash index. That's a pretty big
> i/o to cpu tradeoff.

Hardly.  The minimum possible size of a hash entry today is 8 bytes
header plus 4 bytes datum, plus there's a 4-byte line pointer to factor
in.  So under the most pessimistic assumptions, storing the hash code
would add 25% to the size.  (On MAXALIGN=8 hardware, it might cost you
nothing at all.)

> What if the hash index stored *only* the hash code? That could be useful for
> indexing large datatypes that would otherwise create large indexes.

Hmm, that could be a thought.

            regards, tom lane

From:
"Jim C. Nasby"
Date:

On Tue, May 10, 2005 at 10:14:11AM +1000, Neil Conway wrote:
> Jim C. Nasby wrote:
> >> No, hash joins and hash indexes are unrelated.
> >I know they are now, but does that have to be the case?
>
> I mean, the algorithms are fundamentally unrelated. They share a bit of
> code such as the hash functions themselves, but they are really solving
> two different problems (disk based indexing with (hopefully) good
> concurrency and WAL logging vs. in-memory joins via hashing with spill
> to disk if needed).

Well, in a hash-join right now you normally end up feeding at least one
side of the join with a seqscan. Wouldn't it speed things up
considerably if you could look up hashes in the hash index instead? That
way you can eliminate going to the heap for any hashes that match. Of
course, if limited tuple visibility info was added to hash indexes
(similar to what I think is currently happening to B-tree's), many of
the heap scans could be eliminated as well. A similar method could also
be used for hash aggregates, assuming they use the same hash.
--
Jim C. Nasby, Database Consultant               
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

From:
"Jim C. Nasby"
Date:

On Tue, May 10, 2005 at 12:10:57AM -0400, Tom Lane wrote:
> be responsive to your search.)  (This also brings up the thought that
> it might be interesting to support hash buckets smaller than a page ...
> but I don't know how to make that work in an adaptive fashion.)

IIRC, other databases that support hash indexes also allow you to define
the bucket size, so it might be a good start to allow for that. DBA's
usually have a pretty good idea of what a table will look like in
production, so if there's clear documentation on the effect of bucket
size a good DBA should be able to make a good decision.

What's the challange to making it adaptive, comming up with an algorithm
that gives you the optimal bucket size (which I would think there's
research on...) or allowing the index to accommodate different bucket
sizes existing in the index at once? (Presumably you don't want to
re-write the entire index every time it looks like a different bucket
size would help.)
--
Jim C. Nasby, Database Consultant               
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

From:
Tom Lane
Date:

"Jim C. Nasby" <> writes:
> What's the challange to making it adaptive, comming up with an algorithm
> that gives you the optimal bucket size (which I would think there's
> research on...) or allowing the index to accommodate different bucket
> sizes existing in the index at once? (Presumably you don't want to
> re-write the entire index every time it looks like a different bucket
> size would help.)

Exactly.  That's (a) expensive and (b) really hard to fit into the WAL
paradigm --- I think we could only handle it as a REINDEX.  So if it
were adaptive at all I think we'd have to support multiple bucket sizes
existing simultaneously in the index, and I do not see a good way to do
that.

Allowing a bucket size to be specified at CREATE INDEX doesn't seem out
of line though.  We'd have to think up a scheme for index-AM-specific
index parameters ...

            regards, tom lane

From:
"Jim C. Nasby"
Date:

On Tue, May 10, 2005 at 11:49:50AM -0400, Tom Lane wrote:
> "Jim C. Nasby" <> writes:
> > What's the challange to making it adaptive, comming up with an algorithm
> > that gives you the optimal bucket size (which I would think there's
> > research on...) or allowing the index to accommodate different bucket
> > sizes existing in the index at once? (Presumably you don't want to
> > re-write the entire index every time it looks like a different bucket
> > size would help.)
>
> Exactly.  That's (a) expensive and (b) really hard to fit into the WAL
> paradigm --- I think we could only handle it as a REINDEX.  So if it
> were adaptive at all I think we'd have to support multiple bucket sizes
> existing simultaneously in the index, and I do not see a good way to do
> that.

I'm not really familiar enough with hash indexes to know if this would
work, but if the maximum bucket size was known you could use that to
determine a maximum range of buckets to look at. In some cases, that
range would include only one bucket, otherwise it would be a set of
buckets. If you found a set of buckets, I think you could then just go
to the specific one you need.

If we assume that the maximum bucket size is one page it becomes more
realistic to take an existing large bucket and split it into several
smaller ones. This could be done on an update to the index page, or a
background process could handle it.

In any case, should this go on the TODO list?

> Allowing a bucket size to be specified at CREATE INDEX doesn't seem out
> of line though.  We'd have to think up a scheme for index-AM-specific
> index parameters ...
--
Jim C. Nasby, Database Consultant               
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

From:
Greg Stark
Date:

Tom Lane <> writes:

> > What if the hash index stored *only* the hash code? That could be useful for
> > indexing large datatypes that would otherwise create large indexes.
>
> Hmm, that could be a thought.

Hm, if you go this route of having hash indexes store tuples ordered by hash
code and storing the hash code in the index, then it seems hash indexes become
just a macro for a btree index of HASH(index columns).

I'm not saying that to criticize this plan. In fact I think that captures most
(though not all) of what a hash index should be.

It would be pretty useful. In fact if it isn't how hash indexes are
implemented then it might be useful to provide a user visible hash(ROW)
function that allows creating such indexes as functional indexes. Though
hiding it would make the SQL simpler.

--
greg

From:
Tom Lane
Date:

"Jim C. Nasby" <> writes:
> Well, in a hash-join right now you normally end up feeding at least one
> side of the join with a seqscan. Wouldn't it speed things up
> considerably if you could look up hashes in the hash index instead?

That's called a "nestloop with inner index scan", not a hash join.

            regards, tom lane

From:
Mischa Sandberg
Date:

Quoting "Jim C. Nasby" <>:

> I'm not really familiar enough with hash indexes to know if this
> would
> work, but if the maximum bucket size was known you could use that to
> determine a maximum range of buckets to look at. In some cases, that
> range would include only one bucket, otherwise it would be a set of
> buckets. If you found a set of buckets, I think you could then just
> go
> to the specific one you need.
>
> If we assume that the maximum bucket size is one page it becomes
> more
> realistic to take an existing large bucket and split it into several
> smaller ones. This could be done on an update to the index page, or
> a
> background process could handle it.
>
> In any case, should this go on the TODO list?
>
> > Allowing a bucket size to be specified at CREATE INDEX doesn't seem
> out
> > of line though.  We'd have to think up a scheme for
> index-AM-specific
> > index parameters ...
> --
> Jim C. Nasby, Database Consultant               
> Give your computer some brain candy! www.distributed.net Team #1828

Google "dynamic hash" or "linear hash". It takes care of not needing to
have varying bucket sizes.

Hash indexes are useful if you ALWAYS require disk access; they behave
like worst-case random cache-thrash tests. That's probably why dbs have
gravitated toward tree indexes instead. On the other hand, there's more
(good) to hashing than initially meets the eye.

Dynamic multiway hashing has come a long way from just splicing the bits
together from multiple columns' hash values. If you can lay your hands
on Tim Merrett's old text "Relational Information Systems", it's an
eye-opener. Picture an efficient terabyte spreadsheet.

For one thing, unlike a btree, a multicolumn hash is symmetric: it
doesn't matter which column(s) you do not specify in a partial match.

For another, a multiway hash is useful for much lower selectivity than a
btree. I built such indexes for OLAP cubes, and some dimensions were
only 10 elements wide. At the point where btree indexing becomes worse
than seqscan, a multiway hash tells you which 10% of the disk to scan.


From:
Tom Lane
Date:

Greg Stark <> writes:
>>> What if the hash index stored *only* the hash code? That could be useful for
>>> indexing large datatypes that would otherwise create large indexes.
>>
>> Hmm, that could be a thought.

> Hm, if you go this route of having hash indexes store tuples ordered by hash
> code and storing the hash code in the index, then it seems hash indexes become
> just a macro for a btree index of HASH(index columns).

No, not at all, because searching such an index will require a tree
descent, thus negating the one true advantage of hash indexes.  I see
the potential value of sorting by hashcode within an individual page,
but that doesn't mean we should do the same across the whole index.

            regards, tom lane

From:
Mischa Sandberg
Date:

Quoting "Jim C. Nasby" <>:

> Well, in a hash-join right now you normally end up feeding at least
> one
> side of the join with a seqscan. Wouldn't it speed things up
> considerably if you could look up hashes in the hash index instead?

You might want to google on "grace hash" and "hybrid hash".

The PG hash join is the simplest possible: build a hash table in memory,
and match an input stream against it.

*Hybrid hash* is where you spill the hash to disk in a well-designed
way. Instead of thinking of it as building a hash table in memory, think
of it as partitioning one input; if some or all of it fits in memory,
all the better. The boundary condition is the same.

The real wizard of hybrid hash has to be Goetz Graefe, who sadly has now
joined the MS Borg. He demonstrated that for entire-table joins, hybrid
hash completely dominates sort-merge. MSSQL now uses what he developed
as an academic, but I don't know what the patent state is.

"Grace hash" is the original implementation of hybrid hash:
  Kitsuregawa, M., Tanaka, H., and Moto-oka, T. (1984).
  Architecture and Performance of Relational Algebra Machine Grace.



From:
Mark Lewis
Date:

If the original paper was published in 1984, then it's been more than 20
years.  Any potential patents would already have expired, no?

-- Mark Lewis

On Tue, 2005-05-10 at 14:35, Mischa Sandberg wrote:
> Quoting "Jim C. Nasby" <>:
>
> > Well, in a hash-join right now you normally end up feeding at least
> > one
> > side of the join with a seqscan. Wouldn't it speed things up
> > considerably if you could look up hashes in the hash index instead?
>
> You might want to google on "grace hash" and "hybrid hash".
>
> The PG hash join is the simplest possible: build a hash table in memory,
> and match an input stream against it.
>
> *Hybrid hash* is where you spill the hash to disk in a well-designed
> way. Instead of thinking of it as building a hash table in memory, think
> of it as partitioning one input; if some or all of it fits in memory,
> all the better. The boundary condition is the same.
>
> The real wizard of hybrid hash has to be Goetz Graefe, who sadly has now
> joined the MS Borg. He demonstrated that for entire-table joins, hybrid
> hash completely dominates sort-merge. MSSQL now uses what he developed
> as an academic, but I don't know what the patent state is.
>
> "Grace hash" is the original implementation of hybrid hash:
>   Kitsuregawa, M., Tanaka, H., and Moto-oka, T. (1984).
>   Architecture and Performance of Relational Algebra Machine Grace.
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to  so that your
>       message can get through to the mailing list cleanly


From:
Tom Lane
Date:

Mischa Sandberg <> writes:
> The PG hash join is the simplest possible: build a hash table in memory,
> and match an input stream against it.

> *Hybrid hash* is where you spill the hash to disk in a well-designed
> way. Instead of thinking of it as building a hash table in memory, think
> of it as partitioning one input; if some or all of it fits in memory,
> all the better. The boundary condition is the same.

[ raised eyebrow... ]  Apparently you've not read the code.  It's been
hybrid hashjoin since we got it from Berkeley.  Probably not the best
possible implementation of the concept, but we do understand about spill
to disk.

            regards, tom lane

From:
Mischa Sandberg
Date:

Quoting Tom Lane <>:

> Mischa Sandberg <> writes:
> > The PG hash join is the simplest possible: build a hash table in
> memory,  and match an input stream against it.
>
> [ raised eyebrow... ]  Apparently you've not read the code.  It's
> been hybrid hashjoin since we got it from Berkeley.  Probably not the
> best possible implementation of the concept, but we do
> understand about spill to disk.

Apologies. I stopped reading around line 750 (PG 8.0.1) in
src/backend/executor/nodeHashjoin.c

if (!node->hj_hashdone)
{
    ....
    /*
     * execute the Hash node, to build the hash table
     */
    hashNode->hashtable = hashtable;
    (void) ExecProcNode((PlanState *) hashNode);
    ...

and missed the comment:
    /*
     * Open temp files for outer batches,
     */

Will quietly go and read twice, talk once.


From:
Greg Stark
Date:

Tom Lane <> writes:

> No, not at all, because searching such an index will require a tree
> descent, thus negating the one true advantage of hash indexes.

The hash index still has to do a tree descent, it just has a larger branching
factor than the btree index.

btree indexes could have a special case hack to optionally use a large
branching factor for the root node, effectively turning them into hash
indexes. That would be useful for cases where you know the values will be very
evenly distributed and won't need to scan ranges, ie, when you're indexing a
hash function.

--
greg

From:
Tom Lane
Date:

Greg Stark <> writes:
> Tom Lane <> writes:
>> No, not at all, because searching such an index will require a tree
>> descent, thus negating the one true advantage of hash indexes.

> The hash index still has to do a tree descent, it just has a larger branching
> factor than the btree index.

There is *no* tree descent in a hash index: you go directly to the
bucket you want.

If the bucket spans more than one page, you pay something, but this
strikes me as being equivalent to the case of multiple equal keys
spanning multiple pages in a btree.  It works, but it's not the design
center.

> btree indexes could have a special case hack to optionally use a large
> branching factor for the root node, effectively turning them into hash
> indexes.

No, because you'd still have to fetch and search the root node.

            regards, tom lane

From:
Tom Lane
Date:

Neil Conway <> writes:
> Greg Stark wrote:
>> What if the hash index stored *only* the hash code?

> Attached is a WIP patch that implements this.

Performance?

> I'm posting mainly because I wasn't sure what to do to avoid false
> positives in the case of hash collisions. In the hash AM code it is
> somewhat awkward to fetch the pointed-to heap tuple and recheck the
> scankey.[1] I just did the first thing that came to mind -- I marked all
> the hash AM opclasses as "lossy", so the index qual is rechecked. This
> works, but suggestions for a better way to do things would be welcome.

AFAICS that's the *only* way to do it.

I disagree completely with the idea of forcing this behavior for all
datatypes.  It could only be sensible for fairly wide values; you don't
save enough to justify the lossiness otherwise.

It would be interesting to look into whether it could be driven on a
per-opclass basis.  Then you could have, eg, "text_lossy_hash_ops"
as a non-default opclass the DBA could select if he wanted this
behavior.  (The code could perhaps use the amopreqcheck flag to tell
it which way to behave.)  If that seems unworkable, I'd prefer to see us
set this up as a new index AM type, which would share a lot of code with
the old.

[ BTW, posting patches to pgsql-general seems pretty off-topic. ]

            regards, tom lane

From:
Greg Stark
Date:

Neil Conway <> writes:

> I'm posting mainly because I wasn't sure what to do to avoid false positives in
> the case of hash collisions. In the hash AM code it is somewhat awkward to
> fetch the pointed-to heap tuple and recheck the scankey.[1] I just did the
> first thing that came to mind -- I marked all the hash AM opclasses as "lossy",
> so the index qual is rechecked. This works, but suggestions for a better way to
> do things would be welcome.

I would have thought that would be the only way worth considering.

Consider for example a query involving two or more hash indexes and the new
bitmap indexscan plan. You don't want to fetch the tuples if you can eliminate
them using one of the other indexes.

--
greg

From:
Neil Conway
Date:

Tom Lane wrote:
> Performance?

I'll run some benchmarks tomorrow, as it's rather late in my time zone.
If anyone wants to post some benchmark results, they are welcome to.

> I disagree completely with the idea of forcing this behavior for all
> datatypes.  It could only be sensible for fairly wide values; you don't
> save enough to justify the lossiness otherwise.

I think it would be premature to decide about this before we see some
performance numbers. I'm not fundamentally opposed, though.

> [ BTW, posting patches to pgsql-general seems pretty off-topic. ]

Not any more than discussing implementation details is :) But your point
is well taken, I'll send future patches to -patches.

-Neil

From:
Mischa Sandberg
Date:

Was curious why you pointed out SQL-MED as a SQL-standard approach to
federated servers. Always thought of it as covering access to non-SQL
data, the way the lo_* interface works; as opposed to meshing compatible
(to say nothing of identical) SQL servers. Just checked Jim Melton's
last word on that, to make sure, too. Is there something beyond that,
that I'm missing?

The approach that made first best sense to me (perhaps from having gone
there before) is to leave the SQL syntactically unchanged, and to manage
federated relations via pg_ tables and probably procedures. MSSQL and
Sybase went that route. It won't preclude moving to a system embedded in
the SQL language.

The hurdles for federated SQL service are:
- basic syntax (how to refer to a remote object)
- connection management and delegated security
- timeouts and temporary connection failures
- efficient distributed queries with >1 remote table
- distributed transactions
- interserver integrity constraints

Sometimes the lines get weird because of opportunistic implementations.
For example, for the longest time, MSSQL supported server.db.user.object
references WITHIN STORED PROCEDURES, since the proc engine could hide
some primitive connection management.

PG struck me as such a natural for cross-server queries, because
it keeps everything out in the open, including statistics.
PG is also well set-up to handle heterogeneous table types,
and has functions that return rowsets. Nothing needs to be bent out of
shape syntactically, or in the cross-server interface, to get over the
hurdles above.

The fact that queries hence transactions can't span multiple databases
tells me, PG has a way to go before it can handle dependency on a
distributed transaction monitor.

My 2c.