Thread: [HACKERS] Multicolumn hash indexes

[HACKERS] Multicolumn hash indexes

From
Tomasz Ostrowski
Date:
Hi.

I've noticed that hash indexes can't currently (in PG10) be multicolumn. 
Are they technically hard to implement or just nobody took such a feature?

I think multicolumn hash indexes should help pretty significantly with 
queries like:
- where username=? and user_post_id=?
- where client_id=? and period=? and invoice_number=?
etc.

I imagine that calculating a multicolumn hash should be pretty 
straightforward to implement - after hashing bytes of first column just 
keep going and update the hash state with bytes of a second and 
subsequent columns. And it should allow for faster (O(1), less IO) and 
much smaller (better cached, less IO again) multicolumn indexes. Also in 
PG10 hash indexes are WAL-logged and therefore much easier to work with. 
What do you think?

-- 
Tomasz "Tometzky" Ostrowski


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Multicolumn hash indexes

From
Tom Lane
Date:
Tomasz Ostrowski <tometzky+pg@ato.waw.pl> writes:
> I've noticed that hash indexes can't currently (in PG10) be multicolumn. 
> Are they technically hard to implement or just nobody took such a feature?

It's not simple, particularly not if you wish that the index would support
queries specifying conditions for just a subset of the indexed columns
(an assumption that's buried pretty deeply in the planner, for one thing).
Then you couldn't compute the hash.
        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Multicolumn hash indexes

From
Robert Haas
Date:
On Tue, Sep 26, 2017 at 5:41 PM, Tomasz Ostrowski
<tometzky+pg@ato.waw.pl> wrote:
> I've noticed that hash indexes can't currently (in PG10) be multicolumn. Are
> they technically hard to implement or just nobody took such a feature?
>
> I think multicolumn hash indexes should help pretty significantly with
> queries like:
> - where username=? and user_post_id=?
> - where client_id=? and period=? and invoice_number=?
> etc.
>
> I imagine that calculating a multicolumn hash should be pretty
> straightforward to implement - after hashing bytes of first column just keep
> going and update the hash state with bytes of a second and subsequent
> columns. And it should allow for faster (O(1), less IO) and much smaller
> (better cached, less IO again) multicolumn indexes. Also in PG10 hash
> indexes are WAL-logged and therefore much easier to work with. What do you
> think?

I was just thinking the same thing today.  I think it is a great idea,
but I don't have time to do the work myself at the moment.  I'd be
happy to help review, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Multicolumn hash indexes

From
Robert Haas
Date:
On Tue, Sep 26, 2017 at 7:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Tomasz Ostrowski <tometzky+pg@ato.waw.pl> writes:
>> I've noticed that hash indexes can't currently (in PG10) be multicolumn.
>> Are they technically hard to implement or just nobody took such a feature?
>
> It's not simple, particularly not if you wish that the index would support
> queries specifying conditions for just a subset of the indexed columns
> (an assumption that's buried pretty deeply in the planner, for one thing).
> Then you couldn't compute the hash.

Whoa, that seems like moving the goalposts.  Somebody could design a
hash index that was intended to answer such queries, but it's not the
one we've got.  I think we should just aim to support equality queries
on all columns.  That seems like a fairly straightforward
generalization of what we've already got.

You'd need to adjust a bunch of things, like _hash_datum2hashkey and
_hash_datum2hashkey_type but that doesn't necessarily seem like it
would be super-difficult.  One problem is that the metapage currently
stores the OID of the hashproc used for the index, and there's no
place there to store more than one.  But since that's only for
forensics anyway, we could store InvalidOid for a multi-column index,
or the value for the first column, or whatever.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Multicolumn hash indexes

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Sep 26, 2017 at 7:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> It's not simple, particularly not if you wish that the index would support
>> queries specifying conditions for just a subset of the indexed columns
>> (an assumption that's buried pretty deeply in the planner, for one thing).
>> Then you couldn't compute the hash.

> Whoa, that seems like moving the goalposts.

No, that's merely stating where the goalposts stand, and where they have
stood for the past twenty-odd years.  You might imagine they are somewhere
else, but you'd be mistaken.

There is a facility in the planner to require a condition for the first
column of an index before considering an indexscan plan.  We could perhaps
extend that to require a condition for each column of the index, though
I'm not sure how much work is involved directly in that.  The bigger
picture here though is that it puts a premium on *not* throwing away
"unnecessary" qual conditions, which is directly antithetical to a bunch
of other planner goals.
User: Why won't the planner use my multicolumn hash index?      I have query conditions constraining all the
columns!Us:Well, one of your conditions was discarded because it was           constant-true after constant
simplification,or redundant with    a partition qual or CHECK constraint, or implied by an index    predicate, or
treatedas a join condition instead of a    restriction condition, or absorbed into an equivalence class    and then the
plannerchose to emit some other equivalence    condition instead, or possibly two or three other things.User: WAAAAH!
 

There are also some related problems concerning how to make index entries
for tuples that have some but not all of the indexed columns being NULL.
Maybe that goes away if you're willing to demand strict equality
constraints for all columns, but I'm not completely convinced offhand.
(See the amoptionalkey discussion in the index AM API spec for some
context.)

I agree that doing a half-baked job of this is probably within reach.
I'm uncertain about what it would take to bake it fully.
        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Multicolumn hash indexes

From
David Fetter
Date:
On Tue, Sep 26, 2017 at 11:32:52PM -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Tue, Sep 26, 2017 at 7:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> It's not simple, particularly not if you wish that the index would support
> >> queries specifying conditions for just a subset of the indexed columns
> >> (an assumption that's buried pretty deeply in the planner, for one thing).
> >> Then you couldn't compute the hash.
> 
> > Whoa, that seems like moving the goalposts.
> 
> No, that's merely stating where the goalposts stand, and where they have
> stood for the past twenty-odd years.  You might imagine they are somewhere
> else, but you'd be mistaken.
> 
> There is a facility in the planner to require a condition for the first
> column of an index before considering an indexscan plan.  We could perhaps
> extend that to require a condition for each column of the index, though
> I'm not sure how much work is involved directly in that.  The bigger
> picture here though is that it puts a premium on *not* throwing away
> "unnecessary" qual conditions, which is directly antithetical to a bunch
> of other planner goals.
> 
>     User: Why won't the planner use my multicolumn hash index?
>           I have query conditions constraining all the columns!
>     Us: Well, one of your conditions was discarded because it was
>             constant-true after constant simplification, or redundant with
>         a partition qual or CHECK constraint, or implied by an index
>         predicate, or treated as a join condition instead of a
>         restriction condition, or absorbed into an equivalence class
>         and then the planner chose to emit some other equivalence
>         condition instead, or possibly two or three other things.
>     User: WAAAAH!
> 
> There are also some related problems concerning how to make index
> entries for tuples that have some but not all of the indexed columns
> being NULL.  Maybe that goes away if you're willing to demand strict
> equality constraints for all columns, but I'm not completely
> convinced offhand.  (See the amoptionalkey discussion in the index
> AM API spec for some context.)
> 
> I agree that doing a half-baked job of this is probably within
> reach.  I'm uncertain about what it would take to bake it fully.

To stretch this analogy too far, what other things could be built out
of the bread this bakes?  I'm guessing that at least non-hash
multicolumn indexes would benefit.  Expressional indexes, maybe?

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Multicolumn hash indexes

From
Jesper Pedersen
Date:
On 09/26/2017 08:11 PM, Robert Haas wrote:
> On Tue, Sep 26, 2017 at 7:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Tomasz Ostrowski <tometzky+pg@ato.waw.pl> writes:
>>> I've noticed that hash indexes can't currently (in PG10) be multicolumn.
>>> Are they technically hard to implement or just nobody took such a feature?
>>
>> It's not simple, particularly not if you wish that the index would support
>> queries specifying conditions for just a subset of the indexed columns
>> (an assumption that's buried pretty deeply in the planner, for one thing).
>> Then you couldn't compute the hash.
> 
> Whoa, that seems like moving the goalposts.  Somebody could design a
> hash index that was intended to answer such queries, but it's not the
> one we've got.  I think we should just aim to support equality queries
> on all columns.  That seems like a fairly straightforward
> generalization of what we've already got.
> 

This would require that the applications that are using the database 
knows about the index structure in order to pass down the right columns.

I would say that in most cases that applications doesn't know about 
index structures. So, multiple indexes would have to be created (col1), 
(col1, col2), (col1, col3), ... which isn't ideal.

Maybe an initial proof-of-concept could store the hash of the first 
column (col1) plus the hash of all columns (col1, col2, col3) in the 
index, and see what requirements / design decisions would appear from that.

Best regards, Jesper


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Multicolumn hash indexes

From
Robert Haas
Date:
On Tue, Sep 26, 2017 at 11:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> There is a facility in the planner to require a condition for the first
> column of an index before considering an indexscan plan.  We could perhaps
> extend that to require a condition for each column of the index, though
> I'm not sure how much work is involved directly in that.  The bigger
> picture here though is that it puts a premium on *not* throwing away
> "unnecessary" qual conditions, which is directly antithetical to a bunch
> of other planner goals.
>
>         User: Why won't the planner use my multicolumn hash index?
>               I have query conditions constraining all the columns!
>         Us: Well, one of your conditions was discarded because it was
>             constant-true after constant simplification, or redundant with
>             a partition qual or CHECK constraint, or implied by an index
>             predicate, or treated as a join condition instead of a
>             restriction condition, or absorbed into an equivalence class
>             and then the planner chose to emit some other equivalence
>             condition instead, or possibly two or three other things.
>         User: WAAAAH!

Ah.  Yeah, that's a problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Multicolumn hash indexes

From
Nico Williams
Date:
On Wed, Sep 27, 2017 at 09:56:26AM -0400, Jesper Pedersen wrote:
> On 09/26/2017 08:11 PM, Robert Haas wrote:
> >On Tue, Sep 26, 2017 at 7:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >>Tomasz Ostrowski <tometzky+pg@ato.waw.pl> writes:
> >>>I've noticed that hash indexes can't currently (in PG10) be multicolumn.
> >>>Are they technically hard to implement or just nobody took such a feature?
> >>
> >>It's not simple, particularly not if you wish that the index would support
> >>queries specifying conditions for just a subset of the indexed columns
> >>(an assumption that's buried pretty deeply in the planner, for one thing).
> >>Then you couldn't compute the hash.
> >
> >Whoa, that seems like moving the goalposts.  Somebody could design a
> >hash index that was intended to answer such queries, but it's not the
> >one we've got.  I think we should just aim to support equality queries
> >on all columns.  That seems like a fairly straightforward
> >generalization of what we've already got.
> >
> 
> This would require that the applications that are using the database knows
> about the index structure in order to pass down the right columns.

That's hardly a problem.  We (app devs) already do this all the time.
Sometimes we screw up and get a poor query plan, we figure it out, and
add an index or rewrite the query, or make some index more covering, ...

The real problems, according to Tom Lane, are that the query planner
makes deep assumptions about index keys having prefixes, which b-tree
indexes do but hash indexes do not, and also that optimizations that are
applied before selecting indexes can make it impossible to select a hash
index intended by the author of a query.  The latter problem is not
fatal, but can lead to surprises if not fixed, while the former problem
is fatal until fixed.

> I would say that in most cases that applications doesn't know about index
> structures. So, multiple indexes would have to be created (col1), (col1,
> col2), (col1, col3), ... which isn't ideal.

Nonsense, and also irrelevant to the question of whether multi-column
hash indexes should be supported.  It's not about what Joe User knows to
do, but about what <whoever wants this feature> knows to do.

BTW, one could probably use an expression hash index to get a multi-
column hash index anyways, by using row values, JSON encodings of row
values, or some ad-hoc text representation of row values.

Nico
-- 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Multicolumn hash indexes

From
Robert Haas
Date:
On Wed, Sep 27, 2017 at 9:56 AM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
> Maybe an initial proof-of-concept could store the hash of the first column
> (col1) plus the hash of all columns (col1, col2, col3) in the index, and see
> what requirements / design decisions would appear from that.

I thought about that sort of thing yesterday but it's not that simple.
The problem is that the hash code isn't just stored; it's used to
assign tuples to buckets.  If you have two hash codes, you have to
pick one of the other to use for assigning the tuple to a bucket.  And
then if you want to search using the other hash code, you have to
search all of the buckets, which will stink.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Multicolumn hash indexes

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Sep 27, 2017 at 9:56 AM, Jesper Pedersen
> <jesper.pedersen@redhat.com> wrote:
>> Maybe an initial proof-of-concept could store the hash of the first column
>> (col1) plus the hash of all columns (col1, col2, col3) in the index, and see
>> what requirements / design decisions would appear from that.

> I thought about that sort of thing yesterday but it's not that simple.
> The problem is that the hash code isn't just stored; it's used to
> assign tuples to buckets.  If you have two hash codes, you have to
> pick one of the other to use for assigning the tuple to a bucket.  And
> then if you want to search using the other hash code, you have to
> search all of the buckets, which will stink.

If we follow GIST's lead that the leading column is "most important",
the idea could be to require a search constraint on the first column,
which produces the hash that determines the bucket assignment.  Hashes
for additional columns would just be payload data in the index entry.
If you have search constraint(s) on low-order column(s), you can check
for hash matches before visiting the heap, but they don't reduce how
much of the index you have to search.  Even btree works that way for
many combinations of incomplete index constraints.
        regards, tom lane


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Multicolumn hash indexes

From
Robert Haas
Date:
On Wed, Sep 27, 2017 at 11:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Wed, Sep 27, 2017 at 9:56 AM, Jesper Pedersen
>> <jesper.pedersen@redhat.com> wrote:
>>> Maybe an initial proof-of-concept could store the hash of the first column
>>> (col1) plus the hash of all columns (col1, col2, col3) in the index, and see
>>> what requirements / design decisions would appear from that.
>
>> I thought about that sort of thing yesterday but it's not that simple.
>> The problem is that the hash code isn't just stored; it's used to
>> assign tuples to buckets.  If you have two hash codes, you have to
>> pick one of the other to use for assigning the tuple to a bucket.  And
>> then if you want to search using the other hash code, you have to
>> search all of the buckets, which will stink.
>
> If we follow GIST's lead that the leading column is "most important",
> the idea could be to require a search constraint on the first column,
> which produces the hash that determines the bucket assignment.  Hashes
> for additional columns would just be payload data in the index entry.
> If you have search constraint(s) on low-order column(s), you can check
> for hash matches before visiting the heap, but they don't reduce how
> much of the index you have to search.  Even btree works that way for
> many combinations of incomplete index constraints.

I see.  That makes sense.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Multicolumn hash indexes

From
Tomasz Ostrowski
Date:
On 09/27/2017 05:57 PM, Tom Lane wrote:
> If we follow GIST's lead that the leading column is "most important",
> the idea could be to require a search constraint on the first column,
> which produces the hash that determines the bucket assignment.  Hashes
> for additional columns would just be payload data in the index entry.
> If you have search constraint(s) on low-order column(s), you can check
> for hash matches before visiting the heap, but they don't reduce how
> much of the index you have to search.  Even btree works that way for
> many combinations of incomplete index constraints.

I feel that this would eliminate a large amount of potential gains from 
such an index. This would be usable only when a sufficiently variable 
column exists, in which case a simple hash index on the column wouldn't 
be much worse.

But I have an idea. What if there was a requirement for the search 
criteria to use tuple equality comparison:where (a,b,c)=(?,?,?)
orwhere (a,b,c) in ((?,?,?),(?,?,?),(?,?,?),(?,?,?))

Wouldn't it eliminate problems with disappearing conditions?

Actually maybe this could be implemented just like a functional index. 
So it would implement reasonably something that otherwise would be a 
terribly hackish and slow solution like:
create or replace function hashhack(a bytea, b bytea)returns bigintlanguage sqlimmutableas $$    -- uses 'x1e' (record
separator)   -- to ensure hashhack('a','')!=hashhack('','a')    select (      'x'      ||      substr(
md5($1||'\x1e'::bytea||$2),      1,       16      )    )::bit(64)::bigint;$$;
 
create index t_hashhack_a_b_idx  on t( hashhack(a::bytea,b::bytea) );
        select * from t  where a='a' and b='b'      and    hashhack(a::bytea, b::bytea)    =
hashhack('a'::bytea,'b'::bytea);

If if was automatic man could avoid the overhead of converting data to 
bytea/string, concatenating, truncating, converting back to bigint, 
rechecking condition etc. that make this kind of hack not very sane.

Even providing a specially crafted function or operator for queries 
specifically targeted for the index would be quite sufficient:where pg_equal( (a,b,c), (?,?,?) );

-- 
Tomasz "Tometzky" Ostrowski


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Multicolumn hash indexes

From
Nico Williams
Date:
On Wed, Sep 27, 2017 at 11:57:23AM -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Wed, Sep 27, 2017 at 9:56 AM, Jesper Pedersen
> > <jesper.pedersen@redhat.com> wrote:
> >> Maybe an initial proof-of-concept could store the hash of the first column
> >> (col1) plus the hash of all columns (col1, col2, col3) in the index, and see
> >> what requirements / design decisions would appear from that.
> 
> > I thought about that sort of thing yesterday but it's not that simple.
> > The problem is that the hash code isn't just stored; it's used to
> > assign tuples to buckets.  If you have two hash codes, you have to
> > pick one of the other to use for assigning the tuple to a bucket.  And
> > then if you want to search using the other hash code, you have to
> > search all of the buckets, which will stink.
> 
> If we follow GIST's lead that the leading column is "most important",
> the idea could be to require a search constraint on the first column,
> which produces the hash that determines the bucket assignment.  Hashes
> for additional columns would just be payload data in the index entry.
> If you have search constraint(s) on low-order column(s), you can check
> for hash matches before visiting the heap, but they don't reduce how
> much of the index you have to search.  Even btree works that way for
> many combinations of incomplete index constraints.

But it might be difficult to obtain sufficient selectivity with any
choice of first column for an otherwise reasonable schema design.

Presumably that is why the OP is interested in hash indexing multiple
columns, as otherwise a plain old b-tree would have sufficed.

It might be better to address the planner assumptions about key prefixes
instead, though I recognize that that might be difficult (or perhaps
even undesirable).  The potential for slowing down the planner might be
too high.

I note that it's not possible today to create a hash index on an
expression -- if it were then the obvious workaround would be to create
a hash on a `row(<column_list>)` expression, then use a
`row(<column_list>) = row(<literal_or_parameter_list>)` WHERE clause in
queries, which presumably the planner could not decompose and optimize
or pessimize.  If need be the planner could learn to treat
row(<literal_or_parameter_list>) as a barrier to such optimizations.

A present workaround would be to denormalize the column list one cares
about into a single record-type column that one could then hash.  To use
this one would have to use a `record_column = row(<parameter_list>)`
WHERE clause.  Or, if record types cannot be hashed, then use
row_to_json() and hash the JSON text.

Nico
-- 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Multicolumn hash indexes

From
Robert Haas
Date:
On Wed, Sep 27, 2017 at 5:52 PM, Tomasz Ostrowski
<tometzky+pg@ato.waw.pl> wrote:
> I feel that this would eliminate a large amount of potential gains from such
> an index. This would be usable only when a sufficiently variable column
> exists, in which case a simple hash index on the column wouldn't be much
> worse.

If the index isn't very selective and any query will match many rows
anyway, I think it will often be superior to use a BRIN index (but I
might be wrong).

Maybe you're worrying about something like a billion-row table where
there are 3 columns that form a composite key: (1,1,1), (1,1,2), ...,
(1,1000),(1,2,1),...,(1,1000,1000),(2,1,1),...,(1000,1000,1000).  In
that case, treating the leading column as most important will indeed
be terrible, since we'll put all billion rows into 1000 buckets no
matter how many bucket splits we do.

That seems a little unusual, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Multicolumn hash indexes

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Maybe you're worrying about something like a billion-row table where
> there are 3 columns that form a composite key: (1,1,1), (1,1,2), ...,
> (1,1000),(1,2,1),...,(1,1000,1000),(2,1,1),...,(1000,1000,1000).  In
> that case, treating the leading column as most important will indeed
> be terrible, since we'll put all billion rows into 1000 buckets no
> matter how many bucket splits we do.

> That seems a little unusual, though.

There are few if any indexing techniques where the first column isn't
significantly more important than the rest --- certainly that's true
for btree, for example.  I do not think it's a showstopper if that's
true for hash as well.
        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Multicolumn hash indexes

From
Robert Haas
Date:
On Fri, Sep 29, 2017 at 10:54 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> There are few if any indexing techniques where the first column isn't
> significantly more important than the rest --- certainly that's true
> for btree, for example.

Well, BRIN doesn't care.

> I do not think it's a showstopper if that's
> true for hash as well.

Certainly true, but on the other hand, if the design for which you are
advocating doesn't address a problem Tomasz has, he probably won't
want to bother implementing it.

I think it's pretty interesting to think about what other index
designs we might have that using hashing but are not the same as our
existing hash indexes.  Reviewing Amit's patches for hash indexes has
helped me to understand how the hash AM works a lot better than I had
before, and it's really a pretty limiting design.  The buckets aren't
sorted by hashcode overall, just within a page, which sucks, and the
way that storage allocations are managed within the index is painful.
It might be worth thinking about what we might create for a hash index
today if we were starting over from scratch.  Andres proposed a
hash-over-btree thing where you build a hash value for each tuple and
then sort them by hash code and arrange the results into a btree; I
think that would have the same multi-column index problems you're
talking about here, but it would be more space-efficient.

Now you could also imagine something where we keep a separate set of
hash buckets for each column and multi-column searches are handled by
searching each hash table separately and taking the intersection of
the sets of TIDs found for each column.  Not sure that's a good idea,
but it's sort of like what GIN does.

There are probably other ideas, too.  We should investigate what
others have done.  I see that
https://docs.microsoft.com/en-us/sql/relational-databases/in-memory-oltp/hash-indexes-for-memory-optimized-tables
section E.1 claims that SQL Server requires a hash index to include an
equality condition on every column in the index.  I still think that's
an appealing option from a theoretical point of view even if the
implementation difficulties are formidable.

Come to think of it, I think that the planner behavior you describe
whereby we put a premium on throwing away unnecessary conditions is
probably not so great in other cases, too.  In a btree index, we have
to step through all the tuples in the range covered by the index scan
one by one; if the join sitting above the index scan will reject those
tuples anyway when it tests the join qual, I can see that doing the
same test redundantly at the scan level could well be a loser.  But
suppose it's a BRIN index.  Now doing the test at the scan level is a
lot more likely to win -- I think -- because BRIN can perform that
test once per page rather than once per tuple, whereas the join will
have to reject individual tuples.

This sounds very similar to the question of whether we ought to infer
implied inequalities, which as you know has been a regular complaint.
It's certainly the case that enforcing an inequality in multiple
places can be inefficient, but it can also be a tremendous win if the
inequality is highly selective, which is not uncommon (e.g. imagine a
sort-and-merge-join between two tables and an inequality on the join
column that knocks out 90% - or 99% - of the rows in each).  I have
been wondering whether we need a concept of an optional qual.  Instead
of dropping a qual that doesn't need to be enforced, move it to a
separate bucket of quals that can be enforced if it looks like a win
from a cost perspective and otherwise ignored without breaking
anything.  Implied inequalities (and maybe implied equalities) could
be thrown into such a bucket as well, allowing the decision about
whether to enforce them to be based on some algorithm or heuristic
rather than a summary judgement.

Or maybe that's not the right concept at all, but I think we need some
idea about how cases like this can be handled.  People want the
database to handle the cases about which they care, not judge which
cases those are.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Multicolumn hash indexes

From
Alexander Korotkov
Date:
On Fri, Sep 29, 2017 at 6:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Now you could also imagine something where we keep a separate set of
hash buckets for each column and multi-column searches are handled by
searching each hash table separately and taking the intersection of
the sets of TIDs found for each column.  Not sure that's a good idea,
but it's sort of like what GIN does.

I'd like to note that what GIN does for multicolumn indexes seems quite artificial thing to me.  Logically multicolumn GIN index is the same thing as multiple singlecolumn GIN indexes.  The reason why multicolumn GIN exists is the fact that in particular case GIN does overlapping of scan results over multiple attributes more efficient than our executor do.  Particular case is that scans over multiple columns return ordered sets of TIDs.  GIN implements kind of merge join over TIDs while our executor can only do BitmapAnd in this case.

However, if we imagine that at the moment we develop GIN our executor can do merge join of TIDs produced by multiple index scans, then we wouldn't invent multicolumn GIN at all. Or probably we would invent very different kind of multicolumn GIN.

I mean that multicolumn GIN was a good solution of particular problem from practical side of view.  It allowed to implement very efficient algorithms with minimal changes in planner/executor infrastructure.  However, multicolumn GIN doesn't look like a good design pattern we would like to repeat.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [HACKERS] Multicolumn hash indexes

From
Nico Williams
Date:
On Fri, Sep 29, 2017 at 10:54:55AM -0400, Tom Lane wrote:
> There are few if any indexing techniques where the first column isn't
> significantly more important than the rest --- certainly that's true
> for btree, for example.  I do not think it's a showstopper if that's
> true for hash as well.

You have N>1 columns to index and which you'll be using a conjunction of
all of them together in queries, with equiality predicates.  No one of
those columns is sufficiently selective.  But all the columns together
are plenty selective enough.

Obviously a [multi-column] hash index should do.  The main question is
whether the planner can be made to not consider subsets of the columns
to the exclusion of the hash index -- maybe not, or not easily enough.

This is easy enough to implement with a B-Tree index on an expression
consisting of a decent hash function application to the row_to_json() of
a row composed of the columns in question.  But it requires using the
same expression in queries, naturally, which then acts as a barrier to
the planner's propensity to drop columns from consideration for index
selection.

A multi-column hash index facility shouldn't have to require anything
more than where clauses that are a conjunction of all the columns with
equality predicates.

Nico
-- 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers