Thread: HASH index method not correctly handling NULL text values?

HASH index method not correctly handling NULL text values?

From
David Madore
Date:
Hello.

I don't know if this counts as a bug or not.  I'm using PostgreSQL
version 7.1.3, and I tried creating an index using HASH (rather than
BTREE as I usually do) on a text column of one of my tables.  The
table has 345442 rows, of which 344339 have a non NULL value in the
column in question.  The index creation proceeded without trouble, but
a subsequent "VACUUM VERBOSE ANALYZE" command gave the following
warning: "NUMBER OF INDEX' TUPLES (344339) IS NOT THE SAME AS HEAP'
(345442)" with the advice to recreate the index.  Essentially, my
question is: is the warning incorrect or is the index so?  In the end
I recreated my index as a BTREE and this gave no problem.

Incidentally, I might ask, which is the best choice of indexing method
considering that the only comparison I will ever make on this column
is equality (this is what led me to prefer HASH initially) and
considering that the column takes of the order of magnitude of a few
hundred distinct values (as opposed to the third-of-a-million rows of
the table)?  The BTREE index creation was considerably faster than the
HASH index creation.

In case this is relevant or of any help, here is the creation command
for the table:

CREATE TABLE articles (
  id serial PRIMARY KEY ,
  msgid text UNIQUE NOT NULL ,
  from_header text NOT NULL ,
  subject_header text NOT NULL ,
  date timestamp NOT NULL ,
  posting_date timestamp ,
  last_reference text ,
  lines int NOT NULL ,
  bytes int NOT NULL ,
  sender_name text ,
  sender_host text
) ;

here is the command I tried to create the HASH index:

CREATE INDEX articles_sender_name_key ON articles USING HASH ( sender_name ) ;

and here is the corresponding output from "VACUUM VERBOSE ANALYZE"
(the databased had already been vacuumed just prior to index
creation):

NOTICE:  --Relation articles--
NOTICE:  Pages 10326: Changed 0, reaped 3, Empty 0, New 0; Tup 345442: Vac 0, Keep/VTL 0/0, Crash 0, UnUsed 9, MinLen
151,MaxLen 833; Re-using: Free/Avail. Space 3356/0; EndEmpty/Avail. Pages 0/0. CPU 0.58s/0.36u sec. 
NOTICE:  Index articles_pkey: Pages 1809; Tuples 345442: Deleted 0. CPU 0.25s/1.54u sec.
NOTICE:  Index articles_msgid_key: Pages 2524; Tuples 345442: Deleted 0. CPU 0.22s/1.63u sec.
NOTICE:  Index articles_posting_date_key: Pages 957; Tuples 345442: Deleted 0. CPU 0.12s/1.59u sec.
NOTICE:  Index articles_last_reference_key: Pages 2324; Tuples 345442: Deleted 0. CPU 0.18s/1.68u sec.
NOTICE:  Index articles_sender_name_key: Pages 3089; Tuples 344339: Deleted 0. CPU 0.17s/10.79u sec.
NOTICE:  Index articles_sender_name_key: NUMBER OF INDEX' TUPLES (344339) IS NOT THE SAME AS HEAP' (345442).
        Recreate the index.

PS: Please send copy of replies to me personally as I do not receive
mail from the list.  Thanks again.

--
     David A. Madore
    (david.madore@ens.fr,
     http://www.eleves.ens.fr:8080/home/madore/ )

Re: HASH index method not correctly handling NULL text

From
Neil Conway
Date:
On Sat, 2002-02-02 at 18:49, David Madore wrote:
> Hello.
>
> I don't know if this counts as a bug or not.  I'm using PostgreSQL
> version 7.1.3, and I tried creating an index using HASH (rather than
> BTREE as I usually do) on a text column of one of my tables.  The
> table has 345442 rows, of which 344339 have a non NULL value in the
> column in question.  The index creation proceeded without trouble, but
> a subsequent "VACUUM VERBOSE ANALYZE" command gave the following
> warning: "NUMBER OF INDEX' TUPLES (344339) IS NOT THE SAME AS HEAP'
> (345442)" with the advice to recreate the index.  Essentially, my
> question is: is the warning incorrect or is the index so?  In the end
> I recreated my index as a BTREE and this gave no problem.

The hash index type has some known problems. For one thing, it has a
tendancy to deadlock under heavy load -- on my machine, 'pgbench -c 10
-t 100' will reproducibly produce deadlocks and failed queries. I would
not advise the use of hash indexes for production machines.

However, I've been starting to do some Pg hacking. Coincidentally, my
first project is to attempt to improve hash indexes. I'll see if I can
reproduce your problem on my local machine, and I'll let you know
if/when I have a fix.

> Incidentally, I might ask, which is the best choice of indexing method
> considering that the only comparison I will ever make on this column
> is equality (this is what led me to prefer HASH initially) and
> considering that the column takes of the order of magnitude of a few
> hundred distinct values (as opposed to the third-of-a-million rows of
> the table)?  The BTREE index creation was considerably faster than the
> HASH index creation.

Currently, btree is better than hash for almost any conceivable use. In
theory, I believe hash could provide slightly faster lookups than btree
(can someone confirm this?), but given the issues with hash at the
moment, it's not worth it.

Cheers,

Neil

--
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC


Re: HASH index method not correctly handling NULL text values?

From
Tom Lane
Date:
David Madore <david.madore@ens.fr> writes:
> table has 345442 rows, of which 344339 have a non NULL value in the
> column in question.  The index creation proceeded without trouble, but
> a subsequent "VACUUM VERBOSE ANALYZE" command gave the following
> warning: "NUMBER OF INDEX' TUPLES (344339) IS NOT THE SAME AS HEAP'
> (345442)" with the advice to recreate the index.  Essentially, my
> question is: is the warning incorrect or is the index so?

The warning should be suppressed for hash indexes, since they don't
include nulls.  I believe this is fixed in 7.2.

> Incidentally, I might ask, which is the best choice of indexing method
> considering that the only comparison I will ever make on this column
> is equality (this is what led me to prefer HASH initially) and
> considering that the column takes of the order of magnitude of a few
> hundred distinct values (as opposed to the third-of-a-million rows of
> the table)?  The BTREE index creation was considerably faster than the
> HASH index creation.

Our BTREE implementation is a lot better than our HASH implementation;
I have a difficult time recommending the latter for anything.

            regards, tom lane

Re: HASH index method not correctly handling NULL text

From
Neil Conway
Date:
On Sat, 2002-02-02 at 19:58, Tom Lane wrote:
> David Madore <david.madore@ens.fr> writes:
> > table has 345442 rows, of which 344339 have a non NULL value in the
> > column in question.  The index creation proceeded without trouble, but
> > a subsequent "VACUUM VERBOSE ANALYZE" command gave the following
> > warning: "NUMBER OF INDEX' TUPLES (344339) IS NOT THE SAME AS HEAP'
> > (345442)" with the advice to recreate the index.  Essentially, my
> > question is: is the warning incorrect or is the index so?
>
> The warning should be suppressed for hash indexes, since they don't
> include nulls.  I believe this is fixed in 7.2.

Is there a reason why hash indexes don't include NULLs?

Cheers,

Neil

--
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC


Re: HASH index method not correctly handling NULL text

From
Tom Lane
Date:
Neil Conway <nconway@klamath.dyndns.org> writes:
>> The warning should be suppressed for hash indexes, since they don't
>> include nulls.  I believe this is fixed in 7.2.

> Is there a reason why hash indexes don't include NULLs?

Nobody's got around to fixing them to do so.  AFAICS it should be easy
enough to do; just assign a fixed hash code (zero, likely) for NULLs,
and adjust the comparison routines to be NULL-conscious.

If you want to work on the hash index code, feel free.  My own vision
of things says that we should put our effort into the btree and GIST
index types, which really cover the scalar and multidimensional cases
pretty effectively.  If we had unlimited manpower then it'd be worth
working on hash and rtree too, but I'd be just as happy leaving them
to rot quietly.

But, as always, Postgres is a volunteer project, and the work that
gets done is whatever someone is interested/motivated to work on.
So if improving hash indexes is what floats your boat at the moment,
then by all means go for it.

            regards, tom lane

Re: HASH index method not correctly handling NULL text

From
Neil Conway
Date:
On Sat, 2002-02-02 at 22:09, Tom Lane wrote:
> Neil Conway <nconway@klamath.dyndns.org> writes:
> >> The warning should be suppressed for hash indexes, since they don't
> >> include nulls.  I believe this is fixed in 7.2.
>
> > Is there a reason why hash indexes don't include NULLs?
>
> Nobody's got around to fixing them to do so.  AFAICS it should be easy
> enough to do; just assign a fixed hash code (zero, likely) for NULLs,
> and adjust the comparison routines to be NULL-conscious.

Okay, I'll take a look.

> But, as always, Postgres is a volunteer project, and the work that
> gets done is whatever someone is interested/motivated to work on.
> So if improving hash indexes is what floats your boat at the moment,
> then by all means go for it.

I was looking for a reasonably simple part of the backend to start
working on. If you know of an area that could use improvement, isn't too
complex and is more useful than hash indexes, please suggest it.

Cheers,

Neil

--
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC


Re: HASH index method not correctly handling NULL text

From
Bruce Momjian
Date:
Tom Lane wrote:
> Neil Conway <nconway@klamath.dyndns.org> writes:
> >> The warning should be suppressed for hash indexes, since they don't
> >> include nulls.  I believe this is fixed in 7.2.
>
> > Is there a reason why hash indexes don't include NULLs?
>
> Nobody's got around to fixing them to do so.  AFAICS it should be easy
> enough to do; just assign a fixed hash code (zero, likely) for NULLs,
> and adjust the comparison routines to be NULL-conscious.
>
> If you want to work on the hash index code, feel free.  My own vision
> of things says that we should put our effort into the btree and GIST
> index types, which really cover the scalar and multidimensional cases
> pretty effectively.  If we had unlimited manpower then it'd be worth
> working on hash and rtree too, but I'd be just as happy leaving them
> to rot quietly.

The big question is how should we document the fact that hash isn't
recommended?  We get periodic questions about it and I don't think the
FAQ is the place for it because it is something pretty fundamental we
should document.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: HASH index method not correctly handling NULL text

From
"Rob Arnold"
Date:
Send a "NOTICE: Use of Hash index is deprecated, please use btree instead"
any time someone makes a hash index?

--rob

----- Original Message -----
From: "Bruce Momjian" <pgman@candle.pha.pa.us>
To: "Tom Lane" <tgl@sss.pgh.pa.us>
Cc: "Neil Conway" <nconway@klamath.dyndns.org>;
<pgsql-general@postgresql.org>; <david.madore@ens.fr>
Sent: Monday, February 04, 2002 12:00 AM
Subject: Re: HASH index method not correctly handling NULL text


> Tom Lane wrote:
> > Neil Conway <nconway@klamath.dyndns.org> writes:
> > >> The warning should be suppressed for hash indexes, since they don't
> > >> include nulls.  I believe this is fixed in 7.2.
> >
> > > Is there a reason why hash indexes don't include NULLs?
> >
> > Nobody's got around to fixing them to do so.  AFAICS it should be easy
> > enough to do; just assign a fixed hash code (zero, likely) for NULLs,
> > and adjust the comparison routines to be NULL-conscious.
> >
> > If you want to work on the hash index code, feel free.  My own vision
> > of things says that we should put our effort into the btree and GIST
> > index types, which really cover the scalar and multidimensional cases
> > pretty effectively.  If we had unlimited manpower then it'd be worth
> > working on hash and rtree too, but I'd be just as happy leaving them
> > to rot quietly.
>
> The big question is how should we document the fact that hash isn't
> recommended?  We get periodic questions about it and I don't think the
> FAQ is the place for it because it is something pretty fundamental we
> should document.
>
> --
>   Bruce Momjian                        |  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us               |  (610) 853-3000
>   +  If your life is a hard drive,     |  830 Blythe Avenue
>   +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
>