Thread: HASH index method not correctly handling NULL text values?
Hello. I don't know if this counts as a bug or not. I'm using PostgreSQL version 7.1.3, and I tried creating an index using HASH (rather than BTREE as I usually do) on a text column of one of my tables. The table has 345442 rows, of which 344339 have a non NULL value in the column in question. The index creation proceeded without trouble, but a subsequent "VACUUM VERBOSE ANALYZE" command gave the following warning: "NUMBER OF INDEX' TUPLES (344339) IS NOT THE SAME AS HEAP' (345442)" with the advice to recreate the index. Essentially, my question is: is the warning incorrect or is the index so? In the end I recreated my index as a BTREE and this gave no problem. Incidentally, I might ask, which is the best choice of indexing method considering that the only comparison I will ever make on this column is equality (this is what led me to prefer HASH initially) and considering that the column takes of the order of magnitude of a few hundred distinct values (as opposed to the third-of-a-million rows of the table)? The BTREE index creation was considerably faster than the HASH index creation. In case this is relevant or of any help, here is the creation command for the table: CREATE TABLE articles ( id serial PRIMARY KEY , msgid text UNIQUE NOT NULL , from_header text NOT NULL , subject_header text NOT NULL , date timestamp NOT NULL , posting_date timestamp , last_reference text , lines int NOT NULL , bytes int NOT NULL , sender_name text , sender_host text ) ; here is the command I tried to create the HASH index: CREATE INDEX articles_sender_name_key ON articles USING HASH ( sender_name ) ; and here is the corresponding output from "VACUUM VERBOSE ANALYZE" (the databased had already been vacuumed just prior to index creation): NOTICE: --Relation articles-- NOTICE: Pages 10326: Changed 0, reaped 3, Empty 0, New 0; Tup 345442: Vac 0, Keep/VTL 0/0, Crash 0, UnUsed 9, MinLen 151,MaxLen 833; Re-using: Free/Avail. Space 3356/0; EndEmpty/Avail. Pages 0/0. CPU 0.58s/0.36u sec. NOTICE: Index articles_pkey: Pages 1809; Tuples 345442: Deleted 0. CPU 0.25s/1.54u sec. NOTICE: Index articles_msgid_key: Pages 2524; Tuples 345442: Deleted 0. CPU 0.22s/1.63u sec. NOTICE: Index articles_posting_date_key: Pages 957; Tuples 345442: Deleted 0. CPU 0.12s/1.59u sec. NOTICE: Index articles_last_reference_key: Pages 2324; Tuples 345442: Deleted 0. CPU 0.18s/1.68u sec. NOTICE: Index articles_sender_name_key: Pages 3089; Tuples 344339: Deleted 0. CPU 0.17s/10.79u sec. NOTICE: Index articles_sender_name_key: NUMBER OF INDEX' TUPLES (344339) IS NOT THE SAME AS HEAP' (345442). Recreate the index. PS: Please send copy of replies to me personally as I do not receive mail from the list. Thanks again. -- David A. Madore (david.madore@ens.fr, http://www.eleves.ens.fr:8080/home/madore/ )
On Sat, 2002-02-02 at 18:49, David Madore wrote: > Hello. > > I don't know if this counts as a bug or not. I'm using PostgreSQL > version 7.1.3, and I tried creating an index using HASH (rather than > BTREE as I usually do) on a text column of one of my tables. The > table has 345442 rows, of which 344339 have a non NULL value in the > column in question. The index creation proceeded without trouble, but > a subsequent "VACUUM VERBOSE ANALYZE" command gave the following > warning: "NUMBER OF INDEX' TUPLES (344339) IS NOT THE SAME AS HEAP' > (345442)" with the advice to recreate the index. Essentially, my > question is: is the warning incorrect or is the index so? In the end > I recreated my index as a BTREE and this gave no problem. The hash index type has some known problems. For one thing, it has a tendancy to deadlock under heavy load -- on my machine, 'pgbench -c 10 -t 100' will reproducibly produce deadlocks and failed queries. I would not advise the use of hash indexes for production machines. However, I've been starting to do some Pg hacking. Coincidentally, my first project is to attempt to improve hash indexes. I'll see if I can reproduce your problem on my local machine, and I'll let you know if/when I have a fix. > Incidentally, I might ask, which is the best choice of indexing method > considering that the only comparison I will ever make on this column > is equality (this is what led me to prefer HASH initially) and > considering that the column takes of the order of magnitude of a few > hundred distinct values (as opposed to the third-of-a-million rows of > the table)? The BTREE index creation was considerably faster than the > HASH index creation. Currently, btree is better than hash for almost any conceivable use. In theory, I believe hash could provide slightly faster lookups than btree (can someone confirm this?), but given the issues with hash at the moment, it's not worth it. Cheers, Neil -- Neil Conway <neilconway@rogers.com> PGP Key ID: DB3C29FC
David Madore <david.madore@ens.fr> writes: > table has 345442 rows, of which 344339 have a non NULL value in the > column in question. The index creation proceeded without trouble, but > a subsequent "VACUUM VERBOSE ANALYZE" command gave the following > warning: "NUMBER OF INDEX' TUPLES (344339) IS NOT THE SAME AS HEAP' > (345442)" with the advice to recreate the index. Essentially, my > question is: is the warning incorrect or is the index so? The warning should be suppressed for hash indexes, since they don't include nulls. I believe this is fixed in 7.2. > Incidentally, I might ask, which is the best choice of indexing method > considering that the only comparison I will ever make on this column > is equality (this is what led me to prefer HASH initially) and > considering that the column takes of the order of magnitude of a few > hundred distinct values (as opposed to the third-of-a-million rows of > the table)? The BTREE index creation was considerably faster than the > HASH index creation. Our BTREE implementation is a lot better than our HASH implementation; I have a difficult time recommending the latter for anything. regards, tom lane
On Sat, 2002-02-02 at 19:58, Tom Lane wrote: > David Madore <david.madore@ens.fr> writes: > > table has 345442 rows, of which 344339 have a non NULL value in the > > column in question. The index creation proceeded without trouble, but > > a subsequent "VACUUM VERBOSE ANALYZE" command gave the following > > warning: "NUMBER OF INDEX' TUPLES (344339) IS NOT THE SAME AS HEAP' > > (345442)" with the advice to recreate the index. Essentially, my > > question is: is the warning incorrect or is the index so? > > The warning should be suppressed for hash indexes, since they don't > include nulls. I believe this is fixed in 7.2. Is there a reason why hash indexes don't include NULLs? Cheers, Neil -- Neil Conway <neilconway@rogers.com> PGP Key ID: DB3C29FC
Neil Conway <nconway@klamath.dyndns.org> writes: >> The warning should be suppressed for hash indexes, since they don't >> include nulls. I believe this is fixed in 7.2. > Is there a reason why hash indexes don't include NULLs? Nobody's got around to fixing them to do so. AFAICS it should be easy enough to do; just assign a fixed hash code (zero, likely) for NULLs, and adjust the comparison routines to be NULL-conscious. If you want to work on the hash index code, feel free. My own vision of things says that we should put our effort into the btree and GIST index types, which really cover the scalar and multidimensional cases pretty effectively. If we had unlimited manpower then it'd be worth working on hash and rtree too, but I'd be just as happy leaving them to rot quietly. But, as always, Postgres is a volunteer project, and the work that gets done is whatever someone is interested/motivated to work on. So if improving hash indexes is what floats your boat at the moment, then by all means go for it. regards, tom lane
On Sat, 2002-02-02 at 22:09, Tom Lane wrote: > Neil Conway <nconway@klamath.dyndns.org> writes: > >> The warning should be suppressed for hash indexes, since they don't > >> include nulls. I believe this is fixed in 7.2. > > > Is there a reason why hash indexes don't include NULLs? > > Nobody's got around to fixing them to do so. AFAICS it should be easy > enough to do; just assign a fixed hash code (zero, likely) for NULLs, > and adjust the comparison routines to be NULL-conscious. Okay, I'll take a look. > But, as always, Postgres is a volunteer project, and the work that > gets done is whatever someone is interested/motivated to work on. > So if improving hash indexes is what floats your boat at the moment, > then by all means go for it. I was looking for a reasonably simple part of the backend to start working on. If you know of an area that could use improvement, isn't too complex and is more useful than hash indexes, please suggest it. Cheers, Neil -- Neil Conway <neilconway@rogers.com> PGP Key ID: DB3C29FC
Tom Lane wrote: > Neil Conway <nconway@klamath.dyndns.org> writes: > >> The warning should be suppressed for hash indexes, since they don't > >> include nulls. I believe this is fixed in 7.2. > > > Is there a reason why hash indexes don't include NULLs? > > Nobody's got around to fixing them to do so. AFAICS it should be easy > enough to do; just assign a fixed hash code (zero, likely) for NULLs, > and adjust the comparison routines to be NULL-conscious. > > If you want to work on the hash index code, feel free. My own vision > of things says that we should put our effort into the btree and GIST > index types, which really cover the scalar and multidimensional cases > pretty effectively. If we had unlimited manpower then it'd be worth > working on hash and rtree too, but I'd be just as happy leaving them > to rot quietly. The big question is how should we document the fact that hash isn't recommended? We get periodic questions about it and I don't think the FAQ is the place for it because it is something pretty fundamental we should document. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Send a "NOTICE: Use of Hash index is deprecated, please use btree instead" any time someone makes a hash index? --rob ----- Original Message ----- From: "Bruce Momjian" <pgman@candle.pha.pa.us> To: "Tom Lane" <tgl@sss.pgh.pa.us> Cc: "Neil Conway" <nconway@klamath.dyndns.org>; <pgsql-general@postgresql.org>; <david.madore@ens.fr> Sent: Monday, February 04, 2002 12:00 AM Subject: Re: HASH index method not correctly handling NULL text > Tom Lane wrote: > > Neil Conway <nconway@klamath.dyndns.org> writes: > > >> The warning should be suppressed for hash indexes, since they don't > > >> include nulls. I believe this is fixed in 7.2. > > > > > Is there a reason why hash indexes don't include NULLs? > > > > Nobody's got around to fixing them to do so. AFAICS it should be easy > > enough to do; just assign a fixed hash code (zero, likely) for NULLs, > > and adjust the comparison routines to be NULL-conscious. > > > > If you want to work on the hash index code, feel free. My own vision > > of things says that we should put our effort into the btree and GIST > > index types, which really cover the scalar and multidimensional cases > > pretty effectively. If we had unlimited manpower then it'd be worth > > working on hash and rtree too, but I'd be just as happy leaving them > > to rot quietly. > > The big question is how should we document the fact that hash isn't > recommended? We get periodic questions about it and I don't think the > FAQ is the place for it because it is something pretty fundamental we > should document. > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 853-3000 > + If your life is a hard drive, | 830 Blythe Avenue > + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 >