Re: BUG #2737: hash indexing large table fails, while btree of same index works - Mailing list pgsql-bugs

From Tom Lane
Subject Re: BUG #2737: hash indexing large table fails, while btree of same index works
Date
Msg-id 7530.1163202951@sss.pgh.pa.us
Whole thread Raw
In response to BUG #2737: hash indexing large table fails, while btree of same index works  ("Balazs Nagy" <bnagy@thenewpush.com>)
Responses Re: [PERFORM] BUG #2737: hash indexing large table fails,while btree of same index works
List pgsql-bugs
[ cc'ing to pgsql-performance because of performance issue for hash indexes ]

"Balazs Nagy" <bnagy@thenewpush.com> writes:
> Database table size: ~60 million rows
> Field to index: varchar 127

> CREATE INDEX ... USING hash ...

> fails with a file not found error (psql in verbose mode):

> ERROR:  58P01: could not open segment 3 of relation 1663/16439/16509 (target
> block 528283): No such file or directory
> LOCATION:  _mdfd_getseg, md.c:954

Wow, you're trying to build an 8GB hash index?  Considering that hash
indexes still don't have WAL support, it hardly seems like a good idea
to be using one that large.

The immediate problem here seems to be that the hash code is trying to
touch a page in segment 4 when it hasn't yet touched anything in segment
3.  The low-level md.c code assumes (not unreasonably) that this
probably represents a bug in the calling code, and errors out instead of
allowing the segment to be created.

We ought to think about rejiggering the smgr.c interface to support
hash's behavior more reasonably.  There's already one really bad kluge
in mdread() for hash support :-(

One thought that comes to mind is to require hash to do an smgrextend()
addressing the last block it intends to use whenever it allocates a new
batch of blocks, whereupon md.c could adopt a saner API: allow
smgrextend but not other calls to address blocks beyond the current EOF.
I had once wanted to require hash to explicitly fill all the blocks in
sequence, but that's probably too radical compared to what it does now
--- especially seeing that it seems the extension has to be done while
holding the page-zero lock (see _hash_expandtable).  Writing just the
logically last block in a batch would have the effect that hash indexes
could contain holes (unallocated extents) on filesystems that support
that.  Normally I would think that probably a Bad Thing, but since hash
indexes are never scanned sequentially, it might not matter whether they
end up badly fragmented because of after-the-fact filling in of a hole.
Thoughts?

            regards, tom lane

pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #2732: pg_get_serial_sequence error
Next
From: "Simon Riggs"
Date:
Subject: Re: [PERFORM] BUG #2737: hash indexing large table fails,while btree of same index works