Thread: index type for indexing long texts

index type for indexing long texts

From
Aleksander Kmetec
Date:
(I'm reposting this because the original message didn't make it through in the last ~20 hours)

Hi,

I'm looking for a solution for indexing long TEXT columns. We're currently using a HASH index, which can handle most
situations, but every now and then we need support for even longer texts.

One solution would be to create a functional index which would only use the first N chars of mycol, but then we'd have
to change several hundred occurences of "mycol = someval" with "(mycol = someval AND firstN(mycol) = firstN(someval))",

as well as update some SQL generators...

That's why I'd be interested to know if there are any index types available which store only the first N chars or use
some highly compressed form for storing index data, and then recheck any potential hits against the main table. And if
something like that does not exist yet, how difficult would it be to construct such a solution out of many "spare
parts" 
that come with PG?

All we really need is index support for the "=" operator; everything else is optional.
We're currently using PG 8.1.

I'd be grateful for any suggestions.

Regards,
Aleksander

Re: index type for indexing long texts

From
Tom Lane
Date:
Aleksander Kmetec <aleksander.kmetec@intera.si> writes:
> I'm looking for a solution for indexing long TEXT columns. We're currently using a HASH index, which can handle most
> situations, but every now and then we need support for even longer texts.

> One solution would be to create a functional index which would only use the first N chars of mycol, but then we'd
have 
> to change several hundred occurences of "mycol = someval" with "(mycol = someval AND firstN(mycol) =
firstN(someval))", 
> as well as update some SQL generators...

> That's why I'd be interested to know if there are any index types available which store only the first N chars or use

> some highly compressed form for storing index data, and then recheck any potential hits against the main table. And
if 
> something like that does not exist yet, how difficult would it be to construct such a solution out of many "spare
parts" 
> that come with PG?

I think you could do it with GiST.  Look at contrib/btree_gist, and
make a variant version of its text support in which only the first N
characters are stored/compared, then declare all the operators as
RECHECK.  (Actually, I'm not sure that "<" etc would work correctly in
such a scenario, especially in non-C locales; it might be best to
declare the operator class as containing only "=".)

I don't think you can do it in btree or hash because they assume that
the operators are strictly consistent with the support functions, and
in fact apply the operators directly in some code paths.  So you
couldn't use true text equality as the "=" operator, which ruins your
chances of not changing your queries.  But GiST never touches the
operators at all, just the support procs, and you can make the support
procs do anything.  So it should be possible to make the index work with
just the first N characters, and then the RECHECK with true equality
would filter out the wrong matches.

BTW, if you get something that works well, stick it up on pgfoundry;
you're not the first to have asked this ...

            regards, tom lane

Re: index type for indexing long texts

From
Richard Troy
Date:

> Aleksander Kmetec <aleksander.kmetec@intera.si> writes:
> > I'm looking for a solution for indexing long TEXT columns. We're currently using a HASH index, which can handle
most
> > situations, but every now and then we need support for even longer texts.
>
> > One solution would be to create a functional index which would only use the first N chars of mycol, but then we'd
have
> > to change several hundred occurences of "mycol = someval" with "(mycol = someval AND firstN(mycol) =
firstN(someval))",
> > as well as update some SQL generators...
>
> > That's why I'd be interested to know if there are any index types available which store only the first N chars or
use
> > some highly compressed form for storing index data, and then recheck any potential hits against the main table. And
if
> > something like that does not exist yet, how difficult would it be to construct such a solution out of many "spare
parts"
> > that come with PG?
>

Try moving where the hash takes place - ie, use your own hash function to
create the key.

RT



--
Richard Troy, Chief Scientist
Science Tools Corporation
510-924-1363 or 202-747-1263
rtroy@ScienceTools.com, http://ScienceTools.com/


Re: index type for indexing long texts

From
Aleksander Kmetec
Date:
Thank you both for your suggestions.

I think I'll try the GiST approach first since using an existing contrib extension as a starting point seems like a
simpler task for someone like me. :)

Regards,
Aleksander

Richard Troy wrote:
>
>> Aleksander Kmetec <aleksander.kmetec@intera.si> writes:
>>> I'm looking for a solution for indexing long TEXT columns. We're currently using a HASH index, which can handle
most
>>> situations, but every now and then we need support for even longer texts.
>>> One solution would be to create a functional index which would only use the first N chars of mycol, but then we'd
have
>>> to change several hundred occurences of "mycol = someval" with "(mycol = someval AND firstN(mycol) =
firstN(someval))",
>>> as well as update some SQL generators...
>>> That's why I'd be interested to know if there are any index types available which store only the first N chars or
use
>>> some highly compressed form for storing index data, and then recheck any potential hits against the main table. And
if
>>> something like that does not exist yet, how difficult would it be to construct such a solution out of many "spare
parts"
>>> that come with PG?
>
> Try moving where the hash takes place - ie, use your own hash function to
> create the key.
>
> RT
>
>
>