Thread: Full-Text-Indexing

Full-Text-Indexing

From
Alexander Schneider
Date:
Hello,

I am also in need of FTI and looked iin the script from Maarten 
Boeklhold. I have several questions regarding this:

1) Why should I use "varchar(n)" instead of "text" though the space for a 
varchar-field is larger than text (varchar: 4byte+n, text: 2byte+n)? Is 
it more efficient for queries?

2) I would like a table where I can insert/update/delete the stop-words 
instead of an array inside the code. This would also help in lower the 
words to search for in the query. Are there any grave performance 
penaltys for this solution? It is much more flexible and I would trade a 
small performance loss for this flexibility. [I cannot program in C so 
maybe there is a volunteer who could write this little snippet for me? 
:-)]

Thanx for any helpful comments!

Alex Schneider


Re: [SQL] Full-Text-Indexing

From
Tom Lane
Date:
Alexander Schneider <alex@alexander-schneider.de> writes:
> 1) Why should I use "varchar(n)" instead of "text" though the space for a 
> varchar-field is larger than text (varchar: 4byte+n, text: 2byte+n)? Is 
> it more efficient for queries?

Actually, both text and varchar have 4-byte overhead.  There's no
difference at all in the representation, and none to speak of in
efficiency.

The reason you'd use varchar() is if you have an application-defined
restriction that the contents of the field may not exceed some
particular length.  In that case you say varchar(N), which both makes
the intent visible and allows Postgres to enforce the requirement for
you.

If you don't have any particular upper bound in mind for the string
length, use text type.

Just for completeness: char(N) also has a 4-byte overhead ... rather
unnecessarily, but that's how it's done at the moment.  But there is
a marginal (very marginal) efficiency gain in access, since the system
knows that the field is always the same length --- that makes it
easier to determine the locations of following fields in the same
tuple.  I think you'd pretty much only use char(N) when you have
an application semantics restriction that the string length must be
exactly N characters.  For example, in US postal addresses the state
abbreviation would be char(2).
        regards, tom lane


Re: [SQL] Full-Text-Indexing

From
Maarten Boekhold
Date:
> 1) Why should I use "varchar(n)" instead of "text" though the space for a
> varchar-field is larger than text (varchar: 4byte+n, text: 2byte+n)? Is
> it more efficient for queries?

No reason, I think it the code would even work without changes if you
use 'text' fields...
> 2) I would like a table where I can insert/update/delete the stop-words
> instead of an array inside the code. This would also help in lower the
> words to search for in the query. Are there any grave performance
> penaltys for this solution? It is much more flexible and I would trade a
> small performance loss for this flexibility. [I cannot program in C so
> maybe there is a volunteer who could write this little snippet for me?

The problem here is that without resorting to some advanced stuff using
shared memory or whatever that table would have to be read by every backend
starting up, which might be some significant overhead.

I'm still hoping somebody can rewrite this stuff to use some specific indexing
method so we don't have to go the way of introducing this intermediate table
(i.e. store the substring/oid pairs directly in a btree item and use some
new index access method to take something like field ~= 'rolling NOT stone'
to search this index).

Maarten

-- 

Maarten Boekhold, boekhold@tibco.com
TIBCO Finance Technology Inc.
The Atrium
Strawinskylaan 3051
1077 ZX Amsterdam, The Netherlands
tel: +31 20 3012158, fax: +31 20 3012358
http://www.tibco.com