Thread: Full-Text-Indexing
Hello, I am also in need of FTI and looked iin the script from Maarten Boeklhold. I have several questions regarding this: 1) Why should I use "varchar(n)" instead of "text" though the space for a varchar-field is larger than text (varchar: 4byte+n, text: 2byte+n)? Is it more efficient for queries? 2) I would like a table where I can insert/update/delete the stop-words instead of an array inside the code. This would also help in lower the words to search for in the query. Are there any grave performance penaltys for this solution? It is much more flexible and I would trade a small performance loss for this flexibility. [I cannot program in C so maybe there is a volunteer who could write this little snippet for me? :-)] Thanx for any helpful comments! Alex Schneider
Alexander Schneider <alex@alexander-schneider.de> writes: > 1) Why should I use "varchar(n)" instead of "text" though the space for a > varchar-field is larger than text (varchar: 4byte+n, text: 2byte+n)? Is > it more efficient for queries? Actually, both text and varchar have 4-byte overhead. There's no difference at all in the representation, and none to speak of in efficiency. The reason you'd use varchar() is if you have an application-defined restriction that the contents of the field may not exceed some particular length. In that case you say varchar(N), which both makes the intent visible and allows Postgres to enforce the requirement for you. If you don't have any particular upper bound in mind for the string length, use text type. Just for completeness: char(N) also has a 4-byte overhead ... rather unnecessarily, but that's how it's done at the moment. But there is a marginal (very marginal) efficiency gain in access, since the system knows that the field is always the same length --- that makes it easier to determine the locations of following fields in the same tuple. I think you'd pretty much only use char(N) when you have an application semantics restriction that the string length must be exactly N characters. For example, in US postal addresses the state abbreviation would be char(2). regards, tom lane
> 1) Why should I use "varchar(n)" instead of "text" though the space for a > varchar-field is larger than text (varchar: 4byte+n, text: 2byte+n)? Is > it more efficient for queries? No reason, I think it the code would even work without changes if you use 'text' fields... > 2) I would like a table where I can insert/update/delete the stop-words > instead of an array inside the code. This would also help in lower the > words to search for in the query. Are there any grave performance > penaltys for this solution? It is much more flexible and I would trade a > small performance loss for this flexibility. [I cannot program in C so > maybe there is a volunteer who could write this little snippet for me? The problem here is that without resorting to some advanced stuff using shared memory or whatever that table would have to be read by every backend starting up, which might be some significant overhead. I'm still hoping somebody can rewrite this stuff to use some specific indexing method so we don't have to go the way of introducing this intermediate table (i.e. store the substring/oid pairs directly in a btree item and use some new index access method to take something like field ~= 'rolling NOT stone' to search this index). Maarten -- Maarten Boekhold, boekhold@tibco.com TIBCO Finance Technology Inc. The Atrium Strawinskylaan 3051 1077 ZX Amsterdam, The Netherlands tel: +31 20 3012158, fax: +31 20 3012358 http://www.tibco.com