Speaking of Indexing... (Text indexing) - Mailing list pgsql-hackers

From Poet/Joshua Drake
Subject Speaking of Indexing... (Text indexing)
Date
Msg-id Pine.LNX.4.30.0104101541140.10612-100000@crazypenguins.commandprompt.com
Whole thread Raw
In response to Re: Indexes not used in 7.1RC4: Bug?  (Thomas Lockhart <lockhart@alumni.caltech.edu>)
List pgsql-hackers
Good day,

I've been experimenting a bit with Full Text Indexing in PostgreSQL. I
have found several conflicting sites various places on the net pertaining
to whether or not PostgreSQL supports FTI, and I was hoping I could find
an authoritative answer here - I tried searching the website's archives,
but the search seems to be having some problems.

At any rate, I am running a CVS snapshot of 7.1, and I have been trying to
create a full text index on a series of resumes. Some of these exceed 8k
in size, which is no longer a storage problem of course with 7.1, but I
seem to have run into the wicked 8k once again. Specifically:

ERROR:  index_formtuple: data takes 9344 bytes, max is 8191

Furthermore, after trying to just index on a 8191-character long substring
of the resume, I run into the following:

ERROR:  btree: index item size 3948 exceeds maximum 2713

The only way I could actually get the index created was to substring the
body of the resumes down to 2k. I also later tried using HASH rather than
BTREE, which worked, but none of these solutions really appreciably
increased performance in the way we were hoping.

Are these known and accepted limitations of the current 7.1
implementation, or am I doing something terribly wrong? ;)
On Tue, 10 Apr 2001, Thomas Lockhart wrote:

>> I have the following table, containing about 570000 Rows, but some
>> indexes are not used, on 7.1RC4, freshly vacuumed (analyse). It was the
>> same at least in 7.1RC1
>>       CREATE TABLE access_log(
>>          access_time timestamp   NOT NULL DEFAULT NOW(),
>>          method_num  int2        NOT NULL,
>>          url_id      int4        NOT NULL REFERENCES urls(id),
>>          );
>>       CREATE INDEX method_idx       ON access_log(method_num);
>>       CREATE INDEX url_idx          ON access_log(url_id);
>> url_idx seems OK:
>> But the others not:
>>   logger=# EXPLAIN SELECT * FROM access_log WHERE method_num = 0;
>>   Seq Scan on access_log  (cost=0.00..16443.71 rows=559371 width=89)
>
>The parser does not know that your int4 constant "0" can be represented
>as an int2. Try
>
>  SELECT * FROM access_log WHERE method_num = int2 '0';
>
>(note the type coersion on the constant; there are other ways of
>specifying the same thing).
>
>For the other cases, PostgreSQL is estimating the query cost to be lower
>with a sequential scan. For the "SELECT 1" subselect case, it may be
>that the optimizer does not cheat and determine that there will be only
>one row returned, or that the query can be reformulated to use a simple
>constant.
>
>HTH
>
>                     - Thomas
>
>---------------------------(end of broadcast)---------------------------
>TIP 4: Don't 'kill -9' the postmaster
>

--
--
<COMPANY>CommandPrompt    - http://www.commandprompt.com    </COMPANY>
<PROJECT>OpenDocs, LLC.    - http://www.opendocs.org    </PROJECT>
<PROJECT>LinuxPorts     - http://www.linuxports.com     </PROJECT>
<WEBMASTER>LDP        - http://www.linuxdoc.org    </WEBMASTER>
--
Instead of asking why a piece of software is using "1970s technology,"
start asking why software is ignoring 30 years of accumulated wisdom.
--


pgsql-hackers by date:

Previous
From: Thomas Lockhart
Date:
Subject: Re: Speaking of Indexing... (Text indexing)
Next
From: Thomas Lockhart
Date:
Subject: Re: Large Object problems (was Re: JDBC int8 hack)