Thread: NOTICE: word is too long INSERT 0 3014

NOTICE: word is too long INSERT 0 3014

From
"Joshua D. Drake"
Date:
Hello,

I am running into this limitation ALOT with Tsearch2. What are my
options to get around it. Do I have to compile PostgreSQL with a
different block size?

If yes, what are the downsides to doing so (outside of not being able to
do straight upgrades)?

Sincerely,

Joshua D. Drake

--

      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
             http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate


Re: NOTICE: word is too long INSERT 0 3014

From
"Joshua D. Drake"
Date:
Any thoughts on the below?

Joshua D. Drake wrote:
> Hello,
>
> I am running into this limitation ALOT with Tsearch2. What are my
> options to get around it. Do I have to compile PostgreSQL with a
> different block size?
>
> If yes, what are the downsides to doing so (outside of not being able to
> do straight upgrades)?
>
> Sincerely,
>
> Joshua D. Drake
>


--

      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
             http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate


Re: NOTICE: word is too long INSERT 0 3014

From
Teodor Sigaev
Date:
> I am running into this limitation ALOT with Tsearch2. What are my
> options to get around it. Do I have to compile PostgreSQL with a
> different block size?
>
> If yes, what are the downsides to doing so (outside of not being able to
> do straight upgrades)?

If you really need that, your should play around WordEntry definition
(tsvector.h). Sorry, right now I haven't possibility to look closer,
just tomorrow.
Limit for word is equal 2KB - I supposed that is long enough to store
any meaningful words.


Re: NOTICE: word is too long INSERT 0 3014

From
Teodor Sigaev
Date:
For example, redefine by follow way:

typedef struct
{
     uint32
                 haspos:1,
                 len:31;
     uint32      pos;
}   WordEntry;

/* <= 1Gb */
#define MAXSTRLEN ( 1<<30 )
#define MAXSTRPOS ( 1<<30 )

Teodor Sigaev wrote:
>> I am running into this limitation ALOT with Tsearch2. What are my
>> options to get around it. Do I have to compile PostgreSQL with a
>> different block size?
>>
>> If yes, what are the downsides to doing so (outside of not being able to
>> do straight upgrades)?
>
> If you really need that, your should play around WordEntry definition
> (tsvector.h). Sorry, right now I haven't possibility to look closer,
> just tomorrow.
> Limit for word is equal 2KB - I supposed that is long enough to store
> any meaningful words.
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>               http://archives.postgresql.org/

--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Index greater than 8k

From
"Joshua D. Drake"
Date:
Hello,

I recently posted about a word being too long with Tsearch2. That isn't
actually the problem I am trying to solve (thanks for the feedback
though, now I understand it).

The problem I am after is the 8k index size issue. It is very easy to
get a GIST index (especially when using tsearch2) that is larger than that.

Is recompiling the block size the option there?
What are the downsides, except for the custom build?

Joshua D. Drake

--

      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
             http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate


Re: Index greater than 8k

From
Teodor Sigaev
Date:
> The problem I am after is the 8k index size issue. It is very easy to
> get a GIST index (especially when using tsearch2) that is larger than that.
Hmm, tsearch2 GIST index  is specially designed for support huge index entry:
first, every lexemes in tsvectore are transformed to hash value (with a help of
crc32), second, it's stripped all position infos, third, if size of array is
greater than TOAST_INDEX_TARGET then tsearch2 will make bit signature of
tsvector. Signature's length is fixed and equals to 252 bytes by default (+ 8
bytes for header of datum). All values on internal pages are represented as
signatures below.

So, tsearch2 guarantees that index entry will be small enough. If it's not true,
then there is a bug - pls, make test suite demonstrating the problem.

> Is recompiling the block size the option there?
> What are the downsides, except for the custom build?

Can you send exact error message?

--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: [HACKERS] Index greater than 8k

From
"Joshua D. Drake"
Date:
Teodor Sigaev wrote:
>> The problem I am after is the 8k index size issue. It is very easy to
>> get a GIST index (especially when using tsearch2) that is larger than
>> that.
> Hmm, tsearch2 GIST index  is specially designed for support huge index
> entry:
> first, every lexemes in tsvectore are transformed to hash value (with a
> help of crc32), second, it's stripped all position infos, third, if size
> of array is greater than TOAST_INDEX_TARGET then tsearch2 will make bit
> signature of tsvector. Signature's length is fixed and equals to 252
> bytes by default (+ 8 bytes for header of datum). All values on internal
> pages are represented as signatures below.
>
> So, tsearch2 guarantees that index entry will be small enough. If it's
> not true, then there is a bug - pls, make test suite demonstrating the
> problem.
>
>> Is recompiling the block size the option there?
>> What are the downsides, except for the custom build?
>
> Can you send exact error message?

I am training this week, but Darcy can do it. Can you give them a test
case on what we were working on with that customer?

Joshua D. Drake

>


--

      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
             http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate


Re: [HACKERS] Index greater than 8k

From
Darcy Buskermolen
Date:
On October 31, 2006 06:42 am, Joshua D. Drake wrote:
> Teodor Sigaev wrote:
> >> The problem I am after is the 8k index size issue. It is very easy to
> >> get a GIST index (especially when using tsearch2) that is larger than
> >> that.


The problem as I remember it is pg_tgrm not tsearch2 directly, I've sent a
self contained test case directly to  Teodor  which shows the error.

'ERROR:  index row requires 8792 bytes, maximum size is 8191'


> >
> > Hmm, tsearch2 GIST index  is specially designed for support huge index
> > entry:
> > first, every lexemes in tsvectore are transformed to hash value (with a
> > help of crc32), second, it's stripped all position infos, third, if size
> > of array is greater than TOAST_INDEX_TARGET then tsearch2 will make bit
> > signature of tsvector. Signature's length is fixed and equals to 252
> > bytes by default (+ 8 bytes for header of datum). All values on internal
> > pages are represented as signatures below.
> >
> > So, tsearch2 guarantees that index entry will be small enough. If it's
> > not true, then there is a bug - pls, make test suite demonstrating the
> > problem.
> >
> >> Is recompiling the block size the option there?
> >> What are the downsides, except for the custom build?
> >
> > Can you send exact error message?
>
> I am training this week, but Darcy can do it. Can you give them a test
> case on what we were working on with that customer?
>
> Joshua D. Drake

--
Darcy Buskermolen
Command Prompt, Inc.
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
PostgreSQL solutions since 1997
http://www.commandprompt.com/

Re: [HACKERS] Index greater than 8k

From
Teodor Sigaev
Date:
> The problem as I remember it is pg_tgrm not tsearch2 directly, I've sent a
> self contained test case directly to  Teodor  which shows the error.
>
> 'ERROR:  index row requires 8792 bytes, maximum size is 8191'
Uh, I see. But I'm really surprised why do you use pg_trgm on big text? pg_trgm
is designed to find similar words and use technique known as trigrams. This will
  work good on small pieces of text such as words or set expression. But all big
texts (on the same language) will be similar :(. So, I didn't take care about
guarantee that index tuple's size limitation. In principle, it's possible to
modify pg_trgm to have such guarantee, but index becomes lossy - all tuples
gotten  from index should be checked by table's tuple evaluation.

If you want to search similar documents I can recommend to have a look to
fingerprint technique (http://webglimpse.net/pubs/TR93-33.pdf). It's pretty
close to trigrams and metrics of similarity is the same, but uses another
signature calculations. And, there are some tips and trics: removing HTML
marking,removing punctuation, lowercasing text and so on - it's interesting and
complex task.
--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: [HACKERS] Index greater than 8k

From
Darcy Buskermolen
Date:
On October 31, 2006 08:53 am, Teodor Sigaev wrote:
> > The problem as I remember it is pg_tgrm not tsearch2 directly, I've sent
> > a self contained test case directly to  Teodor  which shows the error.
> >
> > 'ERROR:  index row requires 8792 bytes, maximum size is 8191'
>
> Uh, I see. But I'm really surprised why do you use pg_trgm on big text?
> pg_trgm is designed to find similar words and use technique known as
> trigrams. This will work good on small pieces of text such as words or set
> expression. But all big texts (on the same language) will be similar :(.
> So, I didn't take care about guarantee that index tuple's size limitation.
> In principle, it's possible to modify pg_trgm to have such guarantee, but
> index becomes lossy - all tuples gotten  from index should be checked by
> table's tuple evaluation.

The problem is some of the data we are working with is not strictly "text" but
bytea that we've run through encode(bytea, 'escape'), and we've had to resort
to trigrams in an attempt to mimic LIKE for searches.  From our findings
tsearch2 does not match partial words,  in the same way that a LIKE would. ie
col LIKE 'go%' would match good, gopher.  pg_tgrm will return those with the
limit set appropriately, but tsearch2 does not.


>
> If you want to search similar documents I can recommend to have a look to
> fingerprint technique (http://webglimpse.net/pubs/TR93-33.pdf). It's pretty
> close to trigrams and metrics of similarity is the same, but uses another
> signature calculations. And, there are some tips and trics: removing HTML
> marking,removing punctuation, lowercasing text and so on - it's interesting
> and complex task.

--
Darcy Buskermolen
Command Prompt, Inc.
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
PostgreSQL solutions since 1997
http://www.commandprompt.com/

Re: [HACKERS] Index greater than 8k

From
Alvaro Herrera
Date:
Darcy Buskermolen wrote:
> On October 31, 2006 08:53 am, Teodor Sigaev wrote:
> > > The problem as I remember it is pg_tgrm not tsearch2 directly, I've sent
> > > a self contained test case directly to  Teodor  which shows the error.
> > >
> > > 'ERROR:  index row requires 8792 bytes, maximum size is 8191'
> >
> > Uh, I see. But I'm really surprised why do you use pg_trgm on big text?
> > pg_trgm is designed to find similar words and use technique known as
> > trigrams. This will work good on small pieces of text such as words or set
> > expression. But all big texts (on the same language) will be similar :(.
> > So, I didn't take care about guarantee that index tuple's size limitation.
> > In principle, it's possible to modify pg_trgm to have such guarantee, but
> > index becomes lossy - all tuples gotten  from index should be checked by
> > table's tuple evaluation.
>
> The problem is some of the data we are working with is not strictly "text" but
> bytea that we've run through encode(bytea, 'escape'),

I think one good question is why are you storing bytea and then
searching like it were text.  Why not store the text as text, and put
the extraneous bytes somewhere else?  Certainly you wouldn't expect to
be able to find text among the bytes, would you?

I remember suggesting you to store the Content-type next to each object,
and then creating partial trigram indexes where Content-type: text/*.
Did that plan not work for some reason?

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: [HACKERS] Index greater than 8k

From
"Joshua D. Drake"
Date:
Teodor Sigaev wrote:
>> The problem as I remember it is pg_tgrm not tsearch2 directly, I've
>> sent a self contained test case directly to  Teodor  which shows the
>> error.
>> 'ERROR:  index row requires 8792 bytes, maximum size is 8191'
> Uh, I see. But I'm really surprised why do you use pg_trgm on big text?
> pg_trgm is designed to find similar words and use technique known as
> trigrams. This will  work good on small pieces of text such as words or
> set expression. But all big texts (on the same language) will be similar
> :(. So, I didn't take care about guarantee that index tuple's size
> limitation. In principle, it's possible to modify pg_trgm to have such
> guarantee, but index becomes lossy - all tuples gotten  from index
> should be checked by table's tuple evaluation.

We are trying to get something faster than ~ '%foo%';

Which Tsearch2 does not give us :)

Joshua D. Drake



>
> If you want to search similar documents I can recommend to have a look
> to fingerprint technique (http://webglimpse.net/pubs/TR93-33.pdf). It's
> pretty close to trigrams and metrics of similarity is the same, but uses
> another signature calculations. And, there are some tips and trics:
> removing HTML marking,removing punctuation, lowercasing text and so on -
> it's interesting and complex task.


--

      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
             http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate


Re: [HACKERS] Index greater than 8k

From
"Joshua D. Drake"
Date:
Alvaro Herrera wrote:
> Darcy Buskermolen wrote:
>> On October 31, 2006 08:53 am, Teodor Sigaev wrote:
>>>> The problem as I remember it is pg_tgrm not tsearch2 directly, I've sent
>>>> a self contained test case directly to  Teodor  which shows the error.
>>>>
>>>> 'ERROR:  index row requires 8792 bytes, maximum size is 8191'
>>> Uh, I see. But I'm really surprised why do you use pg_trgm on big text?
>>> pg_trgm is designed to find similar words and use technique known as
>>> trigrams. This will work good on small pieces of text such as words or set
>>> expression. But all big texts (on the same language) will be similar :(.
>>> So, I didn't take care about guarantee that index tuple's size limitation.
>>> In principle, it's possible to modify pg_trgm to have such guarantee, but
>>> index becomes lossy - all tuples gotten  from index should be checked by
>>> table's tuple evaluation.
>> The problem is some of the data we are working with is not strictly "text" but
>> bytea that we've run through encode(bytea, 'escape'),
>
> I think one good question is why are you storing bytea and then
> searching like it were text.

We are not storing bytea, a customer is. We are trying to work around
customer requirements. The data that is being stored is not always text,
sometimes it is binary (a flash file or jpeg). We are using escaped text
to be able to search the string contents of that file .

> Why not store the text as text, and put
> the extraneous bytes somewhere else?  Certainly you wouldn't expect to
> be able to find text among the bytes, would you?

Yes we do (and can) expect to find text among the bytes. We have
searches running, we are just running into the maximum size issues for
certain rows.

Sincerely,

Joshua D. Drake




--

      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
             http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate


Re: [HACKERS] Index greater than 8k

From
"Gregory S. Williamson"
Date:
I hesitate to mention it, since it's retrograde, uses OIDS, may not handle your locale/encoding correctly, may not
scalewell for what you need etc., etc. 

But we've used fti (in the contrib package) to do fast searches for any bit of text in people's names ... we didn't go
withtesearch2 because we were a bit worried about the need to search for fragments of names, and that names don't
followstemming rules and the like very well. Still it might be a way of handling some of the uglier data. It was a bit
ofa pain to set up but seems to work well. Of course, users can ask for something commonplace and get back gazillions
ofrows, but apparently that's ok for the application this is part of. Caveat: only about 32 million rows in this
dataset,partitioned into unequal grouings (about 90 total). 

HTH (but doubt it for reasons that undoubtedly be made clear ;-)

Greg Williamson
DBA
GlobeXplorer LLC


-----Original Message-----
From:    pgsql-general-owner@postgresql.org on behalf of Joshua D. Drake
Sent:    Tue 10/31/2006 7:46 PM
To:    Teodor Sigaev
Cc:    Darcy Buskermolen; PgSQL General; PostgreSQL-development
Subject:    Re: [HACKERS] [GENERAL] Index greater than 8k

Teodor Sigaev wrote:
>> The problem as I remember it is pg_tgrm not tsearch2 directly, I've
>> sent a self contained test case directly to  Teodor  which shows the
>> error.
>> 'ERROR:  index row requires 8792 bytes, maximum size is 8191'
> Uh, I see. But I'm really surprised why do you use pg_trgm on big text?
> pg_trgm is designed to find similar words and use technique known as
> trigrams. This will  work good on small pieces of text such as words or
> set expression. But all big texts (on the same language) will be similar
> :(. So, I didn't take care about guarantee that index tuple's size
> limitation. In principle, it's possible to modify pg_trgm to have such
> guarantee, but index becomes lossy - all tuples gotten  from index
> should be checked by table's tuple evaluation.

We are trying to get something faster than ~ '%foo%';

Which Tsearch2 does not give us :)

Joshua D. Drake



>
> If you want to search similar documents I can recommend to have a look
> to fingerprint technique (http://webglimpse.net/pubs/TR93-33.pdf). It's
> pretty close to trigrams and metrics of similarity is the same, but uses
> another signature calculations. And, there are some tips and trics:
> removing HTML marking,removing punctuation, lowercasing text and so on -
> it's interesting and complex task.


--

      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
             http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate


---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq


-------------------------------------------------------
Click link below if it is SPAM gsw@globexplorer.com

"https://mailscanner.globexplorer.com/dspam/dspam.cgi?signatureID=454815f5242304846743324&user=gsw@globexplorer.com&retrain=spam&template=history&history_page=1"
!DSPAM:454815f5242304846743324!
-------------------------------------------------------







Re: [HACKERS] Index greater than 8k

From
Alvaro Herrera
Date:
Joshua D. Drake wrote:
> Alvaro Herrera wrote:
> > Darcy Buskermolen wrote:
> >> On October 31, 2006 08:53 am, Teodor Sigaev wrote:
> >>>> The problem as I remember it is pg_tgrm not tsearch2 directly, I've sent
> >>>> a self contained test case directly to  Teodor  which shows the error.
> >>>>
> >>>> 'ERROR:  index row requires 8792 bytes, maximum size is 8191'
> >>> Uh, I see. But I'm really surprised why do you use pg_trgm on big text?
> >>> pg_trgm is designed to find similar words and use technique known as
> >>> trigrams. This will work good on small pieces of text such as words or set
> >>> expression. But all big texts (on the same language) will be similar :(.
> >>> So, I didn't take care about guarantee that index tuple's size limitation.
> >>> In principle, it's possible to modify pg_trgm to have such guarantee, but
> >>> index becomes lossy - all tuples gotten  from index should be checked by
> >>> table's tuple evaluation.
> >> The problem is some of the data we are working with is not strictly "text" but
> >> bytea that we've run through encode(bytea, 'escape'),
> >
> > I think one good question is why are you storing bytea and then
> > searching like it were text.
>
> We are not storing bytea, a customer is. We are trying to work around
> customer requirements. The data that is being stored is not always text,
> sometimes it is binary (a flash file or jpeg). We are using escaped text
> to be able to search the string contents of that file .

Hmm, have you tried to create a functional trigram index on the
equivalent of "strings(bytea_column)" or something like that?

I imagine strings(bytea) would be a function that returns the
concatenation of all pure (7 bit) ASCII strings in the byte sequence.

On the other hand, based on Teodor's comment on pg_trgm, maybe this
won't be possible at all.

> > Why not store the text as text, and put
> > the extraneous bytes somewhere else?  Certainly you wouldn't expect to
> > be able to find text among the bytes, would you?
>
> Yes we do (and can) expect to find text among the bytes. We have
> searches running, we are just running into the maximum size issues for
> certain rows.

Do you mean you actually find stuff based on text attributes in JPEG
images and the like?  I thought those were compressed ...

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: [HACKERS] Index greater than 8k

From
"Joshua D. Drake"
Date:
>> We are not storing bytea, a customer is. We are trying to work around
>> customer requirements. The data that is being stored is not always text,
>> sometimes it is binary (a flash file or jpeg). We are using escaped text
>> to be able to search the string contents of that file .
>
> Hmm, have you tried to create a functional trigram index on the
> equivalent of "strings(bytea_column)" or something like that?

I did consider that. I wonder what size we are going to deal with
though. Part of the problem is that some of the data we are dealing with
 is quite large.

>
> I imagine strings(bytea) would be a function that returns the
> concatenation of all pure (7 bit) ASCII strings in the byte sequence.
>
> On the other hand, based on Teodor's comment on pg_trgm, maybe this
> won't be possible at all.
>> Yes we do (and can) expect to find text among the bytes. We have
>> searches running, we are just running into the maximum size issues for
>> certain rows.
>
> Do you mean you actually find stuff based on text attributes in JPEG
> images and the like?  I thought those were compressed ...

Well a jpeg is probably a bad example, but yes they do search jpeg, I am
guessing mostly for header information. A better example would be
postscript files, flash files and of course large amounts of text + Html.

Sincerely,

Joshua D. Drake




--

      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
             http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate


Re: [HACKERS] Index greater than 8k

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Do you mean you actually find stuff based on text attributes in JPEG
> images and the like?  I thought those were compressed ...

Typically not --- the design assumption is that the text size wouldn't
amount to anything anyway compared to the image data, and it's better to
be able to pull it out with minimal processing.

I do suggest though that an image containing auxiliary data like text
comments is a multi-part structure, and that dumping it into a single
uninterpreted database field is spectacularly bad schema design.
You should pull the text out into a separate column once when you store
the data, instead of trying to fix things up when you search.

            regards, tom lane

Re: [HACKERS] Index greater than 8k

From
Teodor Sigaev
Date:
> We are trying to get something faster than ~ '%foo%';
> Which Tsearch2 does not give us :)

Hmm, why not with GIN? :) Although, right now GIN doesn't support prefix search,
it's possible to use it.

Brain storm method:

Develop a dictionary which returns all substring for lexeme, for example for
word foobar it will be 'foobar fooba foob foo fo oobar ooba oob oo obar oba ob
bar ba ar'. And make GIN functional index over your column (to save disk space).
So, your query will be looked as
select ... where to_tsvector(text_column) @@ 'foo';
Notices:
Time of search in GIN weak depend on number of words (opposite to
tsearch2/GiST), but insertion of row may be slow enough....

--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: [HACKERS] Index greater than 8k

From
"Joshua D. Drake"
Date:
Teodor Sigaev wrote:
>> We are trying to get something faster than ~ '%foo%';
>> Which Tsearch2 does not give us :)
>
> Hmm, why not with GIN? :) Although, right now GIN doesn't support prefix
> search, it's possible to use it.

Well they run 8.1 :)

Joshua D. Drake

>
> Brain storm method:
>
> Develop a dictionary which returns all substring for lexeme, for example
> for word foobar it will be 'foobar fooba foob foo fo oobar ooba oob oo
> obar oba ob bar ba ar'. And make GIN functional index over your column
> (to save disk space).
> So, your query will be looked as
> select ... where to_tsvector(text_column) @@ 'foo';
> Notices:
> Time of search in GIN weak depend on number of words (opposite to
> tsearch2/GiST), but insertion of row may be slow enough....
>


--

      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
             http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate


Re: [HACKERS] Index greater than 8k

From
tomas@tuxteam.de
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Tue, Oct 31, 2006 at 08:55:04PM -0800, Joshua D. Drake wrote:
>
> >> We are not storing bytea [...]
[...]
> > Hmm, have you tried to create a functional trigram index on the
> > equivalent of "strings(bytea_column)" or something like that?

Hrm. Sorry for my impolite interuption, but... is there such a thing as
a "functional trigram index"? (this would be very cool).

Thanks
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFFSG33Bcgs9XrR2kYRAnB7AJ4l6UPK/4vhtgr7Ux2/L7VtYq6d7ACeLBZP
IMPCEj5zqhYR7b2eYPgjRRE=
=6uiR
-----END PGP SIGNATURE-----


Re: [HACKERS] Index greater than 8k

From
"Gregory Maxwell"
Date:
On 11/1/06, Teodor Sigaev <teodor@sigaev.ru> wrote:
[snip]
> Brain storm method:
>
> Develop a dictionary which returns all substring for lexeme, for example for
> word foobar it will be 'foobar fooba foob foo fo oobar ooba oob oo obar oba ob
> bar ba ar'. And make GIN functional index over your column (to save disk space).
[snip]
> Time of search in GIN weak depend on number of words (opposite to
> tsearch2/GiST), but insertion of row may be slow enough....

With the right folding the number of possible trigrams for ascii text
is fairly small.. much smaller than the number of words in used in a
large corpus of text so the GIN performance for searches should be
pretty good.

Real magic would be to teach the regex operator to transparently make
use of such an index. ;)

Re: [HACKERS] Index greater than 8k

From
Alvaro Herrera
Date:
tomas@tuxteam.de wrote:
> On Tue, Oct 31, 2006 at 08:55:04PM -0800, Joshua D. Drake wrote:
> >
> > >> We are not storing bytea [...]
> [...]
> > > Hmm, have you tried to create a functional trigram index on the
> > > equivalent of "strings(bytea_column)" or something like that?
>
> Hrm. Sorry for my impolite interuption, but... is there such a thing as
> a "functional trigram index"? (this would be very cool).

Heh :-)  I meant an index, using the pg_trgm opclass (which indexes
trigrams; hence the "trigram" part), on a function that would extract
the text from a bytea column; instead of indexing the trigrams of the
bytea column directly.  Hence the "functional" part.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: [HACKERS] Index greater than 8k

From
Oleg Bartunov
Date:
On Tue, 31 Oct 2006, Joshua D. Drake wrote:

>
> Yes we do (and can) expect to find text among the bytes. We have
> searches running, we are just running into the maximum size issues for
> certain rows.

you can use substr() to be safe, if schema change doesn't available


>
> Sincerely,
>
> Joshua D. Drake
>
>
>
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: [HACKERS] Index greater than 8k

From
tomas@tuxteam.de
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wed, Nov 01, 2006 at 07:16:37PM -0300, Alvaro Herrera wrote:
> tomas@tuxteam.de wrote:
[...]
> > a "functional trigram index"? (this would be very cool).
>
> Heh :-)  I meant an index, using the pg_trgm opclass (which indexes
> trigrams; hence the "trigram" part), on a function that would extract
> the text from a bytea column [...]

[goes back to cave, tests...]

Wow, that works:

  CREATE INDEX i2 ON words USING gist(lower(word) gist_trgm_ops);

so I can interpose a (of course immutable) function before gist/trigram
does its thing. Why didn't I dare to assume that this will work?

Thanks for the hint.

- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFFScGvBcgs9XrR2kYRAl9tAJ9JvWvVo0nrexs409IIKPustuJkXwCbBW5n
W5/wwTogiSdg3rhTXq5pRio=
=t90X
-----END PGP SIGNATURE-----


Re: [HACKERS] Index greater than 8k

From
Oleg Bartunov
Date:
Gregory,

All you described you do with fti is possible with tsearch2. Just need
some think, of course. If you don't need stemming, just don't use it,
if you need something like %txt%, just write simple dictionary, which
produce any substrings from input word.

Oleg
On Tue, 31 Oct 2006, Gregory S. Williamson wrote:

> I hesitate to mention it, since it's retrograde, uses OIDS, may not handle your locale/encoding correctly, may not
scalewell for what you need etc., etc. 
>
> But we've used fti (in the contrib package) to do fast searches for any bit of text in people's names ... we didn't
gowith tesearch2 because we were a bit worried about the need to search for fragments of names, and that names don't
followstemming rules and the like very well. Still it might be a way of handling some of the uglier data. It was a bit
ofa pain to set up but seems to work well. Of course, users can ask for something commonplace and get back gazillions
ofrows, but apparently that's ok for the application this is part of. Caveat: only about 32 million rows in this
dataset,partitioned into unequal grouings (about 90 total). 
>
> HTH (but doubt it for reasons that undoubtedly be made clear ;-)
>
> Greg Williamson
> DBA
> GlobeXplorer LLC
>
>
> -----Original Message-----
> From:    pgsql-general-owner@postgresql.org on behalf of Joshua D. Drake
> Sent:    Tue 10/31/2006 7:46 PM
> To:    Teodor Sigaev
> Cc:    Darcy Buskermolen; PgSQL General; PostgreSQL-development
> Subject:    Re: [HACKERS] [GENERAL] Index greater than 8k
>
> Teodor Sigaev wrote:
>>> The problem as I remember it is pg_tgrm not tsearch2 directly, I've
>>> sent a self contained test case directly to  Teodor  which shows the
>>> error.
>>> 'ERROR:  index row requires 8792 bytes, maximum size is 8191'
>> Uh, I see. But I'm really surprised why do you use pg_trgm on big text?
>> pg_trgm is designed to find similar words and use technique known as
>> trigrams. This will  work good on small pieces of text such as words or
>> set expression. But all big texts (on the same language) will be similar
>> :(. So, I didn't take care about guarantee that index tuple's size
>> limitation. In principle, it's possible to modify pg_trgm to have such
>> guarantee, but index becomes lossy - all tuples gotten  from index
>> should be checked by table's tuple evaluation.
>
> We are trying to get something faster than ~ '%foo%';
>
> Which Tsearch2 does not give us :)
>
> Joshua D. Drake
>
>
>
>>
>> If you want to search similar documents I can recommend to have a look
>> to fingerprint technique (http://webglimpse.net/pubs/TR93-33.pdf). It's
>> pretty close to trigrams and metrics of similarity is the same, but uses
>> another signature calculations. And, there are some tips and trics:
>> removing HTML marking,removing punctuation, lowercasing text and so on -
>> it's interesting and complex task.
>
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: [HACKERS] Index greater than 8k

From
"Joshua D. Drake"
Date:
Oleg Bartunov wrote:
> Gregory,
>
> All you described you do with fti is possible with tsearch2. Just need
> some think, of course. If you don't need stemming, just don't use it,
> if you need something like %txt%, just write simple dictionary, which
> produce any substrings from input word.

Is there any information on writing these dictionaries?

Sincerely,

Joshua D. Drake




--

      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
             http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate


Re: [HACKERS] Index greater than 8k

From
Oleg Bartunov
Date:
On Fri, 3 Nov 2006, Joshua D. Drake wrote:

> Oleg Bartunov wrote:
>> Gregory,
>>
>> All you described you do with fti is possible with tsearch2. Just need
>> some think, of course. If you don't need stemming, just don't use it,
>> if you need something like %txt%, just write simple dictionary, which
>> produce any substrings from input word.
>
> Is there any information on writing these dictionaries?

Туториал
http://www.sai.msu.su/~megera/wiki/Gendict

Some examples are on http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/

For example,
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/dicts/README.intdict
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/custom-dict.html


     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83