Thread: Re: [GENERAL] NOTICE: word is too long INSERT 0 3014
Any thoughts on the below? Joshua D. Drake wrote: > Hello, > > I am running into this limitation ALOT with Tsearch2. What are my > options to get around it. Do I have to compile PostgreSQL with a > different block size? > > If yes, what are the downsides to doing so (outside of not being able to > do straight upgrades)? > > Sincerely, > > Joshua D. Drake > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
Hello, I recently posted about a word being too long with Tsearch2. That isn't actually the problem I am trying to solve (thanks for the feedback though, now I understand it). The problem I am after is the 8k index size issue. It is very easy to get a GIST index (especially when using tsearch2) that is larger than that. Is recompiling the block size the option there? What are the downsides, except for the custom build? Joshua D. Drake -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
> The problem I am after is the 8k index size issue. It is very easy to > get a GIST index (especially when using tsearch2) that is larger than that. Hmm, tsearch2 GIST index is specially designed for support huge index entry: first, every lexemes in tsvectore are transformed to hash value (with a help of crc32), second, it's stripped all position infos, third, if size of array is greater than TOAST_INDEX_TARGET then tsearch2 will make bit signature of tsvector. Signature's length is fixed and equals to 252 bytes by default (+ 8 bytes for header of datum). All values on internal pages are represented as signatures below. So, tsearch2 guarantees that index entry will be small enough. If it's not true, then there is a bug - pls, make test suite demonstrating the problem. > Is recompiling the block size the option there? > What are the downsides, except for the custom build? Can you send exact error message? -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Teodor Sigaev wrote: >> The problem I am after is the 8k index size issue. It is very easy to >> get a GIST index (especially when using tsearch2) that is larger than >> that. > Hmm, tsearch2 GIST index is specially designed for support huge index > entry: > first, every lexemes in tsvectore are transformed to hash value (with a > help of crc32), second, it's stripped all position infos, third, if size > of array is greater than TOAST_INDEX_TARGET then tsearch2 will make bit > signature of tsvector. Signature's length is fixed and equals to 252 > bytes by default (+ 8 bytes for header of datum). All values on internal > pages are represented as signatures below. > > So, tsearch2 guarantees that index entry will be small enough. If it's > not true, then there is a bug - pls, make test suite demonstrating the > problem. > >> Is recompiling the block size the option there? >> What are the downsides, except for the custom build? > > Can you send exact error message? I am training this week, but Darcy can do it. Can you give them a test case on what we were working on with that customer? Joshua D. Drake > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
On October 31, 2006 06:42 am, Joshua D. Drake wrote: > Teodor Sigaev wrote: > >> The problem I am after is the 8k index size issue. It is very easy to > >> get a GIST index (especially when using tsearch2) that is larger than > >> that. The problem as I remember it is pg_tgrm not tsearch2 directly, I've sent a self contained test case directly to Teodor which shows the error. 'ERROR: index row requires 8792 bytes, maximum size is 8191' > > > > Hmm, tsearch2 GIST index is specially designed for support huge index > > entry: > > first, every lexemes in tsvectore are transformed to hash value (with a > > help of crc32), second, it's stripped all position infos, third, if size > > of array is greater than TOAST_INDEX_TARGET then tsearch2 will make bit > > signature of tsvector. Signature's length is fixed and equals to 252 > > bytes by default (+ 8 bytes for header of datum). All values on internal > > pages are represented as signatures below. > > > > So, tsearch2 guarantees that index entry will be small enough. If it's > > not true, then there is a bug - pls, make test suite demonstrating the > > problem. > > > >> Is recompiling the block size the option there? > >> What are the downsides, except for the custom build? > > > > Can you send exact error message? > > I am training this week, but Darcy can do it. Can you give them a test > case on what we were working on with that customer? > > Joshua D. Drake -- Darcy Buskermolen Command Prompt, Inc. Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 PostgreSQL solutions since 1997 http://www.commandprompt.com/
> The problem as I remember it is pg_tgrm not tsearch2 directly, I've sent a > self contained test case directly to Teodor which shows the error. > > 'ERROR: index row requires 8792 bytes, maximum size is 8191' Uh, I see. But I'm really surprised why do you use pg_trgm on big text? pg_trgm is designed to find similar words and use technique known as trigrams. This will work good on small pieces of text such as words or set expression. But all big texts (on the same language) will be similar :(. So, I didn't take care about guarantee that index tuple's size limitation. In principle, it's possible to modify pg_trgm to have such guarantee, but index becomes lossy - all tuples gotten from index should be checked by table's tuple evaluation. If you want to search similar documents I can recommend to have a look to fingerprint technique (http://webglimpse.net/pubs/TR93-33.pdf). It's pretty close to trigrams and metrics of similarity is the same, but uses another signature calculations. And, there are some tips and trics: removing HTML marking,removing punctuation, lowercasing text and so on - it's interesting and complex task. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On October 31, 2006 08:53 am, Teodor Sigaev wrote: > > The problem as I remember it is pg_tgrm not tsearch2 directly, I've sent > > a self contained test case directly to Teodor which shows the error. > > > > 'ERROR: index row requires 8792 bytes, maximum size is 8191' > > Uh, I see. But I'm really surprised why do you use pg_trgm on big text? > pg_trgm is designed to find similar words and use technique known as > trigrams. This will work good on small pieces of text such as words or set > expression. But all big texts (on the same language) will be similar :(. > So, I didn't take care about guarantee that index tuple's size limitation. > In principle, it's possible to modify pg_trgm to have such guarantee, but > index becomes lossy - all tuples gotten from index should be checked by > table's tuple evaluation. The problem is some of the data we are working with is not strictly "text" but bytea that we've run through encode(bytea, 'escape'), and we've had to resort to trigrams in an attempt to mimic LIKE for searches. From our findings tsearch2 does not match partial words, in the same way that a LIKE would. ie col LIKE 'go%' would match good, gopher. pg_tgrm will return those with the limit set appropriately, but tsearch2 does not. > > If you want to search similar documents I can recommend to have a look to > fingerprint technique (http://webglimpse.net/pubs/TR93-33.pdf). It's pretty > close to trigrams and metrics of similarity is the same, but uses another > signature calculations. And, there are some tips and trics: removing HTML > marking,removing punctuation, lowercasing text and so on - it's interesting > and complex task. -- Darcy Buskermolen Command Prompt, Inc. Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 PostgreSQL solutions since 1997 http://www.commandprompt.com/
Darcy Buskermolen wrote: > On October 31, 2006 08:53 am, Teodor Sigaev wrote: > > > The problem as I remember it is pg_tgrm not tsearch2 directly, I've sent > > > a self contained test case directly to Teodor which shows the error. > > > > > > 'ERROR: index row requires 8792 bytes, maximum size is 8191' > > > > Uh, I see. But I'm really surprised why do you use pg_trgm on big text? > > pg_trgm is designed to find similar words and use technique known as > > trigrams. This will work good on small pieces of text such as words or set > > expression. But all big texts (on the same language) will be similar :(. > > So, I didn't take care about guarantee that index tuple's size limitation. > > In principle, it's possible to modify pg_trgm to have such guarantee, but > > index becomes lossy - all tuples gotten from index should be checked by > > table's tuple evaluation. > > The problem is some of the data we are working with is not strictly "text" but > bytea that we've run through encode(bytea, 'escape'), I think one good question is why are you storing bytea and then searching like it were text. Why not store the text as text, and put the extraneous bytes somewhere else? Certainly you wouldn't expect to be able to find text among the bytes, would you? I remember suggesting you to store the Content-type next to each object, and then creating partial trigram indexes where Content-type: text/*. Did that plan not work for some reason? -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Teodor Sigaev wrote: >> The problem as I remember it is pg_tgrm not tsearch2 directly, I've >> sent a self contained test case directly to Teodor which shows the >> error. >> 'ERROR: index row requires 8792 bytes, maximum size is 8191' > Uh, I see. But I'm really surprised why do you use pg_trgm on big text? > pg_trgm is designed to find similar words and use technique known as > trigrams. This will work good on small pieces of text such as words or > set expression. But all big texts (on the same language) will be similar > :(. So, I didn't take care about guarantee that index tuple's size > limitation. In principle, it's possible to modify pg_trgm to have such > guarantee, but index becomes lossy - all tuples gotten from index > should be checked by table's tuple evaluation. We are trying to get something faster than ~ '%foo%'; Which Tsearch2 does not give us :) Joshua D. Drake > > If you want to search similar documents I can recommend to have a look > to fingerprint technique (http://webglimpse.net/pubs/TR93-33.pdf). It's > pretty close to trigrams and metrics of similarity is the same, but uses > another signature calculations. And, there are some tips and trics: > removing HTML marking,removing punctuation, lowercasing text and so on - > it's interesting and complex task. -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
Alvaro Herrera wrote: > Darcy Buskermolen wrote: >> On October 31, 2006 08:53 am, Teodor Sigaev wrote: >>>> The problem as I remember it is pg_tgrm not tsearch2 directly, I've sent >>>> a self contained test case directly to Teodor which shows the error. >>>> >>>> 'ERROR: index row requires 8792 bytes, maximum size is 8191' >>> Uh, I see. But I'm really surprised why do you use pg_trgm on big text? >>> pg_trgm is designed to find similar words and use technique known as >>> trigrams. This will work good on small pieces of text such as words or set >>> expression. But all big texts (on the same language) will be similar :(. >>> So, I didn't take care about guarantee that index tuple's size limitation. >>> In principle, it's possible to modify pg_trgm to have such guarantee, but >>> index becomes lossy - all tuples gotten from index should be checked by >>> table's tuple evaluation. >> The problem is some of the data we are working with is not strictly "text" but >> bytea that we've run through encode(bytea, 'escape'), > > I think one good question is why are you storing bytea and then > searching like it were text. We are not storing bytea, a customer is. We are trying to work around customer requirements. The data that is being stored is not always text, sometimes it is binary (a flash file or jpeg). We are using escaped text to be able to search the string contents of that file . > Why not store the text as text, and put > the extraneous bytes somewhere else? Certainly you wouldn't expect to > be able to find text among the bytes, would you? Yes we do (and can) expect to find text among the bytes. We have searches running, we are just running into the maximum size issues for certain rows. Sincerely, Joshua D. Drake -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
Joshua D. Drake wrote: > Alvaro Herrera wrote: > > Darcy Buskermolen wrote: > >> On October 31, 2006 08:53 am, Teodor Sigaev wrote: > >>>> The problem as I remember it is pg_tgrm not tsearch2 directly, I've sent > >>>> a self contained test case directly to Teodor which shows the error. > >>>> > >>>> 'ERROR: index row requires 8792 bytes, maximum size is 8191' > >>> Uh, I see. But I'm really surprised why do you use pg_trgm on big text? > >>> pg_trgm is designed to find similar words and use technique known as > >>> trigrams. This will work good on small pieces of text such as words or set > >>> expression. But all big texts (on the same language) will be similar :(. > >>> So, I didn't take care about guarantee that index tuple's size limitation. > >>> In principle, it's possible to modify pg_trgm to have such guarantee, but > >>> index becomes lossy - all tuples gotten from index should be checked by > >>> table's tuple evaluation. > >> The problem is some of the data we are working with is not strictly "text" but > >> bytea that we've run through encode(bytea, 'escape'), > > > > I think one good question is why are you storing bytea and then > > searching like it were text. > > We are not storing bytea, a customer is. We are trying to work around > customer requirements. The data that is being stored is not always text, > sometimes it is binary (a flash file or jpeg). We are using escaped text > to be able to search the string contents of that file . Hmm, have you tried to create a functional trigram index on the equivalent of "strings(bytea_column)" or something like that? I imagine strings(bytea) would be a function that returns the concatenation of all pure (7 bit) ASCII strings in the byte sequence. On the other hand, based on Teodor's comment on pg_trgm, maybe this won't be possible at all. > > Why not store the text as text, and put > > the extraneous bytes somewhere else? Certainly you wouldn't expect to > > be able to find text among the bytes, would you? > > Yes we do (and can) expect to find text among the bytes. We have > searches running, we are just running into the maximum size issues for > certain rows. Do you mean you actually find stuff based on text attributes in JPEG images and the like? I thought those were compressed ... -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
>> We are not storing bytea, a customer is. We are trying to work around >> customer requirements. The data that is being stored is not always text, >> sometimes it is binary (a flash file or jpeg). We are using escaped text >> to be able to search the string contents of that file . > > Hmm, have you tried to create a functional trigram index on the > equivalent of "strings(bytea_column)" or something like that? I did consider that. I wonder what size we are going to deal with though. Part of the problem is that some of the data we are dealing with is quite large. > > I imagine strings(bytea) would be a function that returns the > concatenation of all pure (7 bit) ASCII strings in the byte sequence. > > On the other hand, based on Teodor's comment on pg_trgm, maybe this > won't be possible at all. >> Yes we do (and can) expect to find text among the bytes. We have >> searches running, we are just running into the maximum size issues for >> certain rows. > > Do you mean you actually find stuff based on text attributes in JPEG > images and the like? I thought those were compressed ... Well a jpeg is probably a bad example, but yes they do search jpeg, I am guessing mostly for header information. A better example would be postscript files, flash files and of course large amounts of text + Html. Sincerely, Joshua D. Drake -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
Alvaro Herrera <alvherre@commandprompt.com> writes: > Do you mean you actually find stuff based on text attributes in JPEG > images and the like? I thought those were compressed ... Typically not --- the design assumption is that the text size wouldn't amount to anything anyway compared to the image data, and it's better to be able to pull it out with minimal processing. I do suggest though that an image containing auxiliary data like text comments is a multi-part structure, and that dumping it into a single uninterpreted database field is spectacularly bad schema design. You should pull the text out into a separate column once when you store the data, instead of trying to fix things up when you search. regards, tom lane
> We are trying to get something faster than ~ '%foo%'; > Which Tsearch2 does not give us :) Hmm, why not with GIN? :) Although, right now GIN doesn't support prefix search, it's possible to use it. Brain storm method: Develop a dictionary which returns all substring for lexeme, for example for word foobar it will be 'foobar fooba foob foo fo oobar ooba oob oo obar oba ob bar ba ar'. And make GIN functional index over your column (to save disk space). So, your query will be looked as select ... where to_tsvector(text_column) @@ 'foo'; Notices: Time of search in GIN weak depend on number of words (opposite to tsearch2/GiST), but insertion of row may be slow enough.... -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Teodor Sigaev wrote: >> We are trying to get something faster than ~ '%foo%'; >> Which Tsearch2 does not give us :) > > Hmm, why not with GIN? :) Although, right now GIN doesn't support prefix > search, it's possible to use it. Well they run 8.1 :) Joshua D. Drake > > Brain storm method: > > Develop a dictionary which returns all substring for lexeme, for example > for word foobar it will be 'foobar fooba foob foo fo oobar ooba oob oo > obar oba ob bar ba ar'. And make GIN functional index over your column > (to save disk space). > So, your query will be looked as > select ... where to_tsvector(text_column) @@ 'foo'; > Notices: > Time of search in GIN weak depend on number of words (opposite to > tsearch2/GiST), but insertion of row may be slow enough.... > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Tue, Oct 31, 2006 at 08:55:04PM -0800, Joshua D. Drake wrote: > > >> We are not storing bytea [...] [...] > > Hmm, have you tried to create a functional trigram index on the > > equivalent of "strings(bytea_column)" or something like that? Hrm. Sorry for my impolite interuption, but... is there such a thing as a "functional trigram index"? (this would be very cool). Thanks - -- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQFFSG33Bcgs9XrR2kYRAnB7AJ4l6UPK/4vhtgr7Ux2/L7VtYq6d7ACeLBZP IMPCEj5zqhYR7b2eYPgjRRE= =6uiR -----END PGP SIGNATURE-----
On 11/1/06, Teodor Sigaev <teodor@sigaev.ru> wrote: [snip] > Brain storm method: > > Develop a dictionary which returns all substring for lexeme, for example for > word foobar it will be 'foobar fooba foob foo fo oobar ooba oob oo obar oba ob > bar ba ar'. And make GIN functional index over your column (to save disk space). [snip] > Time of search in GIN weak depend on number of words (opposite to > tsearch2/GiST), but insertion of row may be slow enough.... With the right folding the number of possible trigrams for ascii text is fairly small.. much smaller than the number of words in used in a large corpus of text so the GIN performance for searches should be pretty good. Real magic would be to teach the regex operator to transparently make use of such an index. ;)
tomas@tuxteam.de wrote: > On Tue, Oct 31, 2006 at 08:55:04PM -0800, Joshua D. Drake wrote: > > > > >> We are not storing bytea [...] > [...] > > > Hmm, have you tried to create a functional trigram index on the > > > equivalent of "strings(bytea_column)" or something like that? > > Hrm. Sorry for my impolite interuption, but... is there such a thing as > a "functional trigram index"? (this would be very cool). Heh :-) I meant an index, using the pg_trgm opclass (which indexes trigrams; hence the "trigram" part), on a function that would extract the text from a bytea column; instead of indexing the trigrams of the bytea column directly. Hence the "functional" part. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Tue, 31 Oct 2006, Joshua D. Drake wrote: > > Yes we do (and can) expect to find text among the bytes. We have > searches running, we are just running into the maximum size issues for > certain rows. you can use substr() to be safe, if schema change doesn't available > > Sincerely, > > Joshua D. Drake > > > > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Wed, Nov 01, 2006 at 07:16:37PM -0300, Alvaro Herrera wrote: > tomas@tuxteam.de wrote: [...] > > a "functional trigram index"? (this would be very cool). > > Heh :-) I meant an index, using the pg_trgm opclass (which indexes > trigrams; hence the "trigram" part), on a function that would extract > the text from a bytea column [...] [goes back to cave, tests...] Wow, that works: CREATE INDEX i2 ON words USING gist(lower(word) gist_trgm_ops); so I can interpose a (of course immutable) function before gist/trigram does its thing. Why didn't I dare to assume that this will work? Thanks for the hint. - -- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQFFScGvBcgs9XrR2kYRAl9tAJ9JvWvVo0nrexs409IIKPustuJkXwCbBW5n W5/wwTogiSdg3rhTXq5pRio= =t90X -----END PGP SIGNATURE-----