Thread: BUG #12126: Empty tsvector as output of to_tsvector function

BUG #12126: Empty tsvector as output of to_tsvector function

From
marcin.ogorzalek@jmkcomputerate.pl
Date:
The following bug has been logged on the website:

Bug reference:      12126
Logged by:          Marcin Ogorzałek
Email address:      marcin.ogorzalek@jmkcomputerate.pl
PostgreSQL version: 9.2.9
Operating system:   Windows 7 Home Premium
Description:

Hello,

I was trying to use postgresql full text search feature but all texts used
were giving me empty tsvectors as a returned value therefore being
unusable.

I even tested part of your own documentation as stated below and it gave me
empty result as well. I'm pretty sure that returned ts_vector was not over
1MB and there were no warnings or errors. Postgres states that it
successfully returned 1 row.

Here is my statement:
select to_tsvector('
12.2. Tables and Indexes

The examples in the previous section illustrated full text matching using
simple constant strings. This section shows how to search table data,
optionally using indexes.
12.2.1. Searching a Table

It is possible to do a full text search without an index. A simple query to
print the title of each row that contains the word friend in its body field
is:

SELECT title
FROM pgweb
WHERE to_tsvector(''english'', body) @@ to_tsquery(''english'',
''friend'');

This will also find related words such as friends and friendly, since all
these are reduced to the same normalized lexeme.

The query above specifies that the english configuration is to be used to
parse and normalize the strings. Alternatively we could omit the
configuration parameters:

SELECT title
FROM pgweb
WHERE to_tsvector(body) @@ to_tsquery(''friend'');

This query will use the configuration set by default_text_search_config.

A more complex example is to select the ten most recent documents that
contain create and table in the title or body:

SELECT title
FROM pgweb
WHERE to_tsvector(title || '' '' || body) @@ to_tsquery(''create & table'')
ORDER BY last_mod_date DESC
LIMIT 10;

For clarity we omitted the coalesce function calls which would be needed to
find rows that contain NULL in one of the two fields.

Although these queries will work without an index, most applications will
find this approach too slow, except perhaps for occasional ad-hoc searches.
Practical use of text searching usually requires creating an index.
12.2.2. Creating Indexes

We can create a GIN index (Section 12.9) to speed up text searches:

CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(''english'', body));

Notice that the 2-argument version of to_tsvector is used. Only text search
functions that specify a configuration name can be used in expression
indexes (Section 11.7). This is because the index contents must be
unaffected by default_text_search_config. If they were affected, the index
contents might be inconsistent because different entries could contain
tsvectors that were created with different text search configurations, and
there would be no way to guess which was which. It would be impossible to
dump and restore such an index correctly.

Because the two-argument version of to_tsvector was used in the index above,
only a query reference that uses the 2-argument version of to_tsvector with
the same configuration name will use that index. That is, WHERE
to_tsvector(''english'', body) @@ ''a & b'' can use the index, but WHERE
to_tsvector(body) @@ ''a & b'' cannot. This ensures that an index will be
used only with the same configuration used to create the index entries.

It is possible to set up more complex expression indexes wherein the
configuration name is specified by another column, e.g.:

CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(config_name, body));

where config_name is a column in the pgweb table. This allows mixed
configurations in the same index while recording which configuration was
used for each index entry. This would be useful, for example, if the
document collection contained documents in different languages. Again,
queries that are meant to use the index must be phrased to match, e.g.,
WHERE to_tsvector(config_name, body) @@ ''a & b''.

Indexes can even concatenate columns:

CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(''english'', title ||
'' '' || body));

Another approach is to create a separate tsvector column to hold the output
of to_tsvector. This example is a concatenation of title and body, using
coalesce to ensure that one field will still be indexed when the other is
NULL:

ALTER TABLE pgweb ADD COLUMN textsearchable_index_col tsvector;
UPDATE pgweb SET textsearchable_index_col =
     to_tsvector(''english'', coalesce(title,'''') || '' '' ||
coalesce(body,''''));

Then we create a GIN index to speed up the search:

CREATE INDEX textsearch_idx ON pgweb USING gin(textsearchable_index_col);

Now we are ready to perform a fast full text search:

SELECT title
FROM pgweb
WHERE textsearchable_index_col @@ to_tsquery(''create & table'')
ORDER BY last_mod_date DESC
LIMIT 10;

When using a separate column to store the tsvector representation, it is
necessary to create a trigger to keep the tsvector column current anytime
title or body changes. Section 12.4.3 explains how to do that.

')
---------------------------
What is intriguing is that when you delete las sentence it works great,
which means that some limit is reached. Unfortunately I can't identify it.

Apart from version metioned above i tested it on:
 9.2.8 on Windows 8,
 8.4.13 on Debian 6.0

Re: BUG #12126: Empty tsvector as output of to_tsvector function

From
Tom Lane
Date:
marcin.ogorzalek@jmkcomputerate.pl writes:
> I was trying to use postgresql full text search feature but all texts used
> were giving me empty tsvectors as a returned value therefore being
> unusable.

> I even tested part of your own documentation as stated below and it gave me
> empty result as well. I'm pretty sure that returned ts_vector was not over
> 1MB and there were no warnings or errors. Postgres states that it
> successfully returned 1 row.

The example works for me; I get

 '10':190,675 '11.7':304 '12.2':1 '12.2.1':30 '12.2.2':249 '12.4.3':705 '12.9':259 '2':280,388 'ad':236 'ad-hoc':235
'add':604'affect':322 'allow':487 'also':81 'alter':601 'altern':118 'although':215 'anoth':460,560 'anytim':699 'appl 
ic':224 'approach':228,561 'argument':281,370,389 'b':411,422,541
'bodi':63,74,132,169,178,276,409,420,475,539,559,584,622,702'call':198 'cannot':423 'chang':703 'clariti':192
'coalesc':196,586,619,621'col':608,615,644,663 'collect':5 
13 'column':461,481,546,568,605,680,697 'complex':150,450 'concaten':545,580 'config':147,318,473,477,537
'configur':107,123,141,295,342,397,436,455,489,497'constant':17 'contain':57,161,207,333,514 'content':310,325
'correct':365'cou 
ld':120,332 'creat':162,181,246,250,254,265,337,439,463,547,564,625,634,666,690 'current':698 'data':26 'date':187,672
'default':144,315'desc':188,673 'differ':330,339,517 'document':159,512,515 'dump':359 'e.g':462,533 'english':73,77 
,106,275,408,557,618 'ensur':425,588 'entri':331,442,503 'even':544 'exampl':6,151,509,577 'except':231 'explain':706
'express':301,451'fast':652 'field':64,214,591 'find':82,204,226 'friend':60,78,87,89,135 'full':12,40,653 'function' 
:197,291 'gin':256,272,470,554,627,641 'guess':350 'hoc':237 'hold':570 'idx':268,466,550,637 'illustr':11 'imposs':357
'inconsist':328'index':4,29,45,222,248,251,257,266,302,309,324,364,379,402,415,428,441,452,464,493,502,527,542,548, 
595,607,614,628,635,643,662 'keep':694 'languag':518 'last':185,670 'lexem':99 'limit':189,674 'match':14,532
'meant':523'might':326 'mix':488 'mod':186,671 'must':311,528 'name':296,398,456,474,478,538 'necessari':688 'need':202
'norm
al':98,115 'notic':277 'null':208,600 'occasion':234 'omit':121,194 'one':210,590 'option':27 'order':183,668
'output':572'paramet':124 'pars':113 'perform':650 'perhap':232
'pgweb':69,128,173,267,270,465,468,484,549,552,603,611,639,65
9 'phrase':530 'possibl':36,445 'practic':239 'previous':9 'print':50 'queri':48,101,137,217,383,520 'readi':648
'recent':158'record':495 'reduc':94 'refer':384 'relat':83 'represent':685 'requir':245 'restor':361 'row':55,205
'search'
:24,31,42,146,238,243,264,290,317,341,633,655 'section':10,20,258,303,704 'select':66,125,154,170,656 'separ':566,679
'set':142,447,612'show':21 'simpl':16,47 'sinc':90 'slow':230 'specifi':103,293,458 'speed':261,630 'still':593 'stor 
e':682 'string':18,117 'tabl':2,25,33,164,182,485,602,667 'ten':156 'text':13,41,145,242,263,289,316,340,654
'textsearch':606,613,636,642,661'titl':52,67,126,167,171,177,558,582,620,657,700 'trigger':692 'tsqueri':76,134,180,665
'tsvec
tor':72,131,176,274,285,334,374,393,407,419,472,536,556,567,575,609,617,684,696 'two':213,369 'two-argu':368
'unaffect':313'updat':610 'use':15,28,111,139,240,271,287,299,376,386,400,413,431,437,469,499,507,525,553,585,640,677
'usual':
244 'version':282,371,390 'way':348 'wherein':453 'without':43,220 'word':59,84 'work':219 'would':200,345,355,505

What have you got default_text_search_config set to?  Have you changed
any of the files within the installation directory's share/tsearch_data/
subdirectory?

            regards, tom lane

Re: BUG #12126: Empty tsvector as output ofto_tsvector function

From
Marcin Ogorzałek
Date:
W dniu 02.12.2014 18:18, Tom Lane napisał(a):
> marcin.ogorzalek@jmkcomputerate.pl writes:
> I was trying to use postgresql full text search feature but all texts
> used
> were giving me empty tsvectors as a returned value therefore being
> unusable.
>
> I even tested part of your own documentation as stated below and it
> gave me
> empty result as well. I'm pretty sure that returned ts_vector was not
> over
> 1MB and there were no warnings or errors. Postgres states that it
> successfully returned 1 row.
>
> The example works for me; I get
>
> '10':190,675 '11.7':304 '12.2':1 '12.2.1':30 '12.2.2':249 '12.4.3':705
> '12.9':259 '2':280,388 'ad':236 'ad-hoc':235 'add':604 'affect':322
> 'allow':487 'also':81 'alter':601 'altern':118 'although':215
> 'anoth':460,560 'anytim':699 'appl
> ic':224 'approach':228,561 'argument':281,370,389 'b':411,422,541
> 'bodi':63,74,132,169,178,276,409,420,475,539,559,584,622,702
> 'call':198 'cannot':423 'chang':703 'clariti':192
> 'coalesc':196,586,619,621 'col':608,615,644,663 'collect':5
> 13 'column':461,481,546,568,605,680,697 'complex':150,450
> 'concaten':545,580 'config':147,318,473,477,537
> 'configur':107,123,141,295,342,397,436,455,489,497 'constant':17
> 'contain':57,161,207,333,514 'content':310,325 'correct':365 'cou
> ld':120,332
> 'creat':162,181,246,250,254,265,337,439,463,547,564,625,634,666,690
> 'current':698 'data':26 'date':187,672 'default':144,315
> 'desc':188,673 'differ':330,339,517 'document':159,512,515 'dump':359
> 'e.g':462,533 'english':73,77
> ,106,275,408,557,618 'ensur':425,588 'entri':331,442,503 'even':544
> 'exampl':6,151,509,577 'except':231 'explain':706 'express':301,451
> 'fast':652 'field':64,214,591 'find':82,204,226
> 'friend':60,78,87,89,135 'full':12,40,653 'function'
> :197,291 'gin':256,272,470,554,627,641 'guess':350 'hoc':237
> 'hold':570 'idx':268,466,550,637 'illustr':11 'imposs':357
> 'inconsist':328
> 'index':4,29,45,222,248,251,257,266,302,309,324,364,379,402,415,428,441,452,464,493,502,527,542,548,
> 595,607,614,628,635,643,662 'keep':694 'languag':518 'last':185,670
> 'lexem':99 'limit':189,674 'match':14,532 'meant':523 'might':326
> 'mix':488 'mod':186,671 'must':311,528 'name':296,398,456,474,478,538
> 'necessari':688 'need':202 'norm
> al':98,115 'notic':277 'null':208,600 'occasion':234 'omit':121,194
> 'one':210,590 'option':27 'order':183,668 'output':572 'paramet':124
> 'pars':113 'perform':650 'perhap':232
> 'pgweb':69,128,173,267,270,465,468,484,549,552,603,611,639,65
> 9 'phrase':530 'possibl':36,445 'practic':239 'previous':9 'print':50
> 'queri':48,101,137,217,383,520 'readi':648 'recent':158 'record':495
> 'reduc':94 'refer':384 'relat':83 'represent':685 'requir':245
> 'restor':361 'row':55,205 'search'
> :24,31,42,146,238,243,264,290,317,341,633,655
> 'section':10,20,258,303,704 'select':66,125,154,170,656
> 'separ':566,679 'set':142,447,612 'show':21 'simpl':16,47 'sinc':90
> 'slow':230 'specifi':103,293,458 'speed':261,630 'still':593 'stor
> e':682 'string':18,117 'tabl':2,25,33,164,182,485,602,667 'ten':156
> 'text':13,41,145,242,263,289,316,340,654
> 'textsearch':606,613,636,642,661
> 'titl':52,67,126,167,171,177,558,582,620,657,700 'trigger':692
> 'tsqueri':76,134,180,665 'tsvec
> tor':72,131,176,274,285,334,374,393,407,419,472,536,556,567,575,609,617,684,696
> 'two':213,369 'two-argu':368 'unaffect':313 'updat':610
> 'use':15,28,111,139,240,271,287,299,376,386,400,413,431,437,469,499,507,525,553,585,640,677
> 'usual':
> 244 'version':282,371,390 'way':348 'wherein':453 'without':43,220
> 'word':59,84 'work':219 'would':200,345,355,505
>
> What have you got default_text_search_config set to?  Have you changed
> any of the files within the installation directory's
> share/tsearch_data/
> subdirectory?
>
>             regards, tom lane


Thank You for quick response

After this message I've checked several things and problem was that
default_text_search_config set to pg_catalog.simple was unable to
recognize the text.
Only setting the regconfig parameter explicitly to 'english' gave the
results.

Regards Marcin Ogorzałek



Re: BUG #12126: Empty tsvector as output ofto_tsvector function

From
Tom Lane
Date:
=?UTF-8?Q?Marcin_Ogorza=C5=82ek?= <marcin.ogorzalek@jmkcomputerate.pl> writes:
> W dniu 02.12.2014 18:18, Tom Lane napisał(a):
>> What have you got default_text_search_config set to?  Have you changed
>> any of the files within the installation directory's
>> share/tsearch_data/
>> subdirectory?

> After this message I've checked several things and problem was that
> default_text_search_config set to pg_catalog.simple was unable to
> recognize the text.
> Only setting the regconfig parameter explicitly to 'english' gave the
> results.

I had tried both 'english' and 'simple' on your example, and they gave
slightly different but certainly nonempty results.  I continue to suspect
there's something corrupt about either your text search configuration
files, or the description of the 'simple' configuration in the catalogs.

            regards, tom lane