Thread: Tsearch vector not stored by update/set

Tsearch vector not stored by update/set

From

"Justin L. Kennedy"

Date:

21 March 2005, 03:37:52

The short question is why does this:

select to_tsvector('default', coalesce(name, '') ||' '||
coalesce(description, '') ||' '|| coalesce(keywords,'')) from link_items;

give different results than this:

update link_items set linksfti=to_tsvector('default', coalesce(name, '')
||' '|| coalesce(description, '') ||' '|| coalesce(keywords,''));
select linksfti from link_items;


Here are more details:

I am working with Tsearch2 on a server with version string:
PostgreSQL 7.3.2 on i686-pc-linux-gnu, compiled by GCC 2.96

I have a table with the following schema:
CREATE TABLE link_items
(
  link_id int4,
  name varchar(255),
  url varchar(255),
  description text,
  spanish int4,
  spanishurl varchar(255),
  lastmod date,
  visible int4,
  state varchar(25),
  promisepractice int4,
  keywords text,
  linksfti tsvector
)
WITH OIDS;
ALTER TABLE link_items OWNER TO gate;

I want linksfti to hold the search engine's indexing data (indexed on
'name', 'description', and 'keywords'), so I run the following command:

update link_items set linksfti=to_tsvector('default', coalesce(name, '')
||' '|| coalesce(description, '') ||' '|| coalesce(keywords,''));

The results are pretty empty.  Most have empty strings for data, other
only index one or two items in the 3 input columns.

For example, after running, my table looks like:
<name>;<description>;<keywords>;<linksfti>
"American Occupational Therapy Association (AOTA) ";"Nationally recognized
professional association for over 60,000 occupational therapists and
occupational therapy assistants. ";"Rehabilitation Professional
Associations and Councils";"'60':1 '000':2"
"American Physical Therapy Association (APTA)";"Represents more than
70,000 physical therapists, physical therapist assistants, and students of
physical therapy. ";"Rehabilitation Professional Associations and
Councils";"'70':1 '000':2"
"U.S. Deaf Ski & Snowboard Association";"Winter sports for people who are
deaf & relevant links.";"Recreation Winter Sports";"'u.s':1"
"Texas Adaptive Aquatics";"Adaptive water skiing program for people with
physical and/or mental disabilities. ";"Recreation Water
Sports";"'and/or':1"
"World T.E.A.M. Sports";"Inclusive sports activities.";"Recreation Team
Sports";"'t.e.a.m':1"
"Tennessee";"Official State Web Site";"Legal State Agencies";""
"Project Vote Smart";"By entering zip code, users get list of all their
elected officials. Links to elected officials' and candidates' web sites,
etc. ";"Government / Public Policy General";""
"TRIPOD Captioned Films";"Captioned Films for people who are deaf or hard
of hearing.";"Recreation Captioned Movies";""


When don't do it as an UPDATE and just print the contents to the screen, I
get the full expected output:

select name, description, keywords, to_tsvector('default', coalesce(name,
'') ||' '|| coalesce(description, '') ||' '|| coalesce(keywords,'')) from
link_items;

"United States of America Deaf Track and Field";"Promotes training of
track athletes who are deaf and coaches who are deaf and hearing.
";"Recreation Track";"'of':3,11 'and':7,17,22 'are':15,20 'who':14,19
'deaf':5,16,21 'hear':23 'unit':1 'coach':18 'field':8 'state':2
'track':6,12,25 'train':10 'athlet':13 'promot':9 'america':4
'recreat':24"
"Adventure Pursuit, Inc.";"Adventure Pursuit is a group of volunteers who
like spending time with all kinds of people and focus on adventure sports
like kayaking.";"Recreation Water Sports";"'a':7 'is':6 'of':9,18 'on':22
'all':16 'and':20 'inc':3 'who':11 'kind':17 'like':12,25 'time':14
'with':15 'focus':21 'group':8 'kayak':26 'peopl':19 'spend':13
'sport':24,29 'water':28 'volunt':10 'pursuit':2,5 'recreat':27
'adventur':1,4,23"


Using pgAdminIII, I copied (default backup/restore) the database from our
production server and put in on my personal desktop (Windows 2000, PgSQL
8.0.0) and re-ran the update query and it gave proper results.

Is it a known issue with 7.3.2, and is there a workaround without
upgrading the server to 8.0.0?  We will upgrade in a few months, but we
can't take the server offline now because we have too many websites that
depend on it.



--
Justin Kennedy

"HEADLINE: GAY GUY NOT AROUSED BY UGLY, BITCHY GIRLS"
  -ram

Re: Tsearch vector not stored by update/set

From

"Justin L. Kennedy"

Date:

21 March 2005, 22:23:15

On Sun, 20 Mar 2005, Andrew J. Kopciuch wrote:

> On Thursday 17 March 2005 17:55, you wrote:
> > The short question is why does this:
> >
> > select to_tsvector('default', coalesce(name, '') ||' '||
> > coalesce(description, '') ||' '|| coalesce(keywords,'')) from link_items;
> >
> > give different results than this:
> >
> > update link_items set linksfti=to_tsvector('default', coalesce(name, '')
> >
> > ||' '|| coalesce(description, '') ||' '|| coalesce(keywords,''));
> >
> > select linksfti from link_items;
> >
>
>
> It shouldn't.  I just tested with some of your data on my machine, and the
> results were fine.  (PG 7.4.6).  I can not see why they would be any
> different for 7.3.2.
>
> Your data looks rather strange. (re "'60':1 '000':2").  Is that really all
> that was inserted? Or have you just left some out for your email?

About 95% of the 1900 insertions ended up with empty strings (not NULLs),
the other 5% looked like that above.  Either just numbers, or occasionally
words (defined by consecutive non-whitespace characters separated by
whitespace) that had numbers or symbols in them.  Like: "U.S. Senate" was
transformed in such a way that "Senate" was dropped completely and "U.S."
became lowercased "u.s.".  Another example was a URL that happened to be
in the description column of one was captured, but the rest of the text
was not.  Another had a name of "World T.E.A.M. Sports" and all that was
stored in the vector was "t.e.a.m."

It seems to be selective of only numbers, words with numbers in them,
words with '.' or '/' characters.  It completely ignores any other words
or text in any of the 3 fields.

> I could see this being a configuration issue possibly.  What do your pg_ts
> tables look like?  Have you made modifications there?

This morning, I decided to remove the following trigger from the
link_items table:
CREATE TRIGGER updateprodtbl
  BEFORE INSERT OR UPDATE
  ON link_items
  FOR EACH ROW
  EXECUTE PROCEDURE tsearch2('linksfti', 'name', 'description',
'keywords');

Now the UPDATE command I listed above works, so apparently there is
something about this trigger that is blocking the search vector from being
stored.  This trigger was copied and pasted (with only changes to the
column names) from
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch-V2-intro.html
in the INDEXING FIELDS IN A TABLE section.

This does fix the immediate problem of getting the search results for the
live website we are running, but now we have the problem of keeping it up
to date.  I could run a nightly cronjob to update all the rows, but it
seems inefficient, and I would really like to know why the trigger would
keep it from working on 7.3, but not 8.0.

You requested the pg_ts_* tables:
On the Linux-redhat, pg7.3.2

pg_ts_cfg: (4 rows)
oid    ts_name            prs_name    locale
106407    "default"        "default"    "C"
988004    "default_english"    "default"    "en_US"
106408    "default_russian"    "default"    "ru_RU.KOI8-R"
106409    "simple"        "default"    NULL

pg_ts_dict: (5 rows)
oid    dict_name    dict_initoption    dict_init    dict_lexize
106356    en_stem        ""        snb_en_init    snb_lexize
106361    ispell_template    NULL        spell_init    spell_lexize
106358    ru_stem        "/usr/local/pgsql/share/contrib/russion.stop"
snb_ru_init    snb_lexize
106353    simple        NULL        dex_init    dex_lexize
106364    synonym        NULL        syn_init    syn_lexize

pg_ts_parser: (1 row)
oid    prs_name    prs_start    prs_nexttoken    prs_end
prs_headline    prs_lextype
106389    "default"    prsd_start    prsd_getlexeme    prsd_end
prsd_headline    prsd_lextype

pg_ts_cfgmap(73 rows)
ts_name        tok_alias    dict_name
"default"    "lword"    "{en_stem}"
"default"    "nlword"    "{simple}"
"default"    "word"    "{simple}"
"default"    "email"    "{simple}"
"default"    "url"    "{simple}"
"default"    "host"    "{simple}"
"default"    "sfloat"    "{simple}"
"default"    "version"    "{simple}"
"default"    "part_hword"    "{simple}"
"default"    "nlpart_hword"    "{simple}"
"default"    "lpart_hword"    "{en_stem}"
"default"    "hword"    "{simple}"
"default"    "lhword"    "{en_stem}"
"default"    "nlhword"    "{simple}"
"default"    "uri"    "{simple}"
"default"    "file"    "{simple}"
"default"    "float"    "{simple}"
"default"    "int"    "{simple}"
"default"    "uint"    "{simple}"
"default_russian"    "lword"        "{en_stem}"
"default_russian"    "nlword"    "{ru_stem}"
"default_russian"    "word"    "{ru_stem}"
"default_russian"    "email"    "{simple}"
"default_russian"    "url"    "{simple}"
"default_russian"    "host"    "{simple}"
"default_russian"    "sfloat"    "{simple}"
"default_russian"    "version"    "{simple}"
"default_russian"    "part_hword"    "{simple}"
"default_russian"    "nlpart_hword"    "{ru_stem}"
"default_russian"    "lpart_hword"    "{en_stem}"
"default_russian"    "hword"    "{ru_stem}"
"default_russian"    "lhword"    "{en_stem}"
"default_russian"    "nlhword"    "{ru_stem}"
"default_russian"    "uri"    "{simple}"
"default_russian"    "file"    "{simple}"
"default_russian"    "float"    "{simple}"
"default_russian"    "int"    "{simple}"
"default_russian"    "uint"    "{simple}"
"simple"    "lword"    "{simple}"
"simple"    "nlword"    "{simple}"
"simple"    "word"    "{simple}"
"simple"    "email"    "{simple}"
"simple"    "url"    "{simple}"
"simple"    "host"    "{simple}"
"simple"    "sfloat"    "{simple}"
"simple"    "version"    "{simple}"
"simple"    "part_hword"    "{simple}"
"simple"    "nlpart_hword"    "{simple}"
"simple"    "lpart_hword"    "{simple}"
"simple"    "hword"    "{simple}"
"simple"    "lhword"    "{simple}"
"simple"    "nlhword"    "{simple}"
"simple"    "uri"    "{simple}"
"simple"    "file"    "{simple}"
"simple"    "float"    "{simple}"
"simple"    "int"    "{simple}"
"simple"    "uint"    "{simple}"
"default_english"    "url"    "{simple}"
"default_english"    "host"    "{simple}"
"default_english"    "sfloat"    "{simple}"
"default_english"    "uri"    "{simple}"
"default_english"    "int"    "{simple}"
"default_english"    "float"    "{simple}"
"default_english"    "email"    "{simple}"
"default_english"    "word"    "{simple}"
"default_english"    "hword"    "{simple}"
"default_english"    "nlword"    "{simple}"
"default_english"    "nlpart_hword"    "{simple}"
"default_english"    "part_hword"    "{simple}"
"default_english"    "nlhword"    "{simple}"
"default_english"    "file"    "{simple}"
"default_english"    "uint"    "{simple}"
"default_english"    "version"    "{simple}"


On the 8.0.0-beta5 Windows 2000 machine:
pg_ts_cfg: (4 rows, identical to 7.3.2)
oid    ts_name            prs_name    locale
370162    "default"        "default"    "C"
370165    "default_english"    "default"    "en_US"
370163    "default_russian"    "default"    "ru_RU.KOI8-R"
370164    "simple"        "default"    NULL

pg_ts_cfgmap (73 rows, identical to 7.3.2)
Not listed again, all identical except for OIDs

pg_ts_dict (5 rows, identical to 7.3.2 except for OIDS)

pg_ts_parser (1 row, identical to 7.3.2 except for OIDs)

I have made a single change to it from its default installation.  When I
was working with the rank_cd() function on the 8.0.0 machine, it had
errors due to a non-existant english stop file, so I changed
pg_ts_dict.dict_initoption = '' where dict_name = 'en_stem'.  The indexing
system was working fine both before and after the change to the pg_ts_dict
table.  I also propagated the change to the 7.3.2 machine even though it
didn't have the error message (the stop file didn't exist on that computer
either, but it never gave an error message about it).

Re: Tsearch vector not stored by update/set

From

"Andrew J. Kopciuch"

Date:

21 March 2005, 23:49:03

> It seems to be selective of only numbers, words with numbers in them,
> words with '.' or '/' characters.  It completely ignores any other words
> or text in any of the 3 fields.
>

This is a very big hint to your problem.

> You requested the pg_ts_* tables:
> On the Linux-redhat, pg7.3.2
>
> pg_ts_cfgmap(73 rows)
> ts_name  tok_alias dict_name
> "default" "lword" "{en_stem}"
> "default" "nlword" "{simple}"
> "default" "word" "{simple}"
> "default" "email" "{simple}"
> "default" "url" "{simple}"
> "default" "host" "{simple}"
> "default" "sfloat" "{simple}"
> "default" "version" "{simple}"
> "default" "part_hword" "{simple}"
> "default" "nlpart_hword" "{simple}"
> "default" "lpart_hword" "{en_stem}"
> "default" "hword" "{simple}"
> "default" "lhword" "{en_stem}"
> "default" "nlhword" "{simple}"
> "default" "uri" "{simple}"
> "default" "file" "{simple}"
> "default" "float" "{simple}"
> "default" "int" "{simple}"
> "default" "uint" "{simple}"
> "default_russian" "lword"  "{en_stem}"
> "default_russian" "nlword" "{ru_stem}"
> "default_russian" "word" "{ru_stem}"
> "default_russian" "email" "{simple}"
> "default_russian" "url" "{simple}"
> "default_russian" "host" "{simple}"
> "default_russian" "sfloat" "{simple}"
> "default_russian" "version" "{simple}"
> "default_russian" "part_hword" "{simple}"
> "default_russian" "nlpart_hword" "{ru_stem}"
> "default_russian" "lpart_hword" "{en_stem}"
> "default_russian" "hword" "{ru_stem}"
> "default_russian" "lhword" "{en_stem}"
> "default_russian" "nlhword" "{ru_stem}"
> "default_russian" "uri" "{simple}"
> "default_russian" "file" "{simple}"
> "default_russian" "float" "{simple}"
> "default_russian" "int" "{simple}"
> "default_russian" "uint" "{simple}"
> "simple" "lword" "{simple}"
> "simple" "nlword" "{simple}"
> "simple" "word" "{simple}"
> "simple" "email" "{simple}"
> "simple" "url" "{simple}"
> "simple" "host" "{simple}"
> "simple" "sfloat" "{simple}"
> "simple" "version" "{simple}"
> "simple" "part_hword" "{simple}"
> "simple" "nlpart_hword" "{simple}"
> "simple" "lpart_hword" "{simple}"
> "simple" "hword" "{simple}"
> "simple" "lhword" "{simple}"
> "simple" "nlhword" "{simple}"
> "simple" "uri" "{simple}"
> "simple" "file" "{simple}"
> "simple" "float" "{simple}"
> "simple" "int" "{simple}"
> "simple" "uint" "{simple}"
> "default_english" "url" "{simple}"
> "default_english" "host" "{simple}"
> "default_english" "sfloat" "{simple}"
> "default_english" "uri" "{simple}"
> "default_english" "int" "{simple}"
> "default_english" "float" "{simple}"
> "default_english" "email" "{simple}"
> "default_english" "word" "{simple}"
> "default_english" "hword" "{simple}"
> "default_english" "nlword" "{simple}"
> "default_english" "nlpart_hword" "{simple}"
> "default_english" "part_hword" "{simple}"
> "default_english" "nlhword" "{simple}"
> "default_english" "file" "{simple}"
> "default_english" "uint" "{simple}"
> "default_english" "version" "{simple}"
>

I am assuming that your cluster is running created with en_US for the locale,
and that you have set the matching tsearch2 configuration to be your default
(Or curcfg for each process running).

If you look at your config mappings for the "default_english" you will notice
that you have 16 records, as opposed to 19 records like every other
configuration mapping.  From some more in depth observations, I noticed you
are missing entries for the 'lword', 'lhword' and ''lpart_hword'.  That means
that tokens found to be of types 'Latin Words', 'Latin Hyphenated Words' and
'Latin Part Hyphenated Words' are just dropped because you do not have a
configuration mapping set up for them.

This is why only numbers (or other lexem types) would show (They are returned
as lexem_types : int, uint, float, url, etc. for which you have mappings).
Most regular words are simply discarded due to missing entries.  If you fix
your configurations the triggers should work properly.

Your examples worked before, simply because you specified the 'default'
configuration on the insert statement.  Which is not the same as the
'default_english' configuration which is used by the trigger based on your
server encoding (en_US).

> I have made a single change to it from its default installation.  When I
> was working with the rank_cd() function on the 8.0.0 machine, it had
> errors due to a non-existant english stop file, so I changed
> pg_ts_dict.dict_initoption = '' where dict_name = 'en_stem'.  The indexing
> system was working fine both before and after the change to the pg_ts_dict
> table.  I also propagated the change to the 7.3.2 machine even though it
> didn't have the error message (the stop file didn't exist on that computer
> either, but it never gave an error message about it).

I would not recommend this.  The stop file should is most likely on the system
somewhere.  It will change depending on your installation.  Look for
english.stop on the computer(s).  If it is not there, you can grab the one
out of the source distribution and put it wherever you want.  Then just
update the settings to the location you used.


good luck,


Andy

Re: Tsearch vector not stored by update/set

From

"Justin L. Kennedy"

Date:

22 March 2005, 15:03:41

On Mon, 21 Mar 2005, Andrew J. Kopciuch wrote:
> If you look at your config mappings for the "default_english" you will notice
> that you have 16 records, as opposed to 19 records like every other
> configuration mapping.  From some more in depth observations, I noticed you
> are missing entries for the 'lword', 'lhword' and ''lpart_hword'.  That means
> that tokens found to be of types 'Latin Words', 'Latin Hyphenated Words' and
> 'Latin Part Hyphenated Words' are just dropped because you do not have a
> configuration mapping set up for them.

You were correct.  It was these missing three entries that made the
difference.  Thanks for identifying them for me.

The triggers work properly in both server now.

--
Justin Kennedy
Systems Analyst I