Thread: Tsearch vector not stored by update/set
The short question is why does this: select to_tsvector('default', coalesce(name, '') ||' '|| coalesce(description, '') ||' '|| coalesce(keywords,'')) from link_items; give different results than this: update link_items set linksfti=to_tsvector('default', coalesce(name, '') ||' '|| coalesce(description, '') ||' '|| coalesce(keywords,'')); select linksfti from link_items; Here are more details: I am working with Tsearch2 on a server with version string: PostgreSQL 7.3.2 on i686-pc-linux-gnu, compiled by GCC 2.96 I have a table with the following schema: CREATE TABLE link_items ( link_id int4, name varchar(255), url varchar(255), description text, spanish int4, spanishurl varchar(255), lastmod date, visible int4, state varchar(25), promisepractice int4, keywords text, linksfti tsvector ) WITH OIDS; ALTER TABLE link_items OWNER TO gate; I want linksfti to hold the search engine's indexing data (indexed on 'name', 'description', and 'keywords'), so I run the following command: update link_items set linksfti=to_tsvector('default', coalesce(name, '') ||' '|| coalesce(description, '') ||' '|| coalesce(keywords,'')); The results are pretty empty. Most have empty strings for data, other only index one or two items in the 3 input columns. For example, after running, my table looks like: <name>;<description>;<keywords>;<linksfti> "American Occupational Therapy Association (AOTA) ";"Nationally recognized professional association for over 60,000 occupational therapists and occupational therapy assistants. ";"Rehabilitation Professional Associations and Councils";"'60':1 '000':2" "American Physical Therapy Association (APTA)";"Represents more than 70,000 physical therapists, physical therapist assistants, and students of physical therapy. ";"Rehabilitation Professional Associations and Councils";"'70':1 '000':2" "U.S. Deaf Ski & Snowboard Association";"Winter sports for people who are deaf & relevant links.";"Recreation Winter Sports";"'u.s':1" "Texas Adaptive Aquatics";"Adaptive water skiing program for people with physical and/or mental disabilities. ";"Recreation Water Sports";"'and/or':1" "World T.E.A.M. Sports";"Inclusive sports activities.";"Recreation Team Sports";"'t.e.a.m':1" "Tennessee";"Official State Web Site";"Legal State Agencies";"" "Project Vote Smart";"By entering zip code, users get list of all their elected officials. Links to elected officials' and candidates' web sites, etc. ";"Government / Public Policy General";"" "TRIPOD Captioned Films";"Captioned Films for people who are deaf or hard of hearing.";"Recreation Captioned Movies";"" When don't do it as an UPDATE and just print the contents to the screen, I get the full expected output: select name, description, keywords, to_tsvector('default', coalesce(name, '') ||' '|| coalesce(description, '') ||' '|| coalesce(keywords,'')) from link_items; "United States of America Deaf Track and Field";"Promotes training of track athletes who are deaf and coaches who are deaf and hearing. ";"Recreation Track";"'of':3,11 'and':7,17,22 'are':15,20 'who':14,19 'deaf':5,16,21 'hear':23 'unit':1 'coach':18 'field':8 'state':2 'track':6,12,25 'train':10 'athlet':13 'promot':9 'america':4 'recreat':24" "Adventure Pursuit, Inc.";"Adventure Pursuit is a group of volunteers who like spending time with all kinds of people and focus on adventure sports like kayaking.";"Recreation Water Sports";"'a':7 'is':6 'of':9,18 'on':22 'all':16 'and':20 'inc':3 'who':11 'kind':17 'like':12,25 'time':14 'with':15 'focus':21 'group':8 'kayak':26 'peopl':19 'spend':13 'sport':24,29 'water':28 'volunt':10 'pursuit':2,5 'recreat':27 'adventur':1,4,23" Using pgAdminIII, I copied (default backup/restore) the database from our production server and put in on my personal desktop (Windows 2000, PgSQL 8.0.0) and re-ran the update query and it gave proper results. Is it a known issue with 7.3.2, and is there a workaround without upgrading the server to 8.0.0? We will upgrade in a few months, but we can't take the server offline now because we have too many websites that depend on it. -- Justin Kennedy "HEADLINE: GAY GUY NOT AROUSED BY UGLY, BITCHY GIRLS" -ram
On Sun, 20 Mar 2005, Andrew J. Kopciuch wrote: > On Thursday 17 March 2005 17:55, you wrote: > > The short question is why does this: > > > > select to_tsvector('default', coalesce(name, '') ||' '|| > > coalesce(description, '') ||' '|| coalesce(keywords,'')) from link_items; > > > > give different results than this: > > > > update link_items set linksfti=to_tsvector('default', coalesce(name, '') > > > > ||' '|| coalesce(description, '') ||' '|| coalesce(keywords,'')); > > > > select linksfti from link_items; > > > > > It shouldn't. I just tested with some of your data on my machine, and the > results were fine. (PG 7.4.6). I can not see why they would be any > different for 7.3.2. > > Your data looks rather strange. (re "'60':1 '000':2"). Is that really all > that was inserted? Or have you just left some out for your email? About 95% of the 1900 insertions ended up with empty strings (not NULLs), the other 5% looked like that above. Either just numbers, or occasionally words (defined by consecutive non-whitespace characters separated by whitespace) that had numbers or symbols in them. Like: "U.S. Senate" was transformed in such a way that "Senate" was dropped completely and "U.S." became lowercased "u.s.". Another example was a URL that happened to be in the description column of one was captured, but the rest of the text was not. Another had a name of "World T.E.A.M. Sports" and all that was stored in the vector was "t.e.a.m." It seems to be selective of only numbers, words with numbers in them, words with '.' or '/' characters. It completely ignores any other words or text in any of the 3 fields. > I could see this being a configuration issue possibly. What do your pg_ts > tables look like? Have you made modifications there? This morning, I decided to remove the following trigger from the link_items table: CREATE TRIGGER updateprodtbl BEFORE INSERT OR UPDATE ON link_items FOR EACH ROW EXECUTE PROCEDURE tsearch2('linksfti', 'name', 'description', 'keywords'); Now the UPDATE command I listed above works, so apparently there is something about this trigger that is blocking the search vector from being stored. This trigger was copied and pasted (with only changes to the column names) from http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch-V2-intro.html in the INDEXING FIELDS IN A TABLE section. This does fix the immediate problem of getting the search results for the live website we are running, but now we have the problem of keeping it up to date. I could run a nightly cronjob to update all the rows, but it seems inefficient, and I would really like to know why the trigger would keep it from working on 7.3, but not 8.0. You requested the pg_ts_* tables: On the Linux-redhat, pg7.3.2 pg_ts_cfg: (4 rows) oid ts_name prs_name locale 106407 "default" "default" "C" 988004 "default_english" "default" "en_US" 106408 "default_russian" "default" "ru_RU.KOI8-R" 106409 "simple" "default" NULL pg_ts_dict: (5 rows) oid dict_name dict_initoption dict_init dict_lexize 106356 en_stem "" snb_en_init snb_lexize 106361 ispell_template NULL spell_init spell_lexize 106358 ru_stem "/usr/local/pgsql/share/contrib/russion.stop" snb_ru_init snb_lexize 106353 simple NULL dex_init dex_lexize 106364 synonym NULL syn_init syn_lexize pg_ts_parser: (1 row) oid prs_name prs_start prs_nexttoken prs_end prs_headline prs_lextype 106389 "default" prsd_start prsd_getlexeme prsd_end prsd_headline prsd_lextype pg_ts_cfgmap(73 rows) ts_name tok_alias dict_name "default" "lword" "{en_stem}" "default" "nlword" "{simple}" "default" "word" "{simple}" "default" "email" "{simple}" "default" "url" "{simple}" "default" "host" "{simple}" "default" "sfloat" "{simple}" "default" "version" "{simple}" "default" "part_hword" "{simple}" "default" "nlpart_hword" "{simple}" "default" "lpart_hword" "{en_stem}" "default" "hword" "{simple}" "default" "lhword" "{en_stem}" "default" "nlhword" "{simple}" "default" "uri" "{simple}" "default" "file" "{simple}" "default" "float" "{simple}" "default" "int" "{simple}" "default" "uint" "{simple}" "default_russian" "lword" "{en_stem}" "default_russian" "nlword" "{ru_stem}" "default_russian" "word" "{ru_stem}" "default_russian" "email" "{simple}" "default_russian" "url" "{simple}" "default_russian" "host" "{simple}" "default_russian" "sfloat" "{simple}" "default_russian" "version" "{simple}" "default_russian" "part_hword" "{simple}" "default_russian" "nlpart_hword" "{ru_stem}" "default_russian" "lpart_hword" "{en_stem}" "default_russian" "hword" "{ru_stem}" "default_russian" "lhword" "{en_stem}" "default_russian" "nlhword" "{ru_stem}" "default_russian" "uri" "{simple}" "default_russian" "file" "{simple}" "default_russian" "float" "{simple}" "default_russian" "int" "{simple}" "default_russian" "uint" "{simple}" "simple" "lword" "{simple}" "simple" "nlword" "{simple}" "simple" "word" "{simple}" "simple" "email" "{simple}" "simple" "url" "{simple}" "simple" "host" "{simple}" "simple" "sfloat" "{simple}" "simple" "version" "{simple}" "simple" "part_hword" "{simple}" "simple" "nlpart_hword" "{simple}" "simple" "lpart_hword" "{simple}" "simple" "hword" "{simple}" "simple" "lhword" "{simple}" "simple" "nlhword" "{simple}" "simple" "uri" "{simple}" "simple" "file" "{simple}" "simple" "float" "{simple}" "simple" "int" "{simple}" "simple" "uint" "{simple}" "default_english" "url" "{simple}" "default_english" "host" "{simple}" "default_english" "sfloat" "{simple}" "default_english" "uri" "{simple}" "default_english" "int" "{simple}" "default_english" "float" "{simple}" "default_english" "email" "{simple}" "default_english" "word" "{simple}" "default_english" "hword" "{simple}" "default_english" "nlword" "{simple}" "default_english" "nlpart_hword" "{simple}" "default_english" "part_hword" "{simple}" "default_english" "nlhword" "{simple}" "default_english" "file" "{simple}" "default_english" "uint" "{simple}" "default_english" "version" "{simple}" On the 8.0.0-beta5 Windows 2000 machine: pg_ts_cfg: (4 rows, identical to 7.3.2) oid ts_name prs_name locale 370162 "default" "default" "C" 370165 "default_english" "default" "en_US" 370163 "default_russian" "default" "ru_RU.KOI8-R" 370164 "simple" "default" NULL pg_ts_cfgmap (73 rows, identical to 7.3.2) Not listed again, all identical except for OIDs pg_ts_dict (5 rows, identical to 7.3.2 except for OIDS) pg_ts_parser (1 row, identical to 7.3.2 except for OIDs) I have made a single change to it from its default installation. When I was working with the rank_cd() function on the 8.0.0 machine, it had errors due to a non-existant english stop file, so I changed pg_ts_dict.dict_initoption = '' where dict_name = 'en_stem'. The indexing system was working fine both before and after the change to the pg_ts_dict table. I also propagated the change to the 7.3.2 machine even though it didn't have the error message (the stop file didn't exist on that computer either, but it never gave an error message about it).
> It seems to be selective of only numbers, words with numbers in them, > words with '.' or '/' characters. It completely ignores any other words > or text in any of the 3 fields. > This is a very big hint to your problem. > You requested the pg_ts_* tables: > On the Linux-redhat, pg7.3.2 > > pg_ts_cfgmap(73 rows) > ts_name tok_alias dict_name > "default" "lword" "{en_stem}" > "default" "nlword" "{simple}" > "default" "word" "{simple}" > "default" "email" "{simple}" > "default" "url" "{simple}" > "default" "host" "{simple}" > "default" "sfloat" "{simple}" > "default" "version" "{simple}" > "default" "part_hword" "{simple}" > "default" "nlpart_hword" "{simple}" > "default" "lpart_hword" "{en_stem}" > "default" "hword" "{simple}" > "default" "lhword" "{en_stem}" > "default" "nlhword" "{simple}" > "default" "uri" "{simple}" > "default" "file" "{simple}" > "default" "float" "{simple}" > "default" "int" "{simple}" > "default" "uint" "{simple}" > "default_russian" "lword" "{en_stem}" > "default_russian" "nlword" "{ru_stem}" > "default_russian" "word" "{ru_stem}" > "default_russian" "email" "{simple}" > "default_russian" "url" "{simple}" > "default_russian" "host" "{simple}" > "default_russian" "sfloat" "{simple}" > "default_russian" "version" "{simple}" > "default_russian" "part_hword" "{simple}" > "default_russian" "nlpart_hword" "{ru_stem}" > "default_russian" "lpart_hword" "{en_stem}" > "default_russian" "hword" "{ru_stem}" > "default_russian" "lhword" "{en_stem}" > "default_russian" "nlhword" "{ru_stem}" > "default_russian" "uri" "{simple}" > "default_russian" "file" "{simple}" > "default_russian" "float" "{simple}" > "default_russian" "int" "{simple}" > "default_russian" "uint" "{simple}" > "simple" "lword" "{simple}" > "simple" "nlword" "{simple}" > "simple" "word" "{simple}" > "simple" "email" "{simple}" > "simple" "url" "{simple}" > "simple" "host" "{simple}" > "simple" "sfloat" "{simple}" > "simple" "version" "{simple}" > "simple" "part_hword" "{simple}" > "simple" "nlpart_hword" "{simple}" > "simple" "lpart_hword" "{simple}" > "simple" "hword" "{simple}" > "simple" "lhword" "{simple}" > "simple" "nlhword" "{simple}" > "simple" "uri" "{simple}" > "simple" "file" "{simple}" > "simple" "float" "{simple}" > "simple" "int" "{simple}" > "simple" "uint" "{simple}" > "default_english" "url" "{simple}" > "default_english" "host" "{simple}" > "default_english" "sfloat" "{simple}" > "default_english" "uri" "{simple}" > "default_english" "int" "{simple}" > "default_english" "float" "{simple}" > "default_english" "email" "{simple}" > "default_english" "word" "{simple}" > "default_english" "hword" "{simple}" > "default_english" "nlword" "{simple}" > "default_english" "nlpart_hword" "{simple}" > "default_english" "part_hword" "{simple}" > "default_english" "nlhword" "{simple}" > "default_english" "file" "{simple}" > "default_english" "uint" "{simple}" > "default_english" "version" "{simple}" > I am assuming that your cluster is running created with en_US for the locale, and that you have set the matching tsearch2 configuration to be your default (Or curcfg for each process running). If you look at your config mappings for the "default_english" you will notice that you have 16 records, as opposed to 19 records like every other configuration mapping. From some more in depth observations, I noticed you are missing entries for the 'lword', 'lhword' and ''lpart_hword'. That means that tokens found to be of types 'Latin Words', 'Latin Hyphenated Words' and 'Latin Part Hyphenated Words' are just dropped because you do not have a configuration mapping set up for them. This is why only numbers (or other lexem types) would show (They are returned as lexem_types : int, uint, float, url, etc. for which you have mappings). Most regular words are simply discarded due to missing entries. If you fix your configurations the triggers should work properly. Your examples worked before, simply because you specified the 'default' configuration on the insert statement. Which is not the same as the 'default_english' configuration which is used by the trigger based on your server encoding (en_US). > I have made a single change to it from its default installation. When I > was working with the rank_cd() function on the 8.0.0 machine, it had > errors due to a non-existant english stop file, so I changed > pg_ts_dict.dict_initoption = '' where dict_name = 'en_stem'. The indexing > system was working fine both before and after the change to the pg_ts_dict > table. I also propagated the change to the 7.3.2 machine even though it > didn't have the error message (the stop file didn't exist on that computer > either, but it never gave an error message about it). I would not recommend this. The stop file should is most likely on the system somewhere. It will change depending on your installation. Look for english.stop on the computer(s). If it is not there, you can grab the one out of the source distribution and put it wherever you want. Then just update the settings to the location you used. good luck, Andy
On Mon, 21 Mar 2005, Andrew J. Kopciuch wrote: > If you look at your config mappings for the "default_english" you will notice > that you have 16 records, as opposed to 19 records like every other > configuration mapping. From some more in depth observations, I noticed you > are missing entries for the 'lword', 'lhword' and ''lpart_hword'. That means > that tokens found to be of types 'Latin Words', 'Latin Hyphenated Words' and > 'Latin Part Hyphenated Words' are just dropped because you do not have a > configuration mapping set up for them. You were correct. It was these missing three entries that made the difference. Thanks for identifying them for me. The triggers work properly in both server now. -- Justin Kennedy Systems Analyst I