Thread: Tsearch2 question: getting histogram of the vector elements
Greetings! My original problem is to de duplicate a list of around 0.3 million company names. Since a company name can be potentially (mis)spelt in numerous ways exactmatch obviously wont work. To make the searches faster i am using tsearch. For each company name i want to search other companies whose name is similar to the company in question. Since inclusion of all the vector elements of a given company reduces the chance of matching i am thinking of excluding the high frequency words from the query. Hence i need to find the high frequency elements like say 'consulting' , 'limited' , 'Private' 'Industries' that occur commonly in company names. In my table i have populated the co_name_vec feild as strip(to_tsvector(co_name)) can anyone help me analyzing the co_name_vec for the high frequency words? Also i would like to know alternate / better solution to this problem. Regds Mallah. SAMPLE DATA. +-----------------------------------------------------+----------------------------------------------------------+ | co_name | co_name_vec | +-----------------------------------------------------+----------------------------------------------------------+ | European Trade Partner & Consulting | 'trade' 'consult' 'partner' 'european' | | Gulbrandsen Chemicals Pvt. Ltd. | 'ltd' 'pvt' 'chemic' 'gulbrandsen' | | Govt. of Karnataka, Vision Group on Biotechnology | 'govt' 'group' 'vision' 'karnataka' 'biotechnolog' | | Digital Globalsoft Ltd. (A Hewlett Packard Company) | 'ltd' 'digit' 'compani' 'hewlett' 'packard' 'globalsoft' | | Shanon Construction Material Industries | 'materi' 'shanon' 'industri' 'construct' | | singpore india trade rsources company | 'india' 'trade' 'rsourc' 'compani' 'singpor' | | RGV TELECOM CONSULTANTS PVT. LTD. | 'ltd' 'pvt' 'rgv' 'consult' 'telecom' | | avid information search and documents (p) ltd. | 'p' 'ltd' 'avid' 'inform' 'search' 'document' | | Tavant Technologies India (P) Ltd. | 'p' 'ltd' 'india' 'tavant' 'technolog' | | Maschinen Fabrik (India) Pvt. Ltd | 'ltd' 'pvt' 'india' 'fabrik' 'maschinen' | | Manishri Refractories and Ceramics Pvt. Ltd. | 'ltd' 'pvt' 'ceram' 'manishri' 'refractori' | | xavier export import management institute | 'manag' 'export' 'import' 'xavier' 'institut' | | Best InformationTechnology ltd. | 'ltd' 'best' 'informationtechnolog' | | FutureCalls Technology Private Limited | 'limit' 'privat' 'futurecal' 'technolog' | | mak controls and systems pvt ltd | 'ltd' 'mak' 'pvt' 'system' 'control' | | NATIONAL RESEARCH CENTRE FOR CASHEW | 'centr' 'cashew' 'nation' 'research' | | The Madras Aluminium Company Ltd. | 'ltd' 'madra' 'compani' 'aluminium' | | Shriram Institute for Industrial Research | 'shriram' 'industri' 'institut' 'research' | | All India Carpet Trade Fair Committee | 'fair' 'india' 'trade' 'carpet' 'committe' | | Tuff Security & Allied Services | 'alli' 'tuff' 'secur' 'servic' | +-----------------------------------------------------+----------------------------------------------------------+ (20 rows)
Read http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes ("Check words statistics") tsearch2 comes with stat function. for example - top 10 words from apod (Astronomy picture of the day): select * from stat('select ti from apod') order by ndoc desc, nentry desc,word limit 10; word | ndoc | nentry --------+------+--------year | 1283 | 2064star | 1273 | 3908imag | 1267 | 1967light | 1232 | 2206pictur |1177 | 1442earth | 1059 | 1798visibl | 992 | 1259bright | 936 | 1335котор | 903 | 1441эт | 879 | 1397 Oleg On Thu, 11 Mar 2004, Rajesh Kumar Mallah wrote: > > Greetings! > > My original problem is to de duplicate a list of around 0.3 million > company names. > > Since a company name can be potentially (mis)spelt in numerous ways > exactmatch > obviously wont work. > > To make the searches faster i am using tsearch. For each company name i > want to > search other companies whose name is similar to the company in question. > > Since inclusion of all the vector elements of a given company reduces the > chance of matching i am thinking of excluding the high frequency words > from the query. > > Hence i need to find the high frequency elements like say 'consulting' , > 'limited' , 'Private' > 'Industries' that occur commonly in company names. > > In my table i have populated the co_name_vec feild as > strip(to_tsvector(co_name)) > can anyone help me analyzing the co_name_vec for the high frequency words? > > Also i would like to know alternate / better solution to this problem. > > > Regds > Mallah. > > > > SAMPLE DATA. > > +-----------------------------------------------------+----------------------------------------------------------+ > | co_name > | co_name_vec | > +-----------------------------------------------------+----------------------------------------------------------+ > | European Trade Partner & Consulting | 'trade' > 'consult' 'partner' 'european' | > | Gulbrandsen Chemicals Pvt. Ltd. | 'ltd' 'pvt' > 'chemic' 'gulbrandsen' | > | Govt. of Karnataka, Vision Group on Biotechnology | 'govt' 'group' > 'vision' 'karnataka' 'biotechnolog' | > | Digital Globalsoft Ltd. (A Hewlett Packard Company) | 'ltd' 'digit' > 'compani' 'hewlett' 'packard' 'globalsoft' | > | Shanon Construction Material Industries | 'materi' > 'shanon' 'industri' 'construct' | > | singpore india trade rsources company | 'india' 'trade' > 'rsourc' 'compani' 'singpor' | > | RGV TELECOM CONSULTANTS PVT. LTD. | 'ltd' 'pvt' > 'rgv' 'consult' 'telecom' | > | avid information search and documents (p) ltd. | 'p' 'ltd' 'avid' > 'inform' 'search' 'document' | > | Tavant Technologies India (P) Ltd. | 'p' 'ltd' > 'india' 'tavant' 'technolog' | > | Maschinen Fabrik (India) Pvt. Ltd | 'ltd' 'pvt' > 'india' 'fabrik' 'maschinen' | > | Manishri Refractories and Ceramics Pvt. Ltd. | 'ltd' 'pvt' > 'ceram' 'manishri' 'refractori' | > | xavier export import management institute | 'manag' 'export' > 'import' 'xavier' 'institut' | > | Best InformationTechnology ltd. | 'ltd' 'best' > 'informationtechnolog' | > | FutureCalls Technology Private Limited | 'limit' 'privat' > 'futurecal' 'technolog' | > | mak controls and systems pvt ltd | 'ltd' 'mak' > 'pvt' 'system' 'control' | > | NATIONAL RESEARCH CENTRE FOR CASHEW | 'centr' 'cashew' > 'nation' 'research' | > | The Madras Aluminium Company Ltd. | 'ltd' 'madra' > 'compani' 'aluminium' | > | Shriram Institute for Industrial Research | 'shriram' > 'industri' 'institut' 'research' | > | All India Carpet Trade Fair Committee | 'fair' 'india' > 'trade' 'carpet' 'committe' | > | Tuff Security & Allied Services | 'alli' 'tuff' > 'secur' 'servic' | > +-----------------------------------------------------+----------------------------------------------------------+ > (20 rows) > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Thanks very much , that was exactly what i needed. regds mallah. > Rajesh Kumar Mallah wrote: >> can anyone help me analyzing the co_name_vec for the high frequency words? > > Use tsearch2's stat() function. Look at the example at the bottom of the following page: > http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes > > There's a lot of other documentation on the tsearch2 home page at: > http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ > > George Essig ----------------------------------------- Over 1,00,000 exporters are waiting for your order! Click below to get in touch with leading Indian exporters listed in the premier trade directory Exporters Yellow Pages. http://www.trade-india.com/dyn/gdh/eyp/
Rajesh Kumar Mallah wrote: > can anyone help me analyzing the co_name_vec for the high frequency words? Use tsearch2's stat() function. Look at the example at the bottom of the following page: http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes There's a lot of other documentation on the tsearch2 home page at: http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ George Essig