Re: Tsearch2 question: getting histogram of the vector elements - Mailing list pgsql-sql
From | Oleg Bartunov |
---|---|
Subject | Re: Tsearch2 question: getting histogram of the vector elements |
Date | |
Msg-id | Pine.GSO.4.58.0403111159240.25727@ra.sai.msu.su Whole thread Raw |
In response to | Tsearch2 question: getting histogram of the vector elements (Rajesh Kumar Mallah <mallah@trade-india.com>) |
List | pgsql-sql |
Read http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes ("Check words statistics") tsearch2 comes with stat function. for example - top 10 words from apod (Astronomy picture of the day): select * from stat('select ti from apod') order by ndoc desc, nentry desc,word limit 10; word | ndoc | nentry --------+------+--------year | 1283 | 2064star | 1273 | 3908imag | 1267 | 1967light | 1232 | 2206pictur |1177 | 1442earth | 1059 | 1798visibl | 992 | 1259bright | 936 | 1335котор | 903 | 1441эт | 879 | 1397 Oleg On Thu, 11 Mar 2004, Rajesh Kumar Mallah wrote: > > Greetings! > > My original problem is to de duplicate a list of around 0.3 million > company names. > > Since a company name can be potentially (mis)spelt in numerous ways > exactmatch > obviously wont work. > > To make the searches faster i am using tsearch. For each company name i > want to > search other companies whose name is similar to the company in question. > > Since inclusion of all the vector elements of a given company reduces the > chance of matching i am thinking of excluding the high frequency words > from the query. > > Hence i need to find the high frequency elements like say 'consulting' , > 'limited' , 'Private' > 'Industries' that occur commonly in company names. > > In my table i have populated the co_name_vec feild as > strip(to_tsvector(co_name)) > can anyone help me analyzing the co_name_vec for the high frequency words? > > Also i would like to know alternate / better solution to this problem. > > > Regds > Mallah. > > > > SAMPLE DATA. > > +-----------------------------------------------------+----------------------------------------------------------+ > | co_name > | co_name_vec | > +-----------------------------------------------------+----------------------------------------------------------+ > | European Trade Partner & Consulting | 'trade' > 'consult' 'partner' 'european' | > | Gulbrandsen Chemicals Pvt. Ltd. | 'ltd' 'pvt' > 'chemic' 'gulbrandsen' | > | Govt. of Karnataka, Vision Group on Biotechnology | 'govt' 'group' > 'vision' 'karnataka' 'biotechnolog' | > | Digital Globalsoft Ltd. (A Hewlett Packard Company) | 'ltd' 'digit' > 'compani' 'hewlett' 'packard' 'globalsoft' | > | Shanon Construction Material Industries | 'materi' > 'shanon' 'industri' 'construct' | > | singpore india trade rsources company | 'india' 'trade' > 'rsourc' 'compani' 'singpor' | > | RGV TELECOM CONSULTANTS PVT. LTD. | 'ltd' 'pvt' > 'rgv' 'consult' 'telecom' | > | avid information search and documents (p) ltd. | 'p' 'ltd' 'avid' > 'inform' 'search' 'document' | > | Tavant Technologies India (P) Ltd. | 'p' 'ltd' > 'india' 'tavant' 'technolog' | > | Maschinen Fabrik (India) Pvt. Ltd | 'ltd' 'pvt' > 'india' 'fabrik' 'maschinen' | > | Manishri Refractories and Ceramics Pvt. Ltd. | 'ltd' 'pvt' > 'ceram' 'manishri' 'refractori' | > | xavier export import management institute | 'manag' 'export' > 'import' 'xavier' 'institut' | > | Best InformationTechnology ltd. | 'ltd' 'best' > 'informationtechnolog' | > | FutureCalls Technology Private Limited | 'limit' 'privat' > 'futurecal' 'technolog' | > | mak controls and systems pvt ltd | 'ltd' 'mak' > 'pvt' 'system' 'control' | > | NATIONAL RESEARCH CENTRE FOR CASHEW | 'centr' 'cashew' > 'nation' 'research' | > | The Madras Aluminium Company Ltd. | 'ltd' 'madra' > 'compani' 'aluminium' | > | Shriram Institute for Industrial Research | 'shriram' > 'industri' 'institut' 'research' | > | All India Carpet Trade Fair Committee | 'fair' 'india' > 'trade' 'carpet' 'committe' | > | Tuff Security & Allied Services | 'alli' 'tuff' > 'secur' 'servic' | > +-----------------------------------------------------+----------------------------------------------------------+ > (20 rows) > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83