Tsearch2 question: getting histogram of the vector elements - Mailing list pgsql-sql
From | Rajesh Kumar Mallah |
---|---|
Subject | Tsearch2 question: getting histogram of the vector elements |
Date | |
Msg-id | 404F7269.9010301@trade-india.com Whole thread Raw |
Responses |
Re: Tsearch2 question: getting histogram of the vector elements
|
List | pgsql-sql |
Greetings! My original problem is to de duplicate a list of around 0.3 million company names. Since a company name can be potentially (mis)spelt in numerous ways exactmatch obviously wont work. To make the searches faster i am using tsearch. For each company name i want to search other companies whose name is similar to the company in question. Since inclusion of all the vector elements of a given company reduces the chance of matching i am thinking of excluding the high frequency words from the query. Hence i need to find the high frequency elements like say 'consulting' , 'limited' , 'Private' 'Industries' that occur commonly in company names. In my table i have populated the co_name_vec feild as strip(to_tsvector(co_name)) can anyone help me analyzing the co_name_vec for the high frequency words? Also i would like to know alternate / better solution to this problem. Regds Mallah. SAMPLE DATA. +-----------------------------------------------------+----------------------------------------------------------+ | co_name | co_name_vec | +-----------------------------------------------------+----------------------------------------------------------+ | European Trade Partner & Consulting | 'trade' 'consult' 'partner' 'european' | | Gulbrandsen Chemicals Pvt. Ltd. | 'ltd' 'pvt' 'chemic' 'gulbrandsen' | | Govt. of Karnataka, Vision Group on Biotechnology | 'govt' 'group' 'vision' 'karnataka' 'biotechnolog' | | Digital Globalsoft Ltd. (A Hewlett Packard Company) | 'ltd' 'digit' 'compani' 'hewlett' 'packard' 'globalsoft' | | Shanon Construction Material Industries | 'materi' 'shanon' 'industri' 'construct' | | singpore india trade rsources company | 'india' 'trade' 'rsourc' 'compani' 'singpor' | | RGV TELECOM CONSULTANTS PVT. LTD. | 'ltd' 'pvt' 'rgv' 'consult' 'telecom' | | avid information search and documents (p) ltd. | 'p' 'ltd' 'avid' 'inform' 'search' 'document' | | Tavant Technologies India (P) Ltd. | 'p' 'ltd' 'india' 'tavant' 'technolog' | | Maschinen Fabrik (India) Pvt. Ltd | 'ltd' 'pvt' 'india' 'fabrik' 'maschinen' | | Manishri Refractories and Ceramics Pvt. Ltd. | 'ltd' 'pvt' 'ceram' 'manishri' 'refractori' | | xavier export import management institute | 'manag' 'export' 'import' 'xavier' 'institut' | | Best InformationTechnology ltd. | 'ltd' 'best' 'informationtechnolog' | | FutureCalls Technology Private Limited | 'limit' 'privat' 'futurecal' 'technolog' | | mak controls and systems pvt ltd | 'ltd' 'mak' 'pvt' 'system' 'control' | | NATIONAL RESEARCH CENTRE FOR CASHEW | 'centr' 'cashew' 'nation' 'research' | | The Madras Aluminium Company Ltd. | 'ltd' 'madra' 'compani' 'aluminium' | | Shriram Institute for Industrial Research | 'shriram' 'industri' 'institut' 'research' | | All India Carpet Trade Fair Committee | 'fair' 'india' 'trade' 'carpet' 'committe' | | Tuff Security & Allied Services | 'alli' 'tuff' 'secur' 'servic' | +-----------------------------------------------------+----------------------------------------------------------+ (20 rows)