Thread: Tsearch2 question: getting histogram of the vector elements

Tsearch2 question: getting histogram of the vector elements

From
Rajesh Kumar Mallah
Date:
Greetings!

My original problem is to de duplicate a list of  around 0.3 million 
company names.

Since a company name can be potentially (mis)spelt in numerous ways 
exactmatch
obviously wont work.

To make the searches faster i am using tsearch. For each company name i 
want to
search other companies whose name is similar to the company in question.

Since inclusion of all the vector elements of a given company reduces the
chance of matching i am thinking of excluding the high frequency words
from the query.

Hence i need to find the high frequency elements like say 'consulting' , 
'limited' , 'Private'
'Industries' that occur commonly in company names.

In my table i have populated the co_name_vec feild as 
strip(to_tsvector(co_name))
can anyone help me analyzing the co_name_vec for the high frequency words?

Also i would like to know alternate / better solution to this problem.


Regds
Mallah.



SAMPLE DATA.

+-----------------------------------------------------+----------------------------------------------------------+
|                       co_name                       
|                       co_name_vec                        |
+-----------------------------------------------------+----------------------------------------------------------+
| European Trade Partner & Consulting                 | 'trade' 
'consult' 'partner' 'european'                   |
| Gulbrandsen Chemicals Pvt. Ltd.                     | 'ltd' 'pvt' 
'chemic' 'gulbrandsen'                       |
| Govt. of Karnataka, Vision Group on Biotechnology   | 'govt' 'group' 
'vision' 'karnataka' 'biotechnolog'       |
| Digital Globalsoft Ltd. (A Hewlett Packard Company) | 'ltd' 'digit' 
'compani' 'hewlett' 'packard' 'globalsoft' |
| Shanon Construction Material Industries             | 'materi' 
'shanon' 'industri' 'construct'                 |
| singpore india trade rsources company               | 'india' 'trade' 
'rsourc' 'compani' 'singpor'             |
| RGV TELECOM CONSULTANTS PVT. LTD.                   | 'ltd' 'pvt' 
'rgv' 'consult' 'telecom'                    |
| avid information search and documents (p) ltd.      | 'p' 'ltd' 'avid' 
'inform' 'search' 'document'            |
| Tavant Technologies India (P) Ltd.                  | 'p' 'ltd' 
'india' 'tavant' 'technolog'                   |
| Maschinen Fabrik (India) Pvt. Ltd                   | 'ltd' 'pvt' 
'india' 'fabrik' 'maschinen'                 |
| Manishri Refractories and Ceramics Pvt. Ltd.        | 'ltd' 'pvt' 
'ceram' 'manishri' 'refractori'              |
| xavier export  import  management  institute        | 'manag' 'export' 
'import' 'xavier' 'institut'            |
| Best InformationTechnology ltd.                     | 'ltd' 'best' 
'informationtechnolog'                      |
| FutureCalls Technology Private Limited              | 'limit' 'privat' 
'futurecal' 'technolog'                 |
| mak controls and systems pvt ltd                    | 'ltd' 'mak' 
'pvt' 'system' 'control'                     |
| NATIONAL RESEARCH CENTRE FOR CASHEW                 | 'centr' 'cashew' 
'nation' 'research'                     |
| The Madras Aluminium Company Ltd.                   | 'ltd' 'madra' 
'compani' 'aluminium'                      |
| Shriram Institute for Industrial Research           | 'shriram' 
'industri' 'institut' 'research'               |
| All India Carpet Trade Fair Committee               | 'fair' 'india' 
'trade' 'carpet' 'committe'               |
| Tuff Security & Allied Services                     | 'alli' 'tuff' 
'secur' 'servic'                           |
+-----------------------------------------------------+----------------------------------------------------------+
(20 rows)



Re: Tsearch2 question: getting histogram of the vector elements

From
Oleg Bartunov
Date:
Read
http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes
("Check words statistics")

tsearch2 comes with stat function.

for example - top 10 words from apod (Astronomy picture of the day):

select * from stat('select ti from apod') order by ndoc desc, nentry desc,word limit 10;
 word  | ndoc | nentry
--------+------+--------year   | 1283 |   2064star   | 1273 |   3908imag   | 1267 |   1967light  | 1232 |   2206pictur
|1177 |   1442earth  | 1059 |   1798visibl |  992 |   1259bright |  936 |   1335котор  |  903 |   1441эт     |  879 |
1397


Oleg

On Thu, 11 Mar 2004, Rajesh Kumar Mallah wrote:

>
> Greetings!
>
> My original problem is to de duplicate a list of  around 0.3 million
> company names.
>
> Since a company name can be potentially (mis)spelt in numerous ways
> exactmatch
> obviously wont work.
>
> To make the searches faster i am using tsearch. For each company name i
> want to
> search other companies whose name is similar to the company in question.
>
> Since inclusion of all the vector elements of a given company reduces the
> chance of matching i am thinking of excluding the high frequency words
> from the query.
>
> Hence i need to find the high frequency elements like say 'consulting' ,
> 'limited' , 'Private'
> 'Industries' that occur commonly in company names.
>
> In my table i have populated the co_name_vec feild as
> strip(to_tsvector(co_name))
> can anyone help me analyzing the co_name_vec for the high frequency words?
>
> Also i would like to know alternate / better solution to this problem.
>
>
> Regds
> Mallah.
>
>
>
> SAMPLE DATA.
>
> +-----------------------------------------------------+----------------------------------------------------------+
> |                       co_name
> |                       co_name_vec                        |
> +-----------------------------------------------------+----------------------------------------------------------+
> | European Trade Partner & Consulting                 | 'trade'
> 'consult' 'partner' 'european'                   |
> | Gulbrandsen Chemicals Pvt. Ltd.                     | 'ltd' 'pvt'
> 'chemic' 'gulbrandsen'                       |
> | Govt. of Karnataka, Vision Group on Biotechnology   | 'govt' 'group'
> 'vision' 'karnataka' 'biotechnolog'       |
> | Digital Globalsoft Ltd. (A Hewlett Packard Company) | 'ltd' 'digit'
> 'compani' 'hewlett' 'packard' 'globalsoft' |
> | Shanon Construction Material Industries             | 'materi'
> 'shanon' 'industri' 'construct'                 |
> | singpore india trade rsources company               | 'india' 'trade'
> 'rsourc' 'compani' 'singpor'             |
> | RGV TELECOM CONSULTANTS PVT. LTD.                   | 'ltd' 'pvt'
> 'rgv' 'consult' 'telecom'                    |
> | avid information search and documents (p) ltd.      | 'p' 'ltd' 'avid'
> 'inform' 'search' 'document'            |
> | Tavant Technologies India (P) Ltd.                  | 'p' 'ltd'
> 'india' 'tavant' 'technolog'                   |
> | Maschinen Fabrik (India) Pvt. Ltd                   | 'ltd' 'pvt'
> 'india' 'fabrik' 'maschinen'                 |
> | Manishri Refractories and Ceramics Pvt. Ltd.        | 'ltd' 'pvt'
> 'ceram' 'manishri' 'refractori'              |
> | xavier export  import  management  institute        | 'manag' 'export'
> 'import' 'xavier' 'institut'            |
> | Best InformationTechnology ltd.                     | 'ltd' 'best'
> 'informationtechnolog'                      |
> | FutureCalls Technology Private Limited              | 'limit' 'privat'
> 'futurecal' 'technolog'                 |
> | mak controls and systems pvt ltd                    | 'ltd' 'mak'
> 'pvt' 'system' 'control'                     |
> | NATIONAL RESEARCH CENTRE FOR CASHEW                 | 'centr' 'cashew'
> 'nation' 'research'                     |
> | The Madras Aluminium Company Ltd.                   | 'ltd' 'madra'
> 'compani' 'aluminium'                      |
> | Shriram Institute for Industrial Research           | 'shriram'
> 'industri' 'institut' 'research'               |
> | All India Carpet Trade Fair Committee               | 'fair' 'india'
> 'trade' 'carpet' 'committe'               |
> | Tuff Security & Allied Services                     | 'alli' 'tuff'
> 'secur' 'servic'                           |
> +-----------------------------------------------------+----------------------------------------------------------+
> (20 rows)
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
>                http://archives.postgresql.org
>
Regards,    Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83


Thanks very much , that was exactly what i needed.


regds
mallah.

> Rajesh Kumar Mallah wrote:
>> can anyone help me analyzing the co_name_vec for the high frequency words?
>
> Use tsearch2's stat() function.  Look at the example at the bottom of the following page:
> http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes
>
> There's a lot of other documentation on the tsearch2 home page at:
> http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/
>
> George Essig



-----------------------------------------
Over 1,00,000 exporters are waiting for your order! Click below to get
in touch with leading Indian exporters listed in the premier
trade directory Exporters Yellow Pages.
http://www.trade-india.com/dyn/gdh/eyp/




Re: Tsearch2 question: getting histogram of the vector elements

From
George Essig
Date:
Rajesh Kumar Mallah wrote:
> can anyone help me analyzing the co_name_vec for the high frequency words?

Use tsearch2's stat() function.  Look at the example at the bottom of the following page:
http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes

There's a lot of other documentation on the tsearch2 home page at:
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/

George Essig