Thread: access to lexems or access to parsed elements

access to lexems or access to parsed elements

From
"Massa, Harald Armin"
Date:
I want to access the single words in a text. Better yet: the relevant words (i.e. without stop words) in a text.

to_tsvector or casting gets me the lexems as a tsvector:

select to_tsvector('the quick brown fox jumped over the lazy fox')
''brown':3 'fox':4,9 'jump':5 'lazi':8 'quick':2'

And I would like to access "brown", "fox", "jump", "lazi" and "quick" as single values that I insert into another table. 

But: no luck with any tries to convert to records, arrays or similiar.

Next step, the lesser-known-fts-functions:

select ts_parse('default','the quick brown fox jumped over the lazy fox')

(1,the)
(12," ")
(1,quick)
[...]
(1,fox)

is a set-returning-function, giving me 17 records of type pseudo-record. Stopwords still in there, so what. But: No chance of accessing the second field in that record.

Of course, there is allways:

select  substr(what::text,position(',' in what::text)+1,char_length(what::text)-position(',' in what::text)-1) from 
(
select ts_parse('default','the quick brown fox jumped over the lazy fox') as what
)x

but, comeon: having a two-field-record, casting it to one field of text, searching for the "," that separates the two fields and then split the one-field into two fields by substring?

So, is there a better way to access 

a) the lexems of a tsvector
b) the (unnamed) fields of a set-of-record-returning function

?
Harald

--
Harald Armin Massa     www.2ndQuadrant.de
PostgreSQL  Training, Services  and Support

2ndQuadrant Deutschland GmbH     
GF: Harald Armin Massa
Amtsgericht Stuttgart, HRB 736399

Re: access to lexems or access to parsed elements

From
Sushant Sinha
Date:
Can this fit?

select plainto_tsquery('english', 'the quick brown fox jumped over the
lazy fox');
                   plainto_tsquery
-----------------------------------------------------
 'quick' & 'brown' & 'fox' & 'jump' & 'lazi' & 'fox'

-Sushant.

On Thu, 2011-08-25 at 18:21 +0200, Massa, Harald Armin wrote:
> I want to access the single words in a text. Better yet: the relevant
> words (i.e. without stop words) in a text.
>
>
> to_tsvector or casting gets me the lexems as a tsvector:
>
>
> select to_tsvector('the quick brown fox jumped over the lazy fox')
> ''brown':3 'fox':4,9 'jump':5 'lazi':8 'quick':2'
>
>
> And I would like to access "brown", "fox", "jump", "lazi" and "quick"
> as single values that I insert into another table.
>
>
> But: no luck with any tries to convert to records, arrays or similiar.
>
>
> Next step, the lesser-known-fts-functions:
>
>
> select ts_parse('default','the quick brown fox jumped over the lazy
> fox')
>
>
> (1,the)
> (12," ")
> (1,quick)
> [...]
> (1,fox)
>
>
> is a set-returning-function, giving me 17 records of type
> pseudo-record. Stopwords still in there, so what. But: No chance of
> accessing the second field in that record.
>
>
> Of course, there is allways:
>
>
> select  substr(what::text,position(',' in
> what::text)+1,char_length(what::text)-position(',' in what::text)-1)
> from
> (
> select ts_parse('default','the quick brown fox jumped over the lazy
> fox') as what
> )x
>
>
> but, comeon: having a two-field-record, casting it to one field of
> text, searching for the "," that separates the two fields and then
> split the one-field into two fields by substring?
>
>
> So, is there a better way to access
>
>
> a) the lexems of a tsvector
> b) the (unnamed) fields of a set-of-record-returning function
>
>
> ?
> Harald
>
>
> --
> Harald Armin Massa     www.2ndQuadrant.de
> PostgreSQL  Training, Services  and Support
>
> 2ndQuadrant Deutschland GmbH
> GF: Harald Armin Massa
> Amtsgericht Stuttgart, HRB 736399



Re: access to lexems or access to parsed elements

From
Ivan Sergio Borgonovo
Date:
On Thu, 25 Aug 2011 18:21:21 +0200
"Massa, Harald Armin" <harald@2ndQuadrant.de> wrote:

> I want to access the single words in a text. Better yet: the
> relevant words (i.e. without stop words) in a text.
>
> to_tsvector or casting gets me the lexems as a tsvector:

I wrote this piece of C code more than a year ago. [1]
It has been working in a production environment for quite a lot.

It just works with versions < 8.4 since it doesn't support *.
I'd be willing to maintain the module or even expand its
features and release it on any license that will please postgresql
community if my effort would actually make it more easily available
to other people but throwing it on the internet won't be enough.

If someone think it is worth the effort to help me understand how
that could happen I'd be glad to learn.


[1]
http://www.webthatworks.it/d1/content/postgresql-c-module-turn-tsvectors-tsquery-and-return-tsvectors-tables

--
Ivan Sergio Borgonovo
http://www.webthatworks.it


Re: access to lexems or access to parsed elements

From
"Massa, Harald Armin"
Date:
Sushant,

Can this fit?

select plainto_tsquery('english', 'the quick brown fox jumped over the
lazy fox');
                  plainto_tsquery
-----------------------------------------------------
 'quick' & 'brown' & 'fox' & 'jump' & 'lazi' & 'fox'

no, this cannot fit. This just adds a third variation, this time a tsquery object. There is no way to access the single words in that ts_query  programmatically (besides the ugly  "cast to text and split at &')

Harald



--
Harald Armin Massa     www.2ndQuadrant.de
PostgreSQL  Training, Services  and Support

2ndQuadrant Deutschland GmbH     
GF: Harald Armin Massa
Amtsgericht Stuttgart, HRB 736399

Re: access to lexems or access to parsed elements

From
Marcin Mańk
Date:
> a) the lexems of a tsvector
ts_debug (it`s a plain sql function, may give You some inspiration for
Your own queries)

> b) the (unnamed) fields of a set-of-record-returning function
select * from ts_parse('default','the quick brown fox jumped over the lazy fox')
or
select (ts_parse('default','the quick brown fox jumped over the lazy
fox')).token


Greetings
Marcin Mańk

Re: access to lexems or access to parsed elements

From
"Daniel Verite"
Date:
    Massa, Harald Armin wrote:

> select ts_parse('default','the quick brown fox jumped over the lazy fox')
>
> (1,the)
> (12," ")
> (1,quick)
> [...]
> (1,fox)
>
> is a set-returning-function, giving me 17 records of type pseudo-record.
> Stopwords still in there, so what. But: No chance of accessing the second
> field in that record.

What about:
select w from ts_parse('default','the quick brown fox jumped over the lazy
fox')
  as lexems(i,w);


Best regards,
--
Daniel
PostgreSQL-powered mail user agent and storage: http://www.manitou-mail.org