Thread: access to lexems or access to parsed elements
I want to access the single words in a text. Better yet: the relevant words (i.e. without stop words) in a text.
to_tsvector or casting gets me the lexems as a tsvector:
select to_tsvector('the quick brown fox jumped over the lazy fox')
''brown':3 'fox':4,9 'jump':5 'lazi':8 'quick':2'
And I would like to access "brown", "fox", "jump", "lazi" and "quick" as single values that I insert into another table.
But: no luck with any tries to convert to records, arrays or similiar.
Next step, the lesser-known-fts-functions:
select ts_parse('default','the quick brown fox jumped over the lazy fox')
(1,the)
(12," ")
(1,quick)
[...]
(1,fox)
is a set-returning-function, giving me 17 records of type pseudo-record. Stopwords still in there, so what. But: No chance of accessing the second field in that record.
Of course, there is allways:
select substr(what::text,position(',' in what::text)+1,char_length(what::text)-position(',' in what::text)-1) from
(
select ts_parse('default','the quick brown fox jumped over the lazy fox') as what
)x
but, comeon: having a two-field-record, casting it to one field of text, searching for the "," that separates the two fields and then split the one-field into two fields by substring?
So, is there a better way to access
a) the lexems of a tsvector
b) the (unnamed) fields of a set-of-record-returning function
?
Harald
Harald Armin Massa www.2ndQuadrant.de
PostgreSQL Training, Services and Support
2ndQuadrant Deutschland GmbH
GF: Harald Armin Massa
Amtsgericht Stuttgart, HRB 736399
Can this fit? select plainto_tsquery('english', 'the quick brown fox jumped over the lazy fox'); plainto_tsquery ----------------------------------------------------- 'quick' & 'brown' & 'fox' & 'jump' & 'lazi' & 'fox' -Sushant. On Thu, 2011-08-25 at 18:21 +0200, Massa, Harald Armin wrote: > I want to access the single words in a text. Better yet: the relevant > words (i.e. without stop words) in a text. > > > to_tsvector or casting gets me the lexems as a tsvector: > > > select to_tsvector('the quick brown fox jumped over the lazy fox') > ''brown':3 'fox':4,9 'jump':5 'lazi':8 'quick':2' > > > And I would like to access "brown", "fox", "jump", "lazi" and "quick" > as single values that I insert into another table. > > > But: no luck with any tries to convert to records, arrays or similiar. > > > Next step, the lesser-known-fts-functions: > > > select ts_parse('default','the quick brown fox jumped over the lazy > fox') > > > (1,the) > (12," ") > (1,quick) > [...] > (1,fox) > > > is a set-returning-function, giving me 17 records of type > pseudo-record. Stopwords still in there, so what. But: No chance of > accessing the second field in that record. > > > Of course, there is allways: > > > select substr(what::text,position(',' in > what::text)+1,char_length(what::text)-position(',' in what::text)-1) > from > ( > select ts_parse('default','the quick brown fox jumped over the lazy > fox') as what > )x > > > but, comeon: having a two-field-record, casting it to one field of > text, searching for the "," that separates the two fields and then > split the one-field into two fields by substring? > > > So, is there a better way to access > > > a) the lexems of a tsvector > b) the (unnamed) fields of a set-of-record-returning function > > > ? > Harald > > > -- > Harald Armin Massa www.2ndQuadrant.de > PostgreSQL Training, Services and Support > > 2ndQuadrant Deutschland GmbH > GF: Harald Armin Massa > Amtsgericht Stuttgart, HRB 736399
On Thu, 25 Aug 2011 18:21:21 +0200 "Massa, Harald Armin" <harald@2ndQuadrant.de> wrote: > I want to access the single words in a text. Better yet: the > relevant words (i.e. without stop words) in a text. > > to_tsvector or casting gets me the lexems as a tsvector: I wrote this piece of C code more than a year ago. [1] It has been working in a production environment for quite a lot. It just works with versions < 8.4 since it doesn't support *. I'd be willing to maintain the module or even expand its features and release it on any license that will please postgresql community if my effort would actually make it more easily available to other people but throwing it on the internet won't be enough. If someone think it is worth the effort to help me understand how that could happen I'd be glad to learn. [1] http://www.webthatworks.it/d1/content/postgresql-c-module-turn-tsvectors-tsquery-and-return-tsvectors-tables -- Ivan Sergio Borgonovo http://www.webthatworks.it
Sushant,
--
Harald Armin Massa www.2ndQuadrant.de
PostgreSQL Training, Services and Support
2ndQuadrant Deutschland GmbH
GF: Harald Armin Massa
Amtsgericht Stuttgart, HRB 736399
Can this fit?
select plainto_tsquery('english', 'the quick brown fox jumped over the
lazy fox');
plainto_tsquery
-----------------------------------------------------
'quick' & 'brown' & 'fox' & 'jump' & 'lazi' & 'fox'
no, this cannot fit. This just adds a third variation, this time a tsquery object. There is no way to access the single words in that ts_query programmatically (besides the ugly "cast to text and split at &')
Harald
Harald Armin Massa www.2ndQuadrant.de
PostgreSQL Training, Services and Support
2ndQuadrant Deutschland GmbH
GF: Harald Armin Massa
Amtsgericht Stuttgart, HRB 736399
> a) the lexems of a tsvector ts_debug (it`s a plain sql function, may give You some inspiration for Your own queries) > b) the (unnamed) fields of a set-of-record-returning function select * from ts_parse('default','the quick brown fox jumped over the lazy fox') or select (ts_parse('default','the quick brown fox jumped over the lazy fox')).token Greetings Marcin Mańk
Massa, Harald Armin wrote: > select ts_parse('default','the quick brown fox jumped over the lazy fox') > > (1,the) > (12," ") > (1,quick) > [...] > (1,fox) > > is a set-returning-function, giving me 17 records of type pseudo-record. > Stopwords still in there, so what. But: No chance of accessing the second > field in that record. What about: select w from ts_parse('default','the quick brown fox jumped over the lazy fox') as lexems(i,w); Best regards, -- Daniel PostgreSQL-powered mail user agent and storage: http://www.manitou-mail.org