Thread: Find similar records (compare tsvectors)
Hello,
I had a database with articles or attachment stored in bytea format.
I also had a trigger: it insert/update the tsv column when a record is added/updated.
The tsv column had a GIN index.
With this setting, I can do very fast keyword search on the tsv.
Suppose I had a specific record (id=100000).
How to list similar records based on ranking?
In that case, I had to compare a tsvector with another tsvector.
I had this SQL which make the original tsv as a text and then to tsquery, Then I can compare a tsv and a tsquery.
SELECT ts_rank(i.tsv, replace(strip(original.tsv)::text, ' ', '|')::tsquery) as similarity, i.company, i.industry, i.post_timestamp, i.id FROM items i, (SELECT tsv, id FROM items WHERE id=100000) AS original WHERE i.id != original.id ORDER BY similarity;
items table:
id bigint
company varchar
industry varchar
description varchar
post_timestamp timestamp
attachment bytea
tsv tsvector
The problem is that this is very slow.
Any comment?
Thank and regards,
Patrick
Resend.
How to quickly compare the similarity of two tsvector?
On Monday, March 2, 2015 11:01 PM, Patrick Dung <patrick_dkt@yahoo.com.hk> wrote:
Hello,
I had a database with articles or attachment stored in bytea format.
I also had a trigger: it insert/update the tsv column when a record is added/updated.
The tsv column had a GIN index.
With this setting, I can do very fast keyword search on the tsv.
Suppose I had a specific record (id=100000).
How to list similar records based on ranking?
In that case, I had to compare a tsvector with another tsvector.
I had this SQL which make the original tsv as a text and then to tsquery, Then I can compare a tsv and a tsquery.
SELECT ts_rank(i.tsv, replace(strip(original.tsv)::text, ' ', '|')::tsquery) as similarity, i.company, i.industry, i.post_timestamp, i.id FROM items i, (SELECT tsv, id FROM items WHERE id=100000) AS original WHERE i.id != original.id ORDER BY similarity;
items table:
id bigint
company varchar
industry varchar
description varchar
post_timestamp timestamp
attachment bytea
tsv tsvector
The problem is that this is very slow.
Any comment?
Thank and regards,
Patrick
On Fri, Mar 6, 2015 at 5:05 PM, Patrick Dung <patrick_dkt@yahoo.com.hk> wrote: > Resend. > > How to quickly compare the similarity of two tsvector? > check http://www.sai.msu.su/~megera/postgres/talks/pgcon-2012.pdf > > On Monday, March 2, 2015 11:01 PM, Patrick Dung <patrick_dkt@yahoo.com.hk> > wrote: > > > Hello, > > I had a database with articles or attachment stored in bytea format. > I also had a trigger: it insert/update the tsv column when a record is > added/updated. > The tsv column had a GIN index. > With this setting, I can do very fast keyword search on the tsv. > > Suppose I had a specific record (id=100000). > How to list similar records based on ranking? > In that case, I had to compare a tsvector with another tsvector. > > I had this SQL which make the original tsv as a text and then to tsquery, > Then I can compare a tsv and a tsquery. > SELECT ts_rank(i.tsv, replace(strip(original.tsv)::text, ' ', '|')::tsquery) > as similarity, i.company, i.industry, i.post_timestamp, i.id FROM items i, > (SELECT tsv, id FROM items WHERE id=100000) AS original WHERE i.id != > original.id ORDER BY similarity; > > items table: > id bigint > company varchar > industry varchar > description varchar > post_timestamp timestamp > attachment bytea > tsv tsvector > > The problem is that this is very slow. > Any comment? > > Thank and regards, > Patrick > >
Thanks.
smlar is fast and quite good.
I need find tuning on the search result.
On Saturday, March 7, 2015 12:07 AM, Oleg Bartunov <obartunov@gmail.com> wrote:
On Fri, Mar 6, 2015 at 5:05 PM, Patrick Dung <patrick_dkt@yahoo.com.hk> wrote:
> Resend.
>
> How to quickly compare the similarity of two tsvector?
>
check http://www.sai.msu.su/~megera/postgres/talks/pgcon-2012.pdf
>
> On Monday, March 2, 2015 11:01 PM, Patrick Dung <patrick_dkt@yahoo.com.hk>
> wrote:
>
>
> Hello,
>
> I had a database with articles or attachment stored in bytea format.
> I also had a trigger: it insert/update the tsv column when a record is
> added/updated.
> The tsv column had a GIN index.
> With this setting, I can do very fast keyword search on the tsv.
>
> Suppose I had a specific record (id=100000).
> How to list similar records based on ranking?
> In that case, I had to compare a tsvector with another tsvector.
>
> I had this SQL which make the original tsv as a text and then to tsquery,
> Then I can compare a tsv and a tsquery.
> SELECT ts_rank(i.tsv, replace(strip(original.tsv)::text, ' ', '|')::tsquery)
> as similarity, i.company, i.industry, i.post_timestamp, i.id FROM items i,
> (SELECT tsv, id FROM items WHERE id=100000) AS original WHERE i.id !=
> original.id ORDER BY similarity;
>
> items table:
> id bigint
> company varchar
> industry varchar
> description varchar
> post_timestamp timestamp
> attachment bytea
> tsv tsvector
>
> The problem is that this is very slow.
> Any comment?
>
> Thank and regards,
> Patrick
>
>
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
> Resend.
>
> How to quickly compare the similarity of two tsvector?
>
check http://www.sai.msu.su/~megera/postgres/talks/pgcon-2012.pdf
>
> On Monday, March 2, 2015 11:01 PM, Patrick Dung <patrick_dkt@yahoo.com.hk>
> wrote:
>
>
> Hello,
>
> I had a database with articles or attachment stored in bytea format.
> I also had a trigger: it insert/update the tsv column when a record is
> added/updated.
> The tsv column had a GIN index.
> With this setting, I can do very fast keyword search on the tsv.
>
> Suppose I had a specific record (id=100000).
> How to list similar records based on ranking?
> In that case, I had to compare a tsvector with another tsvector.
>
> I had this SQL which make the original tsv as a text and then to tsquery,
> Then I can compare a tsv and a tsquery.
> SELECT ts_rank(i.tsv, replace(strip(original.tsv)::text, ' ', '|')::tsquery)
> as similarity, i.company, i.industry, i.post_timestamp, i.id FROM items i,
> (SELECT tsv, id FROM items WHERE id=100000) AS original WHERE i.id !=
> original.id ORDER BY similarity;
>
> items table:
> id bigint
> company varchar
> industry varchar
> description varchar
> post_timestamp timestamp
> attachment bytea
> tsv tsvector
>
> The problem is that this is very slow.
> Any comment?
>
> Thank and regards,
> Patrick
>
>
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general