Thread: Combine Top-k with similarity search extensions

Combine Top-k with similarity search extensions

From

Shmagi Kavtaradze

Date:

20 November 2015, 13:21:40

I am performing similarity check over a column in a table with about 3500 entries. Column is populated with text data from text file. Performing a check results in 3500 * 3500 rows and it takes forever to calculate for my virtual machine. Is there any way to calculate for top-k results, to decrease amount and time needed? What I mean is that, for example when checking two sentences, if first several words does not match, to stop checking that sentences and move on.

Re: Combine Top-k with similarity search extensions

From

tim.child@comcast.net

Date:

20 November 2015, 19:00:40

Shmagi,

Take the first 20 text characters and compute and store the CRC32 or MD5 of that value. That value acts as a signature. You can then find all distinct signatures, or all rows with duplicate signatures for further analysis You could event try building a signature on the full text string.

From: "Shmagi Kavtaradze" <kavtaradze.s@gmail.com>
To: pgsql-novice@postgresql.org
Sent: Friday, November 20, 2015 2:21:36 AM
Subject: [NOVICE] Combine Top-k with similarity search extensions

Re: Combine Top-k with similarity search extensions

From

Shmagi Kavtaradze

Date:

20 November 2015, 19:13:27

It will add complexity and also no idea how to do it. Is there any alternative?

On Fri, Nov 20, 2015 at 5:00 PM, <tim.child@comcast.net> wrote:

Shmagi,

Take the first 20 text characters and compute and store the CRC32 or MD5 of that value. That value acts as a signature. You can then find all distinct signatures, or all rows with duplicate signatures for further analysis You could event try building a signature on the full text string.

From: "Shmagi Kavtaradze" <kavtaradze.s@gmail.com>
To: pgsql-novice@postgresql.org
Sent: Friday, November 20, 2015 2:21:36 AM
Subject: [NOVICE] Combine Top-k with similarity search extensions

I am performing similarity check over a column in a table with about 3500 entries. Column is populated with text data from text file. Performing a check results in 3500 * 3500 rows and it takes forever to calculate for my virtual machine. Is there any way to calculate for top-k results, to decrease amount and time needed? What I mean is that, for example when checking two sentences, if first several words does not match, to stop checking that sentences and move on.

Re: Combine Top-k with similarity search extensions

From

tim.child@comcast.net

Date:

20 November 2015, 19:42:51

OK its does add complexity,

Here is a functional md5 index on the whole string

drop table if exists text_table;
create table text_table
(
mystring text

);

create index text_md5 on text_table(md5(mystring));

insert into text_table (mystring) values
('John Smith'), ('John Smith'), ('John Smith'), ('John Smith'), ('Ian Smith'), ('Ian Smith'), ('Ian Smith'), ('Ian Smith'), ('Jim Smith'), ('J Smith');

select md5, count from
( select md5(mystring) md5 , count(*) count from text_table group by md5(mystring) ) subq
where count > 1 ;

From: "Shmagi Kavtaradze" <kavtaradze.s@gmail.com>
To: "tim child" <tim.child@comcast.net>
Cc: pgsql-novice@postgresql.org
Sent: Friday, November 20, 2015 8:13:15 AM
Subject: Re: [NOVICE] Combine Top-k with similarity search extensions

It will add complexity and also no idea how to do it. Is there any alternative?

On Fri, Nov 20, 2015 at 5:00 PM, <tim.child@comcast.net> wrote:

Shmagi,

Take the first 20 text characters and compute and store the CRC32 or MD5 of that value. That value acts as a signature. You can then find all distinct signatures, or all rows with duplicate signatures for further analysis You could event try building a signature on the full text string.

From: "Shmagi Kavtaradze" <kavtaradze.s@gmail.com>
To: pgsql-novice@postgresql.org
Sent: Friday, November 20, 2015 2:21:36 AM
Subject: [NOVICE] Combine Top-k with similarity search extensions

I am performing similarity check over a column in a table with about 3500 entries. Column is populated with text data from text file. Performing a check results in 3500 * 3500 rows and it takes forever to calculate for my virtual machine. Is there any way to calculate for top-k results, to decrease amount and time needed? What I mean is that, for example when checking two sentences, if first several words does not match, to stop checking that sentences and move on.

Re: Combine Top-k with similarity search extensions

From

"马修/Матвей/Mateo/Matt Buse"

Date:

21 November 2015, 05:00:25

<span style="font-family:Verdana; color:#000000; font-size:12pt;"><div>Dump to file, run shell script or C program to
sort(sort -u).  Searches and</div><div>comparisons work much better on sorted sets.</div><div><br /></div><div>Matt<br
/></div><blockquoteid="replyBlockquote" style="border-left: 2px solid blue; margin-left: 8px; padding-left: 8px;
font-size:10pt;color:black; font-family:verdana;" webmail="1"><div id="wmQuoteWrapper"> -------- Original Message
--------<br/> Subject: Re: [NOVICE] Combine Top-k with similarity search extensions<br /> From: Shmagi Kavtaradze
<<ahref="mailto:kavtaradze.s@gmail.com">kavtaradze.s@gmail.com</a>><br /> Date: Fri, November 20, 2015 8:13 am<br
/>To: <a href="mailto:tim.child@comcast.net">tim.child@comcast.net</a><br /> Cc: <a
href="mailto:pgsql-novice@postgresql.org">pgsql-novice@postgresql.org</a><br/><br /><div dir="ltr">It will add
complexityand also no idea how to do it. Is there any alternative?</div><div class="gmail_extra"><br /><div
class="gmail_quote">OnFri, Nov 20, 2015 at 5:00 PM, <span dir="ltr"><<a href="mailto:tim.child@comcast.net"
target="_blank">tim.child@comcast.net</a>></span>wrote:<br /><blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px#ccc solid;padding-left:1ex"><div><div
style="font-family:Arial;font-size:12pt;color:#000000"><div>Shmagi,<br/></div><div><br /></div><div>Take the first 20 
textcharacters and compute and store the CRC32 or MD5  of that value.  That value acts as a signature. You can then
findall distinct signatures,  or all rows with duplicate signatures for further analysis  You could event try building
asignature on the full text string.<br /></div><div><br /></div><div><br /></div><div><br /></div><hr /><div
style="color:#000;font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt"><b>From:
</b>"ShmagiKavtaradze" <<a href="mailto:kavtaradze.s@gmail.com" target="_blank">kavtaradze.s@gmail.com</a>><br
/><b>To:</b><a href="mailto:pgsql-novice@postgresql.org" target="_blank">pgsql-novice@postgresql.org</a><br /><b>Sent:
</b>Friday,November 20, 2015 2:21:36 AM<br /><b>Subject: </b>[NOVICE] Combine Top-k with similarity search
extensions<spanclass=""><br /><div><br /></div><div dir="ltr">I am performing similarity check over a column in a table
withabout 3500 entries. Column is populated with text data from text file. Performing a check results in 3500 * 3500
rowsand it takes forever to calculate for my virtual machine. Is there any way to calculate for top-k results, to
decreaseamount and time needed? What I mean is that, for example when checking two sentences, if first several words
doesnot match, to stop checking that sentences and move on. </div></span></div><div><br
/></div></div></div></blockquote></div><br/></div></div></blockquote></span>