Thread: Combine Top-k with similarity search extensions
I am performing similarity check over a column in a table with about 3500 entries. Column is populated with text data from text file. Performing a check results in 3500 * 3500 rows and it takes forever to calculate for my virtual machine. Is there any way to calculate for top-k results, to decrease amount and time needed? What I mean is that, for example when checking two sentences, if first several words does not match, to stop checking that sentences and move on.
Shmagi,
Take the first 20 text characters and compute and store the CRC32 or MD5 of that value. That value acts as a signature. You can then find all distinct signatures, or all rows with duplicate signatures for further analysis You could event try building a signature on the full text string.
From: "Shmagi Kavtaradze" <kavtaradze.s@gmail.com>
To: pgsql-novice@postgresql.org
Sent: Friday, November 20, 2015 2:21:36 AM
Subject: [NOVICE] Combine Top-k with similarity search extensions
To: pgsql-novice@postgresql.org
Sent: Friday, November 20, 2015 2:21:36 AM
Subject: [NOVICE] Combine Top-k with similarity search extensions
I am performing similarity check over a column in a table with about 3500 entries. Column is populated with text data from text file. Performing a check results in 3500 * 3500 rows and it takes forever to calculate for my virtual machine. Is there any way to calculate for top-k results, to decrease amount and time needed? What I mean is that, for example when checking two sentences, if first several words does not match, to stop checking that sentences and move on.
It will add complexity and also no idea how to do it. Is there any alternative?
On Fri, Nov 20, 2015 at 5:00 PM, <tim.child@comcast.net> wrote:
Shmagi,Take the first 20 text characters and compute and store the CRC32 or MD5 of that value. That value acts as a signature. You can then find all distinct signatures, or all rows with duplicate signatures for further analysis You could event try building a signature on the full text string.From: "Shmagi Kavtaradze" <kavtaradze.s@gmail.com>
To: pgsql-novice@postgresql.org
Sent: Friday, November 20, 2015 2:21:36 AM
Subject: [NOVICE] Combine Top-k with similarity search extensionsI am performing similarity check over a column in a table with about 3500 entries. Column is populated with text data from text file. Performing a check results in 3500 * 3500 rows and it takes forever to calculate for my virtual machine. Is there any way to calculate for top-k results, to decrease amount and time needed? What I mean is that, for example when checking two sentences, if first several words does not match, to stop checking that sentences and move on.
OK its does add complexity,
Here is a functional md5 index on the whole string
drop table if exists text_table;
create table text_table
(
mystring text
);
create index text_md5 on text_table(md5(mystring));
insert into text_table (mystring) values
('John Smith'), ('John Smith'), ('John Smith'), ('John Smith'), ('Ian Smith'), ('Ian Smith'), ('Ian Smith'), ('Ian Smith'), ('Jim Smith'), ('J Smith');
select md5, count from
( select md5(mystring) md5 , count(*) count from text_table group by md5(mystring) ) subq
where count > 1 ;
create table text_table
(
mystring text
);
create index text_md5 on text_table(md5(mystring));
insert into text_table (mystring) values
('John Smith'), ('John Smith'), ('John Smith'), ('John Smith'), ('Ian Smith'), ('Ian Smith'), ('Ian Smith'), ('Ian Smith'), ('Jim Smith'), ('J Smith');
select md5, count from
( select md5(mystring) md5 , count(*) count from text_table group by md5(mystring) ) subq
where count > 1 ;
From: "Shmagi Kavtaradze" <kavtaradze.s@gmail.com>
To: "tim child" <tim.child@comcast.net>
Cc: pgsql-novice@postgresql.org
Sent: Friday, November 20, 2015 8:13:15 AM
Subject: Re: [NOVICE] Combine Top-k with similarity search extensions
To: "tim child" <tim.child@comcast.net>
Cc: pgsql-novice@postgresql.org
Sent: Friday, November 20, 2015 8:13:15 AM
Subject: Re: [NOVICE] Combine Top-k with similarity search extensions
It will add complexity and also no idea how to do it. Is there any alternative?
On Fri, Nov 20, 2015 at 5:00 PM, <tim.child@comcast.net> wrote:
Shmagi,Take the first 20 text characters and compute and store the CRC32 or MD5 of that value. That value acts as a signature. You can then find all distinct signatures, or all rows with duplicate signatures for further analysis You could event try building a signature on the full text string.From: "Shmagi Kavtaradze" <kavtaradze.s@gmail.com>
To: pgsql-novice@postgresql.org
Sent: Friday, November 20, 2015 2:21:36 AM
Subject: [NOVICE] Combine Top-k with similarity search extensionsI am performing similarity check over a column in a table with about 3500 entries. Column is populated with text data from text file. Performing a check results in 3500 * 3500 rows and it takes forever to calculate for my virtual machine. Is there any way to calculate for top-k results, to decrease amount and time needed? What I mean is that, for example when checking two sentences, if first several words does not match, to stop checking that sentences and move on.
<span style="font-family:Verdana; color:#000000; font-size:12pt;"><div>Dump to file, run shell script or C program to sort(sort -u). Searches and</div><div>comparisons work much better on sorted sets.</div><div><br /></div><div>Matt<br /></div><blockquoteid="replyBlockquote" style="border-left: 2px solid blue; margin-left: 8px; padding-left: 8px; font-size:10pt;color:black; font-family:verdana;" webmail="1"><div id="wmQuoteWrapper"> -------- Original Message --------<br/> Subject: Re: [NOVICE] Combine Top-k with similarity search extensions<br /> From: Shmagi Kavtaradze <<ahref="mailto:kavtaradze.s@gmail.com">kavtaradze.s@gmail.com</a>><br /> Date: Fri, November 20, 2015 8:13 am<br />To: <a href="mailto:tim.child@comcast.net">tim.child@comcast.net</a><br /> Cc: <a href="mailto:pgsql-novice@postgresql.org">pgsql-novice@postgresql.org</a><br/><br /><div dir="ltr">It will add complexityand also no idea how to do it. Is there any alternative?</div><div class="gmail_extra"><br /><div class="gmail_quote">OnFri, Nov 20, 2015 at 5:00 PM, <span dir="ltr"><<a href="mailto:tim.child@comcast.net" target="_blank">tim.child@comcast.net</a>></span>wrote:<br /><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px#ccc solid;padding-left:1ex"><div><div style="font-family:Arial;font-size:12pt;color:#000000"><div>Shmagi,<br/></div><div><br /></div><div>Take the first 20 textcharacters and compute and store the CRC32 or MD5 of that value. That value acts as a signature. You can then findall distinct signatures, or all rows with duplicate signatures for further analysis You could event try building asignature on the full text string.<br /></div><div><br /></div><div><br /></div><div><br /></div><hr /><div style="color:#000;font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt"><b>From: </b>"ShmagiKavtaradze" <<a href="mailto:kavtaradze.s@gmail.com" target="_blank">kavtaradze.s@gmail.com</a>><br /><b>To:</b><a href="mailto:pgsql-novice@postgresql.org" target="_blank">pgsql-novice@postgresql.org</a><br /><b>Sent: </b>Friday,November 20, 2015 2:21:36 AM<br /><b>Subject: </b>[NOVICE] Combine Top-k with similarity search extensions<spanclass=""><br /><div><br /></div><div dir="ltr">I am performing similarity check over a column in a table withabout 3500 entries. Column is populated with text data from text file. Performing a check results in 3500 * 3500 rowsand it takes forever to calculate for my virtual machine. Is there any way to calculate for top-k results, to decreaseamount and time needed? What I mean is that, for example when checking two sentences, if first several words doesnot match, to stop checking that sentences and move on. </div></span></div><div><br /></div></div></div></blockquote></div><br/></div></div></blockquote></span>