Using Full Text Search to create TurnItIn light - Mailing list pgsql-general

From Jason Foster
Subject Using Full Text Search to create TurnItIn light
Date
Msg-id F89192EE-28B4-4E0A-9830-AC4B6EA4A8CB@gmail.com
Whole thread Raw
List pgsql-general
I've been having my students post journal entries to a Drupal site, and we just noticed that some of them have been
copyingfrom each other (surprise, surprise).  If this was done with reports or other submissions, we'd just submit the
deliverablesto TurnItIn.com and go from there. 

In this case I'd rather run the similarity checking on our local server, and my guess is that Full Text Search is the
wayto go.  I've found a couple of online posts about such things, but they're not quite as clear as I was hoping.  I'm
hopingthat the list can help me out (and keep this from becoming an O(n^4) challenge). 

What I have is essentially a table containing an ID and some TEXT.

My first attempt used the similarity() function from pg_trgm as follows:

select
  aid
 ,bid
 ,similarity(source,dest) sim
from
(
  select
     a.id aid
    ,a.body source
    ,b.id bid
    ,b.body dest
  from
    posts a
  cross join
    posts b
  where
    and a.id != b.id
) as inside
order by sim desc;

I haven't been thinking in SQL terms for a while, so I didn't come up with a way to eliminate half of my checks by
exploitingsymmetry (a simple way to accomplish this would be helpful).  Unfortunately this solution will take way too
longto execute. 

I'm hoping that the full text features can be shoehorned into this problem, but from what little I've seen it will be
trickybecause comparing a tsvector to another tsvector isn't supported.  Plus there's weird (to the novice) indexing
thatwill (I hope) help to keep the complexity down. 

Thoughts from the list are most welcome, and I'll write up a HOWTO once I get this working.

Thanks!

Jason

pgsql-general by date:

Previous
From: Bill Todd
Date:
Subject: Re: Excessive (and slow) fsync() within single transaction
Next
From: Greg Smith
Date:
Subject: Re: Excessive (and slow) fsync() within single transaction