Tom Lane wrote:
> Richard Huxton <dev@archonet.com> writes:
>> Tom Lane wrote:
>>> Richard Huxton <dev@archonet.com> writes:
>>>> Anyone got anything more elegant?
>>> Seems to me that no document should have an empty dup_set. If it's not
>>> a match to any existing document, then immediately assign a new dup_set
>>> number to it.
>
>> That was my initial thought too, but it means when I actually find a
>> duplicate I have to decide which "direction" to renumber them in.
>
> Hmm, so you mean you might decide that two docs are duplicates sometime
> after initially putting them both in the database?
Yep - checking for duplicates can be a slow process - it's O(n^2) over
the number of documents and document-comparisons are probably O(n^2)
over length (or number of similarly-sized word-runs anyway). I'm failingcomparisons as early as I can, but there's a
trade-offbetween speed
and false negatives.
> Seems like you have
> issues with that anyway. If you already know A,B are dups and
> separately that C,D are dups, and you later decide B and C are dups,
> what do you do?
Not necessarily a problem. I'm using "duplicate" very loosely here -
it's more like "very similar to" so it's entirely possible to have sets
(a,b) (b,c) (c,d) and everything be valid just by adding sentences to
the end of each document. Similarity scoring should allow for
insertion/deletion of single words or whole (quite extensive) blocks of
text.
Of course at the moment, as I tweak what I mean by "duplicate" I have to
re-run the check over at least a sizeable chunk of the documents to see
if I prefer it.
Oh - the comparison is outside the DB at the moment, but it's based on
the stemmed tsvector of each document anyway, so it's crying out to be
pushed into the DB once I'm happy it works.
-- Richard Huxton Archonet Ltd