Home > mailing lists

Re: A DISTINCT problem removing duplicates - Mailing list pgsql-sql

From	Richard Huxton
Subject	Re: A DISTINCT problem removing duplicates
Date	December 9, 2008 12:13:01
Msg-id	493E98FF.6090201@archonet.com Whole thread Raw
In response to	Re: A DISTINCT problem removing duplicates (Tom Lane <tgl@sss.pgh.pa.us>)
List	pgsql-sql

Tree view

Tom Lane wrote:
> Richard Huxton <dev@archonet.com> writes:
>> Tom Lane wrote:
>>> Richard Huxton <dev@archonet.com> writes:
>>>> Anyone got anything more elegant?
>>> Seems to me that no document should have an empty dup_set.  If it's not
>>> a match to any existing document, then immediately assign a new dup_set
>>> number to it.
> 
>> That was my initial thought too, but it means when I actually find a
>> duplicate I have to decide which "direction" to renumber them in.
> 
> Hmm, so you mean you might decide that two docs are duplicates sometime
> after initially putting them both in the database? 

Yep - checking for duplicates can be a slow process - it's O(n^2) over
the number of documents and document-comparisons are probably O(n^2)
over length (or number of similarly-sized word-runs anyway). I'm failingcomparisons as early as I can, but there's a
trade-offbetween speed

and false negatives.

> Seems like you have
> issues with that anyway.  If you already know A,B are dups and
> separately that C,D are dups, and you later decide B and C are dups,
> what do you do?

Not necessarily a problem. I'm using "duplicate" very loosely here -
it's more like "very similar to" so it's entirely possible to have sets
(a,b) (b,c) (c,d) and everything be valid just by adding sentences to
the end of each document. Similarity scoring should allow for
insertion/deletion of single words or whole (quite extensive) blocks of
text.

Of course at the moment, as I tweak what I mean by "duplicate" I have to
re-run the check over at least a sizeable chunk of the documents to see
if I prefer it.

Oh - the comparison is outside the DB at the moment, but it's based on
the stemmed tsvector of each document anyway, so it's crying out to be
pushed into the DB once I'm happy it works.

--  Richard Huxton Archonet Ltd

pgsql-sql by date:

From: Tom Lane
Date: 09 December 2008, 11:39:40
Subject: Re: A DISTINCT problem removing duplicates

From: ivan marchesini
Date: 09 December 2008, 13:35:02
Subject: store pdf files

Re: A DISTINCT problem removing duplicates - Mailing list pgsql-sql

Previous

Next