Thread: A DISTINCT problem removing duplicates

A DISTINCT problem removing duplicates

From
Richard Huxton
Date:
The scenario is - a list of documents, some of which may be (near)
duplicates of others, one document being in many duplicate-sets and a
duplicate-set containing many documents.

We want to see a list with only one document (any one) from each
duplicate set. There's an example script attached.

So:

documents  (docid SERIAL, title text, PRIMARY KEY (docid));
duplicates (docid int REFERENCES documents, dup_set SERIAL,
            PRIMARY KEY (docid, dup_set));

This allows one document to be part of multiple duplicate sets, but
that's fine - this is a "fuzzy" match. If two documents match and one of
them is already in the duplicates table then I add the second with the
same dup_set value. If neither are present, generate a new set number.
Match documents in a well-defined order and it's all nice and simple.

A self-join on the duplicates table gives me a count of how many
duplicates each document has. A left-join from the documents table can
list documents and if/how many duplicates they have. The problem comes
when I don't want to see duplicates:

SELECT DISTINCT ON (dup_set)
    ds.dup_set, d.docid, d.title, COALESCE(ds.num_dups, 0) AS num_dups
FROM
    documents d
LEFT JOIN
    (
        SELECT dup1.docid, dup1.dup_set, count(*) - 1 AS num_dups
        FROM duplicates dup1
        JOIN duplicates dup2 USING (dup_set)
        GROUP BY dup1.docid, dup1.dup_set
    ) ds
USING (docid)
ORDER BY dup_set, docid
;

Documents without duplicates have a NULL dup_set. The DISTINCT ON
considers two nulls to be equal, which means we only ever see one
unduplicated document.

I've got two work-arounds. The first is to create a separate sequence
that doesn't overlap with dup_set's values and use that:

CREATE SEQUENCE not_duplicate_seq MINVALUE -999999 MAXVALUE -1 CYCLE;

SELECT DISTINCT ON (dup_set)
    COALESCE(dup_set, nextval('not_duplicate_seq')) AS dup_set,
    d.docid, d.title, COALESCE(ds.num_dups, 0) AS num_dups
...

That works, but is arguably a bit too "clever" if you know what I mean.

The other alternative is to separate duplicated and non-duplicated
documents and UNION them. That's simple enough to see what's happening
but does seem ugly.

Anyone got anything more elegant? I'm happy to alter the duplicates
table so long as it doesn't make it complicated to update.

--
  Richard Huxton
  Archonet Ltd
BEGIN;

CREATE SCHEMA duptest;

SET search_path = duptest;

CREATE TABLE documents (
    docid SERIAL NOT NULL,
    title text,
    PRIMARY KEY (docid)
);

CREATE TABLE duplicates (
    docid   int NOT NULL REFERENCES documents ON DELETE CASCADE,
    dup_set SERIAL NOT NULL,
    PRIMARY KEY (docid, dup_set)
);

-- Five documents
INSERT INTO documents (docid, title)
SELECT i, 'document number ' || i FROM generate_series(1, 6) i;

-- duplicates are (1,3) and (2,4) - 5,6 are not
INSERT INTO duplicates (docid, dup_set)
SELECT i, (i % 2)+1 FROM generate_series(1, 4) i;

SELECT setval('documents_docid_seq', (SELECT max(docid)+1 FROM documents));
SELECT setval('duplicates_dup_set_seq', (SELECT max(dup_set)+1 FROM duplicates));

-- This is a list of all documents with how many duplicates they have
SELECT dup_set, d.docid, d.title, COALESCE(ds.num_dups, 0) AS num_dups
FROM
    documents d
LEFT JOIN
    (
        SELECT dup1.docid, dup1.dup_set, count(*) - 1 AS num_dups
        FROM duplicates dup1
        JOIN duplicates dup2 USING (dup_set)
        GROUP BY dup1.docid, dup1.dup_set
    ) ds
USING (docid)
ORDER BY dup_set, docid
;

-- This DOESN'T work because nulls are considered equal by DISTINCT ON
SELECT DISTINCT ON (dup_set)
    dup_set, d.docid, d.title, COALESCE(ds.num_dups, 0) AS num_dups
FROM
    documents d
LEFT JOIN
    (
        SELECT dup1.docid, dup1.dup_set, count(*) - 1 AS num_dups
        FROM duplicates dup1
        JOIN duplicates dup2 USING (dup_set)
        GROUP BY dup1.docid, dup1.dup_set
    ) ds
USING (docid)
ORDER BY dup_set, docid
;

-- Work around is to fake a unique id for non-duplicates
CREATE SEQUENCE not_duplicate_seq MINVALUE -999999 MAXVALUE -1 CYCLE;

SELECT DISTINCT ON (dup_set)
    COALESCE(dup_set, nextval('not_duplicate_seq')) AS dup_set,
    d.docid, d.title, COALESCE(ds.num_dups, 0) AS num_dups
FROM
    documents d
LEFT JOIN
    (
        SELECT dup1.docid, dup1.dup_set, count(*) - 1 AS num_dups
        FROM duplicates dup1
        JOIN duplicates dup2 USING (dup_set)
        GROUP BY dup1.docid, dup1.dup_set
    ) ds
USING (docid)
ORDER BY dup_set, docid
;

-- Second alternative - UNION duplicated and non-duplicated documents
SELECT DISTINCT ON (dup_set)
    ds.dup_set, d.docid, d.title, num_dups
FROM
    documents d
LEFT JOIN
    (
        SELECT dup1.docid, dup1.dup_set, count(*) - 1 AS num_dups
        FROM duplicates dup1
        JOIN duplicates dup2 USING (dup_set)
        GROUP BY dup1.docid, dup1.dup_set
    ) ds
USING (docid)
WHERE num_dups > 0
UNION ALL
SELECT
    ds.dup_set, d.docid, d.title, COALESCE(num_dups,0) AS num_dups
FROM
    documents d
LEFT JOIN
    (
        SELECT dup1.docid, dup1.dup_set, count(*) - 1 AS num_dups
        FROM duplicates dup1
        JOIN duplicates dup2 USING (dup_set)
        GROUP BY dup1.docid, dup1.dup_set
    ) ds
USING (docid)
WHERE num_dups IS NULL
ORDER BY dup_set, docid
;

ROLLBACK;

Re: A DISTINCT problem removing duplicates

From
Tom Lane
Date:
Richard Huxton <dev@archonet.com> writes:
> Anyone got anything more elegant?

Seems to me that no document should have an empty dup_set.  If it's not
a match to any existing document, then immediately assign a new dup_set
number to it.
        regards, tom lane


Re: A DISTINCT problem removing duplicates

From
Richard Huxton
Date:
Tom Lane wrote:
> Richard Huxton <dev@archonet.com> writes:
>> Anyone got anything more elegant?
> 
> Seems to me that no document should have an empty dup_set.  If it's not
> a match to any existing document, then immediately assign a new dup_set
> number to it.

That was my initial thought too, but it means when I actually find a
duplicate I have to decide which "direction" to renumber them in. It
also means probably keeping a summary table with counts to show which
are duplicates, since the duplicates table is now the same size as the
documents table.


--  Richard Huxton Archonet Ltd


Re: A DISTINCT problem removing duplicates

From
Tom Lane
Date:
Richard Huxton <dev@archonet.com> writes:
> Tom Lane wrote:
>> Richard Huxton <dev@archonet.com> writes:
>>> Anyone got anything more elegant?
>> 
>> Seems to me that no document should have an empty dup_set.  If it's not
>> a match to any existing document, then immediately assign a new dup_set
>> number to it.

> That was my initial thought too, but it means when I actually find a
> duplicate I have to decide which "direction" to renumber them in.

Hmm, so you mean you might decide that two docs are duplicates sometime
after initially putting them both in the database?  Seems like you have
issues with that anyway.  If you already know A,B are dups and
separately that C,D are dups, and you later decide B and C are dups,
what do you do?
        regards, tom lane


Re: A DISTINCT problem removing duplicates

From
Richard Huxton
Date:
Tom Lane wrote:
> Richard Huxton <dev@archonet.com> writes:
>> Tom Lane wrote:
>>> Richard Huxton <dev@archonet.com> writes:
>>>> Anyone got anything more elegant?
>>> Seems to me that no document should have an empty dup_set.  If it's not
>>> a match to any existing document, then immediately assign a new dup_set
>>> number to it.
> 
>> That was my initial thought too, but it means when I actually find a
>> duplicate I have to decide which "direction" to renumber them in.
> 
> Hmm, so you mean you might decide that two docs are duplicates sometime
> after initially putting them both in the database? 

Yep - checking for duplicates can be a slow process - it's O(n^2) over
the number of documents and document-comparisons are probably O(n^2)
over length (or number of similarly-sized word-runs anyway). I'm failingcomparisons as early as I can, but there's a
trade-offbetween speed
 
and false negatives.

> Seems like you have
> issues with that anyway.  If you already know A,B are dups and
> separately that C,D are dups, and you later decide B and C are dups,
> what do you do?

Not necessarily a problem. I'm using "duplicate" very loosely here -
it's more like "very similar to" so it's entirely possible to have sets
(a,b) (b,c) (c,d) and everything be valid just by adding sentences to
the end of each document. Similarity scoring should allow for
insertion/deletion of single words or whole (quite extensive) blocks of
text.

Of course at the moment, as I tweak what I mean by "duplicate" I have to
re-run the check over at least a sizeable chunk of the documents to see
if I prefer it.

Oh - the comparison is outside the DB at the moment, but it's based on
the stemmed tsvector of each document anyway, so it's crying out to be
pushed into the DB once I'm happy it works.

--  Richard Huxton Archonet Ltd