A DISTINCT problem removing duplicates - Mailing list pgsql-sql
From | Richard Huxton |
---|---|
Subject | A DISTINCT problem removing duplicates |
Date | |
Msg-id | 493E3E5F.6090304@archonet.com Whole thread Raw |
Responses |
Re: A DISTINCT problem removing duplicates
|
List | pgsql-sql |
The scenario is - a list of documents, some of which may be (near) duplicates of others, one document being in many duplicate-sets and a duplicate-set containing many documents. We want to see a list with only one document (any one) from each duplicate set. There's an example script attached. So: documents (docid SERIAL, title text, PRIMARY KEY (docid)); duplicates (docid int REFERENCES documents, dup_set SERIAL, PRIMARY KEY (docid, dup_set)); This allows one document to be part of multiple duplicate sets, but that's fine - this is a "fuzzy" match. If two documents match and one of them is already in the duplicates table then I add the second with the same dup_set value. If neither are present, generate a new set number. Match documents in a well-defined order and it's all nice and simple. A self-join on the duplicates table gives me a count of how many duplicates each document has. A left-join from the documents table can list documents and if/how many duplicates they have. The problem comes when I don't want to see duplicates: SELECT DISTINCT ON (dup_set) ds.dup_set, d.docid, d.title, COALESCE(ds.num_dups, 0) AS num_dups FROM documents d LEFT JOIN ( SELECT dup1.docid, dup1.dup_set, count(*) - 1 AS num_dups FROM duplicates dup1 JOIN duplicates dup2 USING (dup_set) GROUP BY dup1.docid, dup1.dup_set ) ds USING (docid) ORDER BY dup_set, docid ; Documents without duplicates have a NULL dup_set. The DISTINCT ON considers two nulls to be equal, which means we only ever see one unduplicated document. I've got two work-arounds. The first is to create a separate sequence that doesn't overlap with dup_set's values and use that: CREATE SEQUENCE not_duplicate_seq MINVALUE -999999 MAXVALUE -1 CYCLE; SELECT DISTINCT ON (dup_set) COALESCE(dup_set, nextval('not_duplicate_seq')) AS dup_set, d.docid, d.title, COALESCE(ds.num_dups, 0) AS num_dups ... That works, but is arguably a bit too "clever" if you know what I mean. The other alternative is to separate duplicated and non-duplicated documents and UNION them. That's simple enough to see what's happening but does seem ugly. Anyone got anything more elegant? I'm happy to alter the duplicates table so long as it doesn't make it complicated to update. -- Richard Huxton Archonet Ltd BEGIN; CREATE SCHEMA duptest; SET search_path = duptest; CREATE TABLE documents ( docid SERIAL NOT NULL, title text, PRIMARY KEY (docid) ); CREATE TABLE duplicates ( docid int NOT NULL REFERENCES documents ON DELETE CASCADE, dup_set SERIAL NOT NULL, PRIMARY KEY (docid, dup_set) ); -- Five documents INSERT INTO documents (docid, title) SELECT i, 'document number ' || i FROM generate_series(1, 6) i; -- duplicates are (1,3) and (2,4) - 5,6 are not INSERT INTO duplicates (docid, dup_set) SELECT i, (i % 2)+1 FROM generate_series(1, 4) i; SELECT setval('documents_docid_seq', (SELECT max(docid)+1 FROM documents)); SELECT setval('duplicates_dup_set_seq', (SELECT max(dup_set)+1 FROM duplicates)); -- This is a list of all documents with how many duplicates they have SELECT dup_set, d.docid, d.title, COALESCE(ds.num_dups, 0) AS num_dups FROM documents d LEFT JOIN ( SELECT dup1.docid, dup1.dup_set, count(*) - 1 AS num_dups FROM duplicates dup1 JOIN duplicates dup2 USING (dup_set) GROUP BY dup1.docid, dup1.dup_set ) ds USING (docid) ORDER BY dup_set, docid ; -- This DOESN'T work because nulls are considered equal by DISTINCT ON SELECT DISTINCT ON (dup_set) dup_set, d.docid, d.title, COALESCE(ds.num_dups, 0) AS num_dups FROM documents d LEFT JOIN ( SELECT dup1.docid, dup1.dup_set, count(*) - 1 AS num_dups FROM duplicates dup1 JOIN duplicates dup2 USING (dup_set) GROUP BY dup1.docid, dup1.dup_set ) ds USING (docid) ORDER BY dup_set, docid ; -- Work around is to fake a unique id for non-duplicates CREATE SEQUENCE not_duplicate_seq MINVALUE -999999 MAXVALUE -1 CYCLE; SELECT DISTINCT ON (dup_set) COALESCE(dup_set, nextval('not_duplicate_seq')) AS dup_set, d.docid, d.title, COALESCE(ds.num_dups, 0) AS num_dups FROM documents d LEFT JOIN ( SELECT dup1.docid, dup1.dup_set, count(*) - 1 AS num_dups FROM duplicates dup1 JOIN duplicates dup2 USING (dup_set) GROUP BY dup1.docid, dup1.dup_set ) ds USING (docid) ORDER BY dup_set, docid ; -- Second alternative - UNION duplicated and non-duplicated documents SELECT DISTINCT ON (dup_set) ds.dup_set, d.docid, d.title, num_dups FROM documents d LEFT JOIN ( SELECT dup1.docid, dup1.dup_set, count(*) - 1 AS num_dups FROM duplicates dup1 JOIN duplicates dup2 USING (dup_set) GROUP BY dup1.docid, dup1.dup_set ) ds USING (docid) WHERE num_dups > 0 UNION ALL SELECT ds.dup_set, d.docid, d.title, COALESCE(num_dups,0) AS num_dups FROM documents d LEFT JOIN ( SELECT dup1.docid, dup1.dup_set, count(*) - 1 AS num_dups FROM duplicates dup1 JOIN duplicates dup2 USING (dup_set) GROUP BY dup1.docid, dup1.dup_set ) ds USING (docid) WHERE num_dups IS NULL ORDER BY dup_set, docid ; ROLLBACK;