Thread: Duplicate data within tables

Duplicate data within tables

From
Machiel Richards
Date:
Good day all

            I am hoping someone can perhaps assist me some more once again.

            I am trying to find out how you can identify duplicate data within a postgresql database.

            The database I am working on at present seems to have a lot of data being duplicated and despite needing to find the problem within the code, the databases need to be sorted out as well.

            There is about 974 tables in the main database, however I have never done such an excercise before so I hope that some more experienced guys can give me some pointers and guidelines.

            I would really appreciate all the help I can get.


Regards
Machiel

Re: Duplicate data within tables

From
mike@if-then-else.pl
Date:
Quoting Machiel Richards <machielr@rdc.co.za>:

> I am trying to find out how you can identify duplicate data
> within a postgresql database.

Machiel,

The problem with duplicates can cause dozens of problems with the integrity
and sanity of the data.  There is no simple trick and you need to be very
careful.  The more careful the more complex data you work with.  Let me
show you a simple example, so you get the basic idea.  Let's say we have a
table "users" with columns: "id", "name" and "email" and some duplicated
emails in that table, which we want to get rid of:

SELECT * FROM users;

id |  name   |        email
----+---------+---------------------
  1 | john    | john@example.com
  2 | joseph  | joseph@example.com
  3 | johnny  | john@example.com
  4 | mike    | michael@example.com
  5 | admin   | admin@example.com
  6 | michael | michael@example.com
  7 | foo     | john@example.com

To identify duplicates (email with count > 1):

SELECT email, count(1) FROM users GROUP BY email ORDER BY count DESC;

        email        | count
---------------------+-------
john@example.com    |     3
michael@example.com |     2
joseph@example.com  |     1
admin@example.com   |     1

To create a table without duplicates:

CREATE TABLE distinct_users (
    id serial PRIMARY KEY,
    name text,
    email text UNIQUE
);

INSERT INTO distinct_users (name, email) SELECT min(name), email FROM users
GROUP BY email;

SELECT * FROM distinct_users;

id |  name   |        email
----+---------+---------------------
  1 | joseph  | joseph@example.com
  2 | michael | michael@example.com
  3 | foo     | john@example.com
  4 | admin   | admin@example.com

To select only the duplicates:

SELECT * FROM users WHERE name NOT IN (SELECT name FROM distinct_users);

id |  name  |        email
----+--------+---------------------
  1 | john   | john@example.com
  3 | johnny | john@example.com
  4 | mike   | michael@example.com

As I said before, there is no simple trick that would resolve the duplicates
problem in 974-table database.  It is not only a design problem.  Since the
database is full of date, it became a content problem as well (if not
mainly).  You do not want to loose part of your data, do you?  So you need
to recreate the logic in more strict way and then put all your existing
data into it.  Esentially you have to take the duplicated data, think of it
and pick the most important part of it and then define as many one-to-many
relations, as you need. Let's get back to our "users" table and let's say
the most significant information is the email. We already created the
"distinct_users" table with UNIQUE emails. We picked one "name", now we
need to work on the rest of the information a little bit.  That would be:
former id's and additional names of the user used in the past.

ALTER TABLE users ADD COLUMN new_id integer;
UPDATE users SET new_id = (SELECT distinct_users.id FROM distinct_users
WHERE distinct_users.email = users.email);

SELECT * FROM users ORDER BY new_id;

id |  name   |        email        | new_id
----+---------+---------------------+--------
  2 | joseph  | joseph@example.com  |      1
  4 | mike    | michael@example.com |      2
  6 | michael | michael@example.com |      2
  1 | john    | john@example.com    |      3
  7 | foo     | john@example.com    |      3
  3 | johnny  | john@example.com    |      3
  5 | admin   | admin@example.com   |      4


ALTER TABLE users ADD COLUMN default_name_now boolean DEFAULT FALSE;
UPDATE users SET default_name_now = (SELECT users.name = distinct_users.name
FROM distinct_users WHERE users.new_id = distinct_users.id);

SELECT * FROM users ORDER BY new_id;

id |  name   |        email        | new_id | default_name_now
----+---------+---------------------+--------+------------------
  2 | joseph  | joseph@example.com  |      1 | t
  4 | mike    | michael@example.com |      2 | f
  6 | michael | michael@example.com |      2 | t
  1 | john    | john@example.com    |      3 | f
  7 | foo     | john@example.com    |      3 | t
  3 | johnny  | john@example.com    |      3 | f
  5 | admin   | admin@example.com   |      4 | t

Here is how you can regain some control over the mess and improve your
ability to manage the data.  I hope, these examples will help you solve the
problem.

Good luck!

    -Mike

--
Michał Roszka
mike@if-then-else.pl