Thread: Duplicate data within tables
Good day all
I am hoping someone can perhaps assist me some more once again.
I am trying to find out how you can identify duplicate data within a postgresql database.
The database I am working on at present seems to have a lot of data being duplicated and despite needing to find the problem within the code, the databases need to be sorted out as well.
There is about 974 tables in the main database, however I have never done such an excercise before so I hope that some more experienced guys can give me some pointers and guidelines.
I would really appreciate all the help I can get.
Regards
Machiel
I am hoping someone can perhaps assist me some more once again.
I am trying to find out how you can identify duplicate data within a postgresql database.
The database I am working on at present seems to have a lot of data being duplicated and despite needing to find the problem within the code, the databases need to be sorted out as well.
There is about 974 tables in the main database, however I have never done such an excercise before so I hope that some more experienced guys can give me some pointers and guidelines.
I would really appreciate all the help I can get.
Regards
Machiel
Quoting Machiel Richards <machielr@rdc.co.za>: > I am trying to find out how you can identify duplicate data > within a postgresql database. Machiel, The problem with duplicates can cause dozens of problems with the integrity and sanity of the data. There is no simple trick and you need to be very careful. The more careful the more complex data you work with. Let me show you a simple example, so you get the basic idea. Let's say we have a table "users" with columns: "id", "name" and "email" and some duplicated emails in that table, which we want to get rid of: SELECT * FROM users; id | name | email ----+---------+--------------------- 1 | john | john@example.com 2 | joseph | joseph@example.com 3 | johnny | john@example.com 4 | mike | michael@example.com 5 | admin | admin@example.com 6 | michael | michael@example.com 7 | foo | john@example.com To identify duplicates (email with count > 1): SELECT email, count(1) FROM users GROUP BY email ORDER BY count DESC; email | count ---------------------+------- john@example.com | 3 michael@example.com | 2 joseph@example.com | 1 admin@example.com | 1 To create a table without duplicates: CREATE TABLE distinct_users ( id serial PRIMARY KEY, name text, email text UNIQUE ); INSERT INTO distinct_users (name, email) SELECT min(name), email FROM users GROUP BY email; SELECT * FROM distinct_users; id | name | email ----+---------+--------------------- 1 | joseph | joseph@example.com 2 | michael | michael@example.com 3 | foo | john@example.com 4 | admin | admin@example.com To select only the duplicates: SELECT * FROM users WHERE name NOT IN (SELECT name FROM distinct_users); id | name | email ----+--------+--------------------- 1 | john | john@example.com 3 | johnny | john@example.com 4 | mike | michael@example.com As I said before, there is no simple trick that would resolve the duplicates problem in 974-table database. It is not only a design problem. Since the database is full of date, it became a content problem as well (if not mainly). You do not want to loose part of your data, do you? So you need to recreate the logic in more strict way and then put all your existing data into it. Esentially you have to take the duplicated data, think of it and pick the most important part of it and then define as many one-to-many relations, as you need. Let's get back to our "users" table and let's say the most significant information is the email. We already created the "distinct_users" table with UNIQUE emails. We picked one "name", now we need to work on the rest of the information a little bit. That would be: former id's and additional names of the user used in the past. ALTER TABLE users ADD COLUMN new_id integer; UPDATE users SET new_id = (SELECT distinct_users.id FROM distinct_users WHERE distinct_users.email = users.email); SELECT * FROM users ORDER BY new_id; id | name | email | new_id ----+---------+---------------------+-------- 2 | joseph | joseph@example.com | 1 4 | mike | michael@example.com | 2 6 | michael | michael@example.com | 2 1 | john | john@example.com | 3 7 | foo | john@example.com | 3 3 | johnny | john@example.com | 3 5 | admin | admin@example.com | 4 ALTER TABLE users ADD COLUMN default_name_now boolean DEFAULT FALSE; UPDATE users SET default_name_now = (SELECT users.name = distinct_users.name FROM distinct_users WHERE users.new_id = distinct_users.id); SELECT * FROM users ORDER BY new_id; id | name | email | new_id | default_name_now ----+---------+---------------------+--------+------------------ 2 | joseph | joseph@example.com | 1 | t 4 | mike | michael@example.com | 2 | f 6 | michael | michael@example.com | 2 | t 1 | john | john@example.com | 3 | f 7 | foo | john@example.com | 3 | t 3 | johnny | john@example.com | 3 | f 5 | admin | admin@example.com | 4 | t Here is how you can regain some control over the mess and improve your ability to manage the data. I hope, these examples will help you solve the problem. Good luck! -Mike -- Michał Roszka mike@if-then-else.pl