Thread: identifying duplicates in table with redundancies
I loaded data from a spread into a interim table so I can analyze the quality of the data. The table contains an entry for every student (250K records) and his trainer. Eventually, I want to extract a unique list of trainers from it. But first of all I want to check for duplicates: 1) multiples trainer names for same trainer id 2) multiple trainer ids for same trainer name I cobbled together the SQL and it does the job but it seems rather convoluted. I would like to know how I can improve it. CREATE TABLE student ( id INTEGER NOT NULL, name VARCHAR(256) NOT NULL, trainer_id INTEGER, trainer_name VARCHAR(256), ); ==== EXAMPLE DATA 22 John 1 Macdonald 23 Jane 1 MacDonald 24 Paul 1 MacDonald 25 Dick 2 Smith 26 Bill 3 Smith 27 Kate 3 Smith ==== -- outputs trainer ids which appear under different names select trainer_id, trainer_name from ( -- different id/name combinations select distinct on (trainer_name) trainer_id, trainer_name from student wheretrainer_id in ( -- trainer ids with appearing with different names select distinct on (id) id from ( -- distinct trainer id-name select distinct on (trainer_id,trainer_name) trainer_idas id, trainer_name as name from student ) as trainer group by trainer.id having count(trainer.name) > 1 ) ) as y order by trainer_id -- Best Regards, Tarlika Elisabeth Schmitz
Howdy, Tarlika. First, did you past correctly your query into your mail? I am asking this because your query doesn't seem work for me, it returns an empty list :-| Your most nested query, this one, > -- distinct trainer id-name > select distinct on (trainer_id,trainer_name) > trainer_id as id, > trainer_name as name from student returns this 1|"MacDonald" 2|"Smith" 3|"Smith" Then, your next most nested query, which includes the previous one, -- trainer ids with appearing with different names > select distinct on (id) id > from > ( > -- distinct trainer id-name > select distinct on (trainer_id,trainer_name) > trainer_id as id, > trainer_name as name from student > ) as trainer > group by trainer.id > having count (trainer.name) > 1 will group by trainer_id and look for trainer_ids which appear more than once, and, as you can see from the above listing there is none... So, I'm asking if you typed it correctly on mail, because it doesn't seem to work, at least not for me.... > 1) multiples trainer names for same trainer id > 2) multiple trainer ids for same trainer name To achieve 2) I would use this query SELECT DISTINCT trainer_id,trainer_name FROM ( SELECT trainer_name -- The field you want to test for duplicates FROM ( SELECT DISTINCT "trainer_id","trainer_name" FROM student ) x GROUP BY "trainer_name" -- the field you want to test for duplicates HAVING (COUNT(*) > 1) ) z NATURAL JOIN student y It will give you a list of the trainer names who have more than one trainer ID and the respective trainer IDS. For your particular example data result will be 2|"Smith" 3|"Smith" As Smith is the only trainer with records with diferent trainer IDs. Question : Can this be what you want? You can add ORDER BY clauses in the end as you wish to order the list in any way you'd like. If you want to achieve 1) just substitute the trainer_name by trainer_id on the commented places. But on the example data you provided you don't have the 1) situation, am I right? So it will output an empty list. Best, Oliver ----- Original Message ----- From: "Tarlika Elisabeth Schmitz" <postgresql@numerixtechnology.de> To: <pgsql-sql@postgresql.org> Sent: Thursday, September 23, 2010 10:39 PM Subject: [SQL] identifying duplicates in table with redundancies >I loaded data from a spread into a interim table so I can analyze the > quality of the data. > > The table contains an entry for every student (250K records) and his > trainer. Eventually, I want to extract a unique list of trainers from > it. But first of all I want to check for duplicates: > > 1) multiples trainer names for same trainer id > 2) multiple trainer ids for same trainer name > > I cobbled together the SQL and it does the job but it seems rather > convoluted. I would like to know how I can improve it. > > CREATE TABLE student ( > id INTEGER NOT NULL, > name VARCHAR(256) NOT NULL, > trainer_id INTEGER, > trainer_name VARCHAR(256), > ); > > ==== > EXAMPLE DATA > > 22 John 1 Macdonald > 23 Jane 1 MacDonald > 24 Paul 1 MacDonald > 25 Dick 2 Smith > 26 Bill 3 Smith > 27 Kate 3 Smith > ==== > > -- outputs trainer ids which appear under different names > select trainer_id, trainer_name from > ( > -- different id/name combinations > select distinct on (trainer_name) trainer_id, trainer_name > from student > where trainer_id in > ( > -- trainer ids with appearing with different names > select distinct on (id) id > from > ( > -- distinct trainer id-name > select distinct on (trainer_id,trainer_name) > trainer_id as id, > trainer_name as name from student > ) as trainer > group by trainer.id > having count (trainer.name) > 1 > ) > ) as y > order by trainer_id > > -- > > Best Regards, > Tarlika Elisabeth Schmitz > > -- > Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-sql
Dear Oliveiros, Thank you for taking the time to help. On Fri, 24 Sep 2010 11:22:21 +0100 "Oliveiros d'Azevedo Cristina" <oliveiros.cristina@marktest.pt> wrote: >----- Original Message ----- >From: "Tarlika Elisabeth Schmitz" <postgresql@numerixtechnology.de> >To: <pgsql-sql@postgresql.org> >Sent: Thursday, September 23, 2010 10:39 PM >Subject: [SQL] identifying duplicates in table with redundancies > > >>[...] I want to check for duplicates: >> >> 1) multiples trainer names for same trainer id >> 2) multiple trainer ids for same trainer name >> >> I cobbled together the SQL and it does the job but it seems rather >> convoluted. I would like to know how I can improve it. >> >> CREATE TABLE student ( >> id INTEGER NOT NULL, >> name VARCHAR(256) NOT NULL, >> trainer_id INTEGER, >> trainer_name VARCHAR(256), >> ); >> >> ==== >> EXAMPLE DATA >> >> 22 John 1 Macdonald >> 23 Jane 1 MacDonald >> 24 Paul 1 MacDonald >> 25 Dick 2 Smith >> 26 Bill 3 Smith >> 27 Kate 3 Smith >> ==== >> >> select trainer_id, trainer_name from >> ( >> select distinct on (trainer_name) trainer_id, trainer_name >> from student >> where trainer_id in >> ( >> select distinct on (id) id >> from >> ( >> select distinct on (trainer_id,trainer_name) >> trainer_id as id, >> trainer_name as name from student >> ) as trainer >> group by trainer.id >> having count (trainer.name) > 1 >> ) >> ) as y >> order by trainer_id >Howdy, Tarlika. > >First, did you past correctly your query into your mail? >I am asking this because your query doesn't seem work for me, it >returns an empty list :-| > >Your most nested query, [...] >returns this > >1|"MacDonald" >2|"Smith" >3|"Smith" For me, the innermost query returns: 1|"Macdonald" 1|"MacDonald" 2|"Smith" 3|"Smith" (note the lower/uppercase "d" in MacDonald) The whole query returns: 1|"Macdonald" 1|"MacDonald" >> 1) multiples trainer names for same trainer id >> 2) multiple trainer ids for same trainer name > >To achieve 2) I would use this query > >SELECT DISTINCT trainer_id,trainer_name >FROM ( >SELECT trainer_name -- The field you want to test for duplicates >FROM ( >SELECT DISTINCT "trainer_id","trainer_name" >FROM student >) x >GROUP BY "trainer_name" -- the field you want to test for duplicates >HAVING (COUNT(*) > 1) >) z >NATURAL JOIN student y I see my 2 innermost queries are the same as yours, just a bit more wordy. I messed up at the third query, which threw up an error when I tried to add an ORDER BY. >It will give you a list of the trainer names who have more than one >trainer ID and the respective trainer IDS. >For your particular example data result will be >2|"Smith" >3|"Smith" Splendid! Just what I wanted. >As Smith is the only trainer with records with diferent trainer IDs. >Question : Can this be what you want? The real table has 250000 entries and quite a few dups. >If you want to achieve 1) just substitute the trainer_name by >trainer_id on the commented places. 1) works as well now - just had to transpose id/name. -- Best Regards, Tarlika Elisabeth Schmitz
Hey,Tarlika. I tried to reproduce your test case through a series of inserts and It seems that the lower case "d" went unnoticed. That explains the empty list i got as result. My fault. Sorry :-( Great to hear it helped you Best, Oliveiros Enviado de meu iPhone Em 24/09/2010, às 05:12 PM, "Tarlika Elisabeth Schmitz" <postgresql@numerixtechnology.d e> escreveu: > Dear Oliveiros, > Thank you for taking the time to help. > > On Fri, 24 Sep 2010 11:22:21 +0100 > "Oliveiros d'Azevedo Cristina" <oliveiros.cristina@marktest.pt> wrote: > >> ----- Original Message ----- >> From: "Tarlika Elisabeth Schmitz" <postgresql@numerixtechnology.de> >> To: <pgsql-sql@postgresql.org> >> Sent: Thursday, September 23, 2010 10:39 PM >> Subject: [SQL] identifying duplicates in table with redundancies >> >> >>> [...] I want to check for duplicates: >>> >>> 1) multiples trainer names for same trainer id >>> 2) multiple trainer ids for same trainer name >>> >>> I cobbled together the SQL and it does the job but it seems rather >>> convoluted. I would like to know how I can improve it. >>> >>> CREATE TABLE student ( >>> id INTEGER NOT NULL, >>> name VARCHAR(256) NOT NULL, >>> trainer_id INTEGER, >>> trainer_name VARCHAR(256), >>> ); >>> >>> ==== >>> EXAMPLE DATA >>> >>> 22 John 1 Macdonald >>> 23 Jane 1 MacDonald >>> 24 Paul 1 MacDonald >>> 25 Dick 2 Smith >>> 26 Bill 3 Smith >>> 27 Kate 3 Smith >>> ==== >>> >>> select trainer_id, trainer_name from >>> ( >>> select distinct on (trainer_name) trainer_id, trainer_name >>> from student >>> where trainer_id in >>> ( >>> select distinct on (id) id >>> from >>> ( >>> select distinct on (trainer_id,trainer_name) >>> trainer_id as id, >>> trainer_name as name from student >>> ) as trainer >>> group by trainer.id >>> having count (trainer.name) > 1 >>> ) >>> ) as y >>> order by trainer_id > > >> Howdy, Tarlika. >> >> First, did you past correctly your query into your mail? >> I am asking this because your query doesn't seem work for me, it >> returns an empty list :-| >> >> Your most nested query, [...] >> returns this >> >> 1|"MacDonald" >> 2|"Smith" >> 3|"Smith" > > > For me, the innermost query returns: > 1|"Macdonald" > 1|"MacDonald" > 2|"Smith" > 3|"Smith" > > (note the lower/uppercase "d" in MacDonald) > > The whole query returns: > 1|"Macdonald" > 1|"MacDonald" > > >>> 1) multiples trainer names for same trainer id >>> 2) multiple trainer ids for same trainer name >> >> To achieve 2) I would use this query >> >> SELECT DISTINCT trainer_id,trainer_name >> FROM ( >> SELECT trainer_name -- The field you want to test for duplicates >> FROM ( >> SELECT DISTINCT "trainer_id","trainer_name" >> FROM student >> ) x >> GROUP BY "trainer_name" -- the field you want to test for >> duplicates >> HAVING (COUNT(*) > 1) >> ) z >> NATURAL JOIN student y > > I see my 2 innermost queries are the same as yours, just a bit more > wordy. I messed up at the third query, which threw up an error when I > tried to add an ORDER BY. > >> It will give you a list of the trainer names who have more than one >> trainer ID and the respective trainer IDS. >> For your particular example data result will be >> 2|"Smith" >> 3|"Smith" > > Splendid! Just what I wanted. > >> As Smith is the only trainer with records with diferent trainer IDs. >> Question : Can this be what you want? > > The real table has 250000 entries and quite a few dups. > >> If you want to achieve 1) just substitute the trainer_name by >> trainer_id on the commented places. > > 1) works as well now - just had to transpose id/name. > > > > > > -- > > Best Regards, > Tarlika Elisabeth Schmitz > > -- > Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-sql
Hello Oliveiros, On Fri, 24 Sep 2010 18:12:18 +0100 Oliver d'Azevedo Christina <oliveiros.cristina@gmail.com> wrote: >Hey,Tarlika. >I tried to reproduce your test case through a series of inserts and >It seems that the lower case "d" went unnoticed. I can assure you people up here (in Scotland) get quite upset about it! It took me a while to get my head round your outer query with the NATURAL JOIN between the student table and the nested query results. I have done table joins before but this solution would not have sprung to mind. -- Many thanks, Tarlika Elisabeth Schmitz
On Fri, 24 Sep 2010 18:12:18 +0100 Oliver d'Azevedo Christina <oliveiros.cristina@gmail.com> wrote: >>> SELECT DISTINCT trainer_id,trainer_name >>> FROM ( >>> SELECT trainer_name -- The field you want to test for duplicates >>> FROM ( >>> SELECT DISTINCT "trainer_id","trainer_name" >>> FROM student >>> ) x >>> GROUP BY "trainer_name" -- the field you want to test for >>> duplicates >>> HAVING (COUNT(*) > 1) >>> ) z >>> NATURAL JOIN student y What indices would you recommend for this operation? -- Best Regards, Tarlika Elisabeth Schmitz
Hello again, Tarlika. In what concerns to indices, I 'm affraid I may not be the best person to advise you, my knowledge of them hardly goes beyond the most trivial cases. I'm sure there are plenty of other people on the list who are able to give you better advise than me. But, on this query in particular I would recomend an indice on trainer_name, as this field will be used on the join and on the group by. For the other query, the one you get by substituting trainer_name by trainer_id, place an index on trainer_id. Also, these indexes may help speed up the order by clause, if you use one. If you have a table with lots of data you can try them around and see how performance varies (and don't forget there's also EXPLAIN ANALYZE) Bear in mind, though, that this is just my 2 cents on a matter that I don't really master. Do not take this as an expert answer. There are many people on this list that can help you better Best, Oliveiros ----- Original Message ----- From: "Tarlika Elisabeth Schmitz" <postgresql@numerixtechnology.de> To: <pgsql-sql@postgresql.org> Sent: Monday, September 27, 2010 5:54 PM Subject: Re: [SQL] identifying duplicates in table with redundancies > On Fri, 24 Sep 2010 18:12:18 +0100 > Oliver d'Azevedo Christina <oliveiros.cristina@gmail.com> wrote: > >>>> SELECT DISTINCT trainer_id,trainer_name >>>> FROM ( >>>> SELECT trainer_name -- The field you want to test for duplicates >>>> FROM ( >>>> SELECT DISTINCT "trainer_id","trainer_name" >>>> FROM student >>>> ) x >>>> GROUP BY "trainer_name" -- the field you want to test for >>>> duplicates >>>> HAVING (COUNT(*) > 1) >>>> ) z >>>> NATURAL JOIN student y > > > > What indices would you recommend for this operation? > > -- > > Best Regards, > Tarlika Elisabeth Schmitz > > -- > Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-sql
On Tue, 28 Sep 2010 11:34:31 +0100 "Oliveiros d'Azevedo Cristina" <oliveiros.cristina@marktest.pt> wrote: >----- Original Message ----- >From: "Tarlika Elisabeth Schmitz" <postgresql@numerixtechnology.de> >To: <pgsql-sql@postgresql.org> >Sent: Monday, September 27, 2010 5:54 PM >Subject: Re: [SQL] identifying duplicates in table with redundancies > > >> On Fri, 24 Sep 2010 18:12:18 +0100 >> Oliver d'Azevedo Christina <oliveiros.cristina@gmail.com> wrote: >> >>>>> SELECT DISTINCT trainer_id,trainer_name >>>>> FROM ( >>>>> SELECT trainer_name -- The field you want to test for duplicates >>>>> FROM ( >>>>> SELECT DISTINCT "trainer_id","trainer_name" >>>>> FROM student >>>>> ) x >>>>> GROUP BY "trainer_name" -- the field you want to test for >>>>> duplicates >>>>> HAVING (COUNT(*) > 1) >>>>> ) z >>>>> NATURAL JOIN student y >> >> >> >> What indices would you recommend for this operation? > >But, on this query in particular I would recomend an indice on >trainer_name, as this field will be used on the join and on the group >by. For the other query, the one you get by substituting trainer_name >by trainer_id, place an index on trainer_id. >Also, these indexes may help speed up the order by clause, if you use >one. > >If you have a table with lots of data you can try them around and see >how performance varies (and don't forget there's also EXPLAIN ANALYZE) Strangely, these indices did not do anything. Without, the query took about 8500ms. Same with index. The table has 250000 records. 11000 have trainer_name = null. Only 13000 unique trainer_names. It is not hugely important as these queries are not time-critical. This is only a helper table, which I use to analyze the date prior to populating the destination tables with the data. Regards, Tarlika
On 09/28/2010 10:36 PM, Tarlika Elisabeth Schmitz wrote: > On Tue, 28 Sep 2010 11:34:31 +0100 > "Oliveiros d'Azevedo Cristina"<oliveiros.cristina@marktest.pt> wrote: > >> ----- Original Message ----- >> From: "Tarlika Elisabeth Schmitz"<postgresql@numerixtechnology.de> >> To:<pgsql-sql@postgresql.org> >> Sent: Monday, September 27, 2010 5:54 PM >> Subject: Re: [SQL] identifying duplicates in table with redundancies >> >> >>> On Fri, 24 Sep 2010 18:12:18 +0100 >>> Oliver d'Azevedo Christina<oliveiros.cristina@gmail.com> wrote: >>> >>>>>> SELECT DISTINCT trainer_id,trainer_name >>>>>> FROM ( >>>>>> SELECT trainer_name -- The field you want to test for duplicates >>>>>> FROM ( >>>>>> SELECT DISTINCT "trainer_id","trainer_name" >>>>>> FROM student >>>>>> ) x >>>>>> GROUP BY "trainer_name" -- the field you want to test for >>>>>> duplicates >>>>>> HAVING (COUNT(*)> 1) >>>>>> ) z >>>>>> NATURAL JOIN student y >>> >>> >>> What indices would you recommend for this operation? >> But, on this query in particular I would recomend an indice on >> trainer_name, as this field will be used on the join and on the group >> by. For the other query, the one you get by substituting trainer_name >> by trainer_id, place an index on trainer_id. >> Also, these indexes may help speed up the order by clause, if you use >> one. >> >> If you have a table with lots of data you can try them around and see >> how performance varies (and don't forget there's also EXPLAIN ANALYZE) > > Strangely, these indices did not do anything. > Without, the query took about 8500ms. Same with index. > > The table has 250000 records. 11000 have trainer_name = null. Only > 13000 unique trainer_names. > > It is not hugely important as these queries are not time-critical. > This is only a helper table, which I use to analyze the date prior to > populating the destination tables with the data. > > Regards, > Tarlika > I guess explain analyze shows up a seq scan. try avoiding to use distinct. use group by instead. regards Andreas
On Wed, 29 Sep 2010 10:40:03 +0200 Andreas Schmitz <mailinglist@longimanus.net> wrote: > On 09/28/2010 10:36 PM, Tarlika Elisabeth Schmitz wrote: >> On Tue, 28 Sep 2010 11:34:31 +0100 >> "Oliveiros d'Azevedo Cristina"<oliveiros.cristina@marktest.pt> >> wrote: >> >>> ----- Original Message ----- >>> From: "Tarlika Elisabeth Schmitz"<postgresql@numerixtechnology.de> >>> To:<pgsql-sql@postgresql.org> >>> Sent: Monday, September 27, 2010 5:54 PM >>> Subject: Re: [SQL] identifying duplicates in table with redundancies >>> >>> >>>> On Fri, 24 Sep 2010 18:12:18 +0100 >>>> Oliver d'Azevedo Christina<oliveiros.cristina@gmail.com> wrote: >>>> >>>>>>> SELECT DISTINCT trainer_id,trainer_name >>>>>>> FROM ( >>>>>>> SELECT trainer_name -- The field you want to test for >>>>>>> duplicates FROM ( >>>>>>> SELECT DISTINCT "trainer_id","trainer_name" >>>>>>> FROM student >>>>>>> ) x >>>>>>> GROUP BY "trainer_name" -- the field you want to test for >>>>>>> duplicates >>>>>>> HAVING (COUNT(*)> 1) >>>>>>> ) z >>>>>>> NATURAL JOIN student y >>>> >>>> >>>> What indices would you recommend for this operation? >>> But, on this query in particular I would recomend an indice on >>> trainer_name,[...] >> Strangely, these indices did not do anything. >> Without, the query took about 8500ms. Same with index. >> >> The table has 250000 records. 11000 have trainer_name = null. Only >> 13000 unique trainer_names. > >I guess explain analyze shows up a seq scan. try avoiding to use >distinct. use group by instead. > >regards, Andreas Hallo Andreas, I reduced the problem to the innermost query: 1) SELECT DISTINCT trainer_id, trainer_name FROM student This results in a sequential table scan. Execution time 7500ms. 2) I created an INDEX ON (trainer_id, trainer_name). Then I had an index scan instead, which still cost 7000ms. 3) Next, I changed from DISTINCT to GROUP BY: SELECT trainer_id, trainer_name FROM student GROUP BY trainer_id, trainer_name This resulted in an index scan @ 6750ms 4) I filtered out NULL trainer_ids WHERE trainer_id IS NOT NULL Amazingly, this resulted in a sequential table scan, which only took 1300ms!! Please, explain (pun not intended)! How can this be. Only 11000/250000 rows have a null trainer_id. Thanks for the GROUP BY tip! -- Best Regards, Tarlika Elisabeth Schmitz
> > Hallo Andreas, > I reduced the problem to the innermost query: > > 1) SELECT DISTINCT trainer_id, trainer_name FROM student > This results in a sequential table scan. Execution time 7500ms. > > 2) I created an INDEX ON (trainer_id, trainer_name). Then I had an index > scan instead, which still cost 7000ms. > > 3) Next, I changed from DISTINCT to GROUP BY: > SELECT trainer_id, trainer_name FROM student > GROUP BY trainer_id, trainer_name > This resulted in an index scan @ 6750ms > > 4) I filtered out NULL trainer_ids > WHERE trainer_id IS NOT NULL > Amazingly, this resulted in a sequential table scan, which only took > 1300ms!! > > Please, explain (pun not intended)! How can this be. Only 11000/250000 > rows have a null trainer_id. > That's an impressive improvement... Personally I have no idea what caused it, specially when you say it was sequential :-| Warmed caches ? Best, Oliver