Thread: Finding Duplicate Rows during INSERTs

Finding Duplicate Rows during INSERTs

From
Rich Shepard
Date:
   Source data has duplicates. I have a file that creates the table then
INSERTS INTO the table all the rows. When I see errors flash by during the
'psql -d <database> -f <file.sql>' I try to scroll back in the terminal to
see where the duplicate rows are located. Too often they are too far back to
let me scroll to see them.

   There must be a better way of doing this. Can I run psql with the tee
command to capture errors in a file I can examine? What is the proper/most
efficient way to identify the duplicates so they can be removed?

TIA,

Rich


Re: Finding Duplicate Rows during INSERTs

From
Rob Sargent
Date:
On 07/09/2012 04:48 PM, Rich Shepard wrote:
>    Source data has duplicates. I have a file that creates the table then
> INSERTS INTO the table all the rows. When I see errors flash by during the
> 'psql -d <database> -f <file.sql>' I try to scroll back in the terminal to
> see where the duplicate rows are located. Too often they are too far
> back to
> let me scroll to see them.
>
>    There must be a better way of doing this. Can I run psql with the tee
> command to capture errors in a file I can examine? What is the proper/most
> efficient way to identify the duplicates so they can be removed?
>
> TIA,
>
> Rich
>
>


psql -d <database> -f file.sql > file.log 2>&1 would give you a logfile


sort -u file.raw > file.uniq might give you clean data?


Re: Finding Duplicate Rows during INSERTs

From
Rich Shepard
Date:
On Mon, 9 Jul 2012, Rob Sargent wrote:

> psql -d <database> -f file.sql > file.log 2>&1 would give you a logfile

Rob,

   Ah, yes. I forgot about redirecting stdout to a file.

> sort -u file.raw > file.uniq might give you clean data?

   Not when it has the SQL on each line. I thought that I had eliminated
duplicates in the raw file, but with 128K rows I obviously missed many.

Thanks,

Rich


Re: Finding Duplicate Rows during INSERTs

From
Darren Duncan
Date:
Rich Shepard wrote:
>   Source data has duplicates. I have a file that creates the table then
> INSERTS INTO the table all the rows. When I see errors flash by during the
> 'psql -d <database> -f <file.sql>' I try to scroll back in the terminal to
> see where the duplicate rows are located. Too often they are too far
> back to
> let me scroll to see them.
>
>   There must be a better way of doing this. Can I run psql with the tee
> command to capture errors in a file I can examine? What is the proper/most
> efficient way to identify the duplicates so they can be removed?
>
> TIA,
>
> Rich

What I recommend is instead inserting your data into staging tables which lack
key constraints, and then you can use SQL to then either locate duplicates or
just copy the unique rows to the normal tables.  I mean, ostensibly SQL is a
better tool for cleaning data than anything else right, usually, or reporting.
-- Darren Duncan