Thread: removing specific duplicates

removing specific duplicates

From

Al Arduengo

Date:

15 July 2002, 08:40:09

I have a table of characterization data from a test of a certain chip where
I work. THe table consists of 5 columns:

sample int
spec text
temp int
voltage float
measurement float

Each sample chip (1-120) is tested at three temps and two voltages. The
column of interest is measurement. The problem is that some of the samples
were tested multiple times for one temp/voltage combination so I have
entries such as:

sample  spec  temp  voltage  measurement
1           Tdsu  25      4.5       1.12e-9
1           Tdsu  25      4.5       1.12e-9
1           Tdsu  25      4.5       1.3e-9

My requirement is to delete any duplicate rows (such as #2 in this case)
and then take the average of #1 and #3, replace #1 with that average
measurement and delete #3. I simply cannot figure out how to first get rid
of *just* #2 and then somehow find the case of #1 and #3 existing and then
do the average followed by removing #3. Using DISTINCT with SELECT will
obviously spit out #1 and #3 but that doesn't get me anywhere. I am
somewhat of a SQL novice so please have pity on me in your possible
explanations. My ultimate goal is to do these modifications without having
to go in and manually find these situations and then manually make the
changes. I think it is possible with SQL but I can't figure out how.

Thanks very much in advance.
-Al Arduengo

Re: removing specific duplicates

From

Josh Jore

Date:

15 July 2002, 10:23:00

Al,
First thing you should do is add a serial not null unique column to that
table. If you are going to need to do this sort of operation then you need
something by which to get a handle to a specific row.

So this command will delete exactly one duplicate from your table. You
need to keep runnin it until it deletes zero rows. Is this idea? No. It's
one step above worst case. Unless you *like* this sort of pain, fix your
app to prevent this sort of  duplication. You could have a trigger or rule
stop duplicate inserts. You could use a unique index. Whatever works for
you. Just do *something*.

DELETE FROM temp WHERE oid IN
    (
    SELECT    temp.oid
    FROM    temp
    JOIN
    ( SELECT count(*), sample, spec, temp, measurement
      FROM temp
      GROUP BY sample, spec, temp, measurement ) dup
    USING (sample, spec, temp, measurement)
    WHERE dup.count > 1
    LIMIT 1;


So your next question. How to replace the existing unique data with a
single averaage. Requesting the average is easy:

> sample  spec  temp  voltage  measurement
> 1           Tdsu  25      4.5       1.12e-9
> 1           Tdsu  25      4.5       1.3e-9

SELECT sample, spec, temp, voltage, AVG(measurement)
FROM temp GROUP BY sample, spec, temp, voltage;

So if you wanted to keep the average and throw away the detail data then
just delete the old row and insert the new row (which you saved prior to
deleting the data)


Joshua b. Jore ; http://www.greentechnologist.org

On Fri, 12 Jul 2002, Al Arduengo wrote:

> I have a table of characterization data from a test of a certain chip where
> I work. THe table consists of 5 columns:
>
> sample int
> spec text
> temp int
> voltage float
> measurement float
>
> Each sample chip (1-120) is tested at three temps and two voltages. The
> column of interest is measurement. The problem is that some of the samples
> were tested multiple times for one temp/voltage combination so I have
> entries such as:
>
>
> My requirement is to delete any duplicate rows (such as #2 in this case)
> and then take the average of #1 and #3, replace #1 with that average
> measurement and delete #3. I simply cannot figure out how to first get rid
> of *just* #2 and then somehow find the case of #1 and #3 existing and then
> do the average followed by removing #3. Using DISTINCT with SELECT will
> obviously spit out #1 and #3 but that doesn't get me anywhere. I am
> somewhat of a SQL novice so please have pity on me in your possible
> explanations. My ultimate goal is to do these modifications without having
> to go in and manually find these situations and then manually make the
> changes. I think it is possible with SQL but I can't figure out how.
>
> Thanks very much in advance.
> -Al Arduengo
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>

Re: removing specific duplicates

From

Ron Johnson

Date:

15 July 2002, 13:35:57

Instead of One Big Delete Statement, the other way to do it
is in a loop in pg/sql.  This is the way I like to do it
because then I can put progress indicators to show how fast
or slow the statements are running.

On Mon, 2002-07-15 at 09:46, Josh Jore wrote:
> Al,
> First thing you should do is add a serial not null unique column to that
> table. If you are going to need to do this sort of operation then you need
> something by which to get a handle to a specific row.
>
> So this command will delete exactly one duplicate from your table. You
> need to keep runnin it until it deletes zero rows. Is this idea? No. It's
> one step above worst case. Unless you *like* this sort of pain, fix your
> app to prevent this sort of  duplication. You could have a trigger or rule
> stop duplicate inserts. You could use a unique index. Whatever works for
> you. Just do *something*.
>
> DELETE FROM temp WHERE oid IN
>     (
>     SELECT    temp.oid
>     FROM    temp
>     JOIN
>     ( SELECT count(*), sample, spec, temp, measurement
>       FROM temp
>       GROUP BY sample, spec, temp, measurement ) dup
>     USING (sample, spec, temp, measurement)
>     WHERE dup.count > 1
>     LIMIT 1;
>
>
> So your next question. How to replace the existing unique data with a
> single averaage. Requesting the average is easy:
>
> > sample  spec  temp  voltage  measurement
> > 1           Tdsu  25      4.5       1.12e-9
> > 1           Tdsu  25      4.5       1.3e-9
>
> SELECT sample, spec, temp, voltage, AVG(measurement)
> FROM temp GROUP BY sample, spec, temp, voltage;
>
> So if you wanted to keep the average and throw away the detail data then
> just delete the old row and insert the new row (which you saved prior to
> deleting the data)
>
>
> Joshua b. Jore ; http://www.greentechnologist.org
>
> On Fri, 12 Jul 2002, Al Arduengo wrote:
>
> > I have a table of characterization data from a test of a certain chip where
> > I work. THe table consists of 5 columns:
> >
> > sample int
> > spec text
> > temp int
> > voltage float
> > measurement float
> >
> > Each sample chip (1-120) is tested at three temps and two voltages. The
> > column of interest is measurement. The problem is that some of the samples
> > were tested multiple times for one temp/voltage combination so I have
> > entries such as:
> >
> >
> > My requirement is to delete any duplicate rows (such as #2 in this case)
> > and then take the average of #1 and #3, replace #1 with that average
> > measurement and delete #3. I simply cannot figure out how to first get rid
> > of *just* #2 and then somehow find the case of #1 and #3 existing and then
> > do the average followed by removing #3. Using DISTINCT with SELECT will
> > obviously spit out #1 and #3 but that doesn't get me anywhere. I am
> > somewhat of a SQL novice so please have pity on me in your possible
> > explanations. My ultimate goal is to do these modifications without having
> > to go in and manually find these situations and then manually make the
> > changes. I think it is possible with SQL but I can't figure out how.
> >
> > Thanks very much in advance.
> > -Al Arduengo

--
+-----------------------------------------------------------------+
| Ron Johnson, Jr.        Home: ron.l.johnson@cox.net             |
| Jefferson, LA  USA      http://ronandheather.dhs.org:81         |
|                                                                 |
| "Experience should teach us to be most on our guard to protect  |
|  liberty when the government's purposes are beneficent. Men     |
|  born to freedom are naturally alert to repel invasion of their |
|  liberty by evil minded rulers. The greatest dangers to liberty |
|  lurk in insidious encroachment by men of zeal, well-meaning    |
|  but without understanding."                                    |
|   Justice Louis Brandeis, dissenting, Olmstead v US (1928)      |
+-----------------------------------------------------------------+