Thread: Filtering duplicated row with a trigger

Filtering duplicated row with a trigger

From
papapep
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I've got plenty of data files (prepared to be inserted with the \copy
statement) but I have to filter them to be sure that there are no
duplicated rows inserted.
I know I should do it with a trigger that executes a function before
inserting the row and if it's duplicated do something with it (insert it
in another table, simply forget it, etc...). The theory is clear :-)
But the practice is not so clear (for me, of course).
Anyone can give me some guide to how the function should do the control
of duplicated rows?

Thanks.

Josep Sànchez
[papapep]
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQE/gZik2vx52x0kyz4RApbZAKCrhvCywbNH8Zce0xpfDhMNQBfQ+ACfShG6
96nY7di8KnV8gJrcWIOzqLI=
=32il
-----END PGP SIGNATURE-----




Re: Filtering duplicated row with a trigger

From
Bruno Wolff III
Date:
On Mon, Oct 06, 2003 at 18:30:29 +0200,
  papapep <papapep@gmx.net> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> I've got plenty of data files (prepared to be inserted with the \copy
> statement) but I have to filter them to be sure that there are no
> duplicated rows inserted.
> I know I should do it with a trigger that executes a function before
> inserting the row and if it's duplicated do something with it (insert it
> in another table, simply forget it, etc...). The theory is clear :-)
> But the practice is not so clear (for me, of course).
> Anyone can give me some guide to how the function should do the control
> of duplicated rows?

You might want to consider reading the data into a temp table and then using
a query to do something with duplicates.

Re: Filtering duplicated row with a trigger

From
Nabil Sayegh
Date:
papapep wrote:
> Anyone can give me some guide to how the function should do the control
> of duplicated rows?

First tell us what the function should do.
If you just want to ignore the 'duplicate' rows, that's easy:

define 1 column as unique and just insert (without transaction) the data
row by row.
Every 'duplicate' row will throw an error, but since we don't use a
transaction here the row is 'ignored', leaving only rows that are
distinct for that unique column.

I write 'duplicate' in quotes, because you didn't define what you mean
by duplicate.

HTH
--
  e-Trolley Sayegh & John, Nabil Sayegh
  Tel.: 0700 etrolley /// 0700 38765539
  Fax.: +49 69 8299381-8
  PGP : http://www.e-trolley.de


Re: [personal] Re: Filtering duplicated row with a trigger

From
papapep
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Bruno Wolff III wrote:
| On Mon, Oct 06, 2003 at 18:30:29 +0200,
|   papapep <papapep@gmx.net> wrote:
|
|>-----BEGIN PGP SIGNED MESSAGE-----
|>Hash: SHA1
|>
|>I've got plenty of data files (prepared to be inserted with the \copy
|>statement) but I have to filter them to be sure that there are no
|>duplicated rows inserted.
|>I know I should do it with a trigger that executes a function before
|>inserting the row and if it's duplicated do something with it (insert it
|>in another table, simply forget it, etc...). The theory is clear :-)
|>But the practice is not so clear (for me, of course).
|>Anyone can give me some guide to how the function should do the control
|>of duplicated rows?
|
|
| You might want to consider reading the data into a temp table and then
using
| a query to do something with duplicates.

I'm very sorry, but I think I don't understand completely what you mean.
Perhaps you suggest to insert the whole data in an initial temporary
table and verify the duplicates in the temporary table before
transfering the "good" rows to the real table? If so, how should I do
the duplicates control in the temp table? (for me is as difficult as my
first question :-( )
Consider that the primary key that we use to see if the row is
duplicated, or not, is a 5 fields key (it has to be so, is a complex
data to filter).

Josep Sànchez
[papapep]



-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQE/gZ672vx52x0kyz4RArRxAKDArnFviC+kHBb0IGAUPb8mgR2pZACgqjiW
zdODPtlhLLRo0qEnOcUxnPM=
=wt1I
-----END PGP SIGNATURE-----




Re: [personal] Re: Filtering duplicated row with a trigger

From
Josh Berkus
Date:
Josep,

Are you a Spanish speaker?  Or Brazillian?   I'd like to get someone to help
you, or at least translate, in a language you are more comfortable with.  So
far, nobody on this list is clear on what you want help doing.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: [personal] Re: Filtering duplicated row with a trigger

From
Nabil Sayegh
Date:
papapep wrote:
> If so, how should I do
> the duplicates control in the temp table? (for me is as difficult as my
> first question :-( )
> Consider that the primary key that we use to see if the row is
> duplicated, or not, is a 5 fields key (it has to be so, is a complex
> data to filter).

CREATE TEMP TABLE tempo (a int, b int, c text);
INSERT INTO tempo (a,b,c) values (1,1,'foo');
INSERT INTO tempo (a,b,c) values (1,2,'bar');
INSERT INTO tempo (a,b,c) values (1,1,'foo');
INSERT INTO tempo (a,b,c) values (1,1,'foo');
INSERT INTO tempo (a,b,c) values (1,1,'foo-bar');

SELECT distinct on (a,b) a, b, c from tempo;
  a | b |  c
---+---+-----
  1 | 1 | foo
  1 | 2 | bar
(2 Zeilen)

This DISTINCT ON select only cares for the given arguments (a,b) to be
distinct. Which c is returned is undefined (random).

HTH
--
  e-Trolley Sayegh & John, Nabil Sayegh
  Tel.: 0700 etrolley /// 0700 38765539
  Fax.: +49 69 8299381-8
  PGP : http://www.e-trolley.de


Re: [personal] Re: Filtering duplicated row with a trigger

From
Bruno Wolff III
Date:
On Mon, Oct 06, 2003 at 18:56:28 +0200,
  papapep <papapep@gmx.net> wrote:
>
> I'm very sorry, but I think I don't understand completely what you mean.
> Perhaps you suggest to insert the whole data in an initial temporary
> table and verify the duplicates in the temporary table before
> transfering the "good" rows to the real table? If so, how should I do

Yes. That allows you to use SQL to handle the duplicates which is probably
going to be simpler than writing a trigger. You can also use copy to
load the temp table which will be faster than using inserts.

> the duplicates control in the temp table? (for me is as difficult as my
> first question :-( )
> Consider that the primary key that we use to see if the row is
> duplicated, or not, is a 5 fields key (it has to be so, is a complex
> data to filter).

You haven't given us a rule to use when deciding which tuples to remove
when a duplicate is detected. Without such a rule we can't give you
detailed instructions on how to remove the duplicates. Having a 5
column primary key doesn't make the problem significantly more difficult
to solve, it mostly just adds a small amount of typing.

Re: [personal] Re: Filtering duplicated row with a trigger

From
papapep
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Josh Berkus wrote:
| Josep,
|
| Are you a Spanish speaker?  Or Brazillian?   I'd like to get someone
to help
| you, or at least translate, in a language you are more comfortable
with.  So
| far, nobody on this list is clear on what you want help doing.
|

I'm very sorry for not being able to explain my problem (BTW, I'm
spanish speaker...)
I'll try to do it better.

I've got a table that has the following fields:

    F1 F2 F3 F4 F5 .........F16

and we have defined that there can't be any field repeating the fiels
F1,F2,F5,F14 and F15 (our, complex, primary key).

I've got, on the other hand, text files prepared to be inserted in this
table with the \copy command, but we are not sure (we've found
duplicated rows several times) that there are not repeated rows.

I'm trying to create a function that controls these duplicated rows to
keep the table "clean" of them. In fact, I don't mind if the duplicated
rows are inserted in a "duplicated rows" table (but perhaps it should be
a good way to detect where they are generated) or if they get "missed in
action".

I hope (I really, really do...) that this time I've been able to explain
it better.

Thanks to all for your patience.

Josep Sanchez
[papapep]

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQE/gaY82vx52x0kyz4RAhXrAKDGD1NRuUt2nufdtt74wuDwg4zEQgCgmRc/
/x1V5EtYriZ6Pw+zm3n6kro=
=GILj
-----END PGP SIGNATURE-----




Re: [personal] Re: Filtering duplicated row with a trigger

From
Bruno Wolff III
Date:
On Mon, Oct 06, 2003 at 19:28:29 +0200,
  papapep <papapep@gmx.net> wrote:
>
> I've got, on the other hand, text files prepared to be inserted in this
> table with the \copy command, but we are not sure (we've found
> duplicated rows several times) that there are not repeated rows.
>
> I'm trying to create a function that controls these duplicated rows to
> keep the table "clean" of them. In fact, I don't mind if the duplicated
> rows are inserted in a "duplicated rows" table (but perhaps it should be
> a good way to detect where they are generated) or if they get "missed in
> action".

And what do want to happen when you run accross a duplicate row?
Do you just want to discard tuples with a duplicate primary key?
If you are discarding duplicates, do you care which of the duplicates
is discarded?
If you want to combine data from the duplicates, do you have a precise
description of what you want to happen?

Re: [personal] Re: Filtering duplicated row with a trigger

From
papapep
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Bruno Wolff III wrote:
| And what do want to happen when you run accross a duplicate row?
| Do you just want to discard tuples with a duplicate primary key?

Initially I wanted to do so. But after I thought to "register" them in a
dups table.

| If you are discarding duplicates, do you care which of the duplicates
| is discarded?

That's why I said that perhaps it should be a good thing to keep them in
a duplicated-rows table, for reviewing them.

| If you want to combine data from the duplicates, do you have a precise
| description of what you want to happen?

No, I do not need to combine data from the duplicated. The entire row is
accepted or not.


Josep Sànchez
[papapep]
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQE/gajE2vx52x0kyz4RAq17AJ9DNYKwJEvTHy/rc9I1hJ8AuIvDjACdEXde
4nvgabqLCdERKWIAo4v6WUI=
=7zFZ
-----END PGP SIGNATURE-----




Re: [personal] Re: Filtering duplicated row with a trigger

From
Josh Berkus
Date:
Josep,

> I'm very sorry for not being able to explain my problem (BTW, I'm
> spanish speaker...)
> I'll try to do it better.

That was very clear.

> I've got a table that has the following fields:
>
>       F1 F2 F3 F4 F5 .........F16
>
> and we have defined that there can't be any field repeating the fiels
> F1,F2,F5,F14 and F15 (our, complex, primary key).
>
> I've got, on the other hand, text files prepared to be inserted in this
> table with the \copy command, but we are not sure (we've found
> duplicated rows several times) that there are not repeated rows.

I'd suggest using the temporary table (or "holding table") approach suggested
by other posters on this list.    While you could use a trigger, that would
mean using INSERT instead of COPY, which would slow down your data loading a
lot.

What you want to do after loading the table really depends on how you want to
handle duplicates.  If you just want to ignore them, then use the SELECT
DISTINCT ON suggestion from another list member ... although this will have
the defect of grabbing the first row with that primary key and ignoring the
others, which might have different information in the other columns.

If you want to raise an alarm and halt the import on finding a duplicate.,
then do:

SELECT F1,F2,F5,F14, F15, count(*) as duplicates
FROM holding_table
GROUP BY F1,F2,F5,F14, F15
HAVING count(*) > 1;

The rows returned by that query will show you the primary keys of the
duplicate rows.

--
Josh Berkus
Aglio Database Solutions
San Francisco
--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: [personal] Re: Filtering duplicated row with a trigger

From
Bruno Wolff III
Date:
Please keep messages copied to the list.

On Mon, Oct 06, 2003 at 19:38:46 +0200,
  papapep <papapep@gmx.net> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Bruno Wolff III wrote:
> | And what do want to happen when you run accross a duplicate row?
> | Do you just want to discard tuples with a duplicate primary key?
>
> Initially I wanted to do so. But after I thought to "register" them in a
> dups table.

OK.

>
> | If you are discarding duplicates, do you care which of the duplicates
> | is discarded?
>
> That's why I said that perhaps it should be a good thing to keep them in
> a duplicated-rows table, for reviewing them.
>
> | If you want to combine data from the duplicates, do you have a precise
> | description of what you want to happen?
>
> No, I do not need to combine data from the duplicated. The entire row is
> accepted or not.

If you initially don't care which dup gets inserted, then the select DISTINCT
ON suggestion will work for loading the real table.
Something like:
insert into real_table select distinct on (pk1, pk2, pk3, pk4, pk5) * from
  temp_table;

To see the sets of duplicates you can do something like:
select * from temp_table, (select pk1, pk2, pk3, pk4, pk5 from temp_table
  group by pk1, pk2, pk3, pk4, pk5 having count(*) > 1) as dups where
  temp_table.pk1 = dups.pk1 and
  temp_table.pk2 = dups.pk2 and
  temp_table.pk3 = dups.pk3 and
  temp_table.pk4 = dups.pk4 and
  temp_table.pk5 = dups.pk5;

If there are a large number of records being imported, you probably want
to create an multicolumn index on pk1, pk2, pk3, pk4, and pk5 on the
temp table after loading it and before doing the query for duplicates.

Re: Filtering duplicated row with a trigger

From
"Partha Sur"
Date:
Hello,

 If you wish to "clean" a table of duplicates the following SQL will work.
It  is used widely in the Oracle world - and I just tried it successfully on
my Red Hat Linux 9 PostgreSQL 7.3.2 database.

 If you have a table T with columns C1, C2, ... Cn and C1, C2, ... Cr are
the
candidate keys for a concatenated primary key (r <= n), then if T currently
contains duplicates (obviously with with no primary key constraint defined
on it - otherwise such a table with duplicate rows could not exist) then to
remove the duplicates so that only tuples with unique candidate key values
remain (i.e. cleaned) run the following delete SQL which uses a correlated
subquery :

delete from T  t1
where oid < (select max (oid)
                     from T  t2
                     where t1.C1 = t2.C1
                     and     t1.C2 = t2.C2
                      ...
                     and     t1.Cr  = t2.Cr) ;

This is based on the fact that though rows in a table may be duplicate they
must always have unique oid numbers.

Then with the above delete statement, only the row with the maximum value
for oid  for a particular set of duplicate rows (for a particular value of
C1||C2...||Cr) will remain.
In the above SQL if you use instead:

 where oid > (select min (oid)  ...

 then only the row with the minimum value of oid will remain ...

 So \copy to a temp table (with no primary key constraints defined) and then
apply SQL similar to above to clean the loaded temp table which can then be
used to load the target production table with no duplicates.

 Regards,

 Partha Sur

> ----- Original Message -----
> From: "papapep" <papapep@gmx.net>
> To: "pgsql-novice" <pgsql-novice@postgresql.org>
> Sent: Monday, October 06, 2003 12:30 PM
> Subject: [NOVICE] Filtering duplicated row with a trigger
>
>
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> >
> > I've got plenty of data files (prepared to be inserted with the \copy
> > statement) but I have to filter them to be sure that there are no
> > duplicated rows inserted.
> > I know I should do it with a trigger that executes a function before
> > inserting the row and if it's duplicated do something with it (insert it
> > in another table, simply forget it, etc...). The theory is clear :-)
> > But the practice is not so clear (for me, of course).
> > Anyone can give me some guide to how the function should do the control
> > of duplicated rows?
> >
> > Thanks.
> >
> > Josep Sànchez
> > [papapep]
> > -----BEGIN PGP SIGNATURE-----
> > Version: GnuPG v1.2.1 (MingW32)
> > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> >
> > iD8DBQE/gZik2vx52x0kyz4RApbZAKCrhvCywbNH8Zce0xpfDhMNQBfQ+ACfShG6
> > 96nY7di8KnV8gJrcWIOzqLI=
> > =32il
> > -----END PGP SIGNATURE-----
> >
> >
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 4: Don't 'kill -9' the postmaster
>



Re: [personal] Re: Filtering duplicated row with a trigger

From
papapep
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Many, many, many thanks to all the people that has tried (and done it)
to help me. I've got various posibilities now to try which is the best
way for us.

Josep Sànchez
[papapep]
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQE/gnZl2vx52x0kyz4RAmWLAJwPjN4GU1mtfMdYBnxSDDztWXPHNQCgnTcs
468dwrozltvTe8c8ehUHp/M=
=MucO
-----END PGP SIGNATURE-----