Thread: Filtering duplicated row with a trigger
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I've got plenty of data files (prepared to be inserted with the \copy statement) but I have to filter them to be sure that there are no duplicated rows inserted. I know I should do it with a trigger that executes a function before inserting the row and if it's duplicated do something with it (insert it in another table, simply forget it, etc...). The theory is clear :-) But the practice is not so clear (for me, of course). Anyone can give me some guide to how the function should do the control of duplicated rows? Thanks. Josep Sànchez [papapep] -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.1 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQE/gZik2vx52x0kyz4RApbZAKCrhvCywbNH8Zce0xpfDhMNQBfQ+ACfShG6 96nY7di8KnV8gJrcWIOzqLI= =32il -----END PGP SIGNATURE-----
On Mon, Oct 06, 2003 at 18:30:29 +0200, papapep <papapep@gmx.net> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > I've got plenty of data files (prepared to be inserted with the \copy > statement) but I have to filter them to be sure that there are no > duplicated rows inserted. > I know I should do it with a trigger that executes a function before > inserting the row and if it's duplicated do something with it (insert it > in another table, simply forget it, etc...). The theory is clear :-) > But the practice is not so clear (for me, of course). > Anyone can give me some guide to how the function should do the control > of duplicated rows? You might want to consider reading the data into a temp table and then using a query to do something with duplicates.
papapep wrote: > Anyone can give me some guide to how the function should do the control > of duplicated rows? First tell us what the function should do. If you just want to ignore the 'duplicate' rows, that's easy: define 1 column as unique and just insert (without transaction) the data row by row. Every 'duplicate' row will throw an error, but since we don't use a transaction here the row is 'ignored', leaving only rows that are distinct for that unique column. I write 'duplicate' in quotes, because you didn't define what you mean by duplicate. HTH -- e-Trolley Sayegh & John, Nabil Sayegh Tel.: 0700 etrolley /// 0700 38765539 Fax.: +49 69 8299381-8 PGP : http://www.e-trolley.de
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Bruno Wolff III wrote: | On Mon, Oct 06, 2003 at 18:30:29 +0200, | papapep <papapep@gmx.net> wrote: | |>-----BEGIN PGP SIGNED MESSAGE----- |>Hash: SHA1 |> |>I've got plenty of data files (prepared to be inserted with the \copy |>statement) but I have to filter them to be sure that there are no |>duplicated rows inserted. |>I know I should do it with a trigger that executes a function before |>inserting the row and if it's duplicated do something with it (insert it |>in another table, simply forget it, etc...). The theory is clear :-) |>But the practice is not so clear (for me, of course). |>Anyone can give me some guide to how the function should do the control |>of duplicated rows? | | | You might want to consider reading the data into a temp table and then using | a query to do something with duplicates. I'm very sorry, but I think I don't understand completely what you mean. Perhaps you suggest to insert the whole data in an initial temporary table and verify the duplicates in the temporary table before transfering the "good" rows to the real table? If so, how should I do the duplicates control in the temp table? (for me is as difficult as my first question :-( ) Consider that the primary key that we use to see if the row is duplicated, or not, is a 5 fields key (it has to be so, is a complex data to filter). Josep Sànchez [papapep] -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.1 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQE/gZ672vx52x0kyz4RArRxAKDArnFviC+kHBb0IGAUPb8mgR2pZACgqjiW zdODPtlhLLRo0qEnOcUxnPM= =wt1I -----END PGP SIGNATURE-----
Josep, Are you a Spanish speaker? Or Brazillian? I'd like to get someone to help you, or at least translate, in a language you are more comfortable with. So far, nobody on this list is clear on what you want help doing. -- Josh Berkus Aglio Database Solutions San Francisco
papapep wrote: > If so, how should I do > the duplicates control in the temp table? (for me is as difficult as my > first question :-( ) > Consider that the primary key that we use to see if the row is > duplicated, or not, is a 5 fields key (it has to be so, is a complex > data to filter). CREATE TEMP TABLE tempo (a int, b int, c text); INSERT INTO tempo (a,b,c) values (1,1,'foo'); INSERT INTO tempo (a,b,c) values (1,2,'bar'); INSERT INTO tempo (a,b,c) values (1,1,'foo'); INSERT INTO tempo (a,b,c) values (1,1,'foo'); INSERT INTO tempo (a,b,c) values (1,1,'foo-bar'); SELECT distinct on (a,b) a, b, c from tempo; a | b | c ---+---+----- 1 | 1 | foo 1 | 2 | bar (2 Zeilen) This DISTINCT ON select only cares for the given arguments (a,b) to be distinct. Which c is returned is undefined (random). HTH -- e-Trolley Sayegh & John, Nabil Sayegh Tel.: 0700 etrolley /// 0700 38765539 Fax.: +49 69 8299381-8 PGP : http://www.e-trolley.de
On Mon, Oct 06, 2003 at 18:56:28 +0200, papapep <papapep@gmx.net> wrote: > > I'm very sorry, but I think I don't understand completely what you mean. > Perhaps you suggest to insert the whole data in an initial temporary > table and verify the duplicates in the temporary table before > transfering the "good" rows to the real table? If so, how should I do Yes. That allows you to use SQL to handle the duplicates which is probably going to be simpler than writing a trigger. You can also use copy to load the temp table which will be faster than using inserts. > the duplicates control in the temp table? (for me is as difficult as my > first question :-( ) > Consider that the primary key that we use to see if the row is > duplicated, or not, is a 5 fields key (it has to be so, is a complex > data to filter). You haven't given us a rule to use when deciding which tuples to remove when a duplicate is detected. Without such a rule we can't give you detailed instructions on how to remove the duplicates. Having a 5 column primary key doesn't make the problem significantly more difficult to solve, it mostly just adds a small amount of typing.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Josh Berkus wrote: | Josep, | | Are you a Spanish speaker? Or Brazillian? I'd like to get someone to help | you, or at least translate, in a language you are more comfortable with. So | far, nobody on this list is clear on what you want help doing. | I'm very sorry for not being able to explain my problem (BTW, I'm spanish speaker...) I'll try to do it better. I've got a table that has the following fields: F1 F2 F3 F4 F5 .........F16 and we have defined that there can't be any field repeating the fiels F1,F2,F5,F14 and F15 (our, complex, primary key). I've got, on the other hand, text files prepared to be inserted in this table with the \copy command, but we are not sure (we've found duplicated rows several times) that there are not repeated rows. I'm trying to create a function that controls these duplicated rows to keep the table "clean" of them. In fact, I don't mind if the duplicated rows are inserted in a "duplicated rows" table (but perhaps it should be a good way to detect where they are generated) or if they get "missed in action". I hope (I really, really do...) that this time I've been able to explain it better. Thanks to all for your patience. Josep Sanchez [papapep] -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.1 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQE/gaY82vx52x0kyz4RAhXrAKDGD1NRuUt2nufdtt74wuDwg4zEQgCgmRc/ /x1V5EtYriZ6Pw+zm3n6kro= =GILj -----END PGP SIGNATURE-----
On Mon, Oct 06, 2003 at 19:28:29 +0200, papapep <papapep@gmx.net> wrote: > > I've got, on the other hand, text files prepared to be inserted in this > table with the \copy command, but we are not sure (we've found > duplicated rows several times) that there are not repeated rows. > > I'm trying to create a function that controls these duplicated rows to > keep the table "clean" of them. In fact, I don't mind if the duplicated > rows are inserted in a "duplicated rows" table (but perhaps it should be > a good way to detect where they are generated) or if they get "missed in > action". And what do want to happen when you run accross a duplicate row? Do you just want to discard tuples with a duplicate primary key? If you are discarding duplicates, do you care which of the duplicates is discarded? If you want to combine data from the duplicates, do you have a precise description of what you want to happen?
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Bruno Wolff III wrote: | And what do want to happen when you run accross a duplicate row? | Do you just want to discard tuples with a duplicate primary key? Initially I wanted to do so. But after I thought to "register" them in a dups table. | If you are discarding duplicates, do you care which of the duplicates | is discarded? That's why I said that perhaps it should be a good thing to keep them in a duplicated-rows table, for reviewing them. | If you want to combine data from the duplicates, do you have a precise | description of what you want to happen? No, I do not need to combine data from the duplicated. The entire row is accepted or not. Josep Sànchez [papapep] -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.1 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQE/gajE2vx52x0kyz4RAq17AJ9DNYKwJEvTHy/rc9I1hJ8AuIvDjACdEXde 4nvgabqLCdERKWIAo4v6WUI= =7zFZ -----END PGP SIGNATURE-----
Josep, > I'm very sorry for not being able to explain my problem (BTW, I'm > spanish speaker...) > I'll try to do it better. That was very clear. > I've got a table that has the following fields: > > F1 F2 F3 F4 F5 .........F16 > > and we have defined that there can't be any field repeating the fiels > F1,F2,F5,F14 and F15 (our, complex, primary key). > > I've got, on the other hand, text files prepared to be inserted in this > table with the \copy command, but we are not sure (we've found > duplicated rows several times) that there are not repeated rows. I'd suggest using the temporary table (or "holding table") approach suggested by other posters on this list. While you could use a trigger, that would mean using INSERT instead of COPY, which would slow down your data loading a lot. What you want to do after loading the table really depends on how you want to handle duplicates. If you just want to ignore them, then use the SELECT DISTINCT ON suggestion from another list member ... although this will have the defect of grabbing the first row with that primary key and ignoring the others, which might have different information in the other columns. If you want to raise an alarm and halt the import on finding a duplicate., then do: SELECT F1,F2,F5,F14, F15, count(*) as duplicates FROM holding_table GROUP BY F1,F2,F5,F14, F15 HAVING count(*) > 1; The rows returned by that query will show you the primary keys of the duplicate rows. -- Josh Berkus Aglio Database Solutions San Francisco -- Josh Berkus Aglio Database Solutions San Francisco
Please keep messages copied to the list. On Mon, Oct 06, 2003 at 19:38:46 +0200, papapep <papapep@gmx.net> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Bruno Wolff III wrote: > | And what do want to happen when you run accross a duplicate row? > | Do you just want to discard tuples with a duplicate primary key? > > Initially I wanted to do so. But after I thought to "register" them in a > dups table. OK. > > | If you are discarding duplicates, do you care which of the duplicates > | is discarded? > > That's why I said that perhaps it should be a good thing to keep them in > a duplicated-rows table, for reviewing them. > > | If you want to combine data from the duplicates, do you have a precise > | description of what you want to happen? > > No, I do not need to combine data from the duplicated. The entire row is > accepted or not. If you initially don't care which dup gets inserted, then the select DISTINCT ON suggestion will work for loading the real table. Something like: insert into real_table select distinct on (pk1, pk2, pk3, pk4, pk5) * from temp_table; To see the sets of duplicates you can do something like: select * from temp_table, (select pk1, pk2, pk3, pk4, pk5 from temp_table group by pk1, pk2, pk3, pk4, pk5 having count(*) > 1) as dups where temp_table.pk1 = dups.pk1 and temp_table.pk2 = dups.pk2 and temp_table.pk3 = dups.pk3 and temp_table.pk4 = dups.pk4 and temp_table.pk5 = dups.pk5; If there are a large number of records being imported, you probably want to create an multicolumn index on pk1, pk2, pk3, pk4, and pk5 on the temp table after loading it and before doing the query for duplicates.
Hello, If you wish to "clean" a table of duplicates the following SQL will work. It is used widely in the Oracle world - and I just tried it successfully on my Red Hat Linux 9 PostgreSQL 7.3.2 database. If you have a table T with columns C1, C2, ... Cn and C1, C2, ... Cr are the candidate keys for a concatenated primary key (r <= n), then if T currently contains duplicates (obviously with with no primary key constraint defined on it - otherwise such a table with duplicate rows could not exist) then to remove the duplicates so that only tuples with unique candidate key values remain (i.e. cleaned) run the following delete SQL which uses a correlated subquery : delete from T t1 where oid < (select max (oid) from T t2 where t1.C1 = t2.C1 and t1.C2 = t2.C2 ... and t1.Cr = t2.Cr) ; This is based on the fact that though rows in a table may be duplicate they must always have unique oid numbers. Then with the above delete statement, only the row with the maximum value for oid for a particular set of duplicate rows (for a particular value of C1||C2...||Cr) will remain. In the above SQL if you use instead: where oid > (select min (oid) ... then only the row with the minimum value of oid will remain ... So \copy to a temp table (with no primary key constraints defined) and then apply SQL similar to above to clean the loaded temp table which can then be used to load the target production table with no duplicates. Regards, Partha Sur > ----- Original Message ----- > From: "papapep" <papapep@gmx.net> > To: "pgsql-novice" <pgsql-novice@postgresql.org> > Sent: Monday, October 06, 2003 12:30 PM > Subject: [NOVICE] Filtering duplicated row with a trigger > > > > -----BEGIN PGP SIGNED MESSAGE----- > > Hash: SHA1 > > > > I've got plenty of data files (prepared to be inserted with the \copy > > statement) but I have to filter them to be sure that there are no > > duplicated rows inserted. > > I know I should do it with a trigger that executes a function before > > inserting the row and if it's duplicated do something with it (insert it > > in another table, simply forget it, etc...). The theory is clear :-) > > But the practice is not so clear (for me, of course). > > Anyone can give me some guide to how the function should do the control > > of duplicated rows? > > > > Thanks. > > > > Josep Sànchez > > [papapep] > > -----BEGIN PGP SIGNATURE----- > > Version: GnuPG v1.2.1 (MingW32) > > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > > > iD8DBQE/gZik2vx52x0kyz4RApbZAKCrhvCywbNH8Zce0xpfDhMNQBfQ+ACfShG6 > > 96nY7di8KnV8gJrcWIOzqLI= > > =32il > > -----END PGP SIGNATURE----- > > > > > > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 4: Don't 'kill -9' the postmaster >
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Many, many, many thanks to all the people that has tried (and done it) to help me. I've got various posibilities now to try which is the best way for us. Josep Sànchez [papapep] -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.1 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQE/gnZl2vx52x0kyz4RAmWLAJwPjN4GU1mtfMdYBnxSDDztWXPHNQCgnTcs 468dwrozltvTe8c8ehUHp/M= =MucO -----END PGP SIGNATURE-----