Thread: Avoiding duplicates (or at least marking them as such) in a "cumulative" transaction table.

Hi,
I am looking for an efficient and effective solution to eliminate
duplicates in a continuously updated "cumulative" transaction table
(no deletions are envisioned as all non-redundant records are
important). Below is my situation.

I do have a “central” transaction table (A) that is expected to be
populated many records at a time by several users simultaneously. I
would like to ensure uniqueness in the tuples of this “central”
transaction table.
On average it is expected that approximately 40% to 60% of the batches
of records being presented for persistence to the “central”
transaction table to be already existing in the “central” table and
hence will be duplicates.
This batches will arrive via an insert query from another “central”
transaction table (B) that will also be populated by other data
generating processes. Table B is expected to contain duplicates. Each
client’s records in this table B will share a common value per batch.

I intend to prune out duplicates within each of these batches of
records but this will not guarantee uniqueness of the records across
all the batches.

I have thought of the following ideas, but I would like more advice
and suggestions and inclusions, including possibilities of using
triggers and so on.

1)For each population of a batch of records from a client, perform a
join of the the entire table A right join table B on … WHERE A.id IS
NULL.
Cons: I think it is likely for more than one data insert transactions
running at the same time which may result in the transactions not
seeing the updates the other(s) are performing and hence not be aware
of potential duplicates. Maybe too expensive to perform a join for
each batch insert query?

2) Simply write the records of each batch into this table then detect
duplicates later (may be at some set time intervals or may be a the
end of the clients run) by using count() like this:


UPDATE table_a b
SET is_redundant=TRUE
FROM
(
SELECT
min(b.id)AS id___b
,potentially_redundant_field
FROM
table_a b
GROUP BY
potentially_redundant_field
HAVING
count(*)>1
)a
WHERE
b.potentially_redundant_field=a.potentially_redundant_field
AND b.id>a.id___b
;


3)A combination of option 1 (if it is not too expensive to perform a
join for each batch insert query) and option 2. Assuming many of the
records of table_a will already be in memory and so will the incoming
batch of records meant for inclusion into table_a, performing a right
join may probably not be expensive and will result in less records for
table_a (as the potential duplicates to the previously fully persisted
records will have been hived-off). This will mean few records for
option 2 to process.

Allan.

On Sun, Mar 7, 2010 at 1:45 AM, Allan Kamau <kamauallan@gmail.com> wrote:
> Hi,
> I am looking for an efficient and effective solution to eliminate
> duplicates in a continuously updated "cumulative" transaction table
> (no deletions are envisioned as all non-redundant records are
> important). Below is my situation.

Is there a reason you can't use a unique index and detect failed
inserts and reject them?

On Mon, Mar 8, 2010 at 5:49 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> On Sun, Mar 7, 2010 at 1:45 AM, Allan Kamau <kamauallan@gmail.com> wrote:
>> Hi,
>> I am looking for an efficient and effective solution to eliminate
>> duplicates in a continuously updated "cumulative" transaction table
>> (no deletions are envisioned as all non-redundant records are
>> important). Below is my situation.
>
> Is there a reason you can't use a unique index and detect failed
> inserts and reject them?
>

I think it would have been possible make use of a unique index as you
have suggested, and silently trap the uniqueness violation.

But in my case (as pointed out in my previous lengthy mail) I am
inserting multiple records at once, which implicitly means a single
transaction. I think in this scenario a violation of uniqueness by
even a single record will lead to all the other records (in this
batch) being rejected either.

Is there perhaps a way to only single out the unique constraint
violating record(s) without having to perform individual record
inserts, I am following the example found here
"http://www.postgresql.org/docs/8.4/interactive/plpgsql-control-structures.html#PLPGSQL-ERROR-TRAPPING".

Allan.

On Sun, Mar 7, 2010 at 11:31 PM, Allan Kamau <kamauallan@gmail.com> wrote:
> On Mon, Mar 8, 2010 at 5:49 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
>> On Sun, Mar 7, 2010 at 1:45 AM, Allan Kamau <kamauallan@gmail.com> wrote:
>>> Hi,
>>> I am looking for an efficient and effective solution to eliminate
>>> duplicates in a continuously updated "cumulative" transaction table
>>> (no deletions are envisioned as all non-redundant records are
>>> important). Below is my situation.
>>
>> Is there a reason you can't use a unique index and detect failed
>> inserts and reject them?
>>
>
> I think it would have been possible make use of a unique index as you
> have suggested, and silently trap the uniqueness violation.
>
> But in my case (as pointed out in my previous lengthy mail) I am
> inserting multiple records at once, which implicitly means a single
> transaction. I think in this scenario a violation of uniqueness by
> even a single record will lead to all the other records (in this
> batch) being rejected either.

There are a whole bunch of approaches to this.  You're basically bulk
loading data into a table that already has data in it.  I'd put
everything into a temp table then insert into main where not exists in
temp table.  And I'd still have that unique index on my pk in the main
table.  First maintain uniqueness, then figure out how to load data
into it.  insert into where not exists is quite fast, and it's all one
big transaction that either goes or doesn't, so there's no cleaning up
broken / extra records later.

Are you looking at upserting these inputs?  Or is the original row good enough?

>
> Is there perhaps a way to only single out the unique constraint
> violating record(s) without having to perform individual record
> inserts, I am following the example found here
> "http://www.postgresql.org/docs/8.4/interactive/plpgsql-control-structures.html#PLPGSQL-ERROR-TRAPPING".
>
> Allan.
>



--
When fascism comes to America, it will be intolerance sold as diversity.

On Mon, Mar 8, 2010 at 10:16 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> On Sun, Mar 7, 2010 at 11:31 PM, Allan Kamau <kamauallan@gmail.com> wrote:
>> On Mon, Mar 8, 2010 at 5:49 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
>>> On Sun, Mar 7, 2010 at 1:45 AM, Allan Kamau <kamauallan@gmail.com> wrote:
>>>> Hi,
>>>> I am looking for an efficient and effective solution to eliminate
>>>> duplicates in a continuously updated "cumulative" transaction table
>>>> (no deletions are envisioned as all non-redundant records are
>>>> important). Below is my situation.
>>>
>>> Is there a reason you can't use a unique index and detect failed
>>> inserts and reject them?
>>>
>>
>> I think it would have been possible make use of a unique index as you
>> have suggested, and silently trap the uniqueness violation.
>>
>> But in my case (as pointed out in my previous lengthy mail) I am
>> inserting multiple records at once, which implicitly means a single
>> transaction. I think in this scenario a violation of uniqueness by
>> even a single record will lead to all the other records (in this
>> batch) being rejected either.
>
> There are a whole bunch of approaches to this.  You're basically bulk
> loading data into a table that already has data in it.  I'd put
> everything into a temp table then insert into main where not exists in
> temp table.  And I'd still have that unique index on my pk in the main
> table.  First maintain uniqueness, then figure out how to load data
> into it.  insert into where not exists is quite fast, and it's all one
> big transaction that either goes or doesn't, so there's no cleaning up
> broken / extra records later.
>
> Are you looking at upserting these inputs?  Or is the original row good enough?
>
>>
>> Is there perhaps a way to only single out the unique constraint
>> violating record(s) without having to perform individual record
>> inserts, I am following the example found here
>> "http://www.postgresql.org/docs/8.4/interactive/plpgsql-control-structures.html#PLPGSQL-ERROR-TRAPPING".
>>
>> Allan.
>>
>
>
>
> --
> When fascism comes to America, it will be intolerance sold as diversity.
>

The original row is good enough, the new should not replace a similar
but already existing record. I will now look at evaluating (for
performance).
1) INSERT INTO main_table(...) SELECT a... FROM mytmp_table a WHERE
a.supposedly_unique_field NOT IN (SELECT a.supposedly_unique_field
FROM main_table a;

2) INSERT INTO main_table(...)SELECT a... FROM mytmp_table a LEFT JOIN
main_table b ON  b.supposedly_unique_field=a.supposedly_unique_field
WHERE a.id IS NULL;

But since at a given time it is still likely for more than one client
running such insert query to insert data into this table, which may
lead to the possibility of transaction(s) that complete after data has
been inserted into the main table by other transaction(s) may attempt
to insert duplicating records (which will trigger the error if a
unique constraint is in effect in the "main" table). I am opting not
having a unique constaint on this field. But rather simply an index.
Then I make use of a boolean (initially NULL) field in the main table
which I will appropriately update at the end of run of all the data
updating clients (a cleanup step).
The update to this "is_redundant" field could be via "UPDATE
main_table b SET is_redundant=TRUE FROM (SELECT min(id)AS
id___min,a.supposedly_unique_field FROM main_table a GROUP BY
a.supposedly_unique_field HAVING count(*)>1)a WHERE
b.supposedly_unique_field=a.supposedly_unique_field AND
b.id>a.id___min;


Allan.