Thread: how to make duplicate finding query faster?

how to make duplicate finding query faster?

From
Sachin Kumar
Date:
Hi All,

I am uploading data into PostgreSQL using the CSV file and checking if there is any duplicates value in DB it should return a duplicate error.  I am using below mention query.

if Card_Bank.objects.filter( Q(ACCOUNT_NUMBER=card_number) ).exists():
        flag=2
      else:
        flag=1
it is taking too much time i am using 600k cards in CSV.

Kindly help me in making the query faster.

I am using Python, Django & PostgreSQL.
--

Best Regards,
Sachin Kumar

Re: how to make duplicate finding query faster?

From
Holger Jakobs
Date:
Am 30.12.20 um 08:36 schrieb Sachin Kumar:
Hi All,

I am uploading data into PostgreSQL using the CSV file and checking if there is any duplicates value in DB it should return a duplicate error.  I am using below mention query.

if Card_Bank.objects.filter( Q(ACCOUNT_NUMBER=card_number) ).exists():
        flag=2
      else:
        flag=1
it is taking too much time i am using 600k cards in CSV.

Kindly help me in making the query faster.

I am using Python, Django & PostgreSQL.
--

Best Regards,
Sachin Kumar

I think it would be easier to not check the duplicates before, but let the DB complain about duplicates.

That would about slash the roundtrips to the DB in half. Instead of check + insert there would be only an insert, which might fail every now and then.

Regards,

Holger

-- 
Holger Jakobs, Bergisch Gladbach, Tel. +49-178-9759012
Attachment

Re: how to make duplicate finding query faster?

From
Sachin Kumar
Date:
HI Mr. Holger,

This will not suffice our requirement. we have to validate that there would not be any duplicate value in DB. we have done that earlier by leaving DB to check if there is any duplicate and found duplicate value in DB. 

we have a table "card"  and a single column "number " which we are updated with a csv file with 600k numbers and require no duplicate number should be there in the table.

Please if can have a faster query can help us in achieving this requirement.



On Wed, Dec 30, 2020 at 1:54 PM Holger Jakobs <holger@jakobs.com> wrote:
Am 30.12.20 um 08:36 schrieb Sachin Kumar:
Hi All,

I am uploading data into PostgreSQL using the CSV file and checking if there is any duplicates value in DB it should return a duplicate error.  I am using below mention query.

if Card_Bank.objects.filter( Q(ACCOUNT_NUMBER=card_number) ).exists():
        flag=2
      else:
        flag=1
it is taking too much time i am using 600k cards in CSV.

Kindly help me in making the query faster.

I am using Python, Django & PostgreSQL.
--

Best Regards,
Sachin Kumar

I think it would be easier to not check the duplicates before, but let the DB complain about duplicates.

That would about slash the roundtrips to the DB in half. Instead of check + insert there would be only an insert, which might fail every now and then.

Regards,

Holger

-- 
Holger Jakobs, Bergisch Gladbach, Tel. +49-178-9759012


--

Best Regards,
Sachin Kumar

Re: how to make duplicate finding query faster?

From
Holger Jakobs
Date:
Am 30.12.20 um 09:59 schrieb Sachin Kumar:
HI Mr. Holger,

This will not suffice our requirement. we have to validate that there would not be any duplicate value in DB. we have done that earlier by leaving DB to check if there is any duplicate and found duplicate value in DB. 

we have a table "card"  and a single column "number " which we are updated with a csv file with 600k numbers and require no duplicate number should be there in the table.

Please if can have a faster query can help us in achieving this requirement.


Quote: leaving DB to check if there is any duplicate and found duplicate value in DB

No, that's not possible with PostgreSQL. If you have a Primary Key or a Unique Constraint, no duplicates will be allowed by the DB.

There must be a different reason.


-- 
Holger Jakobs, Bergisch Gladbach, Tel. +49-178-9759012
Attachment

Re: how to make duplicate finding query faster?

From
"Gavan Schneider"
Date:

On 30 Dec 2020, at 19:59, Sachin Kumar wrote:

This will not suffice our requirement. we have to validate that there
would not be any duplicate value in DB. we have done that earlier by
leaving DB to check if there is any duplicate and found duplicate value in
DB.

If the constraint was in place before any data was inserted into the database then there will be no duplicates. Adding a constraint to a database will not remove existing data which violates the constraint.

we have a table "card" and a single column "number " which we are updated
with a csv file with 600k numbers and require no duplicate number should be
there in the table.

One trap is that the data may not be the same even if it’s a “duplicate”. The “number” may have spaces or other human readable characters. It is a duplicate from the human perspective but not the machine. These “duplicates” will be also be very hard to find with a simple query. If OP is not already doing this I suggest the “number” should be stored in the database as BIGINT eg.

cc_number BIGINT NON NULL UNIQUE,  — this will prevent any addition of a duplicate cc_number
PRIMARY KEY (cc_number)		— or, more simply, in one line: cc_number BIGINT PRIMARY KEY

Obviously the csv will have to be checked to ensure the number is a legal integer value.

There is nothing in the original post to say what is contained in the card field. And it has not been stated if this field is also unique. Assuming it has text then this will be more interesting to work with as a human may see the duplication but not the machine, but if it’s only the number that has to be unique then it is straight forward.

Please if can have a faster query can help us in achieving this requirement.

As noted by Holgar Jacobs up thread it is not possible for a query to be faster than checks done at time the data is inserted. The data has to be inserted and indexes created etc. If the duplicate key constraint is violated then the duplicate is found. Otherwise the data has to be inserted and a query run afterwards.

Assuming there is a reasonably recent PostgreSQL hidden behind the Django and Python, i.e., >9.5, then the INSERT can be attempted with

INSERT INTO  …  ON CONFLICT DO NOTHING — https://www.postgresql.org/about/news/postgresql-131-125-1110-1015-9620-and-9524-released-2111/

Once this is in place the data will be added to the table as fast as that is possible, and with no duplicates. It will be very much slower if every insertion has to be checked in the Django+Python stack. Basically if speed is needed go to the database and talk in SQL.

Regards
Gavan Schneider (who can offer no help with Dango/Python)

Re: how to make duplicate finding query faster?

From
Scott Ribe
Date:
> On Dec 30, 2020, at 12:36 AM, Sachin Kumar <sachinkumaras@gmail.com> wrote:
>
> Hi All,
>
> I am uploading data into PostgreSQL using the CSV file and checking if there is any duplicates value in DB it should
returna duplicate error.  I am using below mention query. 
>
> if Card_Bank.objects.filter( Q(ACCOUNT_NUMBER=card_number) ).exists():
>         flag=2
>       else:
>         flag=1
> it is taking too much time i am using 600k cards in CSV.
>
> Kindly help me in making the query faster.
>
> I am using Python, Django & PostgreSQL.
> --
>
> Best Regards,
> Sachin Kumar

Are you checking one-by-one because your goal is not to fail the whole upload that contains the duplicates, but rather
toskip only the duplicates? 

If that's the case, I think you'd be better off copying the CSV straight into a temp table, using a join to delete
duplicatesfrom it, then insert the remainder into the target table, and finally drop the temp table. 


Re: how to make duplicate finding query faster?

From
Sachin Kumar
Date:
Hi Scott,

Yes, I am checking one by one because my goal is to fail the whole upload if there is any duplicate entry and to inform the user that they have a duplicate entry in the file.

Regards
Sachin


On Wed, Dec 30, 2020 at 6:43 PM Scott Ribe <scott_ribe@elevated-dev.com> wrote:
> On Dec 30, 2020, at 12:36 AM, Sachin Kumar <sachinkumaras@gmail.com> wrote:
>
> Hi All,
>
> I am uploading data into PostgreSQL using the CSV file and checking if there is any duplicates value in DB it should return a duplicate error.  I am using below mention query.
>
> if Card_Bank.objects.filter( Q(ACCOUNT_NUMBER=card_number) ).exists():
>         flag=2
>       else:
>         flag=1
> it is taking too much time i am using 600k cards in CSV.
>
> Kindly help me in making the query faster.
>
> I am using Python, Django & PostgreSQL.
> --
>
> Best Regards,
> Sachin Kumar

Are you checking one-by-one because your goal is not to fail the whole upload that contains the duplicates, but rather to skip only the duplicates?

If that's the case, I think you'd be better off copying the CSV straight into a temp table, using a join to delete duplicates from it, then insert the remainder into the target table, and finally drop the temp table.


--

Best Regards,
Sachin Kumar

Re: how to make duplicate finding query faster?

From
Scott Ribe
Date:
> On Dec 30, 2020, at 6:24 AM, Sachin Kumar <sachinkumaras@gmail.com> wrote:
>
> Yes, I am checking one by one because my goal is to fail the whole upload if there is any duplicate entry and to
informthe user that they have a duplicate entry in the file. 

That's not what I said, though. If you want to fail the whole copy, then you don't have to check one by one, just try
thecopy--assuming you have the correct constraints in place. 

Unless you want to tell the user *which* rows are duplicates, in which case you can try a variant on my prior
suggestion,copy into a temp table, use a join to find duplicates...