Home > mailing lists

Re: how to make duplicate finding query faster? - Mailing list pgsql-admin

From	Gavan Schneider
Subject	Re: how to make duplicate finding query faster?
Date	December 30, 2020 14:21:17
Msg-id	43C74F51-3099-4787-8C73-CE928CEAB01A@pendari.org Whole thread Raw
In response to	Re: how to make duplicate finding query faster? (Sachin Kumar <sachinkumaras@gmail.com>)
List	pgsql-admin

Tree view

On 30 Dec 2020, at 19:59, Sachin Kumar wrote:

This will not suffice our requirement. we have to validate that there
would not be any duplicate value in DB. we have done that earlier by
leaving DB to check if there is any duplicate and found duplicate value in
DB.

If the constraint was in place before any data was inserted into the database then there will be no duplicates. Adding a constraint to a database will not remove existing data which violates the constraint.

we have a table "card" and a single column "number " which we are updated
with a csv file with 600k numbers and require no duplicate number should be
there in the table.

One trap is that the data may not be the same even if it’s a “duplicate”. The “number” may have spaces or other human readable characters. It is a duplicate from the human perspective but not the machine. These “duplicates” will be also be very hard to find with a simple query. If OP is not already doing this I suggest the “number” should be stored in the database as BIGINT eg.

cc_number BIGINT NON NULL UNIQUE,  — this will prevent any addition of a duplicate cc_number
PRIMARY KEY (cc_number)		— or, more simply, in one line: cc_number BIGINT PRIMARY KEY

Obviously the csv will have to be checked to ensure the number is a legal integer value.

There is nothing in the original post to say what is contained in the card field. And it has not been stated if this field is also unique. Assuming it has text then this will be more interesting to work with as a human may see the duplication but not the machine, but if it’s only the number that has to be unique then it is straight forward.

Please if can have a faster query can help us in achieving this requirement.

As noted by Holgar Jacobs up thread it is not possible for a query to be faster than checks done at time the data is inserted. The data has to be inserted and indexes created etc. If the duplicate key constraint is violated then the duplicate is found. Otherwise the data has to be inserted and a query run afterwards.

Assuming there is a reasonably recent PostgreSQL hidden behind the Django and Python, i.e., >9.5, then the INSERT can be attempted with

INSERT INTO  …  ON CONFLICT DO NOTHING — https://www.postgresql.org/about/news/postgresql-131-125-1110-1015-9620-and-9524-released-2111/

Once this is in place the data will be added to the table as fast as that is possible, and with no duplicates. It will be very much slower if every insertion has to be checked in the Django+Python stack. Basically if speed is needed go to the database and talk in SQL.

Regards
Gavan Schneider (who can offer no help with Dango/Python)

pgsql-admin by date:

From: Holger Jakobs
Date: 30 December 2020, 14:18:06
Subject: Re: how to make duplicate finding query faster?

From: Scott Ribe
Date: 30 December 2020, 16:13:07
Subject: Re: how to make duplicate finding query faster?

Re: how to make duplicate finding query faster? - Mailing list pgsql-admin

Previous

Next