> On 3 Feb 2024, at 13:20, Lok P <loknath.73@gmail.com> wrote:
>
> Hello All,
> A non partitioned table having ~4.8 billion rows in it and having data size as ~1.4TB , row size as ~313 bytes having
~127columns in it. This has got approx ~1billion+ duplicate rows inserted in it and we want to get the duplicate data
removedfor this table and create a PK/unique constraint back so as to not have the duplicate values in future. We are
strugglingto do the same.
~4.8 billion rows of which ~1 billion are duplicates… Wait a minute…
Did you verify that your ID column is larger than 32-bits?
Because if that’s a 32 bit integer, the range of values it can hold is about 4.3 billion, after which it wraps around.
With ~4.8 billion rows that would result in about ~0.5 billion repeated ID values, giving you the reported ~1 billion
duplicateID's.
If that’s the case, your duplicates obviously aren’t really duplicates and you require a different type of solution.
> Teammates suggested doing this using CTAS method, i.e. create a new table with the unique record set and then drop
themain table. Something as below
>
> create table TAB1_New
> as
> SELECT * from TAB1 A
> where ID in
> (select min(ID) from TAB1
> group by ID having count(ID)>=1 );
>
> But for the above to work faster , they mentioned to have an index created on the column using which the duplicate
checkwill be performed i.e ID column. So, creating the index itself took ~2hrs+ and the index size now shows as ~116GB.
>
> Create index idx1 on TAB1(ID)
Are your duplicates exact duplicates? Or is there an order of preference among them?
And if so, what really makes those rows unique?
That matters for solutions on how to deduplicate these rows.
Regards,
Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.