Re: Using database to find file doublettes in my computer - Mailing list pgsql-general
From | Lothar Behrens |
---|---|
Subject | Re: Using database to find file doublettes in my computer |
Date | |
Msg-id | dc61ab60-2816-478c-9d53-ea9a2a3dcb30@b31g2000prb.googlegroups.com Whole thread Raw |
In response to | Using database to find file doublettes in my computer (Lothar Behrens <lothar.behrens@lollisoft.de>) |
Responses |
Re: Using database to find file doublettes in my computer
|
List | pgsql-general |
On 18 Nov., 07:40, cr...@postnewspapers.com.au (Craig Ringer) wrote: > -- Once paths is populated, extract duplicates: > > SELECT get_filename(path) AS fn, count(path) AS n > FROM paths HAVING count(path) > 1 > INTO TEMPORARY TABLE dup_files; > > -- Creates UNIQUE index on PATH as well > ALTER TABLE dup_files ADD CONSTRAINT PRIMARY KEY (path); > > -- Now build your side-by-side table of duplicates: > > SELECT p1.path, p2.path > FROM paths p1 INNER JOIN paths p2 > ON (get_filename(p1.path) = get_filename(p2.path)) > WHERE EXISTS(SELECT 1 FROM dup_files WHERE fn = get_filename(p1.path)) > AND p1.path > p2.path; > > You can replace get_filename(fn) with appropriate code, but I'd write a > quick SQL function marked IMMUTABLE to neatly wrap up the pathname > extraction instead. > Hi Craig, I have done the steps as you described. I have about 14000 files with an md5sum. Based on the full filename I have updated the md5sum in my base table 'ECADFiles'. With the following query I see about 2900 files that are available multible times: select "Name", count("Pfad") As n from "ECADFiles" Group by "Name" having count("Pfad") > 1 Using this query I see 13000 double files as a sum: select sum(n) from ( select "Name", count("Pfad") As n from "ECADFiles" Group by "Name" having count("Pfad") > 1) as temp Using the following query I get ~ 129000 records: select p1."Datei", p1."MD5Sum", p1."ID", p2."Datei", p2."MD5Sum", p2."ID" from "ECADFiles" p1 INNER JOIN "ECADFiles" p2 ON (p1."Name" = p2."Name") where EXISTS (select 1 from "Datei" where "Name" = p1."Name" AND "Anzahl" > 1) and p1."Datei" > p2."Datei" I have expected a smaller amount of records due to the fact that for 4 files each available 2 times (sum = 8) I have 8 records in ECADFiles, but must have 4 in the above result. So for an average of 2 doubles I expected half the files from ECADFiles, because one is exactly right and the other is on the left. In general this results in about the same or less records than 13000 multible files. Why did I get these 129000 records ? I assume a rotating from files on the left to the right, thus about n - 1 times too much records. Thus I have tested this: select sum(n), sum(r) from ( select "Name", count("Pfad") As n, count("Pfad") * (count("Pfad") - 1) As r from "ECADFiles" Group by "Name" having count("Pfad") > 1 ) as temp But I got 259240. This is probably not correct. Testing that with the first left file to search on right I get the n occurences at all as of n double files. Assuming there are n - 1 too much, I have got my expected result by changing the last AND rule to the opposite: select p1."Datei", p1."MD5Sum", p1."ID", p2."Datei", p2."MD5Sum", p2."ID" from "ECADFiles" p1 INNER JOIN "ECADFiles" p2 ON (p1."Name" = p2."Name") where p1."Datei" = E'C:\\drefsrechneralt\\Drefs511_2\\EAGLE\ \MW211.SCH' AND EXISTS (select 1 from "Datei" where "Name" = p1."Name" AND "Anzahl" > 1) and p1."Datei" > p2."Datei" Gives 7 records with 7 different right files and the 8th on the left. select p1."Datei", p1."MD5Sum", p1."ID", p2."Datei", p2."MD5Sum", p2."ID" from "ECADFiles" p1 INNER JOIN "ECADFiles" p2 ON (p1."Name" = p2."Name") where p2."Datei" = E'C:\\drefsrechneralt\\Drefs511_2\\EAGLE\ \MW211.SCH' AND EXISTS (select 1 from "Datei" where "Name" = p1."Name" AND "Anzahl" > 1) and p1."Datei" < p2."Datei" Gives 7 records with 7 different left files and the 8th on the right. Any ideas how to remove these unwanted records ? This seems not to be easy, because I see problems araising when changing the data a cursor runs over. Thanks Lothar
pgsql-general by date: