Thread: Using database to find file doublettes in my computer
Hi, I have a problem to find as fast as possible files that are double or in other words, identical. Also identifying those files that are not identical. My approach was to use dir /s and an awk script to convert it to a sql script to be imported into a table. That done, I could start issuing queries. But how to query for files to display a 'left / right view' for each file that is on multible places ? I mean this: This File;Also here C:\some.txt;C:\backup\some.txt C:\some.txt;C:\backup1\some.txt C:\some.txt;C:\backup2\some.txt but have only this list: C:\some.txt C:\backup\some.txt C:\backup1\some.txt C:\backup2\some.txt The reason for this is because I am faced with the problem of ECAD projects that are copied around many times and I have to identify what files are here missing and what files are there. So a manual approach is as follows: 1) Identify one file (schematic1.sch) and see, where are copies of it. 2) Compare the files of both directories and make a desision about what files to use further. 3) Determine conflicts, thus these files can't be copied together for a cleanup. Are there any approaches or help ? This is a very time consuming job and I am searching for any solution that helps me save time :-) I know that those problems did not arise when the projects are well structured and in a version management system. But that isn't here :-) Thanks Lothar
Hi Ho! --- On Tue, 11/18/08, Lothar Behrens <lothar.behrens@lollisoft.de> wrote: > Hi, > > I have a problem to find as fast as possible files that are > double or > in other words, identical. > Also identifying those files that are not identical. > > My approach was to use dir /s and an awk script to convert > it to a sql > script to be imported into a table. > That done, I could start issuing queries. > > But how to query for files to display a 'left / right > view' for each > file that is on multible places ? > > I mean this: > > This File;Also here > C:\some.txt;C:\backup\some.txt > C:\some.txt;C:\backup1\some.txt > C:\some.txt;C:\backup2\some.txt > > but have only this list: > > C:\some.txt > C:\backup\some.txt > C:\backup1\some.txt > C:\backup2\some.txt > > > The reason for this is because I am faced with the problem > of ECAD > projects that are copied around > many times and I have to identify what files are here > missing and what > files are there. > > So a manual approach is as follows: > > 1) Identify one file (schematic1.sch) and see, where are > copies of > it. > 2) Compare the files of both directories and make a > desision about > what files to use further. > 3) Determine conflicts, thus these files can't be > copied together > for a cleanup. > > Are there any approaches or help ? I also have been in this kind of circumstance before, but I work under GNU/Linux as always. 1. At that time, I used `md5sum' to generate the fingerprint of all files in a given directory to be cleaned up. 2. Later, I created a simple Java program to group the names of all files that had the same fingerprint (i.e., MD5 hash). 3. I simply deleted the files with the same MD5 hash but one file with a good filename (in my case, the filename couldn'tbe relied on to perform a comparison since it differed by small additions like date, author's name, and the like). 4. After that, I used my brain to find related files based on the filenames (e.g., `[2006-05-23] Jeff - x.txt' should bethe same as `Jenny - x.txt'). Of course, the Java program also helped me in grouping the files that I thought to be related. 5. Next, I perused the related files to see whether most of the contents were the same. If yes, I took the latest one basedon the modified time. > This is a very time consuming job and I am searching for > any solution > that helps me save time :-) Well, I think I saved a lot of time at that time to be able to eliminate about 7,000 files out of 15,000 files in about twoweeks. > I know that those problems did not arise when the projects > are well > structured and in a version management system. But that > isn't here :-) I hope you employ such a system ASAP :-) > Thanks > > Lothar Best regards, Eus (FSF member #4445) In this digital era, where computing technology is pervasive, your freedom depends on the software controlling those computing devices. Join free software movement today! It is free as in freedom, not as in free beer! Join: http://www.fsf.org/jf?referrer=4445
Lothar Behrens wrote: > But how to query for files to display a 'left / right view' for each > file that is on multible places ? One approach is to use a query to extract the names of all files with duplicates, and store the results in a TEMPORARY table with a UNIQUE index (or PRIMARY KEY) on the filename. You then self-inner-join your paths table to its self, with the join condition being that the filename part of the path EXISTS in the table of files having duplicates. You should also filter out rows where the first filename is <= the second filename to avoid outputting (a,b) and (b,a) for each duplicate, and to avoid outputting (a,a) and (b,b) rows. You can do much the same thing in a single query without the temp table, but I think you'll need a correlated subquery to check for duplicates (or a join on a subquery that'll really expand the results to be processed), so the temp table approach is probably going to be lots faster. Quick example code (untested but should give you the idea), assumes you have a function get_filename(...) that extracts just the filename part of the path: CREATE TABLE paths ( path PRIMARY KEY, --- other fields ); -- Populate paths -- Once paths is populated, extract duplicates: SELECT get_filename(path) AS fn, count(path) AS n FROM paths HAVING count(path) > 1 INTO TEMPORARY TABLE dup_files; -- Creates UNIQUE index on PATH as well ALTER TABLE dup_files ADD CONSTRAINT PRIMARY KEY (path); -- Now build your side-by-side table of duplicates: SELECT p1.path, p2.path FROM paths p1 INNER JOIN paths p2 ON (get_filename(p1.path) = get_filename(p2.path)) WHERE EXISTS(SELECT 1 FROM dup_files WHERE fn = get_filename(p1.path)) AND p1.path > p2.path; You can replace get_filename(fn) with appropriate code, but I'd write a quick SQL function marked IMMUTABLE to neatly wrap up the pathname extraction instead. -- Craig Ringer
On Mon, Nov 17, 2008 at 11:22:47AM -0800, Lothar Behrens wrote: > I have a problem to find as fast as possible files that are double or > in other words, identical. > Also identifying those files that are not identical. I'd probably just take a simple Unix command line approach, something like: find /base/dir -type f -exec md5sum {} \; | sort | uniq -Dw 32 this will give you a list of files whose contents are identical (according to an MD5 hash). An alternative would be to put the hashes into a database and run the matching up there. Sam
On Tue, Nov 18, 2008 at 12:36:42PM +0000, Sam Mason wrote: > On Mon, Nov 17, 2008 at 11:22:47AM -0800, Lothar Behrens wrote: > > I have a problem to find as fast as possible files that are double or > > in other words, identical. > > Also identifying those files that are not identical. > > I'd probably just take a simple Unix command line approach, something > like: > > find /base/dir -type f -exec md5sum {} \; | sort | uniq -Dw 32 You save a little bit of time by using find /base/dir -type f -print0 | xargs -0 md5sum | sort | uniq -Dw 32 > this will give you a list of files whose contents are identical > (according to an MD5 hash). An alternative would be to put the hashes > into a database and run the matching up there. > > > Sam Gerhard
Attachment
On 18 Nov., 07:40, cr...@postnewspapers.com.au (Craig Ringer) wrote: > -- Once paths is populated, extract duplicates: > > SELECT get_filename(path) AS fn, count(path) AS n > FROM paths HAVING count(path) > 1 > INTO TEMPORARY TABLE dup_files; > > -- Creates UNIQUE index on PATH as well > ALTER TABLE dup_files ADD CONSTRAINT PRIMARY KEY (path); > > -- Now build your side-by-side table of duplicates: > > SELECT p1.path, p2.path > FROM paths p1 INNER JOIN paths p2 > ON (get_filename(p1.path) = get_filename(p2.path)) > WHERE EXISTS(SELECT 1 FROM dup_files WHERE fn = get_filename(p1.path)) > AND p1.path > p2.path; > > You can replace get_filename(fn) with appropriate code, but I'd write a > quick SQL function marked IMMUTABLE to neatly wrap up the pathname > extraction instead. > Hi Craig, I have done the steps as you described. I have about 14000 files with an md5sum. Based on the full filename I have updated the md5sum in my base table 'ECADFiles'. With the following query I see about 2900 files that are available multible times: select "Name", count("Pfad") As n from "ECADFiles" Group by "Name" having count("Pfad") > 1 Using this query I see 13000 double files as a sum: select sum(n) from ( select "Name", count("Pfad") As n from "ECADFiles" Group by "Name" having count("Pfad") > 1) as temp Using the following query I get ~ 129000 records: select p1."Datei", p1."MD5Sum", p1."ID", p2."Datei", p2."MD5Sum", p2."ID" from "ECADFiles" p1 INNER JOIN "ECADFiles" p2 ON (p1."Name" = p2."Name") where EXISTS (select 1 from "Datei" where "Name" = p1."Name" AND "Anzahl" > 1) and p1."Datei" > p2."Datei" I have expected a smaller amount of records due to the fact that for 4 files each available 2 times (sum = 8) I have 8 records in ECADFiles, but must have 4 in the above result. So for an average of 2 doubles I expected half the files from ECADFiles, because one is exactly right and the other is on the left. In general this results in about the same or less records than 13000 multible files. Why did I get these 129000 records ? I assume a rotating from files on the left to the right, thus about n - 1 times too much records. Thus I have tested this: select sum(n), sum(r) from ( select "Name", count("Pfad") As n, count("Pfad") * (count("Pfad") - 1) As r from "ECADFiles" Group by "Name" having count("Pfad") > 1 ) as temp But I got 259240. This is probably not correct. Testing that with the first left file to search on right I get the n occurences at all as of n double files. Assuming there are n - 1 too much, I have got my expected result by changing the last AND rule to the opposite: select p1."Datei", p1."MD5Sum", p1."ID", p2."Datei", p2."MD5Sum", p2."ID" from "ECADFiles" p1 INNER JOIN "ECADFiles" p2 ON (p1."Name" = p2."Name") where p1."Datei" = E'C:\\drefsrechneralt\\Drefs511_2\\EAGLE\ \MW211.SCH' AND EXISTS (select 1 from "Datei" where "Name" = p1."Name" AND "Anzahl" > 1) and p1."Datei" > p2."Datei" Gives 7 records with 7 different right files and the 8th on the left. select p1."Datei", p1."MD5Sum", p1."ID", p2."Datei", p2."MD5Sum", p2."ID" from "ECADFiles" p1 INNER JOIN "ECADFiles" p2 ON (p1."Name" = p2."Name") where p2."Datei" = E'C:\\drefsrechneralt\\Drefs511_2\\EAGLE\ \MW211.SCH' AND EXISTS (select 1 from "Datei" where "Name" = p1."Name" AND "Anzahl" > 1) and p1."Datei" < p2."Datei" Gives 7 records with 7 different left files and the 8th on the right. Any ideas how to remove these unwanted records ? This seems not to be easy, because I see problems araising when changing the data a cursor runs over. Thanks Lothar
Lothar Behrens wrote: > I have expected a smaller amount of records due to the fact that for 4 > files each available 2 times (sum = 8) I > have 8 records in ECADFiles, but must have 4 in the above result. > > So for an average of 2 doubles I expected half the files from > ECADFiles, because one is exactly right and the other > is on the left. It's a combinatorial problem. If you have 3 duplicate files, eg: INSERT INTO paths (path) VALUES (E'C:\\path\\file1.txt'), (E'C:\\path2\\file1.txt'), (E'/path/file1.txt'); then the query process I described above will output the matches: C:\path\file1.txt | C:\path2\file1.txt /path/file1.txt | C:\path2\file1.txt /path/file1.txt | C:\path\file1.txt because while it avoids showing both (A,B) and (B,A) pairs, for any A, B and C it'll show: (A,B) (A,C) (B,C) I've attached test SQL that does the above. Presumably, you want to only show, say: (A,B) (A,C) or maybe: (filename, A) (filename, B) (filename, C) If that's what you want, you need to work a little differently. The attached SQL in dups_test2.sql shows one way you might do it, by generating a list of files with duplicates then listing all the locations each appears in. Again, you can do it without the temp table, it'll probably just be slower. None of what I've written is particularly fast anyway - it evaluates those regular expressions many more times than should be required, for example. -- Craig Ringer BEGIN; CREATE TABLE paths ( path TEXT PRIMARY KEY ); CREATE OR REPLACE FUNCTION get_filename(text, text) RETURNS VARCHAR AS $$ SELECT (regexp_split_to_array($1, $2))[array_upper(regexp_split_to_array($1, $2),1)]; $$ LANGUAGE SQL IMMUTABLE; COMMENT ON FUNCTION get_filename(text, text) IS 'Extract filename part from path $1 using path separator $2'; CREATE OR REPLACE FUNCTION get_filename(text) RETURNS VARCHAR AS $$ SELECT get_filename($1, E'[/\\\\]'); $$ LANGUAGE SQL IMMUTABLE; COMMENT ON FUNCTION get_filename(text, text) IS E'Extract filename part from path $1 using path separator / or \\'; INSERT INTO paths (path) VALUES (E'C:\\path\\file1.txt'), (E'C:\\path2\\file1.txt'), (E'/path/file1.txt'), (E'C:\\somewhere\\file2.txt'), (E'/random/place/file2.txt'), (E'/orphans/file3.blah'); COMMIT; BEGIN; SELECT get_filename(path) AS fn, count(path) AS n INTO TEMPORARY TABLE dup_files FROM paths GROUP BY get_filename(path) HAVING count(path) > 1; SELECT * FROM dup_files; -- Creates UNIQUE index on PATH as well ALTER TABLE dup_files ADD CONSTRAINT dup_files_path_pkey PRIMARY KEY (fn); -- Now build your side-by-side table of duplicates: SELECT p1.path, p2.path FROM paths p1 INNER JOIN paths p2 ON (get_filename(p1.path) = get_filename(p2.path)) WHERE EXISTS(SELECT 1 FROM dup_files WHERE fn = get_filename(p1.path)) AND p1.path > p2.path ORDER BY get_filename(p1.path), p1.path, p2.path; ROLLBACK; SELECT get_filename(path) AS fn, min(path) AS path, count(path) AS dupcount INTO dups FROM paths GROUP BY get_filename(path) HAVING count(path) > 1; SELECT * FROM dups; SELECT dups.fn, paths.path FROM dups INNER JOIN paths ON (dups.fn = get_filename(paths.path));