Thread: Using database to find file doublettes in my computer

Using database to find file doublettes in my computer

From

Lothar Behrens

Date:

17 November 2008, 15:40:09

Hi,

I have a problem to find as fast as possible files that are double or
in other words, identical.
Also identifying those files that are not identical.

My approach was to use dir /s and an awk script to convert it to a sql
script to be imported into a table.
That done, I could start issuing queries.

But how to query for files to display a 'left / right view' for each
file that is on multible places ?

I mean this:

This File;Also here
C:\some.txt;C:\backup\some.txt
C:\some.txt;C:\backup1\some.txt
C:\some.txt;C:\backup2\some.txt

but have only this list:

C:\some.txt
C:\backup\some.txt
C:\backup1\some.txt
C:\backup2\some.txt


The reason for this is because I am faced with the problem of ECAD
projects that are copied around
many times and I have to identify what files are here missing and what
files are there.

So a manual approach is as follows:

1)   Identify one file (schematic1.sch) and see, where are copies of
it.
2)   Compare the files of both directories and make a desision about
what files to use further.
3)   Determine conflicts, thus these files can't be copied together
for a cleanup.

Are there any approaches or help ?

This is a very time consuming job and I am searching for any solution
that helps me save time :-)

I know that those problems did not arise when the projects are well
structured and in a version management system. But that isn't here :-)

Thanks

Lothar

Re: Using database to find file doublettes in my computer

From

Eus

Date:

17 November 2008, 23:48:16

Hi Ho!

--- On Tue, 11/18/08, Lothar Behrens <lothar.behrens@lollisoft.de> wrote:

> Hi,
>
> I have a problem to find as fast as possible files that are
> double or
> in other words, identical.
> Also identifying those files that are not identical.
>
> My approach was to use dir /s and an awk script to convert
> it to a sql
> script to be imported into a table.
> That done, I could start issuing queries.
>
> But how to query for files to display a 'left / right
> view' for each
> file that is on multible places ?
>
> I mean this:
>
> This File;Also here
> C:\some.txt;C:\backup\some.txt
> C:\some.txt;C:\backup1\some.txt
> C:\some.txt;C:\backup2\some.txt
>
> but have only this list:
>
> C:\some.txt
> C:\backup\some.txt
> C:\backup1\some.txt
> C:\backup2\some.txt
>
>
> The reason for this is because I am faced with the problem
> of ECAD
> projects that are copied around
> many times and I have to identify what files are here
> missing and what
> files are there.
>
> So a manual approach is as follows:
>
> 1)   Identify one file (schematic1.sch) and see, where are
> copies of
> it.
> 2)   Compare the files of both directories and make a
> desision about
> what files to use further.
> 3)   Determine conflicts, thus these files can't be
> copied together
> for a cleanup.
>
> Are there any approaches or help ?

I also have been in this kind of circumstance before, but I work under GNU/Linux as always.

1. At that time, I used `md5sum' to generate the fingerprint of all files in a given directory to be cleaned up.

2. Later, I created a simple Java program to group the names of all files that had the same fingerprint (i.e., MD5
hash).

3. I simply deleted the files with the same MD5 hash but one file with a good filename (in my case, the filename
couldn'tbe relied on to perform a comparison since it differed by small additions like date, author's name, and the
like).

4. After that, I used my brain to find related files based on the filenames (e.g., `[2006-05-23] Jeff - x.txt' should
bethe same as `Jenny - x.txt'). Of course, the Java program also helped me in grouping the files that I thought to be
related.

5. Next, I perused the related files to see whether most of the contents were the same. If yes, I took the latest one
basedon the modified time. 

> This is a very time consuming job and I am searching for
> any solution
> that helps me save time :-)

Well, I think I saved a lot of time at that time to be able to eliminate about 7,000 files out of 15,000 files in about
twoweeks. 

> I know that those problems did not arise when the projects
> are well
> structured and in a version management system. But that
> isn't here :-)

I hope you employ such a system ASAP :-)

> Thanks
>
> Lothar

Best regards,

Eus (FSF member #4445)

In this digital era, where computing technology is pervasive,

your freedom depends on the software controlling those computing devices.

Join free software movement today!

It is free as in freedom, not as in free beer!

Join: http://www.fsf.org/jf?referrer=4445

Re: Using database to find file doublettes in my computer

From

Craig Ringer

Date:

18 November 2008, 02:40:37

Lothar Behrens wrote:

> But how to query for files to display a 'left / right view' for each
> file that is on multible places ?

One approach is to use a query to extract the names of all files with
duplicates, and store the results in a TEMPORARY table with a UNIQUE
index (or PRIMARY KEY) on the filename.

You then self-inner-join your paths table to its self, with the join
condition being that the filename part of the path EXISTS in the table
of files having duplicates. You should also filter out rows where the
first filename is <= the second filename to avoid outputting (a,b) and
(b,a) for each duplicate, and to avoid outputting (a,a) and (b,b) rows.

You can do much the same thing in a single query without the temp table,
but I think you'll need a correlated subquery to check for duplicates
(or a join on a subquery that'll really expand the results to be
processed), so the temp table approach is probably going to be lots faster.

Quick example code (untested but should give you the idea), assumes you
have a function get_filename(...) that extracts just the filename part
of the path:

CREATE TABLE paths (
   path PRIMARY KEY,
   --- other fields
);

-- Populate paths

-- Once paths is populated, extract duplicates:

SELECT get_filename(path) AS fn, count(path) AS n
FROM paths HAVING count(path) > 1
INTO TEMPORARY TABLE dup_files;

-- Creates UNIQUE index on PATH as well
ALTER TABLE dup_files ADD CONSTRAINT PRIMARY KEY (path);

-- Now build your side-by-side table of duplicates:

SELECT p1.path, p2.path
FROM paths p1 INNER JOIN paths p2
     ON (get_filename(p1.path) = get_filename(p2.path))
WHERE EXISTS(SELECT 1 FROM dup_files WHERE fn = get_filename(p1.path))
  AND p1.path > p2.path;

You can replace get_filename(fn) with appropriate code, but I'd write a
quick SQL function marked IMMUTABLE to neatly wrap up the pathname
extraction instead.

--
Craig Ringer

Re: Using database to find file doublettes in my computer

From

Sam Mason

Date:

18 November 2008, 08:36:49

On Mon, Nov 17, 2008 at 11:22:47AM -0800, Lothar Behrens wrote:
> I have a problem to find as fast as possible files that are double or
> in other words, identical.
> Also identifying those files that are not identical.

I'd probably just take a simple Unix command line approach, something
like:

  find /base/dir -type f -exec md5sum {} \; | sort | uniq -Dw 32

this will give you a list of files whose contents are identical
(according to an MD5 hash).  An alternative would be to put the hashes
into a database and run the matching up there.

  Sam

Re: Using database to find file doublettes in my computer

From

Gerhard Heift

Date:

18 November 2008, 08:42:41

On Tue, Nov 18, 2008 at 12:36:42PM +0000, Sam Mason wrote:
> On Mon, Nov 17, 2008 at 11:22:47AM -0800, Lothar Behrens wrote:
> > I have a problem to find as fast as possible files that are double or
> > in other words, identical.
> > Also identifying those files that are not identical.
>
> I'd probably just take a simple Unix command line approach, something
> like:
>
>   find /base/dir -type f -exec md5sum {} \; | sort | uniq -Dw 32

You save a little bit of time by using

find /base/dir -type f -print0 | xargs -0 md5sum | sort | uniq -Dw 32

> this will give you a list of files whose contents are identical
> (according to an MD5 hash).  An alternative would be to put the hashes
> into a database and run the matching up there.
>
>
>   Sam

Gerhard

Attachment

signature.asc

Re: Using database to find file doublettes in my computer

From

Lothar Behrens

Date:

19 November 2008, 12:10:40

On 18 Nov., 07:40, cr...@postnewspapers.com.au (Craig Ringer) wrote:

> -- Once paths is populated, extract duplicates:
>
> SELECT get_filename(path) AS fn, count(path) AS n
> FROM paths HAVING count(path) > 1
> INTO TEMPORARY TABLE dup_files;
>
> -- Creates UNIQUE index on PATH as well
> ALTER TABLE dup_files ADD CONSTRAINT PRIMARY KEY (path);
>
> -- Now build your side-by-side table of duplicates:
>
> SELECT p1.path, p2.path
> FROM paths p1 INNER JOIN paths p2
>      ON (get_filename(p1.path) = get_filename(p2.path))
> WHERE EXISTS(SELECT 1 FROM dup_files WHERE fn = get_filename(p1.path))
>   AND p1.path > p2.path;
>
> You can replace get_filename(fn) with appropriate code, but I'd write a
> quick SQL function marked IMMUTABLE to neatly wrap up the pathname
> extraction instead.
>

Hi Craig,

I have done the steps as you described. I have about 14000 files with
an md5sum.
Based on the full filename I have updated the md5sum in my base table
'ECADFiles'.

With the following query I see about 2900 files that are available
multible times:

select "Name", count("Pfad") As n
from "ECADFiles"
Group by "Name"
having count("Pfad") > 1

Using this query I see 13000 double files as a sum:

select sum(n) from (
select "Name", count("Pfad") As n
from "ECADFiles"
Group by "Name"
having count("Pfad") > 1) as temp

Using the following query I get ~ 129000 records:

select p1."Datei", p1."MD5Sum", p1."ID", p2."Datei", p2."MD5Sum",
p2."ID"
from "ECADFiles" p1 INNER JOIN "ECADFiles" p2
 ON (p1."Name" = p2."Name")
where EXISTS (select 1 from "Datei" where "Name" = p1."Name" AND
"Anzahl" > 1)
and p1."Datei" > p2."Datei"

I have expected a smaller amount of records due to the fact that for 4
files each available 2 times (sum = 8) I
have 8 records in ECADFiles, but must have 4 in the above result.

So for an average of 2 doubles I expected half the files from
ECADFiles, because one is exactly right and the other
is on the left.

In general this results in about the same or less records than 13000
multible files.

Why did I get these 129000 records ?

I assume a rotating from files on the left to the right, thus about n
- 1 times too much records. Thus I have tested this:

select sum(n), sum(r) from (
select "Name", count("Pfad") As n, count("Pfad") * (count("Pfad") - 1)
As r
from "ECADFiles"
Group by "Name"
having count("Pfad") > 1
) as temp

But I got 259240. This is probably not correct.

Testing that with the first left file to search on right I get the n
occurences at all as of n double files. Assuming there are
n - 1 too much, I have got my expected result by changing the last AND
rule to the opposite:

select p1."Datei", p1."MD5Sum", p1."ID", p2."Datei", p2."MD5Sum",
p2."ID"
from "ECADFiles" p1 INNER JOIN "ECADFiles" p2
 ON (p1."Name" = p2."Name")
where p1."Datei" = E'C:\\drefsrechneralt\\Drefs511_2\\EAGLE\
\MW211.SCH' AND EXISTS (select 1 from "Datei" where "Name" = p1."Name"
AND "Anzahl" > 1)
and p1."Datei" > p2."Datei"

Gives 7 records with 7 different right files and the 8th on the left.

select p1."Datei", p1."MD5Sum", p1."ID", p2."Datei", p2."MD5Sum",
p2."ID"
from "ECADFiles" p1 INNER JOIN "ECADFiles" p2
 ON (p1."Name" = p2."Name")
where p2."Datei" = E'C:\\drefsrechneralt\\Drefs511_2\\EAGLE\
\MW211.SCH' AND EXISTS (select 1 from "Datei" where "Name" = p1."Name"
AND "Anzahl" > 1)
and p1."Datei" < p2."Datei"

Gives 7 records with 7 different left files and the 8th on the right.

Any ideas how to remove these unwanted records ?
This seems not to be easy, because I see problems araising when
changing the data a cursor runs over.

Thanks

Lothar

Re: Using database to find file doublettes in my computer

From

Craig Ringer

Date:

20 November 2008, 07:01:57

Lothar Behrens wrote:

> I have expected a smaller amount of records due to the fact that for 4
> files each available 2 times (sum = 8) I
> have 8 records in ECADFiles, but must have 4 in the above result.
>
> So for an average of 2 doubles I expected half the files from
> ECADFiles, because one is exactly right and the other
> is on the left.

It's a combinatorial problem. If you have 3 duplicate files, eg:

INSERT INTO paths (path) VALUES
(E'C:\\path\\file1.txt'),
(E'C:\\path2\\file1.txt'),
(E'/path/file1.txt');

then the query process I described above will output the matches:

  C:\path\file1.txt       | C:\path2\file1.txt
  /path/file1.txt         | C:\path2\file1.txt
  /path/file1.txt         | C:\path\file1.txt

because while it avoids showing both (A,B) and (B,A) pairs, for any A, B
and C it'll show:

(A,B)
(A,C)
(B,C)

I've attached test SQL that does the above.

Presumably, you want to only show, say:

(A,B)
(A,C)

or maybe:

(filename, A)
(filename, B)
(filename, C)

If that's what you want, you need to work a little differently. The
attached SQL in dups_test2.sql shows one way you might do it, by
generating a list of files with duplicates then listing all the
locations each appears in. Again, you can do it without the temp table,
it'll probably just be slower. None of what I've written is particularly
fast anyway - it evaluates those regular expressions many more times
than should be required, for example.

--
Craig Ringer
BEGIN;

CREATE TABLE paths (
   path TEXT PRIMARY KEY
);

CREATE OR REPLACE FUNCTION get_filename(text, text) RETURNS VARCHAR AS $$
SELECT (regexp_split_to_array($1, $2))[array_upper(regexp_split_to_array($1, $2),1)];
$$ LANGUAGE SQL IMMUTABLE;

COMMENT ON FUNCTION get_filename(text, text) IS 'Extract filename part from path $1 using path separator $2';

CREATE OR REPLACE FUNCTION get_filename(text) RETURNS VARCHAR AS $$
SELECT get_filename($1, E'[/\\\\]');
$$ LANGUAGE SQL IMMUTABLE;

COMMENT ON FUNCTION get_filename(text, text) IS E'Extract filename part from path $1 using path separator / or \\';

INSERT INTO paths (path) VALUES
(E'C:\\path\\file1.txt'),
(E'C:\\path2\\file1.txt'),
(E'/path/file1.txt'),
(E'C:\\somewhere\\file2.txt'),
(E'/random/place/file2.txt'),
(E'/orphans/file3.blah');

COMMIT;
BEGIN;

SELECT
get_filename(path) AS fn, count(path) AS n
INTO TEMPORARY TABLE dup_files
FROM paths
GROUP BY get_filename(path)
HAVING count(path) > 1;

SELECT * FROM dup_files;

-- Creates UNIQUE index on PATH as well
ALTER TABLE dup_files ADD CONSTRAINT dup_files_path_pkey PRIMARY KEY (fn);

-- Now build your side-by-side table of duplicates:

SELECT p1.path, p2.path
FROM paths p1 INNER JOIN paths p2
     ON (get_filename(p1.path) = get_filename(p2.path))
WHERE EXISTS(SELECT 1 FROM dup_files WHERE fn = get_filename(p1.path))
  AND p1.path > p2.path
ORDER BY get_filename(p1.path), p1.path, p2.path;

ROLLBACK;
SELECT get_filename(path) AS fn, min(path) AS path, count(path) AS dupcount
INTO dups
FROM paths
GROUP BY get_filename(path)
HAVING count(path) > 1;

SELECT * FROM dups;

SELECT dups.fn, paths.path
FROM dups INNER JOIN paths ON (dups.fn = get_filename(paths.path));