Re: Using database to find file doublettes in my computer - Mailing list pgsql-general

From Eus
Subject Re: Using database to find file doublettes in my computer
Date
Msg-id 849157.43436.qm@web37603.mail.mud.yahoo.com
Whole thread Raw
In response to Using database to find file doublettes in my computer  (Lothar Behrens <lothar.behrens@lollisoft.de>)
List pgsql-general
Hi Ho!

--- On Tue, 11/18/08, Lothar Behrens <lothar.behrens@lollisoft.de> wrote:

> Hi,
>
> I have a problem to find as fast as possible files that are
> double or
> in other words, identical.
> Also identifying those files that are not identical.
>
> My approach was to use dir /s and an awk script to convert
> it to a sql
> script to be imported into a table.
> That done, I could start issuing queries.
>
> But how to query for files to display a 'left / right
> view' for each
> file that is on multible places ?
>
> I mean this:
>
> This File;Also here
> C:\some.txt;C:\backup\some.txt
> C:\some.txt;C:\backup1\some.txt
> C:\some.txt;C:\backup2\some.txt
>
> but have only this list:
>
> C:\some.txt
> C:\backup\some.txt
> C:\backup1\some.txt
> C:\backup2\some.txt
>
>
> The reason for this is because I am faced with the problem
> of ECAD
> projects that are copied around
> many times and I have to identify what files are here
> missing and what
> files are there.
>
> So a manual approach is as follows:
>
> 1)   Identify one file (schematic1.sch) and see, where are
> copies of
> it.
> 2)   Compare the files of both directories and make a
> desision about
> what files to use further.
> 3)   Determine conflicts, thus these files can't be
> copied together
> for a cleanup.
>
> Are there any approaches or help ?

I also have been in this kind of circumstance before, but I work under GNU/Linux as always.

1. At that time, I used `md5sum' to generate the fingerprint of all files in a given directory to be cleaned up.

2. Later, I created a simple Java program to group the names of all files that had the same fingerprint (i.e., MD5
hash).

3. I simply deleted the files with the same MD5 hash but one file with a good filename (in my case, the filename
couldn'tbe relied on to perform a comparison since it differed by small additions like date, author's name, and the
like).

4. After that, I used my brain to find related files based on the filenames (e.g., `[2006-05-23] Jeff - x.txt' should
bethe same as `Jenny - x.txt'). Of course, the Java program also helped me in grouping the files that I thought to be
related.

5. Next, I perused the related files to see whether most of the contents were the same. If yes, I took the latest one
basedon the modified time. 

> This is a very time consuming job and I am searching for
> any solution
> that helps me save time :-)

Well, I think I saved a lot of time at that time to be able to eliminate about 7,000 files out of 15,000 files in about
twoweeks. 

> I know that those problems did not arise when the projects
> are well
> structured and in a version management system. But that
> isn't here :-)

I hope you employ such a system ASAP :-)

> Thanks
>
> Lothar

Best regards,

Eus (FSF member #4445)



In this digital era, where computing technology is pervasive,

your freedom depends on the software controlling those computing devices.



Join free software movement today!

It is free as in freedom, not as in free beer!



Join: http://www.fsf.org/jf?referrer=4445




pgsql-general by date:

Previous
From: "Scott Marlowe"
Date:
Subject: Re: In memory Database for postgres
Next
From: "Webb Sprague"
Date:
Subject: "INNER JOIN .... USING " in an UPDATE