Find duplicate files

January 06, 2010

Update: As Pádraig Brady, fslint maintainer, pointed out: fslint/findup *is* a shell script.

My 500-GB Seagate FreeAgent Desktop is almost filled to the brim (there's *only* ~70GB free space left) so I need to find all duplicate files for clean-up.

Fortunately, there are tools to do just this. I tried fslint, which is also available in the Fedora repository. I also found several nifty scripts on the web.

I settled for a Perl script, found in PerlMonks, which I modified a bit (used digest() instead of hexdigest(), removed calculation of duplicate file size).

#!/usr/bin/perl -w use strict; use File::Find; use Digest::MD5; my %files; find(\&check_file, $ARGV[0] || "."); local $" = ", "; foreach my $size (sort {$b < => $a} keys %files) { next unless @{$files{$size}} > 1; my %md5; foreach my $file (@{$files{$size}}) { open(FILE, $file) or next; binmode(FILE); push @{$md5{Digest::MD5->new->addfile(*FILE)->digest}},$file; } foreach my $hash (keys %md5) { next unless @{$md5{$hash}} > 1; print "$size @{$md5{$hash}}\n"; } } sub check_file { -f && push @{$files{(stat(_))[7]}}, $File::Find::name; }

I'm a shell-script junkie, so I whipped up something in Bash. It's not as fast as the Perl implementation or fslint, but it does the job.

find "$@" -type f -exec md5sum {} \; | \ sort -k 1,32 | uniq -w 32 -d -D | \ awk 'NF { a[substr($0,0,32)]=(a[substr($0,0,32)]) ? a[substr($0,0,32)] FS $2 : $0 } \ END \ { for(i in a) print a[i] }'

(Awk is pretty cool, isn't it?)

Of course, I tested all three on a directory with about 300 or so duplicate files, here are the results.

fslint/findup:

real    0m3.093s
user    0m1.812s
sys     0m0.368s

Perl:

real    0m4.668s
user    0m0.644s
sys     0m0.188s

Shell:

real    0m30.475s
user    0m1.842s
sys     0m1.692s

Okay, so the shell script's performance was abysmal, but hey, it's always reassuring to know that there are more than one way to do it. (Errr... that's a Perl motto.)

Comments

Pádraig Brady7/1/10 05:38
fslint/findup is shell script :)
http://code.google.com/p/fslint/source/browse/trunk/fslint/findup
ReplyDelete
Replies
Ian Dexter7/1/10 21:59
WOW! Didn't realize that. Looks like I don't have to reinvent the wheel, then.

Thanks for pointing this out. :)
ReplyDelete
Replies

Add comment

Coredump

Find duplicate files

Comments

Post a Comment

Popular posts from this blog

Pull files off Android phone

Open URL in Chrome incognito mode from Terminal

Get disk size in Linux