Unix shell script for removing duplicate files

by Jarno Elonen, 2003-04-06...2013-01-17

The following shell script (one-liner) finds duplicate (2 or more identical) files and outputs a new shell script containing commented-out rm statements for deleting them (copy-paste from here):

OUTF=rem-duplicates.sh; echo "#! /bin/sh" > $OUTF; find "$@" -type f -printf "%s\n" | sort -n | uniq -d | xargs -I@@ -n1 find "$@" -type f -size @@c -exec md5sum {} \; | sort --key=1,32 | uniq -w 32 -d --all-repeated=separate | sed -r 's/^[0-9a-f]*( )*//;s/([^a-zA-Z0-9./_-])/\\\1/g;s/(.+)/#rm \1/' >> $OUTF; chmod a+x $OUTF; ls -l $OUTF

You then have to edit the file to select which files to keep - the script can't safely do it automatically!

If you prefer a C program, try e.g. fdupes (may also be available in the the repository of you favorite Linux distribution). For a GUI based solution, fslint might do it for you.

The code was written for Debian GNU/Linux and has been tested with Bash, Zsh and Dash. Needless to say, you are welcome to do whatever you like with it as long as you don't blame me for disasters... (released in Public Domain)

Thanks to Leendert Meyer, Uriel, Patrick-Emil Zörner and several others for testing and improving the script.

The same script in a more readable form:

OUTF=rem-duplicates.sh;
echo "#! /bin/sh" > $OUTF;
find "$@" -type f -printf "%s\n" | sort -n | uniq -d |
    xargs -I@@ -n1 find "$@" -type f -size @@c -exec md5sum {} \; |
    sort --key=1,32 | uniq -w 32 -d --all-repeated=separate |
    sed -r 's/^[0-9a-f]*( )*//;s/([^a-zA-Z0-9./_-])/\\\1/g;s/(.+)/#rm \1/' >> $OUTF;
chmod a+x $OUTF; ls -l $OUTF

...and a (somewhat out-of-date) variation for old uniq and sort versions:

OUTF=rem-duplicates.sh; ESCSTR="s/([^a-zA-Z0-9])/\\\\\1/g";
SUM="dummy"; echo "#! /bin/sh" > $OUTF; find -type f | sed -r
"$ESCSTR" | while read x; do md5sum "$x"; done | sort -k 1,32 |
uniq -w 32 -d --all-repeated | while read y; do NEW=`echo
"$y-dummy" | sed "s/ .*$//"`; if [ $NEW != $SUM ]; then echo "" >>
$OUTF; fi; SUM="$NEW"; echo "$y" | sed -r
"s/^[0-9a-f]*(\\ )*//;$ESCSTR;s/(.+)/#rm \1/" >> $OUTF; done;
chmod a+x $OUTF; ls -l $OUTF

Example output

#! /bin/sh
#rm ./gdc2001/113-1303_IMG.JPG
#rm ./reppulilta/gdc2001/113-1303_IMG.JPG

#rm ./lissabon/01-01-2001/108-0883_IMG.JPG
#rm ./kuvat\ reppulilta/lissabon/01-01-2001/108-0883_IMG.JPG

#rm ./gdc2001/113-1328_IMG.JPG
#rm ./kuvat\ reppulilta/gdc2001/113-1328_IMG.JPG

Explanation

  1. write output script header
  2. list the size of all files recursively under current directory
  3. filter out unique file sizes (leaving only sizes that occur at least twice)
  4. find the filenames for files whose size occurs on the list
  5. calculate their MD5 sums
  6. find duplicate sums
  7. strip off MD5 sums and leave only file names
  8. escape strange characters from the filenames
  9. write out commented-out delete commands
  10. make the output script writable and ls -l it