Page 1 of 1
Problem to find duplicate files with find
Posted: Thu 14 Mar 2013, 02:01
by R-S-H
Hi.
I do have a very special problem (to me) when using
find.
What I want to do is to search one directory and for every file found to check if it is existing in another directory. The Problem is that the directory where to search for the files has very long paths included.
Example:
Directory to search in:
/tmp/tmp.6ueJftvZHZ/working_tree
Longest path:
/tmp/tmp.6ueJftvZHZ/working_tree/usr/share/applications-desktop-files/applications
To find the files I do use this code:
Code: Select all
Path="/tmp/tmp.6ueJftvZHZ/working_tree"
plen="`echo ${#Path}`"
rm /root/files.txt
Files=`find $Path -maxdepth 8 -type f`
echo "$Files" |while read F
do
echo "$F" >> /root/files.txt
if [ "$F" != "" ]; then
if [ "$F" != "$Path/" ]; then
nlen="`echo ${#F}`"
findname=${F:plen:nlen-plen}
if [ -f $findname ]; then
echo $F
#rm $F
fi
fi
fi
done
Using
-maxdepth 8 returns error message
list of arguments too long or
too many arguments (or similar). Using
-maxdepth 4 does not find all files but gives no error message. Using 5 or 6 for -maxdepth gives a mix of both results.
The paths are too long, aren't they?
How can I get searching through all directories?
Thanks
RSH
Posted: Thu 14 Mar 2013, 11:52
by SFR
Hey Rainer
list of arguments too long or too many arguments (or similar)
Which line this error refers to?
I only got "too many arguments" when there were spaces in some filenames.
Double quotes in
have fixed it.
How can I get searching through all directories?
Simply - don't use 'maxdepth' at all.
'find' acts recursively by default.
Greetings!
Posted: Thu 14 Mar 2013, 13:08
by amigo
First of all, why reinvent the wheel? There are already tools out there to find and deal with duplicate files:
https://duckduckgo.com/?q=linux+find+duplicate+file
Second, don't use the shell '${#VAR}' construct to determine file size, for two reasons. 1. It is not accurate -it counts characters, not bytes. 2. It will be slower than using a normal tool as it must read-in the whole file to count the chars. Use another tool to determine file size. although people commonly use 'ls -l' or 'du', they have disadvanatges. 'ls' because the output must be parsed. 'du' because it is not particularly accurate. The best thing to compare file size is to use 'stat -c %s file-name'.
Still, comparing file size will not really tell you if they are the same. It would be better to use md5sum to compare them -even if they are the same size, a 'hash' of each file will show if they are really the same.
That said, I put together a, hopefully, helpful example script:
Code: Select all
#!/bin/bash
WD=/tmp/tmp.6ueJftvZHZ/working_tree
WD=$1
OUT=/tmp/output-file
:> $OUT
for FILE in $(find $WD -type f) ; do
# '${FILE##*/}' is equal to 'basename $FILE' but much faster in this context
OLD_NAME=$(grep ${FILE##*/}$ $OUT)
#echo $OLD_NAME
if [[ $OLD_NAME ]] ; then
# file is already in the output, check to see of it's the same
OLD_SIZE=$(stat -c %s $WD/$OLD_NAME)
NEW_SIZE=$(stat -c %s $WD/$FILE)
# actually, using an md5sum of each file would be more accurate
if [[ $OLD_SIZE -ne $NEW_SIZE ]] ; then
echo "files are inequal"
else
echo "files are equal"
# rm -f $WD/$FILE
fi
else
echo $FILE >> $OUT
fi
done
Posted: Thu 14 Mar 2013, 22:51
by seaside
amigo,
Nice script.
I was wondering if something like this would be ok as well-
Regards,
s
Posted: Fri 15 Mar 2013, 11:09
by amigo
No, don't use cmp. It will only work on text files anyway as it compares line-by-line. What's wrong with doing it the Right Way? Using stat, ls -l or du will get the information from the header of the file or from the file system metadata without having to read the whole file. I know where you got the idea about counting the chars (${#VAR}', but honestly, that is the worst idea of all. It's supposed to be used to count positional parameters or the size of a string, not determine file size.
There are still other ways to do it incorrectly if you are simply wanting to exercise your mind. The programs that 'do this for a living' usually first check the file size as it is the fastest. If the files *are* the same size, then they read an arbitrary number of bytes from the first of the files and compare that. They may then either assume that the files are identical or go on to make a hash of each and compare that -that's what md5sum does, so that's why I suggested it.
Posted: Fri 15 Mar 2013, 13:07
by R-S-H
Hi.
Thanks to all for the Help. I was busy, so currently could not test any of this, but will do it later tonight.
RSH
Posted: Fri 15 Mar 2013, 16:38
by seaside
amigo wrote:No, don't use cmp. It will only work on text files anyway as it compares line-by-line. .
Amigo,
There may be reasons not to use "cmp" in this context. However, it seems that according to cmp help, it works on binary files as it compares byte by byte.
Code: Select all
# cmp --help
Usage: cmp [OPTION]... FILE1 [FILE2 [SKIP1 [SKIP2]]]
Compare two files byte by byte.
-b --print-bytes Print differing bytes.
-i SKIP --ignore-initial=SKIP Skip the first SKIP bytes of input.
-i SKIP1:SKIP2 --ignore-initial=SKIP1:SKIP2
Skip the first SKIP1 bytes of FILE1 and the first SKIP2 bytes of FILE2.
-l --verbose Output byte numbers and values of all differing bytes.
-n LIMIT --bytes=LIMIT Compare at most LIMIT bytes.
-s --quiet --silent Output nothing; yield exit status only.
-v --version Output version info.
--help Output this help.
A couple of tests I just ran with binary file comparisons appeared accurate.
Regards,
s
Posted: Fri 15 Mar 2013, 18:45
by amigo
Maybe I was thinking of 'comm'. You *could* also use diff or others. I'm pretty sure hashing would be faster than comparing -bit-by-bit, char-by-char or line-by-line. The file type(s) in question would have an influence on the best way, I guess. If there are many, many files to be examined, then every effort to go faster will count. Just test the difference between using 'stat -c %s filename' and parsing the output from 'ls -l filename' using any method. Probabyl something like:
Code: Select all
ls -l |grep... |sed...|cat...|rev|twist|rev|sort|uniq
LOL[/code]
Posted: Fri 15 Mar 2013, 22:27
by Karl Godt
I had similar ideas about letting the installer creating a md5sum.lst to make it easier to for people to remember what they have over installed . This is just a thought for now .
But i hacked a snipplet together for comparing /usr/* and /usr/local/* :
Code: Select all
VERBOSE=''; #set VERBOSE=Y if want more output
PATH1=/usr/bin
PATH2=/usr/local/bin
rm -f program.lst dupes.lst
find $PATH1 -type f -exec md5sum {} \; >>program.lst
find $PATH2 -type f -exec md5sum {} \; >>program.lst
LINES_TOTAL=`wc -l program.lst |awk '{print $1}'`
C=0
echo -n " Processing of $LINES_TOTAL"
while read line; do
[ "$line" ] || continue #in case an empty line
MD5="${line%% *}"
FIL="${line##*/}"
C=$((C+1))
echo -en "\r$C"
[ "$VERBOSE" ] && echo "$MD5 is $FIL"
[ "`grep '/'"${FIL}$" program.lst | wc -l`" -gt 1 ] && { echo -e "\nMore than one of \"$FIL\" in \"$PATH1\" and \"$PATH2\":";grep '/'"${FIL}$" program.lst >>dupes.lst; }
[ "`grep -w "^${MD5}" program.lst | wc -l`" -gt 1 ] && { echo -e "\nMore than one of \"$MD5\" in \"$PATH1\" and \"$PATH2\":";grep -w "^${MD5}" program.lst >>dupes.lst; }
sleep 0.1
done<program.lst
echo
remove_dupes(){
while read line; do
[ "$line" ] || continue #in case an empty line
MD5="${line%% *}"
FIL="${line##*/}"
CHOICES=`grep -w "^${MD5}" dupes.lst |sort -u |tr ' ' ':'`
CHOICES="SKIP__THIS
$CHOICES"
select choice in $CHOICES
do
echo "$choice"
case $choice in
SKIP__THIS) echo "Skipping";;
*) echo "Removing ${choice##*:} ..."
#rm -f "${choice##*:}"
;;
esac
break
done
done<dupes.lst
}
#remove_dupes ##not tested yet