Problem to find duplicate files with find

For discussions about programming, programming questions/advice, and projects that don't really have anything to do with Puppy.
Post Reply
Message
Author
R-S-H
Posts: 487
Joined: Mon 18 Feb 2013, 12:47

Problem to find duplicate files with find

#1 Post by R-S-H »

Hi.

I do have a very special problem (to me) when using find.

What I want to do is to search one directory and for every file found to check if it is existing in another directory. The Problem is that the directory where to search for the files has very long paths included.

Example:

Directory to search in: /tmp/tmp.6ueJftvZHZ/working_tree

Longest path: /tmp/tmp.6ueJftvZHZ/working_tree/usr/share/applications-desktop-files/applications

To find the files I do use this code:

Code: Select all

Path="/tmp/tmp.6ueJftvZHZ/working_tree"
plen="`echo ${#Path}`"

rm /root/files.txt
Files=`find $Path -maxdepth 8 -type f`
echo "$Files" |while read F
do
	echo "$F" >> /root/files.txt
	if [ "$F" != "" ]; then
		if [ "$F" != "$Path/" ]; then
			nlen="`echo ${#F}`"
			findname=${F:plen:nlen-plen}
			if [ -f $findname ]; then
				echo $F
				#rm $F
			fi
		fi
	fi
done
Using -maxdepth 8 returns error message list of arguments too long or too many arguments (or similar). Using -maxdepth 4 does not find all files but gives no error message. Using 5 or 6 for -maxdepth gives a mix of both results.

The paths are too long, aren't they?

How can I get searching through all directories?

Thanks

RSH
[b][url=http://lazy-puppy.weebly.com]LazY Puppy Home
The new LazY Puppy Information Centre[/url][/b]

User avatar
SFR
Posts: 1800
Joined: Wed 26 Oct 2011, 21:52

#2 Post by SFR »

Hey Rainer
list of arguments too long or too many arguments (or similar)
Which line this error refers to?
I only got "too many arguments" when there were spaces in some filenames.
Double quotes in

Code: Select all

if [ -f "$findname" ]; then
have fixed it.
How can I get searching through all directories?
Simply - don't use 'maxdepth' at all.
'find' acts recursively by default.

Greetings!
[color=red][size=75][O]bdurate [R]ules [D]estroy [E]nthusiastic [R]ebels => [C]reative [H]umans [A]lways [O]pen [S]ource[/size][/color]
[b][color=green]Omnia mea mecum porto.[/color][/b]

amigo
Posts: 2629
Joined: Mon 02 Apr 2007, 06:52

#3 Post by amigo »

First of all, why reinvent the wheel? There are already tools out there to find and deal with duplicate files:
https://duckduckgo.com/?q=linux+find+duplicate+file

Second, don't use the shell '${#VAR}' construct to determine file size, for two reasons. 1. It is not accurate -it counts characters, not bytes. 2. It will be slower than using a normal tool as it must read-in the whole file to count the chars. Use another tool to determine file size. although people commonly use 'ls -l' or 'du', they have disadvanatges. 'ls' because the output must be parsed. 'du' because it is not particularly accurate. The best thing to compare file size is to use 'stat -c %s file-name'.

Still, comparing file size will not really tell you if they are the same. It would be better to use md5sum to compare them -even if they are the same size, a 'hash' of each file will show if they are really the same.

That said, I put together a, hopefully, helpful example script:

Code: Select all

#!/bin/bash

WD=/tmp/tmp.6ueJftvZHZ/working_tree
WD=$1
OUT=/tmp/output-file
:> $OUT
for FILE in $(find $WD -type f) ; do
	# '${FILE##*/}' is equal to 'basename $FILE' but much faster in this context
	OLD_NAME=$(grep ${FILE##*/}$ $OUT)
	#echo $OLD_NAME
	if [[ $OLD_NAME ]] ; then
		# file is already in the output, check to see of it's the same
		OLD_SIZE=$(stat -c %s $WD/$OLD_NAME)
		NEW_SIZE=$(stat -c %s $WD/$FILE)
		# actually, using  an md5sum of each file would be more accurate
		if [[ $OLD_SIZE -ne $NEW_SIZE ]] ; then
			echo "files are inequal"
		else
			echo "files are equal"
			# rm -f $WD/$FILE
		fi
	else
		echo $FILE >> $OUT
	fi
done

seaside
Posts: 934
Joined: Thu 12 Apr 2007, 00:19

#4 Post by seaside »

amigo,

Nice script.

I was wondering if something like this would be ok as well-

Code: Select all

 cmp -s $OLD_NAME $FILE
Regards,
s

amigo
Posts: 2629
Joined: Mon 02 Apr 2007, 06:52

#5 Post by amigo »

No, don't use cmp. It will only work on text files anyway as it compares line-by-line. What's wrong with doing it the Right Way? Using stat, ls -l or du will get the information from the header of the file or from the file system metadata without having to read the whole file. I know where you got the idea about counting the chars (${#VAR}', but honestly, that is the worst idea of all. It's supposed to be used to count positional parameters or the size of a string, not determine file size.
There are still other ways to do it incorrectly if you are simply wanting to exercise your mind. The programs that 'do this for a living' usually first check the file size as it is the fastest. If the files *are* the same size, then they read an arbitrary number of bytes from the first of the files and compare that. They may then either assume that the files are identical or go on to make a hash of each and compare that -that's what md5sum does, so that's why I suggested it.

R-S-H
Posts: 487
Joined: Mon 18 Feb 2013, 12:47

#6 Post by R-S-H »

Hi.

Thanks to all for the Help. I was busy, so currently could not test any of this, but will do it later tonight.

RSH
[b][url=http://lazy-puppy.weebly.com]LazY Puppy Home
The new LazY Puppy Information Centre[/url][/b]

seaside
Posts: 934
Joined: Thu 12 Apr 2007, 00:19

#7 Post by seaside »

amigo wrote:No, don't use cmp. It will only work on text files anyway as it compares line-by-line. .
Amigo,

There may be reasons not to use "cmp" in this context. However, it seems that according to cmp help, it works on binary files as it compares byte by byte.

Code: Select all

# cmp --help
Usage: cmp [OPTION]... FILE1 [FILE2 [SKIP1 [SKIP2]]]
Compare two files byte by byte.

  -b  --print-bytes  Print differing bytes.
  -i SKIP  --ignore-initial=SKIP  Skip the first SKIP bytes of input.
  -i SKIP1:SKIP2  --ignore-initial=SKIP1:SKIP2
    Skip the first SKIP1 bytes of FILE1 and the first SKIP2 bytes of FILE2.
  -l  --verbose  Output byte numbers and values of all differing bytes.
  -n LIMIT  --bytes=LIMIT  Compare at most LIMIT bytes.
  -s  --quiet  --silent  Output nothing; yield exit status only.
  -v  --version  Output version info.
  --help  Output this help.
A couple of tests I just ran with binary file comparisons appeared accurate.

Regards,
s

amigo
Posts: 2629
Joined: Mon 02 Apr 2007, 06:52

#8 Post by amigo »

Maybe I was thinking of 'comm'. You *could* also use diff or others. I'm pretty sure hashing would be faster than comparing -bit-by-bit, char-by-char or line-by-line. The file type(s) in question would have an influence on the best way, I guess. If there are many, many files to be examined, then every effort to go faster will count. Just test the difference between using 'stat -c %s filename' and parsing the output from 'ls -l filename' using any method. Probabyl something like:

Code: Select all

ls -l |grep... |sed...|cat...|rev|twist|rev|sort|uniq
LOL[/code]

User avatar
Karl Godt
Posts: 4199
Joined: Sun 20 Jun 2010, 13:52
Location: Kiel,Germany

#9 Post by Karl Godt »

I had similar ideas about letting the installer creating a md5sum.lst to make it easier to for people to remember what they have over installed . This is just a thought for now .
But i hacked a snipplet together for comparing /usr/* and /usr/local/* :

Code: Select all

VERBOSE=''; #set VERBOSE=Y if want more output
PATH1=/usr/bin
PATH2=/usr/local/bin
rm -f program.lst dupes.lst
find $PATH1 -type f -exec md5sum {} \; >>program.lst
find $PATH2 -type f -exec md5sum {} \; >>program.lst
LINES_TOTAL=`wc -l program.lst |awk '{print $1}'`
C=0
echo -n "     Processing of $LINES_TOTAL"
while read line; do
[ "$line" ] || continue  #in case an empty line
MD5="${line%% *}"
FIL="${line##*/}"
C=$((C+1))
echo -en "\r$C"
[ "$VERBOSE" ] && echo "$MD5 is $FIL"
[ "`grep '/'"${FIL}$" program.lst | wc -l`" -gt 1 ] && { echo -e "\nMore than one of \"$FIL\" in \"$PATH1\" and \"$PATH2\":";grep '/'"${FIL}$" program.lst >>dupes.lst; }
[ "`grep -w "^${MD5}" program.lst | wc -l`" -gt 1 ] && { echo -e "\nMore than one of \"$MD5\" in \"$PATH1\" and \"$PATH2\":";grep -w "^${MD5}" program.lst >>dupes.lst; }
sleep 0.1
done<program.lst
echo

remove_dupes(){
while read line; do
[ "$line" ] || continue  #in case an empty line
MD5="${line%% *}"
FIL="${line##*/}"
CHOICES=`grep -w "^${MD5}" dupes.lst |sort -u |tr ' ' ':'`
CHOICES="SKIP__THIS
$CHOICES"
select choice in $CHOICES
  do
    echo "$choice"
    case $choice in
    SKIP__THIS) echo "Skipping";;
    *) echo "Removing ${choice##*:} ..."
    #rm -f "${choice##*:}"
    ;;
    esac
    break
  done
done<dupes.lst
}
#remove_dupes ##not tested yet

Post Reply