Puppy Linux Discussion Forum Forum Index Puppy Linux Discussion Forum
Puppy HOME page : puppylinux.com
"THE" alternative forum : puppylinux.info
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

The time now is Tue 22 Jul 2014, 19:56
All times are UTC - 4
 Forum index » Off-Topic Area » Programming
Problem to find duplicate files with find
Post new topic   Reply to topic View previous topic :: View next topic
Page 1 of 1 [9 Posts]  
Author Message
R-S-H

Joined: 18 Feb 2013
Posts: 490

PostPosted: Wed 13 Mar 2013, 22:01    Post subject:  Problem to find duplicate files with find
Subject description: Any other solution?
 

Hi.

I do have a very special problem (to me) when using find.

What I want to do is to search one directory and for every file found to check if it is existing in another directory. The Problem is that the directory where to search for the files has very long paths included.

Example:

Directory to search in: /tmp/tmp.6ueJftvZHZ/working_tree

Longest path: /tmp/tmp.6ueJftvZHZ/working_tree/usr/share/applications-desktop-files/applications

To find the files I do use this code:
Code:
Path="/tmp/tmp.6ueJftvZHZ/working_tree"
plen="`echo ${#Path}`"

rm /root/files.txt
Files=`find $Path -maxdepth 8 -type f`
echo "$Files" |while read F
do
   echo "$F" >> /root/files.txt
   if [ "$F" != "" ]; then
      if [ "$F" != "$Path/" ]; then
         nlen="`echo ${#F}`"
         findname=${F:plen:nlen-plen}
         if [ -f $findname ]; then
            echo $F
            #rm $F
         fi
      fi
   fi
done

Using -maxdepth 8 returns error message list of arguments too long or too many arguments (or similar). Using -maxdepth 4 does not find all files but gives no error message. Using 5 or 6 for -maxdepth gives a mix of both results.

The paths are too long, aren't they?

How can I get searching through all directories?

Thanks

RSH

_________________
LazY Puppy Home
The new LazY Puppy Information Centre

Back to top
View user's profile Send private message 
SFR


Joined: 26 Oct 2011
Posts: 1037

PostPosted: Thu 14 Mar 2013, 07:52    Post subject:  

Hey Rainer

Quote:
list of arguments too long or too many arguments (or similar)

Which line this error refers to?
I only got "too many arguments" when there were spaces in some filenames.
Double quotes in
Code:
if [ -f "$findname" ]; then

have fixed it.

Quote:
How can I get searching through all directories?

Simply - don't use 'maxdepth' at all.
'find' acts recursively by default.

Greetings!

_________________
[O]bdurate [R]ules [D]estroy [E]nthusiastic [R]ebels => [C]reative [H]umans [A]lways [O]pen [S]ource
Omnia mea mecum porto.
Back to top
View user's profile Send private message 
amigo

Joined: 02 Apr 2007
Posts: 2217

PostPosted: Thu 14 Mar 2013, 09:08    Post subject:  

First of all, why reinvent the wheel? There are already tools out there to find and deal with duplicate files:
https://duckduckgo.com/?q=linux+find+duplicate+file

Second, don't use the shell '${#VAR}' construct to determine file size, for two reasons. 1. It is not accurate -it counts characters, not bytes. 2. It will be slower than using a normal tool as it must read-in the whole file to count the chars. Use another tool to determine file size. although people commonly use 'ls -l' or 'du', they have disadvanatges. 'ls' because the output must be parsed. 'du' because it is not particularly accurate. The best thing to compare file size is to use 'stat -c %s file-name'.

Still, comparing file size will not really tell you if they are the same. It would be better to use md5sum to compare them -even if they are the same size, a 'hash' of each file will show if they are really the same.

That said, I put together a, hopefully, helpful example script:
Code:
#!/bin/bash

WD=/tmp/tmp.6ueJftvZHZ/working_tree
WD=$1
OUT=/tmp/output-file
:> $OUT
for FILE in $(find $WD -type f) ; do
   # '${FILE##*/}' is equal to 'basename $FILE' but much faster in this context
   OLD_NAME=$(grep ${FILE##*/}$ $OUT)
   #echo $OLD_NAME
   if [[ $OLD_NAME ]] ; then
      # file is already in the output, check to see of it's the same
      OLD_SIZE=$(stat -c %s $WD/$OLD_NAME)
      NEW_SIZE=$(stat -c %s $WD/$FILE)
      # actually, using  an md5sum of each file would be more accurate
      if [[ $OLD_SIZE -ne $NEW_SIZE ]] ; then
         echo "files are inequal"
      else
         echo "files are equal"
         # rm -f $WD/$FILE
      fi
   else
      echo $FILE >> $OUT
   fi
done
Back to top
View user's profile Send private message 
seaside

Joined: 11 Apr 2007
Posts: 886

PostPosted: Thu 14 Mar 2013, 18:51    Post subject:  

amigo,

Nice script.

I was wondering if something like this would be ok as well-
Code:
 cmp -s $OLD_NAME $FILE

Regards,
s
Back to top
View user's profile Send private message 
amigo

Joined: 02 Apr 2007
Posts: 2217

PostPosted: Fri 15 Mar 2013, 07:09    Post subject:  

No, don't use cmp. It will only work on text files anyway as it compares line-by-line. What's wrong with doing it the Right Way? Using stat, ls -l or du will get the information from the header of the file or from the file system metadata without having to read the whole file. I know where you got the idea about counting the chars (${#VAR}', but honestly, that is the worst idea of all. It's supposed to be used to count positional parameters or the size of a string, not determine file size.
There are still other ways to do it incorrectly if you are simply wanting to exercise your mind. The programs that 'do this for a living' usually first check the file size as it is the fastest. If the files *are* the same size, then they read an arbitrary number of bytes from the first of the files and compare that. They may then either assume that the files are identical or go on to make a hash of each and compare that -that's what md5sum does, so that's why I suggested it.
Back to top
View user's profile Send private message 
R-S-H

Joined: 18 Feb 2013
Posts: 490

PostPosted: Fri 15 Mar 2013, 09:07    Post subject:  

Hi.

Thanks to all for the Help. I was busy, so currently could not test any of this, but will do it later tonight.

RSH

_________________
LazY Puppy Home
The new LazY Puppy Information Centre

Back to top
View user's profile Send private message 
seaside

Joined: 11 Apr 2007
Posts: 886

PostPosted: Fri 15 Mar 2013, 12:38    Post subject:  

amigo wrote:
No, don't use cmp. It will only work on text files anyway as it compares line-by-line. .

Amigo,

There may be reasons not to use "cmp" in this context. However, it seems that according to cmp help, it works on binary files as it compares byte by byte.
Code:
# cmp --help
Usage: cmp [OPTION]... FILE1 [FILE2 [SKIP1 [SKIP2]]]
Compare two files byte by byte.

  -b  --print-bytes  Print differing bytes.
  -i SKIP  --ignore-initial=SKIP  Skip the first SKIP bytes of input.
  -i SKIP1:SKIP2  --ignore-initial=SKIP1:SKIP2
    Skip the first SKIP1 bytes of FILE1 and the first SKIP2 bytes of FILE2.
  -l  --verbose  Output byte numbers and values of all differing bytes.
  -n LIMIT  --bytes=LIMIT  Compare at most LIMIT bytes.
  -s  --quiet  --silent  Output nothing; yield exit status only.
  -v  --version  Output version info.
  --help  Output this help.


A couple of tests I just ran with binary file comparisons appeared accurate.

Regards,
s
Back to top
View user's profile Send private message 
amigo

Joined: 02 Apr 2007
Posts: 2217

PostPosted: Fri 15 Mar 2013, 14:45    Post subject:  

Maybe I was thinking of 'comm'. You *could* also use diff or others. I'm pretty sure hashing would be faster than comparing -bit-by-bit, char-by-char or line-by-line. The file type(s) in question would have an influence on the best way, I guess. If there are many, many files to be examined, then every effort to go faster will count. Just test the difference between using 'stat -c %s filename' and parsing the output from 'ls -l filename' using any method. Probabyl something like:
Code:
ls -l |grep... |sed...|cat...|rev|twist|rev|sort|uniq
LOL[/code]
Back to top
View user's profile Send private message 
Karl Godt


Joined: 20 Jun 2010
Posts: 3953
Location: Kiel,Germany

PostPosted: Fri 15 Mar 2013, 18:27    Post subject:  

I had similar ideas about letting the installer creating a md5sum.lst to make it easier to for people to remember what they have over installed . This is just a thought for now .
But i hacked a snipplet together for comparing /usr/* and /usr/local/* :
Code:
VERBOSE=''; #set VERBOSE=Y if want more output
PATH1=/usr/bin
PATH2=/usr/local/bin
rm -f program.lst dupes.lst
find $PATH1 -type f -exec md5sum {} \; >>program.lst
find $PATH2 -type f -exec md5sum {} \; >>program.lst
LINES_TOTAL=`wc -l program.lst |awk '{print $1}'`
C=0
echo -n "     Processing of $LINES_TOTAL"
while read line; do
[ "$line" ] || continue  #in case an empty line
MD5="${line%% *}"
FIL="${line##*/}"
C=$((C+1))
echo -en "\r$C"
[ "$VERBOSE" ] && echo "$MD5 is $FIL"
[ "`grep '/'"${FIL}$" program.lst | wc -l`" -gt 1 ] && { echo -e "\nMore than one of \"$FIL\" in \"$PATH1\" and \"$PATH2\":";grep '/'"${FIL}$" program.lst >>dupes.lst; }
[ "`grep -w "^${MD5}" program.lst | wc -l`" -gt 1 ] && { echo -e "\nMore than one of \"$MD5\" in \"$PATH1\" and \"$PATH2\":";grep -w "^${MD5}" program.lst >>dupes.lst; }
sleep 0.1
done<program.lst
echo

remove_dupes(){
while read line; do
[ "$line" ] || continue  #in case an empty line
MD5="${line%% *}"
FIL="${line##*/}"
CHOICES=`grep -w "^${MD5}" dupes.lst |sort -u |tr ' ' ':'`
CHOICES="SKIP__THIS
$CHOICES"
select choice in $CHOICES
  do
    echo "$choice"
    case $choice in
    SKIP__THIS) echo "Skipping";;
    *) echo "Removing ${choice##*:} ..."
    #rm -f "${choice##*:}"
    ;;
    esac
    break
  done
done<dupes.lst
}
#remove_dupes ##not tested yet
Back to top
View user's profile Send private message Visit poster's website 
Display posts from previous:   Sort by:   
Page 1 of 1 [9 Posts]  
Post new topic   Reply to topic View previous topic :: View next topic
 Forum index » Off-Topic Area » Programming
Jump to:  

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Powered by phpBB © 2001, 2005 phpBB Group
[ Time: 0.0840s ][ Queries: 12 (0.0058s) ][ GZIP on ]