Some interesting time comparisons

For discussions about programming, programming questions/advice, and projects that don't really have anything to do with Puppy.
Post Reply
Message
Author
User avatar
Nathan F
Posts: 1764
Joined: Wed 08 Jun 2005, 14:45
Location: Wadsworth, OH (occasionally home)
Contact:

Some interesting time comparisons

#1 Post by Nathan F »

Had a free hour and thought I'd try to get an idea just how much time might be wasted with inefficient shell code. I devised a set of tests (not very rigorous but interesting nonetheless) and thought I'd share the results. Since execution time can vary somewhat depending on system load, each test was run 10 times using the following test script.

Code: Select all

#!/bin/sh
for num in 1 2 3 4 5 6 7 8 9 0 
do ./<script> &>/dev/null 
done
The baseline is a simple run of "find" to get a listing of images from a rather large collection (5011 images in multiple directories).

Code: Select all

#!/bin/sh

dirs="/storage/photos"
find_images() {
find $@ -type f \
  -iname "*.jp*g" \
  -o -iname "*.png" \
  -o -iname "*.svg" \
  -o -iname "*.tif*"
}

find_images $dirs
And the result of the 10x loop is rather fast at 0.75 sec.

Code: Select all

./test.sh  0.53s user 0.14s system 89% cpu 0.754 total
Now to do something useful with the data. I recently have been getting back to using gtkdialog, and for the tree widget the columns are separated with a pipe. Let's format the output for a gtkdialog tree (making a few common mistakes for sake of comparison).

Code: Select all

#!/bin/sh

dirs="/storage/photos"
find_images() {
find $@ -type f \
  | egrep ".jpg|.jpeg|.JPG|.png|.PNG|.tif|.tiff|.TIF|.svg" \
  | while read line ; do echo "$line|$(basename "$line")" ; done
}
find_images $dirs
Run 10x again, significantly slower, almost 2 minutes.

Code: Select all

./test.sh  13.12s user 17.36s system 25% cpu 1:57.86 total
Here's a simple tweak. The find command has some formatting and filtering options, including -name and -iname, the latter being case insensitive.

Code: Select all

#!/bin/sh

dirs="/storage/photos"
find_images() {
find $@ -type f \
  \( -iname "*.jp*g" \
  -o -iname "*.png" \
  -o -iname "*.svg" \
  -o -iname "*.tif*" \) \
  | while read line ; do echo "$line|$(basename "$line")" ; done
}
find_images $dirs
The result edges out the first try, but not as much as I expected. Apparently pipelines really are quite efficient, and egrep is a rather fast utility.

Code: Select all

./test.sh  13.83s user 16.92s system 26% cpu 1:54.15 total
But can we better, by having find do all of the forking and bypass the shell?

Code: Select all

#!/bin/sh

dirs="/storage/photos"
find_images() {
find $@ -type f \
  \( -iname "*.jp*g" \
  -o -iname "*.png" \
  -o -iname "*.svg" \
  -o -iname "*.tif*" \) \
  -exec echo -n '{}|' \; \
  -exec basename '{}' \;
}

find_images $dirs
The results are disappointing, going for 2.5 minutes

Code: Select all

./test.sh > /dev/null  4.64s user 19.18s system 15% cpu 2:34.87 total
Comparing apples to oranges, I'm taking my first steps with Python at present. So I wanted to see how to do the same things in Python. First, just finding the image files without any special formatting.

Code: Select all

#!/usr/bin/env python

import os
import fnmatch
imagext = ['*.jpg', '*.JPG', '*.jpeg', '*.png', '*.PNG', '*.tif', '*.tiff', '*.svg']

dirs = ['/storage/photos']

def find_images(dir):
  for root, dirnames, filenames in os.walk(dir):
    for ext in imagext:
      for filename in fnmatch.filter(filenames, ext):
        fullpath = os.path.join(root, filename)
        print (fullpath)

for dir in dirs:
  find_images (dir)
First thing that pops in my head is how much harder it is to accomplish what can be done with a single invocation of find. To be fair I'm a rank ameteur though and maybe someone else could do it a lot better. How is the speed comparison though? Fairly bad compared to the utility, showing that the binary wins out over the script any time.

Code: Select all

./test.sh  3.82s user 0.39s system 87% cpu 4.823 total
Can we do the same formatting with Python? Yes, and here's where Python starts to shine. It's only a small tweak.

Code: Select all

#!/usr/bin/env python

import os
import fnmatch
imagext = ['*.jpg', '*.JPG', '*.jpeg', '*.png', '*.PNG', '*.tif', '*.tiff', '*.svg']

dirs = ['/storage/photos']

def find_images(dir):
  for root, dirnames, filenames in os.walk(dir):
    for ext in imagext:
      for filename in fnmatch.filter(filenames, ext):
        fullpath = os.path.join(root, filename)
        print (fullpath, end='')
        print ('|', end='')
        print (filename)

for dir in dirs:
  find_images (dir)
And the speed result handily beats even the best shell function by a wide margin.

Code: Select all

./test.sh  4.32s user 0.39s system 87% cpu 5.389 total
Can we get even faster? Well we were running the function 10x via a shell script. How about we see whether bash or Python is faster running a for loop.

Code: Select all

#!/usr/bin/env python

import os
import fnmatch
imagext = ['*.jpg', '*.JPG', '*.jpeg', '*.png', '*.PNG', '*.tif', '*.tiff', '*.svg']

dirs = ['/storage/photos']

def find_images(dir):
  for root, dirnames, filenames in os.walk(dir):
    for ext in imagext:
      for filename in fnmatch.filter(filenames, ext):
        fullpath = os.path.join(root, filename)
        print (fullpath, end='')
        print ('|', end='')
        print (filename)

nums = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '0']
for num in nums:
  for dir in dirs:
    find_images (dir)

Code: Select all

./find_images4.py &> /dev/null  3.44s user 0.33s system 87% cpu 4.304 total
That's half a second faster that bash did running a for loop 10x, although it's possible that's due to the Python version being run all in one script instead of one script calling another ten times. In other words, without the overhead of loading Python ten times it's likely bash would have fared better in the for loop comparison.

I wouldn't go too far with conclusions, but a few things are pretty clear.

Even though the find utility handily beat my Python function, once anything is done with the data you invariably end up having to call upon other shell utilities, each of which take time to load slowing down the script. So even for a simple formatting exercise like this, Python was the clear speed winner. I'm beginning to see why some people love it so much. The shell is great for short scripts where execution time isn't a huge issue, but when dealing with large amounts of data (in this example we had over 50k lines after it looped 10x) the shell becomes painfully slow. I imagine a purpose written program in C would handily beat Python at the task, but that's not the point either. Do I even have a point...not sure.
Bring on the locusts ...

User avatar
Nathan F
Posts: 1764
Joined: Wed 08 Jun 2005, 14:45
Location: Wadsworth, OH (occasionally home)
Contact:

#2 Post by Nathan F »

As a real interesting addendum I tried all the shell functions using busybox with the "exec prefers applets" option enabled. The shell test with the while loop ran in 26.762 seconds, and -exec'ing the extra commands from find it ran in 22.889 seconds. That's still nowhere as fast as the Python functions but it handily beats bash + gnu coreutils. It also reverses the original results as pertaining to -exec from find versus the pipe into a while loop.

Good food for thought there.

The "exec prefers applets" busybox config tells busybox not to fork a new process if there is an applet with the proper function.
Bring on the locusts ...

User avatar
Karl Godt
Posts: 4199
Joined: Sun 20 Jun 2010, 13:52
Location: Kiel,Germany

#3 Post by Karl Godt »

Using find a second time in the same directory usually takes less time, when the kernel is able to cache much for much RAM installed .

You should also test such syntax in ash and bash :

Code: Select all

dir='/storage/photos'
list_function(){
while read filename
do
[ "$filename" ] || continue
echo "$filename"
done<<EOI
$(ls -1 "$@" | grep -iE -e '\.jpg$|\.jpeg$|\.png$|\.svg$|\.ti[f]*$')
EOI
}
list_function '$dir'
«Give me GUI or Death» -- I give you [[Xx]term[inal]] [[Cc]on[s][ole]] .
Macpup user since 2010 on full installations.
People who want problems with Puppy boot frugal :P

User avatar
technosaurus
Posts: 4853
Joined: Mon 19 May 2008, 01:24
Location: Blue Springs, MO
Contact:

#4 Post by technosaurus »

@nathan - it may be faster if you used pure shell without calling external binaries (or applets that need to be forked)
Ex. Instead of find, use echo $path/*.ext
Instead of grep, add a case to your while,
case $line in *.svg|*.png)...;;esac

Should drastically improve speed in most cases.
Check out my [url=https://github.com/technosaurus]github repositories[/url]. I may eventually get around to updating my [url=http://bashismal.blogspot.com]blogspot[/url].

User avatar
Nathan F
Posts: 1764
Joined: Wed 08 Jun 2005, 14:45
Location: Wadsworth, OH (occasionally home)
Contact:

#5 Post by Nathan F »

technosaurus wrote:@nathan - it may be faster if you used pure shell without calling external binaries (or applets that need to be forked)
Ex. Instead of find, use echo $path/*.ext
Instead of grep, add a case to your while,
case $line in *.svg|*.png)...;;esac

Should drastically improve speed in most cases.
Very true. Problem with echo comes when you might possibly have spaces in filenames, as it all goes onto a single line and leaves no way to parse. But the idea of using case in the while loop with globbing like that, that I like a lot. Although find is very, very fast on it's own as long as you don't fork anything from it. Most options passed to it, including both -name and -iname, don't seem to have much of a penalty.

Something else I tried for a situation similar to this, where you want a basename and a path for each file and expect a large number of results, was to pipe find into a couple other commands to batch format before a loop.

Code: Select all

find <path> | rev | sed 's:/:|:' | rev |
while IFS='|' read path name ; do <commands> ; done
Doesn't save much for a small batch of files but in a large batch it keeps from calling basename and dirname repeatedly.

Anyway there's always a bunch of ways to skin the same cat in the shell and they are usually not at all created equal. And I appreciate the tips.
Bring on the locusts ...

User avatar
Nathan F
Posts: 1764
Joined: Wed 08 Jun 2005, 14:45
Location: Wadsworth, OH (occasionally home)
Contact:

#6 Post by Nathan F »

Or even better.

Code: Select all

find <path> | sed 's:\(.*\)/:\1|:' | while IFS='|' read path file ; do <commands> ; done
Bring on the locusts ...

User avatar
Nathan F
Posts: 1764
Joined: Wed 08 Jun 2005, 14:45
Location: Wadsworth, OH (occasionally home)
Contact:

#7 Post by Nathan F »

For comparison.

Code: Select all

$ time ( for n in `seq 0 1 10`
do find /storage/photos -type f \
  -iname "*.jp*g" \
  -o -iname "*.png" \
  -o -iname "*.svg" \
  -o -iname "*.tif" \
  -o -iname "*.tiff" \
  | sed 's:\(.*\)/:\1|:' \
  | while IFS='|' read dir file
    do echo "$dir/$file|$file"
  done &>/dev/null
done)
$ 4.41s user 2.14s system 84% cpu 7.716 total
That's not quite as fast as the Python example but it's getting awful close. I'll have to try it with busybox.
Bring on the locusts ...

User avatar
Nathan F
Posts: 1764
Joined: Wed 08 Jun 2005, 14:45
Location: Wadsworth, OH (occasionally home)
Contact:

#8 Post by Nathan F »

And busybox just did me proud.

Code: Select all

#!/tools/bin/ash
for n in `seq 1 2 10`
do find /storage/photos \
  -type f \
  -iname "*.jp*g" \
  -o -iname "*.png" \
  -o -iname "*.svg" \
  -o -iname "*.tif*" | \
  sed 's:\(.*\)/:\1|:' | \
  while IFS='|' read dir file
    do echo "$dir/$file|$file"
  done
done &>/dev/null

Code: Select all

$ time ./find_images.ash
./find_images.ash  1.27s user 1.23s system 95% cpu 2.603 total
That's a pretty huge speed difference over the other methods and even manages to put Python in it's place.
Bring on the locusts ...

User avatar
technosaurus
Posts: 4853
Joined: Mon 19 May 2008, 01:24
Location: Blue Springs, MO
Contact:

#9 Post by technosaurus »

There is a fast way to do basename and dirname using shell ${VAR##*/} and ${VAR%/*} though I may have them backward (typing in my phone) its called substring manipulation and theres also a bash/ash extension for sedlike replacing ${VAR//OLD/NEW} to replace all, or 1 / to only replace the 1st.
Check out my [url=https://github.com/technosaurus]github repositories[/url]. I may eventually get around to updating my [url=http://bashismal.blogspot.com]blogspot[/url].

User avatar
Nathan F
Posts: 1764
Joined: Wed 08 Jun 2005, 14:45
Location: Wadsworth, OH (occasionally home)
Contact:

#10 Post by Nathan F »

Damn, thanks for that. It's golden. Works in zsh too apparently.
Bring on the locusts ...

User avatar
Nathan F
Posts: 1764
Joined: Wed 08 Jun 2005, 14:45
Location: Wadsworth, OH (occasionally home)
Contact:

#11 Post by Nathan F »

This seems to be a great leveler, with zsh bash and busybox ash all clocking in at @ 2.3-2.4 seconds for the equivalent script. Gotta love shell builtins.
Bring on the locusts ...

User avatar
technosaurus
Posts: 4853
Joined: Mon 19 May 2008, 01:24
Location: Blue Springs, MO
Contact:

#12 Post by technosaurus »

ls -R1 <dir> may be closer to the python code than find

Also awk is generally pretty fast/flexible for this stuff too... Anytime ash fails me, awk usually saves me.
Check out my [url=https://github.com/technosaurus]github repositories[/url]. I may eventually get around to updating my [url=http://bashismal.blogspot.com]blogspot[/url].

User avatar
Nathan F
Posts: 1764
Joined: Wed 08 Jun 2005, 14:45
Location: Wadsworth, OH (occasionally home)
Contact:

#13 Post by Nathan F »

technosaurus wrote:ls -R1 <dir> may be closer to the python code than find

Also awk is generally pretty fast/flexible for this stuff too... Anytime ash fails me, awk usually saves me.
The output of find is much easier to parse for a lot of things than ls -R1. And at this point it's handily beating Python, although since I'm just starting with Python there could very well be a faster way than the one I worked out.

As for awk, I am quite impressed with it but still a novice in it's usage. I've picked up a bit lately and one of these days I'll sit down and do a real study.

I have to say though, I am impressed with Python's capabilities too. Mostly with it's extensibility and syntax. If you can get used to the indentation (and some people can't) it makes the code remarkably concise in some cases. It makes me wonder if anyone ever thought to create a command shell with similar syntax.
Bring on the locusts ...

Post Reply