Some interesting time comparisons
Posted: Mon 19 Aug 2013, 05:07
Had a free hour and thought I'd try to get an idea just how much time might be wasted with inefficient shell code. I devised a set of tests (not very rigorous but interesting nonetheless) and thought I'd share the results. Since execution time can vary somewhat depending on system load, each test was run 10 times using the following test script.
The baseline is a simple run of "find" to get a listing of images from a rather large collection (5011 images in multiple directories).
And the result of the 10x loop is rather fast at 0.75 sec.
Now to do something useful with the data. I recently have been getting back to using gtkdialog, and for the tree widget the columns are separated with a pipe. Let's format the output for a gtkdialog tree (making a few common mistakes for sake of comparison).
Run 10x again, significantly slower, almost 2 minutes.
Here's a simple tweak. The find command has some formatting and filtering options, including -name and -iname, the latter being case insensitive.
The result edges out the first try, but not as much as I expected. Apparently pipelines really are quite efficient, and egrep is a rather fast utility.
But can we better, by having find do all of the forking and bypass the shell?
The results are disappointing, going for 2.5 minutes
Comparing apples to oranges, I'm taking my first steps with Python at present. So I wanted to see how to do the same things in Python. First, just finding the image files without any special formatting.
First thing that pops in my head is how much harder it is to accomplish what can be done with a single invocation of find. To be fair I'm a rank ameteur though and maybe someone else could do it a lot better. How is the speed comparison though? Fairly bad compared to the utility, showing that the binary wins out over the script any time.
Can we do the same formatting with Python? Yes, and here's where Python starts to shine. It's only a small tweak.
And the speed result handily beats even the best shell function by a wide margin.
Can we get even faster? Well we were running the function 10x via a shell script. How about we see whether bash or Python is faster running a for loop.
That's half a second faster that bash did running a for loop 10x, although it's possible that's due to the Python version being run all in one script instead of one script calling another ten times. In other words, without the overhead of loading Python ten times it's likely bash would have fared better in the for loop comparison.
I wouldn't go too far with conclusions, but a few things are pretty clear.
Even though the find utility handily beat my Python function, once anything is done with the data you invariably end up having to call upon other shell utilities, each of which take time to load slowing down the script. So even for a simple formatting exercise like this, Python was the clear speed winner. I'm beginning to see why some people love it so much. The shell is great for short scripts where execution time isn't a huge issue, but when dealing with large amounts of data (in this example we had over 50k lines after it looped 10x) the shell becomes painfully slow. I imagine a purpose written program in C would handily beat Python at the task, but that's not the point either. Do I even have a point...not sure.
Code: Select all
#!/bin/sh
for num in 1 2 3 4 5 6 7 8 9 0
do ./<script> &>/dev/null
done
Code: Select all
#!/bin/sh
dirs="/storage/photos"
find_images() {
find $@ -type f \
-iname "*.jp*g" \
-o -iname "*.png" \
-o -iname "*.svg" \
-o -iname "*.tif*"
}
find_images $dirs
Code: Select all
./test.sh 0.53s user 0.14s system 89% cpu 0.754 total
Code: Select all
#!/bin/sh
dirs="/storage/photos"
find_images() {
find $@ -type f \
| egrep ".jpg|.jpeg|.JPG|.png|.PNG|.tif|.tiff|.TIF|.svg" \
| while read line ; do echo "$line|$(basename "$line")" ; done
}
find_images $dirs
Code: Select all
./test.sh 13.12s user 17.36s system 25% cpu 1:57.86 total
Code: Select all
#!/bin/sh
dirs="/storage/photos"
find_images() {
find $@ -type f \
\( -iname "*.jp*g" \
-o -iname "*.png" \
-o -iname "*.svg" \
-o -iname "*.tif*" \) \
| while read line ; do echo "$line|$(basename "$line")" ; done
}
find_images $dirs
Code: Select all
./test.sh 13.83s user 16.92s system 26% cpu 1:54.15 total
Code: Select all
#!/bin/sh
dirs="/storage/photos"
find_images() {
find $@ -type f \
\( -iname "*.jp*g" \
-o -iname "*.png" \
-o -iname "*.svg" \
-o -iname "*.tif*" \) \
-exec echo -n '{}|' \; \
-exec basename '{}' \;
}
find_images $dirs
Code: Select all
./test.sh > /dev/null 4.64s user 19.18s system 15% cpu 2:34.87 total
Code: Select all
#!/usr/bin/env python
import os
import fnmatch
imagext = ['*.jpg', '*.JPG', '*.jpeg', '*.png', '*.PNG', '*.tif', '*.tiff', '*.svg']
dirs = ['/storage/photos']
def find_images(dir):
for root, dirnames, filenames in os.walk(dir):
for ext in imagext:
for filename in fnmatch.filter(filenames, ext):
fullpath = os.path.join(root, filename)
print (fullpath)
for dir in dirs:
find_images (dir)
Code: Select all
./test.sh 3.82s user 0.39s system 87% cpu 4.823 total
Code: Select all
#!/usr/bin/env python
import os
import fnmatch
imagext = ['*.jpg', '*.JPG', '*.jpeg', '*.png', '*.PNG', '*.tif', '*.tiff', '*.svg']
dirs = ['/storage/photos']
def find_images(dir):
for root, dirnames, filenames in os.walk(dir):
for ext in imagext:
for filename in fnmatch.filter(filenames, ext):
fullpath = os.path.join(root, filename)
print (fullpath, end='')
print ('|', end='')
print (filename)
for dir in dirs:
find_images (dir)
Code: Select all
./test.sh 4.32s user 0.39s system 87% cpu 5.389 total
Code: Select all
#!/usr/bin/env python
import os
import fnmatch
imagext = ['*.jpg', '*.JPG', '*.jpeg', '*.png', '*.PNG', '*.tif', '*.tiff', '*.svg']
dirs = ['/storage/photos']
def find_images(dir):
for root, dirnames, filenames in os.walk(dir):
for ext in imagext:
for filename in fnmatch.filter(filenames, ext):
fullpath = os.path.join(root, filename)
print (fullpath, end='')
print ('|', end='')
print (filename)
nums = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '0']
for num in nums:
for dir in dirs:
find_images (dir)
Code: Select all
./find_images4.py &> /dev/null 3.44s user 0.33s system 87% cpu 4.304 total
I wouldn't go too far with conclusions, but a few things are pretty clear.
Even though the find utility handily beat my Python function, once anything is done with the data you invariably end up having to call upon other shell utilities, each of which take time to load slowing down the script. So even for a simple formatting exercise like this, Python was the clear speed winner. I'm beginning to see why some people love it so much. The shell is great for short scripts where execution time isn't a huge issue, but when dealing with large amounts of data (in this example we had over 50k lines after it looped 10x) the shell becomes painfully slow. I imagine a purpose written program in C would handily beat Python at the task, but that's not the point either. Do I even have a point...not sure.