Puppy Linux Discussion Forum Forum Index Puppy Linux Discussion Forum
Puppy HOME page : puppylinux.com
"THE" alternative forum : puppylinux.info
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

The time now is Tue 23 Sep 2014, 22:25
All times are UTC - 4
 Forum index » Off-Topic Area » Programming
Some interesting time comparisons
Post new topic   Reply to topic View previous topic :: View next topic
Page 1 of 1 [13 Posts]  
Author Message
Nathan F


Joined: 08 Jun 2005
Posts: 1760
Location: Wadsworth, OH (occasionally home)

PostPosted: Mon 19 Aug 2013, 01:07    Post subject:  Some interesting time comparisons  

Had a free hour and thought I'd try to get an idea just how much time might be wasted with inefficient shell code. I devised a set of tests (not very rigorous but interesting nonetheless) and thought I'd share the results. Since execution time can vary somewhat depending on system load, each test was run 10 times using the following test script.
Code:
#!/bin/sh
for num in 1 2 3 4 5 6 7 8 9 0
do ./<script> &>/dev/null
done

The baseline is a simple run of "find" to get a listing of images from a rather large collection (5011 images in multiple directories).
Code:
#!/bin/sh

dirs="/storage/photos"
find_images() {
find $@ -type f \
  -iname "*.jp*g" \
  -o -iname "*.png" \
  -o -iname "*.svg" \
  -o -iname "*.tif*"
}

find_images $dirs

And the result of the 10x loop is rather fast at 0.75 sec.
Code:
./test.sh  0.53s user 0.14s system 89% cpu 0.754 total

Now to do something useful with the data. I recently have been getting back to using gtkdialog, and for the tree widget the columns are separated with a pipe. Let's format the output for a gtkdialog tree (making a few common mistakes for sake of comparison).
Code:
#!/bin/sh

dirs="/storage/photos"
find_images() {
find $@ -type f \
  | egrep ".jpg|.jpeg|.JPG|.png|.PNG|.tif|.tiff|.TIF|.svg" \
  | while read line ; do echo "$line|$(basename "$line")" ; done
}
find_images $dirs

Run 10x again, significantly slower, almost 2 minutes.
Code:
./test.sh  13.12s user 17.36s system 25% cpu 1:57.86 total

Here's a simple tweak. The find command has some formatting and filtering options, including -name and -iname, the latter being case insensitive.
Code:
#!/bin/sh

dirs="/storage/photos"
find_images() {
find $@ -type f \
  \( -iname "*.jp*g" \
  -o -iname "*.png" \
  -o -iname "*.svg" \
  -o -iname "*.tif*" \) \
  | while read line ; do echo "$line|$(basename "$line")" ; done
}
find_images $dirs

The result edges out the first try, but not as much as I expected. Apparently pipelines really are quite efficient, and egrep is a rather fast utility.
Code:
./test.sh  13.83s user 16.92s system 26% cpu 1:54.15 total

But can we better, by having find do all of the forking and bypass the shell?
Code:
#!/bin/sh

dirs="/storage/photos"
find_images() {
find $@ -type f \
  \( -iname "*.jp*g" \
  -o -iname "*.png" \
  -o -iname "*.svg" \
  -o -iname "*.tif*" \) \
  -exec echo -n '{}|' \; \
  -exec basename '{}' \;
}

find_images $dirs

The results are disappointing, going for 2.5 minutes
Code:
./test.sh > /dev/null  4.64s user 19.18s system 15% cpu 2:34.87 total

Comparing apples to oranges, I'm taking my first steps with Python at present. So I wanted to see how to do the same things in Python. First, just finding the image files without any special formatting.
Code:
#!/usr/bin/env python

import os
import fnmatch
imagext = ['*.jpg', '*.JPG', '*.jpeg', '*.png', '*.PNG', '*.tif', '*.tiff', '*.svg']

dirs = ['/storage/photos']

def find_images(dir):
  for root, dirnames, filenames in os.walk(dir):
    for ext in imagext:
      for filename in fnmatch.filter(filenames, ext):
        fullpath = os.path.join(root, filename)
        print (fullpath)

for dir in dirs:
  find_images (dir)

First thing that pops in my head is how much harder it is to accomplish what can be done with a single invocation of find. To be fair I'm a rank ameteur though and maybe someone else could do it a lot better. How is the speed comparison though? Fairly bad compared to the utility, showing that the binary wins out over the script any time.
Code:
./test.sh  3.82s user 0.39s system 87% cpu 4.823 total

Can we do the same formatting with Python? Yes, and here's where Python starts to shine. It's only a small tweak.
Code:
#!/usr/bin/env python

import os
import fnmatch
imagext = ['*.jpg', '*.JPG', '*.jpeg', '*.png', '*.PNG', '*.tif', '*.tiff', '*.svg']

dirs = ['/storage/photos']

def find_images(dir):
  for root, dirnames, filenames in os.walk(dir):
    for ext in imagext:
      for filename in fnmatch.filter(filenames, ext):
        fullpath = os.path.join(root, filename)
        print (fullpath, end='')
        print ('|', end='')
        print (filename)

for dir in dirs:
  find_images (dir)

And the speed result handily beats even the best shell function by a wide margin.
Code:
./test.sh  4.32s user 0.39s system 87% cpu 5.389 total

Can we get even faster? Well we were running the function 10x via a shell script. How about we see whether bash or Python is faster running a for loop.
Code:
#!/usr/bin/env python

import os
import fnmatch
imagext = ['*.jpg', '*.JPG', '*.jpeg', '*.png', '*.PNG', '*.tif', '*.tiff', '*.svg']

dirs = ['/storage/photos']

def find_images(dir):
  for root, dirnames, filenames in os.walk(dir):
    for ext in imagext:
      for filename in fnmatch.filter(filenames, ext):
        fullpath = os.path.join(root, filename)
        print (fullpath, end='')
        print ('|', end='')
        print (filename)

nums = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '0']
for num in nums:
  for dir in dirs:
    find_images (dir)

Code:
./find_images4.py &> /dev/null  3.44s user 0.33s system 87% cpu 4.304 total

That's half a second faster that bash did running a for loop 10x, although it's possible that's due to the Python version being run all in one script instead of one script calling another ten times. In other words, without the overhead of loading Python ten times it's likely bash would have fared better in the for loop comparison.

I wouldn't go too far with conclusions, but a few things are pretty clear.

Even though the find utility handily beat my Python function, once anything is done with the data you invariably end up having to call upon other shell utilities, each of which take time to load slowing down the script. So even for a simple formatting exercise like this, Python was the clear speed winner. I'm beginning to see why some people love it so much. The shell is great for short scripts where execution time isn't a huge issue, but when dealing with large amounts of data (in this example we had over 50k lines after it looped 10x) the shell becomes painfully slow. I imagine a purpose written program in C would handily beat Python at the task, but that's not the point either. Do I even have a point...not sure.

_________________
Bring on the locusts ...
Back to top
View user's profile Send private message AIM Address Yahoo Messenger MSN Messenger 
Nathan F


Joined: 08 Jun 2005
Posts: 1760
Location: Wadsworth, OH (occasionally home)

PostPosted: Mon 19 Aug 2013, 02:22    Post subject:  

As a real interesting addendum I tried all the shell functions using busybox with the "exec prefers applets" option enabled. The shell test with the while loop ran in 26.762 seconds, and -exec'ing the extra commands from find it ran in 22.889 seconds. That's still nowhere as fast as the Python functions but it handily beats bash + gnu coreutils. It also reverses the original results as pertaining to -exec from find versus the pipe into a while loop.

Good food for thought there.

The "exec prefers applets" busybox config tells busybox not to fork a new process if there is an applet with the proper function.

_________________
Bring on the locusts ...
Back to top
View user's profile Send private message AIM Address Yahoo Messenger MSN Messenger 
Karl Godt


Joined: 20 Jun 2010
Posts: 3972
Location: Kiel,Germany

PostPosted: Mon 19 Aug 2013, 10:12    Post subject:  

Using find a second time in the same directory usually takes less time, when the kernel is able to cache much for much RAM installed .

You should also test such syntax in ash and bash :
Code:
dir='/storage/photos'
list_function(){
while read filename
do
[ "$filename" ] || continue
echo "$filename"
done<<EOI
$(ls -1 "$@" | grep -iE -e '\.jpg$|\.jpeg$|\.png$|\.svg$|\.ti[f]*$')
EOI
}
list_function '$dir'

_________________
«Give me GUI or Death» -- I give you [[Xx]term[inal]] [[Cc]on[s][ole]] .
Macpup user since 2010 on full installations.
People who want problems with Puppy boot frugal Razz
Back to top
View user's profile Send private message Visit poster's website 
technosaurus


Joined: 18 May 2008
Posts: 4351

PostPosted: Fri 23 Aug 2013, 20:52    Post subject:  

@nathan - it may be faster if you used pure shell without calling external binaries (or applets that need to be forked)
Ex. Instead of find, use echo $path/*.ext
Instead of grep, add a case to your while,
case $line in *.svg|*.png)...;;esac

Should drastically improve speed in most cases.

_________________
Web Programming - Pet Packaging 100 & 101
Back to top
View user's profile Send private message 
Nathan F


Joined: 08 Jun 2005
Posts: 1760
Location: Wadsworth, OH (occasionally home)

PostPosted: Fri 23 Aug 2013, 22:39    Post subject:  

technosaurus wrote:
@nathan - it may be faster if you used pure shell without calling external binaries (or applets that need to be forked)
Ex. Instead of find, use echo $path/*.ext
Instead of grep, add a case to your while,
case $line in *.svg|*.png)...;;esac

Should drastically improve speed in most cases.

Very true. Problem with echo comes when you might possibly have spaces in filenames, as it all goes onto a single line and leaves no way to parse. But the idea of using case in the while loop with globbing like that, that I like a lot. Although find is very, very fast on it's own as long as you don't fork anything from it. Most options passed to it, including both -name and -iname, don't seem to have much of a penalty.

Something else I tried for a situation similar to this, where you want a basename and a path for each file and expect a large number of results, was to pipe find into a couple other commands to batch format before a loop.
Code:
find <path> | rev | sed 's:/:|:' | rev |
while IFS='|' read path name ; do <commands> ; done

Doesn't save much for a small batch of files but in a large batch it keeps from calling basename and dirname repeatedly.

Anyway there's always a bunch of ways to skin the same cat in the shell and they are usually not at all created equal. And I appreciate the tips.

_________________
Bring on the locusts ...
Back to top
View user's profile Send private message AIM Address Yahoo Messenger MSN Messenger 
Nathan F


Joined: 08 Jun 2005
Posts: 1760
Location: Wadsworth, OH (occasionally home)

PostPosted: Fri 23 Aug 2013, 22:53    Post subject:  

Or even better.
Code:
find <path> | sed 's:\(.*\)/:\1|:' | while IFS='|' read path file ; do <commands> ; done

_________________
Bring on the locusts ...
Back to top
View user's profile Send private message AIM Address Yahoo Messenger MSN Messenger 
Nathan F


Joined: 08 Jun 2005
Posts: 1760
Location: Wadsworth, OH (occasionally home)

PostPosted: Fri 23 Aug 2013, 23:07    Post subject:  

For comparison.
Code:
$ time ( for n in `seq 0 1 10`
do find /storage/photos -type f \
  -iname "*.jp*g" \
  -o -iname "*.png" \
  -o -iname "*.svg" \
  -o -iname "*.tif" \
  -o -iname "*.tiff" \
  | sed 's:\(.*\)/:\1|:' \
  | while IFS='|' read dir file
    do echo "$dir/$file|$file"
  done &>/dev/null
done)
$ 4.41s user 2.14s system 84% cpu 7.716 total

That's not quite as fast as the Python example but it's getting awful close. I'll have to try it with busybox.

_________________
Bring on the locusts ...
Back to top
View user's profile Send private message AIM Address Yahoo Messenger MSN Messenger 
Nathan F


Joined: 08 Jun 2005
Posts: 1760
Location: Wadsworth, OH (occasionally home)

PostPosted: Fri 23 Aug 2013, 23:22    Post subject:  

And busybox just did me proud.
Code:
#!/tools/bin/ash
for n in `seq 1 2 10`
do find /storage/photos \
  -type f \
  -iname "*.jp*g" \
  -o -iname "*.png" \
  -o -iname "*.svg" \
  -o -iname "*.tif*" | \
  sed 's:\(.*\)/:\1|:' | \
  while IFS='|' read dir file
    do echo "$dir/$file|$file"
  done
done &>/dev/null

Code:
$ time ./find_images.ash
./find_images.ash  1.27s user 1.23s system 95% cpu 2.603 total

That's a pretty huge speed difference over the other methods and even manages to put Python in it's place.

_________________
Bring on the locusts ...
Back to top
View user's profile Send private message AIM Address Yahoo Messenger MSN Messenger 
technosaurus


Joined: 18 May 2008
Posts: 4351

PostPosted: Sat 24 Aug 2013, 00:02    Post subject:  

There is a fast way to do basename and dirname using shell ${VAR##*/} and ${VAR%/*} though I may have them backward (typing in my phone) its called substring manipulation and theres also a bash/ash extension for sedlike replacing ${VAR//OLD/NEW} to replace all, or 1 / to only replace the 1st.
_________________
Web Programming - Pet Packaging 100 & 101
Back to top
View user's profile Send private message 
Nathan F


Joined: 08 Jun 2005
Posts: 1760
Location: Wadsworth, OH (occasionally home)

PostPosted: Sat 24 Aug 2013, 00:48    Post subject:  

Damn, thanks for that. It's golden. Works in zsh too apparently.
_________________
Bring on the locusts ...
Back to top
View user's profile Send private message AIM Address Yahoo Messenger MSN Messenger 
Nathan F


Joined: 08 Jun 2005
Posts: 1760
Location: Wadsworth, OH (occasionally home)

PostPosted: Sat 24 Aug 2013, 02:15    Post subject:  

This seems to be a great leveler, with zsh bash and busybox ash all clocking in at @ 2.3-2.4 seconds for the equivalent script. Gotta love shell builtins.
_________________
Bring on the locusts ...
Back to top
View user's profile Send private message AIM Address Yahoo Messenger MSN Messenger 
technosaurus


Joined: 18 May 2008
Posts: 4351

PostPosted: Sat 24 Aug 2013, 03:45    Post subject:  

ls -R1 <dir> may be closer to the python code than find

Also awk is generally pretty fast/flexible for this stuff too... Anytime ash fails me, awk usually saves me.

_________________
Web Programming - Pet Packaging 100 & 101
Back to top
View user's profile Send private message 
Nathan F


Joined: 08 Jun 2005
Posts: 1760
Location: Wadsworth, OH (occasionally home)

PostPosted: Sat 24 Aug 2013, 10:30    Post subject:  

technosaurus wrote:
ls -R1 <dir> may be closer to the python code than find

Also awk is generally pretty fast/flexible for this stuff too... Anytime ash fails me, awk usually saves me.

The output of find is much easier to parse for a lot of things than ls -R1. And at this point it's handily beating Python, although since I'm just starting with Python there could very well be a faster way than the one I worked out.

As for awk, I am quite impressed with it but still a novice in it's usage. I've picked up a bit lately and one of these days I'll sit down and do a real study.

I have to say though, I am impressed with Python's capabilities too. Mostly with it's extensibility and syntax. If you can get used to the indentation (and some people can't) it makes the code remarkably concise in some cases. It makes me wonder if anyone ever thought to create a command shell with similar syntax.

_________________
Bring on the locusts ...
Back to top
View user's profile Send private message AIM Address Yahoo Messenger MSN Messenger 
Display posts from previous:   Sort by:   
Page 1 of 1 [13 Posts]  
Post new topic   Reply to topic View previous topic :: View next topic
 Forum index » Off-Topic Area » Programming
Jump to:  

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Powered by phpBB © 2001, 2005 phpBB Group
[ Time: 0.1320s ][ Queries: 12 (0.0222s) ][ GZIP on ]