The time now is Thu 20 Jun 2013, 07:32
All times are UTC - 4 |
| Author |
Message |
zigbert

Joined: 29 Mar 2006 Posts: 5295 Location: Valåmoen, Norway
|
Posted: Sat 29 Sep 2012, 09:12 Post subject:
Bash: sort Subject description: Is it possible to sort a file based on another file ? |
|
Let's say file1 looks like this: | Code: | 03:50|Artist - Title|001 /path/Artist_Title.mp3
04:16|Bartist - Title|002 /path/Bartist_Title.mp3
03:32|Cartist - Title|003 /path/Cartist_Title.mp3 | ...and file2 contains the correct order: | Code: | 002 /path/Bartist_Title.mp3
001 /path/Artist_Title.mp3
003 /path/Cartist_Title.mp3 | Yes, the sort order is correct even if 002 is above 001. ..
How to sort file1 based on file2 ? ... in the speed of light
Thank you
Sigmund
_________________ Stardust resources
|
|
Back to top
|
|
 |
SFR

Joined: 26 Oct 2011 Posts: 573
|
Posted: Sat 29 Sep 2012, 09:59 Post subject:
|
|
Interesting problem...
Ok, here's the first attempt.
This one will work only if the numbers from file2 (001, 002 ...) are exactly corresponding to the line numbers in file1, as is shown in your examples.
| Code: | #!/bin/bash
for i in `awk '{print $1}' file2`; do
awk 'NR=='$i'' file1
done |
I don't know what about "the speed of light"; must be tested on something larger I guess.
Greetings!
_________________ [O]bdurate [R]ules [D]estroy [E]nthusiastic [R]ebels => [C]reative [H]umans [A]lways [O]pen [S]ource
Omnia mea mecum porto.
|
|
Back to top
|
|
 |
L18L
Joined: 19 Jun 2010 Posts: 1806 Location: Burghaslach, Germany
|
Posted: Sat 29 Sep 2012, 11:25 Post subject:
sort |
|
# time sort -t '|' -k 2 file1
03:50|Artist - Title|001 /path/Artist_Title.mp3
04:16|Bartist - Title|002 /path/Bartist_Title.mp3
03:32|Cartist - Title|003 /path/Cartist_Title.mp3
real 0m0.030s
user 0m0.007s
sys 0m0.007s
#
# time for i in `awk '{print $1}' file2`; do
> awk 'NR=='$i'' file1
> done
04:16|Bartist - Title|002 /path/Bartist_Title.mp3
03:50|Artist - Title|001 /path/Artist_Title.mp3
03:32|Cartist - Title|003 /path/Cartist_Title.mp3
real 0m0.158s
user 0m0.053s
sys 0m0.013s
#
|
|
Back to top
|
|
 |
technosaurus

Joined: 18 May 2008 Posts: 3845
|
Posted: Sat 29 Sep 2012, 17:27 Post subject:
|
|
btw if you are using awk you can right justify text like this:
| Code: | | echo -e "1 hello\n2 world\n256 last" |awk '{printf "%5s %s\n",$1, $2}' |
in awk you can use associative arrays and set it up by processing the first file in a 2nd file {the order matters, it parsers 1st file 1st....} you can also do stuff before and/or after all files
(by associative arrays, I mean that you can just randomly name the fields like a[filename]=number b[filename]=time ...)
here is the template I use, for when I forget all the random features;
| Code: | #!/bin/awk -f
#FILENAME (name of current file) $FILENAME (contents of current file)
#NF number of fields, $NF last field
#NR line number in all files #FNR line number in current file
#ORS (default is "\n") #RS (default is "\n")
#OFS (default is " ") #FS (default is [ \t]*)
#system(command) run a command #close(filename) close(command)
#ARGC, ARGV similar to C, but skips some stuff
#IGNORECASE (default is 0) set to non-0 or use toupper() or tolower()
#ENVIRON array of env vars ex. ENVIRON["SHELL"] (equivalent of $SHELL)
#getline var < file ... close file or command | getline var
#index(haystack, needle) find needle in haystack
#length(string)
#match(string, regexp) returns where the regex starts, or 0
#RLENGTH length of /match/ substring or -1
#RSTART position where the /match/ substring starts, or 0
#split(string, array, fieldsep) split string into an array separated by fieldsep
#printf(format, expression1,...) print format-ted replacing %* with expressions
#%{c,d/i,e,f,g,o,s,x,X,%} char, decimal int, exp notation, float, shortest of
# exp/float, octal, string, hex int, capitalized hex int, a '%' character
#sprintf(format, expression1,...) store printf in a variable
#sub(regexp, replacement, target) replace first regex with replacement in target
#gsub(regexp, replacement, target) like gsub but for all regex in target
#substr(string, start, length)get substring of string from start to start+length
#print > /dev/stdin, /dev/stdout, /dev/stderr, /dev/fd/# or filename
#output can be piped like print $0 | command
#comparisons <,>,<=,>=,==,!=,~,!~,in use && for AND, || for OR, ! for NOT
# (~ is for regexp and "in" looks for subscript in array)
#/word/{...} like if match(...) {...} equivalent of grep
#(condition) ? if-true-exp : if-false-exp or use if (condition){}
#math +,-,*,/,%,**,log(x),exp(x),,sqrt(x),cos(x),sin(x),atan2(y,x),
#rand(),srand(x),time(),ctime()
#
#function name (parameter-list) {
# body-of-function
#}
BEGIN {
#actions that happen before any files are read in
}
#
{
#actions to do on files
}
#
END {
#actions to do after all files are done
}
|
_________________ Puppy Web Desktop Now with pet packages - Pet Packaging 100 & 101
Last edited by technosaurus on Sat 20 Oct 2012, 14:21; edited 1 time in total
|
|
Back to top
|
|
 |
zigbert

Joined: 29 Mar 2006 Posts: 5295 Location: Valåmoen, Norway
|
Posted: Sun 30 Sep 2012, 05:55 Post subject:
|
|
I am thankful for all tips and input.
There are many ways to solve this, but I am still searching for brilliance
Sigmund
_________________ Stardust resources
|
|
Back to top
|
|
 |
akash_rawal

Joined: 25 Aug 2010 Posts: 215 Location: Pune, Maharashtra, India
|
Posted: Sun 30 Sep 2012, 13:42 Post subject:
|
|
Here's my attempt:
| Code: |
#!/bin/bash
#Utility
function endl()
{
cat
echo
}
#Our own private directory
tmp="/tmp/sort2"
mkdir -p "$tmp"
#Index file2
i=0
ifsbak="$IFS"
IFS=""
while read line; do
echo "$i|$line"
i=$(( $i+1 ))
done < "./file2" > "$tmp/file2_indexed"
#Sort both files alphabetically
sort -t '|' -k 3 -o "$tmp/file1_sorted" "./file1"
sort -t '|' -k 2 -o "$tmp/file2_sorted" "$tmp/file2_indexed"
#Load 'sorted' indices into array
IFS='|'
cut -d '|' -f 1 "$tmp/file2_sorted" | tr '
' '|' | endl | while read -a indices; do
IFS='
'
#Don't know why read -a doesn't work outside the loop
#Attach indices to file1_sorted
IFS=""
i=0
while read line; do
echo "${indices[$i]}|$line"
i=$(( $i+1 ))
done < "$tmp/file1_sorted" > "$tmp/file1_indexed"
#Sort it by attached index
sort -t '|' -k 1 -n -o "$tmp/file1_sorted_final" "$tmp/file1_indexed"
#Final output
cut -d '|' -f 2- "$tmp/file1_sorted_final"
break
done
|
For 100000 lines it takes 15 s.
I believe translating the script / parts of the script in awk can speed it up, but too lazy to learn awk
_________________ If there's an open source project in your hand,
Don't aim for popularity, it's out of your hand.
Aim for perfection, as that's the best thing you can.
|
|
Back to top
|
|
 |
rcrsn51

Joined: 05 Sep 2006 Posts: 7834 Location: Stratford, Ontario
|
Posted: Sun 30 Sep 2012, 22:46 Post subject:
Re: Bash: sort Subject description: Is it possible to sort a file based on another file ? |
|
| zigbert wrote: | | How to sort file1 based on file2 ? ... in the speed of light |
That means coding it in C. See attached.
| Code: | # time ./zigsort file1 file2
04:16|Bartist - Title|002 /path/Bartist_Title.mp3
03:50|Artist - Title|001 /path/Artist_Title.mp3
03:32|Cartist - Title|003 /path/Cartist_Title.mp3
real 0m0.001s
user 0m0.000s
sys 0m0.000s |
| Description |
|

Download |
| Filename |
zigsort-1.0.tar.gz |
| Filesize |
3.3 KB |
| Downloaded |
91 Time(s) |
|
|
Back to top
|
|
 |
technosaurus

Joined: 18 May 2008 Posts: 3845
|
Posted: Sun 30 Sep 2012, 23:33 Post subject:
|
|
I _was_ too lazy to learn awk, now too lazy to write 100 lines of shell to do 3 lines of awk
first arg is the unsorted file second arg is the sorted file
| Code: | #!/bin/awk -f
BEGIN{FS="|"}{if($3){d[$3]=$0;}else{print d[$1]}} |
in a shell script it would be:
| Code: | | awk 'BEGIN{FS="|"}{if($3){d[$3]=$0;}else{print d[$1]}}' unsorted_file sorted_file |
_________________ Puppy Web Desktop Now with pet packages - Pet Packaging 100 & 101
|
|
Back to top
|
|
 |
amigo
Joined: 02 Apr 2007 Posts: 1776
|
Posted: Mon 01 Oct 2012, 03:27 Post subject:
|
|
The suggested 'sort' command seemed to me the best:
"sort -t '|' -k 2 file1"
if that produces the desired result. The OP doesn't state how the order is pre-determined. If the order is completely arbitrary, then one of the other suggestions would be best.
Is the order arbitrary, or is it based on the data in column 2 of file1. Otherwise, how do you *produce* file2?
|
|
Back to top
|
|
 |
technosaurus

Joined: 18 May 2008 Posts: 3845
|
Posted: Mon 01 Oct 2012, 06:32 Post subject:
|
|
to me it sounded as if it is based on the order they appear in the sorted file and have nothing to do with the contents (the sorted file is simply the last column of the unsorted in a user defined order?) - AFAICT everything faster (with exception of compiled C that is 40 times larger) than my awk one-liner sorted by the numeric values instead of the order they appear in the file
- I was just solving the problem - not the underlying cause (padded zeroes, the sort category at the end of line vs. the beginning, arbitrary fields, order and names...)
the time was:
real 0m0.009s
user 0m0.004s
sys 0m0.004s
and time shouldn't increase significantly based on file length, since that is about the same time it takes awk to BEGIN{print .}
_________________ Puppy Web Desktop Now with pet packages - Pet Packaging 100 & 101
|
|
Back to top
|
|
 |
rcrsn51

Joined: 05 Sep 2006 Posts: 7834 Location: Stratford, Ontario
|
Posted: Mon 01 Oct 2012, 10:09 Post subject:
|
|
| technosaurus wrote: | | Code: | | awk 'BEGIN{FS="|"}{if($3){d[$3]=$0;}else{print d[$1]}}' unsorted_file sorted_file |
|
Clever. And surprisingly fast. With a test set of 999 records, it was 1/3 the speed of zigsort.
I wonder if there is any memory penalty for building an associative array that big?
|
|
Back to top
|
|
 |
zigbert

Joined: 29 Mar 2006 Posts: 5295 Location: Valåmoen, Norway
|
Posted: Mon 01 Oct 2012, 12:09 Post subject:
|
|
| technosaurus wrote: | | Code: | | awk 'BEGIN{FS="|"}{if($3){d[$3]=$0;}else{print d[$1]}}' unsorted_file sorted_file |
| Now we're talking
Thanks a lot
Sigmund
_________________ Stardust resources
|
|
Back to top
|
|
 |
rcrsn51

Joined: 05 Sep 2006 Posts: 7834 Location: Stratford, Ontario
|
Posted: Mon 01 Oct 2012, 15:44 Post subject:
|
|
As another test, I generated a data set of 9999 records.
| Code: | # time ./zigsort file1 file2 > file3
real 0m0.031s
user 0m0.008s
sys 0m0.020s
# time ./technosort file1 file2 > file3
real 0m0.039s
user 0m0.036s
sys 0m0.000s
# |
Technosort has caught up. It's holding all its data in memory so it only needs one pass through the files. Zigsort's need to re-read file1 is slowing it down.
|
|
Back to top
|
|
 |
rcrsn51

Joined: 05 Sep 2006 Posts: 7834 Location: Stratford, Ontario
|
Posted: Mon 01 Oct 2012, 16:28 Post subject:
|
|
But if I modify Zigsort to hold all its data internally, it yields
| Code: | # time ./zigsort file1 file2 > file3
real 0m0.011s
user 0m0.004s
sys 0m0.004s |
|
|
Back to top
|
|
 |
jamesbond
Joined: 26 Feb 2007 Posts: 1573 Location: The Blue Marble
|
Posted: Tue 02 Oct 2012, 10:03 Post subject:
|
|
My entry. Doesn't assume file1 is already sorted, it matches "012" from file1 exactly with "012" from file2.
| Code: | #!/bin/ash
ENTRIES=10000
FILE1=/tmp/file1
FILE2=/tmp/file2
OUTFILE=/tmp/outfile
generate_file1() {
for a in $(seq 1 $ENTRIES); do
printf "03:50|Artist - Title|%.3d /path/Artist_Title.mp3\n" $a
done > $FILE1
}
generate_file2() {
for a in $(seq 1 $ENTRIES); do
printf "%.3d /path/Artist_Title.mp3\n" $a
done | sort -R > $FILE2
}
# generate fake data for testing
generate_file1
generate_file2
time -p -- awk -F"|" '
NR > FNR {
# sort
FS=" "
print file1[$1]
next
}
{
# scan
line=$0
sub(/ .*/,"",$3)
file1[$3]=line
}
' $FILE1 $FILE2 > $OUTFILE
|
_________________ Fatdog64, Slacko and Puppeee user. Puppy user since 2.13
|
|
Back to top
|
|
 |
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum You cannot attach files in this forum You can download files in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|