Page 1 of 1

Bash: sort

Posted: Sat 29 Sep 2012, 13:12
by zigbert
Let's say file1 looks like this:

Code: Select all

03:50|Artist - Title|001 /path/Artist_Title.mp3
04:16|Bartist - Title|002 /path/Bartist_Title.mp3
03:32|Cartist - Title|003 /path/Cartist_Title.mp3
...and file2 contains the correct order:

Code: Select all

002 /path/Bartist_Title.mp3
001 /path/Artist_Title.mp3
003 /path/Cartist_Title.mp3
Yes, the sort order is correct even if 002 is above 001. ..


How to sort file1 based on file2 ? ... in the speed of light ;)


Thank you
Sigmund

Posted: Sat 29 Sep 2012, 13:59
by SFR
Interesting problem...
Ok, here's the first attempt.

This one will work only if the numbers from file2 (001, 002 ...) are exactly corresponding to the line numbers in file1, as is shown in your examples.

Code: Select all

#!/bin/bash

for i in `awk '{print $1}' file2`; do
  awk 'NR=='$i'' file1
done
I don't know what about "the speed of light"; must be tested on something larger I guess. :wink:

Greetings!

sort

Posted: Sat 29 Sep 2012, 15:25
by L18L
# time sort -t '|' -k 2 file1
03:50|Artist - Title|001 /path/Artist_Title.mp3
04:16|Bartist - Title|002 /path/Bartist_Title.mp3
03:32|Cartist - Title|003 /path/Cartist_Title.mp3

real 0m0.030s
user 0m0.007s
sys 0m0.007s
#

# time for i in `awk '{print $1}' file2`; do
> awk 'NR=='$i'' file1
> done
04:16|Bartist - Title|002 /path/Bartist_Title.mp3
03:50|Artist - Title|001 /path/Artist_Title.mp3
03:32|Cartist - Title|003 /path/Cartist_Title.mp3

real 0m0.158s
user 0m0.053s
sys 0m0.013s
#

Posted: Sat 29 Sep 2012, 21:27
by technosaurus
btw if you are using awk you can right justify text like this:

Code: Select all

echo -e "1 hello\n2 world\n256 last" |awk '{printf "%5s %s\n",$1, $2}'
in awk you can use associative arrays and set it up by processing the first file in a 2nd file {the order matters, it parsers 1st file 1st....} you can also do stuff before and/or after all files

(by associative arrays, I mean that you can just randomly name the fields like a[filename]=number b[filename]=time ...)

here is the template I use, for when I forget all the random features;

Code: Select all

#!/bin/awk -f
#FILENAME (name of current file) $FILENAME (contents of current file)
#NF number of fields, $NF last field
#NR line number in all files		#FNR line number in current file
#ORS (default is "\n")				#RS  (default is "\n")
#OFS (default is " ")				#FS (default is [ \t]*)
#system(command) run a command		#close(filename) close(command)
#ARGC, ARGV similar to C, but skips some stuff
#IGNORECASE (default is 0) set to non-0 or use toupper() or tolower()
#ENVIRON array of env vars ex. ENVIRON["SHELL"] (equivalent of $SHELL)
#getline var < file ... close file or command | getline var
#index(haystack, needle) find needle in haystack
#length(string)
#match(string, regexp) returns where the regex starts, or 0
#RLENGTH length of /match/ substring or -1
#RSTART position where the /match/ substring starts, or 0
#split(string, array, fieldsep) split string into an array separated by fieldsep
#printf(format, expression1,...) print format-ted replacing %* with expressions
#%{c,d/i,e,f,g,o,s,x,X,%} char, decimal int, exp notation, float, shortest of 
#	exp/float, octal, string, hex int, capitalized hex int, a '%' character
#sprintf(format, expression1,...) store printf in a variable
#sub(regexp, replacement, target) replace first regex with replacement in target
#gsub(regexp, replacement, target) like gsub but for all regex in target
#substr(string, start, length)get substring of string from start to start+length
#print > /dev/stdin, /dev/stdout, /dev/stderr, /dev/fd/# or filename
#output can be piped like print $0 | command
#comparisons <,>,<=,>=,==,!=,~,!~,in use && for AND, || for OR, ! for NOT
#	(~ is for regexp and "in" looks for subscript in array)
#/word/{...} like if match(...) {...} equivalent of grep
#(condition) ? if-true-exp : if-false-exp or use if (condition){}
#math +,-,*,/,%,**,log(x),exp(x),,sqrt(x),cos(x),sin(x),atan2(y,x),
#rand(),srand(x),time(),ctime()
#
#function name (parameter-list) {
#     body-of-function
#}

BEGIN {
#actions that happen before any files are read in
}
#
{
#actions to do on files
}
#
END {
#actions to do after all files are done
}


Posted: Sun 30 Sep 2012, 09:55
by zigbert
I am thankful for all tips and input.
There are many ways to solve this, but I am still searching for brilliance :wink:


Sigmund

Posted: Sun 30 Sep 2012, 17:42
by akash_rawal
Here's my attempt:

Code: Select all

#!/bin/bash

#Utility
function endl()
{
	cat
	echo
}

#Our own private directory
tmp="/tmp/sort2"
mkdir -p "$tmp"

#Index file2
i=0
ifsbak="$IFS"
IFS=""
while read line; do
	echo "$i|$line"
	i=$(( $i+1 ))
done < "./file2" > "$tmp/file2_indexed"

#Sort both files alphabetically
sort -t '|' -k 3 -o "$tmp/file1_sorted" "./file1"
sort -t '|' -k 2 -o "$tmp/file2_sorted" "$tmp/file2_indexed"

#Load 'sorted' indices into array
IFS='|'
cut -d '|' -f 1 "$tmp/file2_sorted" | tr '
' '|' | endl | while read -a indices; do
	IFS='
'
	#Don't know why read -a doesn't work outside the loop
	
	#Attach indices to file1_sorted
	IFS=""
	i=0
	while read line; do
		echo "${indices[$i]}|$line"
		i=$(( $i+1 ))
	done < "$tmp/file1_sorted" > "$tmp/file1_indexed"
	#Sort it by attached index
	sort -t '|' -k 1 -n -o "$tmp/file1_sorted_final" "$tmp/file1_indexed"
	#Final output
	cut -d '|' -f 2- "$tmp/file1_sorted_final"
	break
done
For 100000 lines it takes 15 s.

I believe translating the script / parts of the script in awk can speed it up, but too lazy to learn awk :oops:

Re: Bash: sort

Posted: Mon 01 Oct 2012, 02:46
by rcrsn51
zigbert wrote:How to sort file1 based on file2 ? ... in the speed of light
That means coding it in C. See attached.

Code: Select all

# time ./zigsort file1 file2
04:16|Bartist - Title|002 /path/Bartist_Title.mp3
03:50|Artist - Title|001 /path/Artist_Title.mp3
03:32|Cartist - Title|003 /path/Cartist_Title.mp3

real	0m0.001s
user	0m0.000s
sys	0m0.000s

Posted: Mon 01 Oct 2012, 03:33
by technosaurus
I _was_ too lazy to learn awk, now too lazy to write 100 lines of shell to do 3 lines of awk
first arg is the unsorted file second arg is the sorted file

Code: Select all

#!/bin/awk -f
BEGIN{FS="|"}{if($3){d[$3]=$0;}else{print d[$1]}}
in a shell script it would be:

Code: Select all

awk 'BEGIN{FS="|"}{if($3){d[$3]=$0;}else{print d[$1]}}' unsorted_file sorted_file

Posted: Mon 01 Oct 2012, 07:27
by amigo
The suggested 'sort' command seemed to me the best:
"sort -t '|' -k 2 file1"
if that produces the desired result. The OP doesn't state how the order is pre-determined. If the order is completely arbitrary, then one of the other suggestions would be best.
Is the order arbitrary, or is it based on the data in column 2 of file1. Otherwise, how do you *produce* file2?

Posted: Mon 01 Oct 2012, 10:32
by technosaurus
to me it sounded as if it is based on the order they appear in the sorted file and have nothing to do with the contents (the sorted file is simply the last column of the unsorted in a user defined order?) - AFAICT everything faster (with exception of compiled C that is 40 times larger) than my awk one-liner sorted by the numeric values instead of the order they appear in the file
- I was just solving the problem - not the underlying cause (padded zeroes, the sort category at the end of line vs. the beginning, arbitrary fields, order and names...)

the time was:
  • real 0m0.009s
    user 0m0.004s
    sys 0m0.004s
and time shouldn't increase significantly based on file length, since that is about the same time it takes awk to BEGIN{print .}

Posted: Mon 01 Oct 2012, 14:09
by rcrsn51
technosaurus wrote:

Code: Select all

awk 'BEGIN{FS="|"}{if($3){d[$3]=$0;}else{print d[$1]}}' unsorted_file sorted_file
Clever. And surprisingly fast. With a test set of 999 records, it was 1/3 the speed of zigsort.

I wonder if there is any memory penalty for building an associative array that big?

Posted: Mon 01 Oct 2012, 16:09
by zigbert
technosaurus wrote:

Code: Select all

awk 'BEGIN{FS="|"}{if($3){d[$3]=$0;}else{print d[$1]}}' unsorted_file sorted_file
Now we're talking :D


Thanks a lot
Sigmund

Posted: Mon 01 Oct 2012, 19:44
by rcrsn51
As another test, I generated a data set of 9999 records.

Code: Select all

# time ./zigsort file1 file2 > file3

real	0m0.031s
user	0m0.008s
sys	0m0.020s

# time ./technosort file1 file2 > file3

real	0m0.039s
user	0m0.036s
sys	0m0.000s
# 
Technosort has caught up. It's holding all its data in memory so it only needs one pass through the files. Zigsort's need to re-read file1 is slowing it down.

Posted: Mon 01 Oct 2012, 20:28
by rcrsn51
But if I modify Zigsort to hold all its data internally, it yields

Code: Select all

# time ./zigsort file1 file2 > file3

real	0m0.011s
user	0m0.004s
sys	0m0.004s

Posted: Tue 02 Oct 2012, 14:03
by jamesbond
My entry. Doesn't assume file1 is already sorted, it matches "012" from file1 exactly with "012" from file2.

Code: Select all

#!/bin/ash

ENTRIES=10000
FILE1=/tmp/file1
FILE2=/tmp/file2
OUTFILE=/tmp/outfile

generate_file1() {
	for a in $(seq 1 $ENTRIES); do
		printf "03:50|Artist - Title|%.3d /path/Artist_Title.mp3\n" $a
	done > $FILE1
}

generate_file2() {
	for a in $(seq 1 $ENTRIES); do
		printf "%.3d /path/Artist_Title.mp3\n" $a
	done | sort -R > $FILE2
}

# generate fake data for testing
generate_file1
generate_file2

time -p -- awk -F"|" '
NR > FNR {
	# sort	
	FS=" "
	print file1[$1]
	next
}
{
	# scan
	line=$0
	sub(/ .*/,"",$3)
	file1[$3]=line
}
' $FILE1 $FILE2 > $OUTFILE