Bash: Count numbers in list

For discussions about programming, programming questions/advice, and projects that don't really have anything to do with Puppy.
Message
Author
User avatar
zigbert
Posts: 6621
Joined: Wed 29 Mar 2006, 18:13
Location: Valåmoen, Norway
Contact:

Bash: Count numbers in list

#1 Post by zigbert »

Hello!
I am working on a new rating system in pMusic. I want to see how many times the track is played last week or last month.
The idea is to add 'date +%s' to the rating-index each time the track has played. The index file could look like this

Code: Select all

/root/file1.mp3| 1345641102,1345644563,1345647584,...
/root/file2.mp3| 1345623045,1345643786,1345648695,...
Now the question is:
How many times has file2 played since 1345643390?

In other words:
How many numbers in line 2 is above 1345643390?

The list will be long, so we need a fast calculation


Thank you
Sigmund

User avatar
Flash
Official Dog Handler
Posts: 13071
Joined: Wed 04 May 2005, 16:04
Location: Arizona USA

#2 Post by Flash »

So those numbers are time stamps that show each time the file was played? Are they in chronological order?

Couldn't you just find the number of commas in each line? (The reverse of counting sheep by counting their legs and dividing by four. :lol: )

User avatar
L18L
Posts: 3479
Joined: Sat 19 Jun 2010, 18:56
Location: www.eussenheim.de/

Re: Bash: Count numbers in list

#3 Post by L18L »

Code: Select all

#!/bin/ash
#
log="/root/file2.mp3| 1345623045,1345643786,1345648695"
#
#How many numbers in log is below 1345643390? 
#How many numbers in log is below oneTIME
oneTIME=1345643390 # seconds since 1970-1-1
COUNT=0
for VAR in `echo $log | cut -d'|' -f2 | tr ',' '\n'`; do
 [ $(( $VAR - $oneTIME )) -gt 0 ] &&  COUNT=$(( $COUNT + 1 ))
done
echo $COUNT numbers are above $oneTIME
The following should be faster if in chronological order:

Code: Select all

#!/bin/ash
#
log="/root/file2.mp3| 1345623045,1345643786,1345648695"
#
#How many numbers in log is above 1345643390? 
#How many numbers in log is above oneTIME
oneTIME=1345643390 # seconds since 1970-1-1
COUNT=0
for VAR in `echo $log | cut -d'|' -f2 | tr ',' '\n'`; do
 [ $(( $VAR - $oneTIME )) -lt 0 ] &&  COUNT=$(( $COUNT + 1 )) || break
done
echo $COUNT numbers are below $oneTIME
You have to subtract this from the total number of commas. 8)

seaside
Posts: 934
Joined: Thu 12 Apr 2007, 00:19

#4 Post by seaside »

zigbert,

If you don't care about dates, you might just do:

Code: Select all

grep <file name> music_play_log | wc -l 
Regards,
s

User avatar
zigbert
Posts: 6621
Joined: Wed 29 Mar 2006, 18:13
Location: Valåmoen, Norway
Contact:

#5 Post by zigbert »

Thank you guys for your feedback

Yes, the numbers will be in in chronological order.

@seaside
The rating system today shows how many times you have played the track - with no dates. I want more.....

@L18L
This is sure a solution, but it will be slow when rating-index grows.


Sigmund

User avatar
SFR
Posts: 1800
Joined: Wed 26 Oct 2011, 21:52

#6 Post by SFR »

zigbert wrote:[...]it will be slow when rating-index grows.
How about binary search algorithm?
Although it looks pretty complex to implement (at least for me) when using such a type of index (item1, item2, item3...), but it should be easier with arrays.
Here's the snippet I found:
http://www.bashguru.com/2011/01/binarys ... cript.html

Greetings!
[color=red][size=75][O]bdurate [R]ules [D]estroy [E]nthusiastic [R]ebels => [C]reative [H]umans [A]lways [O]pen [S]ource[/size][/color]
[b][color=green]Omnia mea mecum porto.[/color][/b]

User avatar
L18L
Posts: 3479
Joined: Sat 19 Jun 2010, 18:56
Location: www.eussenheim.de/

Count numbers in list

#7 Post by L18L »

SFR wrote:How about binary search algorithm?
Although it looks pretty complex to implement (at least for me) when using such a type of index (item1, item2, item3...), but it should be easier with arrays.
Here's the snippet I found:
http://www.bashguru.com/2011/01/binarys ... cript.html
Here is what I made from that snippet:

Code: Select all

#!/bin/sh
# SCRIPT : derived from: binarysearch.sh
# SCRIPT : bsearch.sh
# USAGE  : bsearch.sh
# PURPOSE: Searches given number in a sorted list.
#                         \\\\ ////
#                         \\ - - //
#                            @ @
#                    ---oOOo-( )-oOOo---
#
#####################################################################
#                      Define Functions Here                        #
#####################################################################

binarysearch()
{
    status=-1
    i=1
    array=($(echo "$@"))
    LowIndex=0
    HeighIndex=$((${#array[@]}-1))
    while [ $LowIndex -lt $HeighIndex ]
    do
        MidIndex=$(($LowIndex+($HeighIndex-$LowIndex)/2))
        MidElement=${array[$MidIndex]}
        case $(($MidElement - $SearchedItem)) in
        -*) LowIndex=$(($MidIndex+1)) ;;
         0) status=0
            searches=$i
            return ;; 
        *)  HeighIndex=$(($MidIndex-1)) ;;
        esac
        let i++
    done
}

#####################################################################
#                       Variable Declaration                       #
#####################################################################

clear
echo "Enter Sorted Array Elements : "
read -a ARRAY
count=${#ARRAY[@]}

#####################################################################
#                       Main Script Starts Here                     #
#####################################################################

while [ : ]
do
     echo -n "
     Type in a number or break using CTRL-C: "
     read SearchedItem
     binarysearch "${ARRAY[@]}"
     [ $SearchedItem -gt $MidElement ] \
      && echo "nothing found equal or later than $SearchedItem" \
      || echo "$SearchedItem found after $searches searches, MidIndex=$MidIndex MidElement=$MidElement"

done

akash_rawal
Posts: 229
Joined: Wed 25 Aug 2010, 15:38
Location: ISM Dhanbad, Jharkhand, India

Binary search application

#8 Post by akash_rawal »

And here's it's application in zigbert's problem:

Code: Select all

#!/bin/bash

#usage: played_since filename time

#data stored in database.txt

file="$1";
time="$2";

ifs_bak="$IFS"
IFS=""

while read -d "|" record; do
	#check for file
	if test "$record" = "$file"; then
		IFS=","
		#Get all timestamps in array
		read -a times
		#Binary search
		i=$(( ${#times[*]}/2 ))
		d=$(( $i/2 ))
		test "$d" -eq "0" && d=1 #precaution
		ulim=$(( ${#times[*]}-1 ))
		while test "$i" -ge "0" -a "$i" -lt "$ulim"; do #check whether index is in range
			if test "$time" -lt "${times[$i]}"; then
				i=$(( $i-$d ))
			elif test "$time" -gt "${times[$(( $i+1 ))]}"; then
				i=$(( $i+$d ))
			else
				#found, tell no. of records greater than this
				echo "$(( $ulim-$i ))"
				exit
			fi
			d=$(( ($d/2) ))
			test "$d" -eq "0" && d=1 #precaution
		done
		#In case of 'out of range'
		if test "$ulim" -lt "0"; then
			echo "0"
		else
			if test "$time" -lt "$times"; then
				echo "${#times[*]}"
			elif test "$time" -gt "${times[$ulim]}"; then
				echo "0"
			else
				echo "bug" #This should not be reached
			fi
		fi
		exit
	fi
	read trash
done < database.txt

IFS="$ifs_bak"

echo "File not found"

I have found that overhead of reading data from file into array is much larger than time required for binary search, we have to do something there.

Plus, if the lists are going to be really large it would be performance advantage to store timestamps of each track in a separate file.

I am using a c++ program to generate about 87 MB test file to search in.

Code: Select all

//g++ -Wall -o data_gen source.cpp

#include <iostream>
#include <fstream>
#include <cstdlib>

using namespace std;

int main()
{
	ofstream strm("database.txt", ios::out);
	
	int i, j;
	for (i = 0; i < 100; i++)
	{
		unsigned long rand_time = 0;
		strm << "file" << i << ".mp3|";
		for (j = 0; j < 100000; j++)
		{
			rand_time += (rand() % 1000);
			strm << rand_time << ",";
		}
		rand_time += (rand() % 1000);
		strm << rand_time << "\n";
	}
	
	strm.close();
	
	return 0;
}
Time taken for last file roughly equals time to generate the database.

Code: Select all

# time ./data_gen 

real	0m6.873s
user	0m3.936s
sys	0m0.257s
# time ./played_since file0.mp3 48000000
3881

real	0m0.269s
user	0m0.173s
sys	0m0.027s
# time ./played_since file99.mp3 48000000
3672

real	0m6.849s
user	0m5.696s
sys	0m0.677s
# 

User avatar
zigbert
Posts: 6621
Joined: Wed 29 Mar 2006, 18:13
Location: Valåmoen, Norway
Contact:

#9 Post by zigbert »

Thank you guys
Pretty amazing you want to spend so much time helping out!!!!!

I made a test.
1.) Checking my existing rating index - there are about 2000 tracks in the index.
2.) Make a fake index based on the new structure including 2000 tracks and 25 timestamps for each track. All together 574kb.
3.) Run the played_since script to get an assumption of how fast it can build an updated list of favorite tracks.

Code: Select all

# cut -d'|' -f1 ./database.txt > /tmp/played_since; time while read I; do ./played_since.sh $I 1345643390; done < /tmp/played_since
...
...
...
24
24
24
24
24
24
24
24
24
24

real	0m30.799s
user	0m21.339s
sys	0m5.560s
# 
I have a modern system, and still it takes half a minute to build a rating list. Is this an understanding why ratings in the big music-apps uses stars for rating instead of a exact status.

But still, it would be awesome to define a given time (last week, januar 2011, or whatever) and then get My favorite tracks in that period.


Sigmund

akash_rawal
Posts: 229
Joined: Wed 25 Aug 2010, 15:38
Location: ISM Dhanbad, Jharkhand, India

#10 Post by akash_rawal »

If you want to process all files then why not go this way:

Code: Select all

#!/bin/bash

#usage: played_since_foreach time

time="$1";

ifs_bak="$IFS"
IFS=""

while read -d "|" record; do
	IFS=","
	#Get all timestamps in array
	read -a times
	#Binary search
	i=$(( ${#times[*]}/2 ))
	d=$(( $i/2 ))
	test "$d" -eq "0" && d=1 #precaution
	ulim=$(( ${#times[*]}-1 ))
	count=-1
	while test "$i" -ge "0" -a "$i" -lt "$ulim"; do #check whether index is in range
		if test "$time" -lt "${times[$i]}"; then
			i=$(( $i-$d ))
		elif test "$time" -gt "${times[$(( $i+1 ))]}"; then
			i=$(( $i+$d ))
		else
			#found, tell no. of records greater than this
			count="$(( $ulim-$i ))"
			break
		fi
		d=$(( ($d/2) ))
		test "$d" -eq "0" && d=1 #precaution
	done
	#In case of 'out of range'
	if test "$count" -lt "0"; then
		if test "$ulim" -lt "0"; then
			count=0
		else
			if test "$time" -lt "$times"; then
				count="${#times[*]}"
			elif test "$time" -gt "${times[$ulim]}"; then
				count=0
			else
				count="bug" #This should not be reached
			fi
		fi
	fi
	#Show result for this record
	echo "$record: $count"
done < database.txt

IFS="$ifs_bak"
This takes about 20 seconds to process my 84 mb test file.
zigbert wrote: 2.) Make a fake index based on the new structure including 2000 tracks and 25 timestamps for each track. All together 574kb.
I think I overestimated the size :lol:, such a file takes about 2s on my system using the above script. But I think it could be even faster.

User avatar
zigbert
Posts: 6621
Joined: Wed 29 Mar 2006, 18:13
Location: Valåmoen, Norway
Contact:

#11 Post by zigbert »

1.1 sec here :)
But I think it could be even faster.
Would be good for those with older hardware. - And those are many in the Puppy world.


Thanks a lot (so far)
Sigmund

User avatar
zigbert
Posts: 6621
Joined: Wed 29 Mar 2006, 18:13
Location: Valåmoen, Norway
Contact:

#12 Post by zigbert »

akash_rawal
I don't think 5 sec is problematic if user request building a specific rating list....

The speed issue arrives when pMusic wants to give a general rating list of the most popular songs in the Overview mode. A workaround could be to always put the last played track at the bottom of the rating list. When later requesting a quick rating-list we could just 'tail -n 200 > database.txt' before executing played_since_foreach.

That means your code will somehow do the job as is, but if speed improvements is possible, it would sure help.


Sigmund

User avatar
technosaurus
Posts: 4853
Joined: Mon 19 May 2008, 01:24
Location: Blue Springs, MO
Contact:

#13 Post by technosaurus »

something like:

Code: Select all

LOGENTRY=`grep $filename log.txt`
#grep is faster than while read case on large files
STAMPS=${LOGENTRY#*|}
#you could use bash arrays instead - set allows "arrays" in sh, ash, etc...
set ${STAMPS//,/ }
#for speed change IFS to "," vs. the //,/ } but don't forget to reset it
while ([ $1 -lt $timestamp ]) do
shift #a hacky way to remove all entries < timestamp
done
# $# is the number of args passed or in this case set with "set ..."
echo The number of times played since $timestamp is $#
haven't tested for accuracy, but something along these lines should be pretty fast... even faster using busybox ash, based on past experience
Check out my [url=https://github.com/technosaurus]github repositories[/url]. I may eventually get around to updating my [url=http://bashismal.blogspot.com]blogspot[/url].

User avatar
L18L
Posts: 3479
Joined: Sat 19 Jun 2010, 18:56
Location: www.eussenheim.de/

Bash: Count numbers in list

#14 Post by L18L »

technosaurus wrote:...haven't tested for accuracy, but ...
Hope this will suffice for accuracy...using zigbert´s original database
# cat database.txt
/root/file1.mp3| 1345641102,1345644563,1345647584
/root/file2.mp3| 1345623045,1345643786,1345648695
#
# ./file_times_played_since 1345644563
The number of times played /root/file1.mp3 since 1345644563 is 2
The number of times played /root/file2.mp3 since 1345644563 is 1
#
# ./file_times_played_since 1345648694
ash: 1345648694: unknown operand
The number of times played /root/file1.mp3 since 1345648694 is 0
The number of times played /root/file2.mp3 since 1345648694 is 1
#

file_times_played_since:

Code: Select all

#!/bin/ash
timestamp=$1

ifs_bak=$IFS
IFS=","

while read LOGENTRY; do

filename=${LOGENTRY%%|*}
#LOGENTRY=`grep $filename database.txt`
#grep is faster than while read case on large files

STAMPS=${LOGENTRY#*|}
#you could use bash arrays instead - set allows "arrays" in sh, ash, etc...
#set ${STAMPS//,/ }
set ${STAMPS}
#for speed change IFS to "," vs. the //,/ } but don't forget to reset it
while ([ $1 -lt $timestamp ]) do
shift #a hacky way to remove all entries < timestamp
done
# $# is the number of args passed or in this case set with "set ..."
echo The number of times played $filename since $timestamp is $#

done < database.txt

IFS=$ifs_bak
:)
The only issue I have found is when result is zero:
ash: 1345648694: unknown operand

User avatar
technosaurus
Posts: 4853
Joined: Mon 19 May 2008, 01:24
Location: Blue Springs, MO
Contact:

#15 Post by technosaurus »

Oops need to prepend the -lt test with a [ $1 ] && ... For the case where it has not been played in the time period
Check out my [url=https://github.com/technosaurus]github repositories[/url]. I may eventually get around to updating my [url=http://bashismal.blogspot.com]blogspot[/url].

User avatar
zigbert
Posts: 6621
Joined: Wed 29 Mar 2006, 18:13
Location: Valåmoen, Norway
Contact:

#16 Post by zigbert »

This sounds interesting.
In am on the run this weekend, so I can't compare executing-time until Monday.


Sigmund

User avatar
zigbert
Posts: 6621
Joined: Wed 29 Mar 2006, 18:13
Location: Valåmoen, Norway
Contact:

#17 Post by zigbert »

technosaurus wrote:Oops need to prepend the -lt test with a [ $1 ] && ... For the case where it has not been played in the time period
....and that means???? :oops:

jamesbond
Posts: 3433
Joined: Mon 26 Feb 2007, 05:02
Location: The Blue Marble

#18 Post by jamesbond »

My submission:

Code: Select all

#!/bin/ash
LANG=C

FIND=3421  # timestamp to find (random)
TIMES=3650 # 3650 timestamps (tracks played everyday for 10 years) - number from 1 to $TIMES
SONGS=2000 # 2000 songs

# generate fake timestamp data
tstamps=$(seq -s, 1 $TIMES)
line="/path/to/song.mp3 | $tstamps"
#echo $line

# generate fake playlist (repeat timestamp data $SONGS times)
for a in $(seq 1 $SONGS); do
	echo $line
done | 

# do actual work
awk -F, -v TS=$FIND '{
		split($1, a, "|"); $1=a[2] # fix first entry
		max=NF; min=1; 
		
		while (max-min > 1) {
			i=int( (max+min)/2 )
			if (TS >= $i) min = i
			else max = i
		}
		print a[1], min
}' 
With TIMES=3650 and SONGS=2000 as above (meaning, 3650 timestamps per song, with 2000 songs), result:

Code: Select all

# time ./fav.sh  > /dev/null

real	0m1.641s
user	0m2.916s
sys	0m0.073s
.

Using Ziggy's original parameter of 25 timestamps per song and 2000 songs (TIMES=25, SONGS=2000), result

Code: Select all

# time ./fav.sh  > /dev/null

real	0m0.068s
user	0m0.080s
sys	0m0.010s
Using 25 timestamps for 20,000 songs (TIMES=25, SONGS=20000), result:

Code: Select all

# time ./fav.sh  > /dev/null

real	0m0.479s
user	0m0.730s
sys	0m0.057s
Using 3650 timestamps for 20,000 songs (TIMES=3650, SONGS=20000), result:

Code: Select all

# time ./fav.sh  > /dev/null

real	0m16.855s
user	0m30.308s
sys	0m0.520s
Of course these are on my machine and is not directly comparable with anyone else's number. Ziggy need to run this on his own machine to test :)

cheers!
Fatdog64 forum links: [url=http://murga-linux.com/puppy/viewtopic.php?t=117546]Latest version[/url] | [url=https://cutt.ly/ke8sn5H]Contributed packages[/url] | [url=https://cutt.ly/se8scrb]ISO builder[/url]

User avatar
L18L
Posts: 3479
Joined: Sat 19 Jun 2010, 18:56
Location: www.eussenheim.de/

Count numbers in list

#19 Post by L18L »

zigbert wrote:
technosaurus wrote:Oops need to prepend the -lt test with a [ $1 ] && ... For the case where it has not been played in the time period
....and that means???? :oops:
change

Code: Select all

while ([ $1 -lt $timestamp ]) do 
to

Code: Select all

while ( [ $1 ] && [ $1 -lt $timestamp ]) do
tested successfully (no more: ash: 1345623045: unknown operand ):
# ./file_times_played_since 1345648694
The number of times played /root/file1.mp3 since 1345648694 is 0
The number of times played /root/file2.mp3 since 1345648694 is 1
#

User avatar
technosaurus
Posts: 4853
Joined: Mon 19 May 2008, 01:24
Location: Blue Springs, MO
Contact:

Re: Count numbers in list

#20 Post by technosaurus »

@jamesbond - over 100 lines grep becomes faster than using builtins, and it doesn't slow it down significantly for less than 100 lines, so its probably worth calling grep in this case since we can't assume a user has/plays only a few songs.
L18L wrote:
zigbert wrote:
technosaurus wrote:Oops need to prepend the -lt test with a [ $1 ] && ... For the case where it has not been played in the time period
....and that means???? :oops:
change

Code: Select all

while ([ $1 -lt $timestamp ]) do 
to

Code: Select all

while ( [ $1 ] && [ $1 -lt $timestamp ]) do
tested successfully (no more: ash: 1345623045: unknown operand ):
# ./file_times_played_since 1345648694
The number of times played /root/file1.mp3 since 1345648694 is 0
The number of times played /root/file2.mp3 since 1345648694 is 1
#
thanks, I was/am posting from my phone.
Check out my [url=https://github.com/technosaurus]github repositories[/url]. I may eventually get around to updating my [url=http://bashismal.blogspot.com]blogspot[/url].

Post Reply