Binary Operations in Bash and Package Manager Applications

Message

s243a · #1 Post by **s243a** » Sat 07 Sep 2019, 22:40

Into
This thread should mostly focus on binary operations in bash but it is inspired by my thinking of how one could use more currency in sc0ttman's package manager (i.e. pkg) to optimize it. Perhaps, regarding the package manager this is more an academic exercise than of practical importance but I don't know. In my fork of pkg I plan to put my concurrency experiments in a separate file from one which deviates less from sc0tmann's work.

What I'm thinking

Anyway, why am I thinking about binary operations. I was thinking about an efficient way of determining which files are priority to install/download and what we need to do to them. Sc0tmann uses a simple for loop to do recursion in listing the dependencies

and this works well because to determine all the dependencies we shouldn't need that many levels of recursion.

Code: Select all

      # recursive search deps of deps
      for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
      do
        deps_list_file="${TMPDIR}all_deps_${i}"

https://gitlab.com/sc0ttj/Pkg/blob/mast ... /pkg#L6042

Now since this loop keeps track of the recursion depth, we can use this information to see which package to download and or install first. We can create a directory of all the packages to process (e.g. download or install) and in this directory put a file prefixed by the recursion depth and the file woulc contain a bunch of status flags. Each bit will denote some status (e.g. things we have done or need to do).E.g.

Bit#1 - 1=download; 0=no download
Bit#2 - 1=install, 2=no install
Bit#3 - 1=downloading, 0=not downloading
etc.

Using this type of system we can check multiple flags at once in an if statement without using operations like "and" (e.g. && or -a) and "or) ( "||" or -o ). There is also less bytes that one needs to read from a file.

Note that these files will be stored in ram so their access times will be fast but if we want faster semantics we could simulate the semantics related to manipulating these files and use associative arras or in memory databases (memory mapped file?).

Okay, how do binary operations work in bash?

Okay, keep in mind that this is only a brainstorming thread for me but here are some things that I dug up:

Here is an example of a bitwise or operation in bash:

Code: Select all

$ echo "$(( 0x123456 | 0x876543 ))"
9925975

https://stackoverflow.com/a/40668584

If you want to read a binary file into a string form you can use hexdump:

Read only X bytes:

Code: Select all

head -cX binary.file | hexdump -v -e '/1 "%u\n"' | while read c; do
  echo $c
done

https://unix.stackexchange.com/a/469239

You can also use the head statment to extract a given number of bytes,

If you want to stick with shell utilities, you can use head to extract a number of bytes, and od to convert a byte into a number.
Code: Select all
export LC_ALL=C    # make sure we aren't in a multibyte locale
n=$(head -c 1 | od -An -t u1)
string=$(head -c $n)

but you have to be careful when doing this because the $() operater strips the last newline char (see post)

Another method is to use the dd command:

...
dd reads any and all data... It certainly won't baulk at zero as a "length"... but if you have \x00 anywhere in your data, you will need to be creative how you handle it; dd has no propblems with it, but your shell script will have problems (but it depends on what you want to do with the data)... The following basically outputs each "data string", to a file with a line divider between each strin...
...
Code: Select all
((count=1)) # count of bytes to read
  dd if=binfile bs=1 skip=$skip count=$count of=datalen 2>/dev/null
  (( $(<datalen wc -c) != count )) && { echo "INFO: End-Of-File" ; break ; }
  strlen=$((0x$(<datalen xxd -ps)))  # xxd is shipped as part of the 'vim-common' package

https://unix.stackexchange.com/a/10966

This is probably the best method if you don't need the result in human readable form and has the added plus that you can skip so many bytes of data. The ability to skip bytes of data might allow for a fast lookup. Dare a say a database written in Bash!

(now there's an academic exercise).

However, the dd syntax is a bit esoteric and it doesn't give the answer in human readable form. As a consequence one might prefer hexdump.

Final Thoughts

As noted above this all may be an academic exercise and a way of learning about binary operations in bash. However, if one can make significant performance gains by using binary operations then it can not only speed up installation times but it can speed up build systems. Typically the list of installed packages isn't large so creating a file in ram to represent each installed file isn't a huge cost in either ram or i/o usage. However, the biggest bottle neck -- aside from potentially download times -- is processing the package databases.

Linux typically uses text files in their package management systems because they are easy to troubleshoot and the quantity of data isn't that large. One might be able to convert these into a database by copying them to ram (i.e. the /tmp folder) and then indexing each entry. This would be a very fast ad-hock database. A binary search can be used to find the starting byte of a given record in the package database. This search would do a seek operation via the dd command. The dd command then would seek to the appropriate point in the text file.

Yes bash might not be the best language for this but the ideas could be ported to any language. The advantage of using bash is that it is almost always available in linux, even in very small systems and much of the package management tools are already written in bash.

disciple · #2 Post by **disciple** » Mon 09 Sep 2019, 08:55

I think after reading all that, your main suggestion is that package managers could improve performance by using some kind of binary database format rather than normal text files. Correct?

The advantage of using bash is that it is almost always available in linux

But normally when you write a package management system it is to run in a specific environment, where you can ensure any dependency is present. It doesn't need to run almost anywhere. If you're that worried about performance I don't see why you would restrict yourself to bash - it probably just makes things harder than they need to be.

jafadmin · #3 Post by **jafadmin** » Mon 07 Oct 2019, 20:20

Demonstration of how to read individual bits and change them in bash
"bitdemo":

Code: Select all

#! /bin/bash
:<<'[#####]'

	script requires a number as it's only argument
	i.e.: bitdemo 255

	just extend the logic for the number of bits needed.

[#####]

BIT_0=1
BIT_1=2
BIT_2=4
BIT_3=8
BIT_4=16
BIT_5=32
BIT_6=64
BIT_7=128

MyVal="$1" # Number entered on the command line

((($MyVal&$BIT_0)==0)) && echo "bit 0 is 0" || echo "bit 0 is 1"
((($MyVal&$BIT_1)==0)) && echo "bit 1 is 0" || echo "bit 1 is 1"
((($MyVal&$BIT_2)==0)) && echo "bit 2 is 0" || echo "bit 2 is 1"
((($MyVal&$BIT_3)==0)) && echo "bit 3 is 0" || echo "bit 3 is 1"
((($MyVal&$BIT_4)==0)) && echo "bit 4 is 0" || echo "bit 4 is 1"
((($MyVal&$BIT_5)==0)) && echo "bit 5 is 0" || echo "bit 5 is 1"
((($MyVal&$BIT_6)==0)) && echo "bit 6 is 0" || echo "bit 6 is 1"
((($MyVal&$BIT_7)==0)) && echo "bit 7 is 0" || echo "bit 7 is 1"

echo
echo "to toggle bit 3 we just .."

if (($MyVal&$BIT_3)) # if bit three is set to 1
 then 
   MyVal="$(($MyVal^$BIT_3))" # flip the bit to 0
fi

echo "$MyVal" && echo

s243a · #4 Post by **s243a** » Tue 08 Oct 2019, 18:58

disciple wrote:I think after reading all that, your main suggestion is that package managers could improve performance by using some kind of binary database format rather than normal text files. Correct?

Using a "binary database" [1] would Improve a package manager relative to performance. Such an improvement is most relevant to a build system. However, this may less ideal for a daily use system since it is harder to troubleshoot and has more dependencies.

The advantage of using bash is that it is almost always available in linux
But normally when you write a package management system it is to run in a specific environment, where you can ensure any dependency is present. It doesn't need to run almost anywhere. If you're that worried about performance I don't see why you would restrict yourself to bash - it probably just makes things harder than they need to be.

This might be the case for a specific environment, depending on the design requirements. For instance if one design requirement is a minimal system, then it might not be ideal to pre-instal a database. However, I think that Slitaz might per-install a database so it isn't a roadblock for a minimal system, depending on what means by minimal.

On the filp side when one builds a package management system they may want it to work on many systems. Python might be useful here because python can be compiled but one has the option of running the code as scripts instead to avoid having to compile separate systems to separate architectures.

Anyway, if one first writes code in Bash then it should be easy to port to python both because python will let you directly use Bash and also because python has more in-built language features than Bash.

P.S. part of my inquiry here was to learn more about binary operations in Bash which I would at some time like to learn more about irregardless of whether such approaches are a good or bad idea for a package management system.

notes
----------
1 - I'm not strictly referring to a binary database here. The actual content can be in ASCII or Unicode format. What matters for fast lookup is that we can index it so that one can quickly do binary searches.

jafadmin · #5 Post by **jafadmin** » Tue 08 Oct 2019, 20:49

I modified my code above to make it more clear and to demonstrate how to not only read a bit, but change it as well.

sc0ttman · #6 Post by **sc0ttman** » Wed 09 Oct 2019, 18:25

Here is a nice way to run SQL queries on CSV: https://github.com/harelba/q

I haven't played with it yet ... But with a bit of pre-processing of the existing repo files, it should be possible to use q to query repo details, deps, etc ..

...There are similar tools for SQLite itself, and that would be very fast ... Someone would just need to convert the files in ~/.packages/ to a .sqlite file ..

If you want the fastest possible parsing of large data files, then a Golang based CSV querying thing would be best - easy to compile statically for 32 and 64 bit as well ... But large binary file sizes (like 2mb or 3mb ... but not really an issue for me tbh)..

....

s243a, can you please isolate the speed fixes you did for Pkg, and post them to me, separate for your other stuff? ...the specific changes you made to the loop/bg process stuff are lost in all these other posts.... cant separate it out..

(My biggest annoyance with Pkg is it's slow dep resolution .. I'm sure if a few brains got together to clean up the dep resolution functions in Pkg, it could be made nice and fast .. without needing a new package/repo format, or bash arrays, or a database, or any extra tools/deps)

s243a · #7 Post by **s243a** » Wed 09 Oct 2019, 20:09

sc0ttman wrote:Here is a nice way to run SQL queries on CSV: https://github.com/harelba/q

I haven't played with it yet ... But with a bit of pre-processing of the existing repo files, it should be possible to use q to query repo details, deps, etc ..

...There are similar tools for SQLite itself, and that would be very fast ... Someone would just need to convert the files in ~/.packages/ to a .sqlite file ..

Interesting. BTW my preference would be to use sqlite since I want to learn more about it.The reason that I want to learn more about sqlite is that it is used on fms which is a usenet like application for freenet. That all said as we both note a database isn't strictly required for a package manager.

If you want the fastest possible parsing of large data files, then a Golang based CSV querying thing would be best - easy to compile statically for 32 and 64 bit as well ... But large binary file sizes (like 2mb or 3mb ... but not really an issue for me tbh)..

That's one thing I wonder. Given the file sizes we're talking about perhaps the speed advantages of using a database are marginal. I do here though that golang is very fast for some applications (e.g. writing servers). I will say though that being able to statically compile golang programs is a big plus.

s243a, can you please isolate the speed fixes you did for Pkg, and post them to me, separate for your other stuff? ...the specific changes you made to the loop/bg process stuff are lost in all these other posts.... cant separate it out..

Okay, I will look into what I have done in this regards, and post it on gitlab. As far as what I've actually implemented though I've only scratched the surface with what can be done with concurrency.

(My biggest annoyance with Pkg is it's slow dep resolution .. I'm sure if a few brains got together to clean up the dep resolution functions in Pkg, it could be made nice and fast .. without needing a new package/repo format, or bash arrays, or a database, or any extra tools/deps)

We could do some speed comparisons, between my latest fork and the official version of pkg. Note that I'm using bash instead of ash for easier use of arrays. I'm not sure how well ash supports associative arrays and it doesn't support process substitution. I apologize for scattered posts but this it because threads like this are for brainstorming rather than a roadmap or a changelog.

My latest fork of pkg is in my latest version of tiny_puduan_ascii-PreAlpha11.5.iso (see post, issues, remastered version with browser).

The sourcecode for my latest fork of pkg can be found at:
https://github.com/s243a/woof-CE/blob/w ... r/sbin/pkg

I suggest that we copy this version to a "bash" branch on gitlab because up to this point I haven't deviated that much from pkg but I plan to make radical changes to the concurrency model. I think that I will create two separate files. This version which I might call something like pkg-classic, and a more radical concurrent version that I will call pkg-concurrent.

The concurrent version will use more currency which should improve speed but with the disadvantage of being more complex and using more memory. A symlink will link to whatever one of these two files is best for most users.

(old)Puppy Linux Discussion Forum

(old)Puppy Linux Discussion Forum

Binary Operations in Bash and Package Manager Applications

Binary Operations in Bash and Package Manager Applications