Unpredictable results from standard utilities

For discussions about programming, programming questions/advice, and projects that don't really have anything to do with Puppy.
Post Reply
Message
Author
disciple
Posts: 6984
Joined: Sun 21 May 2006, 01:46
Location: Auckland, New Zealand

Unpredictable results from standard utilities

#1 Post by disciple »

Both busybox and coreutils utilities give different results depending on the end of line format of an input file i.e. CR (mac standard), LF (unix standard) or CRLF (windows standard) (note - apparently you will also find files in the wild that use a mixture!).
See below - note that some of the commands (file, diff) are used to compare the files, others to compare the results, because I was expecting the same results from each file (cat, sort).
Should this unpredictability be considered a bug? Or are you expected to always check the format of a file before doing anything to it? i.e. should you make sure files are always in LF format if you are using any standard *nix utility in any context (except perhaps OSX - I haven't checked)?

Code: Select all

$  echo "S2
> S">lf
$  cp lf crlf
$  unix2dos crlf
unix2dos: converting file crlf to DOS format ...
$  cp lf cr
$  unix2mac cr
unix2mac: converting file cr to Mac format ...
$  file lf
lf: ASCII text
$  file crlf
crlf: ASCII text, with CRLF line terminators
$  file cr
cr: ASCII text, with CR line terminators
$  diff lf cr
1,2c1
< S2
< S
---
S S2
\ No newline at end of file
$  diff lf crlf
1,2c1,2
< S2
< S
---
> S2
> S
$  sort -V <lf
S
S2
$  sort -V <crlf
S2
S
$  sort -V <cr
S2
$  cat lf
S2
S
$  cat crlf
S2
S
$  cat cr
$  busybox cat cr
$  busybox cat crlf
S2
S
$  busybox cat lf
S2
S
Last edited by disciple on Tue 16 Apr 2019, 08:21, edited 1 time in total.
Do you know a good gtkdialog program? Please post a link here

Classic Puppy quotes

ROOT FOREVER
GTK2 FOREVER

User avatar
nosystemdthanks
Posts: 703
Joined: Thu 03 May 2018, 16:13
Contact:

Re: Unpredictable results from standard utilities

#2 Post by nosystemdthanks »

fig handles this by converting all crlf to lf and then all cr to lf:

cat inputtext | python -c "from sys import stdin, stdout ; p = stdin.read() ; stdout.write(p.replace('\r\n','\n').replace('\r','\n')) ; stdout.flush()" # public domain

i wanted to do this with sed and tr:

cat inputtext | sed "s/\r\n/\n/g" | tr '\r' '\n'

tr doesnt let you change multiple characters, and sed wont treat text before it processes newlines.

note that the "| tr '\r' '\n'" part covers text from very old macs, modern macs use \n like puppy does.

perl is up to this task, but i refuse. i would be interested in a python-free solution just because python is not always available.
[color=green]The freedom to NOT run the software, to be free to avoid vendor lock-in through appropriate modularization/encapsulation and minimized dependencies; meaning any free software can be replaced with a user’s preferred alternatives.[/color]

User avatar
6502coder
Posts: 677
Joined: Mon 23 Mar 2009, 18:07
Location: Western United States

#3 Post by 6502coder »

The UNIX/Linux standard is to use LF as the end-of-line marker in a text file. Every Linux user should know this. If a text file doesn't obey the standard, all bets are off -- that is obvious. This is not "unpredicatable' and it is certainly not a bug.

You can't expect text-handling utilities to automatically handle the 3 common EOL markers, for the simple reason that there are not, and cannot possibly be, any universal agreement on what "handle" means.

For example, it is perfectly possible that I might want to have Linux text files that contain CR chars, for some special reason. (As in fact I have, in situations involving partially encrypted text.) If one of those CR chars happens to end up preceding a LF char, I DO NOT want that CR char to be summarily discarded, just because some other OS happens to use CRLF as its EOL marker.

Since UNIX/Linux utilities cannot possibly anticipate all the ways in which CR might be used in a text file, they cannot unilaterally set policy on what to do with the sequence CRLF, other than to accept the CR verbatim.

So yes, if you routinely deal with MS-DOS text files and/or Mac text files in a Linux environment, the onus is on you to deal with it in the manner appropriate to your particular usage of those files. For many people that may mean interpreting all isolated CRs and CRLFs as EOL markers in Linux. But you cannot assume that to be a satisfactory policy for everyone.

</old-man-rant>

User avatar
nosystemdthanks
Posts: 703
Joined: Thu 03 May 2018, 16:13
Contact:

#4 Post by nosystemdthanks »

i should add that busybox and bash will probably never be totally compatible anyway, and my preference is to use puppies that include bash or at least python.

recent gripe: a version of while that litters into tty when the file is not found, when the standard for years is to be silent by default. that was a prime candidate for -v, right? hateful change of default that breaks so many things. (may not affect puppy users, this is going on elsewhere.)

we interpreted the op differently (i admit i could have paid closer attention.)

you (seemingly) took it as unreasonable whinging, while i simply took it as "what should i do to fix this?"

youre right of course, what you said is valid and technically sound.

my proposal on what to do is fairly simple and probably not the best way for several things-- in the context where its used, a file needs to be processed and translated, so the output is fairly certain to differ from input already.

the onus does fall onto the user, though the first thing the user will probably want to know is what utilities can help them. leafpad or nano (probably geany) will work, on the command line a python one-liner can generally achieve what you want.

you may notice that when someone talks about cr and lf i already presume theyre familiar with the command line. a fallacy, but not a huge leap. on this forum, a few people people probably learn about cr and lf before the command line though.
[color=green]The freedom to NOT run the software, to be free to avoid vendor lock-in through appropriate modularization/encapsulation and minimized dependencies; meaning any free software can be replaced with a user’s preferred alternatives.[/color]

User avatar
6502coder
Posts: 677
Joined: Mon 23 Mar 2009, 18:07
Location: Western United States

Re: Unpredictable results from standard utilities

#5 Post by 6502coder »

@nosystemdthanks

Well, I took the man at his word when he asked:
disciple wrote:Should this unpredictability be considered a bug? Or are you expected to always check the format of a file before doing anything to it? i.e. should you make sure files are always in LF format if you are using any standard *nix utility in any context...?
I may have jumped on him a little too hard but I did concede as much when I signed off with "old-man-rant"

As for you, NST, you are of course quite right that simple ways to deal with the problem exist. If you felt caught in the crossfire, that was unintended on my part.

User avatar
nosystemdthanks
Posts: 703
Joined: Thu 03 May 2018, 16:13
Contact:

Re: Unpredictable results from standard utilities

#6 Post by nosystemdthanks »

oh no problem at all, i rant all the time about software this and software that, i know there will be times when someone feels its actually directed at them when all they did was make me think of some weird software thing.

the developer/user dichotomy leads to all sorts of weird ideas.

sympathy for the user is a good basis for teaching (especially tutoring) but developers are users too, and vice-versa. so its really hard to say when to blame them and when to go easy. go easy when no harm is done, gripe when they break things for someone else is an imperfect but decent rule. certainly better than lisis "we should just kiss their arses and hope for the best." (slightly paraphrased.)
[color=green]The freedom to NOT run the software, to be free to avoid vendor lock-in through appropriate modularization/encapsulation and minimized dependencies; meaning any free software can be replaced with a user’s preferred alternatives.[/color]

disciple
Posts: 6984
Joined: Sun 21 May 2006, 01:46
Location: Auckland, New Zealand

#7 Post by disciple »

Hi guys, there's nothing wrong with a good rant :)
My questions were actually serious - I was neither complaining nor looking for help to deal with these files.

I have scripts that use dos2unix so that they can process the files with e.g. sed, as nosystemdthanks described. And then use unix2dos so the output can be consumed by certain Windows applications.
But I don't remember ever coming across a script that either:
- ran dos2unix and mac2unix on input files just in case, or
- used file or something to detect the eol standard and convert appropriately.

And although I doubt I've seen CR files in the wild, in my experience if you feed CRLF files to linux software it generally gives the results you would expect if you didn't know anything about line endings, or it is like sed and fails in ways that at least seem to make sense.
The UNIX/Linux standard is to use LF as the end-of-line marker in a text file. Every Linux user should know this.
I imagine there are plenty now that don't - they probably wouldn't see it in many tutorials for example.
If a text file doesn't obey the standard, all bets are off -- that is obvious. This is not "unpredicatable'
Sorry, but I think you're going too far there. If a tool isn't going to automatically handle different line endings, you would still expect it to behave in a way that makes sense. It isn't like CR is prohibited.
The thing is, at least on the surface these behaviours just don't make sense. Why does sort seem to work with CRLF files, but if you look closely there are a few mistakes?

Code: Select all

#  echo "0
> 3
> 2
> 1
> A
> 2A
> A2
> A 2">test
#  cat test
0
3
2
1
A
2A
A2
A 2
#  cat test| sort -V
0
1
2
2A
3
A
A2
A 2
#  cp test testcrlf
#  unix2dos testcrlf
#  cat testcrlf| sort -V
0
1
2A
2
3
A2
A
A 2
Why would both implementations of cat fail to read anything from the file with CR line endings?
Since UNIX/Linux utilities cannot possibly anticipate all the ways in which CR might be used in a text file, they cannot unilaterally set policy on what to do with the sequence CRLF, other than to accept the CR verbatim.
Some do, at least by default! I would hope they have a way to override it if necessary...
Last edited by disciple on Tue 12 Jun 2018, 23:14, edited 1 time in total.
Do you know a good gtkdialog program? Please post a link here

Classic Puppy quotes

ROOT FOREVER
GTK2 FOREVER

User avatar
nosystemdthanks
Posts: 703
Joined: Thu 03 May 2018, 16:13
Contact:

#8 Post by nosystemdthanks »

if you turn your examples into scripts that can be downloaded as attachments i will try to explain the behaviour of each, even if i have to guess.
[color=green]The freedom to NOT run the software, to be free to avoid vendor lock-in through appropriate modularization/encapsulation and minimized dependencies; meaning any free software can be replaced with a user’s preferred alternatives.[/color]

disciple
Posts: 6984
Joined: Sun 21 May 2006, 01:46
Location: Auckland, New Zealand

#9 Post by disciple »

OK, some of this might be starting to make sense.
Looking at it again, I think with the CRLF input, sort is simply sorting CR after letters and numbers, but before spaces. This also explains why sort gives a slightly different result if you remove the blank line at the end of the file with CRLF line endings.
But I can't explain the output of sort when it operates on a file with CR line endings. I would have thought it might show the entire content of the file, but all on one line. And why does cat not show anything? I thought maybe it just does not accept input without a trailing LF character, but if i remove the trailing CR (not included in my example below - I used a text editor) then it shows the last line of the file. If I instead add a LF to the end of the file then it gives me the same output as sort.

Code: Select all

#! /bin/sh
echo "0
3
2
1
A 2
2A
A2
A">test 
sort -V<test 
cp test testcrlf 
unix2dos testcrlf 
sort -V <testcrlf
cp test testcr
unix2mac testcr
sort -V <testcr
echo "here comes cat:"
cat testcr
echo "what?  no output from cat?"
sed -i -e '$a\' testcr
cat testcr
Output:

Code: Select all

$  ./example.sh
0
1
2
2A
3
A
A2
A 2
unix2dos: converting file testcrlf to DOS format ...
0
1
2A
2
3
A2
A
A 2
unix2mac: converting file testcr to Mac format ...
A22
here comes cat:
what?  no output from cat?
A22
Last edited by disciple on Wed 13 Jun 2018, 05:25, edited 1 time in total.
Do you know a good gtkdialog program? Please post a link here

Classic Puppy quotes

ROOT FOREVER
GTK2 FOREVER

disciple
Posts: 6984
Joined: Sun 21 May 2006, 01:46
Location: Auckland, New Zealand

#10 Post by disciple »

Ah, I broke the file down into shorter versions, removing a line at a time, and I see that you have to think about this like a typewriter .
Carriage return means "go back to the beginning of the line", and line feed means "go down a line"
In a file with CR line endings this means every time sort gets to the end of the line it goes back to the beginning of the line and starts overwriting it.
But every time cat gets to the end of the line for some reason it just erases everything already on the line. So in a file with a trailing CR you don't get anything.
Do you know a good gtkdialog program? Please post a link here

Classic Puppy quotes

ROOT FOREVER
GTK2 FOREVER

disciple
Posts: 6984
Joined: Sun 21 May 2006, 01:46
Location: Auckland, New Zealand

#11 Post by disciple »

disciple wrote:Looking at it again, I think with the CRLF input, sort is simply sorting CR after letters and numbers, but before spaces.
So, to reconcile this apparent behaviour with the theory in my last post. In a CRLF file when sort processes each line it hits the CR and goes back to the beginning, but then it hits the LF, so it doesn't overwrite anything.
So because cat prints all the lines of a CRLF file, when it hits a line feed it goes back to the beginning and doesn't delete everything on the line if it then hits a line feed.
Last edited by disciple on Wed 13 Jun 2018, 05:34, edited 1 time in total.

User avatar
nosystemdthanks
Posts: 703
Joined: Thu 03 May 2018, 16:13
Contact:

#12 Post by nosystemdthanks »

something else you have to consider is that back when graphics were the exception and text was the rule, the text had a lot of different modes:

echo -e "\\033[0;35mpurple\\041\\033[0m"

so when its being weird, its not always the commands youre running but the environment-- your term window settings (up to the term program you choose, but fairly reliable) and how long since youve output a lot of random things from say, a file in /usr/bin.

# reset

now and then is highly recommended, at least when youre trying to figure out text output. but it will clear your screen, including your scrollback usually.
[color=green]The freedom to NOT run the software, to be free to avoid vendor lock-in through appropriate modularization/encapsulation and minimized dependencies; meaning any free software can be replaced with a user’s preferred alternatives.[/color]

disciple
Posts: 6984
Joined: Sun 21 May 2006, 01:46
Location: Auckland, New Zealand

#13 Post by disciple »

(N.B. have edited previous post)
Thanks guys.
So, although I don't fully understand cat's behaviour (i.e. with CR line endings, after going back to the beginning of the line, why does it erase everything instead of overwriting it character by character), I think it is safe to say none of this is a bug, and if you are dealing with files that may not have LF line endings and a trailing newline then yes, you should always convert them and add the newline if it is missing.

BTW what prompted this was a discussion with the author of the C implementation of natsort - which is affected by CR in the same way, unlike other implementations (well, the python version anyway). So I think the answer is that the behaviour is correct, because it is consistent with standard *nix utilities, but it is probably helpful to amend the documentation to point out that in this respect the behaviour is different to other implementations.
Do you know a good gtkdialog program? Please post a link here

Classic Puppy quotes

ROOT FOREVER
GTK2 FOREVER

User avatar
nosystemdthanks
Posts: 703
Joined: Thu 03 May 2018, 16:13
Contact:

#14 Post by nosystemdthanks »

what happens when you add this:

| tr -d "\r"

after the command?
[color=green]The freedom to NOT run the software, to be free to avoid vendor lock-in through appropriate modularization/encapsulation and minimized dependencies; meaning any free software can be replaced with a user’s preferred alternatives.[/color]

disciple
Posts: 6984
Joined: Sun 21 May 2006, 01:46
Location: Auckland, New Zealand

#15 Post by disciple »

Good point:

Code: Select all

$  cat testcr| tr -d "\r"
0321A 22AA2A$  sort -V testcr| tr -d "\r"
0321A 22AA2A
$
Do you know a good gtkdialog program? Please post a link here

Classic Puppy quotes

ROOT FOREVER
GTK2 FOREVER

User avatar
nosystemdthanks
Posts: 703
Joined: Thu 03 May 2018, 16:13
Contact:

#16 Post by nosystemdthanks »

does that help?

its alright that we arent totally clear on what you want to accomplish. a bias on forums is "tell us what you want precisely so we know how to help."

when i say its a bias, i mean if you say "these results are unpredictable," the fix is to either tell you what we know, or tell you how to make it more predictable. i cant guess if this solves your problem.

thats ok though. im sort of curious, but probably not to the point where i would try the program youre talking about, learn how it works, get familiar enough to solve this.

my motivation here is that im more and more dedicated to helping as many non-systemd users as possible. that doesnt mean i can help you, but if it did, fantastic.

i did help a couple people on the forum today, i solved a problem i was having too, partly-- most days wont be as productive as that.

the environment youre working in looks a lot simpler than it is-- youll be happy to know that most of the time, it is not as intriguing as this. i spent ten years learning bash and several people here can still do circles around me. https://ptpb.pw/UFuV https://kek.gg/i/4SqJ35.png
[color=green]The freedom to NOT run the software, to be free to avoid vendor lock-in through appropriate modularization/encapsulation and minimized dependencies; meaning any free software can be replaced with a user’s preferred alternatives.[/color]

disciple
Posts: 6984
Joined: Sun 21 May 2006, 01:46
Location: Auckland, New Zealand

#17 Post by disciple »

I wasn't trying to accomplish anything much in particular - this behaviour seemed buggy on the face of it, but I found it hard to believe that there could be such significant bugs in coreutils, so I wanted to understand what was going on.
We've established that it is all "not a bug", and a reasonable understanding of what is actually happening, so I'd call that "solved". Thanks.
Do you know a good gtkdialog program? Please post a link here

Classic Puppy quotes

ROOT FOREVER
GTK2 FOREVER

User avatar
nosystemdthanks
Posts: 703
Joined: Thu 03 May 2018, 16:13
Contact:

#18 Post by nosystemdthanks »

oh, thats no problem.

even if its buggy, it probably has more to do with the display mode / vt emulation than the utils themselves.

the utilities write very specific things to files-- including the screen. its the emulation mode of the term, not the utilities that really determines when a line clears or when a newline counts. this is probably about edge cases too, i dont come across this issue much and there are lots of ways to fix the output if it does that.
[color=green]The freedom to NOT run the software, to be free to avoid vendor lock-in through appropriate modularization/encapsulation and minimized dependencies; meaning any free software can be replaced with a user’s preferred alternatives.[/color]

disciple
Posts: 6984
Joined: Sun 21 May 2006, 01:46
Location: Auckland, New Zealand

#19 Post by disciple »

Ha, I was caught out by this again.

Code: Select all

$  cat grasseg.txt
0% 0:230:0
20% 0:160:0
35% 50:130:0
55% 120:100:30
75% 120:130:40
90% 170:160:50
100% 255:255:100
$  awk -F'\\%\\s|:' '{print $1 "," $2 "," $3 "," $4}' grasseg.txt
0,0,230,0
20,0,160,0
35,50,130,0
55,120,100,30
75,120,130,40
90,170,160,50
100,255,255,100
$  awk -F'\\%\\s|:' '{print $1 "," $2 "," $3 "," $4 ","}' grasseg.txt
,,0,230,0
,0,0,160,0
,5,50,130,0
,5,120,100,30
,5,120,130,40
,0,170,160,50
100,255,255,100,
$  awk -F'\\%\\s|:' '{print $1 "," $2 "," $3 "," $4 "," $1}' grasseg.txt
,00,230,0
,200,160,0
,3550,130,0
,55120,100,30
,75120,130,40
,90170,160,50
100,255,255,100,100
Do you know a good gtkdialog program? Please post a link here

Classic Puppy quotes

ROOT FOREVER
GTK2 FOREVER

Post Reply