Puppy Linux Discussion Forum Forum Index Puppy Linux Discussion Forum
Puppy HOME page : puppylinux.com
"THE" alternative forum : puppylinux.info
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

The time now is Thu 21 Jun 2018, 06:52
All times are UTC - 4
 Forum index » Off-Topic Area » Programming
Unpredictable results from standard utilities
Post new topic   Reply to topic View previous topic :: View next topic
Page 1 of 2 [18 Posts]   Goto page: 1, 2 Next
Author Message
disciple

Joined: 20 May 2006
Posts: 6828
Location: Auckland, New Zealand

PostPosted: Mon 11 Jun 2018, 23:35    Post subject:  Unpredictable results from standard utilities
Subject description: depending on end-of-line convention in input file
 

Both busybox and coreutils utilities give different results depending on the end of line format of an input file i.e. CR (mac standard), LF (unix standard) or CRLF (windows standard) (note - apparently you will also find files in the wild that use a mixture!).
See below - note that some of the commands (file, diff) are used to compare the files, others to compare the results, because I was expecting the same results from each file (cat, sort).
Should this unpredictability be considered a bug? Or are you expected to always check the format of a file before doing anything to it? i.e. should you make sure files are always in LF format if you are using any standard *nix utility in any context (except perhaps OSX - I haven't checked)?
Code:
$  echo "S2
> S">lf
$  cp lf crlf
$  unix2dos crlf
unix2dos: converting file crlf to DOS format ...
$  cp lf cr
$  unix2mac cr
unix2mac: converting file cr to Mac format ...
$  file lf
lf: ASCII text
$  file crlf
crlf: ASCII text, with CRLF line terminators
$  file cr
cr: ASCII text, with CR line terminators
$  diff lf cr
1,2c1
< S2
< S
---
S S2
\ No newline at end of file
$  diff lf crlf
1,2c1,2
< S2
< S
---
> S2
> S
$  sort -V <lf
S
S2
$  sort -V <crlf
S2
S
$  sort -V <cr
S2
$  cat lf
S2
S
$  cat crlf
S2
S
$  cat cr
$  busybox cat cr
$  busybox cat crlf
S2
S
$  busybox cat lf
S2
S

_________________
Do you know a good gtkdialog program? Please post a link here

Classic Puppy quotes

ROOT FOREVER
GTK2 FOREVER
Back to top
View user's profile Send private message 
nosystemdthanks

Joined: 03 May 2018
Posts: 168

PostPosted: Tue 12 Jun 2018, 00:10    Post subject: Re: Unpredictable results from standard utilities
Subject description: depending on end-of-line convention in input file
 

fig handles this by converting all crlf to lf and then all cr to lf:

cat inputtext | python -c "from sys import stdin, stdout ; p = stdin.read() ; stdout.write(p.replace('\r\n','\n').replace('\r','\n')) ; stdout.flush()" # public domain

i wanted to do this with sed and tr:

cat inputtext | sed "s/\r\n/\n/g" | tr '\r' '\n'

tr doesnt let you change multiple characters, and sed wont treat text before it processes newlines.

note that the "| tr '\r' '\n'" part covers text from very old macs, modern macs use \n like puppy does.

perl is up to this task, but i refuse. i would be interested in a python-free solution just because python is not always available.

_________________
philosophy is important to software design; coding is useful for demonstrating design concepts
Back to top
View user's profile Send private message Visit poster's website 
6502coder


Joined: 23 Mar 2009
Posts: 460
Location: Western United States

PostPosted: Tue 12 Jun 2018, 00:54    Post subject:  

The UNIX/Linux standard is to use LF as the end-of-line marker in a text file. Every Linux user should know this. If a text file doesn't obey the standard, all bets are off -- that is obvious. This is not "unpredicatable' and it is certainly not a bug.

You can't expect text-handling utilities to automatically handle the 3 common EOL markers, for the simple reason that there are not, and cannot possibly be, any universal agreement on what "handle" means.

For example, it is perfectly possible that I might want to have Linux text files that contain CR chars, for some special reason. (As in fact I have, in situations involving partially encrypted text.) If one of those CR chars happens to end up preceding a LF char, I DO NOT want that CR char to be summarily discarded, just because some other OS happens to use CRLF as its EOL marker.

Since UNIX/Linux utilities cannot possibly anticipate all the ways in which CR might be used in a text file, they cannot unilaterally set policy on what to do with the sequence CRLF, other than to accept the CR verbatim.

So yes, if you routinely deal with MS-DOS text files and/or Mac text files in a Linux environment, the onus is on you to deal with it in the manner appropriate to your particular usage of those files. For many people that may mean interpreting all isolated CRs and CRLFs as EOL markers in Linux. But you cannot assume that to be a satisfactory policy for everyone.

</old-man-rant>
Back to top
View user's profile Send private message 
nosystemdthanks

Joined: 03 May 2018
Posts: 168

PostPosted: Tue 12 Jun 2018, 01:04    Post subject:  

i should add that busybox and bash will probably never be totally compatible anyway, and my preference is to use puppies that include bash or at least python.

recent gripe: a version of while that litters into tty when the file is not found, when the standard for years is to be silent by default. that was a prime candidate for -v, right? hateful change of default that breaks so many things. (may not affect puppy users, this is going on elsewhere.)

we interpreted the op differently (i admit i could have paid closer attention.)

you (seemingly) took it as unreasonable whinging, while i simply took it as "what should i do to fix this?"

youre right of course, what you said is valid and technically sound.

my proposal on what to do is fairly simple and probably not the best way for several things-- in the context where its used, a file needs to be processed and translated, so the output is fairly certain to differ from input already.

the onus does fall onto the user, though the first thing the user will probably want to know is what utilities can help them. leafpad or nano (probably geany) will work, on the command line a python one-liner can generally achieve what you want.

you may notice that when someone talks about cr and lf i already presume theyre familiar with the command line. a fallacy, but not a huge leap. on this forum, a few people people probably learn about cr and lf before the command line though.

_________________
philosophy is important to software design; coding is useful for demonstrating design concepts
Back to top
View user's profile Send private message Visit poster's website 
6502coder


Joined: 23 Mar 2009
Posts: 460
Location: Western United States

PostPosted: Tue 12 Jun 2018, 01:14    Post subject: Re: Unpredictable results from standard utilities
Subject description: depending on end-of-line convention in input file
 

@nosystemdthanks

Well, I took the man at his word when he asked:
disciple wrote:
Should this unpredictability be considered a bug? Or are you expected to always check the format of a file before doing anything to it? i.e. should you make sure files are always in LF format if you are using any standard *nix utility in any context...?

I may have jumped on him a little too hard but I did concede as much when I signed off with "old-man-rant"

As for you, NST, you are of course quite right that simple ways to deal with the problem exist. If you felt caught in the crossfire, that was unintended on my part.
Back to top
View user's profile Send private message 
nosystemdthanks

Joined: 03 May 2018
Posts: 168

PostPosted: Tue 12 Jun 2018, 01:30    Post subject: Re: Unpredictable results from standard utilities
Subject description: depending on end-of-line convention in input file
 

oh no problem at all, i rant all the time about software this and software that, i know there will be times when someone feels its actually directed at them when all they did was make me think of some weird software thing.

the developer/user dichotomy leads to all sorts of weird ideas.

sympathy for the user is a good basis for teaching (especially tutoring) but developers are users too, and vice-versa. so its really hard to say when to blame them and when to go easy. go easy when no harm is done, gripe when they break things for someone else is an imperfect but decent rule. certainly better than lisis "we should just kiss their arses and hope for the best." (slightly paraphrased.)

_________________
philosophy is important to software design; coding is useful for demonstrating design concepts
Back to top
View user's profile Send private message Visit poster's website 
disciple

Joined: 20 May 2006
Posts: 6828
Location: Auckland, New Zealand

PostPosted: Tue 12 Jun 2018, 07:30    Post subject:  

Hi guys, there's nothing wrong with a good rant Smile
My questions were actually serious - I was neither complaining nor looking for help to deal with these files.

I have scripts that use dos2unix so that they can process the files with e.g. sed, as nosystemdthanks described. And then use unix2dos so the output can be consumed by certain Windows applications.
But I don't remember ever coming across a script that either:
- ran dos2unix and mac2unix on input files just in case, or
- used file or something to detect the eol standard and convert appropriately.

And although I doubt I've seen CR files in the wild, in my experience if you feed CRLF files to linux software it generally gives the results you would expect if you didn't know anything about line endings, or it is like sed and fails in ways that at least seem to make sense.

Quote:
The UNIX/Linux standard is to use LF as the end-of-line marker in a text file. Every Linux user should know this.

I imagine there are plenty now that don't - they probably wouldn't see it in many tutorials for example.

Quote:
If a text file doesn't obey the standard, all bets are off -- that is obvious. This is not "unpredicatable'

Sorry, but I think you're going too far there. If a tool isn't going to automatically handle different line endings, you would still expect it to behave in a way that makes sense. It isn't like CR is prohibited.
The thing is, at least on the surface these behaviours just don't make sense. Why does sort seem to work with CRLF files, but if you look closely there are a few mistakes?
Code:
#  echo "0
> 3
> 2
> 1
> A
> 2A
> A2
> A 2">test
#  cat test
0
3
2
1
A
2A
A2
A 2
#  cat test| sort -V
0
1
2
2A
3
A
A2
A 2
#  cp test testcrlf
#  unix2dos testcrlf
#  cat testcrlf| sort -V
0
1
2A
2
3
A2
A
A 2

Why would both implementations of cat fail to read anything from the file with CR line endings?

Quote:
Since UNIX/Linux utilities cannot possibly anticipate all the ways in which CR might be used in a text file, they cannot unilaterally set policy on what to do with the sequence CRLF, other than to accept the CR verbatim.

Some do, at least by default! I would hope they have a way to override it if necessary...

_________________
Do you know a good gtkdialog program? Please post a link here

Classic Puppy quotes

ROOT FOREVER
GTK2 FOREVER

Last edited by disciple on Tue 12 Jun 2018, 19:14; edited 1 time in total
Back to top
View user's profile Send private message 
nosystemdthanks

Joined: 03 May 2018
Posts: 168

PostPosted: Tue 12 Jun 2018, 09:37    Post subject:  

if you turn your examples into scripts that can be downloaded as attachments i will try to explain the behaviour of each, even if i have to guess.
_________________
philosophy is important to software design; coding is useful for demonstrating design concepts
Back to top
View user's profile Send private message Visit poster's website 
disciple

Joined: 20 May 2006
Posts: 6828
Location: Auckland, New Zealand

PostPosted: Tue 12 Jun 2018, 19:49    Post subject:  

OK, some of this might be starting to make sense.
Looking at it again, I think with the CRLF input, sort is simply sorting CR after letters and numbers, but before spaces. This also explains why sort gives a slightly different result if you remove the blank line at the end of the file with CRLF line endings.
But I can't explain the output of sort when it operates on a file with CR line endings. I would have thought it might show the entire content of the file, but all on one line. And why does cat not show anything? I thought maybe it just does not accept input without a trailing LF character, but if i remove the trailing CR (not included in my example below - I used a text editor) then it shows the last line of the file. If I instead add a LF to the end of the file then it gives me the same output as sort.

Code:
#! /bin/sh
echo "0
3
2
1
A 2
2A
A2
A">test
sort -V<test
cp test testcrlf
unix2dos testcrlf
sort -V <testcrlf
cp test testcr
unix2mac testcr
sort -V <testcr
echo "here comes cat:"
cat testcr
echo "what?  no output from cat?"
sed -i -e '$a\' testcr
cat testcr

Output:
Code:
$  ./example.sh
0
1
2
2A
3
A
A2
A 2
unix2dos: converting file testcrlf to DOS format ...
0
1
2A
2
3
A2
A
A 2
unix2mac: converting file testcr to Mac format ...
A22
here comes cat:
what?  no output from cat?
A22

_________________
Do you know a good gtkdialog program? Please post a link here

Classic Puppy quotes

ROOT FOREVER
GTK2 FOREVER

Last edited by disciple on Wed 13 Jun 2018, 01:25; edited 1 time in total
Back to top
View user's profile Send private message 
disciple

Joined: 20 May 2006
Posts: 6828
Location: Auckland, New Zealand

PostPosted: Wed 13 Jun 2018, 01:13    Post subject:  

Ah, I broke the file down into shorter versions, removing a line at a time, and I see that you have to think about this like a typewriter .
Carriage return means "go back to the beginning of the line", and line feed means "go down a line"
In a file with CR line endings this means every time sort gets to the end of the line it goes back to the beginning of the line and starts overwriting it.
But every time cat gets to the end of the line for some reason it just erases everything already on the line. So in a file with a trailing CR you don't get anything.

_________________
Do you know a good gtkdialog program? Please post a link here

Classic Puppy quotes

ROOT FOREVER
GTK2 FOREVER
Back to top
View user's profile Send private message 
disciple

Joined: 20 May 2006
Posts: 6828
Location: Auckland, New Zealand

PostPosted: Wed 13 Jun 2018, 01:28    Post subject:  

disciple wrote:
Looking at it again, I think with the CRLF input, sort is simply sorting CR after letters and numbers, but before spaces.

So, to reconcile this apparent behaviour with the theory in my last post. In a CRLF file when sort processes each line it hits the CR and goes back to the beginning, but then it hits the LF, so it doesn't overwrite anything.
So because cat prints all the lines of a CRLF file, when it hits a line feed it goes back to the beginning and doesn't delete everything on the line if it then hits a line feed.

Last edited by disciple on Wed 13 Jun 2018, 01:34; edited 1 time in total
Back to top
View user's profile Send private message 
nosystemdthanks

Joined: 03 May 2018
Posts: 168

PostPosted: Wed 13 Jun 2018, 01:32    Post subject:  

something else you have to consider is that back when graphics were the exception and text was the rule, the text had a lot of different modes:

echo -e "\\033[0;35mpurple\\041\\033[0m"

so when its being weird, its not always the commands youre running but the environment-- your term window settings (up to the term program you choose, but fairly reliable) and how long since youve output a lot of random things from say, a file in /usr/bin.

# reset

now and then is highly recommended, at least when youre trying to figure out text output. but it will clear your screen, including your scrollback usually.

_________________
philosophy is important to software design; coding is useful for demonstrating design concepts
Back to top
View user's profile Send private message Visit poster's website 
disciple

Joined: 20 May 2006
Posts: 6828
Location: Auckland, New Zealand

PostPosted: Wed 13 Jun 2018, 01:58    Post subject:  

(N.B. have edited previous post)
Thanks guys.
So, although I don't fully understand cat's behaviour (i.e. with CR line endings, after going back to the beginning of the line, why does it erase everything instead of overwriting it character by character), I think it is safe to say none of this is a bug, and if you are dealing with files that may not have LF line endings and a trailing newline then yes, you should always convert them and add the newline if it is missing.

BTW what prompted this was a discussion with the author of the C implementation of natsort - which is affected by CR in the same way, unlike other implementations (well, the python version anyway). So I think the answer is that the behaviour is correct, because it is consistent with standard *nix utilities, but it is probably helpful to amend the documentation to point out that in this respect the behaviour is different to other implementations.

_________________
Do you know a good gtkdialog program? Please post a link here

Classic Puppy quotes

ROOT FOREVER
GTK2 FOREVER
Back to top
View user's profile Send private message 
nosystemdthanks

Joined: 03 May 2018
Posts: 168

PostPosted: Wed 13 Jun 2018, 02:06    Post subject:  

what happens when you add this:

| tr -d "\r"

after the command?

_________________
philosophy is important to software design; coding is useful for demonstrating design concepts
Back to top
View user's profile Send private message Visit poster's website 
disciple

Joined: 20 May 2006
Posts: 6828
Location: Auckland, New Zealand

PostPosted: Wed 13 Jun 2018, 02:27    Post subject:  

Good point:
Code:
$  cat testcr| tr -d "\r"
0321A 22AA2A$  sort -V testcr| tr -d "\r"
0321A 22AA2A
$

_________________
Do you know a good gtkdialog program? Please post a link here

Classic Puppy quotes

ROOT FOREVER
GTK2 FOREVER
Back to top
View user's profile Send private message 
Display posts from previous:   Sort by:   
Page 1 of 2 [18 Posts]   Goto page: 1, 2 Next
Post new topic   Reply to topic View previous topic :: View next topic
 Forum index » Off-Topic Area » Programming
Jump to:  

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Powered by phpBB © 2001, 2005 phpBB Group
[ Time: 0.0764s ][ Queries: 11 (0.0134s) ][ GZIP on ]