Puppy Linux Discussion Forum Forum Index Puppy Linux Discussion Forum
Puppy HOME page : puppylinux.com
"THE" alternative forum : puppylinux.info
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

The time now is Wed 16 Aug 2017, 20:12
All times are UTC - 4
 Forum index » Off-Topic Area » Programming
fold-BB-NOTUSED - IT SHOULD BE USED!
Post new topic   Reply to topic View previous topic :: View next topic
Page 1 of 2 [18 Posts]   Goto page: 1, 2 Next
Author Message
MochiMoppel


Joined: 26 Jan 2011
Posts: 1326
Location: Japan

PostPosted: Wed 14 Jun 2017, 00:31    Post subject:  fold-BB-NOTUSED - IT SHOULD BE USED!
Subject description: coreutils vs. busybox
 

I sometimes wonder about these strange ...-BB-NOTUSED symlinks in bin and sbin directories. Apparently these link to busybox versions of otherwise equally named utilities. But why NOTUSED? Are their full featured cousins, e.g. those contained in coreutils, always considered preferable?

I understand that BB's utilities are bare bone, but what if one of its utilities offers exactly the same options as their counterparts or just the options that the user needs? Wouldn't it be preferable then to use BB? After all - at least in frugal installs - busybox is already running, and calling one of its functions seems so much more efficient than executing the often heavy corutils binary.

Lately I tried to use the fold utility (coreutil version 8.19) to wrap Japanese text. Without any options fold wraps text into colums 80 characters wide. When I piped the text through fold in gtkdialog, I received an error: Gtk-CRITICAL **: gtk_text_buffer_emit_insert: assertion g_utf8_validate (text, len, NULL) failed. Sometimes it worked OK, but most often it did not. I couldn't find a pattern. I then turned to busybox fold and it never failed. It appears that the coreutils version can't handle UTF-8 properly while busybox can.

Here a test case with UTF-8 symbols instead of Japanese characters. This triggers a segmentation fault error and Leafpad will not run:
Code:
echo '☂☃☄★☆☇☈☉☊☋☌☍☎☏☔☕☖☗☘☙☚☛☜☝☞☟☠' | fold -w5 | leafpad


This works for me as expected and folds the string into 5 character wide lines:
Code:
echo '☂☃☄★☆☇☈☉☊☋☌☍☎☏☔☕☖☗☘☙☚☛☜☝☞☟☠' | busybox fold -w5 | leafpad
Back to top
View user's profile Send private message 
musher0


Joined: 04 Jan 2009
Posts: 10642
Location: Gatineau (Qc), Canada

PostPosted: Wed 14 Jun 2017, 01:06    Post subject:  

Hi MochiMoppei.

Your reasoning is sound, and we developers should indeed use the most
efficient tool for the job.

Except the coreutils is "big" only if you choose to compile it in one chunk.

Please find attached a tree of my compilation of coreutils-8.27: there are
105 of them, the smallest being 14 Kb and the largest, 60 Kb.

I think twice about using BB utilities: some BB utils are so trimmed down
they are almost useless. For ex., the less and the lsof replacements
offered by BB are really awful.

Good find, though, this fold utility. I normally use fmt for this purpose.

As to the designation "BB-NOTUSED", it's an "editorial decision" by BarryK,
inventor of PuppyLinux, and that's all it is.

BFN.
coreutils-bin.sort.zip
Description 
zip

 Download 
Filename  coreutils-bin.sort.zip 
Filesize  801 Bytes 
Downloaded  14 Time(s) 

_________________
musher0
~~~~~~~~~~
« Un insensé sur le trône n'est qu'un singe sur le haut d'un toit. » / "A madman
on the throne is just a monkey on top of a roof." (Bernard de Clervaux)
Back to top
View user's profile Send private message 
Sailor Enceladus

Joined: 22 Feb 2016
Posts: 1201

PostPosted: Thu 15 Jun 2017, 13:54    Post subject: Re: fold-BB-NOTUSED - IT SHOULD BE USED!
Subject description: coreutils vs. busybox
 

MochiMoppel wrote:
Here a test case with UTF-8 symbols instead of Japanese characters. This triggers a segmentation fault error and Leafpad will not run:
Code:
echo '☂☃☄★☆☇☈☉☊☋☌☍☎☏☔☕☖☗☘☙☚☛☜☝☞☟☠' | fold -w5 | leafpad


This works for me as expected and folds the string into 5 character wide lines:
Code:
echo '☂☃☄★☆☇☈☉☊☋☌☍☎☏☔☕☖☗☘☙☚☛☜☝☞☟☠' | busybox fold -w5 | leafpad

Same result for me with Coreutil-8.26. Using -w15 instead of -w5 made fold show up like busybox, until I changed it to this:

Code:
echo '☂☃☄hahah★☆☇☈☉☊☋☌☍☎☏☔☕☖☗☘☙☚☛☜☝☞☟☠' | fold -w15
Back to top
View user's profile Send private message 
MochiMoppel


Joined: 26 Jan 2011
Posts: 1326
Location: Japan

PostPosted: Thu 15 Jun 2017, 21:52    Post subject: Re: fold-BB-NOTUSED - IT SHOULD BE USED!
Subject description: coreutils vs. busybox
 

Sailor Enceladus wrote:
Using -w15 instead of -w5 made fold show up like busybox,
Are you really sure? Look closer. Here the result is plain wrong. Wraps after 5, not 15 characters. Busybox wraps after 15 characters (of course).

I can tell you what does work for me in both versions. When I remove 1 character from my string and run fold without any options, i.e fold would have no other function than passing the string to leafpad unchanged. Useless, but successful:
Code:
echo '☂☃☄★☆☇☈☉☊☋☌☍☎☏☔☕☖☗☘☙☚☛☜☝☞☟' | fold | leafpad
Back to top
View user's profile Send private message 
Sailor Enceladus

Joined: 22 Feb 2016
Posts: 1201

PostPosted: Thu 15 Jun 2017, 22:34    Post subject: Re: fold-BB-NOTUSED - IT SHOULD BE USED!
Subject description: coreutils vs. busybox
 

MochiMoppel wrote:
Sailor Enceladus wrote:
Using -w15 instead of -w5 made fold show up like busybox,
Are you really sure? Look closer. Here the result is plain wrong. Wraps after 5, not 15 characters. Busybox wraps after 15 characters (of course).

Haha yes, that's what I meant. fold with -w15 gave me the same as busybox fold with -w5 for those symbols... Laughing

edit: Even unicode does "literal bytes" with the full fold using that syntax it seems, I had to use 10 to wrap by 5

Code:
echo 'ȐȑȒȓȔȕȖȗȘșȚțȜȝȞȟȠȡȢȣȤȥȦȧȨȩȪȫ' | fold -w10 | leafpad
capture25954.png
 Description   
 Filesize   36.86 KB
 Viewed   299 Time(s)

capture25954.png

Back to top
View user's profile Send private message 
MochiMoppel


Joined: 26 Jan 2011
Posts: 1326
Location: Japan

PostPosted: Thu 15 Jun 2017, 23:41    Post subject: Re: fold-BB-NOTUSED - IT SHOULD BE USED!
Subject description: coreutils vs. busybox
 

Sailor Enceladus wrote:
edit: Even unicode does "literal bytes" with the full fold using that syntax it seems, I had to use 10 to wrap by 5

Code:
echo 'ȐȑȒȓȔȕȖȗȘșȚțȜȝȞȟȠȡȢȣȤȥȦȧȨȩȪȫ' | fold -w10 | leafpad

Aaaah. I see a pattern. For your particular string, consisting of 2-byte characters - you have to multiply your desired column width by 2. To wrap after 1 col you write -w2, for 2 cols -w4 etc.
For my original string (3-bytes long chars as Japanese characters) you have to multiply by 3, that's why -w15 "worked" and resulted in 5 cols.
Now mix characters from my and from your string and see what happens Laughing

Bottomline: Coreutils' fold is unusable.
Back to top
View user's profile Send private message 
misko_2083


Joined: 08 Nov 2016
Posts: 15

PostPosted: Fri 16 Jun 2017, 01:19    Post subject: Re: fold-BB-NOTUSED - IT SHOULD BE USED!
Subject description: coreutils vs. busybox
 

^ fold is useless here.
awk on the other hand could do the work
Code:
echo '☂☃☄★☆☇☈☉☊☋☌☍☎☏☔☕☖☗☘☙☚☛☜☝☞☟☠ȐȑȒȓȔȕȖȗȘșȚțȜȝȞȟȠȡȢȣȤȥȦȧȨȩȪȫабвгдђжзијлљњпсћфцчџш'  | awk '{while (length>WIDTH) {print substr($0,1,WIDTH); $0=substr($0,WIDTH+1);} print;}' WIDTH=5 | leafpad
Back to top
View user's profile Send private message 
musher0


Joined: 04 Jan 2009
Posts: 10642
Location: Gatineau (Qc), Canada

PostPosted: Fri 16 Jun 2017, 01:59    Post subject:  

Difference between fold and fmt. using a silly sentence:
difference.jpg
 Description   
 Filesize   87.32 KB
 Viewed   254 Time(s)

difference.jpg


_________________
musher0
~~~~~~~~~~
« Un insensé sur le trône n'est qu'un singe sur le haut d'un toit. » / "A madman
on the throne is just a monkey on top of a roof." (Bernard de Clervaux)
Back to top
View user's profile Send private message 
MochiMoppel


Joined: 26 Jan 2011
Posts: 1326
Location: Japan

PostPosted: Fri 16 Jun 2017, 05:12    Post subject: Re: fold-BB-NOTUSED - IT SHOULD BE USED!
Subject description: coreutils vs. busybox
 

misko_2083 wrote:
^ fold is useless here.
coreutil fold is useless here.

Quote:
awk on the other hand could do the work
Sure, if you try hard enough you will always find a way to make simple things complicated Laughing

I could coerce pure bash to do the job, but what's the point?
Code:
function foldme { for ((c=0;c<=${#2};c+=$1)); do echo "${2:$c:$1}" ;done ;}
foldme 5 '☃☄★☆☇☈☉☊☋☌☍☎ ☏☔☕☖☗☘☙☚ ☛☜☝☞☟☠ȐȑȒȓȔȕȖȗȘșȚțȜȝȞȟȠȡȢȣȤȥ' | leafpad
Back to top
View user's profile Send private message 
misko_2083


Joined: 08 Nov 2016
Posts: 15

PostPosted: Fri 16 Jun 2017, 11:43    Post subject: Re: fold-BB-NOTUSED - IT SHOULD BE USED!
Subject description: coreutils vs. busybox
 

MochiMoppel wrote:

Quote:
awk on the other hand could do the work
Sure, if you try hard enough you will always find a way to make simple things complicated Laughing

It depends from a perspective. Some people use the straw to drink the joghurt. Some people use the spoon to eat the soup.
The point is I like complications. Laughing
Back to top
View user's profile Send private message 
MochiMoppel


Joined: 26 Jan 2011
Posts: 1326
Location: Japan

PostPosted: Mon 19 Jun 2017, 00:08    Post subject:  

musher0 wrote:
Difference between fold and fmt. using a silly sentence:
Different tools for different purposes produce different results ...

Coreutils' fmt can't split strings after a defined length and - again the coreutils show stopper - like fold it can't handle Unicode.

Folding lines at spaces is not my topic here. If needed busybox fold can do the job using the -s switch. Still this is folding and not formatting. Spaces are not replaced by newlines, they are preserved and may end up at line starts.

Code:
# echo 'abcde fghijkl mnopqrstuvwxyz' | fmt -w20
abcde fghijkl
mnopqrstuvwxyz

# echo '☂☃☄★☆☇ ☈☉☊☋☌☍ ☎☏☔☕☖☗☘☙☚☛☜☝☞☟☠' | fmt -w20
☂☃☄★☆☇
☈☉☊☋☌☍
☎☏☔☕☖☗☘☙☚☛☜☝☞☟☠

# echo '☂☃☄★☆ ☇☈☉☊☋☌☍ ☎☏☔☕☖☗☘☙☚☛☜☝☞☟☠' | busybox fold -sw20
☂☃☄★☆ ☇☈☉☊☋☌☍
☎☏☔☕☖☗☘☙☚☛☜☝☞☟☠
Back to top
View user's profile Send private message 
musher0


Joined: 04 Jan 2009
Posts: 10642
Location: Gatineau (Qc), Canada

PostPosted: Mon 19 Jun 2017, 00:57    Post subject:  

Hi, MochiMoppei.

Why is it then that fmt from coreutils works for the French language
with a LANG=fr_CA.utf8 environment?

Example -- Some news about violent winds, taken from Radio-Canada.ca:
Code:
echo "Des vents très violents, possiblement une tornade, ont complètement détruit une résidence de la municipalité d'Hébertville, au Saguenay-Lac-Saint-Jean, ainsi qu'une autre habitation à Sainte-Anne-du-Lac, dans les Laurentides, dimanche. Une journée chaude et humide où le temps instable a donné lieu à une série d'alertes de tornades et d'orages de la part d'Environnement Canada." | fmt -w 80
Result:
Quote:
Des vents très violents, possiblement une tornade, ont complètement détruit
une résidence de la municipalité d'Hébertville, au Saguenay-Lac-Saint-Jean,
ainsi qu'une autre habitation à Sainte-Anne-du-Lac, dans les Laurentides,
dimanche. Une journée chaude et humide où le temps instable a donné lieu à
une série d'alertes de tornades et d'orages de la part d'Environnement Canada.

Code:
echo "Des vents très violents, possiblement une tornade, ont complètement détruit une résidence de la municipalité d'Hébertville, au Saguenay-Lac-Saint-Jean, ainsi qu'une autre habitation à Sainte-Anne-du-Lac, dans les Laurentides, dimanche. Une journée chaude et humide où le temps instable a donné lieu à une série d'alertes de tornades et d'orages de la part d'Environnement Canada." | fmt -w 60
Result:
Quote:
Des vents très violents, possiblement une tornade, ont
complètement détruit une résidence de la municipalité
d'Hébertville, au Saguenay-Lac-Saint-Jean, ainsi
qu'une autre habitation à Sainte-Anne-du-Lac, dans les
Laurentides, dimanche. Une journée chaude et humide où
le temps instable a donné lieu à une série d'alertes
de tornades et d'orages de la part d'Environnement Canada.


Perhaps Japanese is using utf16? (Sorry if I sound ignorant. I would not
know this sort of thing.)

~~~~~~~
Various remarks:
-- Isn't utf8 chosen (or not) for one's language when one sets up the Puppy?

-- coreutils can be compiled with a "disable-nls" parameter... This means
that the developer can choose to have all his compiled coreutils
completely ignore the utf8 environment.

-- if you want to accelerate a sort or do a sort without taking into account
utf8 characters, you set LC_ALL=C and you set LC_ALL="" back on when
finished.

This is a relatively well documented trick. It also works if you wish to
greatly speed up some section of a bash script, whether this section has
some data to sort or not.

-- there is a report about the "cut" utility from coreutils misbehaving in an
utf8 environment here:
https://unix.stackexchange.com/questions/15961/coreutils-that-are-utf-aware

Thus hoping to contribute to the discussion although I do not know your
language.

Best regards.

_________________
musher0
~~~~~~~~~~
« Un insensé sur le trône n'est qu'un singe sur le haut d'un toit. » / "A madman
on the throne is just a monkey on top of a roof." (Bernard de Clervaux)
Back to top
View user's profile Send private message 
MochiMoppel


Joined: 26 Jan 2011
Posts: 1326
Location: Japan

PostPosted: Mon 19 Jun 2017, 08:01    Post subject:  

OK, let's change "can't handle Unicode" to "can't handle Unicode reliably". Does this make it any better?

UTF-8 doesn't care if you use French, Greek, Russian or Japanese. What makes the difference is the number of bytes it uses to represent each character set. For French you never needed Unicode. Extended ASCII could handle it and coreutils should have no problems with French even if it is not UTF-8 aware.

UTF-8 includes the basic 128 ASCII characters (1byte per character), all of the former extended ASCII variants (incl.French!) and some more (2 bytes per character), all kind of symbols and - from a Western point of view - "exotic" languages like Korean or Japanese (3 bytes per character), and lastly there are even 4-byte characters, e.g. less frequently used Japanese Kanji. I expect a text manipulating tool to handle all of these characters.

fmt handles only 1 and 2-byte characters flawlessly. Take your example and change only 1 character to a symbol and you already might end up with an unexpected result.
Code:
# echo "Des vents très violents , possiblement une tornade, ont complètement détruit une résidence de la municipalité" | fmt -w 53
Des vents très violents , possiblement une tornade,
ont complètement détruit une résidence de la
municipalité

# echo "Des vents très violents , possiblement une t☠rnade, ont complètement détruit une résidence de la municipalité" | fmt -w 53
Des vents très violents , possiblement une
t☠rnade, ont complètement détruit une résidence
de la municipalité


Now, to end the discussion about fmt, here is another reason why I need fold and not fmt: fmt only wraps at spaces. Doesn't help me since Japanese text doesn't include space characters.
Back to top
View user's profile Send private message 
musher0


Joined: 04 Jan 2009
Posts: 10642
Location: Gatineau (Qc), Canada

PostPosted: Mon 19 Jun 2017, 12:35    Post subject:  

Hello, MochiMoppei.

Many thanks for the detailed explanation of utf8.
I learned something today.

Again a couple of thoughts:
-- I don't think of Japanese as "exotic", only different. Different Civilizations
make this Planet richer.

-- I believe that you should bring this 3-character bug to the attention of
the authors of coreutils fold at the GNU Foundation. It seems obvious that
they don't have testers for the Japanese language whereas the BusyBox
people do.

BFN.

_________________
musher0
~~~~~~~~~~
« Un insensé sur le trône n'est qu'un singe sur le haut d'un toit. » / "A madman
on the throne is just a monkey on top of a roof." (Bernard de Clervaux)
Back to top
View user's profile Send private message 
misko_2083


Joined: 08 Nov 2016
Posts: 15

PostPosted: Mon 19 Jun 2017, 12:40    Post subject:  

In python that would be trivial with the textwrap library.

Code:
#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import textwrap

strs = str("☎☏☔☕☖☗☘☙☚☛☜☝☞☟☠").decode('utf8')

print(textwrap.fill(strs, 5))


So, there are 3 bytes per character
Code:
printf "%s" "☎☏☔☕☖☗☘☙☚☛☜☝☞☟☠" | hexdump
0000000 98e2 e28e 8f98 98e2 e294 9598 98e2 e296
0000010 9798 98e2 e298 9998 98e2 e29a 9b98 98e2
0000020 e29c 9d98 98e2 e29e 9f98 98e2 00a0     
000002d


fold from coreutils is splitting between the bytes inside the characters.
Code:
printf "%b" "☎☏☔☕☖☗☘☙☚☛☜☝☞☟☠ " | strace -q -e write fold -w 5
write(1, "\342\230\216\342\230\n", 6☎
)   = 6
write(1, "\217\342\230\224\342\n", 6��
)   = 6
write(1, "\230\225\342\230\226\n", 6��☖
)   = 6
write(1, "\342\230\227\342\230\n", 6☗
)   = 6
write(1, "\230\342\230\231\342\n", 6�☙
)   = 6
write(1, "\230\232\342\230\233\n", 6��☛
)   = 6
write(1, "\342\230\234\342\230\n", 6☜
)   = 6
write(1, "\235\342\230\236\342\n", 6�☞
)   = 6
write(1, "\230\237\342\230\240\n", 6��☠
)   = 6
write(1, " ", 1 )                        = 1
+++ exited with 0 +++

5+newline=6
So that means it works if you fold on 3,6,9,12,15... but add one byte (character or space) and it makes a mess to the end of the text.

busybox fold works as expected

by the way wc has the same behaviour in coreutils and busybox. Counts only bytes. Smile
Code:

foo='☎☏☔☕☖☗☘☙☚☛☜☝☞☟☠ abcd'

for (( i=0; i<${#foo}; i++ ))
  do bar+="${foo:$i:1}"
      echo $i \"${foo:$i:1}\"
  done

printf "wc -c: "
echo -e "$bar" | busybox wc -c
bar=""

#Output -->
0 "☎"
1 "☏"
2 "☔"
3 "☕"
4 "☖"
5 "☗"
6 "☘"
7 "☙"
8 "☚"
9 "☛"
10 "☜"
11 "☝"
12 "☞"
13 "☟"
14 "☠"
15 " "
16 "a"
17 "b"
18 "c"
19 "d"
wc -c: 51
Back to top
View user's profile Send private message 
Display posts from previous:   Sort by:   
Page 1 of 2 [18 Posts]   Goto page: 1, 2 Next
Post new topic   Reply to topic View previous topic :: View next topic
 Forum index » Off-Topic Area » Programming
Jump to:  

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Powered by phpBB © 2001, 2005 phpBB Group
[ Time: 0.1716s ][ Queries: 15 (0.0083s) ][ GZIP on ]