sed scratch pad -- A thread of sed examples

For discussions about programming, programming questions/advice, and projects that don't really have anything to do with Puppy.
Message
Author
jamesbond
Posts: 3433
Joined: Mon 26 Feb 2007, 05:02
Location: The Blue Marble

#41 Post by jamesbond »

sc0ttman wrote:...I really need to go learn how sed actually works.. :oops:
Come on Scott, you don't need to be so modest. I'm still using your sjpplog for my main blog, six years later :D
Fatdog64 forum links: [url=http://murga-linux.com/puppy/viewtopic.php?t=117546]Latest version[/url] | [url=https://cutt.ly/ke8sn5H]Contributed packages[/url] | [url=https://cutt.ly/se8scrb]ISO builder[/url]

User avatar
sc0ttman
Posts: 2812
Joined: Wed 16 Sep 2009, 05:44
Location: UK

#42 Post by sc0ttman »

MochiMoppel wrote:
sc0ttman wrote:It's prettier.js that removes the trailing spaces from the source Markdown.. I disabled it ..
I have no clue what you are talking about. What trailing spaces?
Trailing spaces in the markdown from which the HTML page is generated.
MochiMoppel wrote:
..Anyway, I spotted other issues Burunduks code has yesterday (not stripping whitespace inside <a> tags), but I can live with it as
You mean the 10 spaces between consecutive <a> tags?

Code: Select all

          <a href="/mdsh/tags/seo.html">seo</a>,
          <a href="/mdsh/tags/shell.html">shell</a>,
          <a href="/mdsh/tags/xml.html">xml</a>,
This looks like garbage and is not removed because Burunduk may have tried to guess your requirements from your first post. Your original script (s@>\s*<@><@g) was designed to remove pure whitespace between tags, i.e. spaces, tabs or linefeeds, no other characters. You said that this is what you want, except that you don't want to apply this to <pre> tags. This is basically what Burunduk delivered.. As soon as you put any other character between tags, even a single comma, nothing is or should be removed. Apart from the funny <a> tag spacings there is more questionable code , e.g. the seemingly useless '<span></span>' combos, that could be removed.

Wouldn't it be much more effective if you clean the HTML code first?
I have updated some templates, so the whitespace inside various elements is no longer there, pre-minification,
and I have added a few sed lines of my own to remove empty tags (which are generated by Pygments)...

See, the HTML file you're looking at is the result of (more or less):

.mdsh file > pre-markdown-parser -> mustache templater -> markdown-parser -> pygments -> minifier -> HTML file

..so the odd bit of formatting weirdness can creep in during conversion..

-----

About CSS minification, I found this: https://www.tero.co.uk/scripts/minify.php

Code: Select all

sed -e "s|/\*\(\\\\\)\?\*/|/~\1~/|g" -e "s|/\*[^*]*\*\+\([^/][^*]*\*\+\)*/||g" \
  -e "s|\([^:/]\)//.*$|\1|" -e "s|^//.*$||" | tr '\n' ' ' | \
  sed -e "s|/\*[^*]*\*\+\([^/][^*]*\*\+\)*/||g" -e "s|/\~\(\\\\\)\?\~/|/*\1*/|g" \
  -e "s|\s\+| |g" -e "s| \([{;:,]\)|\1|g" -e "s|\([{;:,]\) |\1|g"
[b][url=https://bit.ly/2KjtxoD]Pkg[/url], [url=https://bit.ly/2U6dzxV]mdsh[/url], [url=https://bit.ly/2G49OE8]Woofy[/url], [url=http://goo.gl/bzBU1]Akita[/url], [url=http://goo.gl/SO5ug]VLC-GTK[/url], [url=https://tiny.cc/c2hnfz]Search[/url][/b]

User avatar
sc0ttman
Posts: 2812
Joined: Wed 16 Sep 2009, 05:44
Location: UK

#43 Post by sc0ttman »

jamesbond wrote:
sc0ttman wrote:...I really need to go learn how sed actually works.. :oops:
Come on Scott, you don't need to be so modest. I'm still using your sjpplog for my main blog, six years later :D
And I'm still reading it... :D
[b][url=https://bit.ly/2KjtxoD]Pkg[/url], [url=https://bit.ly/2U6dzxV]mdsh[/url], [url=https://bit.ly/2G49OE8]Woofy[/url], [url=http://goo.gl/bzBU1]Akita[/url], [url=http://goo.gl/SO5ug]VLC-GTK[/url], [url=https://tiny.cc/c2hnfz]Search[/url][/b]

Burunduk
Posts: 80
Joined: Sun 21 Aug 2011, 21:44

#44 Post by Burunduk »

@jamesbond

What you call a defeat looks like a win to me. After some modification your regex works as expected:

Code: Select all

sed -r -e ':a;$!{N;ba;};s/<!--(-?[^-]|--+[^>-])*-*>//g;' test.html
It still has a nested greedy quantifiers but for common html files it shouldn't be a problem.
Edit: Fixed a typo. I typed over a previous code and left an unnecessary part of it. Now really works.

@MochiMoppel

It looks like you have found a beast that formally ruins the concept of nested comments:

<!-->

A simple question:

<!--> Are these two comments <!--> nested <!--> or side by side? <!-->

My editor thinks they are side by side. Firefox sees four empty comments and adds two dashes to each. It's necessary to change the delimiters or disallow this hybrid in order to support nesting.

Also, removing stray delimiters may insert a part of a comment into a document:

use<!--it carefully more or -->less-->

Maybe just report an error? sed's w command can do it.
MochiMoppel wrote:With Burunduk's "superfluous" code it will even create an infinite loop and will not work at all.
¡Vaya! An infinite loop! /!--/ was supposed to deal with it but...
Ok, won't loop now.

Code: Select all

sed ':a;$!{N;ba;};:c;/<!--/s/-->/&&/;tb;:b;s/<!--.*-->-->//;tc' test.html
@sc0ttman

A quick fix.
Edit: Now it changes newlines to spaces except between > and <. Note that this may spoil inline js. Nested <pre> are not supported unless they a nested this way: (((...)))

Code: Select all

sed ':a;$!{N;ba;};s/@/@a/g;s/\n/@n/g;s/<pre/\n&/g;s/<\/pre>/&\n/g' test.html \
  | sed -nr '/(^<pre|<\/pre>$)/!{s/(@n)+/@n/g;s/>\s*(@n)?\s*</></g;s/@n/ /g;
  s/\s+/ /g; # squeeze consecutive whitespaces everywhere, delete this line if unneeded
  };H;${g;s/\n//g;s/@n/\n/g;s/@a/@/g;p}' >min.html

User avatar
MochiMoppel
Posts: 2084
Joined: Wed 26 Jan 2011, 09:06
Location: Japan

#45 Post by MochiMoppel »

Burunduk wrote:@jamesbond
What you call a defeat looks like a win to me. After some modification your regex works as expected
I think there are no winners or losers in this game as there are so many different expectations and no clear answers as to what is right and what is wrong. The green part in this example is what I would consider as comments:
  • Lucky <!-- Hello <!-- Dolly --> World --> Strike
    Nice <!--> Weather -->
    Good <!----> Morning -->
And this is what Geany or even w3schools flag as comments in their syntax highlighting:
  • Lucky <!-- Hello <!-- Dolly --> World --> Strike
    Nice <!--> Weather -->
    Good <!----> Morning -->
And this is what finally browsers treat as comments
  • Lucky <!-- Hello <!-- Dolly --> World --> Strike
    Nice <!--> Weather -->
    Good <!----> Morning -->
The HTML gods at w3.org give no clear answer. They don't forbid nesting but the browsers treat comment nesting different from <div> or <table> nesting. And if comments must open with '<!--' and close with '-->' then I would expect that constructs like '<!-->' or '<!--->' are not comments, but all browsers disagree with me :cry:
Attachments
w3schools.png
w3schools' editor and rendering engine have different opinions.
(Please disregard my sloppy break tags. Should be &lt;br&gt; or &lt;br /&gt;)
(18.03 KiB) Downloaded 143 times

Burunduk
Posts: 80
Joined: Sun 21 Aug 2011, 21:44

#46 Post by Burunduk »

MochiMoppel wrote:I think there are no winners or losers in this game as there are so many different expectations and no clear answers as to what is right and what is wrong.
I think this game is a puzzle not a competition. Everyone who solves it is a winner (unless you look at this philosophically).
MochiMoppel wrote:The HTML gods at w3.org give no clear answer.
The specification says:
12.1.6 Comments

Comments must have the following format:

The string "<!--".
Optionally, text, with the additional restriction that the text must not start with the string ">", nor start with the string "->", nor contain the strings "<!--", "-->", or "--!>", nor end with the string "<!-".
The string "-->".

The text is allowed to end with the string "<!", as in <!--My favorite operators are > and <!-->.
The regex from my previous post already behaves like Geany:

Code: Select all

# echo 'Lucky <!-- Hello <!-- Dolly --> World --> Strike
Nice <!--> Weather -->
Good <!----> Morning -->' | sed -r ':a;$!{N;ba;};s/<!--(-?[^-]|--+[^>-])*-*>//g;'
Lucky  World --> Strike
Nice 
Good  Morning -->
This modification follows the browsers:

Code: Select all

sed -r ':a;$!{N;ba;};s/<!(--+[^>-](-?[^-]|--+[^>-])*-*>|--+>)//g;' test.html

Code: Select all

# echo 'Lucky <!-- Hello <!-- Dolly --> World --> Strike
Nice <!--> Weather -->
Good <!----> Morning -->' | sed -r ':a;$!{N;ba;};s/<!(--+[^>-](-?[^-]|--+[^>-])*-*>|--+>)//g;'
Lucky  World --> Strike
Nice  Weather -->
Good  Morning -->
We can also try to handle the situation when a large block of html is commented out. This is tricky. The regex looks write-only (resembles BF really).
Known limitations: a nested opening comment delimiter <!-- is not recognized if it immediately follows <!- or <!-- , <!--- and so on. <!--> and <!---> are considered to be empty comments if outermost and closing comment delimiters if nested. To be continued :)

Code: Select all

sed -r ':a;$!{N;ba;};:c;s/<!(--+[^>-](-?[^<-]|--+[^<>-]|[<-]*<(-?[^!-]|!-[^-]|--+[^>-]))*-*<?!?---?>|--+>)//g;tc' test.html

Code: Select all

# echo 'Lucky <!-- Hello <!-- Dolly --> World --> Strike
Nice <!--> Weather -->
Good <!----> Morning -->' | sed -r ':a;$!{N;ba;};:c;s/<!(--+[^>-](-?[^<-]|--+[^<>-]|[<-]*<(-?[^!]|!-[^-]|--+[^>-]))*-*<?!?---?>|--+>)//g;tc'
Lucky  Strike
Nice  Weather -->
Good  Morning -->
Finally, an easy solution:
Keef wrote:Using your example, the ouput I get is:

Code: Select all

# cat file.html | sed -e :a -re 's/<!--.*?-->//g;/<!--/N;//ba'

...
Added "s" and replaced "r" with "R". Seems to work.

Code: Select all

ssed -Re ':a;$!{N;ba;};s/<!--(.|\n)*?-->//g' test.html
ssed is available from the ubuntu repo.

s243a
Posts: 2580
Joined: Tue 02 Sep 2014, 04:48
Contact:

#47 Post by s243a »

Burunduk wrote:
Finally, an easy solution:
Keef wrote:Using your example, the ouput I get is:

Code: Select all

# cat file.html | sed -e :a -re 's/<!--.*?-->//g;/<!--/N;//ba'

...
Added "s" and replaced "r" with "R". Seems to work.

Code: Select all

ssed -Re ':a;$!{N;ba;};s/<!--(.|\n)*?-->//g' test.html
ssed is available from the ubuntu repo.
I wonder what the differences between ssed and sed are. This feature looks quite useful:
\cregexpc
Match lines matching the regular expression regexp. The c may be any character.
Cool. What are the differences between ssed and sed?
Find me on [url=https://www.minds.com/ns_tidder]minds[/url] and on [url=https://www.pearltrees.com/s243a/puppy-linux/id12399810]pearltrees[/url].

Burunduk
Posts: 80
Joined: Sun 21 Aug 2011, 21:44

#48 Post by Burunduk »

ssed man page wrote:\cregexpc
Match lines matching the regular expression regexp. The c may be any character.
This is a standard sed feature. Normally, a regex (in s command or in an address) is delimited by / but you can use any other character. In an address the first delimiter must be escaped:

Code: Select all

sed '\%regexp_here%s!another_regexp!replacement!g'
I don't know much about ssed too. It's an old program but I found it only yesterday. Here is some info about its features. I see that I need to change my example:

Code: Select all

ssed -Re ':a;$!{N;ba;};s/<!--.*?-->//Sg' test.html
Its PCRE support seems to be the most notable difference.

s243a
Posts: 2580
Joined: Tue 02 Sep 2014, 04:48
Contact:

#49 Post by s243a »

Burunduk wrote: I don't know much about ssed too. ...
Its PCRE support seems to be the most notable difference.
Where I can see this being very useful is if it let's us assign names to capture groups. My understanding is that sed limits you to 9 capture groups and you might not know which group you got if one of the groups doesn't match (e.g. the capture group is followed by a question mark in the sed regular expression).
Find me on [url=https://www.minds.com/ns_tidder]minds[/url] and on [url=https://www.pearltrees.com/s243a/puppy-linux/id12399810]pearltrees[/url].

User avatar
MochiMoppel
Posts: 2084
Joined: Wed 26 Jan 2011, 09:06
Location: Japan

#50 Post by MochiMoppel »

Burunduk wrote:The specification says:
Thanks. This is what I was looking for and didn't find. Very clear indeed.
Burunduk wrote:This modification follows the browsers:

Code: Select all

sed -r ':a;$!{N;ba;};s/<!(--+[^>-](-?[^-]|--+[^>-])*-*>|--+>)//g;' test.html
That's great! I tried something like this and gave up after an hour. My brain is not built for those oneliners :lol:
s243a wrote:Where I can see this being very useful is if it let's us assign names to capture groups. My understanding is that sed limits you to 9 capture groups and you might not know which group you got if one of the groups doesn't match (e.g. the capture group is followed by a question mark in the sed regular expression).
I'm not sure if I understand what you mean by "assign names". An example would be nice. Subexpressions and back-references can be tricky and may give unexpected results, but IMO this is almost always the result of a wrong regex. Here some examples and their partly pretty astonishing output (look at the very first case):

Code: Select all

 #!/bin/bash
for STRING in ABC320 320 ;do
echo "+++++ Input string $STRING +++++++"
echo "$STRING" | sed -r 's/([A-Z])([0-9])/TXT:\1\tNUMBER:\2X/' 
echo "$STRING" | sed -r 's/.*([A-Z]).*([0-9])/TXT:\1\tNUMBER:\2X/' 
echo "$STRING" | sed -r 's/([A-Z]*)([0-9]*)/TXT:\1\tNUMBER:\2X/' 
echo "$STRING" | sed -r 's/([A-Z])*([0-9])*/TXT:\1\tNUMBER:\2X/' 
echo "$STRING" | sed -r 's/([A-Z]).*([0-9])*/TXT:\1\tNUMBER:\2X/' 
done
Result:

Code: Select all

+++++ Input string ABC320 +++++++
ABTXT:C NUMBER:3X20
TXT:C   NUMBER:0X
TXT:ABC NUMBER:320X
TXT:C   NUMBER:0X
TXT:A   NUMBER:X
+++++ Input string 320 +++++++
320
320
TXT:    NUMBER:320X
TXT:    NUMBER:0X
320

s243a
Posts: 2580
Joined: Tue 02 Sep 2014, 04:48
Contact:

#51 Post by s243a »

MochiMoppel wrote:
s243a wrote:Where I can see this being very useful is if it let's us assign names to capture groups. My understanding is that sed limits you to 9 capture groups and you might not know which group you got if one of the groups doesn't match (e.g. the capture group is followed by a question mark in the sed regular expression).
I'm not sure if I understand what you mean by "assign names". An example would be nice. Subexpressions and back-references can be tricky and may give unexpected results, but IMO this is almost always the result of a wrong regex. Here some examples and their partly pretty astonishing output (look at the very first case):

Code: Select all

 #!/bin/bash
for STRING in ABC320 320 ;do
echo "+++++ Input string $STRING +++++++"
echo "$STRING" | sed -r 's/([A-Z])([0-9])/TXT:\1\tNUMBER:\2X/' 
echo "$STRING" | sed -r 's/.*([A-Z]).*([0-9])/TXT:\1\tNUMBER:\2X/' 
echo "$STRING" | sed -r 's/([A-Z]*)([0-9]*)/TXT:\1\tNUMBER:\2X/' 
echo "$STRING" | sed -r 's/([A-Z])*([0-9])*/TXT:\1\tNUMBER:\2X/' 
echo "$STRING" | sed -r 's/([A-Z]).*([0-9])*/TXT:\1\tNUMBER:\2X/' 
done
Result:

Code: Select all

+++++ Input string ABC320 +++++++
ABTXT:C NUMBER:3X20
TXT:C   NUMBER:0X
TXT:ABC NUMBER:320X
TXT:C   NUMBER:0X
TXT:A   NUMBER:X
+++++ Input string 320 +++++++
320
320
TXT:    NUMBER:320X
TXT:    NUMBER:0X
320
Here's the syntax

Code: Select all

 (?<name>group) 
https://www.regular-expressions.info/named.html

and it may be referenced as follows:

Code: Select all

    \g{name}  [5]  Named backreference
    \k<name>  [5]  Named backreference
https://perldoc.perl.org/perlre.html

also in perl you can do the following:

Code: Select all

 "hello" =~ /(?<greet>hi|hello)/n; # $1 is "hello", $+{greet} is
According to the first link named capture groups (and back references) were added in Perl 5.10. So the question is are the regEx in "ssed" compatible with perl 5.10 and later.

As for examples I'll try to think of some. I had some cases before where I want to use this.

One application, that I see is if you are defining variables in a script.

Code: Select all

x=1
y=2
In this case if I want to pass the values of x and y to an external function in ssed then it would be more readable if my capture groups were named 'x' and 'y'.

P.S. the syntax for named capture groups is slightly different in python. In python it is:

Code: Select all

(?P<name>group)
Find me on [url=https://www.minds.com/ns_tidder]minds[/url] and on [url=https://www.pearltrees.com/s243a/puppy-linux/id12399810]pearltrees[/url].

s243a
Posts: 2580
Joined: Tue 02 Sep 2014, 04:48
Contact:

#52 Post by s243a »

Here's another useful construct of perl compatible regular expressions:
Perl 5 introduced a much richer regex engine, which is hence standard in Java, PHP, Python, etc. Because Perl helpfully supports a subset of sed syntax, you could probably convert a simple sed script to Perl to get to use a useful feature from this extended regex dialect, such as negative assertions:

Code: Select all

perl -pe 's/(?:(?!str).)+/not/' file
will replace a string which is not str with not. The (?:...) is a non-capturing group (unlike in many sed dialects, an unescaped parenthesis is a metacharacter in Perl) and (?!str) is a negative assertion; the text immediately after this position in the string mustn't be str in order for the regex to match. The + repeats this pattern until it fails to match. Notice how the assertion needs to be true at every position in the match, so we match one character at a time with . (newbies often get this wrong, and erroneously only assert at e.g. the beginning of a longer pattern, which could however match str somewhere within, leading to a "leak").
https://stackoverflow.com/a/38220317
Find me on [url=https://www.minds.com/ns_tidder]minds[/url] and on [url=https://www.pearltrees.com/s243a/puppy-linux/id12399810]pearltrees[/url].

Post Reply