sed scratch pad -- A thread of sed examples - Page 3 - (old)Puppy Linux Discussion Forum

sed scratch pad -- A thread of sed examples

52 posts

Previous
1
2
3

Message

Author

jamesbond: Posts: 3433; Joined: Mon 26 Feb 2007, 05:02; Location: The Blue Marble

Quote

#41 Post by jamesbond » Mon 20 Jan 2020, 15:09

sc0ttman wrote:...I really need to go learn how sed actually works..

Come on Scott, you don't need to be so modest. I'm still using your sjpplog for my main blog, six years later

Fatdog64 forum links: [url=http://murga-linux.com/puppy/viewtopic.php?t=117546]Latest version[/url] | [url=https://cutt.ly/ke8sn5H]Contributed packages[/url] | [url=https://cutt.ly/se8scrb]ISO builder[/url]

sc0ttman: Posts: 2812; Joined: Wed 16 Sep 2009, 05:44; Location: UK

Quote

#42 Post by sc0ttman » Mon 20 Jan 2020, 16:11

MochiMoppel wrote:
sc0ttman wrote:It's prettier.js that removes the trailing spaces from the source Markdown.. I disabled it ..
I have no clue what you are talking about. What trailing spaces?

Trailing spaces in the markdown from which the HTML page is generated.

MochiMoppel wrote:
..Anyway, I spotted other issues Burunduks code has yesterday (not stripping whitespace inside <a> tags), but I can live with it as
You mean the 10 spaces between consecutive <a> tags?
Code: Select all
          <a href="/mdsh/tags/seo.html">seo</a>,
          <a href="/mdsh/tags/shell.html">shell</a>,
          <a href="/mdsh/tags/xml.html">xml</a>,
This looks like garbage and is not removed because Burunduk may have tried to guess your requirements from your first post. Your original script (s@>\s*<@><@g) was designed to remove pure whitespace between tags, i.e. spaces, tabs or linefeeds, no other characters. You said that this is what you want, except that you don't want to apply this to <pre> tags. This is basically what Burunduk delivered.. As soon as you put any other character between tags, even a single comma, nothing is or should be removed. Apart from the funny <a> tag spacings there is more questionable code , e.g. the seemingly useless '<span></span>' combos, that could be removed.

Wouldn't it be much more effective if you clean the HTML code first?

I have updated some templates, so the whitespace inside various elements is no longer there, pre-minification,
and I have added a few sed lines of my own to remove empty tags (which are generated by Pygments)...

See, the HTML file you're looking at is the result of (more or less):

.mdsh file > pre-markdown-parser -> mustache templater -> markdown-parser -> pygments -> minifier -> HTML file

..so the odd bit of formatting weirdness can creep in during conversion..

-----

About CSS minification, I found this: https://www.tero.co.uk/scripts/minify.php

Code: Select all

sed -e "s|/\*\(\\\\\)\?\*/|/~\1~/|g" -e "s|/\*[^*]*\*\+\([^/][^*]*\*\+\)*/||g" \
  -e "s|\([^:/]\)//.*$|\1|" -e "s|^//.*$||" | tr '\n' ' ' | \
  sed -e "s|/\*[^*]*\*\+\([^/][^*]*\*\+\)*/||g" -e "s|/\~\(\\\\\)\?\~/|/*\1*/|g" \
  -e "s|\s\+| |g" -e "s| \([{;:,]\)|\1|g" -e "s|\([{;:,]\) |\1|g"

[b][url=https://bit.ly/2KjtxoD]Pkg[/url], [url=https://bit.ly/2U6dzxV]mdsh[/url], [url=https://bit.ly/2G49OE8]Woofy[/url], [url=http://goo.gl/bzBU1]Akita[/url], [url=http://goo.gl/SO5ug]VLC-GTK[/url], [url=https://tiny.cc/c2hnfz]Search[/url][/b]

sc0ttman: Posts: 2812; Joined: Wed 16 Sep 2009, 05:44; Location: UK

Quote

#43 Post by sc0ttman » Mon 20 Jan 2020, 16:12

jamesbond wrote:
sc0ttman wrote:...I really need to go learn how sed actually works..
Come on Scott, you don't need to be so modest. I'm still using your sjpplog for my main blog, six years later

And I'm still reading it...

[b][url=https://bit.ly/2KjtxoD]Pkg[/url], [url=https://bit.ly/2U6dzxV]mdsh[/url], [url=https://bit.ly/2G49OE8]Woofy[/url], [url=http://goo.gl/bzBU1]Akita[/url], [url=http://goo.gl/SO5ug]VLC-GTK[/url], [url=https://tiny.cc/c2hnfz]Search[/url][/b]

Burunduk: Posts: 80; Joined: Sun 21 Aug 2011, 21:44

Quote

#44 Post by Burunduk » Mon 20 Jan 2020, 21:50

@jamesbond

What you call a defeat looks like a win to me. After some modification your regex works as expected:

Code: Select all

sed -r -e ':a;$!{N;ba;};s/<!--(-?[^-]|--+[^>-])*-*>//g;' test.html

It still has a nested greedy quantifiers but for common html files it shouldn't be a problem.
Edit: Fixed a typo. I typed over a previous code and left an unnecessary part of it. Now really works.

@MochiMoppel

It looks like you have found a beast that formally ruins the concept of nested comments:

 Are these two comments  or side by side? less-->

Maybe just report an error? sed's w command can do it.

MochiMoppel wrote:With Burunduk's "superfluous" code it will even create an infinite loop and will not work at all.

¡Vaya! An infinite loop! /!--/ was supposed to deal with it but...
Ok, won't loop now.

Code: Select all

sed ':a;$!{N;ba;};:c;/<!--/s/-->/&&/;tb;:b;s/<!--.*-->-->//;tc' test.html

@sc0ttman

A quick fix.
Edit: Now it changes newlines to spaces except between > and <. Note that this may spoil inline js. Nested <pre> are not supported unless they a nested this way: (((...)))

Code: Select all

sed ':a;$!{N;ba;};s/@/@a/g;s/\n/@n/g;s/<pre/\n&/g;s/<\/pre>/&\n/g' test.html \
  | sed -nr '/(^<pre|<\/pre>$)/!{s/(@n)+/@n/g;s/>\s*(@n)?\s*</></g;s/@n/ /g;
  s/\s+/ /g; # squeeze consecutive whitespaces everywhere, delete this line if unneeded
  };H;${g;s/\n//g;s/@n/\n/g;s/@a/@/g;p}' >min.html

MochiMoppel: Posts: 2084; Joined: Wed 26 Jan 2011, 09:06; Location: Japan

Quote

#45 Post by MochiMoppel » Wed 22 Jan 2020, 06:12

Burunduk wrote:@jamesbond
What you call a defeat looks like a win to me. After some modification your regex works as expected

I think there are no winners or losers in this game as there are so many different expectations and no clear answers as to what is right and what is wrong. The green part in this example is what I would consider as comments:

Lucky  World --> Strike
Nice 
Good  Morning -->

And this is what Geany or even w3schools flag as comments in their syntax highlighting:

Lucky  World --> Strike
Nice 
Good  Morning -->

And this is what finally browsers treat as comments

Lucky  World --> Strike
Nice 
Good  Morning -->

The HTML gods at w3.org give no clear answer. They don't forbid nesting but the browsers treat comment nesting different from <div> or <table> nesting. And if comments must open with '' then I would expect that constructs like '' are not comments, but all browsers disagree with me

Attachments

w3schools.png: w3schools' editor and rendering engine have different opinions.
(Please disregard my sloppy break tags. Should be <br> or <br />); (18.03 KiB) Downloaded 143 times

Burunduk: Posts: 80; Joined: Sun 21 Aug 2011, 21:44

Quote

#46 Post by Burunduk » Sat 25 Jan 2020, 19:35

MochiMoppel wrote:I think there are no winners or losers in this game as there are so many different expectations and no clear answers as to what is right and what is wrong.

I think this game is a puzzle not a competition. Everyone who solves it is a winner (unless you look at this philosophically).

MochiMoppel wrote:The HTML gods at w3.org give no clear answer.

The specification says:

12.1.6 Comments

Comments must have the following format:

The string "", or "--!>", nor end with the string "<!-".
The string "-->".

The text is allowed to end with the string "<!", as in .

The regex from my previous post already behaves like Geany:

Code: Select all

# echo 'Lucky <!-- Hello <!-- Dolly --> World --> Strike
Nice <!--> Weather -->
Good <!----> Morning -->' | sed -r ':a;$!{N;ba;};s/<!--(-?[^-]|--+[^>-])*-*>//g;'
Lucky  World --> Strike
Nice 
Good  Morning -->

This modification follows the browsers:

Code: Select all

sed -r ':a;$!{N;ba;};s/<!(--+[^>-](-?[^-]|--+[^>-])*-*>|--+>)//g;' test.html

Code: Select all

# echo 'Lucky <!-- Hello <!-- Dolly --> World --> Strike
Nice <!--> Weather -->
Good <!----> Morning -->' | sed -r ':a;$!{N;ba;};s/<!(--+[^>-](-?[^-]|--+[^>-])*-*>|--+>)//g;'
Lucky  World --> Strike
Nice  Weather -->
Good  Morning -->

We can also try to handle the situation when a large block of html is commented out. This is tricky. The regex looks write-only (resembles BF really).
Known limitations: a nested opening comment delimiter  and <!---> are considered to be empty comments if outermost and closing comment delimiters if nested. To be continued

Code: Select all

sed -r ':a;$!{N;ba;};:c;s/<!(--+[^>-](-?[^<-]|--+[^<>-]|[<-]*<(-?[^!-]|!-[^-]|--+[^>-]))*-*<?!?---?>|--+>)//g;tc' test.html

Code: Select all

# echo 'Lucky <!-- Hello <!-- Dolly --> World --> Strike
Nice <!--> Weather -->
Good <!----> Morning -->' | sed -r ':a;$!{N;ba;};:c;s/<!(--+[^>-](-?[^<-]|--+[^<>-]|[<-]*<(-?[^!]|!-[^-]|--+[^>-]))*-*<?!?---?>|--+>)//g;tc'
Lucky  Strike
Nice  Weather -->
Good  Morning -->

Finally, an easy solution:

Keef wrote:Using your example, the ouput I get is:
Code: Select all
# cat file.html | sed -e :a -re 's///g;/<!--/N;//ba'

...

Added "s" and replaced "r" with "R". Seems to work.

Code: Select all

ssed -Re ':a;$!{N;ba;};s/<!--(.|\n)*?-->//g' test.html

ssed is available from the ubuntu repo.

s243a: Posts: 2580; Joined: Tue 02 Sep 2014, 04:48; Contact:
Contact s243a

Website

Quote

#47 Post by s243a » Sat 25 Jan 2020, 20:06

Burunduk wrote:
Finally, an easy solution:
Keef wrote:Using your example, the ouput I get is:
Code: Select all
# cat file.html | sed -e :a -re 's///g;///g' test.html
ssed is available from the ubuntu repo.

I wonder what the differences between ssed and sed are. This feature looks quite useful:

\cregexpc
Match lines matching the regular expression regexp. The c may be any character.

Cool. What are the differences between ssed and sed?

Find me on [url=https://www.minds.com/ns_tidder]minds[/url] and on [url=https://www.pearltrees.com/s243a/puppy-linux/id12399810]pearltrees[/url].

Burunduk: Posts: 80; Joined: Sun 21 Aug 2011, 21:44

Quote

#48 Post by Burunduk » Sat 25 Jan 2020, 20:59

ssed man page wrote:\cregexpc
Match lines matching the regular expression regexp. The c may be any character.

This is a standard sed feature. Normally, a regex (in s command or in an address) is delimited by / but you can use any other character. In an address the first delimiter must be escaped:

Code: Select all

sed '\%regexp_here%s!another_regexp!replacement!g'

I don't know much about ssed too. It's an old program but I found it only yesterday. Here is some info about its features. I see that I need to change my example:

Code: Select all

ssed -Re ':a;$!{N;ba;};s/<!--.*?-->//Sg' test.html

Its PCRE support seems to be the most notable difference.

s243a: Posts: 2580; Joined: Tue 02 Sep 2014, 04:48; Contact:
Contact s243a

Website

Quote

#49 Post by s243a » Sun 26 Jan 2020, 05:44

Burunduk wrote: I don't know much about ssed too. ...
Its PCRE support seems to be the most notable difference.

Where I can see this being very useful is if it let's us assign names to capture groups. My understanding is that sed limits you to 9 capture groups and you might not know which group you got if one of the groups doesn't match (e.g. the capture group is followed by a question mark in the sed regular expression).

Find me on [url=https://www.minds.com/ns_tidder]minds[/url] and on [url=https://www.pearltrees.com/s243a/puppy-linux/id12399810]pearltrees[/url].

MochiMoppel: Posts: 2084; Joined: Wed 26 Jan 2011, 09:06; Location: Japan

Quote

#50 Post by MochiMoppel » Mon 27 Jan 2020, 05:50

Burunduk wrote:The specification says:

Thanks. This is what I was looking for and didn't find. Very clear indeed.

Burunduk wrote:This modification follows the browsers:
Code: Select all
sed -r ':a;$!{N;ba;};s/<!(--+[^>-](-?[^-]|--+[^>-])*-*>|--+>)//g;' test.html

That's great! I tried something like this and gave up after an hour. My brain is not built for those oneliners

s243a wrote:Where I can see this being very useful is if it let's us assign names to capture groups. My understanding is that sed limits you to 9 capture groups and you might not know which group you got if one of the groups doesn't match (e.g. the capture group is followed by a question mark in the sed regular expression).

I'm not sure if I understand what you mean by "assign names". An example would be nice. Subexpressions and back-references can be tricky and may give unexpected results, but IMO this is almost always the result of a wrong regex. Here some examples and their partly pretty astonishing output (look at the very first case):

Code: Select all

 #!/bin/bash
for STRING in ABC320 320 ;do
echo "+++++ Input string $STRING +++++++"
echo "$STRING" | sed -r 's/([A-Z])([0-9])/TXT:\1\tNUMBER:\2X/' 
echo "$STRING" | sed -r 's/.*([A-Z]).*([0-9])/TXT:\1\tNUMBER:\2X/' 
echo "$STRING" | sed -r 's/([A-Z]*)([0-9]*)/TXT:\1\tNUMBER:\2X/' 
echo "$STRING" | sed -r 's/([A-Z])*([0-9])*/TXT:\1\tNUMBER:\2X/' 
echo "$STRING" | sed -r 's/([A-Z]).*([0-9])*/TXT:\1\tNUMBER:\2X/' 
done

Result:

Code: Select all

+++++ Input string ABC320 +++++++
ABTXT:C NUMBER:3X20
TXT:C   NUMBER:0X
TXT:ABC NUMBER:320X
TXT:C   NUMBER:0X
TXT:A   NUMBER:X
+++++ Input string 320 +++++++
320
320
TXT:    NUMBER:320X
TXT:    NUMBER:0X
320

s243a: Posts: 2580; Joined: Tue 02 Sep 2014, 04:48; Contact:
Contact s243a

Website

Quote

#51 Post by s243a » Mon 27 Jan 2020, 06:18

MochiMoppel wrote:
s243a wrote:Where I can see this being very useful is if it let's us assign names to capture groups. My understanding is that sed limits you to 9 capture groups and you might not know which group you got if one of the groups doesn't match (e.g. the capture group is followed by a question mark in the sed regular expression).
I'm not sure if I understand what you mean by "assign names". An example would be nice. Subexpressions and back-references can be tricky and may give unexpected results, but IMO this is almost always the result of a wrong regex. Here some examples and their partly pretty astonishing output (look at the very first case):
Code: Select all
 #!/bin/bash
for STRING in ABC320 320 ;do
echo "+++++ Input string $STRING +++++++"
echo "$STRING" | sed -r 's/([A-Z])([0-9])/TXT:\1\tNUMBER:\2X/' 
echo "$STRING" | sed -r 's/.*([A-Z]).*([0-9])/TXT:\1\tNUMBER:\2X/' 
echo "$STRING" | sed -r 's/([A-Z]*)([0-9]*)/TXT:\1\tNUMBER:\2X/' 
echo "$STRING" | sed -r 's/([A-Z])*([0-9])*/TXT:\1\tNUMBER:\2X/' 
echo "$STRING" | sed -r 's/([A-Z]).*([0-9])*/TXT:\1\tNUMBER:\2X/' 
done
Result:
Code: Select all
+++++ Input string ABC320 +++++++
ABTXT:C NUMBER:3X20
TXT:C   NUMBER:0X
TXT:ABC NUMBER:320X
TXT:C   NUMBER:0X
TXT:A   NUMBER:X
+++++ Input string 320 +++++++
320
320
TXT:    NUMBER:320X
TXT:    NUMBER:0X
320

Here's the syntax

Code: Select all

 (?<name>group)

https://www.regular-expressions.info/named.html

and it may be referenced as follows:

Code: Select all

    \g{name}  [5]  Named backreference
    \k<name>  [5]  Named backreference

https://perldoc.perl.org/perlre.html

also in perl you can do the following:

Code: Select all

 "hello" =~ /(?<greet>hi|hello)/n; # $1 is "hello", $+{greet} is

According to the first link named capture groups (and back references) were added in Perl 5.10. So the question is are the regEx in "ssed" compatible with perl 5.10 and later.

As for examples I'll try to think of some. I had some cases before where I want to use this.

One application, that I see is if you are defining variables in a script.

Code: Select all

x=1
y=2

In this case if I want to pass the values of x and y to an external function in ssed then it would be more readable if my capture groups were named 'x' and 'y'.

P.S. the syntax for named capture groups is slightly different in python. In python it is:

Code: Select all

(?P<name>group)

Find me on [url=https://www.minds.com/ns_tidder]minds[/url] and on [url=https://www.pearltrees.com/s243a/puppy-linux/id12399810]pearltrees[/url].

s243a: Posts: 2580; Joined: Tue 02 Sep 2014, 04:48; Contact:
Contact s243a

Website

Quote

#52 Post by s243a » Sat 07 Mar 2020, 23:27

Here's another useful construct of perl compatible regular expressions:

Perl 5 introduced a much richer regex engine, which is hence standard in Java, PHP, Python, etc. Because Perl helpfully supports a subset of sed syntax, you could probably convert a simple sed script to Perl to get to use a useful feature from this extended regex dialect, such as negative assertions:
Code: Select all
perl -pe 's/(?:(?!str).)+/not/' file
will replace a string which is not str with not. The (?:...) is a non-capturing group (unlike in many sed dialects, an unescaped parenthesis is a metacharacter in Perl) and (?!str) is a negative assertion; the text immediately after this position in the string mustn't be str in order for the regex to match. The + repeats this pattern until it fails to match. Notice how the assertion needs to be true at every position in the match, so we match one character at a time with . (newbies often get this wrong, and erroneously only assert at e.g. the beginning of a longer pattern, which could however match str somewhere within, leading to a "leak").

https://stackoverflow.com/a/38220317

Find me on [url=https://www.minds.com/ns_tidder]minds[/url] and on [url=https://www.pearltrees.com/s243a/puppy-linux/id12399810]pearltrees[/url].

Post Reply

52 posts

Previous
1
2
3

Return to “Programming”