Come on Scott, you don't need to be so modest. I'm still using your sjpplog for my main blog, six years latersc0ttman wrote:...I really need to go learn how sed actually works..
sed scratch pad -- A thread of sed examples
Fatdog64 forum links: [url=http://murga-linux.com/puppy/viewtopic.php?t=117546]Latest version[/url] | [url=https://cutt.ly/ke8sn5H]Contributed packages[/url] | [url=https://cutt.ly/se8scrb]ISO builder[/url]
Trailing spaces in the markdown from which the HTML page is generated.MochiMoppel wrote:I have no clue what you are talking about. What trailing spaces?sc0ttman wrote:It's prettier.js that removes the trailing spaces from the source Markdown.. I disabled it ..
I have updated some templates, so the whitespace inside various elements is no longer there, pre-minification,MochiMoppel wrote:You mean the 10 spaces between consecutive <a> tags?..Anyway, I spotted other issues Burunduks code has yesterday (not stripping whitespace inside <a> tags), but I can live with it asThis looks like garbage and is not removed because Burunduk may have tried to guess your requirements from your first post. Your original script (s@>\s*<@><@g) was designed to remove pure whitespace between tags, i.e. spaces, tabs or linefeeds, no other characters. You said that this is what you want, except that you don't want to apply this to <pre> tags. This is basically what Burunduk delivered.. As soon as you put any other character between tags, even a single comma, nothing is or should be removed. Apart from the funny <a> tag spacings there is more questionable code , e.g. the seemingly useless '<span></span>' combos, that could be removed.Code: Select all
<a href="/mdsh/tags/seo.html">seo</a>, <a href="/mdsh/tags/shell.html">shell</a>, <a href="/mdsh/tags/xml.html">xml</a>,
Wouldn't it be much more effective if you clean the HTML code first?
and I have added a few sed lines of my own to remove empty tags (which are generated by Pygments)...
See, the HTML file you're looking at is the result of (more or less):
.mdsh file > pre-markdown-parser -> mustache templater -> markdown-parser -> pygments -> minifier -> HTML file
..so the odd bit of formatting weirdness can creep in during conversion..
-----
About CSS minification, I found this: https://www.tero.co.uk/scripts/minify.php
Code: Select all
sed -e "s|/\*\(\\\\\)\?\*/|/~\1~/|g" -e "s|/\*[^*]*\*\+\([^/][^*]*\*\+\)*/||g" \
-e "s|\([^:/]\)//.*$|\1|" -e "s|^//.*$||" | tr '\n' ' ' | \
sed -e "s|/\*[^*]*\*\+\([^/][^*]*\*\+\)*/||g" -e "s|/\~\(\\\\\)\?\~/|/*\1*/|g" \
-e "s|\s\+| |g" -e "s| \([{;:,]\)|\1|g" -e "s|\([{;:,]\) |\1|g"
[b][url=https://bit.ly/2KjtxoD]Pkg[/url], [url=https://bit.ly/2U6dzxV]mdsh[/url], [url=https://bit.ly/2G49OE8]Woofy[/url], [url=http://goo.gl/bzBU1]Akita[/url], [url=http://goo.gl/SO5ug]VLC-GTK[/url], [url=https://tiny.cc/c2hnfz]Search[/url][/b]
And I'm still reading it...jamesbond wrote:Come on Scott, you don't need to be so modest. I'm still using your sjpplog for my main blog, six years latersc0ttman wrote:...I really need to go learn how sed actually works..
[b][url=https://bit.ly/2KjtxoD]Pkg[/url], [url=https://bit.ly/2U6dzxV]mdsh[/url], [url=https://bit.ly/2G49OE8]Woofy[/url], [url=http://goo.gl/bzBU1]Akita[/url], [url=http://goo.gl/SO5ug]VLC-GTK[/url], [url=https://tiny.cc/c2hnfz]Search[/url][/b]
@jamesbond
What you call a defeat looks like a win to me. After some modification your regex works as expected:
It still has a nested greedy quantifiers but for common html files it shouldn't be a problem.
Edit: Fixed a typo. I typed over a previous code and left an unnecessary part of it. Now really works.
@MochiMoppel
It looks like you have found a beast that formally ruins the concept of nested comments:
<!-->
A simple question:
<!--> Are these two comments <!--> nested <!--> or side by side? <!-->
My editor thinks they are side by side. Firefox sees four empty comments and adds two dashes to each. It's necessary to change the delimiters or disallow this hybrid in order to support nesting.
Also, removing stray delimiters may insert a part of a comment into a document:
use<!--it carefully more or -->less-->
Maybe just report an error? sed's w command can do it.
Ok, won't loop now.
@sc0ttman
A quick fix.
Edit: Now it changes newlines to spaces except between > and <. Note that this may spoil inline js. Nested <pre> are not supported unless they a nested this way: (((...)))
What you call a defeat looks like a win to me. After some modification your regex works as expected:
Code: Select all
sed -r -e ':a;$!{N;ba;};s/<!--(-?[^-]|--+[^>-])*-*>//g;' test.html
Edit: Fixed a typo. I typed over a previous code and left an unnecessary part of it. Now really works.
@MochiMoppel
It looks like you have found a beast that formally ruins the concept of nested comments:
<!-->
A simple question:
<!--> Are these two comments <!--> nested <!--> or side by side? <!-->
My editor thinks they are side by side. Firefox sees four empty comments and adds two dashes to each. It's necessary to change the delimiters or disallow this hybrid in order to support nesting.
Also, removing stray delimiters may insert a part of a comment into a document:
use<!--it carefully more or -->less-->
Maybe just report an error? sed's w command can do it.
¡Vaya! An infinite loop! /!--/ was supposed to deal with it but...MochiMoppel wrote:With Burunduk's "superfluous" code it will even create an infinite loop and will not work at all.
Ok, won't loop now.
Code: Select all
sed ':a;$!{N;ba;};:c;/<!--/s/-->/&&/;tb;:b;s/<!--.*-->-->//;tc' test.html
A quick fix.
Edit: Now it changes newlines to spaces except between > and <. Note that this may spoil inline js. Nested <pre> are not supported unless they a nested this way: (((...)))
Code: Select all
sed ':a;$!{N;ba;};s/@/@a/g;s/\n/@n/g;s/<pre/\n&/g;s/<\/pre>/&\n/g' test.html \
| sed -nr '/(^<pre|<\/pre>$)/!{s/(@n)+/@n/g;s/>\s*(@n)?\s*</></g;s/@n/ /g;
s/\s+/ /g; # squeeze consecutive whitespaces everywhere, delete this line if unneeded
};H;${g;s/\n//g;s/@n/\n/g;s/@a/@/g;p}' >min.html
- MochiMoppel
- Posts: 2084
- Joined: Wed 26 Jan 2011, 09:06
- Location: Japan
I think there are no winners or losers in this game as there are so many different expectations and no clear answers as to what is right and what is wrong. The green part in this example is what I would consider as comments:Burunduk wrote:@jamesbond
What you call a defeat looks like a win to me. After some modification your regex works as expected
- Lucky <!-- Hello <!-- Dolly --> World --> Strike
Nice <!--> Weather -->
Good <!----> Morning -->
- Lucky <!-- Hello <!-- Dolly --> World --> Strike
Nice <!--> Weather -->
Good <!----> Morning -->
- Lucky <!-- Hello <!-- Dolly --> World --> Strike
Nice <!--> Weather -->
Good <!----> Morning -->
- Attachments
-
- w3schools.png
- w3schools' editor and rendering engine have different opinions.
(Please disregard my sloppy break tags. Should be <br> or <br />) - (18.03 KiB) Downloaded 143 times
I think this game is a puzzle not a competition. Everyone who solves it is a winner (unless you look at this philosophically).MochiMoppel wrote:I think there are no winners or losers in this game as there are so many different expectations and no clear answers as to what is right and what is wrong.
The specification says:MochiMoppel wrote:The HTML gods at w3.org give no clear answer.
The regex from my previous post already behaves like Geany:12.1.6 Comments
Comments must have the following format:
The string "<!--".
Optionally, text, with the additional restriction that the text must not start with the string ">", nor start with the string "->", nor contain the strings "<!--", "-->", or "--!>", nor end with the string "<!-".
The string "-->".
The text is allowed to end with the string "<!", as in <!--My favorite operators are > and <!-->.
Code: Select all
# echo 'Lucky <!-- Hello <!-- Dolly --> World --> Strike
Nice <!--> Weather -->
Good <!----> Morning -->' | sed -r ':a;$!{N;ba;};s/<!--(-?[^-]|--+[^>-])*-*>//g;'
Lucky World --> Strike
Nice
Good Morning -->
Code: Select all
sed -r ':a;$!{N;ba;};s/<!(--+[^>-](-?[^-]|--+[^>-])*-*>|--+>)//g;' test.html
Code: Select all
# echo 'Lucky <!-- Hello <!-- Dolly --> World --> Strike
Nice <!--> Weather -->
Good <!----> Morning -->' | sed -r ':a;$!{N;ba;};s/<!(--+[^>-](-?[^-]|--+[^>-])*-*>|--+>)//g;'
Lucky World --> Strike
Nice Weather -->
Good Morning -->
Known limitations: a nested opening comment delimiter <!-- is not recognized if it immediately follows <!- or <!-- , <!--- and so on. <!--> and <!---> are considered to be empty comments if outermost and closing comment delimiters if nested. To be continued
Code: Select all
sed -r ':a;$!{N;ba;};:c;s/<!(--+[^>-](-?[^<-]|--+[^<>-]|[<-]*<(-?[^!-]|!-[^-]|--+[^>-]))*-*<?!?---?>|--+>)//g;tc' test.html
Code: Select all
# echo 'Lucky <!-- Hello <!-- Dolly --> World --> Strike
Nice <!--> Weather -->
Good <!----> Morning -->' | sed -r ':a;$!{N;ba;};:c;s/<!(--+[^>-](-?[^<-]|--+[^<>-]|[<-]*<(-?[^!]|!-[^-]|--+[^>-]))*-*<?!?---?>|--+>)//g;tc'
Lucky Strike
Nice Weather -->
Good Morning -->
Added "s" and replaced "r" with "R". Seems to work.Keef wrote:Using your example, the ouput I get is:Code: Select all
# cat file.html | sed -e :a -re 's/<!--.*?-->//g;/<!--/N;//ba' ...
Code: Select all
ssed -Re ':a;$!{N;ba;};s/<!--(.|\n)*?-->//g' test.html
I wonder what the differences between ssed and sed are. This feature looks quite useful:Burunduk wrote:
Finally, an easy solution:Added "s" and replaced "r" with "R". Seems to work.Keef wrote:Using your example, the ouput I get is:Code: Select all
# cat file.html | sed -e :a -re 's/<!--.*?-->//g;/<!--/N;//ba' ...
ssed is available from the ubuntu repo.Code: Select all
ssed -Re ':a;$!{N;ba;};s/<!--(.|\n)*?-->//g' test.html
Cool. What are the differences between ssed and sed?\cregexpc
Match lines matching the regular expression regexp. The c may be any character.
Find me on [url=https://www.minds.com/ns_tidder]minds[/url] and on [url=https://www.pearltrees.com/s243a/puppy-linux/id12399810]pearltrees[/url].
This is a standard sed feature. Normally, a regex (in s command or in an address) is delimited by / but you can use any other character. In an address the first delimiter must be escaped:ssed man page wrote:\cregexpc
Match lines matching the regular expression regexp. The c may be any character.
Code: Select all
sed '\%regexp_here%s!another_regexp!replacement!g'
Code: Select all
ssed -Re ':a;$!{N;ba;};s/<!--.*?-->//Sg' test.html
Where I can see this being very useful is if it let's us assign names to capture groups. My understanding is that sed limits you to 9 capture groups and you might not know which group you got if one of the groups doesn't match (e.g. the capture group is followed by a question mark in the sed regular expression).Burunduk wrote: I don't know much about ssed too. ...
Its PCRE support seems to be the most notable difference.
Find me on [url=https://www.minds.com/ns_tidder]minds[/url] and on [url=https://www.pearltrees.com/s243a/puppy-linux/id12399810]pearltrees[/url].
- MochiMoppel
- Posts: 2084
- Joined: Wed 26 Jan 2011, 09:06
- Location: Japan
Thanks. This is what I was looking for and didn't find. Very clear indeed.Burunduk wrote:The specification says:
That's great! I tried something like this and gave up after an hour. My brain is not built for those onelinersBurunduk wrote:This modification follows the browsers:Code: Select all
sed -r ':a;$!{N;ba;};s/<!(--+[^>-](-?[^-]|--+[^>-])*-*>|--+>)//g;' test.html
I'm not sure if I understand what you mean by "assign names". An example would be nice. Subexpressions and back-references can be tricky and may give unexpected results, but IMO this is almost always the result of a wrong regex. Here some examples and their partly pretty astonishing output (look at the very first case):s243a wrote:Where I can see this being very useful is if it let's us assign names to capture groups. My understanding is that sed limits you to 9 capture groups and you might not know which group you got if one of the groups doesn't match (e.g. the capture group is followed by a question mark in the sed regular expression).
Code: Select all
#!/bin/bash
for STRING in ABC320 320 ;do
echo "+++++ Input string $STRING +++++++"
echo "$STRING" | sed -r 's/([A-Z])([0-9])/TXT:\1\tNUMBER:\2X/'
echo "$STRING" | sed -r 's/.*([A-Z]).*([0-9])/TXT:\1\tNUMBER:\2X/'
echo "$STRING" | sed -r 's/([A-Z]*)([0-9]*)/TXT:\1\tNUMBER:\2X/'
echo "$STRING" | sed -r 's/([A-Z])*([0-9])*/TXT:\1\tNUMBER:\2X/'
echo "$STRING" | sed -r 's/([A-Z]).*([0-9])*/TXT:\1\tNUMBER:\2X/'
done
Code: Select all
+++++ Input string ABC320 +++++++
ABTXT:C NUMBER:3X20
TXT:C NUMBER:0X
TXT:ABC NUMBER:320X
TXT:C NUMBER:0X
TXT:A NUMBER:X
+++++ Input string 320 +++++++
320
320
TXT: NUMBER:320X
TXT: NUMBER:0X
320
Here's the syntaxMochiMoppel wrote:I'm not sure if I understand what you mean by "assign names". An example would be nice. Subexpressions and back-references can be tricky and may give unexpected results, but IMO this is almost always the result of a wrong regex. Here some examples and their partly pretty astonishing output (look at the very first case):s243a wrote:Where I can see this being very useful is if it let's us assign names to capture groups. My understanding is that sed limits you to 9 capture groups and you might not know which group you got if one of the groups doesn't match (e.g. the capture group is followed by a question mark in the sed regular expression).Result:Code: Select all
#!/bin/bash for STRING in ABC320 320 ;do echo "+++++ Input string $STRING +++++++" echo "$STRING" | sed -r 's/([A-Z])([0-9])/TXT:\1\tNUMBER:\2X/' echo "$STRING" | sed -r 's/.*([A-Z]).*([0-9])/TXT:\1\tNUMBER:\2X/' echo "$STRING" | sed -r 's/([A-Z]*)([0-9]*)/TXT:\1\tNUMBER:\2X/' echo "$STRING" | sed -r 's/([A-Z])*([0-9])*/TXT:\1\tNUMBER:\2X/' echo "$STRING" | sed -r 's/([A-Z]).*([0-9])*/TXT:\1\tNUMBER:\2X/' done
Code: Select all
+++++ Input string ABC320 +++++++ ABTXT:C NUMBER:3X20 TXT:C NUMBER:0X TXT:ABC NUMBER:320X TXT:C NUMBER:0X TXT:A NUMBER:X +++++ Input string 320 +++++++ 320 320 TXT: NUMBER:320X TXT: NUMBER:0X 320
Code: Select all
(?<name>group)
and it may be referenced as follows:
Code: Select all
\g{name} [5] Named backreference
\k<name> [5] Named backreference
also in perl you can do the following:
Code: Select all
"hello" =~ /(?<greet>hi|hello)/n; # $1 is "hello", $+{greet} is
As for examples I'll try to think of some. I had some cases before where I want to use this.
One application, that I see is if you are defining variables in a script.
Code: Select all
x=1
y=2
P.S. the syntax for named capture groups is slightly different in python. In python it is:
Code: Select all
(?P<name>group)
Find me on [url=https://www.minds.com/ns_tidder]minds[/url] and on [url=https://www.pearltrees.com/s243a/puppy-linux/id12399810]pearltrees[/url].
Here's another useful construct of perl compatible regular expressions:
https://stackoverflow.com/a/38220317Perl 5 introduced a much richer regex engine, which is hence standard in Java, PHP, Python, etc. Because Perl helpfully supports a subset of sed syntax, you could probably convert a simple sed script to Perl to get to use a useful feature from this extended regex dialect, such as negative assertions:will replace a string which is not str with not. The (?:...) is a non-capturing group (unlike in many sed dialects, an unescaped parenthesis is a metacharacter in Perl) and (?!str) is a negative assertion; the text immediately after this position in the string mustn't be str in order for the regex to match. The + repeats this pattern until it fails to match. Notice how the assertion needs to be true at every position in the match, so we match one character at a time with . (newbies often get this wrong, and erroneously only assert at e.g. the beginning of a longer pattern, which could however match str somewhere within, leading to a "leak").Code: Select all
perl -pe 's/(?:(?!str).)+/not/' file
Find me on [url=https://www.minds.com/ns_tidder]minds[/url] and on [url=https://www.pearltrees.com/s243a/puppy-linux/id12399810]pearltrees[/url].