(OLD) (ARCHIVED) Puppy Linux Discussion Forum Forum Index (OLD) (ARCHIVED) Puppy Linux Discussion Forum
Puppy HOME page : puppylinux.com
"THE" alternative forum : puppylinux.info

This forum can also be accessed as http://oldforum.puppylinux.com
It is now read-only and serves only as archives.

Please register over the NEW forum
https://forum.puppylinux.com
and continue your work there. Thank you.

 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups    
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

The time now is Fri 23 Oct 2020, 18:03
All times are UTC - 4
 Forum index » Off-Topic Area » Programming
sed scratch pad -- A thread of sed examples
This forum is locked: you cannot post, reply to, or edit topics.   This topic is locked: you cannot edit posts or make replies. View previous topic :: View next topic
Page 2 of 4 [52 Posts]   Goto page: Previous 1, 2, 3, 4 Next
Author Message
MochiMoppel


Joined: 26 Jan 2011
Posts: 2084
Location: Japan

PostPosted: Tue 14 Jan 2020, 12:53    Post subject:  

Yes, essentially the same, but the suggestions I've seen so far are either not for sed, are crap or a combination of both.
Back to top
View user's profile Send private message 
Keef


Joined: 20 Dec 2007
Posts: 1001
Location: Staffordshire

PostPosted: Tue 14 Jan 2020, 14:42    Post subject:  

Is this any good?
https://stackoverflow.com/questions/4055837/delete-html-comment-tags-using-regexp

Using your example, the ouput I get is:

Code:
# cat file.html | sed -e :a -re 's/<!--.*?-->//g;/<!--/N;//ba'
<JWM>
   <Tray  autohide="false" insert="right" x="0" y="-1" border="1" height="28" >
     
      <TrayButton label="Menu" icon="logo-mini.png" border="true">root:3</TrayButton>
border="true">exec:urxvt</TrayButton>
      <Pager/>
     
      <TaskList maxwidth="200"/>
      <Dock/>
     
   
   
   
   
      <Swallow name="xload" width="32">
         xload -nolabel -bg "#888888" -fg red -hl white
      </Swallow>
      <Clock format="%H:%M">minixcal</Clock>
   </Tray>
</JWM>
#
Back to top
View user's profile Send private message 
sc0ttman


Joined: 16 Sep 2009
Posts: 2806
Location: UK

PostPosted: Tue 14 Jan 2020, 16:30    Post subject: html minifier in sed
Subject description: works fine, except with <pre> tags .. any fixes?
 

I'd love to get this working:

An HTML minifier...

This thing nearly does the job, except that it minifies stuff inside <pre> tags...

I would love love love to fix that!!


Code:
function minify_html {
  # temp fix to IFS, just in case the hmtl files contain spaces
  OLD_IFS=$IFS
  IFS="
  "
  for html_file in $html_files
  do
    :
    # dont minify HTML until we can skip contents of <pre>..</pre>
    #sed ':a;N;$!ba;/<div class="highlight"><pre>\.*<\/pre><\/div>/! s@>\s*<@><@g' $html_file > ${html_file//.html/.minhtml}
    #mv ${html_file//.html/.minhtml} ${html_file}
  done
  IFS=$OLD_IFS
}

_________________
Pkg, mdsh, Woofy, Akita, VLC-GTK, Search
Back to top
View user's profile Send private message 
s243a

Joined: 02 Sep 2014
Posts: 2626

PostPosted: Wed 15 Jan 2020, 00:05    Post subject:  

Keef wrote:
Is this any good?
https://stackoverflow.com/questions/4055837/delete-html-comment-tags-using-regexp

Using your example, the ouput I get is:

Code:
# cat file.html | sed -e :a -re 's/<!--.*?-->//g;/<!--/N;//ba'
<JWM>
   <Tray  autohide="false" insert="right" x="0" y="-1" border="1" height="28" >
     
      <TrayButton label="Menu" icon="logo-mini.png" border="true">root:3</TrayButton>
border="true">exec:urxvt</TrayButton>
      <Pager/>
     
      <TaskList maxwidth="200"/>
      <Dock/>
     
   
   
   
   
      <Swallow name="xload" width="32">
         xload -nolabel -bg "#888888" -fg red -hl white
      </Swallow>
      <Clock format="%H:%M">minixcal</Clock>
   </Tray>
</JWM>
#


I get the same output with:

Code:

#Match the last line
$,/.*/ {   
   H #Append new data to hold space
    x #Exchange hold space with pattern space
    s/<!--.*-->//g #Delete comment
    p #Print pattern space
  }   
#If we don't yet have a terminating comment just append to the hold space and start the next cycle.
/.*-->.*/! {
  H #Append pattern space to hold space
  d #Delete pattern space and start next cycle
  }
#If we have a closing comment append data to hold space and copy the hold space to the pattern space to see if we can match both an opening and closing comment in pattern space.
/.*-->.*/ {
    H #Append new data to hold space
    x #Exchange hold space with pattern space
    h #Copy pattern space to hold space
  }
#If this block matches the previous block has already been executed and this block will be executed next.
/.*<!--.*-->.*/ {
    s/<!--.*-->//g #Delete comment
    p #Print pattern space
    s/.*//g #delete pattern space
    x #exchange pattern space with hold space
    d #delete pattern space and start next cycle.
  }


Test program:
https://pastebin.com/tNttFjyT

_________________
Find me on minds and on pearltrees.
Back to top
View user's profile Send private message Visit poster's website 
jamesbond

Joined: 26 Feb 2007
Posts: 3475
Location: The Blue Marble

PostPosted: Wed 15 Jan 2020, 01:29    Post subject:  

MochiMoppel wrote:
The challenge is to remove all comments from a XML/HTML document, using only sed.

Challenge accepted.

This removes the comments and cleans up stray newlines.
Code:
sed -n 'H;x;s/<!--.*-->//;x;${x;s/\n//;s/\n[ \n\t]*\n/\n/g;p}' test.html


If you only want to remove the comments and don't worry about how it looks, this will do.
Code:
sed -n 'H;x;s/<!--.*-->//;x;${x;p}' test.html


Confirmed to work with gnu sed and busybox sed.

_________________
Fatdog64 forum links: Latest version | Contributed packages | ISO builder
Back to top
View user's profile Send private message 
step

Joined: 04 May 2012
Posts: 1352

PostPosted: Wed 15 Jan 2020, 02:50    Post subject:  

Just a reminder to also test a Windows-created HTML file, for which \r\n is the line termination sequence. (I didn't but I remember being scorched about this before).
_________________
Fatdog64-810|+Packages|Kodi|gtkmenuplus
Back to top
View user's profile Send private message 
s243a

Joined: 02 Sep 2014
Posts: 2626

PostPosted: Wed 15 Jan 2020, 02:59    Post subject: Re: html minifier in sed
Subject description: works fine, except with <pre> tags .. any fixes?
 

sc0ttman wrote:
I'd love to get this working:

An HTML minifier...

This thing nearly does the job, except that it minifies stuff inside <pre> tags...

I would love love love to fix that!!


Code:
function minify_html {
  # temp fix to IFS, just in case the hmtl files contain spaces
  OLD_IFS=$IFS
  IFS="
  "
  for html_file in $html_files
  do
    :
    # dont minify HTML until we can skip contents of <pre>..</pre>
    #sed ':a;N;$!ba;/<div class="highlight"><pre>\.*<\/pre><\/div>/! s@>\s*<@><@g' $html_file > ${html_file//.html/.minhtml}
    #mv ${html_file//.html/.minhtml} ${html_file}
  done
  IFS=$OLD_IFS
}


I'll think about the general problem more later but for now I notice that in "<pre>\.*<\/pre>", you are escaping the period but I think what you actual want is ""<pre>.*<\/pre>" (notice the period is not "not escaped") because even with "Basic Regular expressions" the period character still has it's special meaning and in this case we want it to have it's specail meaning so we don't want to escape it.

Quote:

In GNU sed, the only difference between basic and extended regular expressions is in the behavior of a few special characters: ‘?’, ‘+’, parentheses, braces (‘{}’), and ‘|’.

With basic (BRE) syntax, these characters do not have special meaning unless prefixed with a backslash (‘\’); While with extended (ERE) syntax it is reversed: these characters are special unless they are prefixed with backslash (‘\’).

https://www.gnu.org/software/sed/manual/html_node/BRE-vs-ERE.html#BRE-vs-ERE

You can test this with something like:
Quote:

# echo abc | sed 's/a.c//'


BTW, why do we need the "div" tags in the above expression?

_________________
Find me on minds and on pearltrees.
Back to top
View user's profile Send private message Visit poster's website 
sc0ttman


Joined: 16 Sep 2009
Posts: 2806
Location: UK

PostPosted: Wed 15 Jan 2020, 03:10    Post subject: Re: html minifier in sed
Subject description: works fine, except with <pre> tags .. any fixes?
 

s243a wrote:
BTW, why do we need the "div" tags in the above expression?

mdsh/pygments generates divs arounds pre tags.. this minifier is for mdsh.

_________________
Pkg, mdsh, Woofy, Akita, VLC-GTK, Search
Back to top
View user's profile Send private message 
recobayu


Joined: 15 Sep 2010
Posts: 389
Location: indonesia

PostPosted: Wed 15 Jan 2020, 04:44    Post subject:  

MochiMoppel wrote:
Looks like another abandoned thread Crying or Very sad

I'll give it a try anyway since I don't know where to ask.
The challenge is to remove all comments from a XML/HTML document, using only sed.

Example text:
Code:
<JWM>
   <Tray  autohide="false" insert="right" x="0" y="-1" border="1" height="28" >
      <!-- Additional TrayButton attribute: label -->
      <TrayButton label="Menu" icon="logo-mini.png" border="true">root:3</TrayButton>
border="true">exec:urxvt</TrayButton>
      <Pager/>
      <!-- Additional TaskList attribute: maxwidth -->
      <TaskList maxwidth="200"/>
      <Dock/>
      <!-- Additional Swallow attribute: height -->
   <!--   <Swallow name="blinky">
         blinkydelayed -bg "#DCDAD5"
      </Swallow> -->
   <!--   <Swallow name="xtmix-launcher">
         xtmix -launch
      </Swallow> -->
   <!--   <Swallow name="asapm">
         asapmshell -u 4
      </Swallow> -->
   <!--   <Swallow name="freememapplet" width="34">
         freememappletshell
      </Swallow> -->
      <Swallow name="xload" width="32">
         xload -nolabel -bg "#888888" -fg red -hl white
      </Swallow>
      <Clock format="%H:%M">minixcal</Clock>
   </Tray>
</JWM>

The problem is that these comments can be multiline. My rough idea is to let sed move a line to the hold buffer when a '<!--' tag is detected, then continue to fill the hold buffer until a '--> is detexted', load the hold buffer into the pattern space and remove the comment, clear the hold buffer and continue with the next cycle. May not be the right way and I'm not even close to achieve the goal. Does anybody know how to do this?


I use this code, only one line code, but it just delete the <!-- and --> that if it is in different line.

Code:
#sed -e '/<!--/,/-->/d' xml
<JWM>
   <Tray  autohide="false" insert="right" x="0" y="-1" border="1" height="28" >
      <TaskList maxwidth="200"/>
      <Dock/>
      <Swallow name="xload" width="32">
         xload -nolabel -bg "#888888" -fg red -hl white
      </Swallow>
      <Clock format="%H:%M">minixcal</Clock>
   </Tray>
</JWM>
Back to top
View user's profile Send private message 
MochiMoppel


Joined: 26 Jan 2011
Posts: 2084
Location: Japan

PostPosted: Wed 15 Jan 2020, 08:07    Post subject:  

Keef wrote:
Is this any good?
https://stackoverflow.com/questions/4055837/delete-html-comment-tags-using-regexp
If it solves the problem it can't be that bad, right? Wink
But frankly it's not really good: A useless cat, a useless '?' in <!--.*?--> and a strange positioning of the :a label. Still the idea is nice. No pattern space <-> hold space acrobatics, just a clever use of a label. The next suggestion in your link is better;
Code:
sed -r '
/<!--/!b
:a
/-->/! {N;ba}
s/<!--.*-->//
' "$TESTFILE"


sc0ttman wrote:
This thing nearly does the job, except that it minifies stuff inside <pre> tags..
Which job? Where do you eliminate comments? And what "stuff inside <pre> tags". Do you want to preserve comments inside pre tags? What for? The browser wouldn't show them anyway, so you might delete them as well. Unless you provide a sample of your input it is hard to tell what you are after.

s243a wrote:
I get the same output with:
Code:

...some really ugly code here....
Laughing

jamesbond wrote:
Confirmed to work with gnu sed and busybox sed.
That's already a nice achievement. But here is my problem with all suggestions so far: They all assume that a line contains only 1 comment, which is a bold assumption. Surely I take the blame for not providing a better example and I will think of a better one. Generally speaking a XLM document is whitespace agnostic. Linefeeds don't matter and even a huge HTML page can be written as a single line (e.g. Goggle does this). A pattern like <!--.*--> is greedy and would eliminate everything from the first <!-- up to the last --> instead of catching only the next comment termination.

step wrote:
Just a reminder to also test a Windows-created HTML file, for which \r\n is the line termination sequence.
Thanks for the reminder. I can imagine that Mac documents are even more fun as sed probably would treat the whole document as a single line Laughing

@recobayu Thanks, but this is just too limited

@all I now cooked my own solutions, which appear to do what I want. I'll share them if they pass my acid tests. Let's see.
Back to top
View user's profile Send private message 
jamesbond

Joined: 26 Feb 2007
Posts: 3475
Location: The Blue Marble

PostPosted: Wed 15 Jan 2020, 11:50    Post subject:  

MochiMoppel wrote:
That's already a nice achievement. But here is my problem with all suggestions so far: They all assume that a line contains only 1 comment, which is a bold assumption.
True enough.
Quote:
Surely I take the blame for not providing a better example and I will think of a better one.
You did say it was for general HTML/XML. I will now take this to mean as __valid__ HTML/XML which does not allow nested comments.

My updated test case:
Code:
<p>1</p>
<!--2-->3<br>
<!--
4
--><b>5</b><!--
-6 -7 --8 -9- <10> <-11-> <u>12</u>
-->13
<!-- 14 -->15<!-- <-16-> -->17

Expected output:
Code:
<p>1</p>
3<br>
<b>5</b>13
1517


Here is my updated take on the challenge. Still works on busybox sed and gnu sed too.
Code:
sed -r -e ':a;N;$!ba;s/<!--([^-]*|[^-]*-[^-]|[^-]*--[^>])*-->//g;' test.html

_________________
Fatdog64 forum links: Latest version | Contributed packages | ISO builder
Back to top
View user's profile Send private message 
sc0ttman


Joined: 16 Sep 2009
Posts: 2806
Location: UK

PostPosted: Wed 15 Jan 2020, 15:39    Post subject:  

MochiMoppel wrote:
sc0ttman wrote:
This thing nearly does the job, except that it minifies stuff inside <pre> tags..
Which job? Where do you eliminate comments? And what "stuff inside <pre> tags". Do you want to preserve comments inside pre tags? What for? The browser wouldn't show them anyway, so you might delete them as well. Unless you provide a sample of your input it is hard to tell what you are after.

Sorry, I should have been more clear, I'm posting "off topic" .. not even attempting to "remove comments"...

So.. I mean it "does the job" of minifying HTML.. Nothing to do with removing comments... Though it is related (I also want to remove comments at some point), hence me posting here..

So the snippet I posted does the job of minifying HTML, except that is _also_ minifies the contents of <pre> tags... which I don't want...

Carry on ....

_________________
Pkg, mdsh, Woofy, Akita, VLC-GTK, Search
Back to top
View user's profile Send private message 
s243a

Joined: 02 Sep 2014
Posts: 2626

PostPosted: Wed 15 Jan 2020, 16:44    Post subject:  

sc0ttman wrote:
MochiMoppel wrote:
sc0ttman wrote:
This thing nearly does the job, except that it minifies stuff inside <pre> tags..
Which job? Where do you eliminate comments? And what "stuff inside <pre> tags". Do you want to preserve comments inside pre tags? What for? The browser wouldn't show them anyway, so you might delete them as well. Unless you provide a sample of your input it is hard to tell what you are after.

Sorry, I should have been more clear, I'm posting "off topic" .. not even attempting to "remove comments"...

So.. I mean it "does the job" of minifying HTML.. Nothing to do with removing comments... Though it is related (I also want to remove comments at some point), hence me posting here..

So the snippet I posted does the job of minifying HTML, except that is _also_ minifies the contents of <pre> tags... which I don't want...

Carry on ....


Did you try my suggestion above, which was removing the backslash before the ".*" inside the pre tags? If you give us some test input then we can try some tests.

_________________
Find me on minds and on pearltrees.
Back to top
View user's profile Send private message Visit poster's website 
sc0ttman


Joined: 16 Sep 2009
Posts: 2806
Location: UK

PostPosted: Wed 15 Jan 2020, 17:37    Post subject:  

I didn't really try anything - it's not my snippet, and already way beyond anything I know about sed (next to nothing)...

And a valid test case would be any HTML file from mdsh that contains highlighted code like this one: https://sc0ttj.github.io/mdsh/posts/2019/06/29/adding-support-for-more-embedded-languages.html

_________________
Pkg, mdsh, Woofy, Akita, VLC-GTK, Search
Back to top
View user's profile Send private message 
jamesbond

Joined: 26 Feb 2007
Posts: 3475
Location: The Blue Marble

PostPosted: Wed 15 Jan 2020, 23:19    Post subject:  

Scott, I've read your few posts, and I still don't get it. Perhaps it's good if you can give us a sample input and the expected output, as the "incorrect output" as produced by the currently not-working script, so we can get an idea of what it is that you want to do. As it stands, the current sed script will more or less empties out text in-between html tags - leaving basically a blank page full of tags but no text in between. I'm not sure whether that counts as "minify". (I heard of minifying javascript, but minifying html is news to me ...).
_________________
Fatdog64 forum links: Latest version | Contributed packages | ISO builder
Back to top
View user's profile Send private message 
Display posts from previous:   Sort by:   
Page 2 of 4 [52 Posts]   Goto page: Previous 1, 2, 3, 4 Next
This forum is locked: you cannot post, reply to, or edit topics.   This topic is locked: you cannot edit posts or make replies. View previous topic :: View next topic
 Forum index » Off-Topic Area » Programming
Jump to:  

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Powered by phpBB © 2001, 2005 phpBB Group
[ Time: 0.1273s ][ Queries: 13 (0.0563s) ][ GZIP on ]