A sed expression to deal with parsing wikitext [SOLVED]

For discussions about programming, programming questions/advice, and projects that don't really have anything to do with Puppy.
Post Reply
Message
Author
User avatar
thunor
Posts: 350
Joined: Thu 14 Oct 2010, 15:24
Location: Minas Tirith, in the Pelennor Fields fighting the Easterlings
Contact:

A sed expression to deal with parsing wikitext [SOLVED]

#1 Post by thunor »

I've written and am tweaking a wikitext parser using sed and I want to make it as compatible with Creole 1.0 as possible but I'm having problems with //italic//.

I can't find a way to deal with this:

Code: Select all

//some text [[http://www.murga-linux.com/puppy|Puppy Linux Discussion Forum]] some more text//
All I've got is this which italicises at least one char:

Code: Select all

sed -e `|//\([^/]\+\)//|<em>\1</em>|g'
To be honest this is acceptable anyway:

Code: Select all

//some text// [[http://www.murga-linux.com/puppy|//Puppy Linux Discussion Forum//]] //some more text//
but I just wondered if there's a sed wizard about who knows how to deal with "// .* not:// .* //" because it might help me tweak some other stuff. I basically want to not more than one char and I think you can only not single chars.

Regards,
Thunor
Last edited by thunor on Fri 03 May 2013, 21:45, edited 1 time in total.

seaside
Posts: 934
Joined: Thu 12 Apr 2007, 00:19

#2 Post by seaside »

Hey thunor,

I don't have a sed answer, but perhaps a bash solution would do....

Code: Select all

 # line='//some text [[http://www.murga-linux.com/puppy|Puppy Linux Discussion Forum]] some more text//'
# line=${line/#\/\//<em>}  line=${line/%\/\//</em>}
# echo $line
<em>some text [[http://www.murga-linux.com/puppy|Puppy Linux Discussion Forum]] some more text</em> 
Best regards,
s
EDIT: A little experimenting and

Code: Select all

 echo $line |sed 's|^\/\/|<em>|;s|\/\/$|</em>|'  
works :)

User avatar
thunor
Posts: 350
Joined: Thu 14 Oct 2010, 15:24
Location: Minas Tirith, in the Pelennor Fields fighting the Easterlings
Contact:

#3 Post by thunor »

Thanks seaside but it needs to deal with multiples on the same line which I should've mentioned.

It did get me thinking though about maybe dealing with it before I use sed or after with something like you've done or a case statement and then I thought about temporarily substituting "://" and putting it back afterwards. The conclusion is I managed it in sed using temporary string substitution:

Code: Select all

echo '//some text [[http://www.murga-linux.com/puppy|Puppy Linux Discussion Forum]] some more text//' | sed \
	-e 's|://|@COLON_SLASH_SLASH@|g' \
	-e 's|//|@SLASH_SLASH@|g' \
	-e 's|/|@SLASH@|g' \
	-e 's|@SLASH_SLASH@|//|g' \
\
	-e 's|//\([^/]\+\)//|<em>\1</em>|g' \
\
	-e 's|@SLASH@|/|g' \
	-e 's|@COLON_SLASH_SLASH@|://|g'
Cheers and regards,
Thunor

User avatar
sunburnt
Posts: 5090
Joined: Wed 08 Jun 2005, 23:11
Location: Arizona, U.S.A.

#4 Post by sunburnt »

thunor; You`re not very clear about what you`re trying to do.

You posted an example input line, can you post what you want it to look like?

Or is this what you want?
Input: //some text [[http://www.murga-linux.com/puppy|Puppy Linux Discussion Forum]] some more text//

Output: //some text// [[http://www.murga-linux.com/puppy|//Puppy Linux Discussion Forum//]] //some more text//
If so then this does the trick: echo $input |sed 's# \[\[#// \[\[#;s#|#|//#;s#\]\] #//\]\] //#'
You need to escape "\" the "[" and "]" characters as Bash uses them to evaluate expressions: [ -d /root ]&& echo GOOD

### But maybe you`re trying to italicize the "some text" parts?
.

User avatar
thunor
Posts: 350
Joined: Thu 14 Oct 2010, 15:24
Location: Minas Tirith, in the Pelennor Fields fighting the Easterlings
Contact:

#5 Post by thunor »

sunburnt wrote:thunor; You`re not very clear about what you`re trying to do.

You posted an example input line, can you post what you want it to look like?...
Hi sunburnt

This (I'll give you an example using multiples on the same line which needs to be supported):

Code: Select all

//some italicised text [[http://linux.com/learn|Learn Linux]] some italicised text// some non-italicised text //some italicised text [[http://linux.com/learn|Learn Linux]] some italicised text//
to:

Code: Select all

<em>some italicised text [[http://linux.com/learn|Learn Linux]] some italicised text</em> some non-italicised text <em>some italicised text [[http://linux.com/learn|Learn Linux]] some italicised text</em>
and ultimately once I've processed the wikitext formatted external URLs it'll output as:

some italicised text Learn Linux some italicised text some non-italicised text some italicised text Learn Linux some italicised text

I did solve it by substituting the conflicting slashes with something else and then putting them back afterwards which seems the logical thing to do.

This is just an example of the problem I had. I need to be able to italicise //everything and anything// that appear inside double slashes //multiple times// on the same line.

Regards,
Thunor

seaside
Posts: 934
Joined: Thu 12 Apr 2007, 00:19

#6 Post by seaside »

Thunor,

I guess this could be done with sed pattern holds and buffer manipulations which I don't really comprehend. Your solution is to the point and much easier to understand (none of those strange char combinations that require a lookup) :)

Best Regards,
s
(You must be the sed wizard you were looking for :) )

User avatar
technosaurus
Posts: 4853
Joined: Mon 19 May 2008, 01:24
Location: Blue Springs, MO
Contact:

#7 Post by technosaurus »

i recommend posting this to stackoverflow if you cant already find the answer there

using awk and assuming they don't span lines (if they can span lines, just set RS="EOF" or something in the BEGIN section)

Code: Select all

awk '
BEGIN{FS="//"}
{
for(i=1;i<=NF;i++){
    print $i
    i++
    if(i<NF){
        print "<em>" $i "</em>"
    }
}
}
'
Check out my [url=https://github.com/technosaurus]github repositories[/url]. I may eventually get around to updating my [url=http://bashismal.blogspot.com]blogspot[/url].

User avatar
sunburnt
Posts: 5090
Joined: Wed 08 Jun 2005, 23:11
Location: Arizona, U.S.A.

#8 Post by sunburnt »

That`s essentially what I was going to offer up,

A Bash loop to handle the <em></em> tag pairs and ignore "http://".

techysaurus :wink: is always spot on for the most succinct script code...

User avatar
technosaurus
Posts: 4853
Joined: Mon 19 May 2008, 01:24
Location: Blue Springs, MO
Contact:

#9 Post by technosaurus »

sunburnt wrote:and ignore "http://"
for that you'd need something before the i++ like:

Code: Select all

if(substr($i,length($i),1)==":"){printf "//";continue}
Check out my [url=https://github.com/technosaurus]github repositories[/url]. I may eventually get around to updating my [url=http://bashismal.blogspot.com]blogspot[/url].

seaside
Posts: 934
Joined: Thu 12 Apr 2007, 00:19

#10 Post by seaside »

technosaurus,

I tried to get this part -

Code: Select all

if(substr($i,length($i),1)==":"){printf "//";continue}
to work and couldn't. So here's a crossover "Thunor-@colon_slash_slash@" awk version.

Code: Select all

awk '
BEGIN{FS="//"}

{gsub("://","@colon_slash_slash@")}

{
for(i=1;i<=NF;i++){
 
    i++
    if(i<NF){
        sub("@colon_slash_slash@","://",$i)
        print "<em>" $i "</em>"
    }
}
}
' 
No speed difference between sed and awk versions.

Best regards,
s
(Hmmm..."@colon_slash_slash@" sounds more like a colonoscopy, only more comfortable in code than in person) :)

User avatar
technosaurus
Posts: 4853
Joined: Mon 19 May 2008, 01:24
Location: Blue Springs, MO
Contact:

#11 Post by technosaurus »

damn,... I was trying to do it in my head again without running the code - wasn't 100% sure continue was supported the way it is in shell ... anyhow consider it pseudo code
@colon_slash_slash@" sounds more like a colonoscopy
reminds me of a scene in the movie Seven
Check out my [url=https://github.com/technosaurus]github repositories[/url]. I may eventually get around to updating my [url=http://bashismal.blogspot.com]blogspot[/url].

Post Reply