Puppy Linux Discussion Forum Forum Index Puppy Linux Discussion Forum
Puppy HOME page : puppylinux.com
"THE" alternative forum : puppylinux.info
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

The time now is Wed 14 Nov 2018, 19:33
All times are UTC - 4
 Forum index » Off-Topic Area » Programming
A bash script to convert .xml files to .txt
Post new topic   Reply to topic View previous topic :: View next topic
Page 2 of 3 [36 Posts]   Goto page: Previous 1, 2, 3 Next
Author Message
musher0

Joined: 04 Jan 2009
Posts: 12964
Location: Gatineau (Qc), Canada

PostPosted: Wed 14 Feb 2018, 06:06    Post subject:  

Hello everyone.

I'll say it politely, but I am fuming:

I just tested them on the pekwm xml material, and xmlstarlet, xml2 and consorts
are a complete waste of time and intelligence when what you want is a complete
txt file from a complete xml file.

Those utilities are basically designed to extract precise data from xml files. You
want the whole thing, you're out of luck.

This is nothing personal addressed at any of you nice people who shared your
findings. Again, thanks.

But how is it that no one in the (Linux only?) world ever thought that a complete
xml to txt conversion utility might someday become a need? Flabbergasting!!!

I'm going back to my scripts!!!

BFN.

_________________
musher0
~~~~~~~~~~
Je suis né pour aimer et non pas pour haïr. (Sophocle) /
I was born to love and not to hate. (Sophocles)
Back to top
View user's profile Send private message 
puppy_apprentice


Joined: 07 Feb 2012
Posts: 212

PostPosted: Wed 14 Feb 2018, 07:48    Post subject:  

Have you used DocBooks XSLT stylesheets with those tools. XSLT stylesheet is a set of rules that helps convert XML file into another.

https://www.oxygenxml.com/forum/topic10767.html

Code:
xml tr oxygen.xsl your-xml.xml >test.txt


where oxygen.xml:
Code:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
    <xsl:template match="text()[string-length(normalize-space()) = 0]">
        <xsl:text>
</xsl:text>
    </xsl:template>
   
    <xsl:template match="@*"/>
</xsl:stylesheet>


You will get output maybe not so beautifull like yours, but wait, maybe it is time to learn XSLT and make much better XSLT sheet Wink.

O try use PanDoc:

https://pandoc.org/demos.html

Example 31:

Code:
pandoc -f docbook -t markdown -s howto.xml -o example31.text

Last edited by puppy_apprentice on Wed 14 Feb 2018, 12:06; edited 3 times in total
Back to top
View user's profile Send private message 
jamesbond

Joined: 26 Feb 2007
Posts: 3165
Location: The Blue Marble

PostPosted: Wed 14 Feb 2018, 11:58    Post subject:  

My apology. for offering the wrong tool for the job. I didn't read the first post carefully.

All that xml2 does is flatten the .xml files so you can process them further with the familiar awk/sed/grep set of tools (which are line-based and cannot work with hierarchical structure - of which .xml files are). It's a generic tool to pre-process generic .xml files for further processing.

You need a tool to convert a bunch of very specific .xml files (that is, pekwm doc files), in a very specific way, into text files. This is a very specific requirement for which no generic tool would do, or exist. I believe that your script is the first tool ever created to accomplish that job, and being the only tool in its class, I would say it's the best tool there is.

@puppy_apprentice: The pekwm doc is written in an old version of DocBook. With a newer docbook all you need is xslt processor (the one from xmlstarlet should do), and the docbook XSLs; but pekwm uses DSSSL and it requires OpenJade tool to convert it into PDF or HTML. The author of pekwm said so himself: https://github.com/pekdon/pekwm/blob/master/doc/tools/mkdocs.sh.

_________________
Fatdog64, Slacko and Puppeee user. Puppy user since 2.13.
Contributed Fatdog64 packages thread.
Back to top
View user's profile Send private message 
puppy_apprentice


Joined: 07 Feb 2012
Posts: 212

PostPosted: Wed 14 Feb 2018, 12:04    Post subject:  

I belive it is no problem to convert those XMLs to HTML with proper XSLT stylesheet (PDF is another story, but you can convert it to PDF using my CSS from Firefox via CUPS-PDF).
Back to top
View user's profile Send private message 
musher0

Joined: 04 Jan 2009
Posts: 12964
Location: Gatineau (Qc), Canada

PostPosted: Wed 14 Feb 2018, 15:41    Post subject:  

Many thanks for the encouragements, guys.

Text format is two-thirds done! Very Happy

Once they are finished, I will have to "attack" Twisted Evil the original xml files with
puppy_apprentice's css code! (Again thanks.)

So I'm far from through yet, with this project. (Learning a lot along the way.)

I intend to send the final edit of the files back to the the pekwm authors, and
hopefully they will like what they see.

~~~~~~~~~~~~~~

Not that puppy_apprentice is wrong, but James is also right: you can have an
excellent general tool, but there are always details to adjust. Not to mention
personal preferences of this author versus personal preferences of that other author.

General tools can and do save editors a lot of time, but the finishing touches always
have to be done by hand.

~~~~~~~~~~~~~~

Can I ask any of you guys a service ? For the last 4-5 days, I've been getting a
perkwm.org site that's for sale when I go there. But a week ago I could still access
the pekwm docs online. Could someone double-check? TIA.

BFN.

_________________
musher0
~~~~~~~~~~
Je suis né pour aimer et non pas pour haïr. (Sophocle) /
I was born to love and not to hate. (Sophocles)
Back to top
View user's profile Send private message 
puppy_apprentice


Joined: 07 Feb 2012
Posts: 212

PostPosted: Wed 14 Feb 2018, 17:12    Post subject:  

A few years ago i've made few XSLT ssheets. So i've look on them again and read some stackexchange hints and there is solution:

Convert to text, save as text.xsl

Code:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="text" indent="yes" encoding="UTF-8"/>
 
  <xsl:template match="title">
    <xsl:value-of select="translate(text(), 'abcdefghijklmnopqrstuvwxyz', 'ABCDEFGHIJKLMNOPQRSTUVWXYZ')"/>
  </xsl:template>
 
  <xsl:template match="para[1]">
    <xsl:value-of select="normalize-space(text())"/>
  </xsl:template>
 
  <xsl:template match="para[2]/variablelist/title">
    <xsl:value-of select="translate(text(), 'abcdefghijklmnopqrstuvwxyz', 'ABCDEFGHIJKLMNOPQRSTUVWXYZ')"/>
  </xsl:template>
 
  <xsl:template match="para[2]/variablelist/varlistentry/term">
    <xsl:value-of select="normalize-space(text())"/>
  </xsl:template>
 
  <xsl:template match="para[2]/variablelist/varlistentry/listitem/para">
     <xsl:value-of select="concat(normalize-space(text()), '

')"/>
     <xsl:for-each select="itemizedlist/listitem/para">
       <xsl:value-of select="concat('+ ', normalize-space(text()), '
')"/>
     </xsl:for-each>
  </xsl:template>
</xsl:stylesheet>


usage:
Code:

xml tr text.xsl pek-xml.xml >test.txt


Convert to HTML, save as html.xsl
Code:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="html" indent="yes" encoding="UTF-8"/>
 
  <xsl:template match="/">
  <html>
     <head>
        <title>HTML Page from PekWM XML Docs</title>
     </head>

     <body>
        <xsl:apply-templates/>
     </body>
  </html>
 </xsl:template>
 
  <xsl:template match="title">
    <h1><xsl:value-of select="text()"/></h1>
  </xsl:template>
 
  <xsl:template match="para[1]">
    <p><xsl:value-of select="text()"/></p>
  </xsl:template>
 
  <xsl:template match="para[2]/variablelist/title">
    <h1><xsl:value-of select="text()"/></h1>
  </xsl:template>
 
  <xsl:template match="para[2]/variablelist/varlistentry/term">
    <p><b><xsl:value-of select="text()"/></b></p>
  </xsl:template>
 
  <xsl:template match="para[2]/variablelist/varlistentry/listitem/para">
    <p><i><xsl:value-of select="text()"/></i></p>
    <ul>
     <xsl:for-each select="itemizedlist/listitem/para">
       <li><xsl:value-of select="text()"/></li>
     </xsl:for-each>
    </ul>
  </xsl:template>
</xsl:stylesheet>


usage:
Code:

xml tr html.xsl pek-xml.xml >test.html
xstarlet.tar.gz
Description  xstarlet and those 2 XSLT sheets
gz

 Download 
Filename  xstarlet.tar.gz 
Filesize  129.63 KB 
Downloaded  104 Time(s) 
Back to top
View user's profile Send private message 
musher0

Joined: 04 Jan 2009
Posts: 12964
Location: Gatineau (Qc), Canada

PostPosted: Wed 14 Feb 2018, 19:24    Post subject:  

puppy_apprentice,

your nick is misleading! ; ) With xml, you are top-notch!

Your code worked the conversion to txt format instantly for all xml files in the theme
section of the pekwm docs, except the top one, the one with &bla-bla-bla; references in
it. I did that one by hand, but it was the shortest file of the bunch.

Impressive result attached (with source files).

I concatenated all resulting files in a main "theme.txt" file, keeping individual components.
My additional step after that was to use
Code:
fmt -70 theme.txt
.
Many thanks.

BFN.
pekwm-theme-section.zip
Description 
zip

 Download 
Filename  pekwm-theme-section.zip 
Filesize  14.15 KB 
Downloaded  35 Time(s) 

_________________
musher0
~~~~~~~~~~
Je suis né pour aimer et non pas pour haïr. (Sophocle) /
I was born to love and not to hate. (Sophocles)
Back to top
View user's profile Send private message 
puppy_apprentice


Joined: 07 Feb 2012
Posts: 212

PostPosted: Thu 15 Feb 2018, 04:49    Post subject:  

Those two stylesheets need some tweaking. I've made them for expecialy for your first posted example. Others XML files have other tags and little different structure so then don't look nice as expected. It is possible to write more general sheet. But it is exercise for others Wink

Some references:
http://www.xsltfunctions.com/xsl/
http://scraping.pro/5-best-xpath-cheat-sheets-and-quick-references/

And XMLStarlet only understand XSLT v1.0, so not all functions from those resources will work with it.

So we learn here some bash, some CSS, XML, XSLT, XPath. Nice!
test.tar.gz
Description  Destination output for all XML files should looks more like this.
gz

 Download 
Filename  test.tar.gz 
Filesize  3.76 KB 
Downloaded  85 Time(s) 
Back to top
View user's profile Send private message 
puppy_apprentice


Joined: 07 Feb 2012
Posts: 212

PostPosted: Thu 15 Feb 2018, 17:48    Post subject:  

Ok Mushero i've found proper XSLT stylesheets for PekWM Docs.

Unzip archive and go to xsl folder. Read info file for instructions. Now you will see nice html docs for theme section. Could you check rest of PekWM xml docs files?
temp3.png
 Description   structure.html from structure.xml
 Filesize   53.61 KB
 Viewed   184 Time(s)

temp3.png

looks-good.tar.gz
Description  i think that i found proper XSLT sheets for PekWM XML docs.
gz

 Download 
Filename  looks-good.tar.gz 
Filesize  181.09 KB 
Downloaded  87 Time(s) 
Back to top
View user's profile Send private message 
musher0

Joined: 04 Jan 2009
Posts: 12964
Location: Gatineau (Qc), Canada

PostPosted: Thu 15 Feb 2018, 19:32    Post subject:  

Hi, puppy_apprentice.

You got indeed beautiful results.

Using your main.xsl file within the script below in dir /usr/share/doc/pekwm. Crudely
drilling down in the subdirs to get the results. It's just that I am afraid there might be
links within the docs in the subdirs.

Talk with you later.

~~~~~~~~~~
Code:
#!/bin/sh
# formula-PA.sh
####
ls *.xml | awk -F"." '{ print $1 }' > liste
while read doc;do
#   replaceit --input=$doc.xml "&" "-+- "
   xml tr ~/my-applications/text.xsl $doc.xml > $doc.PA.txt
   xml tr ~/my-applications/main.xsl $doc.xml > $doc.PA.html
# formule de « puppy-apprentice », du forum Puppy
done < liste
rm -f liste

_________________
musher0
~~~~~~~~~~
Je suis né pour aimer et non pas pour haïr. (Sophocle) /
I was born to love and not to hate. (Sophocles)
Back to top
View user's profile Send private message 
musher0

Joined: 04 Jan 2009
Posts: 12964
Location: Gatineau (Qc), Canada

PostPosted: Thu 15 Feb 2018, 20:16    Post subject:  

Almost forgot: if the pekwm.org site stays down, it will be important to have those
docs in html up on another site.

I will send a PM to forum member augras to see if it is possible on his augras.eu site,
where he is hosting some of my stuff.

But where does not really matter. I am thinking that we would need some kind of
approbation from the pekwm people. Maybe they can tell us what is really going on
with their site, too.

BFN.

_________________
musher0
~~~~~~~~~~
Je suis né pour aimer et non pas pour haïr. (Sophocle) /
I was born to love and not to hate. (Sophocles)
Back to top
View user's profile Send private message 
musher0

Joined: 04 Jan 2009
Posts: 12964
Location: Gatineau (Qc), Canada

PostPosted: Thu 15 Feb 2018, 20:55    Post subject:  

I should have privided these earlier, sorry.
pekwm-docs.tar.gz
Description  Complete pekwm docs in xml, from the source zip archive on github.
gz

 Download 
Filename  pekwm-docs.tar.gz 
Filesize  62.73 KB 
Downloaded  82 Time(s) 

_________________
musher0
~~~~~~~~~~
Je suis né pour aimer et non pas pour haïr. (Sophocle) /
I was born to love and not to hate. (Sophocles)
Back to top
View user's profile Send private message 
musher0

Joined: 04 Jan 2009
Posts: 12964
Location: Gatineau (Qc), Canada

PostPosted: Thu 15 Feb 2018, 22:01    Post subject:  

Hello puppy_apprentice.

I have tried the system you suggested today (with the db2xhtml-master, etc.), and
got nowhere. Probably there is something I am not understanding.

However, I applied the tips you suggested yesterday and got sometimes pretty
good results. Two zip archives containing the full results of that run -- and the
script I used, are attached:

-- the *.PA.txt and *.PA.html files were created with your system. As you will see,
some have come out in outstanding fashion, others not so much, and others still
were not created at all.

My initial reaction would be to finish the job with a good html editor such as the
Kompozer in SeaMonkey.

But xml/html are more your field of expertise than mine, so if you can produce
the same quality of html as you have showed above, through your "xls" conversion,
applying it on the full pekwm xml's (in the zip archive from the pekwm source,
above), I will certainly not complain!!! Your process is blazingly fast, but it takes
time and expertise to set up.

-- the plain *.txt files in the attached were created with my script. I have also
edited them, with the help of some GNU text utilities, and manually. I feel I have
almost finished. I will use the good text files obtained with your method as
comparison basis, and filler, if needed, during a couple of final editing sessions.

~~~~~~~~

In parallel, I have now evolved a reader script and a search script for the pekwm
doc in text format. A menu acts as the table of contents, and the "real less" utility
as the reader.

I think this reader and search system is exportable | adaptable to other big text
documents. I need another manual like the pekwm manual to test the scripts on
and confirm this.

~~~~~~~~

This ends my "status report". I hope it will help us collaborate on this pekwm
docs project.

BFN.
pekwm-d1.zip
Description 
zip

 Download 
Filename  pekwm-d1.zip 
Filesize  135.16 KB 
Downloaded  37 Time(s) 
pekwm-d2.zip
Description 
zip

 Download 
Filename  pekwm-d2.zip 
Filesize  141.09 KB 
Downloaded  32 Time(s) 
zipsplit.zip
Description  Index for the two zip archives above
zip

 Download 
Filename  zipsplit.zip 
Filesize  1.52 KB 
Downloaded  35 Time(s) 

_________________
musher0
~~~~~~~~~~
Je suis né pour aimer et non pas pour haïr. (Sophocle) /
I was born to love and not to hate. (Sophocles)
Back to top
View user's profile Send private message 
puppy_apprentice


Joined: 07 Feb 2012
Posts: 212

PostPosted: Fri 16 Feb 2018, 13:29    Post subject:  

Had to do some greps, cats, replaceits, xmls and all is in HTML format.
There were some problems with lines that included "&" sign so i have to replace them in XMLs first to "***AND***" and in final HTML document replace them again to "&". There were problems too with "<simplesect>" tag in XMLs - had to change them first to "<section>".

Read whole HTML documents mushero and eventualy correct errors (there will be some variables with "&" at start like &copyright so erase them or find in sources what those variables mean and replace them with proper values).
pekwm-docs-html.tar.gz
Description  Happy reading
gz

 Download 
Filename  pekwm-docs-html.tar.gz 
Filesize  58.07 KB 
Downloaded  123 Time(s) 
Back to top
View user's profile Send private message 
musher0

Joined: 04 Jan 2009
Posts: 12964
Location: Gatineau (Qc), Canada

PostPosted: Fri 16 Feb 2018, 17:38    Post subject:  

Beautiful work, puppy_apprentice!

A thousand thanks for this!

I have referenced your layout on the pekwm thread, here.

BFN.

_________________
musher0
~~~~~~~~~~
Je suis né pour aimer et non pas pour haïr. (Sophocle) /
I was born to love and not to hate. (Sophocles)
Back to top
View user's profile Send private message 
Display posts from previous:   Sort by:   
Page 2 of 3 [36 Posts]   Goto page: Previous 1, 2, 3 Next
Post new topic   Reply to topic View previous topic :: View next topic
 Forum index » Off-Topic Area » Programming
Jump to:  

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Powered by phpBB © 2001, 2005 phpBB Group
[ Time: 0.0933s ][ Queries: 12 (0.0182s) ][ GZIP on ]