A bash script to convert .xml files to .txt

Message

musher0 · #16 Post by **musher0** » Wed 14 Feb 2018, 10:06

Hello everyone.

I'll say it politely, but I am fuming:

I just tested them on the pekwm xml material, and xmlstarlet, xml2 and consorts
are a complete waste of time and intelligence when what you want is a complete
txt file from a complete xml file.

Those utilities are basically designed to extract precise data from xml files. You
want the whole thing, you're out of luck.

This is nothing personal addressed at any of you nice people who shared your
findings. Again, thanks.

But how is it that no one in the (Linux only?) world ever thought that a complete
xml to txt conversion utility might someday become a need? Flabbergasting!!!

I'm going back to my scripts!!!

BFN.

puppy_apprentice · #17 Post by **puppy_apprentice** » Wed 14 Feb 2018, 11:48

Have you used DocBooks XSLT stylesheets with those tools. XSLT stylesheet is a set of rules that helps convert XML file into another.

https://www.oxygenxml.com/forum/topic10767.html

Code: Select all

xml tr oxygen.xsl your-xml.xml >test.txt

where oxygen.xml:

Code: Select all

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
    <xsl:template match="text()[string-length(normalize-space()) = 0]">
        <xsl:text>
</xsl:text>
    </xsl:template>
   
    <xsl:template match="@*"/>
</xsl:stylesheet>

You will get output maybe not so beautifull like yours, but wait, maybe it is time to learn XSLT and make much better XSLT sheet

.

O try use PanDoc:

https://pandoc.org/demos.html

Example 31:

Code: Select all

pandoc -f docbook -t markdown -s howto.xml -o example31.text

jamesbond · #18 Post by **jamesbond** » Wed 14 Feb 2018, 15:58

My apology. for offering the wrong tool for the job. I didn't read the first post carefully.

All that xml2 does is flatten the .xml files so you can process them further with the familiar awk/sed/grep set of tools (which are line-based and cannot work with hierarchical structure - of which .xml files are). It's a generic tool to pre-process generic .xml files for further processing.

You need a tool to convert a bunch of very specific .xml files (that is, pekwm doc files), in a very specific way, into text files. This is a very specific requirement for which no generic tool would do, or exist. I believe that your script is the first tool ever created to accomplish that job, and being the only tool in its class, I would say it's the best tool there is.

@puppy_apprentice: The pekwm doc is written in an old version of DocBook. With a newer docbook all you need is xslt processor (the one from xmlstarlet should do), and the docbook XSLs; but pekwm uses DSSSL and it requires OpenJade tool to convert it into PDF or HTML. The author of pekwm said so himself: https://github.com/pekdon/pekwm/blob/ma ... /mkdocs.sh.

puppy_apprentice · #19 Post by **puppy_apprentice** » Wed 14 Feb 2018, 16:04

I belive it is no problem to convert those XMLs to HTML with proper XSLT stylesheet (PDF is another story, but you can convert it to PDF using my CSS from Firefox via CUPS-PDF).

musher0 · #20 Post by **musher0** » Wed 14 Feb 2018, 19:41

Many thanks for the encouragements, guys.

Text format is two-thirds done!

Once they are finished, I will have to "attack"

the original xml files with
puppy_apprentice's css code! (Again thanks.)

So I'm far from through yet, with this project. (Learning a lot along the way.)

I intend to send the final edit of the files back to the the pekwm authors, and
hopefully they will like what they see.

~~~~~~~~~~~~~~

Not that puppy_apprentice is wrong, but James is also right: you can have an
excellent general tool, but there are always details to adjust. Not to mention
personal preferences of this author versus personal preferences of that other author.

General tools can and do save editors a lot of time, but the finishing touches always
have to be done by hand.

~~~~~~~~~~~~~~

Can I ask any of you guys a service ? For the last 4-5 days, I've been getting a
perkwm.org site that's for sale when I go there. But a week ago I could still access
the pekwm docs online. Could someone double-check? TIA.

BFN.

puppy_apprentice · #21 Post by **puppy_apprentice** » Wed 14 Feb 2018, 21:12

A few years ago i've made few XSLT ssheets. So i've look on them again and read some stackexchange hints and there is solution:

Convert to text, save as text.xsl

Code: Select all

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="text" indent="yes" encoding="UTF-8"/>
  
  <xsl:template match="title">
    <xsl:value-of select="translate(text(), 'abcdefghijklmnopqrstuvwxyz', 'ABCDEFGHIJKLMNOPQRSTUVWXYZ')"/>
  </xsl:template>
  
  <xsl:template match="para[1]">
    <xsl:value-of select="normalize-space(text())"/>
  </xsl:template>
  
  <xsl:template match="para[2]/variablelist/title">
    <xsl:value-of select="translate(text(), 'abcdefghijklmnopqrstuvwxyz', 'ABCDEFGHIJKLMNOPQRSTUVWXYZ')"/>
  </xsl:template>
  
  <xsl:template match="para[2]/variablelist/varlistentry/term">
    <xsl:value-of select="normalize-space(text())"/>
  </xsl:template>
  
  <xsl:template match="para[2]/variablelist/varlistentry/listitem/para">
     <xsl:value-of select="concat(normalize-space(text()), '

')"/>
     <xsl:for-each select="itemizedlist/listitem/para">
       <xsl:value-of select="concat('+ ', normalize-space(text()), '
')"/>
     </xsl:for-each>
  </xsl:template>
</xsl:stylesheet>

usage:

Code: Select all

xml tr text.xsl pek-xml.xml >test.txt

Convert to HTML, save as html.xsl

Code: Select all

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="html" indent="yes" encoding="UTF-8"/>
  
  <xsl:template match="/">
  <html> 
     <head>
        <title>HTML Page from PekWM XML Docs</title>
     </head>

     <body>
        <xsl:apply-templates/>
     </body>
  </html>
 </xsl:template>
  
  <xsl:template match="title">
    <h1><xsl:value-of select="text()"/></h1>
  </xsl:template>
  
  <xsl:template match="para[1]">
    <p><xsl:value-of select="text()"/></p>
  </xsl:template>
  
  <xsl:template match="para[2]/variablelist/title">
    <h1><xsl:value-of select="text()"/></h1>
  </xsl:template>
  
  <xsl:template match="para[2]/variablelist/varlistentry/term">
    <p><b><xsl:value-of select="text()"/></b></p>
  </xsl:template>
  
  <xsl:template match="para[2]/variablelist/varlistentry/listitem/para">
    <p><i><xsl:value-of select="text()"/></i></p>
    <ul>
     <xsl:for-each select="itemizedlist/listitem/para">
       <li><xsl:value-of select="text()"/></li>
     </xsl:for-each>
    </ul>
  </xsl:template>
</xsl:stylesheet>

usage:

Code: Select all

xml tr html.xsl pek-xml.xml >test.html

musher0 · #22 Post by **musher0** » Wed 14 Feb 2018, 23:24

puppy_apprentice,

your nick is misleading! ; ) With xml, you are top-notch!

Your code worked the conversion to txt format instantly for all xml files in the theme
section of the pekwm docs, except the top one, the one with &bla-bla-bla; references in
it. I did that one by hand, but it was the shortest file of the bunch.

Impressive result attached (with source files).

I concatenated all resulting files in a main "theme.txt" file, keeping individual components.
My additional step after that was to use

Code: Select all

fmt -70 theme.txt

.
Many thanks.

BFN.

puppy_apprentice · #23 Post by **puppy_apprentice** » Thu 15 Feb 2018, 08:49

Those two stylesheets need some tweaking. I've made them for expecialy for your first posted example. Others XML files have other tags and little different structure so then don't look nice as expected. It is possible to write more general sheet. But it is exercise for others

Some references:
http://www.xsltfunctions.com/xsl/
http://scraping.pro/5-best-xpath-cheat- ... eferences/

And XMLStarlet only understand XSLT v1.0, so not all functions from those resources will work with it.

So we learn here some bash, some CSS, XML, XSLT, XPath. Nice!

puppy_apprentice · #24 Post by **puppy_apprentice** » Thu 15 Feb 2018, 21:48

Ok Mushero i've found proper XSLT stylesheets for PekWM Docs.

Unzip archive and go to xsl folder. Read info file for instructions. Now you will see nice html docs for theme section. Could you check rest of PekWM xml docs files?

musher0 · #25 Post by **musher0** » Thu 15 Feb 2018, 23:32

Hi, puppy_apprentice.

You got indeed beautiful results.

Using your main.xsl file within the script below in dir /usr/share/doc/pekwm. Crudely
drilling down in the subdirs to get the results. It's just that I am afraid there might be
links within the docs in the subdirs.

Talk with you later.

~~~~~~~~~~

Code: Select all

#!/bin/sh
# formula-PA.sh
####
ls *.xml | awk -F"." '{ print $1 }' > liste
while read doc;do
#	replaceit --input=$doc.xml "&" "-+- "
	xml tr ~/my-applications/text.xsl $doc.xml > $doc.PA.txt
	xml tr ~/my-applications/main.xsl $doc.xml > $doc.PA.html
# formule de « puppy-apprentice », du forum Puppy
done < liste
rm -f liste

musher0 · #26 Post by **musher0** » Fri 16 Feb 2018, 00:16

Almost forgot: if the pekwm.org site stays down, it will be important to have those
docs in html up on another site.

I will send a PM to forum member augras to see if it is possible on his augras.eu site,
where he is hosting some of my stuff.

But where does not really matter. I am thinking that we would need some kind of
approbation from the pekwm people. Maybe they can tell us what is really going on
with their site, too.

BFN.

musher0 · #27 Post by **musher0** » Fri 16 Feb 2018, 00:55

I should have privided these earlier, sorry.

musher0 · #28 Post by **musher0** » Fri 16 Feb 2018, 02:01

Hello puppy_apprentice.

I have tried the system you suggested today (with the db2xhtml-master, etc.), and
got nowhere. Probably there is something I am not understanding.

However, I applied the tips you suggested yesterday and got sometimes pretty
good results. Two zip archives containing the full results of that run -- and the
script I used, are attached:

-- the *.PA.txt and *.PA.html files were created with your system. As you will see,
some have come out in outstanding fashion, others not so much, and others still
were not created at all.

My initial reaction would be to finish the job with a good html editor such as the
Kompozer in SeaMonkey.

But xml/html are more your field of expertise than mine, so if you can produce
the same quality of html as you have showed above, through your "xls" conversion,
applying it on the full pekwm xml's (in the zip archive from the pekwm source,
above), I will certainly not complain!!! Your process is blazingly fast, but it takes
time and expertise to set up.

-- the plain *.txt files in the attached were created with my script. I have also
edited them, with the help of some GNU text utilities, and manually. I feel I have
almost finished. I will use the good text files obtained with your method as
comparison basis, and filler, if needed, during a couple of final editing sessions.

~~~~~~~~

In parallel, I have now evolved a reader script and a search script for the pekwm
doc in text format. A menu acts as the table of contents, and the "real less" utility
as the reader.

I think this reader and search system is exportable | adaptable to other big text
documents. I need another manual like the pekwm manual to test the scripts on
and confirm this.

~~~~~~~~

This ends my "status report". I hope it will help us collaborate on this pekwm
docs project.

BFN.

puppy_apprentice · #29 Post by **puppy_apprentice** » Fri 16 Feb 2018, 17:29

Had to do some greps, cats, replaceits, xmls and all is in HTML format.
There were some problems with lines that included "&" sign so i have to replace them in XMLs first to "***AND***" and in final HTML document replace them again to "&". There were problems too with "<simplesect>" tag in XMLs - had to change them first to "<section>".

Read whole HTML documents mushero and eventualy correct errors (there will be some variables with "&" at start like &copyright so erase them or find in sources what those variables mean and replace them with proper values).

musher0 · #30 Post by **musher0** » Fri 16 Feb 2018, 21:38

Beautiful work, puppy_apprentice!

A thousand thanks for this!

I have referenced your layout on the pekwm thread, here.

BFN.

musher0 · #31 Post by **musher0** » Sat 17 Feb 2018, 04:28

Hello all.

Converting xml to txt, I have often met head on with the problem of too many --
or too few -- line spacings between the paragraphs and subtitles.

I have come up with a Bash solution to REDUCE the number of line spacings
in a text document. All other tips on the Internet (really; this is no exaggeration!)
being about completely removing them, there was a need for this "middle-of-
the-road" approach.

If you completely remove line spacings, it is as bad as when there are too many
of them: the reader has more trouble focusing on the content, because the difference
between foreground and background is either closer to nil -- with no line spacings --,
or too great -- when there are too many. The reader has to do a mental correction
as (s)he reads, and that gets in the way of faster and better understanding.

I have written a short article which expands on the above ideas, on the French side
of the forum. Feel free to use the DeepL Translator on that post. I am available
if there remains any confusion about some sentences; just ask.

Enjoy. BFN.

MochiMoppel · #32 Post by **MochiMoppel** » Sat 17 Feb 2018, 06:16

musher0 wrote:All other tips on the Internet (really; this is no exaggeration!)
being about completely removing them,

???
Look closer. One of the simplest ways is

Code: Select all

cat -s filename

I normally use sed but I know that you don't like sed.

Your code seems to destroy content.
Input file, containing 12 lines:

Code: Select all

Output file contains 6 lines:

Code: Select all

5
7

BTW: Your bash code posted here also swallows content. The sample output skips some listitems ("Desktop" etc.) present in the original XML document.

musher0 · #33 Post by **musher0** » Sat 17 Feb 2018, 11:30

Hi MochiMoppei.

I have tried < cat -s textfile >. It does a fair job, but it does not care if the text has
2 or 3 line spacings in a row, it condenses all of them into one line spacing. It is
ok, but in editing texts, the number of line spacings in a row means something.

Usually:
One line space between paragraphs;
Two line spaces between sections;
Three line spaces between chapters (or more major sections).

My script tries to respect that custom, whereas < cat -s > does not.

I have attached illustrations of the original txt, of the cat -s version, and of the
version produced by my script, so people can better grasp the concept. Only the
ending of the text is illustrated, but IMO it is telling enough about the line spacings,
AND about the line count. The sources for those texts are also attached.

~~~~~~~~~~~~~~~~

As to the missing content, thanks for noticing. I think I said it before, the result of
this type of tool always needs to be compared with the original by a human editor.
And corrections brought to final draft by said human when necessary.

~~~~~~~~~~~~~~~~

That said, this forum needs more good critics like yourself. (I'm serious!) Thanks
for bringing this to my attention. I'll see what I can do to solve the content
problem from within the script.

BFN.

musher0 · #34 Post by **musher0** » Mon 19 Feb 2018, 22:42

Hello all.

Here is a nice resource on how to lay out a document in text format:, with the goal of
enhancing content accessibility:
https://www.w3.org/TR/WCAG-TECHS/text.html

You would follow the above guidelines once the xml to txt conversion is finished, and
you want to prettify and / or standardize your results.

BFN.

puppy_apprentice · #35 Post by **puppy_apprentice** » Tue 06 Mar 2018, 21:01

Mushero, in my Slacko 5.7 (didn't check other PUPs) i have found docbook.css - a stylesheet for docbook's xmls like those PekWM xmls.

Code: Select all

/usr/share/examples/xml/

Like with my css you should add those lines to beginning of every xml file in docbook format:

Code: Select all

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="/usr/share/examples/xml/docbook.css"?>

(old)Puppy Linux Discussion Forum

(old)Puppy Linux Discussion Forum

A bash script to convert .xml files to .txt