A bash script to convert .xml files to .txt

For discussions about programming, programming questions/advice, and projects that don't really have anything to do with Puppy.
Message
Author
musher0
Posts: 14629
Joined: Mon 05 Jan 2009, 00:54
Location: Gatineau (Qc), Canada

#16 Post by musher0 »

Hello everyone.

I'll say it politely, but I am fuming:

I just tested them on the pekwm xml material, and xmlstarlet, xml2 and consorts
are a complete waste of time and intelligence when what you want is a complete
txt file from a complete xml file.

Those utilities are basically designed to extract precise data from xml files. You
want the whole thing, you're out of luck.

This is nothing personal addressed at any of you nice people who shared your
findings. Again, thanks.

But how is it that no one in the (Linux only?) world ever thought that a complete
xml to txt conversion utility might someday become a need? Flabbergasting!!!

I'm going back to my scripts!!!

BFN.
musher0
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)

User avatar
puppy_apprentice
Posts: 299
Joined: Tue 07 Feb 2012, 20:32

#17 Post by puppy_apprentice »

Have you used DocBooks XSLT stylesheets with those tools. XSLT stylesheet is a set of rules that helps convert XML file into another.

https://www.oxygenxml.com/forum/topic10767.html

Code: Select all

xml tr oxygen.xsl your-xml.xml >test.txt
where oxygen.xml:

Code: Select all

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
    <xsl:template match="text()[string-length(normalize-space()) = 0]">
        <xsl:text>
</xsl:text>
    </xsl:template>
   
    <xsl:template match="@*"/>
</xsl:stylesheet>
You will get output maybe not so beautifull like yours, but wait, maybe it is time to learn XSLT and make much better XSLT sheet ;).

O try use PanDoc:

https://pandoc.org/demos.html

Example 31:

Code: Select all

pandoc -f docbook -t markdown -s howto.xml -o example31.text
Last edited by puppy_apprentice on Wed 14 Feb 2018, 16:06, edited 3 times in total.

jamesbond
Posts: 3433
Joined: Mon 26 Feb 2007, 05:02
Location: The Blue Marble

#18 Post by jamesbond »

My apology. for offering the wrong tool for the job. I didn't read the first post carefully.

All that xml2 does is flatten the .xml files so you can process them further with the familiar awk/sed/grep set of tools (which are line-based and cannot work with hierarchical structure - of which .xml files are). It's a generic tool to pre-process generic .xml files for further processing.

You need a tool to convert a bunch of very specific .xml files (that is, pekwm doc files), in a very specific way, into text files. This is a very specific requirement for which no generic tool would do, or exist. I believe that your script is the first tool ever created to accomplish that job, and being the only tool in its class, I would say it's the best tool there is.

@puppy_apprentice: The pekwm doc is written in an old version of DocBook. With a newer docbook all you need is xslt processor (the one from xmlstarlet should do), and the docbook XSLs; but pekwm uses DSSSL and it requires OpenJade tool to convert it into PDF or HTML. The author of pekwm said so himself: https://github.com/pekdon/pekwm/blob/ma ... /mkdocs.sh.
Fatdog64 forum links: [url=http://murga-linux.com/puppy/viewtopic.php?t=117546]Latest version[/url] | [url=https://cutt.ly/ke8sn5H]Contributed packages[/url] | [url=https://cutt.ly/se8scrb]ISO builder[/url]

User avatar
puppy_apprentice
Posts: 299
Joined: Tue 07 Feb 2012, 20:32

#19 Post by puppy_apprentice »

I belive it is no problem to convert those XMLs to HTML with proper XSLT stylesheet (PDF is another story, but you can convert it to PDF using my CSS from Firefox via CUPS-PDF).

musher0
Posts: 14629
Joined: Mon 05 Jan 2009, 00:54
Location: Gatineau (Qc), Canada

#20 Post by musher0 »

Many thanks for the encouragements, guys.

Text format is two-thirds done! :D

Once they are finished, I will have to "attack" :twisted: the original xml files with
puppy_apprentice's css code! (Again thanks.)

So I'm far from through yet, with this project. (Learning a lot along the way.)

I intend to send the final edit of the files back to the the pekwm authors, and
hopefully they will like what they see.

~~~~~~~~~~~~~~

Not that puppy_apprentice is wrong, but James is also right: you can have an
excellent general tool, but there are always details to adjust. Not to mention
personal preferences of this author versus personal preferences of that other author.

General tools can and do save editors a lot of time, but the finishing touches always
have to be done by hand.

~~~~~~~~~~~~~~

Can I ask any of you guys a service ? For the last 4-5 days, I've been getting a
perkwm.org site that's for sale when I go there. But a week ago I could still access
the pekwm docs online. Could someone double-check? TIA.

BFN.
musher0
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)

User avatar
puppy_apprentice
Posts: 299
Joined: Tue 07 Feb 2012, 20:32

#21 Post by puppy_apprentice »

A few years ago i've made few XSLT ssheets. So i've look on them again and read some stackexchange hints and there is solution:

Convert to text, save as text.xsl

Code: Select all

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="text" indent="yes" encoding="UTF-8"/>
  
  <xsl:template match="title">
    <xsl:value-of select="translate(text(), 'abcdefghijklmnopqrstuvwxyz', 'ABCDEFGHIJKLMNOPQRSTUVWXYZ')"/>
  </xsl:template>
  
  <xsl:template match="para[1]">
    <xsl:value-of select="normalize-space(text())"/>
  </xsl:template>
  
  <xsl:template match="para[2]/variablelist/title">
    <xsl:value-of select="translate(text(), 'abcdefghijklmnopqrstuvwxyz', 'ABCDEFGHIJKLMNOPQRSTUVWXYZ')"/>
  </xsl:template>
  
  <xsl:template match="para[2]/variablelist/varlistentry/term">
    <xsl:value-of select="normalize-space(text())"/>
  </xsl:template>
  
  <xsl:template match="para[2]/variablelist/varlistentry/listitem/para">
     <xsl:value-of select="concat(normalize-space(text()), '

')"/>
     <xsl:for-each select="itemizedlist/listitem/para">
       <xsl:value-of select="concat('+ ', normalize-space(text()), '
')"/>
     </xsl:for-each>
  </xsl:template>
</xsl:stylesheet>
usage:

Code: Select all

xml tr text.xsl pek-xml.xml >test.txt
Convert to HTML, save as html.xsl

Code: Select all

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="html" indent="yes" encoding="UTF-8"/>
  
  <xsl:template match="/">
  <html> 
     <head>
        <title>HTML Page from PekWM XML Docs</title>
     </head>

     <body>
        <xsl:apply-templates/>
     </body>
  </html>
 </xsl:template>
  
  <xsl:template match="title">
    <h1><xsl:value-of select="text()"/></h1>
  </xsl:template>
  
  <xsl:template match="para[1]">
    <p><xsl:value-of select="text()"/></p>
  </xsl:template>
  
  <xsl:template match="para[2]/variablelist/title">
    <h1><xsl:value-of select="text()"/></h1>
  </xsl:template>
  
  <xsl:template match="para[2]/variablelist/varlistentry/term">
    <p><b><xsl:value-of select="text()"/></b></p>
  </xsl:template>
  
  <xsl:template match="para[2]/variablelist/varlistentry/listitem/para">
    <p><i><xsl:value-of select="text()"/></i></p>
    <ul>
     <xsl:for-each select="itemizedlist/listitem/para">
       <li><xsl:value-of select="text()"/></li>
     </xsl:for-each>
    </ul>
  </xsl:template>
</xsl:stylesheet>
usage:

Code: Select all

xml tr html.xsl pek-xml.xml >test.html
Attachments
xstarlet.tar.gz
xstarlet and those 2 XSLT sheets
(129.63 KiB) Downloaded 197 times

musher0
Posts: 14629
Joined: Mon 05 Jan 2009, 00:54
Location: Gatineau (Qc), Canada

#22 Post by musher0 »

puppy_apprentice,

your nick is misleading! ; ) With xml, you are top-notch!

Your code worked the conversion to txt format instantly for all xml files in the theme
section of the pekwm docs, except the top one, the one with &bla-bla-bla; references in
it. I did that one by hand, but it was the shortest file of the bunch.

Impressive result attached (with source files).

I concatenated all resulting files in a main "theme.txt" file, keeping individual components.
My additional step after that was to use

Code: Select all

fmt -70 theme.txt
.
Many thanks.

BFN.
Attachments
pekwm-theme-section.zip
(14.15 KiB) Downloaded 121 times
musher0
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)

User avatar
puppy_apprentice
Posts: 299
Joined: Tue 07 Feb 2012, 20:32

#23 Post by puppy_apprentice »

Those two stylesheets need some tweaking. I've made them for expecialy for your first posted example. Others XML files have other tags and little different structure so then don't look nice as expected. It is possible to write more general sheet. But it is exercise for others ;)

Some references:
http://www.xsltfunctions.com/xsl/
http://scraping.pro/5-best-xpath-cheat- ... eferences/

And XMLStarlet only understand XSLT v1.0, so not all functions from those resources will work with it.

So we learn here some bash, some CSS, XML, XSLT, XPath. Nice!
Attachments
test.tar.gz
Destination output for all XML files should looks more like this.
(3.76 KiB) Downloaded 170 times

User avatar
puppy_apprentice
Posts: 299
Joined: Tue 07 Feb 2012, 20:32

#24 Post by puppy_apprentice »

Ok Mushero i've found proper XSLT stylesheets for PekWM Docs.

Unzip archive and go to xsl folder. Read info file for instructions. Now you will see nice html docs for theme section. Could you check rest of PekWM xml docs files?
Attachments
temp3.png
structure.html from structure.xml
(53.61 KiB) Downloaded 224 times
looks-good.tar.gz
i think that i found proper XSLT sheets for PekWM XML docs.
(181.09 KiB) Downloaded 169 times

musher0
Posts: 14629
Joined: Mon 05 Jan 2009, 00:54
Location: Gatineau (Qc), Canada

#25 Post by musher0 »

Hi, puppy_apprentice.

You got indeed beautiful results.

Using your main.xsl file within the script below in dir /usr/share/doc/pekwm. Crudely
drilling down in the subdirs to get the results. It's just that I am afraid there might be
links within the docs in the subdirs.

Talk with you later.

~~~~~~~~~~

Code: Select all

#!/bin/sh
# formula-PA.sh
####
ls *.xml | awk -F"." '{ print $1 }' > liste
while read doc;do
#	replaceit --input=$doc.xml "&" "-+- "
	xml tr ~/my-applications/text.xsl $doc.xml > $doc.PA.txt
	xml tr ~/my-applications/main.xsl $doc.xml > $doc.PA.html
# formule de « puppy-apprentice », du forum Puppy
done < liste
rm -f liste
musher0
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)

musher0
Posts: 14629
Joined: Mon 05 Jan 2009, 00:54
Location: Gatineau (Qc), Canada

#26 Post by musher0 »

Almost forgot: if the pekwm.org site stays down, it will be important to have those
docs in html up on another site.

I will send a PM to forum member augras to see if it is possible on his augras.eu site,
where he is hosting some of my stuff.

But where does not really matter. I am thinking that we would need some kind of
approbation from the pekwm people. Maybe they can tell us what is really going on
with their site, too.

BFN.
musher0
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)

musher0
Posts: 14629
Joined: Mon 05 Jan 2009, 00:54
Location: Gatineau (Qc), Canada

#27 Post by musher0 »

I should have privided these earlier, sorry.
Attachments
pekwm-docs.tar.gz
Complete pekwm docs in xml, from the source zip archive on github.
(62.73 KiB) Downloaded 167 times
musher0
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)

musher0
Posts: 14629
Joined: Mon 05 Jan 2009, 00:54
Location: Gatineau (Qc), Canada

#28 Post by musher0 »

Hello puppy_apprentice.

I have tried the system you suggested today (with the db2xhtml-master, etc.), and
got nowhere. Probably there is something I am not understanding.

However, I applied the tips you suggested yesterday and got sometimes pretty
good results. Two zip archives containing the full results of that run -- and the
script I used, are attached:

-- the *.PA.txt and *.PA.html files were created with your system. As you will see,
some have come out in outstanding fashion, others not so much, and others still
were not created at all.

My initial reaction would be to finish the job with a good html editor such as the
Kompozer in SeaMonkey.

But xml/html are more your field of expertise than mine, so if you can produce
the same quality of html as you have showed above, through your "xls" conversion,
applying it on the full pekwm xml's (in the zip archive from the pekwm source,
above), I will certainly not complain!!! Your process is blazingly fast, but it takes
time and expertise to set up.

-- the plain *.txt files in the attached were created with my script. I have also
edited them, with the help of some GNU text utilities, and manually. I feel I have
almost finished. I will use the good text files obtained with your method as
comparison basis, and filler, if needed, during a couple of final editing sessions.

~~~~~~~~

In parallel, I have now evolved a reader script and a search script for the pekwm
doc in text format. A menu acts as the table of contents, and the "real less" utility
as the reader.

I think this reader and search system is exportable | adaptable to other big text
documents. I need another manual like the pekwm manual to test the scripts on
and confirm this.

~~~~~~~~

This ends my "status report". I hope it will help us collaborate on this pekwm
docs project.

BFN.
Attachments
pekwm-d1.zip
(135.16 KiB) Downloaded 122 times
pekwm-d2.zip
(141.09 KiB) Downloaded 111 times
zipsplit.zip
Index for the two zip archives above
(1.52 KiB) Downloaded 115 times
musher0
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)

User avatar
puppy_apprentice
Posts: 299
Joined: Tue 07 Feb 2012, 20:32

#29 Post by puppy_apprentice »

Had to do some greps, cats, replaceits, xmls and all is in HTML format.
There were some problems with lines that included "&" sign so i have to replace them in XMLs first to "***AND***" and in final HTML document replace them again to "&". There were problems too with "<simplesect>" tag in XMLs - had to change them first to "<section>".

Read whole HTML documents mushero and eventualy correct errors (there will be some variables with "&" at start like &copyright so erase them or find in sources what those variables mean and replace them with proper values).
Attachments
pekwm-docs-html.tar.gz
Happy reading
(58.07 KiB) Downloaded 329 times

musher0
Posts: 14629
Joined: Mon 05 Jan 2009, 00:54
Location: Gatineau (Qc), Canada

#30 Post by musher0 »

Beautiful work, puppy_apprentice!

A thousand thanks for this!

I have referenced your layout on the pekwm thread, here.

BFN.
musher0
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)

musher0
Posts: 14629
Joined: Mon 05 Jan 2009, 00:54
Location: Gatineau (Qc), Canada

#31 Post by musher0 »

Hello all.

Converting xml to txt, I have often met head on with the problem of too many --
or too few -- line spacings between the paragraphs and subtitles.

I have come up with a Bash solution to REDUCE the number of line spacings
in a text document. All other tips on the Internet (really; this is no exaggeration!)
being about completely removing them, there was a need for this "middle-of-
the-road" approach.

If you completely remove line spacings, it is as bad as when there are too many
of them: the reader has more trouble focusing on the content, because the difference
between foreground and background is either closer to nil -- with no line spacings --,
or too great -- when there are too many. The reader has to do a mental correction
as (s)he reads, and that gets in the way of faster and better understanding.

I have written a short article which expands on the above ideas, on the French side
of the forum
. Feel free to use the DeepL Translator on that post. I am available
if there remains any confusion about some sentences; just ask.

Enjoy. BFN.
musher0
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)

User avatar
MochiMoppel
Posts: 2084
Joined: Wed 26 Jan 2011, 09:06
Location: Japan

#32 Post by MochiMoppel »

musher0 wrote:All other tips on the Internet (really; this is no exaggeration!)
being about completely removing them,
???
Look closer. One of the simplest ways is

Code: Select all

cat -s filename
I normally use sed but I know that you don't like sed.

Your code seems to destroy content.
Input file, containing 12 lines:

Code: Select all



4
5

7
8


11
12
Output file contains 6 lines:

Code: Select all


5
7


BTW: Your bash code posted here also swallows content. The sample output skips some listitems ("Desktop" etc.) present in the original XML document.

musher0
Posts: 14629
Joined: Mon 05 Jan 2009, 00:54
Location: Gatineau (Qc), Canada

#33 Post by musher0 »

Hi MochiMoppei.

I have tried < cat -s textfile >. It does a fair job, but it does not care if the text has
2 or 3 line spacings in a row, it condenses all of them into one line spacing. It is
ok, but in editing texts, the number of line spacings in a row means something.

Usually:
One line space between paragraphs;
Two line spaces between sections;
Three line spaces between chapters (or more major sections).

My script tries to respect that custom, whereas < cat -s > does not.

I have attached illustrations of the original txt, of the cat -s version, and of the
version produced by my script, so people can better grasp the concept. Only the
ending of the text is illustrated, but IMO it is telling enough about the line spacings,
AND about the line count. The sources for those texts are also attached.

~~~~~~~~~~~~~~~~

As to the missing content, thanks for noticing. I think I said it before, the result of
this type of tool always needs to be compared with the original by a human editor.
And corrections brought to final draft by said human when necessary.

~~~~~~~~~~~~~~~~

That said, this forum needs more good critics like yourself. (I'm serious!) Thanks
for bringing this to my attention. I'll see what I can do to solve the content
problem from within the script.

BFN.
Attachments
sourcesadvanced.PA.zip
(7.26 KiB) Downloaded 93 times
line-spacings-from-my-script.jpg
(99.09 KiB) Downloaded 124 times
line-spacings-from-cat-s.jpg
(104.03 KiB) Downloaded 128 times
line-spacings-in-original.jpg
(69.2 KiB) Downloaded 120 times
musher0
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)

musher0
Posts: 14629
Joined: Mon 05 Jan 2009, 00:54
Location: Gatineau (Qc), Canada

#34 Post by musher0 »

Hello all.

Here is a nice resource on how to lay out a document in text format:, with the goal of
enhancing content accessibility:
https://www.w3.org/TR/WCAG-TECHS/text.html

You would follow the above guidelines once the xml to txt conversion is finished, and
you want to prettify and / or standardize your results.

BFN.
musher0
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)

User avatar
puppy_apprentice
Posts: 299
Joined: Tue 07 Feb 2012, 20:32

#35 Post by puppy_apprentice »

Mushero, in my Slacko 5.7 (didn't check other PUPs) i have found docbook.css - a stylesheet for docbook's xmls like those PekWM xmls.

Code: Select all

/usr/share/examples/xml/
Like with my css you should add those lines to beginning of every xml file in docbook format:

Code: Select all

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="/usr/share/examples/xml/docbook.css"?>
Attachments
xml-view.jpg
Without colors and other bells and whistles like in my css but for quick reading is good enough.
(55.12 KiB) Downloaded 77 times

Post Reply