Please find below a script that I have created to transform the .xml files from the
pekwm documentation into plain text files. The result is available here. The
starting .xml material is in the doc directory of the pekwm source at github under
pekdon/pekwm.
I spent a couple of hours combing the Web, but could not find no satisfactory utility.
Most required python; some others, to dish out a lot of cash... I did find a nice little
java applet in German, but it converts from xml to csv, which was not ideal.
Puppies do have the xmllint utility. It does a fair job of converting xml to html but
loses any paragraph structure. (A fine utility that is...) It was ok with the smaller
xml files in the pekwm docs, but not for the larger ones. This is the moment when
you realize that you are lost without the blank line separator!!!
So I took the bull by the horns, and decided to convert the pekwm docs to text
format using the replaceit utility and a couple of GNU standards.
This is the script. I know there are as many xml styles as there are stars in the
sky, but It may be useful to someone as a starting point.
It's still in crude form, I'm afraid. Any improvements and/or constructive
observations will be welcome.
BFN.
~~~~~~~~~~~~~~
Code: Select all
#!/bin/sh
# /opt/local/bin/xml2txt.sh
# (c) musher0, 12 février 2018. GPL3
#
# Usage: enter a directory where there are xml files and
# issue the command < xml2txt.sh > (without the chevrons).
#
# All the xml files will be grouped in a text file
# bearing the name of the directory, and processed.
#
# Requires : replaceit.
#
# Comment: it leaves to be desired, you will have to refine
# the output; but it does a basic job of eliminating most (?)
# of the XML tags.
#
####
name="`pwd | awk -F"/" '{ print $NF }'`"
cat -s *.xml > $name.tmp
RPLCT="replaceit --input=$name.tmp"
$RPLCT --wholeline simplesect "#";$RPLCT --wholeline listitem "#"
$RPLCT --wholeline varlistentry "#";$RPLCT --wholeline itemizedlist "#"
$RPLCT --wholeline variablelist "#";$RPLCT --wholeline xreflabel "#"
$RPLCT --wholeline dbhtml "#";$RPLCT "<para>" " "
$RPLCT "</para>" " ";$RPLCT formalpara " "
$RPLCT "<title>" " ¤¤ ";$RPLCT "</title>" " ¤¤ "
$RPLCT "<term>" " ¤ ";$RPLCT "</term>" " ¤ "
$RPLCT "<screen>" " -=- ";$RPLCT "</screen>" " -=- "
$RPLCT "<filename>" " ";$RPLCT "</filename>" " "
$RPLCT "</chapter>" "End of chapter"
$RPLCT "<chapter>" "Beginning of chapter"
$RPLCT "</note>" "End of note"
$RPLCT "<note>" "Beginning of note"
$RPLCT "</section>" "End of section"
$RPLCT "<section>" "Section:";$RPLCT "<partintro>" "Intro"
$RPLCT "</part" "#";$RPLCT "author>" "#"
$RPLCT "object>" "#";$RPLCT "subtitle>" " # "
$RPLCT "bookinfo>" "#";$RPLCT "abstract>" "#"
$RPLCT --wholeline "<authorgroup>" "#"
# Add more XML tags to be replaced or canceled out here.
# Simply follow the replaceit pattern above.
grep -vE "^#" $name.tmp | tr -s "\n" > $name.txt
# The commands in the line above ignore the lines beginning
# with a "#" and then squeeze multiple blank lines into one.
rm -f $name.tmp # Obviously!