Simple converter for .doc/.odt to HTML?

Word processors, spreadsheets, presentations, translation, etc.
Post Reply
Message
Author
User avatar
Makoto
Posts: 1665
Joined: Fri 04 Sep 2009, 01:30
Location: Out wandering... maybe.

Simple converter for .doc/.odt to HTML?

#1 Post by Makoto »

Is there anything simple that will allow me to convert document file formats (mostly .doc and .odt, I guess), with styles/etc., to an HTML file? I can (and have) used OpenOffice/LibreOffice - but that maintains parity with the MS Office way of creating an HTML file from a document, and adds a LOT of unnecessary code overhead to the resulting HTML file. :evil:

Seamonkey's Composer won't directly open the above document filetypes. I can use it (among other HTML editors, of course) to attempt to strip out what the Office programs have done to the text... but that's a massive undertaking. :mrgreen: (Though, if there's an automatic way to 'optimize' the HTML page in Seamonkey, I wouldn't mind. :D)
[ Puppy 4.3.1 JP, Frugal install ] * [ XenialPup 7.5, Frugal install ] * [XenialPup 64 7.5, Frugal install] * [ 4GB RAM | 512MB swap ]
In memory of our beloved American Eskimo puppy (1995-2010) and black Lab puppy (1997-2011).

User avatar
don570
Posts: 5528
Joined: Wed 10 Mar 2010, 19:58
Location: Ontario

#2 Post by don570 »

There is a nicely written text processor that opens up
microsoft docs and saves to various formats including html.

Softmaker 2012 beta is the latest version. It's a commercial product
but there's a free trial so you can find out if it's good enough.

Here's more info ---->

http://murga-linux.com/puppy/viewtopic. ... 950#647950

__________________________________________

User avatar
Makoto
Posts: 1665
Joined: Fri 04 Sep 2009, 01:30
Location: Out wandering... maybe.

#3 Post by Makoto »

Yeah, but I'm a little reluctant to install another office suite just for that particular 'simple' feature, though. :)

Why MS Office/Word and Open/LibreOffice feel they have to add that much code even for a simple text page HTML, I don't know. I did just that, recently, had a monospace font set for the whole document - and the resulting HTML page from OpenOffice was redefining the font with every single line of text. :roll: Among other things, of course.

...then again, I experimented with doing it with AbiWord. Not only did I lose some of the formatting, but it also insisted on adding CSS functions to the document. (It's just plain text, with the occasional italicized, bolded and maybe underlined word. That's all. No real need for a stylesheet, is there? (No, really. I'm not really sure.))
[ Puppy 4.3.1 JP, Frugal install ] * [ XenialPup 7.5, Frugal install ] * [XenialPup 64 7.5, Frugal install] * [ 4GB RAM | 512MB swap ]
In memory of our beloved American Eskimo puppy (1995-2010) and black Lab puppy (1997-2011).

User avatar
technosaurus
Posts: 4853
Joined: Mon 19 May 2008, 01:24
Location: Blue Springs, MO
Contact:

#4 Post by technosaurus »

Abiword
Check out my [url=https://github.com/technosaurus]github repositories[/url]. I may eventually get around to updating my [url=http://bashismal.blogspot.com]blogspot[/url].

User avatar
Makoto
Posts: 1665
Joined: Fri 04 Sep 2009, 01:30
Location: Out wandering... maybe.

#5 Post by Makoto »

I did try Abiword, as I mentioned above. Not only did it insist on adding CSS to the document, it still generated an HTML page around the same size as the versions Word and OpenOffice created. :|

All of them roughly converted a 66k (7-bit) text document into a 166k HTML file. Text should not need a 100k markup. :mrgreen:
(I used to do it manually, so I should know. :P)
[ Puppy 4.3.1 JP, Frugal install ] * [ XenialPup 7.5, Frugal install ] * [XenialPup 64 7.5, Frugal install] * [ 4GB RAM | 512MB swap ]
In memory of our beloved American Eskimo puppy (1995-2010) and black Lab puppy (1997-2011).

User avatar
technosaurus
Posts: 4853
Joined: Mon 19 May 2008, 01:24
Location: Blue Springs, MO
Contact:

#6 Post by technosaurus »

that sound about right actually to preserve formatting... they have to cover cases that aren't as simple as yours. if you dont care about preserving format at all convert to text and then:

Code: Select all

echo '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
	<title></title>
</head>
<body>
<pre>' > file.html

cat file.txt >>file.html

echo '</pre>
</body>
</html>' >>file.html
Check out my [url=https://github.com/technosaurus]github repositories[/url]. I may eventually get around to updating my [url=http://bashismal.blogspot.com]blogspot[/url].

User avatar
Makoto
Posts: 1665
Joined: Fri 04 Sep 2009, 01:30
Location: Out wandering... maybe.

#7 Post by Makoto »

I know, but there's usually something about the generated HTML that just seems... weird, for whatever reason. Much more than it probably needs to be, maybe. Like OpenOffice's insistence on restating the font on every single line of text (sure, I set a monospace font for the entire document, but does it really need to be renewed on every line?). Or an earlier version of MS Word insisting on tokenizing practically everything. :)

Of course, I'll be the first to admit I'm not any sort of expert on HTML. :mrgreen:
[ Puppy 4.3.1 JP, Frugal install ] * [ XenialPup 7.5, Frugal install ] * [XenialPup 64 7.5, Frugal install] * [ 4GB RAM | 512MB swap ]
In memory of our beloved American Eskimo puppy (1995-2010) and black Lab puppy (1997-2011).

User avatar
technosaurus
Posts: 4853
Joined: Mon 19 May 2008, 01:24
Location: Blue Springs, MO
Contact:

#8 Post by technosaurus »

still sticking by my original suggestion, I just tested abiword-2.8.6 in wary 5.3 on /usr/share/examples/test.doc ... just uncheck all of the boxes when you save as html - it actually reduced the total size 4 fold and looks acceptable.
(btw abiword does have a command line interface that you can batch process with)
Check out my [url=https://github.com/technosaurus]github repositories[/url]. I may eventually get around to updating my [url=http://bashismal.blogspot.com]blogspot[/url].

User avatar
Makoto
Posts: 1665
Joined: Fri 04 Sep 2009, 01:30
Location: Out wandering... maybe.

#9 Post by Makoto »

Abiword eats (doesn't support) some of the simple formatting elements I use, though, like horizontal lines. They disappear from the document when I load it... and, of course, aren't added to the end HTML. :(

(That's aside from the fact that Abiword usually behaves rather badly, for me. I'm surprised I managed to get it to export a document to an HTML page without something bad happening, aside from the missing elements.)

Hmm... wonder how much of a dent HTML Tidy might make in it? :|
[ Puppy 4.3.1 JP, Frugal install ] * [ XenialPup 7.5, Frugal install ] * [XenialPup 64 7.5, Frugal install] * [ 4GB RAM | 512MB swap ]
In memory of our beloved American Eskimo puppy (1995-2010) and black Lab puppy (1997-2011).

Post Reply