Passing grep to sed with .doc files

For discussions about programming, programming questions/advice, and projects that don't really have anything to do with Puppy.
Post Reply
Message
Author
slavvo67
Posts: 1610
Joined: Sat 13 Oct 2012, 02:07
Location: The other Mr. 305

Passing grep to sed with .doc files

#1 Post by slavvo67 »

The line below works with every other file-type I tried except for .doc files. While I'm probably one of the few that still occasionally uses that format, I'm trying to get this one right. Anyone have suggestions?

grep doc file.txt | sed -i 's/doc.*/doc/' file.txt

The above is searching file.txt for any line containing doc and passing to sed to remove anything after the .doc extension. Works as is with .pdf, .docx, .odt and many other formats. Doesn't work for .doc files. Does anyone know why and can anyone suggest a possible fix? Quoting didn't help, either.


Thanks,

Slavvo67

musher0
Posts: 14629
Joined: Mon 05 Jan 2009, 00:54
Location: Gatineau (Qc), Canada

#2 Post by musher0 »

Hi slavvo67.

I am not sure what you are talking about.

Are you talking about converting a *.doc file into a *.txt file?
Any word processor can do that, no?

If you are talking about replacing occurrences of ".doc" with ".txt" in a
specific document, replaceit can do that in a cinch using this one-liner:

Code: Select all

replaceit --input=my-file ".doc" ".txt"
Replaceit also has a few parms you can use if you want to get really precise.
Such as: change .doc to .txt when word or syllable X is before the
occurrence, but not after; etc.
Just type < replaceit > in a console to see those parms.

Replaceit is in all recent Puppies, thanks to my insistence some time ago.
For older Puppies, the source code is available here.
I have also provided a ready-made pet archive of it, here.

IHTH. TWYL
musher0
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)

User avatar
Keef
Posts: 987
Joined: Thu 20 Dec 2007, 22:12
Location: Staffordshire

#3 Post by Keef »

I understood it as removing any text on a line after ".doc".
This does work for me, if that what is intended.
eg:

Code: Select all

aaa bbb ccc
aaa bbb.doc ccc
.doc 1234 qwerty
ddd eee fff eee.doc ggg
ggg hhh
becomes:

Code: Select all

aaa bbb ccc
aaa bbb.doc
.doc
ddd eee fff eee.doc
ggg hhh
Works as is with .pdf, .docx, .odt and many other formats
is a little confusing as I assume it is just dealing with a text string, as I see it. - Could be "*.anything". But maybe I am missing something .

slavvo67
Posts: 1610
Joined: Sat 13 Oct 2012, 02:07
Location: The other Mr. 305

#4 Post by slavvo67 »

Lemme start again.

I have a text file with a list of different documents. But it's extracted from somewhere else, so it's not clean.

So maybe a few lines read:

/root/my-documents/musher.docTTYLXXXX ======= whatever after.
/root/my-documents/Instructions.pdfYYTEK ------- whatever after.

In all file extensions so far except for the .doc extension, the example in my first post works;l dropping everything after the extension name. I end up with /root/my-documents/Instructions.pdf for the .pdf file without the YYTEK ------- whatever after the .pdf. This isn't working for me with the .doc extension. Not sure if doc has another meaning in bash or something. Strangely, even .docx works; just not .doc.

User avatar
6502coder
Posts: 677
Joined: Mon 23 Mar 2009, 18:07
Location: Western United States

#5 Post by 6502coder »

I'm sure it can't be something as silly as this, but...

If indeed your pathnames are starting out with "/root/my-documents/...",
then, well...

User avatar
Keef
Posts: 987
Joined: Thu 20 Dec 2007, 22:12
Location: Staffordshire

#6 Post by Keef »

Try

Code: Select all

grep doc file.txt | sed -i 's/\.doc.*/.doc/' file.txt
Seems to be working.

User avatar
MochiMoppel
Posts: 2084
Joined: Wed 26 Jan 2011, 09:06
Location: Japan

#7 Post by MochiMoppel »

slavvo67 wrote:the example in my first post works;l dropping everything after the extension name.
Yeah, but you need separate sed statements for every possible extension.

This should also work. Cleans lines with extension .doc .DOC .pdf .PdF (in other words:case insensitive)

Code: Select all

sed -i -r 's/^(.*\.(doc|pdf)).*$/\1/I' file.txt
No need for grep anywhere as sed gets input from file.txt and not from the piped grep output.
But: if your "whatever after" contains .doc we will have to think again ...

.
Last edited by MochiMoppel on Sun 12 Aug 2018, 00:29, edited 3 times in total.

musher0
Posts: 14629
Joined: Mon 05 Jan 2009, 00:54
Location: Gatineau (Qc), Canada

#8 Post by musher0 »

Hello slavvo67.

You can do it without grep or sed or awk,
using only bash's string handling capacity:

Code: Select all

#!/bin/ash
# /root/my-applications/bin/afterDOC.sh
# For slavvo67
####
> new.lst # We create a destination list.
while read line;do # We read per line and dissect it.
	A="${line%.*}" # Sub-string before the dot.
	B="${line#*.}" # Sub-string after the dot.
	B="${B:0:3}" # We keep the first 3 characters of $B.
	echo "$A.$B" >> new.lst  # We rebuild the list.
done < /root/my-documents/document.lst # Originating list
more new.lst # We show the resulting list.
document.lst in /root/my-documents being:
/root/my-documents/musher.docTTYLXXXX
/root/my-documents/slavvo67.docYYTEK
/root/my-documents/BarryK.docBLABLEBLI
/root/my-documents/01micko.docTRALALA
/root/my-documents/OldCarExample.docBroomBROOM
Shown result will be:
/root/my-documents/musher.doc
/root/my-documents/slavvo67.doc
/root/my-documents/BarryK.doc
/root/my-documents/01micko.doc
/root/my-documents/OldCarExample.doc
It will work whatever the file extension, too, provided the file extension is
3 characters long.

IHTH.
musher0
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)

slavvo67
Posts: 1610
Joined: Sat 13 Oct 2012, 02:07
Location: The other Mr. 305

#9 Post by slavvo67 »

Mochi's solution worked. Musher0, the problem with your solution is that there are at least 3 popular file formats that go out 4 characters. I'm still going to try yours and Keef's, as well.

Maybe I'll do a combo, using yours for .### and Mochi's for hard coding the 4 character file formats. I'll have to play a bit.....

Thank you all for your help on this!

musher0
Posts: 14629
Joined: Mon 05 Jan 2009, 00:54
Location: Gatineau (Qc), Canada

#10 Post by musher0 »

Slavvo67?

You're changing the question! The .doc extension stated in the title of this
thread has a dot + three characters.

Mochi's solution also applies to extensions with a dot + three characters.

If you wanted a solution for extensions with a dot plus 3 OR 4 characters, you
should have said so at the start. It's not that difficult to come up with a script for
that, but we need to know in advance.

But whatever... You're on your own, now, man.
musher0
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)

User avatar
MochiMoppel
Posts: 2084
Joined: Wed 26 Jan 2011, 09:06
Location: Japan

#11 Post by MochiMoppel »

slavvo67 wrote:Maybe I'll do a combo, using yours for .### and Mochi's for hard coding the 4 character file formats
I doubt that mixing both approaches is necessary.
If you know all the possible file extensions, then you can use my proposal and clean the file with a single command. Length of extensions doesn't matter.
If you need more flexibility and want instead strip any extension regardless of its lenght then you definitely have to provide more information about the "whatever". If paths are always lowercase and the following "whatever" always starts uppercase as in your example then such a catch-all solution would be possible but your information is a bit sparse.

Still I wonder how you came up with such a messy file.txt in the first place. Could it be that your problem started much earlier?

slavvo67
Posts: 1610
Joined: Sat 13 Oct 2012, 02:07
Location: The other Mr. 305

#12 Post by slavvo67 »

Actually, the combo worked well but Musher0's example passes a bunch of errors to the terminal. I agree that it's much better to hard code the extension. Problem is, any extension can be used. So, maybe better to hard code the few 4 character extensions and let the errors roll. The combo approach actually worked, though. So grab the 4 character extensions with Mochi's example and pass the rest through Musher0's 3 character example. I am mobile but will show u what I'm doing ....

Post Reply