Passing grep to sed with .doc files
Passing grep to sed with .doc files
The line below works with every other file-type I tried except for .doc files. While I'm probably one of the few that still occasionally uses that format, I'm trying to get this one right. Anyone have suggestions?
grep doc file.txt | sed -i 's/doc.*/doc/' file.txt
The above is searching file.txt for any line containing doc and passing to sed to remove anything after the .doc extension. Works as is with .pdf, .docx, .odt and many other formats. Doesn't work for .doc files. Does anyone know why and can anyone suggest a possible fix? Quoting didn't help, either.
Thanks,
Slavvo67
grep doc file.txt | sed -i 's/doc.*/doc/' file.txt
The above is searching file.txt for any line containing doc and passing to sed to remove anything after the .doc extension. Works as is with .pdf, .docx, .odt and many other formats. Doesn't work for .doc files. Does anyone know why and can anyone suggest a possible fix? Quoting didn't help, either.
Thanks,
Slavvo67
Hi slavvo67.
I am not sure what you are talking about.
Are you talking about converting a *.doc file into a *.txt file?
Any word processor can do that, no?
If you are talking about replacing occurrences of ".doc" with ".txt" in a
specific document, replaceit can do that in a cinch using this one-liner:Replaceit also has a few parms you can use if you want to get really precise.
Such as: change .doc to .txt when word or syllable X is before the
occurrence, but not after; etc.
Just type < replaceit > in a console to see those parms.
Replaceit is in all recent Puppies, thanks to my insistence some time ago.
For older Puppies, the source code is available here.
I have also provided a ready-made pet archive of it, here.
IHTH. TWYL
I am not sure what you are talking about.
Are you talking about converting a *.doc file into a *.txt file?
Any word processor can do that, no?
If you are talking about replacing occurrences of ".doc" with ".txt" in a
specific document, replaceit can do that in a cinch using this one-liner:
Code: Select all
replaceit --input=my-file ".doc" ".txt"
Such as: change .doc to .txt when word or syllable X is before the
occurrence, but not after; etc.
Just type < replaceit > in a console to see those parms.
Replaceit is in all recent Puppies, thanks to my insistence some time ago.
For older Puppies, the source code is available here.
I have also provided a ready-made pet archive of it, here.
IHTH. TWYL
musher0
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)
I understood it as removing any text on a line after ".doc".
This does work for me, if that what is intended.
eg:
becomes:
This does work for me, if that what is intended.
eg:
Code: Select all
aaa bbb ccc
aaa bbb.doc ccc
.doc 1234 qwerty
ddd eee fff eee.doc ggg
ggg hhh
Code: Select all
aaa bbb ccc
aaa bbb.doc
.doc
ddd eee fff eee.doc
ggg hhh
is a little confusing as I assume it is just dealing with a text string, as I see it. - Could be "*.anything". But maybe I am missing something .Works as is with .pdf, .docx, .odt and many other formats
Lemme start again.
I have a text file with a list of different documents. But it's extracted from somewhere else, so it's not clean.
So maybe a few lines read:
/root/my-documents/musher.docTTYLXXXX ======= whatever after.
/root/my-documents/Instructions.pdfYYTEK ------- whatever after.
In all file extensions so far except for the .doc extension, the example in my first post works;l dropping everything after the extension name. I end up with /root/my-documents/Instructions.pdf for the .pdf file without the YYTEK ------- whatever after the .pdf. This isn't working for me with the .doc extension. Not sure if doc has another meaning in bash or something. Strangely, even .docx works; just not .doc.
I have a text file with a list of different documents. But it's extracted from somewhere else, so it's not clean.
So maybe a few lines read:
/root/my-documents/musher.docTTYLXXXX ======= whatever after.
/root/my-documents/Instructions.pdfYYTEK ------- whatever after.
In all file extensions so far except for the .doc extension, the example in my first post works;l dropping everything after the extension name. I end up with /root/my-documents/Instructions.pdf for the .pdf file without the YYTEK ------- whatever after the .pdf. This isn't working for me with the .doc extension. Not sure if doc has another meaning in bash or something. Strangely, even .docx works; just not .doc.
Try
Seems to be working.
Code: Select all
grep doc file.txt | sed -i 's/\.doc.*/.doc/' file.txt
- MochiMoppel
- Posts: 2084
- Joined: Wed 26 Jan 2011, 09:06
- Location: Japan
Yeah, but you need separate sed statements for every possible extension.slavvo67 wrote:the example in my first post works;l dropping everything after the extension name.
This should also work. Cleans lines with extension .doc .DOC .pdf .PdF (in other words:case insensitive)
Code: Select all
sed -i -r 's/^(.*\.(doc|pdf)).*$/\1/I' file.txt
But: if your "whatever after" contains .doc we will have to think again ...
.
Last edited by MochiMoppel on Sun 12 Aug 2018, 00:29, edited 3 times in total.
Hello slavvo67.
You can do it without grep or sed or awk,
using only bash's string handling capacity:document.lst in /root/my-documents being:
3 characters long.
IHTH.
You can do it without grep or sed or awk,
using only bash's string handling capacity:
Code: Select all
#!/bin/ash
# /root/my-applications/bin/afterDOC.sh
# For slavvo67
####
> new.lst # We create a destination list.
while read line;do # We read per line and dissect it.
A="${line%.*}" # Sub-string before the dot.
B="${line#*.}" # Sub-string after the dot.
B="${B:0:3}" # We keep the first 3 characters of $B.
echo "$A.$B" >> new.lst # We rebuild the list.
done < /root/my-documents/document.lst # Originating list
more new.lst # We show the resulting list.
Shown result will be:/root/my-documents/musher.docTTYLXXXX
/root/my-documents/slavvo67.docYYTEK
/root/my-documents/BarryK.docBLABLEBLI
/root/my-documents/01micko.docTRALALA
/root/my-documents/OldCarExample.docBroomBROOM
It will work whatever the file extension, too, provided the file extension is/root/my-documents/musher.doc
/root/my-documents/slavvo67.doc
/root/my-documents/BarryK.doc
/root/my-documents/01micko.doc
/root/my-documents/OldCarExample.doc
3 characters long.
IHTH.
musher0
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)
Mochi's solution worked. Musher0, the problem with your solution is that there are at least 3 popular file formats that go out 4 characters. I'm still going to try yours and Keef's, as well.
Maybe I'll do a combo, using yours for .### and Mochi's for hard coding the 4 character file formats. I'll have to play a bit.....
Thank you all for your help on this!
Maybe I'll do a combo, using yours for .### and Mochi's for hard coding the 4 character file formats. I'll have to play a bit.....
Thank you all for your help on this!
Slavvo67?
You're changing the question! The .doc extension stated in the title of this
thread has a dot + three characters.
Mochi's solution also applies to extensions with a dot + three characters.
If you wanted a solution for extensions with a dot plus 3 OR 4 characters, you
should have said so at the start. It's not that difficult to come up with a script for
that, but we need to know in advance.
But whatever... You're on your own, now, man.
You're changing the question! The .doc extension stated in the title of this
thread has a dot + three characters.
Mochi's solution also applies to extensions with a dot + three characters.
If you wanted a solution for extensions with a dot plus 3 OR 4 characters, you
should have said so at the start. It's not that difficult to come up with a script for
that, but we need to know in advance.
But whatever... You're on your own, now, man.
musher0
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)
~~~~~~~~~~
"You want it darker? We kill the flame." (L. Cohen)
- MochiMoppel
- Posts: 2084
- Joined: Wed 26 Jan 2011, 09:06
- Location: Japan
I doubt that mixing both approaches is necessary.slavvo67 wrote:Maybe I'll do a combo, using yours for .### and Mochi's for hard coding the 4 character file formats
If you know all the possible file extensions, then you can use my proposal and clean the file with a single command. Length of extensions doesn't matter.
If you need more flexibility and want instead strip any extension regardless of its lenght then you definitely have to provide more information about the "whatever". If paths are always lowercase and the following "whatever" always starts uppercase as in your example then such a catch-all solution would be possible but your information is a bit sparse.
Still I wonder how you came up with such a messy file.txt in the first place. Could it be that your problem started much earlier?
Actually, the combo worked well but Musher0's example passes a bunch of errors to the terminal. I agree that it's much better to hard code the extension. Problem is, any extension can be used. So, maybe better to hard code the few 4 character extensions and let the errors roll. The combo approach actually worked, though. So grab the 4 character extensions with Mochi's example and pass the rest through Musher0's 3 character example. I am mobile but will show u what I'm doing ....