Puppy Linux Discussion Forum Forum Index Puppy Linux Discussion Forum
Puppy HOME page : puppylinux.com
"THE" alternative forum : puppylinux.info
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

The time now is Thu 18 Oct 2018, 13:38
All times are UTC - 4
 Forum index » Off-Topic Area » Programming
Passing grep to sed with .doc files
Post new topic   Reply to topic View previous topic :: View next topic
Page 1 of 1 [12 Posts]  
Author Message
slavvo67

Joined: 12 Oct 2012
Posts: 1570
Location: The other Mr. 305

PostPosted: Sat 11 Aug 2018, 12:08    Post subject:  Passing grep to sed with .doc files  

The line below works with every other file-type I tried except for .doc files. While I'm probably one of the few that still occasionally uses that format, I'm trying to get this one right. Anyone have suggestions?

grep doc file.txt | sed -i 's/doc.*/doc/' file.txt

The above is searching file.txt for any line containing doc and passing to sed to remove anything after the .doc extension. Works as is with .pdf, .docx, .odt and many other formats. Doesn't work for .doc files. Does anyone know why and can anyone suggest a possible fix? Quoting didn't help, either.


Thanks,

Slavvo67
Back to top
View user's profile Send private message 
musher0

Joined: 04 Jan 2009
Posts: 12816
Location: Gatineau (Qc), Canada

PostPosted: Sat 11 Aug 2018, 13:34    Post subject:  

Hi slavvo67.

I am not sure what you are talking about.

Are you talking about converting a *.doc file into a *.txt file?
Any word processor can do that, no?

If you are talking about replacing occurrences of ".doc" with ".txt" in a
specific document, replaceit can do that in a cinch using this one-liner:
Code:
replaceit --input=my-file ".doc" ".txt"
Replaceit also has a few parms you can use if you want to get really precise.
Such as: change .doc to .txt when word or syllable X is before the
occurrence, but not after; etc.
Just type < replaceit > in a console to see those parms.

Replaceit is in all recent Puppies, thanks to my insistence some time ago.
For older Puppies, the source code is available here.
I have also provided a ready-made pet archive of it, here.

IHTH. TWYL

_________________
musher0
~~~~~~~~~~
Fidèle elle commença, ainsi elle restera. (Prov. canadien) /
Faithful she began, so will she stay. (Canadian prov.)
Back to top
View user's profile Send private message 
Keef


Joined: 20 Dec 2007
Posts: 918
Location: Staffordshire

PostPosted: Sat 11 Aug 2018, 14:18    Post subject:  

I understood it as removing any text on a line after ".doc".
This does work for me, if that what is intended.
eg:

Code:
aaa bbb ccc
aaa bbb.doc ccc
.doc 1234 qwerty
ddd eee fff eee.doc ggg
ggg hhh


becomes:
Code:
aaa bbb ccc
aaa bbb.doc
.doc
ddd eee fff eee.doc
ggg hhh


Quote:
Works as is with .pdf, .docx, .odt and many other formats
is a little confusing as I assume it is just dealing with a text string, as I see it. - Could be "*.anything". But maybe I am missing something .
Back to top
View user's profile Send private message 
slavvo67

Joined: 12 Oct 2012
Posts: 1570
Location: The other Mr. 305

PostPosted: Sat 11 Aug 2018, 18:21    Post subject:  

Lemme start again.

I have a text file with a list of different documents. But it's extracted from somewhere else, so it's not clean.

So maybe a few lines read:

/root/my-documents/musher.docTTYLXXXX ======= whatever after.
/root/my-documents/Instructions.pdfYYTEK ------- whatever after.

In all file extensions so far except for the .doc extension, the example in my first post works;l dropping everything after the extension name. I end up with /root/my-documents/Instructions.pdf for the .pdf file without the YYTEK ------- whatever after the .pdf. This isn't working for me with the .doc extension. Not sure if doc has another meaning in bash or something. Strangely, even .docx works; just not .doc.
Back to top
View user's profile Send private message 
6502coder


Joined: 23 Mar 2009
Posts: 478
Location: Western United States

PostPosted: Sat 11 Aug 2018, 19:07    Post subject:  

I'm sure it can't be something as silly as this, but...

If indeed your pathnames are starting out with "/root/my-documents/...",
then, well...
Back to top
View user's profile Send private message 
Keef


Joined: 20 Dec 2007
Posts: 918
Location: Staffordshire

PostPosted: Sat 11 Aug 2018, 19:24    Post subject:  

Try
Code:

grep doc file.txt | sed -i 's/\.doc.*/.doc/' file.txt


Seems to be working.
Back to top
View user's profile Send private message 
MochiMoppel


Joined: 26 Jan 2011
Posts: 1662
Location: Japan

PostPosted: Sat 11 Aug 2018, 19:43    Post subject:  

slavvo67 wrote:
the example in my first post works;l dropping everything after the extension name.
Yeah, but you need separate sed statements for every possible extension.

This should also work. Cleans lines with extension .doc .DOC .pdf .PdF (in other words:case insensitive)
Code:
sed -i -r 's/^(.*\.(doc|pdf)).*$/\1/I' file.txt

No need for grep anywhere as sed gets input from file.txt and not from the piped grep output.
But: if your "whatever after" contains .doc we will have to think again ...

.

Last edited by MochiMoppel on Sat 11 Aug 2018, 20:29; edited 3 times in total
Back to top
View user's profile Send private message 
musher0

Joined: 04 Jan 2009
Posts: 12816
Location: Gatineau (Qc), Canada

PostPosted: Sat 11 Aug 2018, 19:45    Post subject:  

Hello slavvo67.

You can do it without grep or sed or awk,
using only bash's string handling capacity:
Code:
#!/bin/ash
# /root/my-applications/bin/afterDOC.sh
# For slavvo67
####
> new.lst # We create a destination list.
while read line;do # We read per line and dissect it.
   A="${line%.*}" # Sub-string before the dot.
   B="${line#*.}" # Sub-string after the dot.
   B="${B:0:3}" # We keep the first 3 characters of $B.
   echo "$A.$B" >> new.lst  # We rebuild the list.
done < /root/my-documents/document.lst # Originating list
more new.lst # We show the resulting list.
document.lst in /root/my-documents being:
Quote:
/root/my-documents/musher.docTTYLXXXX
/root/my-documents/slavvo67.docYYTEK
/root/my-documents/BarryK.docBLABLEBLI
/root/my-documents/01micko.docTRALALA
/root/my-documents/OldCarExample.docBroomBROOM
Shown result will be:
Quote:
/root/my-documents/musher.doc
/root/my-documents/slavvo67.doc
/root/my-documents/BarryK.doc
/root/my-documents/01micko.doc
/root/my-documents/OldCarExample.doc

It will work whatever the file extension, too, provided the file extension is
3 characters long.

IHTH.

_________________
musher0
~~~~~~~~~~
Fidèle elle commença, ainsi elle restera. (Prov. canadien) /
Faithful she began, so will she stay. (Canadian prov.)
Back to top
View user's profile Send private message 
slavvo67

Joined: 12 Oct 2012
Posts: 1570
Location: The other Mr. 305

PostPosted: Tue 14 Aug 2018, 21:46    Post subject:  

Mochi's solution worked. Musher0, the problem with your solution is that there are at least 3 popular file formats that go out 4 characters. I'm still going to try yours and Keef's, as well.

Maybe I'll do a combo, using yours for .### and Mochi's for hard coding the 4 character file formats. I'll have to play a bit.....

Thank you all for your help on this!
Back to top
View user's profile Send private message 
musher0

Joined: 04 Jan 2009
Posts: 12816
Location: Gatineau (Qc), Canada

PostPosted: Tue 14 Aug 2018, 22:35    Post subject:  

Slavvo67?

You're changing the question! The .doc extension stated in the title of this
thread has a dot + three characters.

Mochi's solution also applies to extensions with a dot + three characters.

If you wanted a solution for extensions with a dot plus 3 OR 4 characters, you
should have said so at the start. It's not that difficult to come up with a script for
that, but we need to know in advance.

But whatever... You're on your own, now, man.

_________________
musher0
~~~~~~~~~~
Fidèle elle commença, ainsi elle restera. (Prov. canadien) /
Faithful she began, so will she stay. (Canadian prov.)
Back to top
View user's profile Send private message 
MochiMoppel


Joined: 26 Jan 2011
Posts: 1662
Location: Japan

PostPosted: Wed 15 Aug 2018, 00:46    Post subject:  

slavvo67 wrote:
Maybe I'll do a combo, using yours for .### and Mochi's for hard coding the 4 character file formats

I doubt that mixing both approaches is necessary.
If you know all the possible file extensions, then you can use my proposal and clean the file with a single command. Length of extensions doesn't matter.
If you need more flexibility and want instead strip any extension regardless of its lenght then you definitely have to provide more information about the "whatever". If paths are always lowercase and the following "whatever" always starts uppercase as in your example then such a catch-all solution would be possible but your information is a bit sparse.

Still I wonder how you came up with such a messy file.txt in the first place. Could it be that your problem started much earlier?
Back to top
View user's profile Send private message 
slavvo67

Joined: 12 Oct 2012
Posts: 1570
Location: The other Mr. 305

PostPosted: Wed 15 Aug 2018, 12:39    Post subject:  

Actually, the combo worked well but Musher0's example passes a bunch of errors to the terminal. I agree that it's much better to hard code the extension. Problem is, any extension can be used. So, maybe better to hard code the few 4 character extensions and let the errors roll. The combo approach actually worked, though. So grab the 4 character extensions with Mochi's example and pass the rest through Musher0's 3 character example. I am mobile but will show u what I'm doing ....
Back to top
View user's profile Send private message 
Display posts from previous:   Sort by:   
Page 1 of 1 [12 Posts]  
Post new topic   Reply to topic View previous topic :: View next topic
 Forum index » Off-Topic Area » Programming
Jump to:  

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Powered by phpBB © 2001, 2005 phpBB Group
[ Time: 0.0635s ][ Queries: 12 (0.0204s) ][ GZIP on ]