Puppy Linux Discussion Forum Forum Index Puppy Linux Discussion Forum
Puppy HOME page : puppylinux.com
"THE" alternative forum : puppylinux.info
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

The time now is Sun 20 Apr 2014, 09:26
All times are UTC - 4
 Forum index » Off-Topic Area » Programming
how do I remove duplicate words on a text list?[solved]
Post new topic   Reply to topic View previous topic :: View next topic
Page 1 of 1 [9 Posts]  
Author Message
scsijon

Joined: 23 May 2007
Posts: 997
Location: the australian mallee

PostPosted: Fri 01 Mar 2013, 05:19    Post subject:  how do I remove duplicate words on a text list?[solved]  

I have a list of 24000+ 'words' in a Text file. It's in a vertical list format.

However, about 40+% are duplicates and i'd like to delete them.

The format of the 'words' are basically anything that can be typed in on a keyboard.

I have used 'return-newline' as the separator.

Has anyone a simple script I can run to 'fix the problem'

thanks

Last edited by scsijon on Fri 01 Mar 2013, 20:00; edited 1 time in total
Back to top
View user's profile Send private message Visit poster's website 
SFR


Joined: 26 Oct 2011
Posts: 879

PostPosted: Fri 01 Mar 2013, 06:57    Post subject: Re: how do I remove duplicate words on a text list?  

Hey Scsijon

To make it clear - the list looks like this:
abc
def
abc
abc
blablabla
zxzx
something
blablabla
...


and should look like this:
abc
def
blablabla
zxzx
something
...


right?

If you don't mind that lines will be also sorted:
Code:
sort -u input_file

But if it's a problem, here's cute awk one-liner I just found on Stack Overflow:
Code:
awk '!_[$0]++' input_file


Greetings!

_________________
[O]bdurate [R]ules [D]estroy [E]nthusiastic [R]ebels => [C]reative [H]umans [A]lways [O]pen [S]ource
Omnia mea mecum porto.
Back to top
View user's profile Send private message 
vovchik


Joined: 23 Oct 2006
Posts: 1285
Location: Ukraine

PostPosted: Fri 01 Mar 2013, 08:49    Post subject:  

Dear guys,

This is pretty easy too:

Code:

cat some.txt | sort | uniq


With kind regards,
vovchik
Back to top
View user's profile Send private message 
amigo

Joined: 02 Apr 2007
Posts: 2169

PostPosted: Fri 01 Mar 2013, 11:52    Post subject:  

I recently had the same problem and wrote something for it. But, I don't find it right now, so I've re-created it:
Code:
#!/bin/bash
# uniq_no-sort
# print out uniq lines, but without sorting them

FILE=$1

OUT="$FILE.uniq"
: > $OUT

while read LINE ; do
   if ! [[ $(fgrep -q $LINE $OUT) ]] ; then
      echo $LINE >> $OUT
   fi
done < $FILE
Back to top
View user's profile Send private message 
tallboy


Joined: 21 Sep 2010
Posts: 409
Location: Oslo, Norway

PostPosted: Fri 01 Mar 2013, 12:28    Post subject:  

Sorry, mistake, could not find a way to delete post.
_________________
True freedom is a live Puppy on a multisession CD/DVD.
Back to top
View user's profile Send private message 
GustavoYz


Joined: 07 Jul 2010
Posts: 886
Location: .ar

PostPosted: Fri 01 Mar 2013, 14:50    Post subject:  

vovchik wrote:

Code:

cat some.txt | sort | uniq


With kind regards,
vovchik


Cat isnt really needed:
Code:
sort file.txt | uniq

_________________

Back to top
View user's profile Send private message 
scsijon

Joined: 23 May 2007
Posts: 997
Location: the australian mallee

PostPosted: Fri 01 Mar 2013, 18:17    Post subject:  

Sorry folks, I wish it was that simple.

I have already sorted the list, that was when I relaized how many duplicates were in it.


consider this:


aaa
aba
ada
ad
aea
aea
aea
agd
agd
ased
ased
ased-ss
ased-ss<p

and on we go.

I want to remove all the duplicates found.

It's what happens when you need to rebuild a crashed component list from backups.
Back to top
View user's profile Send private message Visit poster's website 
Keef


Joined: 20 Dec 2007
Posts: 562
Location: Staffordshire

PostPosted: Fri 01 Mar 2013, 19:15    Post subject:  

Code:

# cat list.txt
aaa
aba
ada
ad
aea
aea
aea
agd
agd
ased
ased
ased-ss
ased-ss<p
# sort list.txt | uniq
aaa
aba
ad
ada
aea
agd
ased
ased-ss
ased-ss<p
#

Seems to work for me....
Back to top
View user's profile Send private message 
scsijon

Joined: 23 May 2007
Posts: 997
Location: the australian mallee

PostPosted: Fri 01 Mar 2013, 19:59    Post subject:  

sorry , vovchick, GustavoYz and Keef

yes it does, I must have done a typo the first time I tried it.

thanks all
Back to top
View user's profile Send private message Visit poster's website 
Display posts from previous:   Sort by:   
Page 1 of 1 [9 Posts]  
Post new topic   Reply to topic View previous topic :: View next topic
 Forum index » Off-Topic Area » Programming
Jump to:  

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Powered by phpBB © 2001, 2005 phpBB Group
[ Time: 0.0687s ][ Queries: 12 (0.0204s) ][ GZIP on ]