Remove Duplicate Rows From A Text File Using Powershell
secretGeek .:dot Nuts about dot Net:.
home .: about .: sign up .: sitemap .: secretGeek RSS

Remove Duplicate Rows From A Text File Using Powershell

If your text file is already sorted... then removing duplicates is very easy.

PS:\> gc $filename | get-unique > $newfileName

(But remember, the Get-Unique command only works on sorted data!)

If the file's content is not sorted, and the final order of the lines is unimportant, then it's also easy....

Sort it -- and then use Get-Unique

gc $filename | sort | get-unique > $newfileName

(You now end up with a file that is sorted, and where every line is unique)

However... the case that one bumps into is always the tricky case...

If the file data is not 'sorted', but the order *is* important... then it's a little trickier. I've got an approach... let's turn it into a solution.

Remove Duplicate Rows From A Text File Using Powershell... unsorted file, where order is important

I'm going to add each line to a hash table.

But before adding it -- i'll check if the line is already in the hash table.

If it's not there yet -- then I'll send that line into the new file. Here's an example:

PS H:\> $hash = @{}      # define a new empty hash table
PS H:\> gc c:\rawlist.txt | 
>> %{if($hash.$_ -eq $null) { $_ }; $hash.$_ = 1} > 
>> c:\newlist.txt

I test it out... given this input:

Apple
Dog
Dog
Carrot
Banana
Fun
Dog
Apple
Egg
Carrot
Egg

I get this output...

Apple
Dog
Carrot
Banana
Fun
Egg

Okay... i thought that was going to be really hard. Huh.

One more thing to do... see if I can comment this a little better...

PS H:\> $hash = @{}                 # Define an empty hashtable
PS H:\>  gc c:\rawlist.txt |        # Send the content of the file into the pipeline...
>>  % {                             # For each object in the pipeline...
>>                                      # note '%' is an alias of 'foreach-object'          
>>     if ($hash.$_ -eq $null) {    # if that line is not a key in our hashtable...
>>                                      # note -eq means 'equals'
>>                                      # note $_ means 'the data we got from the pipe'
>>                                      # note $null means NULL
>>         $_                       # ... send that line further along the pipe
>>     };
>>     $hash.$_ = 1                 # Add that line to the hash (so we won't send it again)
>>                                      # note that the value isn't important here,
>>                                      # only the key. ;-)
>>  } > c:\newlist.txt              # finally... redirect the pipe into a new file.
>>




'Xaegr' on Mon, 16 Apr 2007 00:08:56 GMT, sez:

PS> measure-command { Get-Content tst.txt | Select-Object -Unique } | ... TotalMilliseconds
32,7035

PS> $hash=@{}; measure-command { Get-Content tst.txt | %{if($hash.$_ -eq $null){$_};$hash.$_=1} } | ... TotalMilliseconds
127,7818

where is the point?



'Xaegr' on Mon, 16 Apr 2007 00:10:45 GMT, sez:

Sorry, rechecked, values differ less. But select -unique faster anyway



'lb' on Mon, 16 Apr 2007 00:31:38 GMT, sez:

"Select-Object -unique"

ah, thanks Xaegr, i didn't know about that!

that looks better.



'frb' on Tue, 08 May 2007 17:50:16 GMT, sez:

"Sort -unique" is faster, and has the side effect of sorting your data.



'AP' on Wed, 23 May 2007 22:29:37 GMT, sez:

fails to work with with something more complex than Apple orange da da da

<add key="WorkersResponseService" value="WorkersResponse" />
<add key="WorkersResponseService" value="WorkersResponse" />
<add key="WorkersResponseService" value="WorkersResponse" />
<add key="WorkersResponseService" value="WorkersResponse" />
<add key="WorkerTimeOutLimit" value="600" />
<add key="WorkerTimeOutLimit" value="600" />
<add key="WorkerTimeOutLimit" value="600" />
<add key="WorkerTimeOutLimit" value="600" />



'n00b' on Sun, 22 Mar 2009 11:24:10 GMT, sez:

I ran:
gc $filename | sort | get-unique > $newfileName
on a 7.5Mb text file (which is just a list - one entry per line), and out came a 13.9Mb version of said text file.

I figured it had something to do with the sort order, and ran it again. This time it came out to be 13.8Mb.

Scrolling through the list, I figured it out it was case sensitive. Is there any way to do a tolower() or something? And why was the file almost completely duplicated?



'play free blackjack online' on Tue, 30 Jun 2009 09:01:55 GMT, sez:

Awesome. just awesome...i haven't any word to appreciate this post.....Really i am impressed from this post....the person who create this post he is a great human..thanks for shared this with us.i found this informative and interesting blog so i think so its very useful and knowledge able.I would like to thank you for the efforts you have made in writing this article. I am hoping the same best work from you in the future as well. In fact your creative writing abilities has inspired me.Really the blogging is spreading its wings rapidly.



'Jeff' on Mon, 19 Jul 2010 22:12:11 GMT, sez:

Thats! Very useful info.



'iphone sale' on Sun, 14 Nov 2010 02:55:17 GMT, sez:

Nice site.



'free grants for women' on Sun, 14 Nov 2010 02:55:43 GMT, sez:

interesting site.



'spray foam insulation' on Sun, 14 Nov 2010 02:56:34 GMT, sez:

interesting website.



'generique' on Tue, 16 Nov 2010 14:36:43 GMT, sez:

Nice solution
Is it possible to filter rows by regex so that only matched rows would get into result file?



'silver' on Sat, 04 Dec 2010 03:36:15 GMT, sez:

was the file almost completely duplicated?



'DanGle neck' on Tue, 14 Dec 2010 23:02:33 GMT, sez:

Thanks for the blog entry -- it helped me write the following code segment

TO: 'NOOB'- I needed to ignore case also, and here is the code I came up with:

# define a new empty hash table
$hashTable = @{}

foreach ($file in $filesArr ) {
# read file contents as a string
$fileContentsStr = [string]::join([environment]::newline, (get-content -path $file.fullname))

$resultsObj= $regExpObj.Matches($fileContentsStr)

foreach ($tempObj in $resultsObj ) {
$valueStr = [string]$tempObj.value
$keyStr = [string]$valueStr.toLower()

#-- only add non-duplicate values
if(!($hashTable.ContainsKey($keyStr))) {
$hashTable.Add( $keyStr, $valueStr)
Write-Host -ForegroundColor DarkMagenta "===== hashTable.Add( $keyStr, $valueStr) ====="
}
}



'JonD' on Thu, 27 Jan 2011 21:14:45 GMT, sez:

Is there a way to do this, while keeping the entire line of text but only checking the first x characters (say first 7 characters)?



'arnold' on Sat, 19 Mar 2011 22:57:20 GMT, sez:

heya :)
i was very luck to find your page!
my contribution is:

cls
$path = "C:\studio"
$DateTime = get-date
$minutes = get-date -format "mm"
$VolleStunde = get-date -format "HH"
$DiffMinutes = [Int64]$VolleStunde - ([int64]$VolleStunde - [Int64]$minutes)
$files = Get-Childitem $path -recurse -Force | where {($_.LastWriteTime -ge ([DateTime]::Now.Addminutes( - $DiffMinutes)) -and ($_.LastWriteTime -ge ([DateTime]::Today)) -and ($_.length -eq 0))}

$NbrFiles = $files.Count
if ([Int]$VolleStunde -eq [Int]22)
{$files | Export-Clixml -path c:\studio\3D\png-zerro.xml
}

Write-Host
Write-Host "Abfrage: Zerro (length=0) files ?"
Write-Host
Write-Host "Datum-Uhrzeit:"$DateTime
Write-Host $DiffMinutes"': Nach "$VolleStunde":00 Uhr."

switch ([int]$NbrFiles)
{
{$_ -lt [int]1} {Write-Host "0 : Zerro (length=0) files"}
{$_ -gt [int]0} {Write-Host $NbrFiles ": Zerro (length=0) files"}
}

Write-Host "letzte(n) $DiffMinutes Minute(n) in diesen Folder: ""$path""."
Write-Host "[Facultative mit: Subfolder = parent folder]."
$files | out-file C:\Abfrage.txt
Write-Host

switch ([int]$NbrFiles)
{
{$_ -lt [int]1} {Write-Host "Namen der Files: Keine"}
{$_ -gt [int]0} {Write-Host "Zerro (length=0) files:"
Write-Host
Write-Host $files}
}

Write-Host
Write-Host "Aufgeteilte Files MMC-Namen (split, unique):"
Write-Host
$splitline = @([regex]::Split($files," "))
$splitline = $splitline[0..$NbrFiles] -replace (".png","")
$splitline = $splitline[0..$NbrFiles] -replace ("graphe2_Mmc","")
$splitline = $splitline[0..$NbrFiles] -replace ("graphe2_sim_Mmc","")
$splitline = $splitline[0..$NbrFiles] -replace ("graphe3_sim_Mmc","")
$splitline = $splitline[0..$NbrFiles] -replace ("graphe8_sim_Mmc","")
$splitline = $splitline[0..$NbrFiles] -replace ("graphe9_sim_Mmc","")
$hash = @{}
$FilesName = $splitline | % {if ($hash.$_ -eq $null) { $_ };$hash.$_ = 1 }

Write-Host $FilesName.Count ": MMC"
Write-Host "MMC Namen:"
Write-Host
$FilesName


and thes resultat is with your help:

Abfrage: Zerro (length=0) files ?

Datum-Uhrzeit: 19/03/2011 23:51:19
51': Nach 23:00 Uhr.
16 : Zerro (length=0) files
letzte(n) 51 Minute(n) in diesen Folder: "C:\studio".
[Facultative mit: Subfolder = parent folder].

Zerro (length=0) files:

graphe2_MmcSeti.germany.png graphe2_MmcWlkchen.png graphe2_sim_MmcKira-Casa.png graphe2_sim_MmcSeti.germany.png graphe3_sim_MmcDingakastown.png graphe3_sim_MmcGod-Guildcity.png graphe3_sim_MmcLa-Bodda.png graphe3_sim_MmcSaarlouis.png graphe8_sim_MmcDingakastown.png graphe8_sim_MmcDussel-dorf.png graphe8_sim_MmcGod-Guildcity.png graphe8_sim_MmcLa-Bodda.png graphe8_sim_MmcSeti.germany.png graphe9_sim_MmcDingakastown.png graphe9_sim_MmcGod-Guildcity.png graphe9_sim_MmcLa-Bodda.png

Aufgeteilte Files MMC-Namen (split, unique):

8 : MMC
MMC Namen:

Seti.germany
Wlkchen
Kira-Casa
Dingakastown
God-Guildcity
La-Bodda
Saarlouis
Dussel-dorf



'SEO Chiropractic' on Mon, 18 Apr 2011 19:00:15 GMT, sez:

Thanks for teaching us how to remove duplicate rows, you have an interesting site here.



'Mike' on Thu, 08 Sep 2011 14:08:57 GMT, sez:

Good Stuff.



'terry' on Wed, 14 Dec 2011 17:28:28 GMT, sez:

Thank you very much!

@ Hey, Xaegr, you are wrong anyway.
SecretGeek script is more speed than Sort+Unique.

In my test: a text file with 313.000 lines:
SecretGeek: 27 seconds.
Sort+Unique: 32 seconds.

I do it several time and it's always the same.



'terry' on Wed, 14 Dec 2011 17:41:52 GMT, sez:

I can integrate my previous comment:
'measure-command" get only the time to command to execute, but not time to write to a file on disk.

So, real measure that I do with Windows clock:
SecretGeek: 1 min and 20 sec.
Sort+Unique:1 min and 35 sec.

So, not so big difference.
But take in mind that first script doesn't sort items, and very often this is good.



'man and van London' on Thu, 15 Dec 2011 17:12:46 GMT, sez:

You may be dealing with tough stains or just sick of looking at your sober and dirty carpets that you have been unable to clean on your own. Hiring a carpet cleaning machine is a good way to handle this chore, but using professional cleaning services will be your best bet.




name


website (optional)


enter the word:
 

comment (HTML not allowed)


All viewpoints welcome. But the right to delete any post for any reason is reserved. Don't make me do it. Aim for constructiveness. Comments may be republished, emailed to your loved ones or printed and used as toilet paper. Also, I get particularly nasty on comment spam. It's not worth even trying to post comment spam here -- your html is escaped, and your links are given a rel='nofollow'. By attempting to post a comment, you understand that if the comment is considered spam, at my absolute discretion, your IP address may be used as the target of a prolonged distributed denial of service attack. Your electricity might suddenly stop working. Your car tyres will go mysteriously flat. You will suffer permanent hairloss. Your dreams will be filled with terrifying monsters. And in any case I reserve the right to record and publish your IP address.

 

TimeSnapper is a life analysis system that stores and plays-back your computer use. It makes timesheet recording a breeze, helps you recover lost work and shows you how to sharpen your act.

 

NimbleText - FREE text manipulation and data extraction

NimbleText is a Powerful FREE Tool

Use it for:

  • extracting data from text
  • manipulating text
  • generating code

It makes you look awesome. Use it right now! Go on! Hurry! Don't walk, run!

 

Articles

Mind-boggling Demo of New Gaming Genre, aka Folder-Based Hangman, aka Fun with Recursion Mind-boggling Demo of New Gaming Genre, aka Folder-Based Hangman, aka Fun with Recursion
Got CSV in your javascript? Use agnes. Got CSV in your javascript? Use agnes.
I went to write down a book name and founded an internet empire instead. I went to write down a book name and founded an internet empire instead.
NimbleText: Origins NimbleText: Origins
The Windows 8 Mullet The Windows 8 Mullet
Cosby: spontaneous striped background generator Cosby: spontaneous striped background generator
Slides from WDCNZ: Live Coding Asp.net MVC3 Slides from WDCNZ: Live Coding Asp.net MVC3
MVC 3, MVC 3, "Third Times a Charm" references
Custom Errors in ASP.Net MVC: It couldn't be simpler, right? Custom Errors in ASP.Net MVC: It couldn't be simpler, right?
Anatomy of a Domain Hijacking, part 2: The Website Who Came In From The Cold Anatomy of a Domain Hijacking, part 2: The Website Who Came In From The Cold
Anatomy of a Domain Hijacking, part 1 Anatomy of a Domain Hijacking, part 1
secretGeek.net domain has been stolen. The site may go down. secretGeek.net domain has been stolen. The site may go down.
Boring article: 'untrusted domain' issue with SQL Server. Boring article: 'untrusted domain' issue with SQL Server.
Coding While You Commute Coding While You Commute
Test Driven Dentistry Is A Good Thing Test Driven Dentistry Is A Good Thing
The 'less crashy' release of NimbleText The 'less crashy' release of NimbleText
Rethinking Toolbars in Visual Studio (or any IDE) Rethinking Toolbars in Visual Studio (or any IDE)
Where shall we have lunch? Where shall we have lunch?
Setting up email for your microIsv Setting up email for your microIsv
The NO Visual Studio movement: Compiling .net projects in Notepad++ The NO Visual Studio movement: Compiling .net projects in Notepad++
ZeroOne: the editor for programmers who think in binary ZeroOne: the editor for programmers who think in binary
Mercurial workflow for personal projects (with a .net bias) Mercurial workflow for personal projects (with a .net bias)
I see you're using vim. Let me fix that for you. I see you're using vim. Let me fix that for you.
The worst recruitment spam I've ever read The worst recruitment spam I've ever read
A thank you I forgot to say A thank you I forgot to say
My new product, NimbleText, is live My new product, NimbleText, is live
Grabbing the free songs of Jonathan Coulton (with Powershell) Grabbing the free songs of Jonathan Coulton (with Powershell)
Using NimbleSet to compare lists Using NimbleSet to compare lists
Wanted: Wiki Lists (dot org) Wanted: Wiki Lists (dot org)
DOS on Dope: The last MVC web framework you'll ever need DOS on Dope: The last MVC web framework you'll ever need
JSON Query Languages: 5 special purpose editors JSON Query Languages: 5 special purpose editors
What then, is b? What then, is b?
SQLike: A simple editor SQLike: A simple editor
Yet Another BizPlan Generator. Yet Another BizPlan Generator.
HOT GUIDS: A hot or not site for guids HOT GUIDS: A hot or not site for guids
How does life get better? One tiny hack at a time. How does life get better? One tiny hack at a time.
24 things to do, and 100 things *not* to do (yet) for building a MicroISV 24 things to do, and 100 things *not* to do (yet) for building a MicroISV
Venture capital won't kill Jeff Atwood, it will only make him Jeffer. Venture capital won't kill Jeff Atwood, it will only make him Jeffer.
A handy workflow image for newbie mercurial users A handy workflow image for newbie mercurial users
Fractal Feedback, a diversion into recreational programming Fractal Feedback, a diversion into recreational programming
Hump-Jumping: How the Education of Computer Science can be Saved, err, maybe. Hump-Jumping: How the Education of Computer Science can be Saved, err, maybe.
Suggested User Experience Improvements for DiffMerge Suggested User Experience Improvements for DiffMerge
SQL Style Extensions for C# SQL Style Extensions for C#
The Movie Hollywood (And My Wife) Doesn't Want You To See: Weekend at Jacko's The Movie Hollywood (And My Wife) Doesn't Want You To See: Weekend at Jacko's
Sysi: the ultimate administrators toolkit Sysi: the ultimate administrators toolkit

Archives .: secretGeek :: Complete Archives
TimeSnapper -- Automated Screenshot Journal TimeSnapper.com    
Version 3.3: true productivity boost

Next Action NextAction
Managing the top of your mind

NimbleText -- World's Simplest Code GeneratorNimbleText -- World's Simplest Code Generator, Text Manipulator, Data Extractor

25 steps for building a Micro-ISV 25 steps for building a Micro-ISV
3 minute guides -- babysteps in new technologies: powershell, JSON, watir, F# 3 Minute Guide Series
Universal Troubleshooting checklist Universal Troubleshooting Checklist
Top 10 SecretGeek articles Top 10 SecretGeek articles
ShinyPower (help with Powershell) ShinyPower
Now at CodePlex

Realtime CSS Editor, in a browser RealTime Online CSS Editor
Gradient Maker -- a tool for making background images that blend from one colour to another. Forget photoshop, this is the bomb. Gradient Maker


[powered by Google] 


How to be depressed How to be depressed
You are not inadequate.



Recommended Reading


the little schemer


The Best Software Writing I
The Business Of Software (Eric Sink)

Recommended blogs

Jeff Atwood
Joseph Cooney
Phil Haack
Scott Hanselman
Julia Lerman
Rhys Parry
Joel Pobar
Thomas White
OJ Reeves
Eric Sink

Aggregated Links

proggit
dzone
hacker news
dot net kicks

Human Link Machines

interesting finds
a continuous learner's weblog
arjan's world
weekly link post

LinkedIn profile
LogEnvy - event logs made sexy
Computer, Unlocked. A rapid computer customization resource
PC Smart Buys - Computer Hardware in Australia
 
home .: about .: sign up .: sitemap .: secretGeek RSS .: © Leon Bambrick 2006 .: privacy

home .: about .: sign up .: sitemap .: RSS .: © Leon Bambrick 2006 .: privacy