Remove Duplicate Rows From A Text File Using Powershell
If your text file is already sorted... then removing duplicates is very easy. PS:\> gc $filename | get-unique > $newfileName
(But remember, the Get-Unique command only works on sorted data!) If the file's content is not sorted, and the final order of the lines is unimportant, then it's also easy.... Sort it -- and then use Get-Unique gc $filename | sort | get-unique > $newfileName (You now end up with a file that is sorted, and where every line is unique) However... the case that one bumps into is always the tricky case...
If the file data is not 'sorted', but the order *is* important... then it's a little trickier. I've got an approach... let's turn it into a solution.
Remove Duplicate Rows From A Text File Using Powershell... unsorted file, where order is important
I'm going to add each line to a hash table.
But before adding it -- i'll check if the line is already in the hash table.
If it's not there yet -- then I'll send that line into the new file. Here's an example:
PS H:\> $hash = @{} # define a new empty hash table
PS H:\> gc c:\rawlist.txt |
>> %{if($hash.$_ -eq $null) { $_ }; $hash.$_ = 1} >
>> c:\newlist.txt
I test it out... given this input:
Apple
Dog
Dog
Carrot
Banana
Fun
Dog
Apple
Egg
Carrot
Egg
I get this output...
Apple
Dog
Carrot
Banana
Fun
Egg
Okay... i thought that was going to be really hard. Huh.
One more thing to do... see if I can comment this a little better...
PS H:\> $hash = @{} # Define an empty hashtable
PS H:\> gc c:\rawlist.txt | # Send the content of the file into the pipeline...
>> % { # For each object in the pipeline...
>> # note '%' is an alias of 'foreach-object'
>> if ($hash.$_ -eq $null) { # if that line is not a key in our hashtable...
>> # note -eq means 'equals'
>> # note $_ means 'the data we got from the pipe'
>> # note $null means NULL
>> $_ # ... send that line further along the pipe
>> };
>> $hash.$_ = 1 # Add that line to the hash (so we won't send it again)
>> # note that the value isn't important here,
>> # only the key. ;-)
>> } > c:\newlist.txt # finally... redirect the pipe into a new file.
>>
'Xaegr' on Mon, 16 Apr 2007 00:08:56 GMT, sez: PS> measure-command { Get-Content tst.txt | Select-Object -Unique } | ... TotalMilliseconds
32,7035
PS> $hash=@{}; measure-command { Get-Content tst.txt | %{if($hash.$_ -eq $null){$_};$hash.$_=1} } | ... TotalMilliseconds
127,7818
where is the point?
'Xaegr' on Mon, 16 Apr 2007 00:10:45 GMT, sez: Sorry, rechecked, values differ less. But select -unique faster anyway
'lb' on Mon, 16 Apr 2007 00:31:38 GMT, sez: "Select-Object -unique"
ah, thanks Xaegr, i didn't know about that!
that looks better.
'frb' on Tue, 08 May 2007 17:50:16 GMT, sez: "Sort -unique" is faster, and has the side effect of sorting your data.
'AP' on Wed, 23 May 2007 22:29:37 GMT, sez: fails to work with with something more complex than Apple orange da da da
<add key="WorkersResponseService" value="WorkersResponse" />
<add key="WorkersResponseService" value="WorkersResponse" />
<add key="WorkersResponseService" value="WorkersResponse" />
<add key="WorkersResponseService" value="WorkersResponse" />
<add key="WorkerTimeOutLimit" value="600" />
<add key="WorkerTimeOutLimit" value="600" />
<add key="WorkerTimeOutLimit" value="600" />
<add key="WorkerTimeOutLimit" value="600" />
'n00b' on Sun, 22 Mar 2009 11:24:10 GMT, sez: I ran:
gc $filename | sort | get-unique > $newfileName
on a 7.5Mb text file (which is just a list - one entry per line), and out came a 13.9Mb version of said text file.
I figured it had something to do with the sort order, and ran it again. This time it came out to be 13.8Mb.
Scrolling through the list, I figured it out it was case sensitive. Is there any way to do a tolower() or something? And why was the file almost completely duplicated?
'play free blackjack online' on Tue, 30 Jun 2009 09:01:55 GMT, sez: Awesome. just awesome...i haven't any word to appreciate this post.....Really i am impressed from this post....the person who create this post he is a great human..thanks for shared this with us.i found this informative and interesting blog so i think so its very useful and knowledge able.I would like to thank you for the efforts you have made in writing this article. I am hoping the same best work from you in the future as well. In fact your creative writing abilities has inspired me.Really the blogging is spreading its wings rapidly.
'Jeff' on Mon, 19 Jul 2010 22:12:11 GMT, sez: Thats! Very useful info.
'iphone sale' on Sun, 14 Nov 2010 02:55:17 GMT, sez: Nice site.
'free grants for women' on Sun, 14 Nov 2010 02:55:43 GMT, sez: interesting site.
'spray foam insulation' on Sun, 14 Nov 2010 02:56:34 GMT, sez: interesting website.
'generique' on Tue, 16 Nov 2010 14:36:43 GMT, sez: Nice solution
Is it possible to filter rows by regex so that only matched rows would get into result file?
'silver' on Sat, 04 Dec 2010 03:36:15 GMT, sez: was the file almost completely duplicated?
'DanGle neck' on Tue, 14 Dec 2010 23:02:33 GMT, sez: Thanks for the blog entry -- it helped me write the following code segment
TO: 'NOOB'- I needed to ignore case also, and here is the code I came up with:
# define a new empty hash table
$hashTable = @{}
foreach ($file in $filesArr ) {
# read file contents as a string
$fileContentsStr = [string]::join([environment]::newline, (get-content -path $file.fullname))
$resultsObj= $regExpObj.Matches($fileContentsStr)
foreach ($tempObj in $resultsObj ) {
$valueStr = [string]$tempObj.value
$keyStr = [string]$valueStr.toLower()
#-- only add non-duplicate values
if(!($hashTable.ContainsKey($keyStr))) {
$hashTable.Add( $keyStr, $valueStr)
Write-Host -ForegroundColor DarkMagenta "===== hashTable.Add( $keyStr, $valueStr) ====="
}
}
'JonD' on Thu, 27 Jan 2011 21:14:45 GMT, sez: Is there a way to do this, while keeping the entire line of text but only checking the first x characters (say first 7 characters)?
'arnold' on Sat, 19 Mar 2011 22:57:20 GMT, sez: heya :)
i was very luck to find your page!
my contribution is:
cls
$path = "C:\studio"
$DateTime = get-date
$minutes = get-date -format "mm"
$VolleStunde = get-date -format "HH"
$DiffMinutes = [Int64]$VolleStunde - ([int64]$VolleStunde - [Int64]$minutes)
$files = Get-Childitem $path -recurse -Force | where {($_.LastWriteTime -ge ([DateTime]::Now.Addminutes( - $DiffMinutes)) -and ($_.LastWriteTime -ge ([DateTime]::Today)) -and ($_.length -eq 0))}
$NbrFiles = $files.Count
if ([Int]$VolleStunde -eq [Int]22)
{$files | Export-Clixml -path c:\studio\3D\png-zerro.xml
}
Write-Host
Write-Host "Abfrage: Zerro (length=0) files ?"
Write-Host
Write-Host "Datum-Uhrzeit:"$DateTime
Write-Host $DiffMinutes"': Nach "$VolleStunde":00 Uhr."
switch ([int]$NbrFiles)
{
{$_ -lt [int]1} {Write-Host "0 : Zerro (length=0) files"}
{$_ -gt [int]0} {Write-Host $NbrFiles ": Zerro (length=0) files"}
}
Write-Host "letzte(n) $DiffMinutes Minute(n) in diesen Folder: ""$path""."
Write-Host "[Facultative mit: Subfolder = parent folder]."
$files | out-file C:\Abfrage.txt
Write-Host
switch ([int]$NbrFiles)
{
{$_ -lt [int]1} {Write-Host "Namen der Files: Keine"}
{$_ -gt [int]0} {Write-Host "Zerro (length=0) files:"
Write-Host
Write-Host $files}
}
Write-Host
Write-Host "Aufgeteilte Files MMC-Namen (split, unique):"
Write-Host
$splitline = @([regex]::Split($files," "))
$splitline = $splitline[0..$NbrFiles] -replace (".png","")
$splitline = $splitline[0..$NbrFiles] -replace ("graphe2_Mmc","")
$splitline = $splitline[0..$NbrFiles] -replace ("graphe2_sim_Mmc","")
$splitline = $splitline[0..$NbrFiles] -replace ("graphe3_sim_Mmc","")
$splitline = $splitline[0..$NbrFiles] -replace ("graphe8_sim_Mmc","")
$splitline = $splitline[0..$NbrFiles] -replace ("graphe9_sim_Mmc","")
$hash = @{}
$FilesName = $splitline | % {if ($hash.$_ -eq $null) { $_ };$hash.$_ = 1 }
Write-Host $FilesName.Count ": MMC"
Write-Host "MMC Namen:"
Write-Host
$FilesName
and thes resultat is with your help:
Abfrage: Zerro (length=0) files ?
Datum-Uhrzeit: 19/03/2011 23:51:19
51': Nach 23:00 Uhr.
16 : Zerro (length=0) files
letzte(n) 51 Minute(n) in diesen Folder: "C:\studio".
[Facultative mit: Subfolder = parent folder].
Zerro (length=0) files:
graphe2_MmcSeti.germany.png graphe2_MmcWlkchen.png graphe2_sim_MmcKira-Casa.png graphe2_sim_MmcSeti.germany.png graphe3_sim_MmcDingakastown.png graphe3_sim_MmcGod-Guildcity.png graphe3_sim_MmcLa-Bodda.png graphe3_sim_MmcSaarlouis.png graphe8_sim_MmcDingakastown.png graphe8_sim_MmcDussel-dorf.png graphe8_sim_MmcGod-Guildcity.png graphe8_sim_MmcLa-Bodda.png graphe8_sim_MmcSeti.germany.png graphe9_sim_MmcDingakastown.png graphe9_sim_MmcGod-Guildcity.png graphe9_sim_MmcLa-Bodda.png
Aufgeteilte Files MMC-Namen (split, unique):
8 : MMC
MMC Namen:
Seti.germany
Wlkchen
Kira-Casa
Dingakastown
God-Guildcity
La-Bodda
Saarlouis
Dussel-dorf
'SEO Chiropractic' on Mon, 18 Apr 2011 19:00:15 GMT, sez: Thanks for teaching us how to remove duplicate rows, you have an interesting site here.
'Mike' on Thu, 08 Sep 2011 14:08:57 GMT, sez: Good Stuff.
'terry' on Wed, 14 Dec 2011 17:28:28 GMT, sez: Thank you very much!
@ Hey, Xaegr, you are wrong anyway.
SecretGeek script is more speed than Sort+Unique.
In my test: a text file with 313.000 lines:
SecretGeek: 27 seconds.
Sort+Unique: 32 seconds.
I do it several time and it's always the same.
'terry' on Wed, 14 Dec 2011 17:41:52 GMT, sez: I can integrate my previous comment:
'measure-command" get only the time to command to execute, but not time to write to a file on disk.
So, real measure that I do with Windows clock:
SecretGeek: 1 min and 20 sec.
Sort+Unique:1 min and 35 sec.
So, not so big difference.
But take in mind that first script doesn't sort items, and very often this is good.
'man and van London' on Thu, 15 Dec 2011 17:12:46 GMT, sez: You may be dealing with tough stains or just sick of looking at your sober and dirty carpets that you have been unable to clean on your own. Hiring a carpet cleaning machine is a good way to handle this chore, but using professional cleaning services will be your best bet.
|