How do I use HashSet to remove duplicates from a text file? (C#)
So I've decided to create a program that does quite a few things. As a part of this program there's a section called "text tools" which takes a text file (via 1 button) and then has additional buttons that perform other functions like removing whitespace and empty lines from the file, removing duplicates and removing lines that match a certain pattern eg 123 or abc.
I'm able to import the file and print the list using a foreach loop and I believe I'm along the right lines however I need to remove duplicates. I've decided to use HashSet thanks to this thread in which it says it's the simplest and fastest method (my file will contain million of lines).
The problem is that I can't figure out just what I'm doing wrong, I've got the event handler for the button click, created a list of strings in memory, looped through each line in the file (adding it to the list) then creating another list and setting that to be the HashSet of list. (sorry if that's convoluted, it doesn't work for a reason).
I've looked at every stackoverflow question similar to this but I can't find any solution. I've also looked into HashSet in general to no avail.
Here's my code so far:
private void btnClearDuplicates_Copy_Click(object sender, RoutedEventArgs e)
{
List<string> list = new List<string>();
foreach (string line in File.ReadLines(FilePath, Encoding.UTF8))
{
list.Add(line);
}
var DuplicatesRemoved = new HashSet<String>(list);
}
c#
|
show 6 more comments
So I've decided to create a program that does quite a few things. As a part of this program there's a section called "text tools" which takes a text file (via 1 button) and then has additional buttons that perform other functions like removing whitespace and empty lines from the file, removing duplicates and removing lines that match a certain pattern eg 123 or abc.
I'm able to import the file and print the list using a foreach loop and I believe I'm along the right lines however I need to remove duplicates. I've decided to use HashSet thanks to this thread in which it says it's the simplest and fastest method (my file will contain million of lines).
The problem is that I can't figure out just what I'm doing wrong, I've got the event handler for the button click, created a list of strings in memory, looped through each line in the file (adding it to the list) then creating another list and setting that to be the HashSet of list. (sorry if that's convoluted, it doesn't work for a reason).
I've looked at every stackoverflow question similar to this but I can't find any solution. I've also looked into HashSet in general to no avail.
Here's my code so far:
private void btnClearDuplicates_Copy_Click(object sender, RoutedEventArgs e)
{
List<string> list = new List<string>();
foreach (string line in File.ReadLines(FilePath, Encoding.UTF8))
{
list.Add(line);
}
var DuplicatesRemoved = new HashSet<String>(list);
}
c#
stackoverflow.com/questions/31052953/…
– Mitch Wheat
Nov 15 '18 at 2:13
docs.microsoft.com/en-us/dotnet/api/…
– mjwills
Nov 15 '18 at 2:19
cannot convert from 'System.Collections.Generic.List<string>' to 'System.Collections.Generic.IEqualityComparer<System.Windows.Documents.List>'
– College Ameteur
Nov 15 '18 at 2:20
2
Respectfully I didn't open the question to ask for links that I've already found
If you are going to be snarky, at least provide the links that you have read. We aren't mind readers. :)
– mjwills
Nov 15 '18 at 2:24
3
I'd suggest stopping using theList<string>
altogether and use aHashSet<string>
then. You don't need theList
. Note thatHashSet
could, in theory, return data in a different order than in the file (it won't with the current implementation, but it could in future).
– mjwills
Nov 15 '18 at 2:36
|
show 6 more comments
So I've decided to create a program that does quite a few things. As a part of this program there's a section called "text tools" which takes a text file (via 1 button) and then has additional buttons that perform other functions like removing whitespace and empty lines from the file, removing duplicates and removing lines that match a certain pattern eg 123 or abc.
I'm able to import the file and print the list using a foreach loop and I believe I'm along the right lines however I need to remove duplicates. I've decided to use HashSet thanks to this thread in which it says it's the simplest and fastest method (my file will contain million of lines).
The problem is that I can't figure out just what I'm doing wrong, I've got the event handler for the button click, created a list of strings in memory, looped through each line in the file (adding it to the list) then creating another list and setting that to be the HashSet of list. (sorry if that's convoluted, it doesn't work for a reason).
I've looked at every stackoverflow question similar to this but I can't find any solution. I've also looked into HashSet in general to no avail.
Here's my code so far:
private void btnClearDuplicates_Copy_Click(object sender, RoutedEventArgs e)
{
List<string> list = new List<string>();
foreach (string line in File.ReadLines(FilePath, Encoding.UTF8))
{
list.Add(line);
}
var DuplicatesRemoved = new HashSet<String>(list);
}
c#
So I've decided to create a program that does quite a few things. As a part of this program there's a section called "text tools" which takes a text file (via 1 button) and then has additional buttons that perform other functions like removing whitespace and empty lines from the file, removing duplicates and removing lines that match a certain pattern eg 123 or abc.
I'm able to import the file and print the list using a foreach loop and I believe I'm along the right lines however I need to remove duplicates. I've decided to use HashSet thanks to this thread in which it says it's the simplest and fastest method (my file will contain million of lines).
The problem is that I can't figure out just what I'm doing wrong, I've got the event handler for the button click, created a list of strings in memory, looped through each line in the file (adding it to the list) then creating another list and setting that to be the HashSet of list. (sorry if that's convoluted, it doesn't work for a reason).
I've looked at every stackoverflow question similar to this but I can't find any solution. I've also looked into HashSet in general to no avail.
Here's my code so far:
private void btnClearDuplicates_Copy_Click(object sender, RoutedEventArgs e)
{
List<string> list = new List<string>();
foreach (string line in File.ReadLines(FilePath, Encoding.UTF8))
{
list.Add(line);
}
var DuplicatesRemoved = new HashSet<String>(list);
}
c#
c#
edited Nov 15 '18 at 2:28
College Ameteur
asked Nov 15 '18 at 2:11
College AmeteurCollege Ameteur
43
43
stackoverflow.com/questions/31052953/…
– Mitch Wheat
Nov 15 '18 at 2:13
docs.microsoft.com/en-us/dotnet/api/…
– mjwills
Nov 15 '18 at 2:19
cannot convert from 'System.Collections.Generic.List<string>' to 'System.Collections.Generic.IEqualityComparer<System.Windows.Documents.List>'
– College Ameteur
Nov 15 '18 at 2:20
2
Respectfully I didn't open the question to ask for links that I've already found
If you are going to be snarky, at least provide the links that you have read. We aren't mind readers. :)
– mjwills
Nov 15 '18 at 2:24
3
I'd suggest stopping using theList<string>
altogether and use aHashSet<string>
then. You don't need theList
. Note thatHashSet
could, in theory, return data in a different order than in the file (it won't with the current implementation, but it could in future).
– mjwills
Nov 15 '18 at 2:36
|
show 6 more comments
stackoverflow.com/questions/31052953/…
– Mitch Wheat
Nov 15 '18 at 2:13
docs.microsoft.com/en-us/dotnet/api/…
– mjwills
Nov 15 '18 at 2:19
cannot convert from 'System.Collections.Generic.List<string>' to 'System.Collections.Generic.IEqualityComparer<System.Windows.Documents.List>'
– College Ameteur
Nov 15 '18 at 2:20
2
Respectfully I didn't open the question to ask for links that I've already found
If you are going to be snarky, at least provide the links that you have read. We aren't mind readers. :)
– mjwills
Nov 15 '18 at 2:24
3
I'd suggest stopping using theList<string>
altogether and use aHashSet<string>
then. You don't need theList
. Note thatHashSet
could, in theory, return data in a different order than in the file (it won't with the current implementation, but it could in future).
– mjwills
Nov 15 '18 at 2:36
stackoverflow.com/questions/31052953/…
– Mitch Wheat
Nov 15 '18 at 2:13
stackoverflow.com/questions/31052953/…
– Mitch Wheat
Nov 15 '18 at 2:13
docs.microsoft.com/en-us/dotnet/api/…
– mjwills
Nov 15 '18 at 2:19
docs.microsoft.com/en-us/dotnet/api/…
– mjwills
Nov 15 '18 at 2:19
cannot convert from 'System.Collections.Generic.List<string>' to 'System.Collections.Generic.IEqualityComparer<System.Windows.Documents.List>'
– College Ameteur
Nov 15 '18 at 2:20
cannot convert from 'System.Collections.Generic.List<string>' to 'System.Collections.Generic.IEqualityComparer<System.Windows.Documents.List>'
– College Ameteur
Nov 15 '18 at 2:20
2
2
Respectfully I didn't open the question to ask for links that I've already found
If you are going to be snarky, at least provide the links that you have read. We aren't mind readers. :)– mjwills
Nov 15 '18 at 2:24
Respectfully I didn't open the question to ask for links that I've already found
If you are going to be snarky, at least provide the links that you have read. We aren't mind readers. :)– mjwills
Nov 15 '18 at 2:24
3
3
I'd suggest stopping using the
List<string>
altogether and use a HashSet<string>
then. You don't need the List
. Note that HashSet
could, in theory, return data in a different order than in the file (it won't with the current implementation, but it could in future).– mjwills
Nov 15 '18 at 2:36
I'd suggest stopping using the
List<string>
altogether and use a HashSet<string>
then. You don't need the List
. Note that HashSet
could, in theory, return data in a different order than in the file (it won't with the current implementation, but it could in future).– mjwills
Nov 15 '18 at 2:36
|
show 6 more comments
2 Answers
2
active
oldest
votes
To be specific to your question, and to get my last 3 points.
var lines = File.ReadAllLines("somepath");
var hashSet = new HashSet<string>(lines);
File.WriteAllLines("somepath", hashSet.ToList());
Note there are other ways and maybe more performant ways of doing this. it depends on the amount of duplicates, and the size of the file
2 things: 1) Would this write the files to the same path as it took them from? (just to clarify) 2) I used ReadLines above because people said it was faster, would there be any impact on performance between the two methods using a file that has millions of lines?
– College Ameteur
Nov 15 '18 at 2:32
1
@CollegeAmeteur millions of lines is a completely different optimization, and there maybe several things involved to make this more efficient thanReadAllLines
andReadLines
. what i suggest you do, download a benchmark tool and see what works for you.
– Michael Randall
Nov 15 '18 at 2:36
add a comment |
It is preferable to process file as a stream if possible. I would not even call it optimization, I would rather call it not wasting. If you can use stream approach, ReadAllLines approach is somewhere between almost good and very bad, depending on situation. It is also good idea to preserve lines order. HashSet generally does not preserve order, if you store everything into it and read it, it can be shuffled.
using (var outFile = new StreamWriter(outFilePath))
{
HashSet<string> seen = new HashSet<string>();
foreach (string line in File.ReadLines(FilePath, Encoding.UTF8))
{
if (seen.Add(line))
{
outFile.WriteLine(line);
}
}
}
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53311453%2fhow-do-i-use-hashset-to-remove-duplicates-from-a-text-file-c%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
To be specific to your question, and to get my last 3 points.
var lines = File.ReadAllLines("somepath");
var hashSet = new HashSet<string>(lines);
File.WriteAllLines("somepath", hashSet.ToList());
Note there are other ways and maybe more performant ways of doing this. it depends on the amount of duplicates, and the size of the file
2 things: 1) Would this write the files to the same path as it took them from? (just to clarify) 2) I used ReadLines above because people said it was faster, would there be any impact on performance between the two methods using a file that has millions of lines?
– College Ameteur
Nov 15 '18 at 2:32
1
@CollegeAmeteur millions of lines is a completely different optimization, and there maybe several things involved to make this more efficient thanReadAllLines
andReadLines
. what i suggest you do, download a benchmark tool and see what works for you.
– Michael Randall
Nov 15 '18 at 2:36
add a comment |
To be specific to your question, and to get my last 3 points.
var lines = File.ReadAllLines("somepath");
var hashSet = new HashSet<string>(lines);
File.WriteAllLines("somepath", hashSet.ToList());
Note there are other ways and maybe more performant ways of doing this. it depends on the amount of duplicates, and the size of the file
2 things: 1) Would this write the files to the same path as it took them from? (just to clarify) 2) I used ReadLines above because people said it was faster, would there be any impact on performance between the two methods using a file that has millions of lines?
– College Ameteur
Nov 15 '18 at 2:32
1
@CollegeAmeteur millions of lines is a completely different optimization, and there maybe several things involved to make this more efficient thanReadAllLines
andReadLines
. what i suggest you do, download a benchmark tool and see what works for you.
– Michael Randall
Nov 15 '18 at 2:36
add a comment |
To be specific to your question, and to get my last 3 points.
var lines = File.ReadAllLines("somepath");
var hashSet = new HashSet<string>(lines);
File.WriteAllLines("somepath", hashSet.ToList());
Note there are other ways and maybe more performant ways of doing this. it depends on the amount of duplicates, and the size of the file
To be specific to your question, and to get my last 3 points.
var lines = File.ReadAllLines("somepath");
var hashSet = new HashSet<string>(lines);
File.WriteAllLines("somepath", hashSet.ToList());
Note there are other ways and maybe more performant ways of doing this. it depends on the amount of duplicates, and the size of the file
answered Nov 15 '18 at 2:29
Michael RandallMichael Randall
32.4k63565
32.4k63565
2 things: 1) Would this write the files to the same path as it took them from? (just to clarify) 2) I used ReadLines above because people said it was faster, would there be any impact on performance between the two methods using a file that has millions of lines?
– College Ameteur
Nov 15 '18 at 2:32
1
@CollegeAmeteur millions of lines is a completely different optimization, and there maybe several things involved to make this more efficient thanReadAllLines
andReadLines
. what i suggest you do, download a benchmark tool and see what works for you.
– Michael Randall
Nov 15 '18 at 2:36
add a comment |
2 things: 1) Would this write the files to the same path as it took them from? (just to clarify) 2) I used ReadLines above because people said it was faster, would there be any impact on performance between the two methods using a file that has millions of lines?
– College Ameteur
Nov 15 '18 at 2:32
1
@CollegeAmeteur millions of lines is a completely different optimization, and there maybe several things involved to make this more efficient thanReadAllLines
andReadLines
. what i suggest you do, download a benchmark tool and see what works for you.
– Michael Randall
Nov 15 '18 at 2:36
2 things: 1) Would this write the files to the same path as it took them from? (just to clarify) 2) I used ReadLines above because people said it was faster, would there be any impact on performance between the two methods using a file that has millions of lines?
– College Ameteur
Nov 15 '18 at 2:32
2 things: 1) Would this write the files to the same path as it took them from? (just to clarify) 2) I used ReadLines above because people said it was faster, would there be any impact on performance between the two methods using a file that has millions of lines?
– College Ameteur
Nov 15 '18 at 2:32
1
1
@CollegeAmeteur millions of lines is a completely different optimization, and there maybe several things involved to make this more efficient than
ReadAllLines
and ReadLines
. what i suggest you do, download a benchmark tool and see what works for you.– Michael Randall
Nov 15 '18 at 2:36
@CollegeAmeteur millions of lines is a completely different optimization, and there maybe several things involved to make this more efficient than
ReadAllLines
and ReadLines
. what i suggest you do, download a benchmark tool and see what works for you.– Michael Randall
Nov 15 '18 at 2:36
add a comment |
It is preferable to process file as a stream if possible. I would not even call it optimization, I would rather call it not wasting. If you can use stream approach, ReadAllLines approach is somewhere between almost good and very bad, depending on situation. It is also good idea to preserve lines order. HashSet generally does not preserve order, if you store everything into it and read it, it can be shuffled.
using (var outFile = new StreamWriter(outFilePath))
{
HashSet<string> seen = new HashSet<string>();
foreach (string line in File.ReadLines(FilePath, Encoding.UTF8))
{
if (seen.Add(line))
{
outFile.WriteLine(line);
}
}
}
add a comment |
It is preferable to process file as a stream if possible. I would not even call it optimization, I would rather call it not wasting. If you can use stream approach, ReadAllLines approach is somewhere between almost good and very bad, depending on situation. It is also good idea to preserve lines order. HashSet generally does not preserve order, if you store everything into it and read it, it can be shuffled.
using (var outFile = new StreamWriter(outFilePath))
{
HashSet<string> seen = new HashSet<string>();
foreach (string line in File.ReadLines(FilePath, Encoding.UTF8))
{
if (seen.Add(line))
{
outFile.WriteLine(line);
}
}
}
add a comment |
It is preferable to process file as a stream if possible. I would not even call it optimization, I would rather call it not wasting. If you can use stream approach, ReadAllLines approach is somewhere between almost good and very bad, depending on situation. It is also good idea to preserve lines order. HashSet generally does not preserve order, if you store everything into it and read it, it can be shuffled.
using (var outFile = new StreamWriter(outFilePath))
{
HashSet<string> seen = new HashSet<string>();
foreach (string line in File.ReadLines(FilePath, Encoding.UTF8))
{
if (seen.Add(line))
{
outFile.WriteLine(line);
}
}
}
It is preferable to process file as a stream if possible. I would not even call it optimization, I would rather call it not wasting. If you can use stream approach, ReadAllLines approach is somewhere between almost good and very bad, depending on situation. It is also good idea to preserve lines order. HashSet generally does not preserve order, if you store everything into it and read it, it can be shuffled.
using (var outFile = new StreamWriter(outFilePath))
{
HashSet<string> seen = new HashSet<string>();
foreach (string line in File.ReadLines(FilePath, Encoding.UTF8))
{
if (seen.Add(line))
{
outFile.WriteLine(line);
}
}
}
answered Nov 15 '18 at 3:24
Antonín LejsekAntonín Lejsek
4,20221118
4,20221118
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53311453%2fhow-do-i-use-hashset-to-remove-duplicates-from-a-text-file-c%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
stackoverflow.com/questions/31052953/…
– Mitch Wheat
Nov 15 '18 at 2:13
docs.microsoft.com/en-us/dotnet/api/…
– mjwills
Nov 15 '18 at 2:19
cannot convert from 'System.Collections.Generic.List<string>' to 'System.Collections.Generic.IEqualityComparer<System.Windows.Documents.List>'
– College Ameteur
Nov 15 '18 at 2:20
2
Respectfully I didn't open the question to ask for links that I've already found
If you are going to be snarky, at least provide the links that you have read. We aren't mind readers. :)– mjwills
Nov 15 '18 at 2:24
3
I'd suggest stopping using the
List<string>
altogether and use aHashSet<string>
then. You don't need theList
. Note thatHashSet
could, in theory, return data in a different order than in the file (it won't with the current implementation, but it could in future).– mjwills
Nov 15 '18 at 2:36