Compare two large CSV files and making a third one from the difference
up vote
0
down vote
favorite
I have a problem where I need to compare two large CSV Files (Approx 5 to 8 GB) and have to make a third CSV file from their difference.
Any suggestion pointer for supported libraries for the same or any reference to get started
e.g
File 1.csv
+---+------+------+
|ID |value1|value2|
+---+------+------+
| 1| a| Ran1|
| 2| b| Ran2|
+---+-----+-------+
File 2.csv
+---+------+------+
|ID |value1|value2|
+---+------+------+
| 3| c| Ran3|
| 2| b| Ran2|
+---+-----+-------+
Schema of both file is same
Result - file 3.csv
File 2.csv
+---+------+------+
|ID |value1|value2|
+---+------+------+
| 2| b| Ran2|
+---+-----+-------+scala scala-collections
|
show 2 more comments
up vote
0
down vote
favorite
I have a problem where I need to compare two large CSV Files (Approx 5 to 8 GB) and have to make a third CSV file from their difference.
Any suggestion pointer for supported libraries for the same or any reference to get started
e.g
File 1.csv
+---+------+------+
|ID |value1|value2|
+---+------+------+
| 1| a| Ran1|
| 2| b| Ran2|
+---+-----+-------+
File 2.csv
+---+------+------+
|ID |value1|value2|
+---+------+------+
| 3| c| Ran3|
| 2| b| Ran2|
+---+-----+-------+
Schema of both file is same
Result - file 3.csv
File 2.csv
+---+------+------+
|ID |value1|value2|
+---+------+------+
| 2| b| Ran2|
+---+-----+-------+scala scala-collections
For handling the CSV files you may take a look at cormorant. Addiotionally, given there will be large files, I wouldn't use standar scala collections - but instead, some form of Streaming (for example fs2). Now, I don't really understand what exactly you mean with "difference".
– Luis Miguel Mejía Suárez
Nov 9 at 15:45
For streaming large files, something like Akka Streams is pretty common.
– James Whiteley
Nov 9 at 16:08
By "difference", do you mean a textual difference, or content difference? "4.50" is textually different than "4.5", but is not different if the field is treated as a number.
– Bob Dalgleish
Nov 9 at 17:35
If there is a difference, how do you resynchronize the two sources? Do you look for the next non-differing line, or do you have key fields, such as a timestamp, that guarantees ordered access?
– Bob Dalgleish
Nov 9 at 17:36
@BobDalgleish - Schema of both csv is same, their is textual difference in both file
– Shaitender Singh
Nov 11 at 0:52
|
show 2 more comments
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I have a problem where I need to compare two large CSV Files (Approx 5 to 8 GB) and have to make a third CSV file from their difference.
Any suggestion pointer for supported libraries for the same or any reference to get started
e.g
File 1.csv
+---+------+------+
|ID |value1|value2|
+---+------+------+
| 1| a| Ran1|
| 2| b| Ran2|
+---+-----+-------+
File 2.csv
+---+------+------+
|ID |value1|value2|
+---+------+------+
| 3| c| Ran3|
| 2| b| Ran2|
+---+-----+-------+
Schema of both file is same
Result - file 3.csv
File 2.csv
+---+------+------+
|ID |value1|value2|
+---+------+------+
| 2| b| Ran2|
+---+-----+-------+scala scala-collections
I have a problem where I need to compare two large CSV Files (Approx 5 to 8 GB) and have to make a third CSV file from their difference.
Any suggestion pointer for supported libraries for the same or any reference to get started
e.g
File 1.csv
+---+------+------+
|ID |value1|value2|
+---+------+------+
| 1| a| Ran1|
| 2| b| Ran2|
+---+-----+-------+
File 2.csv
+---+------+------+
|ID |value1|value2|
+---+------+------+
| 3| c| Ran3|
| 2| b| Ran2|
+---+-----+-------+
Schema of both file is same
Result - file 3.csv
File 2.csv
+---+------+------+
|ID |value1|value2|
+---+------+------+
| 2| b| Ran2|
+---+-----+-------+File 1.csv
+---+------+------+
|ID |value1|value2|
+---+------+------+
| 1| a| Ran1|
| 2| b| Ran2|
+---+-----+-------+
File 2.csv
+---+------+------+
|ID |value1|value2|
+---+------+------+
| 3| c| Ran3|
| 2| b| Ran2|
+---+-----+-------+
Schema of both file is same
Result - file 3.csv
File 2.csv
+---+------+------+
|ID |value1|value2|
+---+------+------+
| 2| b| Ran2|
+---+-----+-------+File 1.csv
+---+------+------+
|ID |value1|value2|
+---+------+------+
| 1| a| Ran1|
| 2| b| Ran2|
+---+-----+-------+
File 2.csv
+---+------+------+
|ID |value1|value2|
+---+------+------+
| 3| c| Ran3|
| 2| b| Ran2|
+---+-----+-------+
Schema of both file is same
Result - file 3.csv
File 2.csv
+---+------+------+
|ID |value1|value2|
+---+------+------+
| 2| b| Ran2|
+---+-----+-------+scala scala-collections
scala scala-collections
edited Nov 11 at 0:49
asked Nov 9 at 15:32
Shaitender Singh
98651333
98651333
For handling the CSV files you may take a look at cormorant. Addiotionally, given there will be large files, I wouldn't use standar scala collections - but instead, some form of Streaming (for example fs2). Now, I don't really understand what exactly you mean with "difference".
– Luis Miguel Mejía Suárez
Nov 9 at 15:45
For streaming large files, something like Akka Streams is pretty common.
– James Whiteley
Nov 9 at 16:08
By "difference", do you mean a textual difference, or content difference? "4.50" is textually different than "4.5", but is not different if the field is treated as a number.
– Bob Dalgleish
Nov 9 at 17:35
If there is a difference, how do you resynchronize the two sources? Do you look for the next non-differing line, or do you have key fields, such as a timestamp, that guarantees ordered access?
– Bob Dalgleish
Nov 9 at 17:36
@BobDalgleish - Schema of both csv is same, their is textual difference in both file
– Shaitender Singh
Nov 11 at 0:52
|
show 2 more comments
For handling the CSV files you may take a look at cormorant. Addiotionally, given there will be large files, I wouldn't use standar scala collections - but instead, some form of Streaming (for example fs2). Now, I don't really understand what exactly you mean with "difference".
– Luis Miguel Mejía Suárez
Nov 9 at 15:45
For streaming large files, something like Akka Streams is pretty common.
– James Whiteley
Nov 9 at 16:08
By "difference", do you mean a textual difference, or content difference? "4.50" is textually different than "4.5", but is not different if the field is treated as a number.
– Bob Dalgleish
Nov 9 at 17:35
If there is a difference, how do you resynchronize the two sources? Do you look for the next non-differing line, or do you have key fields, such as a timestamp, that guarantees ordered access?
– Bob Dalgleish
Nov 9 at 17:36
@BobDalgleish - Schema of both csv is same, their is textual difference in both file
– Shaitender Singh
Nov 11 at 0:52
For handling the CSV files you may take a look at cormorant. Addiotionally, given there will be large files, I wouldn't use standar scala collections - but instead, some form of Streaming (for example fs2). Now, I don't really understand what exactly you mean with "difference".
– Luis Miguel Mejía Suárez
Nov 9 at 15:45
For handling the CSV files you may take a look at cormorant. Addiotionally, given there will be large files, I wouldn't use standar scala collections - but instead, some form of Streaming (for example fs2). Now, I don't really understand what exactly you mean with "difference".
– Luis Miguel Mejía Suárez
Nov 9 at 15:45
For streaming large files, something like Akka Streams is pretty common.
– James Whiteley
Nov 9 at 16:08
For streaming large files, something like Akka Streams is pretty common.
– James Whiteley
Nov 9 at 16:08
By "difference", do you mean a textual difference, or content difference? "4.50" is textually different than "4.5", but is not different if the field is treated as a number.
– Bob Dalgleish
Nov 9 at 17:35
By "difference", do you mean a textual difference, or content difference? "4.50" is textually different than "4.5", but is not different if the field is treated as a number.
– Bob Dalgleish
Nov 9 at 17:35
If there is a difference, how do you resynchronize the two sources? Do you look for the next non-differing line, or do you have key fields, such as a timestamp, that guarantees ordered access?
– Bob Dalgleish
Nov 9 at 17:36
If there is a difference, how do you resynchronize the two sources? Do you look for the next non-differing line, or do you have key fields, such as a timestamp, that guarantees ordered access?
– Bob Dalgleish
Nov 9 at 17:36
@BobDalgleish - Schema of both csv is same, their is textual difference in both file
– Shaitender Singh
Nov 11 at 0:52
@BobDalgleish - Schema of both csv is same, their is textual difference in both file
– Shaitender Singh
Nov 11 at 0:52
|
show 2 more comments
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53228708%2fcompare-two-large-csv-files-and-making-a-third-one-from-the-difference%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53228708%2fcompare-two-large-csv-files-and-making-a-third-one-from-the-difference%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
For handling the CSV files you may take a look at cormorant. Addiotionally, given there will be large files, I wouldn't use standar scala collections - but instead, some form of Streaming (for example fs2). Now, I don't really understand what exactly you mean with "difference".
– Luis Miguel Mejía Suárez
Nov 9 at 15:45
For streaming large files, something like Akka Streams is pretty common.
– James Whiteley
Nov 9 at 16:08
By "difference", do you mean a textual difference, or content difference? "4.50" is textually different than "4.5", but is not different if the field is treated as a number.
– Bob Dalgleish
Nov 9 at 17:35
If there is a difference, how do you resynchronize the two sources? Do you look for the next non-differing line, or do you have key fields, such as a timestamp, that guarantees ordered access?
– Bob Dalgleish
Nov 9 at 17:36
@BobDalgleish - Schema of both csv is same, their is textual difference in both file
– Shaitender Singh
Nov 11 at 0:52