how to split and bin a larget text file using bash?











up vote
-3
down vote

favorite












I have a very large txt file like this:



a 1 0
a 2 0
a 3 2
a 4 2
a 5 0
a 6 1
a 7 0
a 8 1
a 9 1
a 10 0
b 1 0
b 2 0
b 3 12
b 4 21
b 5 20
b 6 1
c 1 0
c 2 0
c 3 202
c 4 222
c 5 210
c 6 120
c 7 10
c 8 1
c 9 1
c 10 0
c 11 0
c 12 20
c 13 222
c 14 122
c 15 50


I want to bin this file based on the first column, each gets 3 bins, each bin containg the sum of the 3rd column:



a bin1 2
a bin2 3
a bin3 2
b bin1 0
b bin2 33
b bin3 21
c bin1 634
c bin2 132
c bin3 414


and save the final file into a new txt file.



My original file was insanely large, containing more than 100 million rows, and file size is >2G. So is there a better way to do this without running out of RAM?



Thank you.










share|improve this question






















  • How are bins grouped? There are three bins for each value in the first column, but how do you know which row goes into which of the 3 bins? Is that, like, if there are n lines with a value on the first column, first floor(n/3) values goes to bin1, then followed group of floor(n/3) goes to bin2 and then last floor(n/3) + n%3 values go to bin3? If you concerned of memory, use split to split the file to parts or csplit to split on the first column. What have you tried? What does not work?
    – Kamil Cuk
    Nov 7 at 22:17








  • 1




    Please show what you have tried. Maybe that will clarify what you are trying to do. idownvotedbecau.se/unclearquestion idownvotedbecau.se/noattempt
    – Paul Hodges
    Nov 7 at 22:21










  • @Kamil Cuk. Hi Kamil, Thank you for the reply. You understand it right. Your question: How do you know which row goes into which of the 3 bins? Let's take value a rows for example: a bin1 ( row1+row2+row3 in the 3rd column); a bin2 (row4+row5+row6 in the 3rd column); a bin3 (sum of all the rest values starting from row7 in the 3rd column). I know how to achieve my goal using python pandas but it is too slow and takes forever. Even simple scripts like" with open('input.txt') as f: data=f.readlines() for line in data: ....." will take forever because any for loop will not be acceptable.
    – stevex
    Nov 7 at 23:08










  • That's why I'm thinking to use simple bash commands to do this or at least to process the file into smaller ones so that I can work on it using pandas. I like your idea of csplit but don't know how to split the original file. Do you mind giving me some scripts of how to split the file based on the values in the 1st column, so that I get file1 containing a......, file2 containing b....., file3 containing c.... Thank you!
    – stevex
    Nov 7 at 23:10










  • I think I would go with csplit the file into parts with unique first column, then split each into 3 separate files of ca. equal lines number, then sum value in column datamash finally output everything. Does the second column has any significance? It looks like it can be safely removed.
    – Kamil Cuk
    Nov 7 at 23:16















up vote
-3
down vote

favorite












I have a very large txt file like this:



a 1 0
a 2 0
a 3 2
a 4 2
a 5 0
a 6 1
a 7 0
a 8 1
a 9 1
a 10 0
b 1 0
b 2 0
b 3 12
b 4 21
b 5 20
b 6 1
c 1 0
c 2 0
c 3 202
c 4 222
c 5 210
c 6 120
c 7 10
c 8 1
c 9 1
c 10 0
c 11 0
c 12 20
c 13 222
c 14 122
c 15 50


I want to bin this file based on the first column, each gets 3 bins, each bin containg the sum of the 3rd column:



a bin1 2
a bin2 3
a bin3 2
b bin1 0
b bin2 33
b bin3 21
c bin1 634
c bin2 132
c bin3 414


and save the final file into a new txt file.



My original file was insanely large, containing more than 100 million rows, and file size is >2G. So is there a better way to do this without running out of RAM?



Thank you.










share|improve this question






















  • How are bins grouped? There are three bins for each value in the first column, but how do you know which row goes into which of the 3 bins? Is that, like, if there are n lines with a value on the first column, first floor(n/3) values goes to bin1, then followed group of floor(n/3) goes to bin2 and then last floor(n/3) + n%3 values go to bin3? If you concerned of memory, use split to split the file to parts or csplit to split on the first column. What have you tried? What does not work?
    – Kamil Cuk
    Nov 7 at 22:17








  • 1




    Please show what you have tried. Maybe that will clarify what you are trying to do. idownvotedbecau.se/unclearquestion idownvotedbecau.se/noattempt
    – Paul Hodges
    Nov 7 at 22:21










  • @Kamil Cuk. Hi Kamil, Thank you for the reply. You understand it right. Your question: How do you know which row goes into which of the 3 bins? Let's take value a rows for example: a bin1 ( row1+row2+row3 in the 3rd column); a bin2 (row4+row5+row6 in the 3rd column); a bin3 (sum of all the rest values starting from row7 in the 3rd column). I know how to achieve my goal using python pandas but it is too slow and takes forever. Even simple scripts like" with open('input.txt') as f: data=f.readlines() for line in data: ....." will take forever because any for loop will not be acceptable.
    – stevex
    Nov 7 at 23:08










  • That's why I'm thinking to use simple bash commands to do this or at least to process the file into smaller ones so that I can work on it using pandas. I like your idea of csplit but don't know how to split the original file. Do you mind giving me some scripts of how to split the file based on the values in the 1st column, so that I get file1 containing a......, file2 containing b....., file3 containing c.... Thank you!
    – stevex
    Nov 7 at 23:10










  • I think I would go with csplit the file into parts with unique first column, then split each into 3 separate files of ca. equal lines number, then sum value in column datamash finally output everything. Does the second column has any significance? It looks like it can be safely removed.
    – Kamil Cuk
    Nov 7 at 23:16













up vote
-3
down vote

favorite









up vote
-3
down vote

favorite











I have a very large txt file like this:



a 1 0
a 2 0
a 3 2
a 4 2
a 5 0
a 6 1
a 7 0
a 8 1
a 9 1
a 10 0
b 1 0
b 2 0
b 3 12
b 4 21
b 5 20
b 6 1
c 1 0
c 2 0
c 3 202
c 4 222
c 5 210
c 6 120
c 7 10
c 8 1
c 9 1
c 10 0
c 11 0
c 12 20
c 13 222
c 14 122
c 15 50


I want to bin this file based on the first column, each gets 3 bins, each bin containg the sum of the 3rd column:



a bin1 2
a bin2 3
a bin3 2
b bin1 0
b bin2 33
b bin3 21
c bin1 634
c bin2 132
c bin3 414


and save the final file into a new txt file.



My original file was insanely large, containing more than 100 million rows, and file size is >2G. So is there a better way to do this without running out of RAM?



Thank you.










share|improve this question













I have a very large txt file like this:



a 1 0
a 2 0
a 3 2
a 4 2
a 5 0
a 6 1
a 7 0
a 8 1
a 9 1
a 10 0
b 1 0
b 2 0
b 3 12
b 4 21
b 5 20
b 6 1
c 1 0
c 2 0
c 3 202
c 4 222
c 5 210
c 6 120
c 7 10
c 8 1
c 9 1
c 10 0
c 11 0
c 12 20
c 13 222
c 14 122
c 15 50


I want to bin this file based on the first column, each gets 3 bins, each bin containg the sum of the 3rd column:



a bin1 2
a bin2 3
a bin3 2
b bin1 0
b bin2 33
b bin3 21
c bin1 634
c bin2 132
c bin3 414


and save the final file into a new txt file.



My original file was insanely large, containing more than 100 million rows, and file size is >2G. So is there a better way to do this without running out of RAM?



Thank you.







bash






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 7 at 22:03









stevex

235




235












  • How are bins grouped? There are three bins for each value in the first column, but how do you know which row goes into which of the 3 bins? Is that, like, if there are n lines with a value on the first column, first floor(n/3) values goes to bin1, then followed group of floor(n/3) goes to bin2 and then last floor(n/3) + n%3 values go to bin3? If you concerned of memory, use split to split the file to parts or csplit to split on the first column. What have you tried? What does not work?
    – Kamil Cuk
    Nov 7 at 22:17








  • 1




    Please show what you have tried. Maybe that will clarify what you are trying to do. idownvotedbecau.se/unclearquestion idownvotedbecau.se/noattempt
    – Paul Hodges
    Nov 7 at 22:21










  • @Kamil Cuk. Hi Kamil, Thank you for the reply. You understand it right. Your question: How do you know which row goes into which of the 3 bins? Let's take value a rows for example: a bin1 ( row1+row2+row3 in the 3rd column); a bin2 (row4+row5+row6 in the 3rd column); a bin3 (sum of all the rest values starting from row7 in the 3rd column). I know how to achieve my goal using python pandas but it is too slow and takes forever. Even simple scripts like" with open('input.txt') as f: data=f.readlines() for line in data: ....." will take forever because any for loop will not be acceptable.
    – stevex
    Nov 7 at 23:08










  • That's why I'm thinking to use simple bash commands to do this or at least to process the file into smaller ones so that I can work on it using pandas. I like your idea of csplit but don't know how to split the original file. Do you mind giving me some scripts of how to split the file based on the values in the 1st column, so that I get file1 containing a......, file2 containing b....., file3 containing c.... Thank you!
    – stevex
    Nov 7 at 23:10










  • I think I would go with csplit the file into parts with unique first column, then split each into 3 separate files of ca. equal lines number, then sum value in column datamash finally output everything. Does the second column has any significance? It looks like it can be safely removed.
    – Kamil Cuk
    Nov 7 at 23:16


















  • How are bins grouped? There are three bins for each value in the first column, but how do you know which row goes into which of the 3 bins? Is that, like, if there are n lines with a value on the first column, first floor(n/3) values goes to bin1, then followed group of floor(n/3) goes to bin2 and then last floor(n/3) + n%3 values go to bin3? If you concerned of memory, use split to split the file to parts or csplit to split on the first column. What have you tried? What does not work?
    – Kamil Cuk
    Nov 7 at 22:17








  • 1




    Please show what you have tried. Maybe that will clarify what you are trying to do. idownvotedbecau.se/unclearquestion idownvotedbecau.se/noattempt
    – Paul Hodges
    Nov 7 at 22:21










  • @Kamil Cuk. Hi Kamil, Thank you for the reply. You understand it right. Your question: How do you know which row goes into which of the 3 bins? Let's take value a rows for example: a bin1 ( row1+row2+row3 in the 3rd column); a bin2 (row4+row5+row6 in the 3rd column); a bin3 (sum of all the rest values starting from row7 in the 3rd column). I know how to achieve my goal using python pandas but it is too slow and takes forever. Even simple scripts like" with open('input.txt') as f: data=f.readlines() for line in data: ....." will take forever because any for loop will not be acceptable.
    – stevex
    Nov 7 at 23:08










  • That's why I'm thinking to use simple bash commands to do this or at least to process the file into smaller ones so that I can work on it using pandas. I like your idea of csplit but don't know how to split the original file. Do you mind giving me some scripts of how to split the file based on the values in the 1st column, so that I get file1 containing a......, file2 containing b....., file3 containing c.... Thank you!
    – stevex
    Nov 7 at 23:10










  • I think I would go with csplit the file into parts with unique first column, then split each into 3 separate files of ca. equal lines number, then sum value in column datamash finally output everything. Does the second column has any significance? It looks like it can be safely removed.
    – Kamil Cuk
    Nov 7 at 23:16
















How are bins grouped? There are three bins for each value in the first column, but how do you know which row goes into which of the 3 bins? Is that, like, if there are n lines with a value on the first column, first floor(n/3) values goes to bin1, then followed group of floor(n/3) goes to bin2 and then last floor(n/3) + n%3 values go to bin3? If you concerned of memory, use split to split the file to parts or csplit to split on the first column. What have you tried? What does not work?
– Kamil Cuk
Nov 7 at 22:17






How are bins grouped? There are three bins for each value in the first column, but how do you know which row goes into which of the 3 bins? Is that, like, if there are n lines with a value on the first column, first floor(n/3) values goes to bin1, then followed group of floor(n/3) goes to bin2 and then last floor(n/3) + n%3 values go to bin3? If you concerned of memory, use split to split the file to parts or csplit to split on the first column. What have you tried? What does not work?
– Kamil Cuk
Nov 7 at 22:17






1




1




Please show what you have tried. Maybe that will clarify what you are trying to do. idownvotedbecau.se/unclearquestion idownvotedbecau.se/noattempt
– Paul Hodges
Nov 7 at 22:21




Please show what you have tried. Maybe that will clarify what you are trying to do. idownvotedbecau.se/unclearquestion idownvotedbecau.se/noattempt
– Paul Hodges
Nov 7 at 22:21












@Kamil Cuk. Hi Kamil, Thank you for the reply. You understand it right. Your question: How do you know which row goes into which of the 3 bins? Let's take value a rows for example: a bin1 ( row1+row2+row3 in the 3rd column); a bin2 (row4+row5+row6 in the 3rd column); a bin3 (sum of all the rest values starting from row7 in the 3rd column). I know how to achieve my goal using python pandas but it is too slow and takes forever. Even simple scripts like" with open('input.txt') as f: data=f.readlines() for line in data: ....." will take forever because any for loop will not be acceptable.
– stevex
Nov 7 at 23:08




@Kamil Cuk. Hi Kamil, Thank you for the reply. You understand it right. Your question: How do you know which row goes into which of the 3 bins? Let's take value a rows for example: a bin1 ( row1+row2+row3 in the 3rd column); a bin2 (row4+row5+row6 in the 3rd column); a bin3 (sum of all the rest values starting from row7 in the 3rd column). I know how to achieve my goal using python pandas but it is too slow and takes forever. Even simple scripts like" with open('input.txt') as f: data=f.readlines() for line in data: ....." will take forever because any for loop will not be acceptable.
– stevex
Nov 7 at 23:08












That's why I'm thinking to use simple bash commands to do this or at least to process the file into smaller ones so that I can work on it using pandas. I like your idea of csplit but don't know how to split the original file. Do you mind giving me some scripts of how to split the file based on the values in the 1st column, so that I get file1 containing a......, file2 containing b....., file3 containing c.... Thank you!
– stevex
Nov 7 at 23:10




That's why I'm thinking to use simple bash commands to do this or at least to process the file into smaller ones so that I can work on it using pandas. I like your idea of csplit but don't know how to split the original file. Do you mind giving me some scripts of how to split the file based on the values in the 1st column, so that I get file1 containing a......, file2 containing b....., file3 containing c.... Thank you!
– stevex
Nov 7 at 23:10












I think I would go with csplit the file into parts with unique first column, then split each into 3 separate files of ca. equal lines number, then sum value in column datamash finally output everything. Does the second column has any significance? It looks like it can be safely removed.
– Kamil Cuk
Nov 7 at 23:16




I think I would go with csplit the file into parts with unique first column, then split each into 3 separate files of ca. equal lines number, then sum value in column datamash finally output everything. Does the second column has any significance? It looks like it can be safely removed.
– Kamil Cuk
Nov 7 at 23:16

















active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














 

draft saved


draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53198548%2fhow-to-split-and-bin-a-larget-text-file-using-bash%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown






























active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes
















 

draft saved


draft discarded



















































 


draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53198548%2fhow-to-split-and-bin-a-larget-text-file-using-bash%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

Tangent Lines Diagram Along Smooth Curve

Yusuf al-Mu'taman ibn Hud

Zucchini