how to split and bin a larget text file using bash?
up vote
-3
down vote
favorite
I have a very large txt file like this:
a 1 0
a 2 0
a 3 2
a 4 2
a 5 0
a 6 1
a 7 0
a 8 1
a 9 1
a 10 0
b 1 0
b 2 0
b 3 12
b 4 21
b 5 20
b 6 1
c 1 0
c 2 0
c 3 202
c 4 222
c 5 210
c 6 120
c 7 10
c 8 1
c 9 1
c 10 0
c 11 0
c 12 20
c 13 222
c 14 122
c 15 50
I want to bin this file based on the first column, each gets 3 bins, each bin containg the sum of the 3rd column:
a bin1 2
a bin2 3
a bin3 2
b bin1 0
b bin2 33
b bin3 21
c bin1 634
c bin2 132
c bin3 414
and save the final file into a new txt file.
My original file was insanely large, containing more than 100 million rows, and file size is >2G. So is there a better way to do this without running out of RAM?
Thank you.
bash
|
show 1 more comment
up vote
-3
down vote
favorite
I have a very large txt file like this:
a 1 0
a 2 0
a 3 2
a 4 2
a 5 0
a 6 1
a 7 0
a 8 1
a 9 1
a 10 0
b 1 0
b 2 0
b 3 12
b 4 21
b 5 20
b 6 1
c 1 0
c 2 0
c 3 202
c 4 222
c 5 210
c 6 120
c 7 10
c 8 1
c 9 1
c 10 0
c 11 0
c 12 20
c 13 222
c 14 122
c 15 50
I want to bin this file based on the first column, each gets 3 bins, each bin containg the sum of the 3rd column:
a bin1 2
a bin2 3
a bin3 2
b bin1 0
b bin2 33
b bin3 21
c bin1 634
c bin2 132
c bin3 414
and save the final file into a new txt file.
My original file was insanely large, containing more than 100 million rows, and file size is >2G. So is there a better way to do this without running out of RAM?
Thank you.
bash
How are bins grouped? There are three bins for each value in the first column, but how do you know which row goes into which of the 3 bins? Is that, like, if there are n lines with a value on the first column, firstfloor(n/3)
values goes to bin1, then followed group offloor(n/3)
goes to bin2 and then lastfloor(n/3) + n%3
values go to bin3? If you concerned of memory, usesplit
to split the file to parts orcsplit
to split on the first column. What have you tried? What does not work?
– Kamil Cuk
Nov 7 at 22:17
1
Please show what you have tried. Maybe that will clarify what you are trying to do. idownvotedbecau.se/unclearquestion idownvotedbecau.se/noattempt
– Paul Hodges
Nov 7 at 22:21
@Kamil Cuk. Hi Kamil, Thank you for the reply. You understand it right. Your question: How do you know which row goes into which of the 3 bins? Let's take value a rows for example: a bin1 ( row1+row2+row3 in the 3rd column); a bin2 (row4+row5+row6 in the 3rd column); a bin3 (sum of all the rest values starting from row7 in the 3rd column). I know how to achieve my goal using python pandas but it is too slow and takes forever. Even simple scripts like" with open('input.txt') as f: data=f.readlines() for line in data: ....." will take forever because any for loop will not be acceptable.
– stevex
Nov 7 at 23:08
That's why I'm thinking to use simple bash commands to do this or at least to process the file into smaller ones so that I can work on it using pandas. I like your idea of csplit but don't know how to split the original file. Do you mind giving me some scripts of how to split the file based on the values in the 1st column, so that I get file1 containing a......, file2 containing b....., file3 containing c.... Thank you!
– stevex
Nov 7 at 23:10
I think I would go withcsplit
the file into parts with unique first column, thensplit
each into 3 separate files of ca. equal lines number, then sum value in columndatamash
finally output everything. Does the second column has any significance? It looks like it can be safely removed.
– Kamil Cuk
Nov 7 at 23:16
|
show 1 more comment
up vote
-3
down vote
favorite
up vote
-3
down vote
favorite
I have a very large txt file like this:
a 1 0
a 2 0
a 3 2
a 4 2
a 5 0
a 6 1
a 7 0
a 8 1
a 9 1
a 10 0
b 1 0
b 2 0
b 3 12
b 4 21
b 5 20
b 6 1
c 1 0
c 2 0
c 3 202
c 4 222
c 5 210
c 6 120
c 7 10
c 8 1
c 9 1
c 10 0
c 11 0
c 12 20
c 13 222
c 14 122
c 15 50
I want to bin this file based on the first column, each gets 3 bins, each bin containg the sum of the 3rd column:
a bin1 2
a bin2 3
a bin3 2
b bin1 0
b bin2 33
b bin3 21
c bin1 634
c bin2 132
c bin3 414
and save the final file into a new txt file.
My original file was insanely large, containing more than 100 million rows, and file size is >2G. So is there a better way to do this without running out of RAM?
Thank you.
bash
I have a very large txt file like this:
a 1 0
a 2 0
a 3 2
a 4 2
a 5 0
a 6 1
a 7 0
a 8 1
a 9 1
a 10 0
b 1 0
b 2 0
b 3 12
b 4 21
b 5 20
b 6 1
c 1 0
c 2 0
c 3 202
c 4 222
c 5 210
c 6 120
c 7 10
c 8 1
c 9 1
c 10 0
c 11 0
c 12 20
c 13 222
c 14 122
c 15 50
I want to bin this file based on the first column, each gets 3 bins, each bin containg the sum of the 3rd column:
a bin1 2
a bin2 3
a bin3 2
b bin1 0
b bin2 33
b bin3 21
c bin1 634
c bin2 132
c bin3 414
and save the final file into a new txt file.
My original file was insanely large, containing more than 100 million rows, and file size is >2G. So is there a better way to do this without running out of RAM?
Thank you.
bash
bash
asked Nov 7 at 22:03
stevex
235
235
How are bins grouped? There are three bins for each value in the first column, but how do you know which row goes into which of the 3 bins? Is that, like, if there are n lines with a value on the first column, firstfloor(n/3)
values goes to bin1, then followed group offloor(n/3)
goes to bin2 and then lastfloor(n/3) + n%3
values go to bin3? If you concerned of memory, usesplit
to split the file to parts orcsplit
to split on the first column. What have you tried? What does not work?
– Kamil Cuk
Nov 7 at 22:17
1
Please show what you have tried. Maybe that will clarify what you are trying to do. idownvotedbecau.se/unclearquestion idownvotedbecau.se/noattempt
– Paul Hodges
Nov 7 at 22:21
@Kamil Cuk. Hi Kamil, Thank you for the reply. You understand it right. Your question: How do you know which row goes into which of the 3 bins? Let's take value a rows for example: a bin1 ( row1+row2+row3 in the 3rd column); a bin2 (row4+row5+row6 in the 3rd column); a bin3 (sum of all the rest values starting from row7 in the 3rd column). I know how to achieve my goal using python pandas but it is too slow and takes forever. Even simple scripts like" with open('input.txt') as f: data=f.readlines() for line in data: ....." will take forever because any for loop will not be acceptable.
– stevex
Nov 7 at 23:08
That's why I'm thinking to use simple bash commands to do this or at least to process the file into smaller ones so that I can work on it using pandas. I like your idea of csplit but don't know how to split the original file. Do you mind giving me some scripts of how to split the file based on the values in the 1st column, so that I get file1 containing a......, file2 containing b....., file3 containing c.... Thank you!
– stevex
Nov 7 at 23:10
I think I would go withcsplit
the file into parts with unique first column, thensplit
each into 3 separate files of ca. equal lines number, then sum value in columndatamash
finally output everything. Does the second column has any significance? It looks like it can be safely removed.
– Kamil Cuk
Nov 7 at 23:16
|
show 1 more comment
How are bins grouped? There are three bins for each value in the first column, but how do you know which row goes into which of the 3 bins? Is that, like, if there are n lines with a value on the first column, firstfloor(n/3)
values goes to bin1, then followed group offloor(n/3)
goes to bin2 and then lastfloor(n/3) + n%3
values go to bin3? If you concerned of memory, usesplit
to split the file to parts orcsplit
to split on the first column. What have you tried? What does not work?
– Kamil Cuk
Nov 7 at 22:17
1
Please show what you have tried. Maybe that will clarify what you are trying to do. idownvotedbecau.se/unclearquestion idownvotedbecau.se/noattempt
– Paul Hodges
Nov 7 at 22:21
@Kamil Cuk. Hi Kamil, Thank you for the reply. You understand it right. Your question: How do you know which row goes into which of the 3 bins? Let's take value a rows for example: a bin1 ( row1+row2+row3 in the 3rd column); a bin2 (row4+row5+row6 in the 3rd column); a bin3 (sum of all the rest values starting from row7 in the 3rd column). I know how to achieve my goal using python pandas but it is too slow and takes forever. Even simple scripts like" with open('input.txt') as f: data=f.readlines() for line in data: ....." will take forever because any for loop will not be acceptable.
– stevex
Nov 7 at 23:08
That's why I'm thinking to use simple bash commands to do this or at least to process the file into smaller ones so that I can work on it using pandas. I like your idea of csplit but don't know how to split the original file. Do you mind giving me some scripts of how to split the file based on the values in the 1st column, so that I get file1 containing a......, file2 containing b....., file3 containing c.... Thank you!
– stevex
Nov 7 at 23:10
I think I would go withcsplit
the file into parts with unique first column, thensplit
each into 3 separate files of ca. equal lines number, then sum value in columndatamash
finally output everything. Does the second column has any significance? It looks like it can be safely removed.
– Kamil Cuk
Nov 7 at 23:16
How are bins grouped? There are three bins for each value in the first column, but how do you know which row goes into which of the 3 bins? Is that, like, if there are n lines with a value on the first column, first
floor(n/3)
values goes to bin1, then followed group of floor(n/3)
goes to bin2 and then last floor(n/3) + n%3
values go to bin3? If you concerned of memory, use split
to split the file to parts or csplit
to split on the first column. What have you tried? What does not work?– Kamil Cuk
Nov 7 at 22:17
How are bins grouped? There are three bins for each value in the first column, but how do you know which row goes into which of the 3 bins? Is that, like, if there are n lines with a value on the first column, first
floor(n/3)
values goes to bin1, then followed group of floor(n/3)
goes to bin2 and then last floor(n/3) + n%3
values go to bin3? If you concerned of memory, use split
to split the file to parts or csplit
to split on the first column. What have you tried? What does not work?– Kamil Cuk
Nov 7 at 22:17
1
1
Please show what you have tried. Maybe that will clarify what you are trying to do. idownvotedbecau.se/unclearquestion idownvotedbecau.se/noattempt
– Paul Hodges
Nov 7 at 22:21
Please show what you have tried. Maybe that will clarify what you are trying to do. idownvotedbecau.se/unclearquestion idownvotedbecau.se/noattempt
– Paul Hodges
Nov 7 at 22:21
@Kamil Cuk. Hi Kamil, Thank you for the reply. You understand it right. Your question: How do you know which row goes into which of the 3 bins? Let's take value a rows for example: a bin1 ( row1+row2+row3 in the 3rd column); a bin2 (row4+row5+row6 in the 3rd column); a bin3 (sum of all the rest values starting from row7 in the 3rd column). I know how to achieve my goal using python pandas but it is too slow and takes forever. Even simple scripts like" with open('input.txt') as f: data=f.readlines() for line in data: ....." will take forever because any for loop will not be acceptable.
– stevex
Nov 7 at 23:08
@Kamil Cuk. Hi Kamil, Thank you for the reply. You understand it right. Your question: How do you know which row goes into which of the 3 bins? Let's take value a rows for example: a bin1 ( row1+row2+row3 in the 3rd column); a bin2 (row4+row5+row6 in the 3rd column); a bin3 (sum of all the rest values starting from row7 in the 3rd column). I know how to achieve my goal using python pandas but it is too slow and takes forever. Even simple scripts like" with open('input.txt') as f: data=f.readlines() for line in data: ....." will take forever because any for loop will not be acceptable.
– stevex
Nov 7 at 23:08
That's why I'm thinking to use simple bash commands to do this or at least to process the file into smaller ones so that I can work on it using pandas. I like your idea of csplit but don't know how to split the original file. Do you mind giving me some scripts of how to split the file based on the values in the 1st column, so that I get file1 containing a......, file2 containing b....., file3 containing c.... Thank you!
– stevex
Nov 7 at 23:10
That's why I'm thinking to use simple bash commands to do this or at least to process the file into smaller ones so that I can work on it using pandas. I like your idea of csplit but don't know how to split the original file. Do you mind giving me some scripts of how to split the file based on the values in the 1st column, so that I get file1 containing a......, file2 containing b....., file3 containing c.... Thank you!
– stevex
Nov 7 at 23:10
I think I would go with
csplit
the file into parts with unique first column, then split
each into 3 separate files of ca. equal lines number, then sum value in column datamash
finally output everything. Does the second column has any significance? It looks like it can be safely removed.– Kamil Cuk
Nov 7 at 23:16
I think I would go with
csplit
the file into parts with unique first column, then split
each into 3 separate files of ca. equal lines number, then sum value in column datamash
finally output everything. Does the second column has any significance? It looks like it can be safely removed.– Kamil Cuk
Nov 7 at 23:16
|
show 1 more comment
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53198548%2fhow-to-split-and-bin-a-larget-text-file-using-bash%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
How are bins grouped? There are three bins for each value in the first column, but how do you know which row goes into which of the 3 bins? Is that, like, if there are n lines with a value on the first column, first
floor(n/3)
values goes to bin1, then followed group offloor(n/3)
goes to bin2 and then lastfloor(n/3) + n%3
values go to bin3? If you concerned of memory, usesplit
to split the file to parts orcsplit
to split on the first column. What have you tried? What does not work?– Kamil Cuk
Nov 7 at 22:17
1
Please show what you have tried. Maybe that will clarify what you are trying to do. idownvotedbecau.se/unclearquestion idownvotedbecau.se/noattempt
– Paul Hodges
Nov 7 at 22:21
@Kamil Cuk. Hi Kamil, Thank you for the reply. You understand it right. Your question: How do you know which row goes into which of the 3 bins? Let's take value a rows for example: a bin1 ( row1+row2+row3 in the 3rd column); a bin2 (row4+row5+row6 in the 3rd column); a bin3 (sum of all the rest values starting from row7 in the 3rd column). I know how to achieve my goal using python pandas but it is too slow and takes forever. Even simple scripts like" with open('input.txt') as f: data=f.readlines() for line in data: ....." will take forever because any for loop will not be acceptable.
– stevex
Nov 7 at 23:08
That's why I'm thinking to use simple bash commands to do this or at least to process the file into smaller ones so that I can work on it using pandas. I like your idea of csplit but don't know how to split the original file. Do you mind giving me some scripts of how to split the file based on the values in the 1st column, so that I get file1 containing a......, file2 containing b....., file3 containing c.... Thank you!
– stevex
Nov 7 at 23:10
I think I would go with
csplit
the file into parts with unique first column, thensplit
each into 3 separate files of ca. equal lines number, then sum value in columndatamash
finally output everything. Does the second column has any significance? It looks like it can be safely removed.– Kamil Cuk
Nov 7 at 23:16