how to split and bin a larget text file using bash?

up vote
-3
down vote

favorite

I have a very large txt file like this:

I want to bin this file based on the first column, each gets 3 bins, each bin containg the sum of the 3rd column:

a bin1 2

a bin2 3

a bin3 2

b bin1 0

b bin2 33

b bin3 21

c bin1 634

c bin2 132

c bin3 414

and save the final file into a new txt file.

My original file was insanely large, containing more than 100 million rows, and file size is >2G. So is there a better way to do this without running out of RAM?

Thank you.

asked Nov 7 at 22:03

stevex

235

How are bins grouped? There are three bins for each value in the first column, but how do you know which row goes into which of the 3 bins? Is that, like, if there are n lines with a value on the first column, first floor(n/3) values goes to bin1, then followed group of floor(n/3) goes to bin2 and then last floor(n/3) + n%3 values go to bin3? If you concerned of memory, use split to split the file to parts or csplit to split on the first column. What have you tried? What does not work?
– Kamil Cuk
Nov 7 at 22:17

1

Please show what you have tried. Maybe that will clarify what you are trying to do. idownvotedbecau.se/unclearquestion idownvotedbecau.se/noattempt
– Paul Hodges
Nov 7 at 22:21

@Kamil Cuk. Hi Kamil, Thank you for the reply. You understand it right. Your question: How do you know which row goes into which of the 3 bins? Let's take value a rows for example: a bin1 ( row1+row2+row3 in the 3rd column); a bin2 (row4+row5+row6 in the 3rd column); a bin3 (sum of all the rest values starting from row7 in the 3rd column). I know how to achieve my goal using python pandas but it is too slow and takes forever. Even simple scripts like" with open('input.txt') as f: data=f.readlines() for line in data: ....." will take forever because any for loop will not be acceptable.
– stevex
Nov 7 at 23:08

That's why I'm thinking to use simple bash commands to do this or at least to process the file into smaller ones so that I can work on it using pandas. I like your idea of csplit but don't know how to split the original file. Do you mind giving me some scripts of how to split the file based on the values in the 1st column, so that I get file1 containing a......, file2 containing b....., file3 containing c.... Thank you!
– stevex
Nov 7 at 23:10

I think I would go with csplit the file into parts with unique first column, then split each into 3 separate files of ca. equal lines number, then sum value in column datamash finally output everything. Does the second column has any significance? It looks like it can be safely removed.
– Kamil Cuk
Nov 7 at 23:16

|
show 1 more comment

up vote
-3
down vote

favorite

I have a very large txt file like this:

I want to bin this file based on the first column, each gets 3 bins, each bin containg the sum of the 3rd column:

a bin1 2

a bin2 3

a bin3 2

b bin1 0

b bin2 33

b bin3 21

c bin1 634

c bin2 132

c bin3 414

and save the final file into a new txt file.

My original file was insanely large, containing more than 100 million rows, and file size is >2G. So is there a better way to do this without running out of RAM?

Thank you.

asked Nov 7 at 22:03

stevex

235

How are bins grouped? There are three bins for each value in the first column, but how do you know which row goes into which of the 3 bins? Is that, like, if there are n lines with a value on the first column, first floor(n/3) values goes to bin1, then followed group of floor(n/3) goes to bin2 and then last floor(n/3) + n%3 values go to bin3? If you concerned of memory, use split to split the file to parts or csplit to split on the first column. What have you tried? What does not work?
– Kamil Cuk
Nov 7 at 22:17

1

Please show what you have tried. Maybe that will clarify what you are trying to do. idownvotedbecau.se/unclearquestion idownvotedbecau.se/noattempt
– Paul Hodges
Nov 7 at 22:21

@Kamil Cuk. Hi Kamil, Thank you for the reply. You understand it right. Your question: How do you know which row goes into which of the 3 bins? Let's take value a rows for example: a bin1 ( row1+row2+row3 in the 3rd column); a bin2 (row4+row5+row6 in the 3rd column); a bin3 (sum of all the rest values starting from row7 in the 3rd column). I know how to achieve my goal using python pandas but it is too slow and takes forever. Even simple scripts like" with open('input.txt') as f: data=f.readlines() for line in data: ....." will take forever because any for loop will not be acceptable.
– stevex
Nov 7 at 23:08

That's why I'm thinking to use simple bash commands to do this or at least to process the file into smaller ones so that I can work on it using pandas. I like your idea of csplit but don't know how to split the original file. Do you mind giving me some scripts of how to split the file based on the values in the 1st column, so that I get file1 containing a......, file2 containing b....., file3 containing c.... Thank you!
– stevex
Nov 7 at 23:10

I think I would go with csplit the file into parts with unique first column, then split each into 3 separate files of ca. equal lines number, then sum value in column datamash finally output everything. Does the second column has any significance? It looks like it can be safely removed.
– Kamil Cuk
Nov 7 at 23:16

|
show 1 more comment

up vote
-3
down vote

favorite

I have a very large txt file like this:

I want to bin this file based on the first column, each gets 3 bins, each bin containg the sum of the 3rd column:

a bin1 2

a bin2 3

a bin3 2

b bin1 0

b bin2 33

b bin3 21

c bin1 634

c bin2 132

c bin3 414

and save the final file into a new txt file.

My original file was insanely large, containing more than 100 million rows, and file size is >2G. So is there a better way to do this without running out of RAM?

Thank you.

asked Nov 7 at 22:03

stevex

235

I have a very large txt file like this:

I want to bin this file based on the first column, each gets 3 bins, each bin containg the sum of the 3rd column:

a bin1 2

a bin2 3

a bin3 2

b bin1 0

b bin2 33

b bin3 21

c bin1 634

c bin2 132

c bin3 414

and save the final file into a new txt file.

My original file was insanely large, containing more than 100 million rows, and file size is >2G. So is there a better way to do this without running out of RAM?

Thank you.

bash

asked Nov 7 at 22:03

stevex

235

asked Nov 7 at 22:03

stevex

235

asked Nov 7 at 22:03

stevex

235

asked Nov 7 at 22:03

stevex

235

asked Nov 7 at 22:03

stevex

235

How are bins grouped? There are three bins for each value in the first column, but how do you know which row goes into which of the 3 bins? Is that, like, if there are n lines with a value on the first column, first floor(n/3) values goes to bin1, then followed group of floor(n/3) goes to bin2 and then last floor(n/3) + n%3 values go to bin3? If you concerned of memory, use split to split the file to parts or csplit to split on the first column. What have you tried? What does not work?
– Kamil Cuk
Nov 7 at 22:17

1

Please show what you have tried. Maybe that will clarify what you are trying to do. idownvotedbecau.se/unclearquestion idownvotedbecau.se/noattempt
– Paul Hodges
Nov 7 at 22:21

@Kamil Cuk. Hi Kamil, Thank you for the reply. You understand it right. Your question: How do you know which row goes into which of the 3 bins? Let's take value a rows for example: a bin1 ( row1+row2+row3 in the 3rd column); a bin2 (row4+row5+row6 in the 3rd column); a bin3 (sum of all the rest values starting from row7 in the 3rd column). I know how to achieve my goal using python pandas but it is too slow and takes forever. Even simple scripts like" with open('input.txt') as f: data=f.readlines() for line in data: ....." will take forever because any for loop will not be acceptable.
– stevex
Nov 7 at 23:08

That's why I'm thinking to use simple bash commands to do this or at least to process the file into smaller ones so that I can work on it using pandas. I like your idea of csplit but don't know how to split the original file. Do you mind giving me some scripts of how to split the file based on the values in the 1st column, so that I get file1 containing a......, file2 containing b....., file3 containing c.... Thank you!
– stevex
Nov 7 at 23:10

I think I would go with csplit the file into parts with unique first column, then split each into 3 separate files of ca. equal lines number, then sum value in column datamash finally output everything. Does the second column has any significance? It looks like it can be safely removed.
– Kamil Cuk
Nov 7 at 23:16

|
show 1 more comment

How are bins grouped? There are three bins for each value in the first column, but how do you know which row goes into which of the 3 bins? Is that, like, if there are n lines with a value on the first column, first floor(n/3) values goes to bin1, then followed group of floor(n/3) goes to bin2 and then last floor(n/3) + n%3 values go to bin3? If you concerned of memory, use split to split the file to parts or csplit to split on the first column. What have you tried? What does not work?
– Kamil Cuk
Nov 7 at 22:17

1

Please show what you have tried. Maybe that will clarify what you are trying to do. idownvotedbecau.se/unclearquestion idownvotedbecau.se/noattempt
– Paul Hodges
Nov 7 at 22:21

@Kamil Cuk. Hi Kamil, Thank you for the reply. You understand it right. Your question: How do you know which row goes into which of the 3 bins? Let's take value a rows for example: a bin1 ( row1+row2+row3 in the 3rd column); a bin2 (row4+row5+row6 in the 3rd column); a bin3 (sum of all the rest values starting from row7 in the 3rd column). I know how to achieve my goal using python pandas but it is too slow and takes forever. Even simple scripts like" with open('input.txt') as f: data=f.readlines() for line in data: ....." will take forever because any for loop will not be acceptable.
– stevex
Nov 7 at 23:08

That's why I'm thinking to use simple bash commands to do this or at least to process the file into smaller ones so that I can work on it using pandas. I like your idea of csplit but don't know how to split the original file. Do you mind giving me some scripts of how to split the file based on the values in the 1st column, so that I get file1 containing a......, file2 containing b....., file3 containing c.... Thank you!
– stevex
Nov 7 at 23:10

I think I would go with csplit the file into parts with unique first column, then split each into 3 separate files of ca. equal lines number, then sum value in column datamash finally output everything. Does the second column has any significance? It looks like it can be safely removed.
– Kamil Cuk
Nov 7 at 23:16

How are bins grouped? There are three bins for each value in the first column, but how do you know which row goes into which of the 3 bins? Is that, like, if there are n lines with a value on the first column, first floor(n/3) values goes to bin1, then followed group of floor(n/3) goes to bin2 and then last floor(n/3) + n%3 values go to bin3? If you concerned of memory, use split to split the file to parts or csplit to split on the first column. What have you tried? What does not work?
– Kamil Cuk
Nov 7 at 22:17

Please show what you have tried. Maybe that will clarify what you are trying to do. idownvotedbecau.se/unclearquestion idownvotedbecau.se/noattempt
– Paul Hodges
Nov 7 at 22:21

@Kamil Cuk. Hi Kamil, Thank you for the reply. You understand it right. Your question: How do you know which row goes into which of the 3 bins? Let's take value a rows for example: a bin1 ( row1+row2+row3 in the 3rd column); a bin2 (row4+row5+row6 in the 3rd column); a bin3 (sum of all the rest values starting from row7 in the 3rd column). I know how to achieve my goal using python pandas but it is too slow and takes forever. Even simple scripts like" with open('input.txt') as f: data=f.readlines() for line in data: ....." will take forever because any for loop will not be acceptable.
– stevex
Nov 7 at 23:08

That's why I'm thinking to use simple bash commands to do this or at least to process the file into smaller ones so that I can work on it using pandas. I like your idea of csplit but don't know how to split the original file. Do you mind giving me some scripts of how to split the file based on the values in the 1st column, so that I get file1 containing a......, file2 containing b....., file3 containing c.... Thank you!
– stevex
Nov 7 at 23:10

I think I would go with csplit the file into parts with unique first column, then split each into 3 separate files of ca. equal lines number, then sum value in column datamash finally output everything. Does the second column has any significance? It looks like it can be safely removed.
– Kamil Cuk
Nov 7 at 23:16

|
show 1 more comment

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53198548%2fhow-to-split-and-bin-a-larget-text-file-using-bash%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

active

oldest

votes

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Wsrtjtyk