Map Reduce, does reducer automatically sorts?
up vote
0
down vote
favorite
there is something that i do not have clear about the whoel functioning view of a MapReduce programming environment.
Considering to have 1k of random unsorted words in the form (word, 1) coming out from a (or more than one) mapper. Suppose with the reducer i wanna save them all inside a single huge sorted file. How does it works? I mean, the reducer itself sort all the words automatically? What does the reducer function should do? What if i have just one reducer with limited ram and disk?
hadoop mapreduce reduce
add a comment |
up vote
0
down vote
favorite
there is something that i do not have clear about the whoel functioning view of a MapReduce programming environment.
Considering to have 1k of random unsorted words in the form (word, 1) coming out from a (or more than one) mapper. Suppose with the reducer i wanna save them all inside a single huge sorted file. How does it works? I mean, the reducer itself sort all the words automatically? What does the reducer function should do? What if i have just one reducer with limited ram and disk?
hadoop mapreduce reduce
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
there is something that i do not have clear about the whoel functioning view of a MapReduce programming environment.
Considering to have 1k of random unsorted words in the form (word, 1) coming out from a (or more than one) mapper. Suppose with the reducer i wanna save them all inside a single huge sorted file. How does it works? I mean, the reducer itself sort all the words automatically? What does the reducer function should do? What if i have just one reducer with limited ram and disk?
hadoop mapreduce reduce
there is something that i do not have clear about the whoel functioning view of a MapReduce programming environment.
Considering to have 1k of random unsorted words in the form (word, 1) coming out from a (or more than one) mapper. Suppose with the reducer i wanna save them all inside a single huge sorted file. How does it works? I mean, the reducer itself sort all the words automatically? What does the reducer function should do? What if i have just one reducer with limited ram and disk?
hadoop mapreduce reduce
hadoop mapreduce reduce
asked Nov 8 at 18:46
rollotommasi
3618
3618
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
when the reducer get the data ,the data has already be sorted in the map side .
the process is like this
Map side:
1. Each inputSplit will be processed by a map task, and the result of the map output will be temporarily placed in a circular memory buffer [ SHUFFLE ](the size of the buffer is 100M by default, controlled by the io.sort.mb property). When the buffer is about to overflow (the default is 80% of the buffer size), an overflow file will be created in the local file system .
2. Before writing to the disk, the thread first divides the data into the same number of partitions according to the number of reduce tasks, that is, a reduce task corresponds to the data of one partition. to avoid some of the reduction tasks being assigned to large amounts of data, even without data. In fact, the data in each partition is sorted. If the Combiner is set at this time, the sorted result is subjected to the Combiner operation.
3. When the local task outputs the last record, there may be a lot of overflow files, and these files need to be merged. The sorting and combining operations are continually performed during the merge process for two purposes: 1. Minimize the amount of data written to disk each time; 2. Minimize the amount of data transferred by the network during the next replication phase. Finally merged into a partitioned and sorted file. In order to reduce the amount of data transmitted over the network, you can compress the data here, just set mapred.compress.map.out to true.
4. Copy the data from the partition to the corresponding reduce task.
Reduce side:
1.Reduce will receive data from different map tasks, and the amount of data sent from each map is ordered. If the amount of data accepted by the reduce side is quite small, it is directly stored in the memory. If the amount of data exceeds a certain proportion of the size of the buffer, the data is merged and written to the disk.
2. As the number of overflow files increases, the background thread will merge them into a larger, more ordered file. In fact, regardless of the map side or the reduce side, MapReduce repeatedly performs sorting and merging operations.
3. The merge process will generate a lot of intermediate files (written to disk), but MapReduce will make the data written to the disk as small as possible, and the result of the last merge is not written to the disk, but directly input To reduce the function.
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
when the reducer get the data ,the data has already be sorted in the map side .
the process is like this
Map side:
1. Each inputSplit will be processed by a map task, and the result of the map output will be temporarily placed in a circular memory buffer [ SHUFFLE ](the size of the buffer is 100M by default, controlled by the io.sort.mb property). When the buffer is about to overflow (the default is 80% of the buffer size), an overflow file will be created in the local file system .
2. Before writing to the disk, the thread first divides the data into the same number of partitions according to the number of reduce tasks, that is, a reduce task corresponds to the data of one partition. to avoid some of the reduction tasks being assigned to large amounts of data, even without data. In fact, the data in each partition is sorted. If the Combiner is set at this time, the sorted result is subjected to the Combiner operation.
3. When the local task outputs the last record, there may be a lot of overflow files, and these files need to be merged. The sorting and combining operations are continually performed during the merge process for two purposes: 1. Minimize the amount of data written to disk each time; 2. Minimize the amount of data transferred by the network during the next replication phase. Finally merged into a partitioned and sorted file. In order to reduce the amount of data transmitted over the network, you can compress the data here, just set mapred.compress.map.out to true.
4. Copy the data from the partition to the corresponding reduce task.
Reduce side:
1.Reduce will receive data from different map tasks, and the amount of data sent from each map is ordered. If the amount of data accepted by the reduce side is quite small, it is directly stored in the memory. If the amount of data exceeds a certain proportion of the size of the buffer, the data is merged and written to the disk.
2. As the number of overflow files increases, the background thread will merge them into a larger, more ordered file. In fact, regardless of the map side or the reduce side, MapReduce repeatedly performs sorting and merging operations.
3. The merge process will generate a lot of intermediate files (written to disk), but MapReduce will make the data written to the disk as small as possible, and the result of the last merge is not written to the disk, but directly input To reduce the function.
add a comment |
up vote
0
down vote
when the reducer get the data ,the data has already be sorted in the map side .
the process is like this
Map side:
1. Each inputSplit will be processed by a map task, and the result of the map output will be temporarily placed in a circular memory buffer [ SHUFFLE ](the size of the buffer is 100M by default, controlled by the io.sort.mb property). When the buffer is about to overflow (the default is 80% of the buffer size), an overflow file will be created in the local file system .
2. Before writing to the disk, the thread first divides the data into the same number of partitions according to the number of reduce tasks, that is, a reduce task corresponds to the data of one partition. to avoid some of the reduction tasks being assigned to large amounts of data, even without data. In fact, the data in each partition is sorted. If the Combiner is set at this time, the sorted result is subjected to the Combiner operation.
3. When the local task outputs the last record, there may be a lot of overflow files, and these files need to be merged. The sorting and combining operations are continually performed during the merge process for two purposes: 1. Minimize the amount of data written to disk each time; 2. Minimize the amount of data transferred by the network during the next replication phase. Finally merged into a partitioned and sorted file. In order to reduce the amount of data transmitted over the network, you can compress the data here, just set mapred.compress.map.out to true.
4. Copy the data from the partition to the corresponding reduce task.
Reduce side:
1.Reduce will receive data from different map tasks, and the amount of data sent from each map is ordered. If the amount of data accepted by the reduce side is quite small, it is directly stored in the memory. If the amount of data exceeds a certain proportion of the size of the buffer, the data is merged and written to the disk.
2. As the number of overflow files increases, the background thread will merge them into a larger, more ordered file. In fact, regardless of the map side or the reduce side, MapReduce repeatedly performs sorting and merging operations.
3. The merge process will generate a lot of intermediate files (written to disk), but MapReduce will make the data written to the disk as small as possible, and the result of the last merge is not written to the disk, but directly input To reduce the function.
add a comment |
up vote
0
down vote
up vote
0
down vote
when the reducer get the data ,the data has already be sorted in the map side .
the process is like this
Map side:
1. Each inputSplit will be processed by a map task, and the result of the map output will be temporarily placed in a circular memory buffer [ SHUFFLE ](the size of the buffer is 100M by default, controlled by the io.sort.mb property). When the buffer is about to overflow (the default is 80% of the buffer size), an overflow file will be created in the local file system .
2. Before writing to the disk, the thread first divides the data into the same number of partitions according to the number of reduce tasks, that is, a reduce task corresponds to the data of one partition. to avoid some of the reduction tasks being assigned to large amounts of data, even without data. In fact, the data in each partition is sorted. If the Combiner is set at this time, the sorted result is subjected to the Combiner operation.
3. When the local task outputs the last record, there may be a lot of overflow files, and these files need to be merged. The sorting and combining operations are continually performed during the merge process for two purposes: 1. Minimize the amount of data written to disk each time; 2. Minimize the amount of data transferred by the network during the next replication phase. Finally merged into a partitioned and sorted file. In order to reduce the amount of data transmitted over the network, you can compress the data here, just set mapred.compress.map.out to true.
4. Copy the data from the partition to the corresponding reduce task.
Reduce side:
1.Reduce will receive data from different map tasks, and the amount of data sent from each map is ordered. If the amount of data accepted by the reduce side is quite small, it is directly stored in the memory. If the amount of data exceeds a certain proportion of the size of the buffer, the data is merged and written to the disk.
2. As the number of overflow files increases, the background thread will merge them into a larger, more ordered file. In fact, regardless of the map side or the reduce side, MapReduce repeatedly performs sorting and merging operations.
3. The merge process will generate a lot of intermediate files (written to disk), but MapReduce will make the data written to the disk as small as possible, and the result of the last merge is not written to the disk, but directly input To reduce the function.
when the reducer get the data ,the data has already be sorted in the map side .
the process is like this
Map side:
1. Each inputSplit will be processed by a map task, and the result of the map output will be temporarily placed in a circular memory buffer [ SHUFFLE ](the size of the buffer is 100M by default, controlled by the io.sort.mb property). When the buffer is about to overflow (the default is 80% of the buffer size), an overflow file will be created in the local file system .
2. Before writing to the disk, the thread first divides the data into the same number of partitions according to the number of reduce tasks, that is, a reduce task corresponds to the data of one partition. to avoid some of the reduction tasks being assigned to large amounts of data, even without data. In fact, the data in each partition is sorted. If the Combiner is set at this time, the sorted result is subjected to the Combiner operation.
3. When the local task outputs the last record, there may be a lot of overflow files, and these files need to be merged. The sorting and combining operations are continually performed during the merge process for two purposes: 1. Minimize the amount of data written to disk each time; 2. Minimize the amount of data transferred by the network during the next replication phase. Finally merged into a partitioned and sorted file. In order to reduce the amount of data transmitted over the network, you can compress the data here, just set mapred.compress.map.out to true.
4. Copy the data from the partition to the corresponding reduce task.
Reduce side:
1.Reduce will receive data from different map tasks, and the amount of data sent from each map is ordered. If the amount of data accepted by the reduce side is quite small, it is directly stored in the memory. If the amount of data exceeds a certain proportion of the size of the buffer, the data is merged and written to the disk.
2. As the number of overflow files increases, the background thread will merge them into a larger, more ordered file. In fact, regardless of the map side or the reduce side, MapReduce repeatedly performs sorting and merging operations.
3. The merge process will generate a lot of intermediate files (written to disk), but MapReduce will make the data written to the disk as small as possible, and the result of the last merge is not written to the disk, but directly input To reduce the function.
answered Nov 10 at 7:03
HbnKing
6021315
6021315
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53214246%2fmap-reduce-does-reducer-automatically-sorts%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown