Map Reduce, does reducer automatically sorts?











up vote
0
down vote

favorite












there is something that i do not have clear about the whoel functioning view of a MapReduce programming environment.



Considering to have 1k of random unsorted words in the form (word, 1) coming out from a (or more than one) mapper. Suppose with the reducer i wanna save them all inside a single huge sorted file. How does it works? I mean, the reducer itself sort all the words automatically? What does the reducer function should do? What if i have just one reducer with limited ram and disk?










share|improve this question


























    up vote
    0
    down vote

    favorite












    there is something that i do not have clear about the whoel functioning view of a MapReduce programming environment.



    Considering to have 1k of random unsorted words in the form (word, 1) coming out from a (or more than one) mapper. Suppose with the reducer i wanna save them all inside a single huge sorted file. How does it works? I mean, the reducer itself sort all the words automatically? What does the reducer function should do? What if i have just one reducer with limited ram and disk?










    share|improve this question
























      up vote
      0
      down vote

      favorite









      up vote
      0
      down vote

      favorite











      there is something that i do not have clear about the whoel functioning view of a MapReduce programming environment.



      Considering to have 1k of random unsorted words in the form (word, 1) coming out from a (or more than one) mapper. Suppose with the reducer i wanna save them all inside a single huge sorted file. How does it works? I mean, the reducer itself sort all the words automatically? What does the reducer function should do? What if i have just one reducer with limited ram and disk?










      share|improve this question













      there is something that i do not have clear about the whoel functioning view of a MapReduce programming environment.



      Considering to have 1k of random unsorted words in the form (word, 1) coming out from a (or more than one) mapper. Suppose with the reducer i wanna save them all inside a single huge sorted file. How does it works? I mean, the reducer itself sort all the words automatically? What does the reducer function should do? What if i have just one reducer with limited ram and disk?







      hadoop mapreduce reduce






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 8 at 18:46









      rollotommasi

      3618




      3618
























          1 Answer
          1






          active

          oldest

          votes

















          up vote
          0
          down vote













          enter image description here



          when the reducer get the data ,the data has already be sorted in the map side .



          the process is like this

          Map side:

          1. Each inputSplit will be processed by a map task, and the result of the map output will be temporarily placed in a circular memory buffer [ SHUFFLE ](the size of the buffer is 100M by default, controlled by the io.sort.mb property). When the buffer is about to overflow (the default is 80% of the buffer size), an overflow file will be created in the local file system .

          2. Before writing to the disk, the thread first divides the data into the same number of partitions according to the number of reduce tasks, that is, a reduce task corresponds to the data of one partition. to avoid some of the reduction tasks being assigned to large amounts of data, even without data. In fact, the data in each partition is sorted. If the Combiner is set at this time, the sorted result is subjected to the Combiner operation.

          3. When the local task outputs the last record, there may be a lot of overflow files, and these files need to be merged. The sorting and combining operations are continually performed during the merge process for two purposes: 1. Minimize the amount of data written to disk each time; 2. Minimize the amount of data transferred by the network during the next replication phase. Finally merged into a partitioned and sorted file. In order to reduce the amount of data transmitted over the network, you can compress the data here, just set mapred.compress.map.out to true.

          4. Copy the data from the partition to the corresponding reduce task.

          Reduce side:

          1.Reduce will receive data from different map tasks, and the amount of data sent from each map is ordered. If the amount of data accepted by the reduce side is quite small, it is directly stored in the memory. If the amount of data exceeds a certain proportion of the size of the buffer, the data is merged and written to the disk.

          2. As the number of overflow files increases, the background thread will merge them into a larger, more ordered file. In fact, regardless of the map side or the reduce side, MapReduce repeatedly performs sorting and merging operations.

          3. The merge process will generate a lot of intermediate files (written to disk), but MapReduce will make the data written to the disk as small as possible, and the result of the last merge is not written to the disk, but directly input To reduce the function.






          share|improve this answer





















            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53214246%2fmap-reduce-does-reducer-automatically-sorts%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            0
            down vote













            enter image description here



            when the reducer get the data ,the data has already be sorted in the map side .



            the process is like this

            Map side:

            1. Each inputSplit will be processed by a map task, and the result of the map output will be temporarily placed in a circular memory buffer [ SHUFFLE ](the size of the buffer is 100M by default, controlled by the io.sort.mb property). When the buffer is about to overflow (the default is 80% of the buffer size), an overflow file will be created in the local file system .

            2. Before writing to the disk, the thread first divides the data into the same number of partitions according to the number of reduce tasks, that is, a reduce task corresponds to the data of one partition. to avoid some of the reduction tasks being assigned to large amounts of data, even without data. In fact, the data in each partition is sorted. If the Combiner is set at this time, the sorted result is subjected to the Combiner operation.

            3. When the local task outputs the last record, there may be a lot of overflow files, and these files need to be merged. The sorting and combining operations are continually performed during the merge process for two purposes: 1. Minimize the amount of data written to disk each time; 2. Minimize the amount of data transferred by the network during the next replication phase. Finally merged into a partitioned and sorted file. In order to reduce the amount of data transmitted over the network, you can compress the data here, just set mapred.compress.map.out to true.

            4. Copy the data from the partition to the corresponding reduce task.

            Reduce side:

            1.Reduce will receive data from different map tasks, and the amount of data sent from each map is ordered. If the amount of data accepted by the reduce side is quite small, it is directly stored in the memory. If the amount of data exceeds a certain proportion of the size of the buffer, the data is merged and written to the disk.

            2. As the number of overflow files increases, the background thread will merge them into a larger, more ordered file. In fact, regardless of the map side or the reduce side, MapReduce repeatedly performs sorting and merging operations.

            3. The merge process will generate a lot of intermediate files (written to disk), but MapReduce will make the data written to the disk as small as possible, and the result of the last merge is not written to the disk, but directly input To reduce the function.






            share|improve this answer

























              up vote
              0
              down vote













              enter image description here



              when the reducer get the data ,the data has already be sorted in the map side .



              the process is like this

              Map side:

              1. Each inputSplit will be processed by a map task, and the result of the map output will be temporarily placed in a circular memory buffer [ SHUFFLE ](the size of the buffer is 100M by default, controlled by the io.sort.mb property). When the buffer is about to overflow (the default is 80% of the buffer size), an overflow file will be created in the local file system .

              2. Before writing to the disk, the thread first divides the data into the same number of partitions according to the number of reduce tasks, that is, a reduce task corresponds to the data of one partition. to avoid some of the reduction tasks being assigned to large amounts of data, even without data. In fact, the data in each partition is sorted. If the Combiner is set at this time, the sorted result is subjected to the Combiner operation.

              3. When the local task outputs the last record, there may be a lot of overflow files, and these files need to be merged. The sorting and combining operations are continually performed during the merge process for two purposes: 1. Minimize the amount of data written to disk each time; 2. Minimize the amount of data transferred by the network during the next replication phase. Finally merged into a partitioned and sorted file. In order to reduce the amount of data transmitted over the network, you can compress the data here, just set mapred.compress.map.out to true.

              4. Copy the data from the partition to the corresponding reduce task.

              Reduce side:

              1.Reduce will receive data from different map tasks, and the amount of data sent from each map is ordered. If the amount of data accepted by the reduce side is quite small, it is directly stored in the memory. If the amount of data exceeds a certain proportion of the size of the buffer, the data is merged and written to the disk.

              2. As the number of overflow files increases, the background thread will merge them into a larger, more ordered file. In fact, regardless of the map side or the reduce side, MapReduce repeatedly performs sorting and merging operations.

              3. The merge process will generate a lot of intermediate files (written to disk), but MapReduce will make the data written to the disk as small as possible, and the result of the last merge is not written to the disk, but directly input To reduce the function.






              share|improve this answer























                up vote
                0
                down vote










                up vote
                0
                down vote









                enter image description here



                when the reducer get the data ,the data has already be sorted in the map side .



                the process is like this

                Map side:

                1. Each inputSplit will be processed by a map task, and the result of the map output will be temporarily placed in a circular memory buffer [ SHUFFLE ](the size of the buffer is 100M by default, controlled by the io.sort.mb property). When the buffer is about to overflow (the default is 80% of the buffer size), an overflow file will be created in the local file system .

                2. Before writing to the disk, the thread first divides the data into the same number of partitions according to the number of reduce tasks, that is, a reduce task corresponds to the data of one partition. to avoid some of the reduction tasks being assigned to large amounts of data, even without data. In fact, the data in each partition is sorted. If the Combiner is set at this time, the sorted result is subjected to the Combiner operation.

                3. When the local task outputs the last record, there may be a lot of overflow files, and these files need to be merged. The sorting and combining operations are continually performed during the merge process for two purposes: 1. Minimize the amount of data written to disk each time; 2. Minimize the amount of data transferred by the network during the next replication phase. Finally merged into a partitioned and sorted file. In order to reduce the amount of data transmitted over the network, you can compress the data here, just set mapred.compress.map.out to true.

                4. Copy the data from the partition to the corresponding reduce task.

                Reduce side:

                1.Reduce will receive data from different map tasks, and the amount of data sent from each map is ordered. If the amount of data accepted by the reduce side is quite small, it is directly stored in the memory. If the amount of data exceeds a certain proportion of the size of the buffer, the data is merged and written to the disk.

                2. As the number of overflow files increases, the background thread will merge them into a larger, more ordered file. In fact, regardless of the map side or the reduce side, MapReduce repeatedly performs sorting and merging operations.

                3. The merge process will generate a lot of intermediate files (written to disk), but MapReduce will make the data written to the disk as small as possible, and the result of the last merge is not written to the disk, but directly input To reduce the function.






                share|improve this answer












                enter image description here



                when the reducer get the data ,the data has already be sorted in the map side .



                the process is like this

                Map side:

                1. Each inputSplit will be processed by a map task, and the result of the map output will be temporarily placed in a circular memory buffer [ SHUFFLE ](the size of the buffer is 100M by default, controlled by the io.sort.mb property). When the buffer is about to overflow (the default is 80% of the buffer size), an overflow file will be created in the local file system .

                2. Before writing to the disk, the thread first divides the data into the same number of partitions according to the number of reduce tasks, that is, a reduce task corresponds to the data of one partition. to avoid some of the reduction tasks being assigned to large amounts of data, even without data. In fact, the data in each partition is sorted. If the Combiner is set at this time, the sorted result is subjected to the Combiner operation.

                3. When the local task outputs the last record, there may be a lot of overflow files, and these files need to be merged. The sorting and combining operations are continually performed during the merge process for two purposes: 1. Minimize the amount of data written to disk each time; 2. Minimize the amount of data transferred by the network during the next replication phase. Finally merged into a partitioned and sorted file. In order to reduce the amount of data transmitted over the network, you can compress the data here, just set mapred.compress.map.out to true.

                4. Copy the data from the partition to the corresponding reduce task.

                Reduce side:

                1.Reduce will receive data from different map tasks, and the amount of data sent from each map is ordered. If the amount of data accepted by the reduce side is quite small, it is directly stored in the memory. If the amount of data exceeds a certain proportion of the size of the buffer, the data is merged and written to the disk.

                2. As the number of overflow files increases, the background thread will merge them into a larger, more ordered file. In fact, regardless of the map side or the reduce side, MapReduce repeatedly performs sorting and merging operations.

                3. The merge process will generate a lot of intermediate files (written to disk), but MapReduce will make the data written to the disk as small as possible, and the result of the last merge is not written to the disk, but directly input To reduce the function.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 10 at 7:03









                HbnKing

                6021315




                6021315






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.





                    Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                    Please pay close attention to the following guidance:


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53214246%2fmap-reduce-does-reducer-automatically-sorts%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    這個網誌中的熱門文章

                    Tangent Lines Diagram Along Smooth Curve

                    Yusuf al-Mu'taman ibn Hud

                    Zucchini