Spark Group By Key to (String, Iterable)





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







0















I am trying to group urldata by key where the values would be string



Sample Data :



url_3 url_2
url_3 url_2
url_3 url_1
url_4 url_3
url_4 url_1


Expected result:



(url_3,(url_2,url_1))
(url_4,(url_3,url_1))


1) Load the urldata:



Dataset<String> lines = spark.read()
.textFile("C:/Users/91984/workspace/myApp/src/test/resources/in/urldata.txt");


2) Split the dataset using space



Encoder<Tuple2<String, String>> encoder2 = 
Encoders.tuple(Encoders.STRING(), Encoders.STRING());
Dataset<Tuple2<String, String>> tupleRDD = lines.map(f->{
Tuple2<String, String> m =
new Tuple2<String, String>(f.split(" ")[0], f.split(" ")[1]);
return m;
},encoder2);


3) Used groupbyKey to group the tupleRDD datsebase on key



KeyValueGroupedDataset<String, Tuple2<String, String>> keygrpDS = 
tupleRDD.groupByKey(f->f._1, Encoders.STRING());


Can someone explain me why groupByKey at step 3 is returning KeyValueGroupedDataset<String, Tuple2<String, String>> instead of KeyValueGroupedDataset<String, Iterable<String>> and what will be the change to be done to get the expected results.










share|improve this question

























  • please accept the answer as correct or state any issues if you still are facing.

    – void
    Nov 29 '18 at 10:10


















0















I am trying to group urldata by key where the values would be string



Sample Data :



url_3 url_2
url_3 url_2
url_3 url_1
url_4 url_3
url_4 url_1


Expected result:



(url_3,(url_2,url_1))
(url_4,(url_3,url_1))


1) Load the urldata:



Dataset<String> lines = spark.read()
.textFile("C:/Users/91984/workspace/myApp/src/test/resources/in/urldata.txt");


2) Split the dataset using space



Encoder<Tuple2<String, String>> encoder2 = 
Encoders.tuple(Encoders.STRING(), Encoders.STRING());
Dataset<Tuple2<String, String>> tupleRDD = lines.map(f->{
Tuple2<String, String> m =
new Tuple2<String, String>(f.split(" ")[0], f.split(" ")[1]);
return m;
},encoder2);


3) Used groupbyKey to group the tupleRDD datsebase on key



KeyValueGroupedDataset<String, Tuple2<String, String>> keygrpDS = 
tupleRDD.groupByKey(f->f._1, Encoders.STRING());


Can someone explain me why groupByKey at step 3 is returning KeyValueGroupedDataset<String, Tuple2<String, String>> instead of KeyValueGroupedDataset<String, Iterable<String>> and what will be the change to be done to get the expected results.










share|improve this question

























  • please accept the answer as correct or state any issues if you still are facing.

    – void
    Nov 29 '18 at 10:10














0












0








0








I am trying to group urldata by key where the values would be string



Sample Data :



url_3 url_2
url_3 url_2
url_3 url_1
url_4 url_3
url_4 url_1


Expected result:



(url_3,(url_2,url_1))
(url_4,(url_3,url_1))


1) Load the urldata:



Dataset<String> lines = spark.read()
.textFile("C:/Users/91984/workspace/myApp/src/test/resources/in/urldata.txt");


2) Split the dataset using space



Encoder<Tuple2<String, String>> encoder2 = 
Encoders.tuple(Encoders.STRING(), Encoders.STRING());
Dataset<Tuple2<String, String>> tupleRDD = lines.map(f->{
Tuple2<String, String> m =
new Tuple2<String, String>(f.split(" ")[0], f.split(" ")[1]);
return m;
},encoder2);


3) Used groupbyKey to group the tupleRDD datsebase on key



KeyValueGroupedDataset<String, Tuple2<String, String>> keygrpDS = 
tupleRDD.groupByKey(f->f._1, Encoders.STRING());


Can someone explain me why groupByKey at step 3 is returning KeyValueGroupedDataset<String, Tuple2<String, String>> instead of KeyValueGroupedDataset<String, Iterable<String>> and what will be the change to be done to get the expected results.










share|improve this question
















I am trying to group urldata by key where the values would be string



Sample Data :



url_3 url_2
url_3 url_2
url_3 url_1
url_4 url_3
url_4 url_1


Expected result:



(url_3,(url_2,url_1))
(url_4,(url_3,url_1))


1) Load the urldata:



Dataset<String> lines = spark.read()
.textFile("C:/Users/91984/workspace/myApp/src/test/resources/in/urldata.txt");


2) Split the dataset using space



Encoder<Tuple2<String, String>> encoder2 = 
Encoders.tuple(Encoders.STRING(), Encoders.STRING());
Dataset<Tuple2<String, String>> tupleRDD = lines.map(f->{
Tuple2<String, String> m =
new Tuple2<String, String>(f.split(" ")[0], f.split(" ")[1]);
return m;
},encoder2);


3) Used groupbyKey to group the tupleRDD datsebase on key



KeyValueGroupedDataset<String, Tuple2<String, String>> keygrpDS = 
tupleRDD.groupByKey(f->f._1, Encoders.STRING());


Can someone explain me why groupByKey at step 3 is returning KeyValueGroupedDataset<String, Tuple2<String, String>> instead of KeyValueGroupedDataset<String, Iterable<String>> and what will be the change to be done to get the expected results.







java apache-spark apache-spark-sql






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 24 '18 at 18:28









Oli

2,4141419




2,4141419










asked Nov 24 '18 at 17:31









NareshNaresh

456




456













  • please accept the answer as correct or state any issues if you still are facing.

    – void
    Nov 29 '18 at 10:10



















  • please accept the answer as correct or state any issues if you still are facing.

    – void
    Nov 29 '18 at 10:10

















please accept the answer as correct or state any issues if you still are facing.

– void
Nov 29 '18 at 10:10





please accept the answer as correct or state any issues if you still are facing.

– void
Nov 29 '18 at 10:10












2 Answers
2






active

oldest

votes


















1














That's the way it works with datasets in spark. When you have a dataset of type Dataset<T>, you can group it by some mapping function that takes an object of type T and returns an object of type K (the key). What you get is a KeyValueGroupedDataset<K,T> on which you can call an aggregation function (See the javadoc). In your case, you could use mapGroups to which you can provide a function that maps a key K and an iterable Iterable<T> to a new object R of your choosing. If it helps, in your code, T is a Tuple2 and K a URL.






share|improve this answer































    1














    Spark requires you to follow your groupBY with an aggregation method. I would have tupleRDD as a DataFrame like:



    column1 column2

    url_3 url_2
    url_3 url_2
    url_3 url_1
    url_4 url_3
    url_4 url_1


    and pass on a collect_list(column2) like



    df.groupBy('column1').agg('column2', collect_list('column2')).



    This example is in Python. Scala/Java APIs should be similar, though.






    share|improve this answer
























      Your Answer






      StackExchange.ifUsing("editor", function () {
      StackExchange.using("externalEditor", function () {
      StackExchange.using("snippets", function () {
      StackExchange.snippets.init();
      });
      });
      }, "code-snippets");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "1"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53460706%2fspark-group-by-key-to-string-iterablestring%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      1














      That's the way it works with datasets in spark. When you have a dataset of type Dataset<T>, you can group it by some mapping function that takes an object of type T and returns an object of type K (the key). What you get is a KeyValueGroupedDataset<K,T> on which you can call an aggregation function (See the javadoc). In your case, you could use mapGroups to which you can provide a function that maps a key K and an iterable Iterable<T> to a new object R of your choosing. If it helps, in your code, T is a Tuple2 and K a URL.






      share|improve this answer




























        1














        That's the way it works with datasets in spark. When you have a dataset of type Dataset<T>, you can group it by some mapping function that takes an object of type T and returns an object of type K (the key). What you get is a KeyValueGroupedDataset<K,T> on which you can call an aggregation function (See the javadoc). In your case, you could use mapGroups to which you can provide a function that maps a key K and an iterable Iterable<T> to a new object R of your choosing. If it helps, in your code, T is a Tuple2 and K a URL.






        share|improve this answer


























          1












          1








          1







          That's the way it works with datasets in spark. When you have a dataset of type Dataset<T>, you can group it by some mapping function that takes an object of type T and returns an object of type K (the key). What you get is a KeyValueGroupedDataset<K,T> on which you can call an aggregation function (See the javadoc). In your case, you could use mapGroups to which you can provide a function that maps a key K and an iterable Iterable<T> to a new object R of your choosing. If it helps, in your code, T is a Tuple2 and K a URL.






          share|improve this answer













          That's the way it works with datasets in spark. When you have a dataset of type Dataset<T>, you can group it by some mapping function that takes an object of type T and returns an object of type K (the key). What you get is a KeyValueGroupedDataset<K,T> on which you can call an aggregation function (See the javadoc). In your case, you could use mapGroups to which you can provide a function that maps a key K and an iterable Iterable<T> to a new object R of your choosing. If it helps, in your code, T is a Tuple2 and K a URL.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 24 '18 at 18:25









          OliOli

          2,4141419




          2,4141419

























              1














              Spark requires you to follow your groupBY with an aggregation method. I would have tupleRDD as a DataFrame like:



              column1 column2

              url_3 url_2
              url_3 url_2
              url_3 url_1
              url_4 url_3
              url_4 url_1


              and pass on a collect_list(column2) like



              df.groupBy('column1').agg('column2', collect_list('column2')).



              This example is in Python. Scala/Java APIs should be similar, though.






              share|improve this answer




























                1














                Spark requires you to follow your groupBY with an aggregation method. I would have tupleRDD as a DataFrame like:



                column1 column2

                url_3 url_2
                url_3 url_2
                url_3 url_1
                url_4 url_3
                url_4 url_1


                and pass on a collect_list(column2) like



                df.groupBy('column1').agg('column2', collect_list('column2')).



                This example is in Python. Scala/Java APIs should be similar, though.






                share|improve this answer


























                  1












                  1








                  1







                  Spark requires you to follow your groupBY with an aggregation method. I would have tupleRDD as a DataFrame like:



                  column1 column2

                  url_3 url_2
                  url_3 url_2
                  url_3 url_1
                  url_4 url_3
                  url_4 url_1


                  and pass on a collect_list(column2) like



                  df.groupBy('column1').agg('column2', collect_list('column2')).



                  This example is in Python. Scala/Java APIs should be similar, though.






                  share|improve this answer













                  Spark requires you to follow your groupBY with an aggregation method. I would have tupleRDD as a DataFrame like:



                  column1 column2

                  url_3 url_2
                  url_3 url_2
                  url_3 url_1
                  url_4 url_3
                  url_4 url_1


                  and pass on a collect_list(column2) like



                  df.groupBy('column1').agg('column2', collect_list('column2')).



                  This example is in Python. Scala/Java APIs should be similar, though.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Nov 24 '18 at 18:29









                  voidvoid

                  1,10741739




                  1,10741739






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Stack Overflow!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53460706%2fspark-group-by-key-to-string-iterablestring%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      這個網誌中的熱門文章

                      Tangent Lines Diagram Along Smooth Curve

                      Yusuf al-Mu'taman ibn Hud

                      Zucchini