Spark Group By Key to (String, Iterable)
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
I am trying to group urldata by key where the values would be string
Sample Data :
url_3 url_2
url_3 url_2
url_3 url_1
url_4 url_3
url_4 url_1
Expected result:
(url_3,(url_2,url_1))
(url_4,(url_3,url_1))
1) Load the urldata:
Dataset<String> lines = spark.read()
.textFile("C:/Users/91984/workspace/myApp/src/test/resources/in/urldata.txt");
2) Split the dataset using space
Encoder<Tuple2<String, String>> encoder2 =
Encoders.tuple(Encoders.STRING(), Encoders.STRING());
Dataset<Tuple2<String, String>> tupleRDD = lines.map(f->{
Tuple2<String, String> m =
new Tuple2<String, String>(f.split(" ")[0], f.split(" ")[1]);
return m;
},encoder2);
3) Used groupbyKey to group the tupleRDD datsebase on key
KeyValueGroupedDataset<String, Tuple2<String, String>> keygrpDS =
tupleRDD.groupByKey(f->f._1, Encoders.STRING());
Can someone explain me why groupByKey at step 3 is returning KeyValueGroupedDataset<String, Tuple2<String, String>>
instead of KeyValueGroupedDataset<String, Iterable<String>>
and what will be the change to be done to get the expected results.
java apache-spark apache-spark-sql
add a comment |
I am trying to group urldata by key where the values would be string
Sample Data :
url_3 url_2
url_3 url_2
url_3 url_1
url_4 url_3
url_4 url_1
Expected result:
(url_3,(url_2,url_1))
(url_4,(url_3,url_1))
1) Load the urldata:
Dataset<String> lines = spark.read()
.textFile("C:/Users/91984/workspace/myApp/src/test/resources/in/urldata.txt");
2) Split the dataset using space
Encoder<Tuple2<String, String>> encoder2 =
Encoders.tuple(Encoders.STRING(), Encoders.STRING());
Dataset<Tuple2<String, String>> tupleRDD = lines.map(f->{
Tuple2<String, String> m =
new Tuple2<String, String>(f.split(" ")[0], f.split(" ")[1]);
return m;
},encoder2);
3) Used groupbyKey to group the tupleRDD datsebase on key
KeyValueGroupedDataset<String, Tuple2<String, String>> keygrpDS =
tupleRDD.groupByKey(f->f._1, Encoders.STRING());
Can someone explain me why groupByKey at step 3 is returning KeyValueGroupedDataset<String, Tuple2<String, String>>
instead of KeyValueGroupedDataset<String, Iterable<String>>
and what will be the change to be done to get the expected results.
java apache-spark apache-spark-sql
please accept the answer as correct or state any issues if you still are facing.
– void
Nov 29 '18 at 10:10
add a comment |
I am trying to group urldata by key where the values would be string
Sample Data :
url_3 url_2
url_3 url_2
url_3 url_1
url_4 url_3
url_4 url_1
Expected result:
(url_3,(url_2,url_1))
(url_4,(url_3,url_1))
1) Load the urldata:
Dataset<String> lines = spark.read()
.textFile("C:/Users/91984/workspace/myApp/src/test/resources/in/urldata.txt");
2) Split the dataset using space
Encoder<Tuple2<String, String>> encoder2 =
Encoders.tuple(Encoders.STRING(), Encoders.STRING());
Dataset<Tuple2<String, String>> tupleRDD = lines.map(f->{
Tuple2<String, String> m =
new Tuple2<String, String>(f.split(" ")[0], f.split(" ")[1]);
return m;
},encoder2);
3) Used groupbyKey to group the tupleRDD datsebase on key
KeyValueGroupedDataset<String, Tuple2<String, String>> keygrpDS =
tupleRDD.groupByKey(f->f._1, Encoders.STRING());
Can someone explain me why groupByKey at step 3 is returning KeyValueGroupedDataset<String, Tuple2<String, String>>
instead of KeyValueGroupedDataset<String, Iterable<String>>
and what will be the change to be done to get the expected results.
java apache-spark apache-spark-sql
I am trying to group urldata by key where the values would be string
Sample Data :
url_3 url_2
url_3 url_2
url_3 url_1
url_4 url_3
url_4 url_1
Expected result:
(url_3,(url_2,url_1))
(url_4,(url_3,url_1))
1) Load the urldata:
Dataset<String> lines = spark.read()
.textFile("C:/Users/91984/workspace/myApp/src/test/resources/in/urldata.txt");
2) Split the dataset using space
Encoder<Tuple2<String, String>> encoder2 =
Encoders.tuple(Encoders.STRING(), Encoders.STRING());
Dataset<Tuple2<String, String>> tupleRDD = lines.map(f->{
Tuple2<String, String> m =
new Tuple2<String, String>(f.split(" ")[0], f.split(" ")[1]);
return m;
},encoder2);
3) Used groupbyKey to group the tupleRDD datsebase on key
KeyValueGroupedDataset<String, Tuple2<String, String>> keygrpDS =
tupleRDD.groupByKey(f->f._1, Encoders.STRING());
Can someone explain me why groupByKey at step 3 is returning KeyValueGroupedDataset<String, Tuple2<String, String>>
instead of KeyValueGroupedDataset<String, Iterable<String>>
and what will be the change to be done to get the expected results.
java apache-spark apache-spark-sql
java apache-spark apache-spark-sql
edited Nov 24 '18 at 18:28
Oli
2,4141419
2,4141419
asked Nov 24 '18 at 17:31
NareshNaresh
456
456
please accept the answer as correct or state any issues if you still are facing.
– void
Nov 29 '18 at 10:10
add a comment |
please accept the answer as correct or state any issues if you still are facing.
– void
Nov 29 '18 at 10:10
please accept the answer as correct or state any issues if you still are facing.
– void
Nov 29 '18 at 10:10
please accept the answer as correct or state any issues if you still are facing.
– void
Nov 29 '18 at 10:10
add a comment |
2 Answers
2
active
oldest
votes
That's the way it works with datasets in spark. When you have a dataset of type Dataset<T>
, you can group it by some mapping function that takes an object of type T and returns an object of type K (the key). What you get is a KeyValueGroupedDataset<K,T>
on which you can call an aggregation function (See the javadoc). In your case, you could use mapGroups
to which you can provide a function that maps a key K
and an iterable Iterable<T>
to a new object R of your choosing. If it helps, in your code, T is a Tuple2 and K a URL.
add a comment |
Spark requires you to follow your groupBY
with an aggregation
method. I would have tupleRDD as a DataFrame
like:
column1 column2
url_3 url_2
url_3 url_2
url_3 url_1
url_4 url_3
url_4 url_1
and pass on a collect_list(column2)
like
df.groupBy('column1').agg('column2', collect_list('column2'))
.
This example is in Python. Scala/Java APIs should be similar, though.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53460706%2fspark-group-by-key-to-string-iterablestring%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
That's the way it works with datasets in spark. When you have a dataset of type Dataset<T>
, you can group it by some mapping function that takes an object of type T and returns an object of type K (the key). What you get is a KeyValueGroupedDataset<K,T>
on which you can call an aggregation function (See the javadoc). In your case, you could use mapGroups
to which you can provide a function that maps a key K
and an iterable Iterable<T>
to a new object R of your choosing. If it helps, in your code, T is a Tuple2 and K a URL.
add a comment |
That's the way it works with datasets in spark. When you have a dataset of type Dataset<T>
, you can group it by some mapping function that takes an object of type T and returns an object of type K (the key). What you get is a KeyValueGroupedDataset<K,T>
on which you can call an aggregation function (See the javadoc). In your case, you could use mapGroups
to which you can provide a function that maps a key K
and an iterable Iterable<T>
to a new object R of your choosing. If it helps, in your code, T is a Tuple2 and K a URL.
add a comment |
That's the way it works with datasets in spark. When you have a dataset of type Dataset<T>
, you can group it by some mapping function that takes an object of type T and returns an object of type K (the key). What you get is a KeyValueGroupedDataset<K,T>
on which you can call an aggregation function (See the javadoc). In your case, you could use mapGroups
to which you can provide a function that maps a key K
and an iterable Iterable<T>
to a new object R of your choosing. If it helps, in your code, T is a Tuple2 and K a URL.
That's the way it works with datasets in spark. When you have a dataset of type Dataset<T>
, you can group it by some mapping function that takes an object of type T and returns an object of type K (the key). What you get is a KeyValueGroupedDataset<K,T>
on which you can call an aggregation function (See the javadoc). In your case, you could use mapGroups
to which you can provide a function that maps a key K
and an iterable Iterable<T>
to a new object R of your choosing. If it helps, in your code, T is a Tuple2 and K a URL.
answered Nov 24 '18 at 18:25
OliOli
2,4141419
2,4141419
add a comment |
add a comment |
Spark requires you to follow your groupBY
with an aggregation
method. I would have tupleRDD as a DataFrame
like:
column1 column2
url_3 url_2
url_3 url_2
url_3 url_1
url_4 url_3
url_4 url_1
and pass on a collect_list(column2)
like
df.groupBy('column1').agg('column2', collect_list('column2'))
.
This example is in Python. Scala/Java APIs should be similar, though.
add a comment |
Spark requires you to follow your groupBY
with an aggregation
method. I would have tupleRDD as a DataFrame
like:
column1 column2
url_3 url_2
url_3 url_2
url_3 url_1
url_4 url_3
url_4 url_1
and pass on a collect_list(column2)
like
df.groupBy('column1').agg('column2', collect_list('column2'))
.
This example is in Python. Scala/Java APIs should be similar, though.
add a comment |
Spark requires you to follow your groupBY
with an aggregation
method. I would have tupleRDD as a DataFrame
like:
column1 column2
url_3 url_2
url_3 url_2
url_3 url_1
url_4 url_3
url_4 url_1
and pass on a collect_list(column2)
like
df.groupBy('column1').agg('column2', collect_list('column2'))
.
This example is in Python. Scala/Java APIs should be similar, though.
Spark requires you to follow your groupBY
with an aggregation
method. I would have tupleRDD as a DataFrame
like:
column1 column2
url_3 url_2
url_3 url_2
url_3 url_1
url_4 url_3
url_4 url_1
and pass on a collect_list(column2)
like
df.groupBy('column1').agg('column2', collect_list('column2'))
.
This example is in Python. Scala/Java APIs should be similar, though.
answered Nov 24 '18 at 18:29
voidvoid
1,10741739
1,10741739
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53460706%2fspark-group-by-key-to-string-iterablestring%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
please accept the answer as correct or state any issues if you still are facing.
– void
Nov 29 '18 at 10:10