Spark Group By Key to (String, Iterable)

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

I am trying to group urldata by key where the values would be string

Sample Data :

url_3 url_2

url_3 url_2

url_3 url_1

url_4 url_3

url_4 url_1

Expected result:

(url_3,(url_2,url_1))

(url_4,(url_3,url_1))

1) Load the urldata:

Dataset<String> lines = spark.read()

    .textFile("C:/Users/91984/workspace/myApp/src/test/resources/in/urldata.txt");

2) Split the dataset using space

Encoder<Tuple2<String, String>> encoder2 = 

    Encoders.tuple(Encoders.STRING(), Encoders.STRING());

Dataset<Tuple2<String, String>> tupleRDD = lines.map(f->{

    Tuple2<String, String> m = 

        new Tuple2<String, String>(f.split(" ")[0], f.split(" ")[1]);

    return m;

},encoder2);

3) Used groupbyKey to group the tupleRDD datsebase on key

KeyValueGroupedDataset<String, Tuple2<String, String>> keygrpDS = 

    tupleRDD.groupByKey(f->f._1, Encoders.STRING());

Can someone explain me why groupByKey at step 3 is returning KeyValueGroupedDataset<String, Tuple2<String, String>> instead of KeyValueGroupedDataset<String, Iterable<String>> and what will be the change to be done to get the expected results.

edited Nov 24 '18 at 18:28

Oli

2,4141419

asked Nov 24 '18 at 17:31

Naresh

456

please accept the answer as correct or state any issues if you still are facing.

– void
Nov 29 '18 at 10:10

add a comment |

I am trying to group urldata by key where the values would be string

Sample Data :

url_3 url_2

url_3 url_2

url_3 url_1

url_4 url_3

url_4 url_1

Expected result:

(url_3,(url_2,url_1))

(url_4,(url_3,url_1))

1) Load the urldata:

Dataset<String> lines = spark.read()

    .textFile("C:/Users/91984/workspace/myApp/src/test/resources/in/urldata.txt");

2) Split the dataset using space

Encoder<Tuple2<String, String>> encoder2 = 

    Encoders.tuple(Encoders.STRING(), Encoders.STRING());

Dataset<Tuple2<String, String>> tupleRDD = lines.map(f->{

    Tuple2<String, String> m = 

        new Tuple2<String, String>(f.split(" ")[0], f.split(" ")[1]);

    return m;

},encoder2);

3) Used groupbyKey to group the tupleRDD datsebase on key

KeyValueGroupedDataset<String, Tuple2<String, String>> keygrpDS = 

    tupleRDD.groupByKey(f->f._1, Encoders.STRING());

edited Nov 24 '18 at 18:28

Oli

2,4141419

asked Nov 24 '18 at 17:31

Naresh

456

please accept the answer as correct or state any issues if you still are facing.

– void
Nov 29 '18 at 10:10

add a comment |

I am trying to group urldata by key where the values would be string

Sample Data :

url_3 url_2

url_3 url_2

url_3 url_1

url_4 url_3

url_4 url_1

Expected result:

(url_3,(url_2,url_1))

(url_4,(url_3,url_1))

1) Load the urldata:

Dataset<String> lines = spark.read()

    .textFile("C:/Users/91984/workspace/myApp/src/test/resources/in/urldata.txt");

2) Split the dataset using space

Encoder<Tuple2<String, String>> encoder2 = 

    Encoders.tuple(Encoders.STRING(), Encoders.STRING());

Dataset<Tuple2<String, String>> tupleRDD = lines.map(f->{

    Tuple2<String, String> m = 

        new Tuple2<String, String>(f.split(" ")[0], f.split(" ")[1]);

    return m;

},encoder2);

3) Used groupbyKey to group the tupleRDD datsebase on key

KeyValueGroupedDataset<String, Tuple2<String, String>> keygrpDS = 

    tupleRDD.groupByKey(f->f._1, Encoders.STRING());

edited Nov 24 '18 at 18:28

Oli

2,4141419

asked Nov 24 '18 at 17:31

Naresh

456

I am trying to group urldata by key where the values would be string

Sample Data :

url_3 url_2

url_3 url_2

url_3 url_1

url_4 url_3

url_4 url_1

Expected result:

(url_3,(url_2,url_1))

(url_4,(url_3,url_1))

1) Load the urldata:

Dataset<String> lines = spark.read()

    .textFile("C:/Users/91984/workspace/myApp/src/test/resources/in/urldata.txt");

2) Split the dataset using space

Encoder<Tuple2<String, String>> encoder2 = 

    Encoders.tuple(Encoders.STRING(), Encoders.STRING());

Dataset<Tuple2<String, String>> tupleRDD = lines.map(f->{

    Tuple2<String, String> m = 

        new Tuple2<String, String>(f.split(" ")[0], f.split(" ")[1]);

    return m;

},encoder2);

3) Used groupbyKey to group the tupleRDD datsebase on key

KeyValueGroupedDataset<String, Tuple2<String, String>> keygrpDS = 

    tupleRDD.groupByKey(f->f._1, Encoders.STRING());

java apache-spark apache-spark-sql

edited Nov 24 '18 at 18:28

Oli

2,4141419

asked Nov 24 '18 at 17:31

Naresh

456

edited Nov 24 '18 at 18:28

Oli

2,4141419

asked Nov 24 '18 at 17:31

Naresh

456

edited Nov 24 '18 at 18:28

Oli

2,4141419

edited Nov 24 '18 at 18:28

Oli

2,4141419

edited Nov 24 '18 at 18:28

Oli

2,4141419

asked Nov 24 '18 at 17:31

Naresh

456

asked Nov 24 '18 at 17:31

Naresh

456

asked Nov 24 '18 at 17:31

Naresh

456

please accept the answer as correct or state any issues if you still are facing.

– void
Nov 29 '18 at 10:10

add a comment |

please accept the answer as correct or state any issues if you still are facing.

– void
Nov 29 '18 at 10:10

please accept the answer as correct or state any issues if you still are facing.

– void
Nov 29 '18 at 10:10

add a comment |

2 Answers
2

active

oldest

votes

That's the way it works with datasets in spark. When you have a dataset of type Dataset<T>, you can group it by some mapping function that takes an object of type T and returns an object of type K (the key). What you get is a KeyValueGroupedDataset<K,T> on which you can call an aggregation function (See the javadoc). In your case, you could use mapGroups to which you can provide a function that maps a key K and an iterable Iterable<T> to a new object R of your choosing. If it helps, in your code, T is a Tuple2 and K a URL.

answered Nov 24 '18 at 18:25

Oli

2,4141419

add a comment |

Spark requires you to follow your groupBY with an aggregation method. I would have tupleRDD as a DataFrame like:

column1 column2



url_3 url_2

url_3 url_2

url_3 url_1

url_4 url_3

url_4 url_1

and pass on a collect_list(column2) like

df.groupBy('column1').agg('column2', collect_list('column2')).

This example is in Python. Scala/Java APIs should be similar, though.

answered Nov 24 '18 at 18:29

void

1,10741739

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53460706%2fspark-group-by-key-to-string-iterablestring%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

answered Nov 24 '18 at 18:25

Oli

2,4141419

add a comment |

answered Nov 24 '18 at 18:25

Oli

2,4141419

add a comment |

answered Nov 24 '18 at 18:25

Oli

2,4141419

answered Nov 24 '18 at 18:25

Oli

2,4141419

answered Nov 24 '18 at 18:25

Oli

2,4141419

answered Nov 24 '18 at 18:25

Oli

2,4141419

answered Nov 24 '18 at 18:25

Oli

2,4141419

add a comment |

Spark requires you to follow your groupBY with an aggregation method. I would have tupleRDD as a DataFrame like:

column1 column2



url_3 url_2

url_3 url_2

url_3 url_1

url_4 url_3

url_4 url_1

and pass on a collect_list(column2) like

df.groupBy('column1').agg('column2', collect_list('column2')).

This example is in Python. Scala/Java APIs should be similar, though.

answered Nov 24 '18 at 18:29

void

1,10741739

add a comment |

Spark requires you to follow your groupBY with an aggregation method. I would have tupleRDD as a DataFrame like:

column1 column2



url_3 url_2

url_3 url_2

url_3 url_1

url_4 url_3

url_4 url_1

and pass on a collect_list(column2) like

df.groupBy('column1').agg('column2', collect_list('column2')).

This example is in Python. Scala/Java APIs should be similar, though.

answered Nov 24 '18 at 18:29

void

1,10741739

add a comment |

Spark requires you to follow your groupBY with an aggregation method. I would have tupleRDD as a DataFrame like:

column1 column2



url_3 url_2

url_3 url_2

url_3 url_1

url_4 url_3

url_4 url_1

and pass on a collect_list(column2) like

df.groupBy('column1').agg('column2', collect_list('column2')).

This example is in Python. Scala/Java APIs should be similar, though.

answered Nov 24 '18 at 18:29

void

1,10741739

Spark requires you to follow your groupBY with an aggregation method. I would have tupleRDD as a DataFrame like:

column1 column2



url_3 url_2

url_3 url_2

url_3 url_1

url_4 url_3

url_4 url_1

and pass on a collect_list(column2) like

df.groupBy('column1').agg('column2', collect_list('column2')).

This example is in Python. Scala/Java APIs should be similar, though.

answered Nov 24 '18 at 18:29

void

1,10741739

answered Nov 24 '18 at 18:29

void

1,10741739

answered Nov 24 '18 at 18:29

void

1,10741739

answered Nov 24 '18 at 18:29

void

1,10741739

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Wsrtjtyk