Merge Apache Spark columns from array and struct inside struct array

Here is the schema of the incomming data stream. Im using spark 2.3.2 streaming to process the data.

val schema = StructType(Seq(

            StructField("status", StringType),

            StructField("data", StructType(Seq(

                StructField("resultType", StringType),

                StructField("result", ArrayType(StructType(Array(

                    StructField("metric", StructType(Seq(StructField("application", StringType),

                                                         StructField("component", StringType),

                                                         StructField("instance", StringType)))), 

                    StructField("value", ArrayType(StringType))

                ))))

             )

         ))))

Here is how i've applied the schema to the dstream's rdd.

  val df = rdd.toDS()                        

                    .selectExpr("cast (value as string) as myData") 

                    .select(from_json($"myData", schema).as("myData"))               

                    .select($"myData.data.*")

                    .select("result")

The above code yields the following output

{"result":[{"metric":{"application":"A","component":"S","instance":"tp01.net:9072"},"value":["1.542972576979E9","237006995456"]},

       {"metric":{"application":"A","component":"S","instance":"tp02.net:9072"},"value":["1.542972576979E9","237006995456"]},

       {"metric":{"application":"A","component":"S","instance":"tp03.net:9072"},"value":["1.542972576979E9","237006995456"]},

       {"metric":{"application":"B","component":"S","instance":"bp03.net:9072"},"value":["1.542972576979E9","270860144640"]},

       {"metric":{"application":"B","component":"S","instance":"bp04.net:9072"},"value":["1.542972576979E9","270860144640"]},

       {"metric":{"application":"B","component":"S","instance":"ps01.net:9072"},"value":["1.542972576979E9","135177400320"]},

 ]}

But in order to extract features, i need to convert the above to the following data frame

application     component       instance            value1              value2

A               S               tp01.net:9072       1.542972576979E9    237006995456

A               S               tp02.net:9072       1.542972576979E9    237006995456

A               S               tp03.net:9072       1.542972576979E9    237006995456

B               S               bp03.net:9072       1.542972576979E9    270860144640

B               S               bp04.net:9072       1.542972576979E9    270860144640

B               S               ps01.net:9072       1.542972576979E9    135177400320

As you see the each row is already an exploded row. Any ideas on how to select the array values and the struct into a single dataframe please?

Thanks

asked Nov 23 '18 at 13:20

user1384205

4382928

1

Try using df.select(explode($"result").as("flat")).select($"flat.metric.*", $"flat.value".getItem(0).as("value1"), $"flat.value".getItem(1).as("value2")).show()

– vindev
Nov 23 '18 at 13:46

Works perfectly. Thank you!

– user1384205
Nov 23 '18 at 15:16

add a comment |

Here is the schema of the incomming data stream. Im using spark 2.3.2 streaming to process the data.

val schema = StructType(Seq(

            StructField("status", StringType),

            StructField("data", StructType(Seq(

                StructField("resultType", StringType),

                StructField("result", ArrayType(StructType(Array(

                    StructField("metric", StructType(Seq(StructField("application", StringType),

                                                         StructField("component", StringType),

                                                         StructField("instance", StringType)))), 

                    StructField("value", ArrayType(StringType))

                ))))

             )

         ))))

Here is how i've applied the schema to the dstream's rdd.

  val df = rdd.toDS()                        

                    .selectExpr("cast (value as string) as myData") 

                    .select(from_json($"myData", schema).as("myData"))               

                    .select($"myData.data.*")

                    .select("result")

The above code yields the following output

{"result":[{"metric":{"application":"A","component":"S","instance":"tp01.net:9072"},"value":["1.542972576979E9","237006995456"]},

       {"metric":{"application":"A","component":"S","instance":"tp02.net:9072"},"value":["1.542972576979E9","237006995456"]},

       {"metric":{"application":"A","component":"S","instance":"tp03.net:9072"},"value":["1.542972576979E9","237006995456"]},

       {"metric":{"application":"B","component":"S","instance":"bp03.net:9072"},"value":["1.542972576979E9","270860144640"]},

       {"metric":{"application":"B","component":"S","instance":"bp04.net:9072"},"value":["1.542972576979E9","270860144640"]},

       {"metric":{"application":"B","component":"S","instance":"ps01.net:9072"},"value":["1.542972576979E9","135177400320"]},

 ]}

But in order to extract features, i need to convert the above to the following data frame

application     component       instance            value1              value2

A               S               tp01.net:9072       1.542972576979E9    237006995456

A               S               tp02.net:9072       1.542972576979E9    237006995456

A               S               tp03.net:9072       1.542972576979E9    237006995456

B               S               bp03.net:9072       1.542972576979E9    270860144640

B               S               bp04.net:9072       1.542972576979E9    270860144640

B               S               ps01.net:9072       1.542972576979E9    135177400320

As you see the each row is already an exploded row. Any ideas on how to select the array values and the struct into a single dataframe please?

Thanks

asked Nov 23 '18 at 13:20

user1384205

4382928

1

Try using df.select(explode($"result").as("flat")).select($"flat.metric.*", $"flat.value".getItem(0).as("value1"), $"flat.value".getItem(1).as("value2")).show()

– vindev
Nov 23 '18 at 13:46

Works perfectly. Thank you!

– user1384205
Nov 23 '18 at 15:16

add a comment |

Here is the schema of the incomming data stream. Im using spark 2.3.2 streaming to process the data.

val schema = StructType(Seq(

            StructField("status", StringType),

            StructField("data", StructType(Seq(

                StructField("resultType", StringType),

                StructField("result", ArrayType(StructType(Array(

                    StructField("metric", StructType(Seq(StructField("application", StringType),

                                                         StructField("component", StringType),

                                                         StructField("instance", StringType)))), 

                    StructField("value", ArrayType(StringType))

                ))))

             )

         ))))

Here is how i've applied the schema to the dstream's rdd.

  val df = rdd.toDS()                        

                    .selectExpr("cast (value as string) as myData") 

                    .select(from_json($"myData", schema).as("myData"))               

                    .select($"myData.data.*")

                    .select("result")

The above code yields the following output

{"result":[{"metric":{"application":"A","component":"S","instance":"tp01.net:9072"},"value":["1.542972576979E9","237006995456"]},

       {"metric":{"application":"A","component":"S","instance":"tp02.net:9072"},"value":["1.542972576979E9","237006995456"]},

       {"metric":{"application":"A","component":"S","instance":"tp03.net:9072"},"value":["1.542972576979E9","237006995456"]},

       {"metric":{"application":"B","component":"S","instance":"bp03.net:9072"},"value":["1.542972576979E9","270860144640"]},

       {"metric":{"application":"B","component":"S","instance":"bp04.net:9072"},"value":["1.542972576979E9","270860144640"]},

       {"metric":{"application":"B","component":"S","instance":"ps01.net:9072"},"value":["1.542972576979E9","135177400320"]},

 ]}

But in order to extract features, i need to convert the above to the following data frame

application     component       instance            value1              value2

A               S               tp01.net:9072       1.542972576979E9    237006995456

A               S               tp02.net:9072       1.542972576979E9    237006995456

A               S               tp03.net:9072       1.542972576979E9    237006995456

B               S               bp03.net:9072       1.542972576979E9    270860144640

B               S               bp04.net:9072       1.542972576979E9    270860144640

B               S               ps01.net:9072       1.542972576979E9    135177400320

As you see the each row is already an exploded row. Any ideas on how to select the array values and the struct into a single dataframe please?

Thanks

asked Nov 23 '18 at 13:20

user1384205

4382928

Here is the schema of the incomming data stream. Im using spark 2.3.2 streaming to process the data.

val schema = StructType(Seq(

            StructField("status", StringType),

            StructField("data", StructType(Seq(

                StructField("resultType", StringType),

                StructField("result", ArrayType(StructType(Array(

                    StructField("metric", StructType(Seq(StructField("application", StringType),

                                                         StructField("component", StringType),

                                                         StructField("instance", StringType)))), 

                    StructField("value", ArrayType(StringType))

                ))))

             )

         ))))

Here is how i've applied the schema to the dstream's rdd.

  val df = rdd.toDS()                        

                    .selectExpr("cast (value as string) as myData") 

                    .select(from_json($"myData", schema).as("myData"))               

                    .select($"myData.data.*")

                    .select("result")

The above code yields the following output

{"result":[{"metric":{"application":"A","component":"S","instance":"tp01.net:9072"},"value":["1.542972576979E9","237006995456"]},

       {"metric":{"application":"A","component":"S","instance":"tp02.net:9072"},"value":["1.542972576979E9","237006995456"]},

       {"metric":{"application":"A","component":"S","instance":"tp03.net:9072"},"value":["1.542972576979E9","237006995456"]},

       {"metric":{"application":"B","component":"S","instance":"bp03.net:9072"},"value":["1.542972576979E9","270860144640"]},

       {"metric":{"application":"B","component":"S","instance":"bp04.net:9072"},"value":["1.542972576979E9","270860144640"]},

       {"metric":{"application":"B","component":"S","instance":"ps01.net:9072"},"value":["1.542972576979E9","135177400320"]},

 ]}

But in order to extract features, i need to convert the above to the following data frame

application     component       instance            value1              value2

A               S               tp01.net:9072       1.542972576979E9    237006995456

A               S               tp02.net:9072       1.542972576979E9    237006995456

A               S               tp03.net:9072       1.542972576979E9    237006995456

B               S               bp03.net:9072       1.542972576979E9    270860144640

B               S               bp04.net:9072       1.542972576979E9    270860144640

B               S               ps01.net:9072       1.542972576979E9    135177400320

As you see the each row is already an exploded row. Any ideas on how to select the array values and the struct into a single dataframe please?

Thanks

scala apache-spark apache-spark-sql

asked Nov 23 '18 at 13:20

user1384205

4382928

asked Nov 23 '18 at 13:20

user1384205

4382928

asked Nov 23 '18 at 13:20

user1384205

4382928

asked Nov 23 '18 at 13:20

user1384205

4382928

asked Nov 23 '18 at 13:20

user1384205

4382928

1

Try using df.select(explode($"result").as("flat")).select($"flat.metric.*", $"flat.value".getItem(0).as("value1"), $"flat.value".getItem(1).as("value2")).show()

– vindev
Nov 23 '18 at 13:46

Works perfectly. Thank you!

– user1384205
Nov 23 '18 at 15:16

add a comment |

1

Try using df.select(explode($"result").as("flat")).select($"flat.metric.*", $"flat.value".getItem(0).as("value1"), $"flat.value".getItem(1).as("value2")).show()

– vindev
Nov 23 '18 at 13:46

Works perfectly. Thank you!

– user1384205
Nov 23 '18 at 15:16

Try using

df.select(explode($"result").as("flat")).select($"flat.metric.*", $"flat.value".getItem(0).as("value1"), $"flat.value".getItem(1).as("value2")).show()

– vindev
Nov 23 '18 at 13:46

Try using

df.select(explode($"result").as("flat")).select($"flat.metric.*", $"flat.value".getItem(0).as("value1"), $"flat.value".getItem(1).as("value2")).show()

– vindev
Nov 23 '18 at 13:46

Works perfectly. Thank you!

– user1384205
Nov 23 '18 at 15:16

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53447493%2fmerge-apache-spark-columns-from-array-and-struct-inside-struct-array%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Wsrtjtyk