Spark SQL - createDataFrame wrong struct schema
When trying to create a DataFrame with Spark SQL by passing it a list of Rows like so:
some_data = [{'some-column': [{'timestamp': 1353534535353, 'strVal': 'some-string'}]},
{'some-column': [{'timestamp': 1353534535354, 'strVal': 'another-string'}]}]
spark.createDataFrame([Row(**d) for d in some_data]).printSchema()
The resulting DataFrame's schema is:
root
|-- some-column: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: long (valueContainsNull = true)
This schema is wrong, as strVal
column is of string
type (and indeed collecting on this DataFrame would result in nulls
on this column).
I'd expect for the schema to be an Array
of appropriate Structs
- inferred with a bit of Python reflection on the types of values.
Why is this not the case?
Is there anything I can do besides providing the schema explicitly in this case?
apache-spark dataframe pyspark apache-spark-sql schema
add a comment |
When trying to create a DataFrame with Spark SQL by passing it a list of Rows like so:
some_data = [{'some-column': [{'timestamp': 1353534535353, 'strVal': 'some-string'}]},
{'some-column': [{'timestamp': 1353534535354, 'strVal': 'another-string'}]}]
spark.createDataFrame([Row(**d) for d in some_data]).printSchema()
The resulting DataFrame's schema is:
root
|-- some-column: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: long (valueContainsNull = true)
This schema is wrong, as strVal
column is of string
type (and indeed collecting on this DataFrame would result in nulls
on this column).
I'd expect for the schema to be an Array
of appropriate Structs
- inferred with a bit of Python reflection on the types of values.
Why is this not the case?
Is there anything I can do besides providing the schema explicitly in this case?
apache-spark dataframe pyspark apache-spark-sql schema
add a comment |
When trying to create a DataFrame with Spark SQL by passing it a list of Rows like so:
some_data = [{'some-column': [{'timestamp': 1353534535353, 'strVal': 'some-string'}]},
{'some-column': [{'timestamp': 1353534535354, 'strVal': 'another-string'}]}]
spark.createDataFrame([Row(**d) for d in some_data]).printSchema()
The resulting DataFrame's schema is:
root
|-- some-column: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: long (valueContainsNull = true)
This schema is wrong, as strVal
column is of string
type (and indeed collecting on this DataFrame would result in nulls
on this column).
I'd expect for the schema to be an Array
of appropriate Structs
- inferred with a bit of Python reflection on the types of values.
Why is this not the case?
Is there anything I can do besides providing the schema explicitly in this case?
apache-spark dataframe pyspark apache-spark-sql schema
When trying to create a DataFrame with Spark SQL by passing it a list of Rows like so:
some_data = [{'some-column': [{'timestamp': 1353534535353, 'strVal': 'some-string'}]},
{'some-column': [{'timestamp': 1353534535354, 'strVal': 'another-string'}]}]
spark.createDataFrame([Row(**d) for d in some_data]).printSchema()
The resulting DataFrame's schema is:
root
|-- some-column: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: long (valueContainsNull = true)
This schema is wrong, as strVal
column is of string
type (and indeed collecting on this DataFrame would result in nulls
on this column).
I'd expect for the schema to be an Array
of appropriate Structs
- inferred with a bit of Python reflection on the types of values.
Why is this not the case?
Is there anything I can do besides providing the schema explicitly in this case?
apache-spark dataframe pyspark apache-spark-sql schema
apache-spark dataframe pyspark apache-spark-sql schema
edited Nov 20 '18 at 9:15
user10465355
1,8712417
1,8712417
asked Nov 19 '18 at 23:22
user976850user976850
4411616
4411616
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
This happens because the structure doesn't encode what you mean. As explained in the SQL guide Python dict
is mapped to MapType
.
To work with structures you should use nested Rows
(namedtuples
are preferred in general, but require valid name identifiers):
from pyspark.sql import Row
Outer = Row("some-column")
Inner = Row("timestamp", "strVal")
spark.createDataFrame([
Outer([Inner(1353534535353, 'some-string')]),
Outer([Inner(1353534535354, 'another-string')])
]).printSchema()
root
|-- some-column: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- timestamp: long (nullable = true)
| | |-- strVal: string (nullable = true)
With the structure you have at the moment, the scheme outcome could be achieved with intermediate JSON:
import json
spark.read.json(sc.parallelize(some_data).map(json.dumps)).printSchema()
root
|-- some-column: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- strVal: string (nullable = true)
| | |-- timestamp: long (nullable = true)
or explicit schema:
from pyspark.sql.types import *
schema = StructType([StructField(
"some-column", ArrayType(StructType([
StructField("timestamp", LongType()),
StructField("strVal", StringType())])
))])
spark.createDataFrame(some_data, schema)
although the last method might not be fully robust.
Thanks, the JSON trick is exactly what i was hoping for. Why isn't this the default behavior? e.g. forspark.createDataFrame(rdd, schema)
- cant the schema be inferred by doing this very same trick?
– user976850
Nov 20 '18 at 9:05
Schema is inferred according to the linked specification so it couldn't be, without making a whole process full of special cases.
– user10465355
Nov 20 '18 at 11:24
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53384072%2fspark-sql-createdataframe-wrong-struct-schema%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
This happens because the structure doesn't encode what you mean. As explained in the SQL guide Python dict
is mapped to MapType
.
To work with structures you should use nested Rows
(namedtuples
are preferred in general, but require valid name identifiers):
from pyspark.sql import Row
Outer = Row("some-column")
Inner = Row("timestamp", "strVal")
spark.createDataFrame([
Outer([Inner(1353534535353, 'some-string')]),
Outer([Inner(1353534535354, 'another-string')])
]).printSchema()
root
|-- some-column: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- timestamp: long (nullable = true)
| | |-- strVal: string (nullable = true)
With the structure you have at the moment, the scheme outcome could be achieved with intermediate JSON:
import json
spark.read.json(sc.parallelize(some_data).map(json.dumps)).printSchema()
root
|-- some-column: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- strVal: string (nullable = true)
| | |-- timestamp: long (nullable = true)
or explicit schema:
from pyspark.sql.types import *
schema = StructType([StructField(
"some-column", ArrayType(StructType([
StructField("timestamp", LongType()),
StructField("strVal", StringType())])
))])
spark.createDataFrame(some_data, schema)
although the last method might not be fully robust.
Thanks, the JSON trick is exactly what i was hoping for. Why isn't this the default behavior? e.g. forspark.createDataFrame(rdd, schema)
- cant the schema be inferred by doing this very same trick?
– user976850
Nov 20 '18 at 9:05
Schema is inferred according to the linked specification so it couldn't be, without making a whole process full of special cases.
– user10465355
Nov 20 '18 at 11:24
add a comment |
This happens because the structure doesn't encode what you mean. As explained in the SQL guide Python dict
is mapped to MapType
.
To work with structures you should use nested Rows
(namedtuples
are preferred in general, but require valid name identifiers):
from pyspark.sql import Row
Outer = Row("some-column")
Inner = Row("timestamp", "strVal")
spark.createDataFrame([
Outer([Inner(1353534535353, 'some-string')]),
Outer([Inner(1353534535354, 'another-string')])
]).printSchema()
root
|-- some-column: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- timestamp: long (nullable = true)
| | |-- strVal: string (nullable = true)
With the structure you have at the moment, the scheme outcome could be achieved with intermediate JSON:
import json
spark.read.json(sc.parallelize(some_data).map(json.dumps)).printSchema()
root
|-- some-column: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- strVal: string (nullable = true)
| | |-- timestamp: long (nullable = true)
or explicit schema:
from pyspark.sql.types import *
schema = StructType([StructField(
"some-column", ArrayType(StructType([
StructField("timestamp", LongType()),
StructField("strVal", StringType())])
))])
spark.createDataFrame(some_data, schema)
although the last method might not be fully robust.
Thanks, the JSON trick is exactly what i was hoping for. Why isn't this the default behavior? e.g. forspark.createDataFrame(rdd, schema)
- cant the schema be inferred by doing this very same trick?
– user976850
Nov 20 '18 at 9:05
Schema is inferred according to the linked specification so it couldn't be, without making a whole process full of special cases.
– user10465355
Nov 20 '18 at 11:24
add a comment |
This happens because the structure doesn't encode what you mean. As explained in the SQL guide Python dict
is mapped to MapType
.
To work with structures you should use nested Rows
(namedtuples
are preferred in general, but require valid name identifiers):
from pyspark.sql import Row
Outer = Row("some-column")
Inner = Row("timestamp", "strVal")
spark.createDataFrame([
Outer([Inner(1353534535353, 'some-string')]),
Outer([Inner(1353534535354, 'another-string')])
]).printSchema()
root
|-- some-column: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- timestamp: long (nullable = true)
| | |-- strVal: string (nullable = true)
With the structure you have at the moment, the scheme outcome could be achieved with intermediate JSON:
import json
spark.read.json(sc.parallelize(some_data).map(json.dumps)).printSchema()
root
|-- some-column: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- strVal: string (nullable = true)
| | |-- timestamp: long (nullable = true)
or explicit schema:
from pyspark.sql.types import *
schema = StructType([StructField(
"some-column", ArrayType(StructType([
StructField("timestamp", LongType()),
StructField("strVal", StringType())])
))])
spark.createDataFrame(some_data, schema)
although the last method might not be fully robust.
This happens because the structure doesn't encode what you mean. As explained in the SQL guide Python dict
is mapped to MapType
.
To work with structures you should use nested Rows
(namedtuples
are preferred in general, but require valid name identifiers):
from pyspark.sql import Row
Outer = Row("some-column")
Inner = Row("timestamp", "strVal")
spark.createDataFrame([
Outer([Inner(1353534535353, 'some-string')]),
Outer([Inner(1353534535354, 'another-string')])
]).printSchema()
root
|-- some-column: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- timestamp: long (nullable = true)
| | |-- strVal: string (nullable = true)
With the structure you have at the moment, the scheme outcome could be achieved with intermediate JSON:
import json
spark.read.json(sc.parallelize(some_data).map(json.dumps)).printSchema()
root
|-- some-column: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- strVal: string (nullable = true)
| | |-- timestamp: long (nullable = true)
or explicit schema:
from pyspark.sql.types import *
schema = StructType([StructField(
"some-column", ArrayType(StructType([
StructField("timestamp", LongType()),
StructField("strVal", StringType())])
))])
spark.createDataFrame(some_data, schema)
although the last method might not be fully robust.
edited Nov 20 '18 at 0:58
answered Nov 20 '18 at 0:48
user10465355user10465355
1,8712417
1,8712417
Thanks, the JSON trick is exactly what i was hoping for. Why isn't this the default behavior? e.g. forspark.createDataFrame(rdd, schema)
- cant the schema be inferred by doing this very same trick?
– user976850
Nov 20 '18 at 9:05
Schema is inferred according to the linked specification so it couldn't be, without making a whole process full of special cases.
– user10465355
Nov 20 '18 at 11:24
add a comment |
Thanks, the JSON trick is exactly what i was hoping for. Why isn't this the default behavior? e.g. forspark.createDataFrame(rdd, schema)
- cant the schema be inferred by doing this very same trick?
– user976850
Nov 20 '18 at 9:05
Schema is inferred according to the linked specification so it couldn't be, without making a whole process full of special cases.
– user10465355
Nov 20 '18 at 11:24
Thanks, the JSON trick is exactly what i was hoping for. Why isn't this the default behavior? e.g. for
spark.createDataFrame(rdd, schema)
- cant the schema be inferred by doing this very same trick?– user976850
Nov 20 '18 at 9:05
Thanks, the JSON trick is exactly what i was hoping for. Why isn't this the default behavior? e.g. for
spark.createDataFrame(rdd, schema)
- cant the schema be inferred by doing this very same trick?– user976850
Nov 20 '18 at 9:05
Schema is inferred according to the linked specification so it couldn't be, without making a whole process full of special cases.
– user10465355
Nov 20 '18 at 11:24
Schema is inferred according to the linked specification so it couldn't be, without making a whole process full of special cases.
– user10465355
Nov 20 '18 at 11:24
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53384072%2fspark-sql-createdataframe-wrong-struct-schema%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown