Create a dataframe from column of dictionaries in pyspark
up vote
-1
down vote
favorite
I want to create a new dataframe from existing dataframe in pyspark. The dataframe "df" contains a column named "data" which has rows of dictionary and has a schema as string. And the keys of each dictionary are not fixed.For example the name and address are the keys for the first row dictionary but that would not be the case for other rows they may be different. following is the example for that;
........................................................
data
........................................................
{"name": "sam", "address":"uk"}
........................................................
{"name":"jack" , "address":"aus", "occupation":"job"}
.........................................................
How do I convert into the dataframe with individual columns like following.
name address occupation
sam uk
jack aus job
python python-2.7 dictionary pyspark pyspark-sql
add a comment |
up vote
-1
down vote
favorite
I want to create a new dataframe from existing dataframe in pyspark. The dataframe "df" contains a column named "data" which has rows of dictionary and has a schema as string. And the keys of each dictionary are not fixed.For example the name and address are the keys for the first row dictionary but that would not be the case for other rows they may be different. following is the example for that;
........................................................
data
........................................................
{"name": "sam", "address":"uk"}
........................................................
{"name":"jack" , "address":"aus", "occupation":"job"}
.........................................................
How do I convert into the dataframe with individual columns like following.
name address occupation
sam uk
jack aus job
python python-2.7 dictionary pyspark pyspark-sql
Possible duplicate of How to convert list of dictionaries into Spark DataFrame
– pault
Nov 9 at 12:34
Or a dupe of Pyspark: explode json in column to multiple columns. It's hard to tell from your question
– pault
Nov 9 at 12:44
@pault Its not duplicate of above both these links. I referred it before asking the query. The question is properly understood. The dataframe "df" has a column named "data" which contains rows of dictionary. Its not a list of dictionary.
– amol desai
Nov 11 at 5:20
Your question is still unclear. You can't have "rows of dictionaries" in a pyspark DataFrame. Isdf
a pandas DataFrame? Or is thedata
column actually of typeStringType()
orMapType()
? Edit your question with the output ofdf.select('data').printSchema()
. Better yet, provide a reproducible example. Maybe you're looking for this answer.
– pault
Nov 13 at 15:55
add a comment |
up vote
-1
down vote
favorite
up vote
-1
down vote
favorite
I want to create a new dataframe from existing dataframe in pyspark. The dataframe "df" contains a column named "data" which has rows of dictionary and has a schema as string. And the keys of each dictionary are not fixed.For example the name and address are the keys for the first row dictionary but that would not be the case for other rows they may be different. following is the example for that;
........................................................
data
........................................................
{"name": "sam", "address":"uk"}
........................................................
{"name":"jack" , "address":"aus", "occupation":"job"}
.........................................................
How do I convert into the dataframe with individual columns like following.
name address occupation
sam uk
jack aus job
python python-2.7 dictionary pyspark pyspark-sql
I want to create a new dataframe from existing dataframe in pyspark. The dataframe "df" contains a column named "data" which has rows of dictionary and has a schema as string. And the keys of each dictionary are not fixed.For example the name and address are the keys for the first row dictionary but that would not be the case for other rows they may be different. following is the example for that;
........................................................
data
........................................................
{"name": "sam", "address":"uk"}
........................................................
{"name":"jack" , "address":"aus", "occupation":"job"}
.........................................................
How do I convert into the dataframe with individual columns like following.
name address occupation
sam uk
jack aus job
python python-2.7 dictionary pyspark pyspark-sql
python python-2.7 dictionary pyspark pyspark-sql
edited Nov 15 at 3:45
asked Nov 9 at 4:25
amol desai
57137
57137
Possible duplicate of How to convert list of dictionaries into Spark DataFrame
– pault
Nov 9 at 12:34
Or a dupe of Pyspark: explode json in column to multiple columns. It's hard to tell from your question
– pault
Nov 9 at 12:44
@pault Its not duplicate of above both these links. I referred it before asking the query. The question is properly understood. The dataframe "df" has a column named "data" which contains rows of dictionary. Its not a list of dictionary.
– amol desai
Nov 11 at 5:20
Your question is still unclear. You can't have "rows of dictionaries" in a pyspark DataFrame. Isdf
a pandas DataFrame? Or is thedata
column actually of typeStringType()
orMapType()
? Edit your question with the output ofdf.select('data').printSchema()
. Better yet, provide a reproducible example. Maybe you're looking for this answer.
– pault
Nov 13 at 15:55
add a comment |
Possible duplicate of How to convert list of dictionaries into Spark DataFrame
– pault
Nov 9 at 12:34
Or a dupe of Pyspark: explode json in column to multiple columns. It's hard to tell from your question
– pault
Nov 9 at 12:44
@pault Its not duplicate of above both these links. I referred it before asking the query. The question is properly understood. The dataframe "df" has a column named "data" which contains rows of dictionary. Its not a list of dictionary.
– amol desai
Nov 11 at 5:20
Your question is still unclear. You can't have "rows of dictionaries" in a pyspark DataFrame. Isdf
a pandas DataFrame? Or is thedata
column actually of typeStringType()
orMapType()
? Edit your question with the output ofdf.select('data').printSchema()
. Better yet, provide a reproducible example. Maybe you're looking for this answer.
– pault
Nov 13 at 15:55
Possible duplicate of How to convert list of dictionaries into Spark DataFrame
– pault
Nov 9 at 12:34
Possible duplicate of How to convert list of dictionaries into Spark DataFrame
– pault
Nov 9 at 12:34
Or a dupe of Pyspark: explode json in column to multiple columns. It's hard to tell from your question
– pault
Nov 9 at 12:44
Or a dupe of Pyspark: explode json in column to multiple columns. It's hard to tell from your question
– pault
Nov 9 at 12:44
@pault Its not duplicate of above both these links. I referred it before asking the query. The question is properly understood. The dataframe "df" has a column named "data" which contains rows of dictionary. Its not a list of dictionary.
– amol desai
Nov 11 at 5:20
@pault Its not duplicate of above both these links. I referred it before asking the query. The question is properly understood. The dataframe "df" has a column named "data" which contains rows of dictionary. Its not a list of dictionary.
– amol desai
Nov 11 at 5:20
Your question is still unclear. You can't have "rows of dictionaries" in a pyspark DataFrame. Is
df
a pandas DataFrame? Or is the data
column actually of type StringType()
or MapType()
? Edit your question with the output of df.select('data').printSchema()
. Better yet, provide a reproducible example. Maybe you're looking for this answer.– pault
Nov 13 at 15:55
Your question is still unclear. You can't have "rows of dictionaries" in a pyspark DataFrame. Is
df
a pandas DataFrame? Or is the data
column actually of type StringType()
or MapType()
? Edit your question with the output of df.select('data').printSchema()
. Better yet, provide a reproducible example. Maybe you're looking for this answer.– pault
Nov 13 at 15:55
add a comment |
2 Answers
2
active
oldest
votes
up vote
0
down vote
Convert data
to an RDD, then use spark.read.json
to convert the RDD into a dataFrame with the schema.
data = [
{"name": "sam", "address":"uk"},
{"name":"jack" , "address":"aus", "occupation":"job"}
]
spark = SparkSession.builder.getOrCreate()
df = spark.read.json(sc.parallelize(data)).na.fill('')
df.show()
+-------+----+----------+
|address|name|occupation|
+-------+----+----------+
| uk| sam| |
| aus|jack| job|
+-------+----+----------+
I have tried this method its giving py4j.Py4JException: Method __getnewargs__() does not exist. The data is a column name of dataframe df.
– amol desai
Nov 9 at 4:56
add a comment |
up vote
0
down vote
If the order of rows is not important, this is another way you can do this:
from pyspark import SparkContext
sc = SparkContext()
df = sc.parallelize([
{"name":"jack" , "address":"aus", "occupation":"job"},
{"name": "sam", "address":"uk"}
]).toDF()
df = df.na.fill('')
df.show()
+-------+----+----------+
|address|name|occupation|
+-------+----+----------+
| aus|jack| job|
| uk| sam| |
+-------+----+----------+
add a comment |
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
Convert data
to an RDD, then use spark.read.json
to convert the RDD into a dataFrame with the schema.
data = [
{"name": "sam", "address":"uk"},
{"name":"jack" , "address":"aus", "occupation":"job"}
]
spark = SparkSession.builder.getOrCreate()
df = spark.read.json(sc.parallelize(data)).na.fill('')
df.show()
+-------+----+----------+
|address|name|occupation|
+-------+----+----------+
| uk| sam| |
| aus|jack| job|
+-------+----+----------+
I have tried this method its giving py4j.Py4JException: Method __getnewargs__() does not exist. The data is a column name of dataframe df.
– amol desai
Nov 9 at 4:56
add a comment |
up vote
0
down vote
Convert data
to an RDD, then use spark.read.json
to convert the RDD into a dataFrame with the schema.
data = [
{"name": "sam", "address":"uk"},
{"name":"jack" , "address":"aus", "occupation":"job"}
]
spark = SparkSession.builder.getOrCreate()
df = spark.read.json(sc.parallelize(data)).na.fill('')
df.show()
+-------+----+----------+
|address|name|occupation|
+-------+----+----------+
| uk| sam| |
| aus|jack| job|
+-------+----+----------+
I have tried this method its giving py4j.Py4JException: Method __getnewargs__() does not exist. The data is a column name of dataframe df.
– amol desai
Nov 9 at 4:56
add a comment |
up vote
0
down vote
up vote
0
down vote
Convert data
to an RDD, then use spark.read.json
to convert the RDD into a dataFrame with the schema.
data = [
{"name": "sam", "address":"uk"},
{"name":"jack" , "address":"aus", "occupation":"job"}
]
spark = SparkSession.builder.getOrCreate()
df = spark.read.json(sc.parallelize(data)).na.fill('')
df.show()
+-------+----+----------+
|address|name|occupation|
+-------+----+----------+
| uk| sam| |
| aus|jack| job|
+-------+----+----------+
Convert data
to an RDD, then use spark.read.json
to convert the RDD into a dataFrame with the schema.
data = [
{"name": "sam", "address":"uk"},
{"name":"jack" , "address":"aus", "occupation":"job"}
]
spark = SparkSession.builder.getOrCreate()
df = spark.read.json(sc.parallelize(data)).na.fill('')
df.show()
+-------+----+----------+
|address|name|occupation|
+-------+----+----------+
| uk| sam| |
| aus|jack| job|
+-------+----+----------+
answered Nov 9 at 4:48
coldspeed
113k18104177
113k18104177
I have tried this method its giving py4j.Py4JException: Method __getnewargs__() does not exist. The data is a column name of dataframe df.
– amol desai
Nov 9 at 4:56
add a comment |
I have tried this method its giving py4j.Py4JException: Method __getnewargs__() does not exist. The data is a column name of dataframe df.
– amol desai
Nov 9 at 4:56
I have tried this method its giving py4j.Py4JException: Method __getnewargs__() does not exist. The data is a column name of dataframe df.
– amol desai
Nov 9 at 4:56
I have tried this method its giving py4j.Py4JException: Method __getnewargs__() does not exist. The data is a column name of dataframe df.
– amol desai
Nov 9 at 4:56
add a comment |
up vote
0
down vote
If the order of rows is not important, this is another way you can do this:
from pyspark import SparkContext
sc = SparkContext()
df = sc.parallelize([
{"name":"jack" , "address":"aus", "occupation":"job"},
{"name": "sam", "address":"uk"}
]).toDF()
df = df.na.fill('')
df.show()
+-------+----+----------+
|address|name|occupation|
+-------+----+----------+
| aus|jack| job|
| uk| sam| |
+-------+----+----------+
add a comment |
up vote
0
down vote
If the order of rows is not important, this is another way you can do this:
from pyspark import SparkContext
sc = SparkContext()
df = sc.parallelize([
{"name":"jack" , "address":"aus", "occupation":"job"},
{"name": "sam", "address":"uk"}
]).toDF()
df = df.na.fill('')
df.show()
+-------+----+----------+
|address|name|occupation|
+-------+----+----------+
| aus|jack| job|
| uk| sam| |
+-------+----+----------+
add a comment |
up vote
0
down vote
up vote
0
down vote
If the order of rows is not important, this is another way you can do this:
from pyspark import SparkContext
sc = SparkContext()
df = sc.parallelize([
{"name":"jack" , "address":"aus", "occupation":"job"},
{"name": "sam", "address":"uk"}
]).toDF()
df = df.na.fill('')
df.show()
+-------+----+----------+
|address|name|occupation|
+-------+----+----------+
| aus|jack| job|
| uk| sam| |
+-------+----+----------+
If the order of rows is not important, this is another way you can do this:
from pyspark import SparkContext
sc = SparkContext()
df = sc.parallelize([
{"name":"jack" , "address":"aus", "occupation":"job"},
{"name": "sam", "address":"uk"}
]).toDF()
df = df.na.fill('')
df.show()
+-------+----+----------+
|address|name|occupation|
+-------+----+----------+
| aus|jack| job|
| uk| sam| |
+-------+----+----------+
answered Nov 9 at 8:24
Ali AzG
607515
607515
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53219863%2fcreate-a-dataframe-from-column-of-dictionaries-in-pyspark%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Possible duplicate of How to convert list of dictionaries into Spark DataFrame
– pault
Nov 9 at 12:34
Or a dupe of Pyspark: explode json in column to multiple columns. It's hard to tell from your question
– pault
Nov 9 at 12:44
@pault Its not duplicate of above both these links. I referred it before asking the query. The question is properly understood. The dataframe "df" has a column named "data" which contains rows of dictionary. Its not a list of dictionary.
– amol desai
Nov 11 at 5:20
Your question is still unclear. You can't have "rows of dictionaries" in a pyspark DataFrame. Is
df
a pandas DataFrame? Or is thedata
column actually of typeStringType()
orMapType()
? Edit your question with the output ofdf.select('data').printSchema()
. Better yet, provide a reproducible example. Maybe you're looking for this answer.– pault
Nov 13 at 15:55