Create a dataframe from column of dictionaries in pyspark











up vote
-1
down vote

favorite












I want to create a new dataframe from existing dataframe in pyspark. The dataframe "df" contains a column named "data" which has rows of dictionary and has a schema as string. And the keys of each dictionary are not fixed.For example the name and address are the keys for the first row dictionary but that would not be the case for other rows they may be different. following is the example for that;



........................................................
data
........................................................
{"name": "sam", "address":"uk"}
........................................................
{"name":"jack" , "address":"aus", "occupation":"job"}
.........................................................


How do I convert into the dataframe with individual columns like following.



 name   address    occupation
sam uk
jack aus job









share|improve this question
























  • Possible duplicate of How to convert list of dictionaries into Spark DataFrame
    – pault
    Nov 9 at 12:34










  • Or a dupe of Pyspark: explode json in column to multiple columns. It's hard to tell from your question
    – pault
    Nov 9 at 12:44










  • @pault Its not duplicate of above both these links. I referred it before asking the query. The question is properly understood. The dataframe "df" has a column named "data" which contains rows of dictionary. Its not a list of dictionary.
    – amol desai
    Nov 11 at 5:20












  • Your question is still unclear. You can't have "rows of dictionaries" in a pyspark DataFrame. Is df a pandas DataFrame? Or is the data column actually of type StringType() or MapType()? Edit your question with the output of df.select('data').printSchema(). Better yet, provide a reproducible example. Maybe you're looking for this answer.
    – pault
    Nov 13 at 15:55

















up vote
-1
down vote

favorite












I want to create a new dataframe from existing dataframe in pyspark. The dataframe "df" contains a column named "data" which has rows of dictionary and has a schema as string. And the keys of each dictionary are not fixed.For example the name and address are the keys for the first row dictionary but that would not be the case for other rows they may be different. following is the example for that;



........................................................
data
........................................................
{"name": "sam", "address":"uk"}
........................................................
{"name":"jack" , "address":"aus", "occupation":"job"}
.........................................................


How do I convert into the dataframe with individual columns like following.



 name   address    occupation
sam uk
jack aus job









share|improve this question
























  • Possible duplicate of How to convert list of dictionaries into Spark DataFrame
    – pault
    Nov 9 at 12:34










  • Or a dupe of Pyspark: explode json in column to multiple columns. It's hard to tell from your question
    – pault
    Nov 9 at 12:44










  • @pault Its not duplicate of above both these links. I referred it before asking the query. The question is properly understood. The dataframe "df" has a column named "data" which contains rows of dictionary. Its not a list of dictionary.
    – amol desai
    Nov 11 at 5:20












  • Your question is still unclear. You can't have "rows of dictionaries" in a pyspark DataFrame. Is df a pandas DataFrame? Or is the data column actually of type StringType() or MapType()? Edit your question with the output of df.select('data').printSchema(). Better yet, provide a reproducible example. Maybe you're looking for this answer.
    – pault
    Nov 13 at 15:55















up vote
-1
down vote

favorite









up vote
-1
down vote

favorite











I want to create a new dataframe from existing dataframe in pyspark. The dataframe "df" contains a column named "data" which has rows of dictionary and has a schema as string. And the keys of each dictionary are not fixed.For example the name and address are the keys for the first row dictionary but that would not be the case for other rows they may be different. following is the example for that;



........................................................
data
........................................................
{"name": "sam", "address":"uk"}
........................................................
{"name":"jack" , "address":"aus", "occupation":"job"}
.........................................................


How do I convert into the dataframe with individual columns like following.



 name   address    occupation
sam uk
jack aus job









share|improve this question















I want to create a new dataframe from existing dataframe in pyspark. The dataframe "df" contains a column named "data" which has rows of dictionary and has a schema as string. And the keys of each dictionary are not fixed.For example the name and address are the keys for the first row dictionary but that would not be the case for other rows they may be different. following is the example for that;



........................................................
data
........................................................
{"name": "sam", "address":"uk"}
........................................................
{"name":"jack" , "address":"aus", "occupation":"job"}
.........................................................


How do I convert into the dataframe with individual columns like following.



 name   address    occupation
sam uk
jack aus job






python python-2.7 dictionary pyspark pyspark-sql






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 15 at 3:45

























asked Nov 9 at 4:25









amol desai

57137




57137












  • Possible duplicate of How to convert list of dictionaries into Spark DataFrame
    – pault
    Nov 9 at 12:34










  • Or a dupe of Pyspark: explode json in column to multiple columns. It's hard to tell from your question
    – pault
    Nov 9 at 12:44










  • @pault Its not duplicate of above both these links. I referred it before asking the query. The question is properly understood. The dataframe "df" has a column named "data" which contains rows of dictionary. Its not a list of dictionary.
    – amol desai
    Nov 11 at 5:20












  • Your question is still unclear. You can't have "rows of dictionaries" in a pyspark DataFrame. Is df a pandas DataFrame? Or is the data column actually of type StringType() or MapType()? Edit your question with the output of df.select('data').printSchema(). Better yet, provide a reproducible example. Maybe you're looking for this answer.
    – pault
    Nov 13 at 15:55




















  • Possible duplicate of How to convert list of dictionaries into Spark DataFrame
    – pault
    Nov 9 at 12:34










  • Or a dupe of Pyspark: explode json in column to multiple columns. It's hard to tell from your question
    – pault
    Nov 9 at 12:44










  • @pault Its not duplicate of above both these links. I referred it before asking the query. The question is properly understood. The dataframe "df" has a column named "data" which contains rows of dictionary. Its not a list of dictionary.
    – amol desai
    Nov 11 at 5:20












  • Your question is still unclear. You can't have "rows of dictionaries" in a pyspark DataFrame. Is df a pandas DataFrame? Or is the data column actually of type StringType() or MapType()? Edit your question with the output of df.select('data').printSchema(). Better yet, provide a reproducible example. Maybe you're looking for this answer.
    – pault
    Nov 13 at 15:55


















Possible duplicate of How to convert list of dictionaries into Spark DataFrame
– pault
Nov 9 at 12:34




Possible duplicate of How to convert list of dictionaries into Spark DataFrame
– pault
Nov 9 at 12:34












Or a dupe of Pyspark: explode json in column to multiple columns. It's hard to tell from your question
– pault
Nov 9 at 12:44




Or a dupe of Pyspark: explode json in column to multiple columns. It's hard to tell from your question
– pault
Nov 9 at 12:44












@pault Its not duplicate of above both these links. I referred it before asking the query. The question is properly understood. The dataframe "df" has a column named "data" which contains rows of dictionary. Its not a list of dictionary.
– amol desai
Nov 11 at 5:20






@pault Its not duplicate of above both these links. I referred it before asking the query. The question is properly understood. The dataframe "df" has a column named "data" which contains rows of dictionary. Its not a list of dictionary.
– amol desai
Nov 11 at 5:20














Your question is still unclear. You can't have "rows of dictionaries" in a pyspark DataFrame. Is df a pandas DataFrame? Or is the data column actually of type StringType() or MapType()? Edit your question with the output of df.select('data').printSchema(). Better yet, provide a reproducible example. Maybe you're looking for this answer.
– pault
Nov 13 at 15:55






Your question is still unclear. You can't have "rows of dictionaries" in a pyspark DataFrame. Is df a pandas DataFrame? Or is the data column actually of type StringType() or MapType()? Edit your question with the output of df.select('data').printSchema(). Better yet, provide a reproducible example. Maybe you're looking for this answer.
– pault
Nov 13 at 15:55














2 Answers
2






active

oldest

votes

















up vote
0
down vote













Convert data to an RDD, then use spark.read.json to convert the RDD into a dataFrame with the schema.



data = [
{"name": "sam", "address":"uk"},
{"name":"jack" , "address":"aus", "occupation":"job"}
]

spark = SparkSession.builder.getOrCreate()
df = spark.read.json(sc.parallelize(data)).na.fill('')
df.show()
+-------+----+----------+
|address|name|occupation|
+-------+----+----------+
| uk| sam| |
| aus|jack| job|
+-------+----+----------+





share|improve this answer





















  • I have tried this method its giving py4j.Py4JException: Method __getnewargs__() does not exist. The data is a column name of dataframe df.
    – amol desai
    Nov 9 at 4:56


















up vote
0
down vote













If the order of rows is not important, this is another way you can do this:



from pyspark import SparkContext
sc = SparkContext()

df = sc.parallelize([
{"name":"jack" , "address":"aus", "occupation":"job"},
{"name": "sam", "address":"uk"}
]).toDF()

df = df.na.fill('')

df.show()

+-------+----+----------+
|address|name|occupation|
+-------+----+----------+
| aus|jack| job|
| uk| sam| |
+-------+----+----------+





share|improve this answer





















    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53219863%2fcreate-a-dataframe-from-column-of-dictionaries-in-pyspark%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    0
    down vote













    Convert data to an RDD, then use spark.read.json to convert the RDD into a dataFrame with the schema.



    data = [
    {"name": "sam", "address":"uk"},
    {"name":"jack" , "address":"aus", "occupation":"job"}
    ]

    spark = SparkSession.builder.getOrCreate()
    df = spark.read.json(sc.parallelize(data)).na.fill('')
    df.show()
    +-------+----+----------+
    |address|name|occupation|
    +-------+----+----------+
    | uk| sam| |
    | aus|jack| job|
    +-------+----+----------+





    share|improve this answer





















    • I have tried this method its giving py4j.Py4JException: Method __getnewargs__() does not exist. The data is a column name of dataframe df.
      – amol desai
      Nov 9 at 4:56















    up vote
    0
    down vote













    Convert data to an RDD, then use spark.read.json to convert the RDD into a dataFrame with the schema.



    data = [
    {"name": "sam", "address":"uk"},
    {"name":"jack" , "address":"aus", "occupation":"job"}
    ]

    spark = SparkSession.builder.getOrCreate()
    df = spark.read.json(sc.parallelize(data)).na.fill('')
    df.show()
    +-------+----+----------+
    |address|name|occupation|
    +-------+----+----------+
    | uk| sam| |
    | aus|jack| job|
    +-------+----+----------+





    share|improve this answer





















    • I have tried this method its giving py4j.Py4JException: Method __getnewargs__() does not exist. The data is a column name of dataframe df.
      – amol desai
      Nov 9 at 4:56













    up vote
    0
    down vote










    up vote
    0
    down vote









    Convert data to an RDD, then use spark.read.json to convert the RDD into a dataFrame with the schema.



    data = [
    {"name": "sam", "address":"uk"},
    {"name":"jack" , "address":"aus", "occupation":"job"}
    ]

    spark = SparkSession.builder.getOrCreate()
    df = spark.read.json(sc.parallelize(data)).na.fill('')
    df.show()
    +-------+----+----------+
    |address|name|occupation|
    +-------+----+----------+
    | uk| sam| |
    | aus|jack| job|
    +-------+----+----------+





    share|improve this answer












    Convert data to an RDD, then use spark.read.json to convert the RDD into a dataFrame with the schema.



    data = [
    {"name": "sam", "address":"uk"},
    {"name":"jack" , "address":"aus", "occupation":"job"}
    ]

    spark = SparkSession.builder.getOrCreate()
    df = spark.read.json(sc.parallelize(data)).na.fill('')
    df.show()
    +-------+----+----------+
    |address|name|occupation|
    +-------+----+----------+
    | uk| sam| |
    | aus|jack| job|
    +-------+----+----------+






    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Nov 9 at 4:48









    coldspeed

    113k18104177




    113k18104177












    • I have tried this method its giving py4j.Py4JException: Method __getnewargs__() does not exist. The data is a column name of dataframe df.
      – amol desai
      Nov 9 at 4:56


















    • I have tried this method its giving py4j.Py4JException: Method __getnewargs__() does not exist. The data is a column name of dataframe df.
      – amol desai
      Nov 9 at 4:56
















    I have tried this method its giving py4j.Py4JException: Method __getnewargs__() does not exist. The data is a column name of dataframe df.
    – amol desai
    Nov 9 at 4:56




    I have tried this method its giving py4j.Py4JException: Method __getnewargs__() does not exist. The data is a column name of dataframe df.
    – amol desai
    Nov 9 at 4:56












    up vote
    0
    down vote













    If the order of rows is not important, this is another way you can do this:



    from pyspark import SparkContext
    sc = SparkContext()

    df = sc.parallelize([
    {"name":"jack" , "address":"aus", "occupation":"job"},
    {"name": "sam", "address":"uk"}
    ]).toDF()

    df = df.na.fill('')

    df.show()

    +-------+----+----------+
    |address|name|occupation|
    +-------+----+----------+
    | aus|jack| job|
    | uk| sam| |
    +-------+----+----------+





    share|improve this answer

























      up vote
      0
      down vote













      If the order of rows is not important, this is another way you can do this:



      from pyspark import SparkContext
      sc = SparkContext()

      df = sc.parallelize([
      {"name":"jack" , "address":"aus", "occupation":"job"},
      {"name": "sam", "address":"uk"}
      ]).toDF()

      df = df.na.fill('')

      df.show()

      +-------+----+----------+
      |address|name|occupation|
      +-------+----+----------+
      | aus|jack| job|
      | uk| sam| |
      +-------+----+----------+





      share|improve this answer























        up vote
        0
        down vote










        up vote
        0
        down vote









        If the order of rows is not important, this is another way you can do this:



        from pyspark import SparkContext
        sc = SparkContext()

        df = sc.parallelize([
        {"name":"jack" , "address":"aus", "occupation":"job"},
        {"name": "sam", "address":"uk"}
        ]).toDF()

        df = df.na.fill('')

        df.show()

        +-------+----+----------+
        |address|name|occupation|
        +-------+----+----------+
        | aus|jack| job|
        | uk| sam| |
        +-------+----+----------+





        share|improve this answer












        If the order of rows is not important, this is another way you can do this:



        from pyspark import SparkContext
        sc = SparkContext()

        df = sc.parallelize([
        {"name":"jack" , "address":"aus", "occupation":"job"},
        {"name": "sam", "address":"uk"}
        ]).toDF()

        df = df.na.fill('')

        df.show()

        +-------+----+----------+
        |address|name|occupation|
        +-------+----+----------+
        | aus|jack| job|
        | uk| sam| |
        +-------+----+----------+






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 9 at 8:24









        Ali AzG

        607515




        607515






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53219863%2fcreate-a-dataframe-from-column-of-dictionaries-in-pyspark%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            這個網誌中的熱門文章

            Tangent Lines Diagram Along Smooth Curve

            Yusuf al-Mu'taman ibn Hud

            Zucchini