Pyspark converting dataframe to rdd and split

up vote
0
down vote

favorite

I have a dataframe and I did convert it to rdd but when I apply split function I got an error message

Here is my dataframe

df = spark.createDataFrame([(1, '2013-07-25 00:00:00.0',100,'CLOSED'),

                        (2, '2013-07-25 12:23:00.0',200,'PENDING PAYMENT'),

                        (3, '2013-07-25 03:30:00.0',400,'PENDING PAYMENT'),

                        (4, '2013-07-25 12:23:00.0',50,'COMPLETE'),

                        (5, '2013-07-25 12:23:00.0',50,'CLOSED'),

                        (6, '2013-07-26 02:00:00.0',300,'CLOSED'),

                        (7, '2013-07-26 6:23:00.0',10,'PENDING PAYMENT'),

                        (8, '2013-07-26 03:30:00.0',5,'PENDING PAYMENT'),

                        (9, '2013-07-26 2:23:00.0',20,'COMPLETE'),

                        (10,'2013-07-26 1:23:00.0',30,'CLOSED')],

                        ['Id', 'Date', 'Total', 'Transaction'])

I did convert to list and rdd

rdd = df.rdd.map(list).collect()

rdd_df=sc.parallelize(rdd)

Then apply

 rdd_df.map(lambda z: z.split(","))

"AttributeError: 'list' object has no attribute 'split'"

But rdd_df is not a list, let's check

type(rdd_df)

pyspark.rdd.RDD

What might be the problem? I would like to map and add the column 3.Desired output will be like ;

  (PENDING PAYMENT,4),(COMPLETE,2),(CLOSED,4)

Thank you.

edited Nov 8 at 13:38

asked Nov 7 at 8:29

melik

616

What exactly are you trying to accomplish?
– Hoenie
Nov 7 at 9:58

I am sorry I added what the desired output will look like
– melik
Nov 8 at 9:24

Can you try rdd_df.map(lambda x: x[3])?
– Hoenie
Nov 8 at 10:53

add a comment |

up vote
0
down vote

favorite

I have a dataframe and I did convert it to rdd but when I apply split function I got an error message

Here is my dataframe

df = spark.createDataFrame([(1, '2013-07-25 00:00:00.0',100,'CLOSED'),

                        (2, '2013-07-25 12:23:00.0',200,'PENDING PAYMENT'),

                        (3, '2013-07-25 03:30:00.0',400,'PENDING PAYMENT'),

                        (4, '2013-07-25 12:23:00.0',50,'COMPLETE'),

                        (5, '2013-07-25 12:23:00.0',50,'CLOSED'),

                        (6, '2013-07-26 02:00:00.0',300,'CLOSED'),

                        (7, '2013-07-26 6:23:00.0',10,'PENDING PAYMENT'),

                        (8, '2013-07-26 03:30:00.0',5,'PENDING PAYMENT'),

                        (9, '2013-07-26 2:23:00.0',20,'COMPLETE'),

                        (10,'2013-07-26 1:23:00.0',30,'CLOSED')],

                        ['Id', 'Date', 'Total', 'Transaction'])

I did convert to list and rdd

rdd = df.rdd.map(list).collect()

rdd_df=sc.parallelize(rdd)

Then apply

 rdd_df.map(lambda z: z.split(","))

"AttributeError: 'list' object has no attribute 'split'"

But rdd_df is not a list, let's check

type(rdd_df)

pyspark.rdd.RDD

What might be the problem? I would like to map and add the column 3.Desired output will be like ;

  (PENDING PAYMENT,4),(COMPLETE,2),(CLOSED,4)

Thank you.

edited Nov 8 at 13:38

asked Nov 7 at 8:29

melik

616

What exactly are you trying to accomplish?
– Hoenie
Nov 7 at 9:58

I am sorry I added what the desired output will look like
– melik
Nov 8 at 9:24

Can you try rdd_df.map(lambda x: x[3])?
– Hoenie
Nov 8 at 10:53

add a comment |

up vote
0
down vote

favorite

I have a dataframe and I did convert it to rdd but when I apply split function I got an error message

Here is my dataframe

df = spark.createDataFrame([(1, '2013-07-25 00:00:00.0',100,'CLOSED'),

                        (2, '2013-07-25 12:23:00.0',200,'PENDING PAYMENT'),

                        (3, '2013-07-25 03:30:00.0',400,'PENDING PAYMENT'),

                        (4, '2013-07-25 12:23:00.0',50,'COMPLETE'),

                        (5, '2013-07-25 12:23:00.0',50,'CLOSED'),

                        (6, '2013-07-26 02:00:00.0',300,'CLOSED'),

                        (7, '2013-07-26 6:23:00.0',10,'PENDING PAYMENT'),

                        (8, '2013-07-26 03:30:00.0',5,'PENDING PAYMENT'),

                        (9, '2013-07-26 2:23:00.0',20,'COMPLETE'),

                        (10,'2013-07-26 1:23:00.0',30,'CLOSED')],

                        ['Id', 'Date', 'Total', 'Transaction'])

I did convert to list and rdd

rdd = df.rdd.map(list).collect()

rdd_df=sc.parallelize(rdd)

Then apply

 rdd_df.map(lambda z: z.split(","))

"AttributeError: 'list' object has no attribute 'split'"

But rdd_df is not a list, let's check

type(rdd_df)

pyspark.rdd.RDD

What might be the problem? I would like to map and add the column 3.Desired output will be like ;

  (PENDING PAYMENT,4),(COMPLETE,2),(CLOSED,4)

Thank you.

edited Nov 8 at 13:38

asked Nov 7 at 8:29

melik

616

I have a dataframe and I did convert it to rdd but when I apply split function I got an error message

Here is my dataframe

df = spark.createDataFrame([(1, '2013-07-25 00:00:00.0',100,'CLOSED'),

                        (2, '2013-07-25 12:23:00.0',200,'PENDING PAYMENT'),

                        (3, '2013-07-25 03:30:00.0',400,'PENDING PAYMENT'),

                        (4, '2013-07-25 12:23:00.0',50,'COMPLETE'),

                        (5, '2013-07-25 12:23:00.0',50,'CLOSED'),

                        (6, '2013-07-26 02:00:00.0',300,'CLOSED'),

                        (7, '2013-07-26 6:23:00.0',10,'PENDING PAYMENT'),

                        (8, '2013-07-26 03:30:00.0',5,'PENDING PAYMENT'),

                        (9, '2013-07-26 2:23:00.0',20,'COMPLETE'),

                        (10,'2013-07-26 1:23:00.0',30,'CLOSED')],

                        ['Id', 'Date', 'Total', 'Transaction'])

I did convert to list and rdd

rdd = df.rdd.map(list).collect()

rdd_df=sc.parallelize(rdd)

Then apply

 rdd_df.map(lambda z: z.split(","))

"AttributeError: 'list' object has no attribute 'split'"

But rdd_df is not a list, let's check

type(rdd_df)

pyspark.rdd.RDD

What might be the problem? I would like to map and add the column 3.Desired output will be like ;

  (PENDING PAYMENT,4),(COMPLETE,2),(CLOSED,4)

Thank you.

list dataframe split rdd

edited Nov 8 at 13:38

asked Nov 7 at 8:29

melik

616

edited Nov 8 at 13:38

asked Nov 7 at 8:29

melik

616

edited Nov 8 at 13:38

asked Nov 7 at 8:29

melik

616

asked Nov 7 at 8:29

melik

616

asked Nov 7 at 8:29

melik

616

What exactly are you trying to accomplish?
– Hoenie
Nov 7 at 9:58

I am sorry I added what the desired output will look like
– melik
Nov 8 at 9:24

Can you try rdd_df.map(lambda x: x[3])?
– Hoenie
Nov 8 at 10:53

add a comment |

What exactly are you trying to accomplish?
– Hoenie
Nov 7 at 9:58

I am sorry I added what the desired output will look like
– melik
Nov 8 at 9:24

Can you try rdd_df.map(lambda x: x[3])?
– Hoenie
Nov 8 at 10:53

What exactly are you trying to accomplish?
– Hoenie
Nov 7 at 9:58

I am sorry I added what the desired output will look like
– melik
Nov 8 at 9:24

Can you try rdd_df.map(lambda x: x[3])?
– Hoenie
Nov 8 at 10:53

add a comment |

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53185788%2fpyspark-converting-dataframe-to-rdd-and-split%23new-answer', 'question_page');
}
);

Post as a guest

Name

active

oldest

votes

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Name

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Wsrtjtyk