Pyspark converting dataframe to rdd and split











up vote
0
down vote

favorite












I have a dataframe and I did convert it to rdd but when I apply split function I got an error message



Here is my dataframe



df = spark.createDataFrame([(1, '2013-07-25 00:00:00.0',100,'CLOSED'),
(2, '2013-07-25 12:23:00.0',200,'PENDING PAYMENT'),
(3, '2013-07-25 03:30:00.0',400,'PENDING PAYMENT'),
(4, '2013-07-25 12:23:00.0',50,'COMPLETE'),
(5, '2013-07-25 12:23:00.0',50,'CLOSED'),
(6, '2013-07-26 02:00:00.0',300,'CLOSED'),
(7, '2013-07-26 6:23:00.0',10,'PENDING PAYMENT'),
(8, '2013-07-26 03:30:00.0',5,'PENDING PAYMENT'),
(9, '2013-07-26 2:23:00.0',20,'COMPLETE'),
(10,'2013-07-26 1:23:00.0',30,'CLOSED')],
['Id', 'Date', 'Total', 'Transaction'])


I did convert to list and rdd



rdd = df.rdd.map(list).collect()
rdd_df=sc.parallelize(rdd)


Then apply



 rdd_df.map(lambda z: z.split(","))


"AttributeError: 'list' object has no attribute 'split'"



But rdd_df is not a list, let's check



type(rdd_df)
pyspark.rdd.RDD


What might be the problem? I would like to map and add the column 3.Desired output will be like ;



  (PENDING PAYMENT,4),(COMPLETE,2),(CLOSED,4)


Thank you.










share|improve this question
























  • What exactly are you trying to accomplish?
    – Hoenie
    Nov 7 at 9:58










  • I am sorry I added what the desired output will look like
    – melik
    Nov 8 at 9:24










  • Can you try rdd_df.map(lambda x: x[3])?
    – Hoenie
    Nov 8 at 10:53















up vote
0
down vote

favorite












I have a dataframe and I did convert it to rdd but when I apply split function I got an error message



Here is my dataframe



df = spark.createDataFrame([(1, '2013-07-25 00:00:00.0',100,'CLOSED'),
(2, '2013-07-25 12:23:00.0',200,'PENDING PAYMENT'),
(3, '2013-07-25 03:30:00.0',400,'PENDING PAYMENT'),
(4, '2013-07-25 12:23:00.0',50,'COMPLETE'),
(5, '2013-07-25 12:23:00.0',50,'CLOSED'),
(6, '2013-07-26 02:00:00.0',300,'CLOSED'),
(7, '2013-07-26 6:23:00.0',10,'PENDING PAYMENT'),
(8, '2013-07-26 03:30:00.0',5,'PENDING PAYMENT'),
(9, '2013-07-26 2:23:00.0',20,'COMPLETE'),
(10,'2013-07-26 1:23:00.0',30,'CLOSED')],
['Id', 'Date', 'Total', 'Transaction'])


I did convert to list and rdd



rdd = df.rdd.map(list).collect()
rdd_df=sc.parallelize(rdd)


Then apply



 rdd_df.map(lambda z: z.split(","))


"AttributeError: 'list' object has no attribute 'split'"



But rdd_df is not a list, let's check



type(rdd_df)
pyspark.rdd.RDD


What might be the problem? I would like to map and add the column 3.Desired output will be like ;



  (PENDING PAYMENT,4),(COMPLETE,2),(CLOSED,4)


Thank you.










share|improve this question
























  • What exactly are you trying to accomplish?
    – Hoenie
    Nov 7 at 9:58










  • I am sorry I added what the desired output will look like
    – melik
    Nov 8 at 9:24










  • Can you try rdd_df.map(lambda x: x[3])?
    – Hoenie
    Nov 8 at 10:53













up vote
0
down vote

favorite









up vote
0
down vote

favorite











I have a dataframe and I did convert it to rdd but when I apply split function I got an error message



Here is my dataframe



df = spark.createDataFrame([(1, '2013-07-25 00:00:00.0',100,'CLOSED'),
(2, '2013-07-25 12:23:00.0',200,'PENDING PAYMENT'),
(3, '2013-07-25 03:30:00.0',400,'PENDING PAYMENT'),
(4, '2013-07-25 12:23:00.0',50,'COMPLETE'),
(5, '2013-07-25 12:23:00.0',50,'CLOSED'),
(6, '2013-07-26 02:00:00.0',300,'CLOSED'),
(7, '2013-07-26 6:23:00.0',10,'PENDING PAYMENT'),
(8, '2013-07-26 03:30:00.0',5,'PENDING PAYMENT'),
(9, '2013-07-26 2:23:00.0',20,'COMPLETE'),
(10,'2013-07-26 1:23:00.0',30,'CLOSED')],
['Id', 'Date', 'Total', 'Transaction'])


I did convert to list and rdd



rdd = df.rdd.map(list).collect()
rdd_df=sc.parallelize(rdd)


Then apply



 rdd_df.map(lambda z: z.split(","))


"AttributeError: 'list' object has no attribute 'split'"



But rdd_df is not a list, let's check



type(rdd_df)
pyspark.rdd.RDD


What might be the problem? I would like to map and add the column 3.Desired output will be like ;



  (PENDING PAYMENT,4),(COMPLETE,2),(CLOSED,4)


Thank you.










share|improve this question















I have a dataframe and I did convert it to rdd but when I apply split function I got an error message



Here is my dataframe



df = spark.createDataFrame([(1, '2013-07-25 00:00:00.0',100,'CLOSED'),
(2, '2013-07-25 12:23:00.0',200,'PENDING PAYMENT'),
(3, '2013-07-25 03:30:00.0',400,'PENDING PAYMENT'),
(4, '2013-07-25 12:23:00.0',50,'COMPLETE'),
(5, '2013-07-25 12:23:00.0',50,'CLOSED'),
(6, '2013-07-26 02:00:00.0',300,'CLOSED'),
(7, '2013-07-26 6:23:00.0',10,'PENDING PAYMENT'),
(8, '2013-07-26 03:30:00.0',5,'PENDING PAYMENT'),
(9, '2013-07-26 2:23:00.0',20,'COMPLETE'),
(10,'2013-07-26 1:23:00.0',30,'CLOSED')],
['Id', 'Date', 'Total', 'Transaction'])


I did convert to list and rdd



rdd = df.rdd.map(list).collect()
rdd_df=sc.parallelize(rdd)


Then apply



 rdd_df.map(lambda z: z.split(","))


"AttributeError: 'list' object has no attribute 'split'"



But rdd_df is not a list, let's check



type(rdd_df)
pyspark.rdd.RDD


What might be the problem? I would like to map and add the column 3.Desired output will be like ;



  (PENDING PAYMENT,4),(COMPLETE,2),(CLOSED,4)


Thank you.







list dataframe split rdd






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 8 at 13:38

























asked Nov 7 at 8:29









melik

616




616












  • What exactly are you trying to accomplish?
    – Hoenie
    Nov 7 at 9:58










  • I am sorry I added what the desired output will look like
    – melik
    Nov 8 at 9:24










  • Can you try rdd_df.map(lambda x: x[3])?
    – Hoenie
    Nov 8 at 10:53


















  • What exactly are you trying to accomplish?
    – Hoenie
    Nov 7 at 9:58










  • I am sorry I added what the desired output will look like
    – melik
    Nov 8 at 9:24










  • Can you try rdd_df.map(lambda x: x[3])?
    – Hoenie
    Nov 8 at 10:53
















What exactly are you trying to accomplish?
– Hoenie
Nov 7 at 9:58




What exactly are you trying to accomplish?
– Hoenie
Nov 7 at 9:58












I am sorry I added what the desired output will look like
– melik
Nov 8 at 9:24




I am sorry I added what the desired output will look like
– melik
Nov 8 at 9:24












Can you try rdd_df.map(lambda x: x[3])?
– Hoenie
Nov 8 at 10:53




Can you try rdd_df.map(lambda x: x[3])?
– Hoenie
Nov 8 at 10:53

















active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














 

draft saved


draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53185788%2fpyspark-converting-dataframe-to-rdd-and-split%23new-answer', 'question_page');
}
);

Post as a guest





































active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes
















 

draft saved


draft discarded



















































 


draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53185788%2fpyspark-converting-dataframe-to-rdd-and-split%23new-answer', 'question_page');
}
);

Post as a guest




















































































這個網誌中的熱門文章

Tangent Lines Diagram Along Smooth Curve

Yusuf al-Mu'taman ibn Hud

Zucchini