Pyspark converting dataframe to rdd and split

Multi tool use
up vote
0
down vote
favorite
I have a dataframe and I did convert it to rdd but when I apply split function I got an error message
Here is my dataframe
df = spark.createDataFrame([(1, '2013-07-25 00:00:00.0',100,'CLOSED'),
(2, '2013-07-25 12:23:00.0',200,'PENDING PAYMENT'),
(3, '2013-07-25 03:30:00.0',400,'PENDING PAYMENT'),
(4, '2013-07-25 12:23:00.0',50,'COMPLETE'),
(5, '2013-07-25 12:23:00.0',50,'CLOSED'),
(6, '2013-07-26 02:00:00.0',300,'CLOSED'),
(7, '2013-07-26 6:23:00.0',10,'PENDING PAYMENT'),
(8, '2013-07-26 03:30:00.0',5,'PENDING PAYMENT'),
(9, '2013-07-26 2:23:00.0',20,'COMPLETE'),
(10,'2013-07-26 1:23:00.0',30,'CLOSED')],
['Id', 'Date', 'Total', 'Transaction'])
I did convert to list and rdd
rdd = df.rdd.map(list).collect()
rdd_df=sc.parallelize(rdd)
Then apply
rdd_df.map(lambda z: z.split(","))
"AttributeError: 'list' object has no attribute 'split'"
But rdd_df is not a list, let's check
type(rdd_df)
pyspark.rdd.RDD
What might be the problem? I would like to map and add the column 3.Desired output will be like ;
(PENDING PAYMENT,4),(COMPLETE,2),(CLOSED,4)
Thank you.
list dataframe split rdd
add a comment |
up vote
0
down vote
favorite
I have a dataframe and I did convert it to rdd but when I apply split function I got an error message
Here is my dataframe
df = spark.createDataFrame([(1, '2013-07-25 00:00:00.0',100,'CLOSED'),
(2, '2013-07-25 12:23:00.0',200,'PENDING PAYMENT'),
(3, '2013-07-25 03:30:00.0',400,'PENDING PAYMENT'),
(4, '2013-07-25 12:23:00.0',50,'COMPLETE'),
(5, '2013-07-25 12:23:00.0',50,'CLOSED'),
(6, '2013-07-26 02:00:00.0',300,'CLOSED'),
(7, '2013-07-26 6:23:00.0',10,'PENDING PAYMENT'),
(8, '2013-07-26 03:30:00.0',5,'PENDING PAYMENT'),
(9, '2013-07-26 2:23:00.0',20,'COMPLETE'),
(10,'2013-07-26 1:23:00.0',30,'CLOSED')],
['Id', 'Date', 'Total', 'Transaction'])
I did convert to list and rdd
rdd = df.rdd.map(list).collect()
rdd_df=sc.parallelize(rdd)
Then apply
rdd_df.map(lambda z: z.split(","))
"AttributeError: 'list' object has no attribute 'split'"
But rdd_df is not a list, let's check
type(rdd_df)
pyspark.rdd.RDD
What might be the problem? I would like to map and add the column 3.Desired output will be like ;
(PENDING PAYMENT,4),(COMPLETE,2),(CLOSED,4)
Thank you.
list dataframe split rdd
What exactly are you trying to accomplish?
– Hoenie
Nov 7 at 9:58
I am sorry I added what the desired output will look like
– melik
Nov 8 at 9:24
Can you tryrdd_df.map(lambda x: x[3])
?
– Hoenie
Nov 8 at 10:53
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I have a dataframe and I did convert it to rdd but when I apply split function I got an error message
Here is my dataframe
df = spark.createDataFrame([(1, '2013-07-25 00:00:00.0',100,'CLOSED'),
(2, '2013-07-25 12:23:00.0',200,'PENDING PAYMENT'),
(3, '2013-07-25 03:30:00.0',400,'PENDING PAYMENT'),
(4, '2013-07-25 12:23:00.0',50,'COMPLETE'),
(5, '2013-07-25 12:23:00.0',50,'CLOSED'),
(6, '2013-07-26 02:00:00.0',300,'CLOSED'),
(7, '2013-07-26 6:23:00.0',10,'PENDING PAYMENT'),
(8, '2013-07-26 03:30:00.0',5,'PENDING PAYMENT'),
(9, '2013-07-26 2:23:00.0',20,'COMPLETE'),
(10,'2013-07-26 1:23:00.0',30,'CLOSED')],
['Id', 'Date', 'Total', 'Transaction'])
I did convert to list and rdd
rdd = df.rdd.map(list).collect()
rdd_df=sc.parallelize(rdd)
Then apply
rdd_df.map(lambda z: z.split(","))
"AttributeError: 'list' object has no attribute 'split'"
But rdd_df is not a list, let's check
type(rdd_df)
pyspark.rdd.RDD
What might be the problem? I would like to map and add the column 3.Desired output will be like ;
(PENDING PAYMENT,4),(COMPLETE,2),(CLOSED,4)
Thank you.
list dataframe split rdd
I have a dataframe and I did convert it to rdd but when I apply split function I got an error message
Here is my dataframe
df = spark.createDataFrame([(1, '2013-07-25 00:00:00.0',100,'CLOSED'),
(2, '2013-07-25 12:23:00.0',200,'PENDING PAYMENT'),
(3, '2013-07-25 03:30:00.0',400,'PENDING PAYMENT'),
(4, '2013-07-25 12:23:00.0',50,'COMPLETE'),
(5, '2013-07-25 12:23:00.0',50,'CLOSED'),
(6, '2013-07-26 02:00:00.0',300,'CLOSED'),
(7, '2013-07-26 6:23:00.0',10,'PENDING PAYMENT'),
(8, '2013-07-26 03:30:00.0',5,'PENDING PAYMENT'),
(9, '2013-07-26 2:23:00.0',20,'COMPLETE'),
(10,'2013-07-26 1:23:00.0',30,'CLOSED')],
['Id', 'Date', 'Total', 'Transaction'])
I did convert to list and rdd
rdd = df.rdd.map(list).collect()
rdd_df=sc.parallelize(rdd)
Then apply
rdd_df.map(lambda z: z.split(","))
"AttributeError: 'list' object has no attribute 'split'"
But rdd_df is not a list, let's check
type(rdd_df)
pyspark.rdd.RDD
What might be the problem? I would like to map and add the column 3.Desired output will be like ;
(PENDING PAYMENT,4),(COMPLETE,2),(CLOSED,4)
Thank you.
list dataframe split rdd
list dataframe split rdd
edited Nov 8 at 13:38
asked Nov 7 at 8:29
melik
616
616
What exactly are you trying to accomplish?
– Hoenie
Nov 7 at 9:58
I am sorry I added what the desired output will look like
– melik
Nov 8 at 9:24
Can you tryrdd_df.map(lambda x: x[3])
?
– Hoenie
Nov 8 at 10:53
add a comment |
What exactly are you trying to accomplish?
– Hoenie
Nov 7 at 9:58
I am sorry I added what the desired output will look like
– melik
Nov 8 at 9:24
Can you tryrdd_df.map(lambda x: x[3])
?
– Hoenie
Nov 8 at 10:53
What exactly are you trying to accomplish?
– Hoenie
Nov 7 at 9:58
What exactly are you trying to accomplish?
– Hoenie
Nov 7 at 9:58
I am sorry I added what the desired output will look like
– melik
Nov 8 at 9:24
I am sorry I added what the desired output will look like
– melik
Nov 8 at 9:24
Can you try
rdd_df.map(lambda x: x[3])
?– Hoenie
Nov 8 at 10:53
Can you try
rdd_df.map(lambda x: x[3])
?– Hoenie
Nov 8 at 10:53
add a comment |
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53185788%2fpyspark-converting-dataframe-to-rdd-and-split%23new-answer', 'question_page');
}
);
Post as a guest
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
4FTk1GoY6pqHiJr7EJGOx
What exactly are you trying to accomplish?
– Hoenie
Nov 7 at 9:58
I am sorry I added what the desired output will look like
– melik
Nov 8 at 9:24
Can you try
rdd_df.map(lambda x: x[3])
?– Hoenie
Nov 8 at 10:53