Pyspark converting dataframe to rdd and split
up vote
0
down vote
favorite
I have a dataframe and I did convert it to rdd but when I apply split function I got an error message
Here is my dataframe
df = spark.createDataFrame([(1, '2013-07-25 00:00:00.0',100,'CLOSED'),
(2, '2013-07-25 12:23:00.0',200,'PENDING PAYMENT'),
(3, '2013-07-25 03:30:00.0',400,'PENDING PAYMENT'),
(4, '2013-07-25 12:23:00.0',50,'COMPLETE'),
(5, '2013-07-25 12:23:00.0',50,'CLOSED'),
(6, '2013-07-26 02:00:00.0',300,'CLOSED'),
(7, '2013-07-26 6:23:00.0',10,'PENDING PAYMENT'),
(8, '2013-07-26 03:30:00.0',5,'PENDING PAYMENT'),
(9, '2013-07-26 2:23:00.0',20,'COMPLETE'),
(10,'2013-07-26 1:23:00.0',30,'CLOSED')],
['Id', 'Date', 'Total', 'Transaction'])
I did convert to list and rdd
rdd = df.rdd.map(list).collect()
rdd_df=sc.parallelize(rdd)
Then apply
rdd_df.map(lambda z: z.split(","))
"AttributeError: 'list' object has no attribute 'split'"
But rdd_df is not a list, let's check
type(rdd_df)
pyspark.rdd.RDD
What might be the problem? I would like to map and add the column 3.Desired output will be like ;
(PENDING PAYMENT,4),(COMPLETE,2),(CLOSED,4)
Thank you.
list dataframe split rdd
add a comment |
up vote
0
down vote
favorite
I have a dataframe and I did convert it to rdd but when I apply split function I got an error message
Here is my dataframe
df = spark.createDataFrame([(1, '2013-07-25 00:00:00.0',100,'CLOSED'),
(2, '2013-07-25 12:23:00.0',200,'PENDING PAYMENT'),
(3, '2013-07-25 03:30:00.0',400,'PENDING PAYMENT'),
(4, '2013-07-25 12:23:00.0',50,'COMPLETE'),
(5, '2013-07-25 12:23:00.0',50,'CLOSED'),
(6, '2013-07-26 02:00:00.0',300,'CLOSED'),
(7, '2013-07-26 6:23:00.0',10,'PENDING PAYMENT'),
(8, '2013-07-26 03:30:00.0',5,'PENDING PAYMENT'),
(9, '2013-07-26 2:23:00.0',20,'COMPLETE'),
(10,'2013-07-26 1:23:00.0',30,'CLOSED')],
['Id', 'Date', 'Total', 'Transaction'])
I did convert to list and rdd
rdd = df.rdd.map(list).collect()
rdd_df=sc.parallelize(rdd)
Then apply
rdd_df.map(lambda z: z.split(","))
"AttributeError: 'list' object has no attribute 'split'"
But rdd_df is not a list, let's check
type(rdd_df)
pyspark.rdd.RDD
What might be the problem? I would like to map and add the column 3.Desired output will be like ;
(PENDING PAYMENT,4),(COMPLETE,2),(CLOSED,4)
Thank you.
list dataframe split rdd
What exactly are you trying to accomplish?
– Hoenie
Nov 7 at 9:58
I am sorry I added what the desired output will look like
– melik
Nov 8 at 9:24
Can you tryrdd_df.map(lambda x: x[3])
?
– Hoenie
Nov 8 at 10:53
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I have a dataframe and I did convert it to rdd but when I apply split function I got an error message
Here is my dataframe
df = spark.createDataFrame([(1, '2013-07-25 00:00:00.0',100,'CLOSED'),
(2, '2013-07-25 12:23:00.0',200,'PENDING PAYMENT'),
(3, '2013-07-25 03:30:00.0',400,'PENDING PAYMENT'),
(4, '2013-07-25 12:23:00.0',50,'COMPLETE'),
(5, '2013-07-25 12:23:00.0',50,'CLOSED'),
(6, '2013-07-26 02:00:00.0',300,'CLOSED'),
(7, '2013-07-26 6:23:00.0',10,'PENDING PAYMENT'),
(8, '2013-07-26 03:30:00.0',5,'PENDING PAYMENT'),
(9, '2013-07-26 2:23:00.0',20,'COMPLETE'),
(10,'2013-07-26 1:23:00.0',30,'CLOSED')],
['Id', 'Date', 'Total', 'Transaction'])
I did convert to list and rdd
rdd = df.rdd.map(list).collect()
rdd_df=sc.parallelize(rdd)
Then apply
rdd_df.map(lambda z: z.split(","))
"AttributeError: 'list' object has no attribute 'split'"
But rdd_df is not a list, let's check
type(rdd_df)
pyspark.rdd.RDD
What might be the problem? I would like to map and add the column 3.Desired output will be like ;
(PENDING PAYMENT,4),(COMPLETE,2),(CLOSED,4)
Thank you.
list dataframe split rdd
I have a dataframe and I did convert it to rdd but when I apply split function I got an error message
Here is my dataframe
df = spark.createDataFrame([(1, '2013-07-25 00:00:00.0',100,'CLOSED'),
(2, '2013-07-25 12:23:00.0',200,'PENDING PAYMENT'),
(3, '2013-07-25 03:30:00.0',400,'PENDING PAYMENT'),
(4, '2013-07-25 12:23:00.0',50,'COMPLETE'),
(5, '2013-07-25 12:23:00.0',50,'CLOSED'),
(6, '2013-07-26 02:00:00.0',300,'CLOSED'),
(7, '2013-07-26 6:23:00.0',10,'PENDING PAYMENT'),
(8, '2013-07-26 03:30:00.0',5,'PENDING PAYMENT'),
(9, '2013-07-26 2:23:00.0',20,'COMPLETE'),
(10,'2013-07-26 1:23:00.0',30,'CLOSED')],
['Id', 'Date', 'Total', 'Transaction'])
I did convert to list and rdd
rdd = df.rdd.map(list).collect()
rdd_df=sc.parallelize(rdd)
Then apply
rdd_df.map(lambda z: z.split(","))
"AttributeError: 'list' object has no attribute 'split'"
But rdd_df is not a list, let's check
type(rdd_df)
pyspark.rdd.RDD
What might be the problem? I would like to map and add the column 3.Desired output will be like ;
(PENDING PAYMENT,4),(COMPLETE,2),(CLOSED,4)
Thank you.
list dataframe split rdd
list dataframe split rdd
edited Nov 8 at 13:38
asked Nov 7 at 8:29
melik
616
616
What exactly are you trying to accomplish?
– Hoenie
Nov 7 at 9:58
I am sorry I added what the desired output will look like
– melik
Nov 8 at 9:24
Can you tryrdd_df.map(lambda x: x[3])
?
– Hoenie
Nov 8 at 10:53
add a comment |
What exactly are you trying to accomplish?
– Hoenie
Nov 7 at 9:58
I am sorry I added what the desired output will look like
– melik
Nov 8 at 9:24
Can you tryrdd_df.map(lambda x: x[3])
?
– Hoenie
Nov 8 at 10:53
What exactly are you trying to accomplish?
– Hoenie
Nov 7 at 9:58
What exactly are you trying to accomplish?
– Hoenie
Nov 7 at 9:58
I am sorry I added what the desired output will look like
– melik
Nov 8 at 9:24
I am sorry I added what the desired output will look like
– melik
Nov 8 at 9:24
Can you try
rdd_df.map(lambda x: x[3])
?– Hoenie
Nov 8 at 10:53
Can you try
rdd_df.map(lambda x: x[3])
?– Hoenie
Nov 8 at 10:53
add a comment |
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53185788%2fpyspark-converting-dataframe-to-rdd-and-split%23new-answer', 'question_page');
}
);
Post as a guest
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
What exactly are you trying to accomplish?
– Hoenie
Nov 7 at 9:58
I am sorry I added what the desired output will look like
– melik
Nov 8 at 9:24
Can you try
rdd_df.map(lambda x: x[3])
?– Hoenie
Nov 8 at 10:53