Spark - optmize query with multiple “whens”
up vote
1
down vote
favorite
Is there a way to optimize this query to not use withColumn multiple times. My biggest problem is that I hit this issue: https://issues.apache.org/jira/browse/SPARK-18532
The query is something like this. I have a dataframe with 10 boolean columns.
I have some modifiers like:
val smallIncrease = 5
val smallDecrease = -5
val bigIncrease = 10
val bigDecrease = -10
Based on each of the boolean column I would like to calculate a final score by  adding small/big increase/decrease base on values in different columns.
So now my query looks something like this:
df.withColumn("result", when(col("col1"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))
.withColumn("result", when(col("col2"), col("result") + lit(bigIncrease)).otherwise(col("result") + lit(bigDecrease)))
.withColumn("result", when(col("col3"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))
.withColumn("result", when(col("col4"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))
.withColumn("result", when(col("col5"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(bigDecrease)))
.withColumn("result", when(col("col6"), col("result") + lit(bigIncrease)).otherwise(col("result") + lit(smallDecrease)))
.withColumn("result", when(col("col7"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))
Is there a way to compact this query and avoid multiple withColumns.
Unfortunaltey UDF is not choise as there are more than 10 boolean column to take into account and UDFs are limited to 10 column. Maybe I can split it into 2 UDFs but this looks very ugly to me...
apache-spark optimization query-optimization
add a comment |
up vote
1
down vote
favorite
Is there a way to optimize this query to not use withColumn multiple times. My biggest problem is that I hit this issue: https://issues.apache.org/jira/browse/SPARK-18532
The query is something like this. I have a dataframe with 10 boolean columns.
I have some modifiers like:
val smallIncrease = 5
val smallDecrease = -5
val bigIncrease = 10
val bigDecrease = -10
Based on each of the boolean column I would like to calculate a final score by  adding small/big increase/decrease base on values in different columns.
So now my query looks something like this:
df.withColumn("result", when(col("col1"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))
.withColumn("result", when(col("col2"), col("result") + lit(bigIncrease)).otherwise(col("result") + lit(bigDecrease)))
.withColumn("result", when(col("col3"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))
.withColumn("result", when(col("col4"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))
.withColumn("result", when(col("col5"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(bigDecrease)))
.withColumn("result", when(col("col6"), col("result") + lit(bigIncrease)).otherwise(col("result") + lit(smallDecrease)))
.withColumn("result", when(col("col7"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))
Is there a way to compact this query and avoid multiple withColumns.
Unfortunaltey UDF is not choise as there are more than 10 boolean column to take into account and UDFs are limited to 10 column. Maybe I can split it into 2 UDFs but this looks very ugly to me...
apache-spark optimization query-optimization
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
Is there a way to optimize this query to not use withColumn multiple times. My biggest problem is that I hit this issue: https://issues.apache.org/jira/browse/SPARK-18532
The query is something like this. I have a dataframe with 10 boolean columns.
I have some modifiers like:
val smallIncrease = 5
val smallDecrease = -5
val bigIncrease = 10
val bigDecrease = -10
Based on each of the boolean column I would like to calculate a final score by  adding small/big increase/decrease base on values in different columns.
So now my query looks something like this:
df.withColumn("result", when(col("col1"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))
.withColumn("result", when(col("col2"), col("result") + lit(bigIncrease)).otherwise(col("result") + lit(bigDecrease)))
.withColumn("result", when(col("col3"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))
.withColumn("result", when(col("col4"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))
.withColumn("result", when(col("col5"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(bigDecrease)))
.withColumn("result", when(col("col6"), col("result") + lit(bigIncrease)).otherwise(col("result") + lit(smallDecrease)))
.withColumn("result", when(col("col7"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))
Is there a way to compact this query and avoid multiple withColumns.
Unfortunaltey UDF is not choise as there are more than 10 boolean column to take into account and UDFs are limited to 10 column. Maybe I can split it into 2 UDFs but this looks very ugly to me...
apache-spark optimization query-optimization
Is there a way to optimize this query to not use withColumn multiple times. My biggest problem is that I hit this issue: https://issues.apache.org/jira/browse/SPARK-18532
The query is something like this. I have a dataframe with 10 boolean columns.
I have some modifiers like:
val smallIncrease = 5
val smallDecrease = -5
val bigIncrease = 10
val bigDecrease = -10
Based on each of the boolean column I would like to calculate a final score by  adding small/big increase/decrease base on values in different columns.
So now my query looks something like this:
df.withColumn("result", when(col("col1"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))
.withColumn("result", when(col("col2"), col("result") + lit(bigIncrease)).otherwise(col("result") + lit(bigDecrease)))
.withColumn("result", when(col("col3"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))
.withColumn("result", when(col("col4"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))
.withColumn("result", when(col("col5"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(bigDecrease)))
.withColumn("result", when(col("col6"), col("result") + lit(bigIncrease)).otherwise(col("result") + lit(smallDecrease)))
.withColumn("result", when(col("col7"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))
Is there a way to compact this query and avoid multiple withColumns.
Unfortunaltey UDF is not choise as there are more than 10 boolean column to take into account and UDFs are limited to 10 column. Maybe I can split it into 2 UDFs but this looks very ugly to me...
apache-spark optimization query-optimization
apache-spark optimization query-optimization
asked Nov 7 at 9:26
MitakaJ9
145
145
add a comment |
add a comment |
                                1 Answer
                                1
                        
active
oldest
votes
up vote
0
down vote
How about something like this?
def myFun(b: Seq[Boolean], result: Int): Int = {
  val conversions: Seq[(Boolean, Int) => Int] = ??? // Functions to apply increase/decrease for each boolean value col1, col2 etc.
  b.zip(conversions).foldLeft(result){
    case (acc, (nextBool, nextFun)) => nextFun(nextBool, acc) 
  }
}
val myUdf = udf(myFun(_: Seq[Boolean], _: Int))
df.select(myUdf(array($"col1", $"col2", $"col3"...), $"result").as("result"))
add a comment |
                                1 Answer
                                1
                        
active
oldest
votes
                                1 Answer
                                1
                        
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
How about something like this?
def myFun(b: Seq[Boolean], result: Int): Int = {
  val conversions: Seq[(Boolean, Int) => Int] = ??? // Functions to apply increase/decrease for each boolean value col1, col2 etc.
  b.zip(conversions).foldLeft(result){
    case (acc, (nextBool, nextFun)) => nextFun(nextBool, acc) 
  }
}
val myUdf = udf(myFun(_: Seq[Boolean], _: Int))
df.select(myUdf(array($"col1", $"col2", $"col3"...), $"result").as("result"))
add a comment |
up vote
0
down vote
How about something like this?
def myFun(b: Seq[Boolean], result: Int): Int = {
  val conversions: Seq[(Boolean, Int) => Int] = ??? // Functions to apply increase/decrease for each boolean value col1, col2 etc.
  b.zip(conversions).foldLeft(result){
    case (acc, (nextBool, nextFun)) => nextFun(nextBool, acc) 
  }
}
val myUdf = udf(myFun(_: Seq[Boolean], _: Int))
df.select(myUdf(array($"col1", $"col2", $"col3"...), $"result").as("result"))
add a comment |
up vote
0
down vote
up vote
0
down vote
How about something like this?
def myFun(b: Seq[Boolean], result: Int): Int = {
  val conversions: Seq[(Boolean, Int) => Int] = ??? // Functions to apply increase/decrease for each boolean value col1, col2 etc.
  b.zip(conversions).foldLeft(result){
    case (acc, (nextBool, nextFun)) => nextFun(nextBool, acc) 
  }
}
val myUdf = udf(myFun(_: Seq[Boolean], _: Int))
df.select(myUdf(array($"col1", $"col2", $"col3"...), $"result").as("result"))
How about something like this?
def myFun(b: Seq[Boolean], result: Int): Int = {
  val conversions: Seq[(Boolean, Int) => Int] = ??? // Functions to apply increase/decrease for each boolean value col1, col2 etc.
  b.zip(conversions).foldLeft(result){
    case (acc, (nextBool, nextFun)) => nextFun(nextBool, acc) 
  }
}
val myUdf = udf(myFun(_: Seq[Boolean], _: Int))
df.select(myUdf(array($"col1", $"col2", $"col3"...), $"result").as("result"))
answered Nov 7 at 11:42


Terry Dactyl
1,071412
1,071412
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53186613%2fspark-optmize-query-with-multiple-whens%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown