Spark - optmize query with multiple “whens”

up vote
1
down vote

favorite

Is there a way to optimize this query to not use withColumn multiple times. My biggest problem is that I hit this issue: https://issues.apache.org/jira/browse/SPARK-18532

The query is something like this. I have a dataframe with 10 boolean columns.
I have some modifiers like:

val smallIncrease = 5

val smallDecrease = -5

val bigIncrease = 10

val bigDecrease = -10

Based on each of the boolean column I would like to calculate a final score by adding small/big increase/decrease base on values in different columns.
So now my query looks something like this:

df.withColumn("result", when(col("col1"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))

.withColumn("result", when(col("col2"), col("result") + lit(bigIncrease)).otherwise(col("result") + lit(bigDecrease)))

.withColumn("result", when(col("col3"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))

.withColumn("result", when(col("col4"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))

.withColumn("result", when(col("col5"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(bigDecrease)))

.withColumn("result", when(col("col6"), col("result") + lit(bigIncrease)).otherwise(col("result") + lit(smallDecrease)))

.withColumn("result", when(col("col7"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))

Is there a way to compact this query and avoid multiple withColumns.
Unfortunaltey UDF is not choise as there are more than 10 boolean column to take into account and UDFs are limited to 10 column. Maybe I can split it into 2 UDFs but this looks very ugly to me...

asked Nov 7 at 9:26

MitakaJ9

145

add a comment |

up vote
1
down vote

favorite

Is there a way to optimize this query to not use withColumn multiple times. My biggest problem is that I hit this issue: https://issues.apache.org/jira/browse/SPARK-18532

The query is something like this. I have a dataframe with 10 boolean columns.
I have some modifiers like:

val smallIncrease = 5

val smallDecrease = -5

val bigIncrease = 10

val bigDecrease = -10

Based on each of the boolean column I would like to calculate a final score by adding small/big increase/decrease base on values in different columns.
So now my query looks something like this:

df.withColumn("result", when(col("col1"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))

.withColumn("result", when(col("col2"), col("result") + lit(bigIncrease)).otherwise(col("result") + lit(bigDecrease)))

.withColumn("result", when(col("col3"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))

.withColumn("result", when(col("col4"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))

.withColumn("result", when(col("col5"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(bigDecrease)))

.withColumn("result", when(col("col6"), col("result") + lit(bigIncrease)).otherwise(col("result") + lit(smallDecrease)))

.withColumn("result", when(col("col7"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))

asked Nov 7 at 9:26

MitakaJ9

145

add a comment |

up vote
1
down vote

favorite

Is there a way to optimize this query to not use withColumn multiple times. My biggest problem is that I hit this issue: https://issues.apache.org/jira/browse/SPARK-18532

The query is something like this. I have a dataframe with 10 boolean columns.
I have some modifiers like:

val smallIncrease = 5

val smallDecrease = -5

val bigIncrease = 10

val bigDecrease = -10

Based on each of the boolean column I would like to calculate a final score by adding small/big increase/decrease base on values in different columns.
So now my query looks something like this:

df.withColumn("result", when(col("col1"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))

.withColumn("result", when(col("col2"), col("result") + lit(bigIncrease)).otherwise(col("result") + lit(bigDecrease)))

.withColumn("result", when(col("col3"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))

.withColumn("result", when(col("col4"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))

.withColumn("result", when(col("col5"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(bigDecrease)))

.withColumn("result", when(col("col6"), col("result") + lit(bigIncrease)).otherwise(col("result") + lit(smallDecrease)))

.withColumn("result", when(col("col7"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))

asked Nov 7 at 9:26

MitakaJ9

145

Is there a way to optimize this query to not use withColumn multiple times. My biggest problem is that I hit this issue: https://issues.apache.org/jira/browse/SPARK-18532

The query is something like this. I have a dataframe with 10 boolean columns.
I have some modifiers like:

val smallIncrease = 5

val smallDecrease = -5

val bigIncrease = 10

val bigDecrease = -10

Based on each of the boolean column I would like to calculate a final score by adding small/big increase/decrease base on values in different columns.
So now my query looks something like this:

df.withColumn("result", when(col("col1"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))

.withColumn("result", when(col("col2"), col("result") + lit(bigIncrease)).otherwise(col("result") + lit(bigDecrease)))

.withColumn("result", when(col("col3"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))

.withColumn("result", when(col("col4"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))

.withColumn("result", when(col("col5"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(bigDecrease)))

.withColumn("result", when(col("col6"), col("result") + lit(bigIncrease)).otherwise(col("result") + lit(smallDecrease)))

.withColumn("result", when(col("col7"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))

apache-spark optimization query-optimization

asked Nov 7 at 9:26

MitakaJ9

145

asked Nov 7 at 9:26

MitakaJ9

145

asked Nov 7 at 9:26

MitakaJ9

145

asked Nov 7 at 9:26

MitakaJ9

145

asked Nov 7 at 9:26

MitakaJ9

145

add a comment |

1 Answer
1

active

oldest

votes

up vote
0
down vote

How about something like this?

def myFun(b: Seq[Boolean], result: Int): Int = {



  val conversions: Seq[(Boolean, Int) => Int] = ??? // Functions to apply increase/decrease for each boolean value col1, col2 etc.



  b.zip(conversions).foldLeft(result){

    case (acc, (nextBool, nextFun)) => nextFun(nextBool, acc) 

  }

}



val myUdf = udf(myFun(_: Seq[Boolean], _: Int))



df.select(myUdf(array($"col1", $"col2", $"col3"...), $"result").as("result"))

answered Nov 7 at 11:42

Terry Dactyl

1,071412

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53186613%2fspark-optmize-query-with-multiple-whens%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

How about something like this?

def myFun(b: Seq[Boolean], result: Int): Int = {



  val conversions: Seq[(Boolean, Int) => Int] = ??? // Functions to apply increase/decrease for each boolean value col1, col2 etc.



  b.zip(conversions).foldLeft(result){

    case (acc, (nextBool, nextFun)) => nextFun(nextBool, acc) 

  }

}



val myUdf = udf(myFun(_: Seq[Boolean], _: Int))



df.select(myUdf(array($"col1", $"col2", $"col3"...), $"result").as("result"))

answered Nov 7 at 11:42

Terry Dactyl

1,071412

add a comment |

up vote
0
down vote

How about something like this?

def myFun(b: Seq[Boolean], result: Int): Int = {



  val conversions: Seq[(Boolean, Int) => Int] = ??? // Functions to apply increase/decrease for each boolean value col1, col2 etc.



  b.zip(conversions).foldLeft(result){

    case (acc, (nextBool, nextFun)) => nextFun(nextBool, acc) 

  }

}



val myUdf = udf(myFun(_: Seq[Boolean], _: Int))



df.select(myUdf(array($"col1", $"col2", $"col3"...), $"result").as("result"))

answered Nov 7 at 11:42

Terry Dactyl

1,071412

add a comment |

up vote
0
down vote

How about something like this?

def myFun(b: Seq[Boolean], result: Int): Int = {



  val conversions: Seq[(Boolean, Int) => Int] = ??? // Functions to apply increase/decrease for each boolean value col1, col2 etc.



  b.zip(conversions).foldLeft(result){

    case (acc, (nextBool, nextFun)) => nextFun(nextBool, acc) 

  }

}



val myUdf = udf(myFun(_: Seq[Boolean], _: Int))



df.select(myUdf(array($"col1", $"col2", $"col3"...), $"result").as("result"))

answered Nov 7 at 11:42

Terry Dactyl

1,071412

How about something like this?

def myFun(b: Seq[Boolean], result: Int): Int = {



  val conversions: Seq[(Boolean, Int) => Int] = ??? // Functions to apply increase/decrease for each boolean value col1, col2 etc.



  b.zip(conversions).foldLeft(result){

    case (acc, (nextBool, nextFun)) => nextFun(nextBool, acc) 

  }

}



val myUdf = udf(myFun(_: Seq[Boolean], _: Int))



df.select(myUdf(array($"col1", $"col2", $"col3"...), $"result").as("result"))

answered Nov 7 at 11:42

Terry Dactyl

1,071412

answered Nov 7 at 11:42

Terry Dactyl

1,071412

answered Nov 7 at 11:42

Terry Dactyl

1,071412

answered Nov 7 at 11:42

Terry Dactyl

1,071412

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

wi1tfdH,vWHQ L fMZWrSw4TbmaDZErasmI2FUKZ ZdvhJ 4n,HySEVwo 6LXJLxj2

搜尋此網誌

Wsrtjtyk