Spark - optmize query with multiple “whens”











up vote
1
down vote

favorite












Is there a way to optimize this query to not use withColumn multiple times. My biggest problem is that I hit this issue: https://issues.apache.org/jira/browse/SPARK-18532



The query is something like this. I have a dataframe with 10 boolean columns.
I have some modifiers like:



val smallIncrease = 5
val smallDecrease = -5
val bigIncrease = 10
val bigDecrease = -10


Based on each of the boolean column I would like to calculate a final score by adding small/big increase/decrease base on values in different columns.
So now my query looks something like this:



df.withColumn("result", when(col("col1"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))
.withColumn("result", when(col("col2"), col("result") + lit(bigIncrease)).otherwise(col("result") + lit(bigDecrease)))
.withColumn("result", when(col("col3"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))
.withColumn("result", when(col("col4"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))
.withColumn("result", when(col("col5"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(bigDecrease)))
.withColumn("result", when(col("col6"), col("result") + lit(bigIncrease)).otherwise(col("result") + lit(smallDecrease)))
.withColumn("result", when(col("col7"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))


Is there a way to compact this query and avoid multiple withColumns.
Unfortunaltey UDF is not choise as there are more than 10 boolean column to take into account and UDFs are limited to 10 column. Maybe I can split it into 2 UDFs but this looks very ugly to me...










share|improve this question


























    up vote
    1
    down vote

    favorite












    Is there a way to optimize this query to not use withColumn multiple times. My biggest problem is that I hit this issue: https://issues.apache.org/jira/browse/SPARK-18532



    The query is something like this. I have a dataframe with 10 boolean columns.
    I have some modifiers like:



    val smallIncrease = 5
    val smallDecrease = -5
    val bigIncrease = 10
    val bigDecrease = -10


    Based on each of the boolean column I would like to calculate a final score by adding small/big increase/decrease base on values in different columns.
    So now my query looks something like this:



    df.withColumn("result", when(col("col1"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))
    .withColumn("result", when(col("col2"), col("result") + lit(bigIncrease)).otherwise(col("result") + lit(bigDecrease)))
    .withColumn("result", when(col("col3"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))
    .withColumn("result", when(col("col4"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))
    .withColumn("result", when(col("col5"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(bigDecrease)))
    .withColumn("result", when(col("col6"), col("result") + lit(bigIncrease)).otherwise(col("result") + lit(smallDecrease)))
    .withColumn("result", when(col("col7"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))


    Is there a way to compact this query and avoid multiple withColumns.
    Unfortunaltey UDF is not choise as there are more than 10 boolean column to take into account and UDFs are limited to 10 column. Maybe I can split it into 2 UDFs but this looks very ugly to me...










    share|improve this question
























      up vote
      1
      down vote

      favorite









      up vote
      1
      down vote

      favorite











      Is there a way to optimize this query to not use withColumn multiple times. My biggest problem is that I hit this issue: https://issues.apache.org/jira/browse/SPARK-18532



      The query is something like this. I have a dataframe with 10 boolean columns.
      I have some modifiers like:



      val smallIncrease = 5
      val smallDecrease = -5
      val bigIncrease = 10
      val bigDecrease = -10


      Based on each of the boolean column I would like to calculate a final score by adding small/big increase/decrease base on values in different columns.
      So now my query looks something like this:



      df.withColumn("result", when(col("col1"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))
      .withColumn("result", when(col("col2"), col("result") + lit(bigIncrease)).otherwise(col("result") + lit(bigDecrease)))
      .withColumn("result", when(col("col3"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))
      .withColumn("result", when(col("col4"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))
      .withColumn("result", when(col("col5"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(bigDecrease)))
      .withColumn("result", when(col("col6"), col("result") + lit(bigIncrease)).otherwise(col("result") + lit(smallDecrease)))
      .withColumn("result", when(col("col7"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))


      Is there a way to compact this query and avoid multiple withColumns.
      Unfortunaltey UDF is not choise as there are more than 10 boolean column to take into account and UDFs are limited to 10 column. Maybe I can split it into 2 UDFs but this looks very ugly to me...










      share|improve this question













      Is there a way to optimize this query to not use withColumn multiple times. My biggest problem is that I hit this issue: https://issues.apache.org/jira/browse/SPARK-18532



      The query is something like this. I have a dataframe with 10 boolean columns.
      I have some modifiers like:



      val smallIncrease = 5
      val smallDecrease = -5
      val bigIncrease = 10
      val bigDecrease = -10


      Based on each of the boolean column I would like to calculate a final score by adding small/big increase/decrease base on values in different columns.
      So now my query looks something like this:



      df.withColumn("result", when(col("col1"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))
      .withColumn("result", when(col("col2"), col("result") + lit(bigIncrease)).otherwise(col("result") + lit(bigDecrease)))
      .withColumn("result", when(col("col3"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))
      .withColumn("result", when(col("col4"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))
      .withColumn("result", when(col("col5"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(bigDecrease)))
      .withColumn("result", when(col("col6"), col("result") + lit(bigIncrease)).otherwise(col("result") + lit(smallDecrease)))
      .withColumn("result", when(col("col7"), col("result") + lit(smallIncrease)).otherwise(col("result") + lit(smallDecrease)))


      Is there a way to compact this query and avoid multiple withColumns.
      Unfortunaltey UDF is not choise as there are more than 10 boolean column to take into account and UDFs are limited to 10 column. Maybe I can split it into 2 UDFs but this looks very ugly to me...







      apache-spark optimization query-optimization






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 7 at 9:26









      MitakaJ9

      145




      145
























          1 Answer
          1






          active

          oldest

          votes

















          up vote
          0
          down vote













          How about something like this?



          def myFun(b: Seq[Boolean], result: Int): Int = {

          val conversions: Seq[(Boolean, Int) => Int] = ??? // Functions to apply increase/decrease for each boolean value col1, col2 etc.

          b.zip(conversions).foldLeft(result){
          case (acc, (nextBool, nextFun)) => nextFun(nextBool, acc)
          }
          }

          val myUdf = udf(myFun(_: Seq[Boolean], _: Int))

          df.select(myUdf(array($"col1", $"col2", $"col3"...), $"result").as("result"))





          share|improve this answer





















            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














             

            draft saved


            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53186613%2fspark-optmize-query-with-multiple-whens%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            0
            down vote













            How about something like this?



            def myFun(b: Seq[Boolean], result: Int): Int = {

            val conversions: Seq[(Boolean, Int) => Int] = ??? // Functions to apply increase/decrease for each boolean value col1, col2 etc.

            b.zip(conversions).foldLeft(result){
            case (acc, (nextBool, nextFun)) => nextFun(nextBool, acc)
            }
            }

            val myUdf = udf(myFun(_: Seq[Boolean], _: Int))

            df.select(myUdf(array($"col1", $"col2", $"col3"...), $"result").as("result"))





            share|improve this answer

























              up vote
              0
              down vote













              How about something like this?



              def myFun(b: Seq[Boolean], result: Int): Int = {

              val conversions: Seq[(Boolean, Int) => Int] = ??? // Functions to apply increase/decrease for each boolean value col1, col2 etc.

              b.zip(conversions).foldLeft(result){
              case (acc, (nextBool, nextFun)) => nextFun(nextBool, acc)
              }
              }

              val myUdf = udf(myFun(_: Seq[Boolean], _: Int))

              df.select(myUdf(array($"col1", $"col2", $"col3"...), $"result").as("result"))





              share|improve this answer























                up vote
                0
                down vote










                up vote
                0
                down vote









                How about something like this?



                def myFun(b: Seq[Boolean], result: Int): Int = {

                val conversions: Seq[(Boolean, Int) => Int] = ??? // Functions to apply increase/decrease for each boolean value col1, col2 etc.

                b.zip(conversions).foldLeft(result){
                case (acc, (nextBool, nextFun)) => nextFun(nextBool, acc)
                }
                }

                val myUdf = udf(myFun(_: Seq[Boolean], _: Int))

                df.select(myUdf(array($"col1", $"col2", $"col3"...), $"result").as("result"))





                share|improve this answer












                How about something like this?



                def myFun(b: Seq[Boolean], result: Int): Int = {

                val conversions: Seq[(Boolean, Int) => Int] = ??? // Functions to apply increase/decrease for each boolean value col1, col2 etc.

                b.zip(conversions).foldLeft(result){
                case (acc, (nextBool, nextFun)) => nextFun(nextBool, acc)
                }
                }

                val myUdf = udf(myFun(_: Seq[Boolean], _: Int))

                df.select(myUdf(array($"col1", $"col2", $"col3"...), $"result").as("result"))






                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 7 at 11:42









                Terry Dactyl

                1,071412




                1,071412






























                     

                    draft saved


                    draft discarded



















































                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53186613%2fspark-optmize-query-with-multiple-whens%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    這個網誌中的熱門文章

                    Tangent Lines Diagram Along Smooth Curve

                    Yusuf al-Mu'taman ibn Hud

                    Zucchini