Watson Studio “Spark Environment” - how to increase `spark.driver.maxResultSize`?

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

I'm running a spark job where I'm reading, manipulating and merging a lot of txt files into a single file, but I'm hitting this issue:

Py4JJavaError: An error occurred while calling o8483.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 838 tasks (1025.6 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)

Is it possible to increase the size of spark.driver.maxResultSize?

Note: this question is about the WS Spark “Environments” NOT about Analytics Engine.

edited Nov 30 '18 at 7:09

asked Nov 24 '18 at 14:34

Chris Snow

11.3k1783194

add a comment |

I'm running a spark job where I'm reading, manipulating and merging a lot of txt files into a single file, but I'm hitting this issue:

Py4JJavaError: An error occurred while calling o8483.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 838 tasks (1025.6 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)

Is it possible to increase the size of spark.driver.maxResultSize?

Note: this question is about the WS Spark “Environments” NOT about Analytics Engine.

edited Nov 30 '18 at 7:09

asked Nov 24 '18 at 14:34

Chris Snow

11.3k1783194

add a comment |

I'm running a spark job where I'm reading, manipulating and merging a lot of txt files into a single file, but I'm hitting this issue:

Py4JJavaError: An error occurred while calling o8483.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 838 tasks (1025.6 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)

Is it possible to increase the size of spark.driver.maxResultSize?

Note: this question is about the WS Spark “Environments” NOT about Analytics Engine.

edited Nov 30 '18 at 7:09

asked Nov 24 '18 at 14:34

Chris Snow

11.3k1783194

I'm running a spark job where I'm reading, manipulating and merging a lot of txt files into a single file, but I'm hitting this issue:

Py4JJavaError: An error occurred while calling o8483.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 838 tasks (1025.6 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)

Is it possible to increase the size of spark.driver.maxResultSize?

Note: this question is about the WS Spark “Environments” NOT about Analytics Engine.

watson-studio

edited Nov 30 '18 at 7:09

asked Nov 24 '18 at 14:34

Chris Snow

11.3k1783194

edited Nov 30 '18 at 7:09

asked Nov 24 '18 at 14:34

Chris Snow

11.3k1783194

edited Nov 30 '18 at 7:09

asked Nov 24 '18 at 14:34

Chris Snow

11.3k1783194

asked Nov 24 '18 at 14:34

Chris Snow

11.3k1783194

asked Nov 24 '18 at 14:34

Chris Snow

11.3k1783194

add a comment |

1 Answer
1

active

oldest

votes

You can increase the default value through the Ambari console if you are using "Analytics Engine" spark cluster instance. You can get the link and credentials to the Ambari console from IAE instance in console.bluemix.net. From Ambari console, add a new property in

Spark2 -> "Custom spark2-defaults" -> Add property -> spark.driver.maxResultSize = 2GB

Make sure the spark.driver.maxResultSize values is less than driver memory which is set in

Spark2 -> "Advanced spark2-env" -> content -> SPARK_DRIVER_MEMORY

Another suggestion if you are just trying to create a single CSV file and don't want to change spark conf values since u don't know how large the final file would be, is to use a function like below which uses hdfs getmerge function to create a single csv file just like pandas.

def writeSparkDFAsCSV_HDFS(spark_df, file_location,file_name, csv_sep=',', csv_quote='"'):

    """

    It can be used to write large spark dataframe as a csv file without running 

    into memory issues while converting to pandas dataframe.

    It first writes the spark df to a temp hdfs location and uses getmerge to create 

    a single file. After adding a header, the merged file is moved to hdfs.



    Args:

        spark_df (spark dataframe) : Data object to be written to file.

        file_location (String) : Directory location of the file.

        file_name (String) : Name of file to write to.

        csv_sep (character) : Field separator to use in csv file

        csv_quote (character) : Quote character to use in csv file

    """

    # define temp and final paths

    file_path= os.path.join(file_location,file_name)

    temp_file_location = tempfile.NamedTemporaryFile().name 

    temp_file_path = os.path.join(temp_file_location,file_name)



    print("Create directories")

    #create directories if not exist in both local and hdfs

    !mkdir $temp_file_location

    !hdfs dfs -mkdir $file_location

    !hdfs dfs -mkdir $temp_file_location



    # write to temp hdfs location

    print("Write to temp hdfs location : {}".format("hdfs://" + temp_file_path))

    spark_df.write.csv("hdfs://" + temp_file_path, sep=csv_sep, quote=csv_quote)





    # merge file from hadoop to local

    print("Merge and put file at {}".format(temp_file_path))

    !hdfs dfs -getmerge $temp_file_path $temp_file_path



    # Add header to the merged file

    header = ",".join(spark_df.columns)

    !rm $temp_file_location/.*crc

    line_prepender(temp_file_path, header)



    #move the final file to hdfs

    !hdfs dfs -put -f $temp_file_path $file_path



    #cleanup temp locations

    print("Cleanup..")

    !rm -rf $temp_file_location

    !hdfs dfs -rm -r $temp_file_location

    print("Done!")

edited Nov 30 '18 at 1:02

answered Nov 30 '18 at 0:54

Manoj Singh

1,072417

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53459221%2fwatson-studio-spark-environment-how-to-increase-spark-driver-maxresultsize%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Spark2 -> "Custom spark2-defaults" -> Add property -> spark.driver.maxResultSize = 2GB

Make sure the spark.driver.maxResultSize values is less than driver memory which is set in

Spark2 -> "Advanced spark2-env" -> content -> SPARK_DRIVER_MEMORY

def writeSparkDFAsCSV_HDFS(spark_df, file_location,file_name, csv_sep=',', csv_quote='"'):

    """

    It can be used to write large spark dataframe as a csv file without running 

    into memory issues while converting to pandas dataframe.

    It first writes the spark df to a temp hdfs location and uses getmerge to create 

    a single file. After adding a header, the merged file is moved to hdfs.



    Args:

        spark_df (spark dataframe) : Data object to be written to file.

        file_location (String) : Directory location of the file.

        file_name (String) : Name of file to write to.

        csv_sep (character) : Field separator to use in csv file

        csv_quote (character) : Quote character to use in csv file

    """

    # define temp and final paths

    file_path= os.path.join(file_location,file_name)

    temp_file_location = tempfile.NamedTemporaryFile().name 

    temp_file_path = os.path.join(temp_file_location,file_name)



    print("Create directories")

    #create directories if not exist in both local and hdfs

    !mkdir $temp_file_location

    !hdfs dfs -mkdir $file_location

    !hdfs dfs -mkdir $temp_file_location



    # write to temp hdfs location

    print("Write to temp hdfs location : {}".format("hdfs://" + temp_file_path))

    spark_df.write.csv("hdfs://" + temp_file_path, sep=csv_sep, quote=csv_quote)





    # merge file from hadoop to local

    print("Merge and put file at {}".format(temp_file_path))

    !hdfs dfs -getmerge $temp_file_path $temp_file_path



    # Add header to the merged file

    header = ",".join(spark_df.columns)

    !rm $temp_file_location/.*crc

    line_prepender(temp_file_path, header)



    #move the final file to hdfs

    !hdfs dfs -put -f $temp_file_path $file_path



    #cleanup temp locations

    print("Cleanup..")

    !rm -rf $temp_file_location

    !hdfs dfs -rm -r $temp_file_location

    print("Done!")

edited Nov 30 '18 at 1:02

answered Nov 30 '18 at 0:54

Manoj Singh

1,072417

add a comment |

Spark2 -> "Custom spark2-defaults" -> Add property -> spark.driver.maxResultSize = 2GB

Make sure the spark.driver.maxResultSize values is less than driver memory which is set in

Spark2 -> "Advanced spark2-env" -> content -> SPARK_DRIVER_MEMORY

def writeSparkDFAsCSV_HDFS(spark_df, file_location,file_name, csv_sep=',', csv_quote='"'):

    """

    It can be used to write large spark dataframe as a csv file without running 

    into memory issues while converting to pandas dataframe.

    It first writes the spark df to a temp hdfs location and uses getmerge to create 

    a single file. After adding a header, the merged file is moved to hdfs.



    Args:

        spark_df (spark dataframe) : Data object to be written to file.

        file_location (String) : Directory location of the file.

        file_name (String) : Name of file to write to.

        csv_sep (character) : Field separator to use in csv file

        csv_quote (character) : Quote character to use in csv file

    """

    # define temp and final paths

    file_path= os.path.join(file_location,file_name)

    temp_file_location = tempfile.NamedTemporaryFile().name 

    temp_file_path = os.path.join(temp_file_location,file_name)



    print("Create directories")

    #create directories if not exist in both local and hdfs

    !mkdir $temp_file_location

    !hdfs dfs -mkdir $file_location

    !hdfs dfs -mkdir $temp_file_location



    # write to temp hdfs location

    print("Write to temp hdfs location : {}".format("hdfs://" + temp_file_path))

    spark_df.write.csv("hdfs://" + temp_file_path, sep=csv_sep, quote=csv_quote)





    # merge file from hadoop to local

    print("Merge and put file at {}".format(temp_file_path))

    !hdfs dfs -getmerge $temp_file_path $temp_file_path



    # Add header to the merged file

    header = ",".join(spark_df.columns)

    !rm $temp_file_location/.*crc

    line_prepender(temp_file_path, header)



    #move the final file to hdfs

    !hdfs dfs -put -f $temp_file_path $file_path



    #cleanup temp locations

    print("Cleanup..")

    !rm -rf $temp_file_location

    !hdfs dfs -rm -r $temp_file_location

    print("Done!")

edited Nov 30 '18 at 1:02

answered Nov 30 '18 at 0:54

Manoj Singh

1,072417

add a comment |

Spark2 -> "Custom spark2-defaults" -> Add property -> spark.driver.maxResultSize = 2GB

Make sure the spark.driver.maxResultSize values is less than driver memory which is set in

Spark2 -> "Advanced spark2-env" -> content -> SPARK_DRIVER_MEMORY

def writeSparkDFAsCSV_HDFS(spark_df, file_location,file_name, csv_sep=',', csv_quote='"'):

    """

    It can be used to write large spark dataframe as a csv file without running 

    into memory issues while converting to pandas dataframe.

    It first writes the spark df to a temp hdfs location and uses getmerge to create 

    a single file. After adding a header, the merged file is moved to hdfs.



    Args:

        spark_df (spark dataframe) : Data object to be written to file.

        file_location (String) : Directory location of the file.

        file_name (String) : Name of file to write to.

        csv_sep (character) : Field separator to use in csv file

        csv_quote (character) : Quote character to use in csv file

    """

    # define temp and final paths

    file_path= os.path.join(file_location,file_name)

    temp_file_location = tempfile.NamedTemporaryFile().name 

    temp_file_path = os.path.join(temp_file_location,file_name)



    print("Create directories")

    #create directories if not exist in both local and hdfs

    !mkdir $temp_file_location

    !hdfs dfs -mkdir $file_location

    !hdfs dfs -mkdir $temp_file_location



    # write to temp hdfs location

    print("Write to temp hdfs location : {}".format("hdfs://" + temp_file_path))

    spark_df.write.csv("hdfs://" + temp_file_path, sep=csv_sep, quote=csv_quote)





    # merge file from hadoop to local

    print("Merge and put file at {}".format(temp_file_path))

    !hdfs dfs -getmerge $temp_file_path $temp_file_path



    # Add header to the merged file

    header = ",".join(spark_df.columns)

    !rm $temp_file_location/.*crc

    line_prepender(temp_file_path, header)



    #move the final file to hdfs

    !hdfs dfs -put -f $temp_file_path $file_path



    #cleanup temp locations

    print("Cleanup..")

    !rm -rf $temp_file_location

    !hdfs dfs -rm -r $temp_file_location

    print("Done!")

edited Nov 30 '18 at 1:02

answered Nov 30 '18 at 0:54

Manoj Singh

1,072417

Spark2 -> "Custom spark2-defaults" -> Add property -> spark.driver.maxResultSize = 2GB

Make sure the spark.driver.maxResultSize values is less than driver memory which is set in

Spark2 -> "Advanced spark2-env" -> content -> SPARK_DRIVER_MEMORY

def writeSparkDFAsCSV_HDFS(spark_df, file_location,file_name, csv_sep=',', csv_quote='"'):

    """

    It can be used to write large spark dataframe as a csv file without running 

    into memory issues while converting to pandas dataframe.

    It first writes the spark df to a temp hdfs location and uses getmerge to create 

    a single file. After adding a header, the merged file is moved to hdfs.



    Args:

        spark_df (spark dataframe) : Data object to be written to file.

        file_location (String) : Directory location of the file.

        file_name (String) : Name of file to write to.

        csv_sep (character) : Field separator to use in csv file

        csv_quote (character) : Quote character to use in csv file

    """

    # define temp and final paths

    file_path= os.path.join(file_location,file_name)

    temp_file_location = tempfile.NamedTemporaryFile().name 

    temp_file_path = os.path.join(temp_file_location,file_name)



    print("Create directories")

    #create directories if not exist in both local and hdfs

    !mkdir $temp_file_location

    !hdfs dfs -mkdir $file_location

    !hdfs dfs -mkdir $temp_file_location



    # write to temp hdfs location

    print("Write to temp hdfs location : {}".format("hdfs://" + temp_file_path))

    spark_df.write.csv("hdfs://" + temp_file_path, sep=csv_sep, quote=csv_quote)





    # merge file from hadoop to local

    print("Merge and put file at {}".format(temp_file_path))

    !hdfs dfs -getmerge $temp_file_path $temp_file_path



    # Add header to the merged file

    header = ",".join(spark_df.columns)

    !rm $temp_file_location/.*crc

    line_prepender(temp_file_path, header)



    #move the final file to hdfs

    !hdfs dfs -put -f $temp_file_path $file_path



    #cleanup temp locations

    print("Cleanup..")

    !rm -rf $temp_file_location

    !hdfs dfs -rm -r $temp_file_location

    print("Done!")

edited Nov 30 '18 at 1:02

answered Nov 30 '18 at 0:54

Manoj Singh

1,072417

edited Nov 30 '18 at 1:02

answered Nov 30 '18 at 0:54

Manoj Singh

1,072417

answered Nov 30 '18 at 0:54

Manoj Singh

1,072417

answered Nov 30 '18 at 0:54

Manoj Singh

1,072417

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

1HsD ZqH6bkv,bXiJY,t,E6jwb8NvYqt8xTAgIYfdvT,Nsz3XigL 62k,wI,J9n,OwXvTVydrT

搜尋此網誌

Wsrtjtyk