How to pass deep learning model data to map function in Spark

up vote
1
down vote

favorite

I have a very simple use-case where I am reading large number of images as rdd from s3 using sc.binaryFiles method. Once this RDD is created I am passing the content inside the rdd to the vgg16 feature extractor function. So, in this I will need the model data using which the feature extraction will be done, so I am putting the model data into broadcast variable and then accesing the value in each map function. Below is the code:-

s3_files_rdd = sc.binaryFiles(RESOLVED_IMAGE_PATH)



s3_files_rdd.persist()



model_data = initVGG16()

broadcast_model = sc.broadcast(model_data)



features_rdd = s3_files_rdd.mapPartitions(extract_features_)



response_rdd = features_rdd.map(lambda x: (x[0], write_to_s3(x, OUTPUT, FORMAT_NAME)))

extract_features_ method:-

def extract_features_(xs):

    model_data = initVGG16()

    for k, v in xs:

        yield k, extract_features2(model_data,v)

extract_features method:-

from keras.preprocessing import image

from keras.applications.vgg16 import VGG16

from keras.models import Model

from io import BytesIO

from keras.applications.vgg16 import preprocess_input

def extract_features(model,obj):

    try:

        print('executing vgg16 feature extractor...')

        img = image.load_img(BytesIO(obj), target_size=(224, 224,3))

        img_data = image.img_to_array(img)

        img_data = np.expand_dims(img_data, axis=0)

        img_data = preprocess_input(img_data)

        vgg16_feature = model.predict(img_data)[0]

        print('++++++++++++++++++++++++++++',vgg16_feature.shape)

        return vgg16_feature

    except Exception as e:

        print('Error......{}'.format(e.args))

        return

write to s3 method:-

def write_to_s3(rdd, output_path, format_name):

    file_path = rdd[0]

    file_name_without_ext = get_file_name_without_ext(file_name)

    bucket_name = output_path.split('/', 1)[0]



    final_path = 'deepak' + '/' + file_name_without_ext + '.' + format_name



    LOGGER.info("Saving to S3....")

    cci = cc.get_interface(bucket_name, ACCESS_KEY=os.environ.get("AWS_ACCESS_KEY_ID"),

                           SECRET_KEY=os.environ.get("AWS_SECRET_ACCESS_KEY"), endpoint_url='https://s3.amazonaws.com')

    response = cci.upload_npy_array(final_path, rdd[1])

    return response

Inside the write_to_s3 method I am getting the RDD, extracting the key name to be saved and bucket. then using a library called cottoncandy to drectly save the RDD content which is numpy array in my case instead of saving any intermediate file.

I am getting below error :-

127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/cloudpickle.py", line 600, in save_reduce

    save(state)

  File "/usr/lib64/python2.7/pickle.py", line 286, in save

    f(self, obj) # Call unbound method with explicit self

  File "/usr/lib64/python2.7/pickle.py", line 655, in save_dict

    self._batch_setitems(obj.iteritems())

  File "/usr/lib64/python2.7/pickle.py", line 687, in _batch_setitems

    save(v)

  File "/usr/lib64/python2.7/pickle.py", line 306, in save

    rv = reduce(self.proto)

TypeError: can't pickle thread.lock objects

Traceback (most recent call last):

  File "one_file5.py", line 98, in <module>

    run()

  File "one_file5.py", line 89, in run

    LOGGER.info('features_rdd rdd created,...... %s',features_rdd.count())    

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 1041, in count

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 1032, in sum

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 906, in fold

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 809, in collect

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 2455, in _jrdd

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 2388, in _wrap_function

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 2374, in _prepare_for_python_RDD

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/serializers.py", line 464, in dumps

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/cloudpickle.py", line 704, in dumps

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/cloudpickle.py", line 162, in dump

pickle.PicklingError: Could not serialize object: TypeError: can't pickle thread.lock objects.

When I am commenting out the the code part of features_rdd, then the program runs fine which means something is not proper in the features_rdd part. Not sure what I am doing wrong here.

I am running the program in AWS EMR, with 4 executors.
executor core 7
executor RAM 8GB
Spark version 2.2.1

edited Nov 8 at 0:42

asked Nov 7 at 10:25

dks551

17410

add a comment |

up vote
1
down vote

favorite

s3_files_rdd = sc.binaryFiles(RESOLVED_IMAGE_PATH)



s3_files_rdd.persist()



model_data = initVGG16()

broadcast_model = sc.broadcast(model_data)



features_rdd = s3_files_rdd.mapPartitions(extract_features_)



response_rdd = features_rdd.map(lambda x: (x[0], write_to_s3(x, OUTPUT, FORMAT_NAME)))

extract_features_ method:-

def extract_features_(xs):

    model_data = initVGG16()

    for k, v in xs:

        yield k, extract_features2(model_data,v)

extract_features method:-

from keras.preprocessing import image

from keras.applications.vgg16 import VGG16

from keras.models import Model

from io import BytesIO

from keras.applications.vgg16 import preprocess_input

def extract_features(model,obj):

    try:

        print('executing vgg16 feature extractor...')

        img = image.load_img(BytesIO(obj), target_size=(224, 224,3))

        img_data = image.img_to_array(img)

        img_data = np.expand_dims(img_data, axis=0)

        img_data = preprocess_input(img_data)

        vgg16_feature = model.predict(img_data)[0]

        print('++++++++++++++++++++++++++++',vgg16_feature.shape)

        return vgg16_feature

    except Exception as e:

        print('Error......{}'.format(e.args))

        return

write to s3 method:-

def write_to_s3(rdd, output_path, format_name):

    file_path = rdd[0]

    file_name_without_ext = get_file_name_without_ext(file_name)

    bucket_name = output_path.split('/', 1)[0]



    final_path = 'deepak' + '/' + file_name_without_ext + '.' + format_name



    LOGGER.info("Saving to S3....")

    cci = cc.get_interface(bucket_name, ACCESS_KEY=os.environ.get("AWS_ACCESS_KEY_ID"),

                           SECRET_KEY=os.environ.get("AWS_SECRET_ACCESS_KEY"), endpoint_url='https://s3.amazonaws.com')

    response = cci.upload_npy_array(final_path, rdd[1])

    return response

I am getting below error :-

127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/cloudpickle.py", line 600, in save_reduce

    save(state)

  File "/usr/lib64/python2.7/pickle.py", line 286, in save

    f(self, obj) # Call unbound method with explicit self

  File "/usr/lib64/python2.7/pickle.py", line 655, in save_dict

    self._batch_setitems(obj.iteritems())

  File "/usr/lib64/python2.7/pickle.py", line 687, in _batch_setitems

    save(v)

  File "/usr/lib64/python2.7/pickle.py", line 306, in save

    rv = reduce(self.proto)

TypeError: can't pickle thread.lock objects

Traceback (most recent call last):

  File "one_file5.py", line 98, in <module>

    run()

  File "one_file5.py", line 89, in run

    LOGGER.info('features_rdd rdd created,...... %s',features_rdd.count())    

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 1041, in count

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 1032, in sum

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 906, in fold

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 809, in collect

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 2455, in _jrdd

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 2388, in _wrap_function

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 2374, in _prepare_for_python_RDD

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/serializers.py", line 464, in dumps

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/cloudpickle.py", line 704, in dumps

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/cloudpickle.py", line 162, in dump

pickle.PicklingError: Could not serialize object: TypeError: can't pickle thread.lock objects.

When I am commenting out the the code part of features_rdd, then the program runs fine which means something is not proper in the features_rdd part. Not sure what I am doing wrong here.

I am running the program in AWS EMR, with 4 executors.
executor core 7
executor RAM 8GB
Spark version 2.2.1

edited Nov 8 at 0:42

asked Nov 7 at 10:25

dks551

17410

add a comment |

up vote
1
down vote

favorite

s3_files_rdd = sc.binaryFiles(RESOLVED_IMAGE_PATH)



s3_files_rdd.persist()



model_data = initVGG16()

broadcast_model = sc.broadcast(model_data)



features_rdd = s3_files_rdd.mapPartitions(extract_features_)



response_rdd = features_rdd.map(lambda x: (x[0], write_to_s3(x, OUTPUT, FORMAT_NAME)))

extract_features_ method:-

def extract_features_(xs):

    model_data = initVGG16()

    for k, v in xs:

        yield k, extract_features2(model_data,v)

extract_features method:-

from keras.preprocessing import image

from keras.applications.vgg16 import VGG16

from keras.models import Model

from io import BytesIO

from keras.applications.vgg16 import preprocess_input

def extract_features(model,obj):

    try:

        print('executing vgg16 feature extractor...')

        img = image.load_img(BytesIO(obj), target_size=(224, 224,3))

        img_data = image.img_to_array(img)

        img_data = np.expand_dims(img_data, axis=0)

        img_data = preprocess_input(img_data)

        vgg16_feature = model.predict(img_data)[0]

        print('++++++++++++++++++++++++++++',vgg16_feature.shape)

        return vgg16_feature

    except Exception as e:

        print('Error......{}'.format(e.args))

        return

write to s3 method:-

def write_to_s3(rdd, output_path, format_name):

    file_path = rdd[0]

    file_name_without_ext = get_file_name_without_ext(file_name)

    bucket_name = output_path.split('/', 1)[0]



    final_path = 'deepak' + '/' + file_name_without_ext + '.' + format_name



    LOGGER.info("Saving to S3....")

    cci = cc.get_interface(bucket_name, ACCESS_KEY=os.environ.get("AWS_ACCESS_KEY_ID"),

                           SECRET_KEY=os.environ.get("AWS_SECRET_ACCESS_KEY"), endpoint_url='https://s3.amazonaws.com')

    response = cci.upload_npy_array(final_path, rdd[1])

    return response

I am getting below error :-

127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/cloudpickle.py", line 600, in save_reduce

    save(state)

  File "/usr/lib64/python2.7/pickle.py", line 286, in save

    f(self, obj) # Call unbound method with explicit self

  File "/usr/lib64/python2.7/pickle.py", line 655, in save_dict

    self._batch_setitems(obj.iteritems())

  File "/usr/lib64/python2.7/pickle.py", line 687, in _batch_setitems

    save(v)

  File "/usr/lib64/python2.7/pickle.py", line 306, in save

    rv = reduce(self.proto)

TypeError: can't pickle thread.lock objects

Traceback (most recent call last):

  File "one_file5.py", line 98, in <module>

    run()

  File "one_file5.py", line 89, in run

    LOGGER.info('features_rdd rdd created,...... %s',features_rdd.count())    

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 1041, in count

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 1032, in sum

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 906, in fold

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 809, in collect

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 2455, in _jrdd

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 2388, in _wrap_function

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 2374, in _prepare_for_python_RDD

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/serializers.py", line 464, in dumps

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/cloudpickle.py", line 704, in dumps

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/cloudpickle.py", line 162, in dump

pickle.PicklingError: Could not serialize object: TypeError: can't pickle thread.lock objects.

When I am commenting out the the code part of features_rdd, then the program runs fine which means something is not proper in the features_rdd part. Not sure what I am doing wrong here.

I am running the program in AWS EMR, with 4 executors.
executor core 7
executor RAM 8GB
Spark version 2.2.1

edited Nov 8 at 0:42

asked Nov 7 at 10:25

dks551

17410

s3_files_rdd = sc.binaryFiles(RESOLVED_IMAGE_PATH)



s3_files_rdd.persist()



model_data = initVGG16()

broadcast_model = sc.broadcast(model_data)



features_rdd = s3_files_rdd.mapPartitions(extract_features_)



response_rdd = features_rdd.map(lambda x: (x[0], write_to_s3(x, OUTPUT, FORMAT_NAME)))

extract_features_ method:-

def extract_features_(xs):

    model_data = initVGG16()

    for k, v in xs:

        yield k, extract_features2(model_data,v)

extract_features method:-

from keras.preprocessing import image

from keras.applications.vgg16 import VGG16

from keras.models import Model

from io import BytesIO

from keras.applications.vgg16 import preprocess_input

def extract_features(model,obj):

    try:

        print('executing vgg16 feature extractor...')

        img = image.load_img(BytesIO(obj), target_size=(224, 224,3))

        img_data = image.img_to_array(img)

        img_data = np.expand_dims(img_data, axis=0)

        img_data = preprocess_input(img_data)

        vgg16_feature = model.predict(img_data)[0]

        print('++++++++++++++++++++++++++++',vgg16_feature.shape)

        return vgg16_feature

    except Exception as e:

        print('Error......{}'.format(e.args))

        return

write to s3 method:-

def write_to_s3(rdd, output_path, format_name):

    file_path = rdd[0]

    file_name_without_ext = get_file_name_without_ext(file_name)

    bucket_name = output_path.split('/', 1)[0]



    final_path = 'deepak' + '/' + file_name_without_ext + '.' + format_name



    LOGGER.info("Saving to S3....")

    cci = cc.get_interface(bucket_name, ACCESS_KEY=os.environ.get("AWS_ACCESS_KEY_ID"),

                           SECRET_KEY=os.environ.get("AWS_SECRET_ACCESS_KEY"), endpoint_url='https://s3.amazonaws.com')

    response = cci.upload_npy_array(final_path, rdd[1])

    return response

I am getting below error :-

127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/cloudpickle.py", line 600, in save_reduce

    save(state)

  File "/usr/lib64/python2.7/pickle.py", line 286, in save

    f(self, obj) # Call unbound method with explicit self

  File "/usr/lib64/python2.7/pickle.py", line 655, in save_dict

    self._batch_setitems(obj.iteritems())

  File "/usr/lib64/python2.7/pickle.py", line 687, in _batch_setitems

    save(v)

  File "/usr/lib64/python2.7/pickle.py", line 306, in save

    rv = reduce(self.proto)

TypeError: can't pickle thread.lock objects

Traceback (most recent call last):

  File "one_file5.py", line 98, in <module>

    run()

  File "one_file5.py", line 89, in run

    LOGGER.info('features_rdd rdd created,...... %s',features_rdd.count())    

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 1041, in count

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 1032, in sum

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 906, in fold

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 809, in collect

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 2455, in _jrdd

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 2388, in _wrap_function

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 2374, in _prepare_for_python_RDD

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/serializers.py", line 464, in dumps

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/cloudpickle.py", line 704, in dumps

  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/cloudpickle.py", line 162, in dump

pickle.PicklingError: Could not serialize object: TypeError: can't pickle thread.lock objects.

When I am commenting out the the code part of features_rdd, then the program runs fine which means something is not proper in the features_rdd part. Not sure what I am doing wrong here.

I am running the program in AWS EMR, with 4 executors.
executor core 7
executor RAM 8GB
Spark version 2.2.1

apache-spark keras pyspark deep-learning

edited Nov 8 at 0:42

asked Nov 7 at 10:25

dks551

17410

edited Nov 8 at 0:42

asked Nov 7 at 10:25

dks551

17410

edited Nov 8 at 0:42

asked Nov 7 at 10:25

dks551

17410

asked Nov 7 at 10:25

dks551

17410

asked Nov 7 at 10:25

dks551

17410

add a comment |

1 Answer
1

active

oldest

votes

up vote
1
down vote

Replace your current code with mapPartitions:

def extract_features_(xs):

    model_data = initVGG16()

    for k, v in xs:

        yield k, extract_features(model_data, v)



features_rdd = s3_files_rdd.mapPartitions(extract_features_)

answered Nov 7 at 11:54

user10618151

111

1

The above is working fine, but when I get the features_rdd and once I want to write to s3 by passing the same function to some write_to_s3 function then again it returns me the same error. Updated my question
– dks551
Nov 8 at 0:34

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53187570%2fhow-to-pass-deep-learning-model-data-to-map-function-in-spark%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

Replace your current code with mapPartitions:

def extract_features_(xs):

    model_data = initVGG16()

    for k, v in xs:

        yield k, extract_features(model_data, v)



features_rdd = s3_files_rdd.mapPartitions(extract_features_)

answered Nov 7 at 11:54

user10618151

111

1

The above is working fine, but when I get the features_rdd and once I want to write to s3 by passing the same function to some write_to_s3 function then again it returns me the same error. Updated my question
– dks551
Nov 8 at 0:34

add a comment |

up vote
1
down vote

Replace your current code with mapPartitions:

def extract_features_(xs):

    model_data = initVGG16()

    for k, v in xs:

        yield k, extract_features(model_data, v)



features_rdd = s3_files_rdd.mapPartitions(extract_features_)

answered Nov 7 at 11:54

user10618151

111

1

The above is working fine, but when I get the features_rdd and once I want to write to s3 by passing the same function to some write_to_s3 function then again it returns me the same error. Updated my question
– dks551
Nov 8 at 0:34

add a comment |

up vote
1
down vote

Replace your current code with mapPartitions:

def extract_features_(xs):

    model_data = initVGG16()

    for k, v in xs:

        yield k, extract_features(model_data, v)



features_rdd = s3_files_rdd.mapPartitions(extract_features_)

answered Nov 7 at 11:54

user10618151

111

Replace your current code with mapPartitions:

def extract_features_(xs):

    model_data = initVGG16()

    for k, v in xs:

        yield k, extract_features(model_data, v)



features_rdd = s3_files_rdd.mapPartitions(extract_features_)

answered Nov 7 at 11:54

user10618151

111

answered Nov 7 at 11:54

user10618151

111

answered Nov 7 at 11:54

user10618151

111

answered Nov 7 at 11:54

user10618151

111

1

The above is working fine, but when I get the features_rdd and once I want to write to s3 by passing the same function to some write_to_s3 function then again it returns me the same error. Updated my question
– dks551
Nov 8 at 0:34

add a comment |

1

The above is working fine, but when I get the features_rdd and once I want to write to s3 by passing the same function to some write_to_s3 function then again it returns me the same error. Updated my question
– dks551
Nov 8 at 0:34

The above is working fine, but when I get the features_rdd and once I want to write to s3 by passing the same function to some write_to_s3 function then again it returns me the same error. Updated my question
– dks551
Nov 8 at 0:34

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Wsrtjtyk