Feature extraction from images present in s3 using spark driver is giving error

up vote
0
down vote

favorite

I have a Pyspark application which will basically download image files somewhere s3 and extract features from those image files using keras.
Here is the entire flow:-

1. Download images from s3 using.

    s3_files_rdd = sc.binaryFiles(s3_path) ## [('s3n://..',bytearray)]



2. Then convert the above byte inside the rdd to image object.



import matplotlib.pyplot as plt

import matplotlib.image as mpimg

from io import BytesIO



def convert_binary_to_image_obj(obj):

    img = mpimg.imread(BytesIO(obj), 'jpg')

    return img





images_rdd = s3_files_rdd.map(lambda x: (x[0], convert_binary_to_image_obj(x[1])))



3. Now pass the images_rdd to another function to extract features using keras vgg16 model.



def initVGG16():

    model = VGG16(weights='imagenet', include_top=True)

    return Model(inputs=model.input, outputs=model.get_layer("fc2").output)



def extract_features(img):

    img_data = image.img_to_array(img)

    img_data = np.expand_dims(img_data, axis=0)

    img_data = preprocess_input(img_data)

    vgg16_feature = initVGG16().predict(img_data)[0]

    return vgg16_feature





features_rdd = images_rdd.map(lambda x: (x[0], extract_features(x[1])))

But when I am trying to application it gives the below error message:-

ValueError: Error when checking input: expected input_1 to have shape (224, 224, 3) but got array with shape (300, 200, 3)



    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:330)

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:470)

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:453)

    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:284)

    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)

    at scala.collection.Iterator$class.foreach(Iterator.scala:893)

I know the error here is in the extract_features function it expects the image to be of 224,224,3 size which is not the case right now. Because I am not saving the image to my local disk. I am directly converting using matplotlib lib to image object once I download from s3.

How to resolve this issue ?What I want basically is download the image from s3 and then in memory resize it like image.load_img(image_path, target_size=(224, 224)) function works and then pass this image object to my extract_features function.

asked Nov 4 at 9:52

dks551

17310

add a comment |

up vote
0
down vote

favorite

I have a Pyspark application which will basically download image files somewhere s3 and extract features from those image files using keras.
Here is the entire flow:-

1. Download images from s3 using.

    s3_files_rdd = sc.binaryFiles(s3_path) ## [('s3n://..',bytearray)]



2. Then convert the above byte inside the rdd to image object.



import matplotlib.pyplot as plt

import matplotlib.image as mpimg

from io import BytesIO



def convert_binary_to_image_obj(obj):

    img = mpimg.imread(BytesIO(obj), 'jpg')

    return img





images_rdd = s3_files_rdd.map(lambda x: (x[0], convert_binary_to_image_obj(x[1])))



3. Now pass the images_rdd to another function to extract features using keras vgg16 model.



def initVGG16():

    model = VGG16(weights='imagenet', include_top=True)

    return Model(inputs=model.input, outputs=model.get_layer("fc2").output)



def extract_features(img):

    img_data = image.img_to_array(img)

    img_data = np.expand_dims(img_data, axis=0)

    img_data = preprocess_input(img_data)

    vgg16_feature = initVGG16().predict(img_data)[0]

    return vgg16_feature





features_rdd = images_rdd.map(lambda x: (x[0], extract_features(x[1])))

But when I am trying to application it gives the below error message:-

ValueError: Error when checking input: expected input_1 to have shape (224, 224, 3) but got array with shape (300, 200, 3)



    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:330)

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:470)

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:453)

    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:284)

    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)

    at scala.collection.Iterator$class.foreach(Iterator.scala:893)

asked Nov 4 at 9:52

dks551

17310

add a comment |

up vote
0
down vote

favorite

I have a Pyspark application which will basically download image files somewhere s3 and extract features from those image files using keras.
Here is the entire flow:-

1. Download images from s3 using.

    s3_files_rdd = sc.binaryFiles(s3_path) ## [('s3n://..',bytearray)]



2. Then convert the above byte inside the rdd to image object.



import matplotlib.pyplot as plt

import matplotlib.image as mpimg

from io import BytesIO



def convert_binary_to_image_obj(obj):

    img = mpimg.imread(BytesIO(obj), 'jpg')

    return img





images_rdd = s3_files_rdd.map(lambda x: (x[0], convert_binary_to_image_obj(x[1])))



3. Now pass the images_rdd to another function to extract features using keras vgg16 model.



def initVGG16():

    model = VGG16(weights='imagenet', include_top=True)

    return Model(inputs=model.input, outputs=model.get_layer("fc2").output)



def extract_features(img):

    img_data = image.img_to_array(img)

    img_data = np.expand_dims(img_data, axis=0)

    img_data = preprocess_input(img_data)

    vgg16_feature = initVGG16().predict(img_data)[0]

    return vgg16_feature





features_rdd = images_rdd.map(lambda x: (x[0], extract_features(x[1])))

But when I am trying to application it gives the below error message:-

ValueError: Error when checking input: expected input_1 to have shape (224, 224, 3) but got array with shape (300, 200, 3)



    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:330)

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:470)

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:453)

    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:284)

    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)

    at scala.collection.Iterator$class.foreach(Iterator.scala:893)

asked Nov 4 at 9:52

dks551

17310

I have a Pyspark application which will basically download image files somewhere s3 and extract features from those image files using keras.
Here is the entire flow:-

1. Download images from s3 using.

    s3_files_rdd = sc.binaryFiles(s3_path) ## [('s3n://..',bytearray)]



2. Then convert the above byte inside the rdd to image object.



import matplotlib.pyplot as plt

import matplotlib.image as mpimg

from io import BytesIO



def convert_binary_to_image_obj(obj):

    img = mpimg.imread(BytesIO(obj), 'jpg')

    return img





images_rdd = s3_files_rdd.map(lambda x: (x[0], convert_binary_to_image_obj(x[1])))



3. Now pass the images_rdd to another function to extract features using keras vgg16 model.



def initVGG16():

    model = VGG16(weights='imagenet', include_top=True)

    return Model(inputs=model.input, outputs=model.get_layer("fc2").output)



def extract_features(img):

    img_data = image.img_to_array(img)

    img_data = np.expand_dims(img_data, axis=0)

    img_data = preprocess_input(img_data)

    vgg16_feature = initVGG16().predict(img_data)[0]

    return vgg16_feature





features_rdd = images_rdd.map(lambda x: (x[0], extract_features(x[1])))

But when I am trying to application it gives the below error message:-

ValueError: Error when checking input: expected input_1 to have shape (224, 224, 3) but got array with shape (300, 200, 3)



    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:330)

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:470)

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:453)

    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:284)

    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)

    at scala.collection.Iterator$class.foreach(Iterator.scala:893)

keras pyspark deep-learning feature-extraction

asked Nov 4 at 9:52

dks551

17310

asked Nov 4 at 9:52

dks551

17310

asked Nov 4 at 9:52

dks551

17310

asked Nov 4 at 9:52

dks551

17310

asked Nov 4 at 9:52

dks551

17310

add a comment |

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53139544%2ffeature-extraction-from-images-present-in-s3-using-spark-driver-is-giving-error%23new-answer', 'question_page');
}
);

Post as a guest

Name

active

oldest

votes

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Name

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Wsrtjtyk