Feature extraction from images present in s3 using spark driver is giving error











up vote
0
down vote

favorite












I have a Pyspark application which will basically download image files somewhere s3 and extract features from those image files using keras.
Here is the entire flow:-



1. Download images from s3 using.
s3_files_rdd = sc.binaryFiles(s3_path) ## [('s3n://..',bytearray)]

2. Then convert the above byte inside the rdd to image object.

import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from io import BytesIO

def convert_binary_to_image_obj(obj):
img = mpimg.imread(BytesIO(obj), 'jpg')
return img


images_rdd = s3_files_rdd.map(lambda x: (x[0], convert_binary_to_image_obj(x[1])))

3. Now pass the images_rdd to another function to extract features using keras vgg16 model.

def initVGG16():
model = VGG16(weights='imagenet', include_top=True)
return Model(inputs=model.input, outputs=model.get_layer("fc2").output)

def extract_features(img):
img_data = image.img_to_array(img)
img_data = np.expand_dims(img_data, axis=0)
img_data = preprocess_input(img_data)
vgg16_feature = initVGG16().predict(img_data)[0]
return vgg16_feature


features_rdd = images_rdd.map(lambda x: (x[0], extract_features(x[1])))


But when I am trying to application it gives the below error message:-



ValueError: Error when checking input: expected input_1 to have shape (224, 224, 3) but got array with shape (300, 200, 3)

at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:330)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:470)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:453)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:284)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)


I know the error here is in the extract_features function it expects the image to be of 224,224,3 size which is not the case right now. Because I am not saving the image to my local disk. I am directly converting using matplotlib lib to image object once I download from s3.



How to resolve this issue ?What I want basically is download the image from s3 and then in memory resize it like image.load_img(image_path, target_size=(224, 224)) function works and then pass this image object to my extract_features function.










share|improve this question


























    up vote
    0
    down vote

    favorite












    I have a Pyspark application which will basically download image files somewhere s3 and extract features from those image files using keras.
    Here is the entire flow:-



    1. Download images from s3 using.
    s3_files_rdd = sc.binaryFiles(s3_path) ## [('s3n://..',bytearray)]

    2. Then convert the above byte inside the rdd to image object.

    import matplotlib.pyplot as plt
    import matplotlib.image as mpimg
    from io import BytesIO

    def convert_binary_to_image_obj(obj):
    img = mpimg.imread(BytesIO(obj), 'jpg')
    return img


    images_rdd = s3_files_rdd.map(lambda x: (x[0], convert_binary_to_image_obj(x[1])))

    3. Now pass the images_rdd to another function to extract features using keras vgg16 model.

    def initVGG16():
    model = VGG16(weights='imagenet', include_top=True)
    return Model(inputs=model.input, outputs=model.get_layer("fc2").output)

    def extract_features(img):
    img_data = image.img_to_array(img)
    img_data = np.expand_dims(img_data, axis=0)
    img_data = preprocess_input(img_data)
    vgg16_feature = initVGG16().predict(img_data)[0]
    return vgg16_feature


    features_rdd = images_rdd.map(lambda x: (x[0], extract_features(x[1])))


    But when I am trying to application it gives the below error message:-



    ValueError: Error when checking input: expected input_1 to have shape (224, 224, 3) but got array with shape (300, 200, 3)

    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:330)
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:470)
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:453)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:284)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)


    I know the error here is in the extract_features function it expects the image to be of 224,224,3 size which is not the case right now. Because I am not saving the image to my local disk. I am directly converting using matplotlib lib to image object once I download from s3.



    How to resolve this issue ?What I want basically is download the image from s3 and then in memory resize it like image.load_img(image_path, target_size=(224, 224)) function works and then pass this image object to my extract_features function.










    share|improve this question
























      up vote
      0
      down vote

      favorite









      up vote
      0
      down vote

      favorite











      I have a Pyspark application which will basically download image files somewhere s3 and extract features from those image files using keras.
      Here is the entire flow:-



      1. Download images from s3 using.
      s3_files_rdd = sc.binaryFiles(s3_path) ## [('s3n://..',bytearray)]

      2. Then convert the above byte inside the rdd to image object.

      import matplotlib.pyplot as plt
      import matplotlib.image as mpimg
      from io import BytesIO

      def convert_binary_to_image_obj(obj):
      img = mpimg.imread(BytesIO(obj), 'jpg')
      return img


      images_rdd = s3_files_rdd.map(lambda x: (x[0], convert_binary_to_image_obj(x[1])))

      3. Now pass the images_rdd to another function to extract features using keras vgg16 model.

      def initVGG16():
      model = VGG16(weights='imagenet', include_top=True)
      return Model(inputs=model.input, outputs=model.get_layer("fc2").output)

      def extract_features(img):
      img_data = image.img_to_array(img)
      img_data = np.expand_dims(img_data, axis=0)
      img_data = preprocess_input(img_data)
      vgg16_feature = initVGG16().predict(img_data)[0]
      return vgg16_feature


      features_rdd = images_rdd.map(lambda x: (x[0], extract_features(x[1])))


      But when I am trying to application it gives the below error message:-



      ValueError: Error when checking input: expected input_1 to have shape (224, 224, 3) but got array with shape (300, 200, 3)

      at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:330)
      at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:470)
      at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:453)
      at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:284)
      at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
      at scala.collection.Iterator$class.foreach(Iterator.scala:893)


      I know the error here is in the extract_features function it expects the image to be of 224,224,3 size which is not the case right now. Because I am not saving the image to my local disk. I am directly converting using matplotlib lib to image object once I download from s3.



      How to resolve this issue ?What I want basically is download the image from s3 and then in memory resize it like image.load_img(image_path, target_size=(224, 224)) function works and then pass this image object to my extract_features function.










      share|improve this question













      I have a Pyspark application which will basically download image files somewhere s3 and extract features from those image files using keras.
      Here is the entire flow:-



      1. Download images from s3 using.
      s3_files_rdd = sc.binaryFiles(s3_path) ## [('s3n://..',bytearray)]

      2. Then convert the above byte inside the rdd to image object.

      import matplotlib.pyplot as plt
      import matplotlib.image as mpimg
      from io import BytesIO

      def convert_binary_to_image_obj(obj):
      img = mpimg.imread(BytesIO(obj), 'jpg')
      return img


      images_rdd = s3_files_rdd.map(lambda x: (x[0], convert_binary_to_image_obj(x[1])))

      3. Now pass the images_rdd to another function to extract features using keras vgg16 model.

      def initVGG16():
      model = VGG16(weights='imagenet', include_top=True)
      return Model(inputs=model.input, outputs=model.get_layer("fc2").output)

      def extract_features(img):
      img_data = image.img_to_array(img)
      img_data = np.expand_dims(img_data, axis=0)
      img_data = preprocess_input(img_data)
      vgg16_feature = initVGG16().predict(img_data)[0]
      return vgg16_feature


      features_rdd = images_rdd.map(lambda x: (x[0], extract_features(x[1])))


      But when I am trying to application it gives the below error message:-



      ValueError: Error when checking input: expected input_1 to have shape (224, 224, 3) but got array with shape (300, 200, 3)

      at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:330)
      at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:470)
      at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:453)
      at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:284)
      at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
      at scala.collection.Iterator$class.foreach(Iterator.scala:893)


      I know the error here is in the extract_features function it expects the image to be of 224,224,3 size which is not the case right now. Because I am not saving the image to my local disk. I am directly converting using matplotlib lib to image object once I download from s3.



      How to resolve this issue ?What I want basically is download the image from s3 and then in memory resize it like image.load_img(image_path, target_size=(224, 224)) function works and then pass this image object to my extract_features function.







      keras pyspark deep-learning feature-extraction






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 4 at 9:52









      dks551

      17310




      17310





























          active

          oldest

          votes











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














           

          draft saved


          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53139544%2ffeature-extraction-from-images-present-in-s3-using-spark-driver-is-giving-error%23new-answer', 'question_page');
          }
          );

          Post as a guest





































          active

          oldest

          votes













          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes
















           

          draft saved


          draft discarded



















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53139544%2ffeature-extraction-from-images-present-in-s3-using-spark-driver-is-giving-error%23new-answer', 'question_page');
          }
          );

          Post as a guest




















































































          這個網誌中的熱門文章

          Tangent Lines Diagram Along Smooth Curve

          Yusuf al-Mu'taman ibn Hud

          Zucchini