pandas simpleimputer preserve datatypes












0














I am facing a simple error with the code below.



My objective is to use simpleimputer to plug missing values of different datatypes in one shot.



When i try to do that, the fit_transform seems to be not work as expected.
When dtype argument is not used, the code works just fine, but the resulting dataframe loses its data type information. When i include the dtype list in the arguments, i am seeing the below error. You should be able to simulate the error by just copying and pasting in your local.



import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

import sklearn
print(sklearn.__version__)

0.21.dev0

data = [['Alex','NJ',21,5.10],['Mary','NY',20,np.nan],['Sam',np.nan,np.nan,6.3]]
df = pd.DataFrame(data,columns=['Name','State','Age','Height'])

df.dtypes
Name object
State object
Age float64
Height float64
dtype: object

imp = SimpleImputer(strategy="most_frequent")

#df = pd.DataFrame(imp.fit_transform(df),columns=df.columns) <<<<----- This works just fine
#df
#Name State Age Height
#0 Alex NJ 21 5.1
#1 Mary NY 20 5.1
#2 Sam NJ 20 6.3
#df.dtypes
#Name object
#State object
#Age object
#Height object
#dtype: object


The below statement fails - with the error listed below ( I am trying to preserve dtypes during imputing process)



df = pd.DataFrame(imp.fit_transform(df),columns=df.columns,dtype=df.dtypes)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-23-e9780979921f> in <module>()
7
8 imp = SimpleImputer(strategy="most_frequent")
----> 9 df = pd.DataFrame(imp.fit_transform(df),columns=df.columns,dtype=df.dtypes)

~AppDataLocalContinuumanaconda3libsite-packagespandascoreframe.py in __init__(self, data, index, columns, dtype, copy)
337 data = {}
338 if dtype is not None:
--> 339 dtype = self._validate_dtype(dtype)
340
341 if isinstance(data, DataFrame):

~AppDataLocalContinuumanaconda3libsite-packagespandascoregeneric.py in _validate_dtype(self, dtype)
166
167 if dtype is not None:
--> 168 dtype = pandas_dtype(dtype)
169
170 # a compound dtype

~AppDataLocalContinuumanaconda3libsite-packagespandascoredtypescommon.py in pandas_dtype(dtype)
2020 # which we safeguard against by catching them earlier and returning
2021 # np.dtype(valid_dtype) before this condition is evaluated.
-> 2022 if dtype in [object, np.object_, 'object', 'O']:
2023 return npdtype
2024 elif npdtype.kind == 'O':

~AppDataLocalContinuumanaconda3libsite-packagespandascoregeneric.py in __nonzero__(self)
1574 raise ValueError("The truth value of a {0} is ambiguous. "
1575 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
-> 1576 .format(self.__class__.__name__))
1577
1578 __bool__ = __nonzero__

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().









share|improve this question





























    0














    I am facing a simple error with the code below.



    My objective is to use simpleimputer to plug missing values of different datatypes in one shot.



    When i try to do that, the fit_transform seems to be not work as expected.
    When dtype argument is not used, the code works just fine, but the resulting dataframe loses its data type information. When i include the dtype list in the arguments, i am seeing the below error. You should be able to simulate the error by just copying and pasting in your local.



    import pandas as pd
    import numpy as np
    from sklearn.impute import SimpleImputer

    import sklearn
    print(sklearn.__version__)

    0.21.dev0

    data = [['Alex','NJ',21,5.10],['Mary','NY',20,np.nan],['Sam',np.nan,np.nan,6.3]]
    df = pd.DataFrame(data,columns=['Name','State','Age','Height'])

    df.dtypes
    Name object
    State object
    Age float64
    Height float64
    dtype: object

    imp = SimpleImputer(strategy="most_frequent")

    #df = pd.DataFrame(imp.fit_transform(df),columns=df.columns) <<<<----- This works just fine
    #df
    #Name State Age Height
    #0 Alex NJ 21 5.1
    #1 Mary NY 20 5.1
    #2 Sam NJ 20 6.3
    #df.dtypes
    #Name object
    #State object
    #Age object
    #Height object
    #dtype: object


    The below statement fails - with the error listed below ( I am trying to preserve dtypes during imputing process)



    df = pd.DataFrame(imp.fit_transform(df),columns=df.columns,dtype=df.dtypes)
    ---------------------------------------------------------------------------
    ValueError Traceback (most recent call last)
    <ipython-input-23-e9780979921f> in <module>()
    7
    8 imp = SimpleImputer(strategy="most_frequent")
    ----> 9 df = pd.DataFrame(imp.fit_transform(df),columns=df.columns,dtype=df.dtypes)

    ~AppDataLocalContinuumanaconda3libsite-packagespandascoreframe.py in __init__(self, data, index, columns, dtype, copy)
    337 data = {}
    338 if dtype is not None:
    --> 339 dtype = self._validate_dtype(dtype)
    340
    341 if isinstance(data, DataFrame):

    ~AppDataLocalContinuumanaconda3libsite-packagespandascoregeneric.py in _validate_dtype(self, dtype)
    166
    167 if dtype is not None:
    --> 168 dtype = pandas_dtype(dtype)
    169
    170 # a compound dtype

    ~AppDataLocalContinuumanaconda3libsite-packagespandascoredtypescommon.py in pandas_dtype(dtype)
    2020 # which we safeguard against by catching them earlier and returning
    2021 # np.dtype(valid_dtype) before this condition is evaluated.
    -> 2022 if dtype in [object, np.object_, 'object', 'O']:
    2023 return npdtype
    2024 elif npdtype.kind == 'O':

    ~AppDataLocalContinuumanaconda3libsite-packagespandascoregeneric.py in __nonzero__(self)
    1574 raise ValueError("The truth value of a {0} is ambiguous. "
    1575 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
    -> 1576 .format(self.__class__.__name__))
    1577
    1578 __bool__ = __nonzero__

    ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().









    share|improve this question



























      0












      0








      0


      1





      I am facing a simple error with the code below.



      My objective is to use simpleimputer to plug missing values of different datatypes in one shot.



      When i try to do that, the fit_transform seems to be not work as expected.
      When dtype argument is not used, the code works just fine, but the resulting dataframe loses its data type information. When i include the dtype list in the arguments, i am seeing the below error. You should be able to simulate the error by just copying and pasting in your local.



      import pandas as pd
      import numpy as np
      from sklearn.impute import SimpleImputer

      import sklearn
      print(sklearn.__version__)

      0.21.dev0

      data = [['Alex','NJ',21,5.10],['Mary','NY',20,np.nan],['Sam',np.nan,np.nan,6.3]]
      df = pd.DataFrame(data,columns=['Name','State','Age','Height'])

      df.dtypes
      Name object
      State object
      Age float64
      Height float64
      dtype: object

      imp = SimpleImputer(strategy="most_frequent")

      #df = pd.DataFrame(imp.fit_transform(df),columns=df.columns) <<<<----- This works just fine
      #df
      #Name State Age Height
      #0 Alex NJ 21 5.1
      #1 Mary NY 20 5.1
      #2 Sam NJ 20 6.3
      #df.dtypes
      #Name object
      #State object
      #Age object
      #Height object
      #dtype: object


      The below statement fails - with the error listed below ( I am trying to preserve dtypes during imputing process)



      df = pd.DataFrame(imp.fit_transform(df),columns=df.columns,dtype=df.dtypes)
      ---------------------------------------------------------------------------
      ValueError Traceback (most recent call last)
      <ipython-input-23-e9780979921f> in <module>()
      7
      8 imp = SimpleImputer(strategy="most_frequent")
      ----> 9 df = pd.DataFrame(imp.fit_transform(df),columns=df.columns,dtype=df.dtypes)

      ~AppDataLocalContinuumanaconda3libsite-packagespandascoreframe.py in __init__(self, data, index, columns, dtype, copy)
      337 data = {}
      338 if dtype is not None:
      --> 339 dtype = self._validate_dtype(dtype)
      340
      341 if isinstance(data, DataFrame):

      ~AppDataLocalContinuumanaconda3libsite-packagespandascoregeneric.py in _validate_dtype(self, dtype)
      166
      167 if dtype is not None:
      --> 168 dtype = pandas_dtype(dtype)
      169
      170 # a compound dtype

      ~AppDataLocalContinuumanaconda3libsite-packagespandascoredtypescommon.py in pandas_dtype(dtype)
      2020 # which we safeguard against by catching them earlier and returning
      2021 # np.dtype(valid_dtype) before this condition is evaluated.
      -> 2022 if dtype in [object, np.object_, 'object', 'O']:
      2023 return npdtype
      2024 elif npdtype.kind == 'O':

      ~AppDataLocalContinuumanaconda3libsite-packagespandascoregeneric.py in __nonzero__(self)
      1574 raise ValueError("The truth value of a {0} is ambiguous. "
      1575 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
      -> 1576 .format(self.__class__.__name__))
      1577
      1578 __bool__ = __nonzero__

      ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().









      share|improve this question















      I am facing a simple error with the code below.



      My objective is to use simpleimputer to plug missing values of different datatypes in one shot.



      When i try to do that, the fit_transform seems to be not work as expected.
      When dtype argument is not used, the code works just fine, but the resulting dataframe loses its data type information. When i include the dtype list in the arguments, i am seeing the below error. You should be able to simulate the error by just copying and pasting in your local.



      import pandas as pd
      import numpy as np
      from sklearn.impute import SimpleImputer

      import sklearn
      print(sklearn.__version__)

      0.21.dev0

      data = [['Alex','NJ',21,5.10],['Mary','NY',20,np.nan],['Sam',np.nan,np.nan,6.3]]
      df = pd.DataFrame(data,columns=['Name','State','Age','Height'])

      df.dtypes
      Name object
      State object
      Age float64
      Height float64
      dtype: object

      imp = SimpleImputer(strategy="most_frequent")

      #df = pd.DataFrame(imp.fit_transform(df),columns=df.columns) <<<<----- This works just fine
      #df
      #Name State Age Height
      #0 Alex NJ 21 5.1
      #1 Mary NY 20 5.1
      #2 Sam NJ 20 6.3
      #df.dtypes
      #Name object
      #State object
      #Age object
      #Height object
      #dtype: object


      The below statement fails - with the error listed below ( I am trying to preserve dtypes during imputing process)



      df = pd.DataFrame(imp.fit_transform(df),columns=df.columns,dtype=df.dtypes)
      ---------------------------------------------------------------------------
      ValueError Traceback (most recent call last)
      <ipython-input-23-e9780979921f> in <module>()
      7
      8 imp = SimpleImputer(strategy="most_frequent")
      ----> 9 df = pd.DataFrame(imp.fit_transform(df),columns=df.columns,dtype=df.dtypes)

      ~AppDataLocalContinuumanaconda3libsite-packagespandascoreframe.py in __init__(self, data, index, columns, dtype, copy)
      337 data = {}
      338 if dtype is not None:
      --> 339 dtype = self._validate_dtype(dtype)
      340
      341 if isinstance(data, DataFrame):

      ~AppDataLocalContinuumanaconda3libsite-packagespandascoregeneric.py in _validate_dtype(self, dtype)
      166
      167 if dtype is not None:
      --> 168 dtype = pandas_dtype(dtype)
      169
      170 # a compound dtype

      ~AppDataLocalContinuumanaconda3libsite-packagespandascoredtypescommon.py in pandas_dtype(dtype)
      2020 # which we safeguard against by catching them earlier and returning
      2021 # np.dtype(valid_dtype) before this condition is evaluated.
      -> 2022 if dtype in [object, np.object_, 'object', 'O']:
      2023 return npdtype
      2024 elif npdtype.kind == 'O':

      ~AppDataLocalContinuumanaconda3libsite-packagespandascoregeneric.py in __nonzero__(self)
      1574 raise ValueError("The truth value of a {0} is ambiguous. "
      1575 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
      -> 1576 .format(self.__class__.__name__))
      1577
      1578 __bool__ = __nonzero__

      ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().






      pandas






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 12 at 0:21

























      asked Nov 11 at 23:28









      Dsfs Dsfds

      82




      82
























          1 Answer
          1






          active

          oldest

          votes


















          1














          If you want to preserve the dtype, I recommend using pandas to find the mode and then call fillna:



          df = df.fillna(df.agg(lambda x: pd.Series.mode(x)[0], axis=0))
          print(df)
          Name State Age Height
          0 Alex NJ 21.0 5.1
          1 Mary NY 20.0 5.1
          2 Sam NJ 20.0 6.3

          print(df.dtypes)
          Name object
          State object
          Age float64
          Height float64
          dtype: object




          Alternatively, use astype and pass a dictionary:



          df = pd.DataFrame(
          imp.fit_transform(df), columns=df.columns
          ).astype(df.dtypes.to_dict())

          print(df)
          Name State Age Height
          0 Alex NJ 21.0 5.1
          1 Mary NY 20.0 5.1
          2 Sam NJ 20.0 6.3

          print(df.dtypes)
          Name object
          State object
          Age float64
          Height float64
          dtype: object


          Explicit astype call is needed because, as per the documentation, only a single dtype can be passed to the pd.DataFrame constructor.




          ?pd.DataFrame
          ...
          dtype : dtype, default None
          | Data type to force. Only a single dtype is allowed.






          share|improve this answer























          • Added version info to the original question. I want to use the imputer function which is much more readable and has the ability to take in dtypes parameter.
            – Dsfs Dsfds
            Nov 12 at 0:24










          • @DsfsDsfds See edit?
            – coldspeed
            Nov 12 at 0:48










          • the astype worked like charm, thanks!!
            – Dsfs Dsfds
            Nov 12 at 1:05











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53254292%2fpandas-simpleimputer-preserve-datatypes%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1














          If you want to preserve the dtype, I recommend using pandas to find the mode and then call fillna:



          df = df.fillna(df.agg(lambda x: pd.Series.mode(x)[0], axis=0))
          print(df)
          Name State Age Height
          0 Alex NJ 21.0 5.1
          1 Mary NY 20.0 5.1
          2 Sam NJ 20.0 6.3

          print(df.dtypes)
          Name object
          State object
          Age float64
          Height float64
          dtype: object




          Alternatively, use astype and pass a dictionary:



          df = pd.DataFrame(
          imp.fit_transform(df), columns=df.columns
          ).astype(df.dtypes.to_dict())

          print(df)
          Name State Age Height
          0 Alex NJ 21.0 5.1
          1 Mary NY 20.0 5.1
          2 Sam NJ 20.0 6.3

          print(df.dtypes)
          Name object
          State object
          Age float64
          Height float64
          dtype: object


          Explicit astype call is needed because, as per the documentation, only a single dtype can be passed to the pd.DataFrame constructor.




          ?pd.DataFrame
          ...
          dtype : dtype, default None
          | Data type to force. Only a single dtype is allowed.






          share|improve this answer























          • Added version info to the original question. I want to use the imputer function which is much more readable and has the ability to take in dtypes parameter.
            – Dsfs Dsfds
            Nov 12 at 0:24










          • @DsfsDsfds See edit?
            – coldspeed
            Nov 12 at 0:48










          • the astype worked like charm, thanks!!
            – Dsfs Dsfds
            Nov 12 at 1:05
















          1














          If you want to preserve the dtype, I recommend using pandas to find the mode and then call fillna:



          df = df.fillna(df.agg(lambda x: pd.Series.mode(x)[0], axis=0))
          print(df)
          Name State Age Height
          0 Alex NJ 21.0 5.1
          1 Mary NY 20.0 5.1
          2 Sam NJ 20.0 6.3

          print(df.dtypes)
          Name object
          State object
          Age float64
          Height float64
          dtype: object




          Alternatively, use astype and pass a dictionary:



          df = pd.DataFrame(
          imp.fit_transform(df), columns=df.columns
          ).astype(df.dtypes.to_dict())

          print(df)
          Name State Age Height
          0 Alex NJ 21.0 5.1
          1 Mary NY 20.0 5.1
          2 Sam NJ 20.0 6.3

          print(df.dtypes)
          Name object
          State object
          Age float64
          Height float64
          dtype: object


          Explicit astype call is needed because, as per the documentation, only a single dtype can be passed to the pd.DataFrame constructor.




          ?pd.DataFrame
          ...
          dtype : dtype, default None
          | Data type to force. Only a single dtype is allowed.






          share|improve this answer























          • Added version info to the original question. I want to use the imputer function which is much more readable and has the ability to take in dtypes parameter.
            – Dsfs Dsfds
            Nov 12 at 0:24










          • @DsfsDsfds See edit?
            – coldspeed
            Nov 12 at 0:48










          • the astype worked like charm, thanks!!
            – Dsfs Dsfds
            Nov 12 at 1:05














          1












          1








          1






          If you want to preserve the dtype, I recommend using pandas to find the mode and then call fillna:



          df = df.fillna(df.agg(lambda x: pd.Series.mode(x)[0], axis=0))
          print(df)
          Name State Age Height
          0 Alex NJ 21.0 5.1
          1 Mary NY 20.0 5.1
          2 Sam NJ 20.0 6.3

          print(df.dtypes)
          Name object
          State object
          Age float64
          Height float64
          dtype: object




          Alternatively, use astype and pass a dictionary:



          df = pd.DataFrame(
          imp.fit_transform(df), columns=df.columns
          ).astype(df.dtypes.to_dict())

          print(df)
          Name State Age Height
          0 Alex NJ 21.0 5.1
          1 Mary NY 20.0 5.1
          2 Sam NJ 20.0 6.3

          print(df.dtypes)
          Name object
          State object
          Age float64
          Height float64
          dtype: object


          Explicit astype call is needed because, as per the documentation, only a single dtype can be passed to the pd.DataFrame constructor.




          ?pd.DataFrame
          ...
          dtype : dtype, default None
          | Data type to force. Only a single dtype is allowed.






          share|improve this answer














          If you want to preserve the dtype, I recommend using pandas to find the mode and then call fillna:



          df = df.fillna(df.agg(lambda x: pd.Series.mode(x)[0], axis=0))
          print(df)
          Name State Age Height
          0 Alex NJ 21.0 5.1
          1 Mary NY 20.0 5.1
          2 Sam NJ 20.0 6.3

          print(df.dtypes)
          Name object
          State object
          Age float64
          Height float64
          dtype: object




          Alternatively, use astype and pass a dictionary:



          df = pd.DataFrame(
          imp.fit_transform(df), columns=df.columns
          ).astype(df.dtypes.to_dict())

          print(df)
          Name State Age Height
          0 Alex NJ 21.0 5.1
          1 Mary NY 20.0 5.1
          2 Sam NJ 20.0 6.3

          print(df.dtypes)
          Name object
          State object
          Age float64
          Height float64
          dtype: object


          Explicit astype call is needed because, as per the documentation, only a single dtype can be passed to the pd.DataFrame constructor.




          ?pd.DataFrame
          ...
          dtype : dtype, default None
          | Data type to force. Only a single dtype is allowed.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 12 at 0:50

























          answered Nov 12 at 0:04









          coldspeed

          119k19113191




          119k19113191












          • Added version info to the original question. I want to use the imputer function which is much more readable and has the ability to take in dtypes parameter.
            – Dsfs Dsfds
            Nov 12 at 0:24










          • @DsfsDsfds See edit?
            – coldspeed
            Nov 12 at 0:48










          • the astype worked like charm, thanks!!
            – Dsfs Dsfds
            Nov 12 at 1:05


















          • Added version info to the original question. I want to use the imputer function which is much more readable and has the ability to take in dtypes parameter.
            – Dsfs Dsfds
            Nov 12 at 0:24










          • @DsfsDsfds See edit?
            – coldspeed
            Nov 12 at 0:48










          • the astype worked like charm, thanks!!
            – Dsfs Dsfds
            Nov 12 at 1:05
















          Added version info to the original question. I want to use the imputer function which is much more readable and has the ability to take in dtypes parameter.
          – Dsfs Dsfds
          Nov 12 at 0:24




          Added version info to the original question. I want to use the imputer function which is much more readable and has the ability to take in dtypes parameter.
          – Dsfs Dsfds
          Nov 12 at 0:24












          @DsfsDsfds See edit?
          – coldspeed
          Nov 12 at 0:48




          @DsfsDsfds See edit?
          – coldspeed
          Nov 12 at 0:48












          the astype worked like charm, thanks!!
          – Dsfs Dsfds
          Nov 12 at 1:05




          the astype worked like charm, thanks!!
          – Dsfs Dsfds
          Nov 12 at 1:05


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.





          Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


          Please pay close attention to the following guidance:


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53254292%2fpandas-simpleimputer-preserve-datatypes%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          這個網誌中的熱門文章

          Tangent Lines Diagram Along Smooth Curve

          Yusuf al-Mu'taman ibn Hud

          Zucchini