scikit-learn spectral clustering: unable to find NaN lurking in data












4















I'm running spectral coclustering on this dataset of Jeopardy questions, and there is this frustrating issue I'm facing with the data. Note that I'm only clustering all the values in the 'question' column.



There is apparently a "divide by zero" ValueError occurring when I run biclustering on the dataset.



/usr/local/lib/python3.6/dist-packages/sklearn/cluster/bicluster.py:38: RuntimeWarning: divide by zero encountered in true_divide
row_diag = np.asarray(1.0 / np.sqrt(X.sum(axis=1))).squeeze()
/usr/local/lib/python3.6/dist-packages/sklearn/cluster/bicluster.py:286: RuntimeWarning: invalid value encountered in multiply
z = np.vstack((row_diag[:, np.newaxis] * u,
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
...
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').


The error apparently suggests that there is a NaN or infinite value lurking in my data (which is only the singular column of questions). It's completely text data that I'm handling, and I've already tried most NumPy and Pandas functions for filtering NaNs and inf, as well as many solutions on Stack Overflow. I couldn't find any.



Just to ensure my code doesn't have a fault, the same thing perfectly works on the twenty newsgroups dataset.



Here's the code on Kaggle if you want to run it and see for yourself. However, just in case SO's policies prohibit this, here's the code in a nutshell:



dat = pd.DataFrame(pd.read_csv('../input/jarchive_cleaned.csv'))

qlist =

def cleanhtml(raw_html):
cleanr = re.compile('<.*?>')
cleantext = re.sub(cleanr, '', raw_html)
return cleantext

for row in dat.iterrows():
txt = row[1]['text'].lower()
txt = cleanhtml(txt)
txt = re.sub(r'[^a-z ]',"",txt)
txt = re.sub(r' ',' ',txt)
# txt = ' '.join([stem(w) for w in txt.split(" ")])
qlist.append([txt,row[1]['answer'],row[1]['category']])

print(qlist[:10])

swords = set(stopwords.words('english'))
tv = TfidfVectorizer(stop_words = swords , strip_accents='ascii')

queslst = [q for (q,a,c) in qlist]
qlen = len(set([c for (q,a,c) in qlist]))

mtx = tv.fit_transform(queslst)

cocluster = SpectralCoclustering(n_clusters=qlen, svd_method='arpack', random_state=0) #

t = time()
cocluster.fit(mtx)









share|improve this question

























  • Did you try to print mtx?

    – Gal Sivan
    Nov 18 '18 at 6:26











  • @GalSivan I did. It throws a matrix of term frequencies. Nothing odd and no NaNs/infs there.

    – Mayukh Nair
    Nov 18 '18 at 6:49
















4















I'm running spectral coclustering on this dataset of Jeopardy questions, and there is this frustrating issue I'm facing with the data. Note that I'm only clustering all the values in the 'question' column.



There is apparently a "divide by zero" ValueError occurring when I run biclustering on the dataset.



/usr/local/lib/python3.6/dist-packages/sklearn/cluster/bicluster.py:38: RuntimeWarning: divide by zero encountered in true_divide
row_diag = np.asarray(1.0 / np.sqrt(X.sum(axis=1))).squeeze()
/usr/local/lib/python3.6/dist-packages/sklearn/cluster/bicluster.py:286: RuntimeWarning: invalid value encountered in multiply
z = np.vstack((row_diag[:, np.newaxis] * u,
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
...
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').


The error apparently suggests that there is a NaN or infinite value lurking in my data (which is only the singular column of questions). It's completely text data that I'm handling, and I've already tried most NumPy and Pandas functions for filtering NaNs and inf, as well as many solutions on Stack Overflow. I couldn't find any.



Just to ensure my code doesn't have a fault, the same thing perfectly works on the twenty newsgroups dataset.



Here's the code on Kaggle if you want to run it and see for yourself. However, just in case SO's policies prohibit this, here's the code in a nutshell:



dat = pd.DataFrame(pd.read_csv('../input/jarchive_cleaned.csv'))

qlist =

def cleanhtml(raw_html):
cleanr = re.compile('<.*?>')
cleantext = re.sub(cleanr, '', raw_html)
return cleantext

for row in dat.iterrows():
txt = row[1]['text'].lower()
txt = cleanhtml(txt)
txt = re.sub(r'[^a-z ]',"",txt)
txt = re.sub(r' ',' ',txt)
# txt = ' '.join([stem(w) for w in txt.split(" ")])
qlist.append([txt,row[1]['answer'],row[1]['category']])

print(qlist[:10])

swords = set(stopwords.words('english'))
tv = TfidfVectorizer(stop_words = swords , strip_accents='ascii')

queslst = [q for (q,a,c) in qlist]
qlen = len(set([c for (q,a,c) in qlist]))

mtx = tv.fit_transform(queslst)

cocluster = SpectralCoclustering(n_clusters=qlen, svd_method='arpack', random_state=0) #

t = time()
cocluster.fit(mtx)









share|improve this question

























  • Did you try to print mtx?

    – Gal Sivan
    Nov 18 '18 at 6:26











  • @GalSivan I did. It throws a matrix of term frequencies. Nothing odd and no NaNs/infs there.

    – Mayukh Nair
    Nov 18 '18 at 6:49














4












4








4








I'm running spectral coclustering on this dataset of Jeopardy questions, and there is this frustrating issue I'm facing with the data. Note that I'm only clustering all the values in the 'question' column.



There is apparently a "divide by zero" ValueError occurring when I run biclustering on the dataset.



/usr/local/lib/python3.6/dist-packages/sklearn/cluster/bicluster.py:38: RuntimeWarning: divide by zero encountered in true_divide
row_diag = np.asarray(1.0 / np.sqrt(X.sum(axis=1))).squeeze()
/usr/local/lib/python3.6/dist-packages/sklearn/cluster/bicluster.py:286: RuntimeWarning: invalid value encountered in multiply
z = np.vstack((row_diag[:, np.newaxis] * u,
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
...
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').


The error apparently suggests that there is a NaN or infinite value lurking in my data (which is only the singular column of questions). It's completely text data that I'm handling, and I've already tried most NumPy and Pandas functions for filtering NaNs and inf, as well as many solutions on Stack Overflow. I couldn't find any.



Just to ensure my code doesn't have a fault, the same thing perfectly works on the twenty newsgroups dataset.



Here's the code on Kaggle if you want to run it and see for yourself. However, just in case SO's policies prohibit this, here's the code in a nutshell:



dat = pd.DataFrame(pd.read_csv('../input/jarchive_cleaned.csv'))

qlist =

def cleanhtml(raw_html):
cleanr = re.compile('<.*?>')
cleantext = re.sub(cleanr, '', raw_html)
return cleantext

for row in dat.iterrows():
txt = row[1]['text'].lower()
txt = cleanhtml(txt)
txt = re.sub(r'[^a-z ]',"",txt)
txt = re.sub(r' ',' ',txt)
# txt = ' '.join([stem(w) for w in txt.split(" ")])
qlist.append([txt,row[1]['answer'],row[1]['category']])

print(qlist[:10])

swords = set(stopwords.words('english'))
tv = TfidfVectorizer(stop_words = swords , strip_accents='ascii')

queslst = [q for (q,a,c) in qlist]
qlen = len(set([c for (q,a,c) in qlist]))

mtx = tv.fit_transform(queslst)

cocluster = SpectralCoclustering(n_clusters=qlen, svd_method='arpack', random_state=0) #

t = time()
cocluster.fit(mtx)









share|improve this question
















I'm running spectral coclustering on this dataset of Jeopardy questions, and there is this frustrating issue I'm facing with the data. Note that I'm only clustering all the values in the 'question' column.



There is apparently a "divide by zero" ValueError occurring when I run biclustering on the dataset.



/usr/local/lib/python3.6/dist-packages/sklearn/cluster/bicluster.py:38: RuntimeWarning: divide by zero encountered in true_divide
row_diag = np.asarray(1.0 / np.sqrt(X.sum(axis=1))).squeeze()
/usr/local/lib/python3.6/dist-packages/sklearn/cluster/bicluster.py:286: RuntimeWarning: invalid value encountered in multiply
z = np.vstack((row_diag[:, np.newaxis] * u,
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
...
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').


The error apparently suggests that there is a NaN or infinite value lurking in my data (which is only the singular column of questions). It's completely text data that I'm handling, and I've already tried most NumPy and Pandas functions for filtering NaNs and inf, as well as many solutions on Stack Overflow. I couldn't find any.



Just to ensure my code doesn't have a fault, the same thing perfectly works on the twenty newsgroups dataset.



Here's the code on Kaggle if you want to run it and see for yourself. However, just in case SO's policies prohibit this, here's the code in a nutshell:



dat = pd.DataFrame(pd.read_csv('../input/jarchive_cleaned.csv'))

qlist =

def cleanhtml(raw_html):
cleanr = re.compile('<.*?>')
cleantext = re.sub(cleanr, '', raw_html)
return cleantext

for row in dat.iterrows():
txt = row[1]['text'].lower()
txt = cleanhtml(txt)
txt = re.sub(r'[^a-z ]',"",txt)
txt = re.sub(r' ',' ',txt)
# txt = ' '.join([stem(w) for w in txt.split(" ")])
qlist.append([txt,row[1]['answer'],row[1]['category']])

print(qlist[:10])

swords = set(stopwords.words('english'))
tv = TfidfVectorizer(stop_words = swords , strip_accents='ascii')

queslst = [q for (q,a,c) in qlist]
qlen = len(set([c for (q,a,c) in qlist]))

mtx = tv.fit_transform(queslst)

cocluster = SpectralCoclustering(n_clusters=qlen, svd_method='arpack', random_state=0) #

t = time()
cocluster.fit(mtx)






python machine-learning scikit-learn cluster-analysis






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 21 '18 at 6:37







Mayukh Nair

















asked Nov 18 '18 at 5:51









Mayukh NairMayukh Nair

2451318




2451318













  • Did you try to print mtx?

    – Gal Sivan
    Nov 18 '18 at 6:26











  • @GalSivan I did. It throws a matrix of term frequencies. Nothing odd and no NaNs/infs there.

    – Mayukh Nair
    Nov 18 '18 at 6:49



















  • Did you try to print mtx?

    – Gal Sivan
    Nov 18 '18 at 6:26











  • @GalSivan I did. It throws a matrix of term frequencies. Nothing odd and no NaNs/infs there.

    – Mayukh Nair
    Nov 18 '18 at 6:49

















Did you try to print mtx?

– Gal Sivan
Nov 18 '18 at 6:26





Did you try to print mtx?

– Gal Sivan
Nov 18 '18 at 6:26













@GalSivan I did. It throws a matrix of term frequencies. Nothing odd and no NaNs/infs there.

– Mayukh Nair
Nov 18 '18 at 6:49





@GalSivan I did. It throws a matrix of term frequencies. Nothing odd and no NaNs/infs there.

– Mayukh Nair
Nov 18 '18 at 6:49












2 Answers
2






active

oldest

votes


















1














Some strings sequence like e.g. 'down out' results in a zero return value from TfidfVectorizer(). That causes the errors starting with a divide by zero error, which results in inf values in the mtx sparse matrix and this causes the second error.



As a workaround to this problem to remove this sequences or remove the zero matrix elements from the mtx matrix after it created by TfidfVectorizer.fit_transform(), which a bit tricky because of the sparse matrix operation.



I made the second solution, as I didn't dived into the original tasks, as follows:



swords = set(stopwords.words('english'))
tv = TfidfVectorizer(stop_words = swords , strip_accents='ascii')

queslst = [q for (q,a,c) in qlist]
qlen = len(set([c for (q,a,c) in qlist]))

mtx = tv.fit_transform(queslst)

indices =
for i,mx in enumerate(mtx):
if np.sum(mx, axis=1) == 0:
indices.append(i)

mask = np.ones(mtx.shape[0], dtype=bool)
mask[indices] = False
mtx = mtx[mask]

cocluster = SpectralCoclustering(n_clusters=qlen, svd_method='arpack', random_state=0) #

t = time()

cocluster.fit(mtx)


Finally it works. I hope, it helps, good luck!






share|improve this answer


























  • This works like a charm. You're a lifesaver @Geeocode, thank you! :)

    – Mayukh Nair
    Nov 21 '18 at 6:31






  • 1





    @MayukhNair You're welcome. I'm happy, it helps! :)

    – Geeocode
    Nov 21 '18 at 10:53



















0














The division by zero is causing the NaNs. You need to fix the root cause first. As you may know 1/0=NaN.



You probably have a column or row that is all zeros.






share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53358270%2fscikit-learn-spectral-clustering-unable-to-find-nan-lurking-in-data%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    Some strings sequence like e.g. 'down out' results in a zero return value from TfidfVectorizer(). That causes the errors starting with a divide by zero error, which results in inf values in the mtx sparse matrix and this causes the second error.



    As a workaround to this problem to remove this sequences or remove the zero matrix elements from the mtx matrix after it created by TfidfVectorizer.fit_transform(), which a bit tricky because of the sparse matrix operation.



    I made the second solution, as I didn't dived into the original tasks, as follows:



    swords = set(stopwords.words('english'))
    tv = TfidfVectorizer(stop_words = swords , strip_accents='ascii')

    queslst = [q for (q,a,c) in qlist]
    qlen = len(set([c for (q,a,c) in qlist]))

    mtx = tv.fit_transform(queslst)

    indices =
    for i,mx in enumerate(mtx):
    if np.sum(mx, axis=1) == 0:
    indices.append(i)

    mask = np.ones(mtx.shape[0], dtype=bool)
    mask[indices] = False
    mtx = mtx[mask]

    cocluster = SpectralCoclustering(n_clusters=qlen, svd_method='arpack', random_state=0) #

    t = time()

    cocluster.fit(mtx)


    Finally it works. I hope, it helps, good luck!






    share|improve this answer


























    • This works like a charm. You're a lifesaver @Geeocode, thank you! :)

      – Mayukh Nair
      Nov 21 '18 at 6:31






    • 1





      @MayukhNair You're welcome. I'm happy, it helps! :)

      – Geeocode
      Nov 21 '18 at 10:53
















    1














    Some strings sequence like e.g. 'down out' results in a zero return value from TfidfVectorizer(). That causes the errors starting with a divide by zero error, which results in inf values in the mtx sparse matrix and this causes the second error.



    As a workaround to this problem to remove this sequences or remove the zero matrix elements from the mtx matrix after it created by TfidfVectorizer.fit_transform(), which a bit tricky because of the sparse matrix operation.



    I made the second solution, as I didn't dived into the original tasks, as follows:



    swords = set(stopwords.words('english'))
    tv = TfidfVectorizer(stop_words = swords , strip_accents='ascii')

    queslst = [q for (q,a,c) in qlist]
    qlen = len(set([c for (q,a,c) in qlist]))

    mtx = tv.fit_transform(queslst)

    indices =
    for i,mx in enumerate(mtx):
    if np.sum(mx, axis=1) == 0:
    indices.append(i)

    mask = np.ones(mtx.shape[0], dtype=bool)
    mask[indices] = False
    mtx = mtx[mask]

    cocluster = SpectralCoclustering(n_clusters=qlen, svd_method='arpack', random_state=0) #

    t = time()

    cocluster.fit(mtx)


    Finally it works. I hope, it helps, good luck!






    share|improve this answer


























    • This works like a charm. You're a lifesaver @Geeocode, thank you! :)

      – Mayukh Nair
      Nov 21 '18 at 6:31






    • 1





      @MayukhNair You're welcome. I'm happy, it helps! :)

      – Geeocode
      Nov 21 '18 at 10:53














    1












    1








    1







    Some strings sequence like e.g. 'down out' results in a zero return value from TfidfVectorizer(). That causes the errors starting with a divide by zero error, which results in inf values in the mtx sparse matrix and this causes the second error.



    As a workaround to this problem to remove this sequences or remove the zero matrix elements from the mtx matrix after it created by TfidfVectorizer.fit_transform(), which a bit tricky because of the sparse matrix operation.



    I made the second solution, as I didn't dived into the original tasks, as follows:



    swords = set(stopwords.words('english'))
    tv = TfidfVectorizer(stop_words = swords , strip_accents='ascii')

    queslst = [q for (q,a,c) in qlist]
    qlen = len(set([c for (q,a,c) in qlist]))

    mtx = tv.fit_transform(queslst)

    indices =
    for i,mx in enumerate(mtx):
    if np.sum(mx, axis=1) == 0:
    indices.append(i)

    mask = np.ones(mtx.shape[0], dtype=bool)
    mask[indices] = False
    mtx = mtx[mask]

    cocluster = SpectralCoclustering(n_clusters=qlen, svd_method='arpack', random_state=0) #

    t = time()

    cocluster.fit(mtx)


    Finally it works. I hope, it helps, good luck!






    share|improve this answer















    Some strings sequence like e.g. 'down out' results in a zero return value from TfidfVectorizer(). That causes the errors starting with a divide by zero error, which results in inf values in the mtx sparse matrix and this causes the second error.



    As a workaround to this problem to remove this sequences or remove the zero matrix elements from the mtx matrix after it created by TfidfVectorizer.fit_transform(), which a bit tricky because of the sparse matrix operation.



    I made the second solution, as I didn't dived into the original tasks, as follows:



    swords = set(stopwords.words('english'))
    tv = TfidfVectorizer(stop_words = swords , strip_accents='ascii')

    queslst = [q for (q,a,c) in qlist]
    qlen = len(set([c for (q,a,c) in qlist]))

    mtx = tv.fit_transform(queslst)

    indices =
    for i,mx in enumerate(mtx):
    if np.sum(mx, axis=1) == 0:
    indices.append(i)

    mask = np.ones(mtx.shape[0], dtype=bool)
    mask[indices] = False
    mtx = mtx[mask]

    cocluster = SpectralCoclustering(n_clusters=qlen, svd_method='arpack', random_state=0) #

    t = time()

    cocluster.fit(mtx)


    Finally it works. I hope, it helps, good luck!







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 19 '18 at 1:13

























    answered Nov 18 '18 at 20:37









    GeeocodeGeeocode

    2,3161820




    2,3161820













    • This works like a charm. You're a lifesaver @Geeocode, thank you! :)

      – Mayukh Nair
      Nov 21 '18 at 6:31






    • 1





      @MayukhNair You're welcome. I'm happy, it helps! :)

      – Geeocode
      Nov 21 '18 at 10:53



















    • This works like a charm. You're a lifesaver @Geeocode, thank you! :)

      – Mayukh Nair
      Nov 21 '18 at 6:31






    • 1





      @MayukhNair You're welcome. I'm happy, it helps! :)

      – Geeocode
      Nov 21 '18 at 10:53

















    This works like a charm. You're a lifesaver @Geeocode, thank you! :)

    – Mayukh Nair
    Nov 21 '18 at 6:31





    This works like a charm. You're a lifesaver @Geeocode, thank you! :)

    – Mayukh Nair
    Nov 21 '18 at 6:31




    1




    1





    @MayukhNair You're welcome. I'm happy, it helps! :)

    – Geeocode
    Nov 21 '18 at 10:53





    @MayukhNair You're welcome. I'm happy, it helps! :)

    – Geeocode
    Nov 21 '18 at 10:53













    0














    The division by zero is causing the NaNs. You need to fix the root cause first. As you may know 1/0=NaN.



    You probably have a column or row that is all zeros.






    share|improve this answer




























      0














      The division by zero is causing the NaNs. You need to fix the root cause first. As you may know 1/0=NaN.



      You probably have a column or row that is all zeros.






      share|improve this answer


























        0












        0








        0







        The division by zero is causing the NaNs. You need to fix the root cause first. As you may know 1/0=NaN.



        You probably have a column or row that is all zeros.






        share|improve this answer













        The division by zero is causing the NaNs. You need to fix the root cause first. As you may know 1/0=NaN.



        You probably have a column or row that is all zeros.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 19 '18 at 6:56









        Anony-MousseAnony-Mousse

        57.9k797161




        57.9k797161






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53358270%2fscikit-learn-spectral-clustering-unable-to-find-nan-lurking-in-data%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            這個網誌中的熱門文章

            Tangent Lines Diagram Along Smooth Curve

            Yusuf al-Mu'taman ibn Hud

            Zucchini