scikit-learn spectral clustering: unable to find NaN lurking in data

I'm running spectral coclustering on this dataset of Jeopardy questions, and there is this frustrating issue I'm facing with the data. Note that I'm only clustering all the values in the 'question' column.

There is apparently a "divide by zero" ValueError occurring when I run biclustering on the dataset.

/usr/local/lib/python3.6/dist-packages/sklearn/cluster/bicluster.py:38: RuntimeWarning: divide by zero encountered in true_divide

  row_diag = np.asarray(1.0 / np.sqrt(X.sum(axis=1))).squeeze()

/usr/local/lib/python3.6/dist-packages/sklearn/cluster/bicluster.py:286: RuntimeWarning: invalid value encountered in multiply

  z = np.vstack((row_diag[:, np.newaxis] * u,

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

...

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

The error apparently suggests that there is a NaN or infinite value lurking in my data (which is only the singular column of questions). It's completely text data that I'm handling, and I've already tried most NumPy and Pandas functions for filtering NaNs and inf, as well as many solutions on Stack Overflow. I couldn't find any.

Just to ensure my code doesn't have a fault, the same thing perfectly works on the twenty newsgroups dataset.

Here's the code on Kaggle if you want to run it and see for yourself. However, just in case SO's policies prohibit this, here's the code in a nutshell:

dat = pd.DataFrame(pd.read_csv('../input/jarchive_cleaned.csv'))



qlist = 



def cleanhtml(raw_html):

  cleanr = re.compile('<.*?>')

  cleantext = re.sub(cleanr, '', raw_html)

  return cleantext



for row in dat.iterrows():

  txt = row[1]['text'].lower()

  txt = cleanhtml(txt)

  txt = re.sub(r'[^a-z ]',"",txt)

  txt = re.sub(r'  ',' ',txt)

#   txt = ' '.join([stem(w) for w in txt.split(" ")])

  qlist.append([txt,row[1]['answer'],row[1]['category']])



print(qlist[:10])



swords = set(stopwords.words('english'))

tv = TfidfVectorizer(stop_words = swords , strip_accents='ascii')



queslst = [q for (q,a,c) in qlist]

qlen = len(set([c for (q,a,c) in qlist]))



mtx = tv.fit_transform(queslst)



cocluster = SpectralCoclustering(n_clusters=qlen, svd_method='arpack', random_state=0) #



t = time()

cocluster.fit(mtx)

edited Nov 21 '18 at 6:37

asked Nov 18 '18 at 5:51

Mayukh Nair

2451318

Did you try to print mtx?

– Gal Sivan
Nov 18 '18 at 6:26

@GalSivan I did. It throws a matrix of term frequencies. Nothing odd and no NaNs/infs there.

– Mayukh Nair
Nov 18 '18 at 6:49

add a comment |

There is apparently a "divide by zero" ValueError occurring when I run biclustering on the dataset.

/usr/local/lib/python3.6/dist-packages/sklearn/cluster/bicluster.py:38: RuntimeWarning: divide by zero encountered in true_divide

  row_diag = np.asarray(1.0 / np.sqrt(X.sum(axis=1))).squeeze()

/usr/local/lib/python3.6/dist-packages/sklearn/cluster/bicluster.py:286: RuntimeWarning: invalid value encountered in multiply

  z = np.vstack((row_diag[:, np.newaxis] * u,

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

...

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Just to ensure my code doesn't have a fault, the same thing perfectly works on the twenty newsgroups dataset.

Here's the code on Kaggle if you want to run it and see for yourself. However, just in case SO's policies prohibit this, here's the code in a nutshell:

dat = pd.DataFrame(pd.read_csv('../input/jarchive_cleaned.csv'))



qlist = 



def cleanhtml(raw_html):

  cleanr = re.compile('<.*?>')

  cleantext = re.sub(cleanr, '', raw_html)

  return cleantext



for row in dat.iterrows():

  txt = row[1]['text'].lower()

  txt = cleanhtml(txt)

  txt = re.sub(r'[^a-z ]',"",txt)

  txt = re.sub(r'  ',' ',txt)

#   txt = ' '.join([stem(w) for w in txt.split(" ")])

  qlist.append([txt,row[1]['answer'],row[1]['category']])



print(qlist[:10])



swords = set(stopwords.words('english'))

tv = TfidfVectorizer(stop_words = swords , strip_accents='ascii')



queslst = [q for (q,a,c) in qlist]

qlen = len(set([c for (q,a,c) in qlist]))



mtx = tv.fit_transform(queslst)



cocluster = SpectralCoclustering(n_clusters=qlen, svd_method='arpack', random_state=0) #



t = time()

cocluster.fit(mtx)

edited Nov 21 '18 at 6:37

asked Nov 18 '18 at 5:51

Mayukh Nair

2451318

Did you try to print mtx?

– Gal Sivan
Nov 18 '18 at 6:26

@GalSivan I did. It throws a matrix of term frequencies. Nothing odd and no NaNs/infs there.

– Mayukh Nair
Nov 18 '18 at 6:49

add a comment |

There is apparently a "divide by zero" ValueError occurring when I run biclustering on the dataset.

/usr/local/lib/python3.6/dist-packages/sklearn/cluster/bicluster.py:38: RuntimeWarning: divide by zero encountered in true_divide

  row_diag = np.asarray(1.0 / np.sqrt(X.sum(axis=1))).squeeze()

/usr/local/lib/python3.6/dist-packages/sklearn/cluster/bicluster.py:286: RuntimeWarning: invalid value encountered in multiply

  z = np.vstack((row_diag[:, np.newaxis] * u,

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

...

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Just to ensure my code doesn't have a fault, the same thing perfectly works on the twenty newsgroups dataset.

Here's the code on Kaggle if you want to run it and see for yourself. However, just in case SO's policies prohibit this, here's the code in a nutshell:

dat = pd.DataFrame(pd.read_csv('../input/jarchive_cleaned.csv'))



qlist = 



def cleanhtml(raw_html):

  cleanr = re.compile('<.*?>')

  cleantext = re.sub(cleanr, '', raw_html)

  return cleantext



for row in dat.iterrows():

  txt = row[1]['text'].lower()

  txt = cleanhtml(txt)

  txt = re.sub(r'[^a-z ]',"",txt)

  txt = re.sub(r'  ',' ',txt)

#   txt = ' '.join([stem(w) for w in txt.split(" ")])

  qlist.append([txt,row[1]['answer'],row[1]['category']])



print(qlist[:10])



swords = set(stopwords.words('english'))

tv = TfidfVectorizer(stop_words = swords , strip_accents='ascii')



queslst = [q for (q,a,c) in qlist]

qlen = len(set([c for (q,a,c) in qlist]))



mtx = tv.fit_transform(queslst)



cocluster = SpectralCoclustering(n_clusters=qlen, svd_method='arpack', random_state=0) #



t = time()

cocluster.fit(mtx)

edited Nov 21 '18 at 6:37

asked Nov 18 '18 at 5:51

Mayukh Nair

2451318

There is apparently a "divide by zero" ValueError occurring when I run biclustering on the dataset.

/usr/local/lib/python3.6/dist-packages/sklearn/cluster/bicluster.py:38: RuntimeWarning: divide by zero encountered in true_divide

  row_diag = np.asarray(1.0 / np.sqrt(X.sum(axis=1))).squeeze()

/usr/local/lib/python3.6/dist-packages/sklearn/cluster/bicluster.py:286: RuntimeWarning: invalid value encountered in multiply

  z = np.vstack((row_diag[:, np.newaxis] * u,

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

...

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Just to ensure my code doesn't have a fault, the same thing perfectly works on the twenty newsgroups dataset.

Here's the code on Kaggle if you want to run it and see for yourself. However, just in case SO's policies prohibit this, here's the code in a nutshell:

dat = pd.DataFrame(pd.read_csv('../input/jarchive_cleaned.csv'))



qlist = 



def cleanhtml(raw_html):

  cleanr = re.compile('<.*?>')

  cleantext = re.sub(cleanr, '', raw_html)

  return cleantext



for row in dat.iterrows():

  txt = row[1]['text'].lower()

  txt = cleanhtml(txt)

  txt = re.sub(r'[^a-z ]',"",txt)

  txt = re.sub(r'  ',' ',txt)

#   txt = ' '.join([stem(w) for w in txt.split(" ")])

  qlist.append([txt,row[1]['answer'],row[1]['category']])



print(qlist[:10])



swords = set(stopwords.words('english'))

tv = TfidfVectorizer(stop_words = swords , strip_accents='ascii')



queslst = [q for (q,a,c) in qlist]

qlen = len(set([c for (q,a,c) in qlist]))



mtx = tv.fit_transform(queslst)



cocluster = SpectralCoclustering(n_clusters=qlen, svd_method='arpack', random_state=0) #



t = time()

cocluster.fit(mtx)

python machine-learning scikit-learn cluster-analysis

edited Nov 21 '18 at 6:37

asked Nov 18 '18 at 5:51

Mayukh Nair

2451318

edited Nov 21 '18 at 6:37

asked Nov 18 '18 at 5:51

Mayukh Nair

2451318

edited Nov 21 '18 at 6:37

asked Nov 18 '18 at 5:51

Mayukh Nair

2451318

asked Nov 18 '18 at 5:51

Mayukh Nair

2451318

asked Nov 18 '18 at 5:51

Mayukh Nair

2451318

Did you try to print mtx?

– Gal Sivan
Nov 18 '18 at 6:26

@GalSivan I did. It throws a matrix of term frequencies. Nothing odd and no NaNs/infs there.

– Mayukh Nair
Nov 18 '18 at 6:49

add a comment |

Did you try to print mtx?

– Gal Sivan
Nov 18 '18 at 6:26

@GalSivan I did. It throws a matrix of term frequencies. Nothing odd and no NaNs/infs there.

– Mayukh Nair
Nov 18 '18 at 6:49

Did you try to print mtx?

– Gal Sivan
Nov 18 '18 at 6:26

@GalSivan I did. It throws a matrix of term frequencies. Nothing odd and no NaNs/infs there.

– Mayukh Nair
Nov 18 '18 at 6:49

add a comment |

2 Answers
2

active

oldest

votes

Some strings sequence like e.g. 'down out' results in a zero return value from TfidfVectorizer(). That causes the errors starting with a divide by zero error, which results in inf values in the mtx sparse matrix and this causes the second error.

As a workaround to this problem to remove this sequences or remove the zero matrix elements from the mtx matrix after it created by TfidfVectorizer.fit_transform(), which a bit tricky because of the sparse matrix operation.

I made the second solution, as I didn't dived into the original tasks, as follows:

swords = set(stopwords.words('english'))

tv = TfidfVectorizer(stop_words = swords , strip_accents='ascii')



queslst = [q for (q,a,c) in qlist]

qlen = len(set([c for (q,a,c) in qlist]))



mtx = tv.fit_transform(queslst)



indices = 

for i,mx in enumerate(mtx):

    if np.sum(mx, axis=1) == 0:

        indices.append(i)



mask = np.ones(mtx.shape[0], dtype=bool)

mask[indices] = False

mtx = mtx[mask]        



cocluster = SpectralCoclustering(n_clusters=qlen, svd_method='arpack', random_state=0) #



t = time()



cocluster.fit(mtx)

Finally it works. I hope, it helps, good luck!

edited Nov 19 '18 at 1:13

answered Nov 18 '18 at 20:37

Geeocode

2,3161820

This works like a charm. You're a lifesaver @Geeocode, thank you! :)

– Mayukh Nair
Nov 21 '18 at 6:31

1

@MayukhNair You're welcome. I'm happy, it helps! :)

– Geeocode
Nov 21 '18 at 10:53

add a comment |

The division by zero is causing the NaNs. You need to fix the root cause first. As you may know 1/0=NaN.

You probably have a column or row that is all zeros.

answered Nov 19 '18 at 6:56

Anony-Mousse

57.9k797161

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53358270%2fscikit-learn-spectral-clustering-unable-to-find-nan-lurking-in-data%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

I made the second solution, as I didn't dived into the original tasks, as follows:

swords = set(stopwords.words('english'))

tv = TfidfVectorizer(stop_words = swords , strip_accents='ascii')



queslst = [q for (q,a,c) in qlist]

qlen = len(set([c for (q,a,c) in qlist]))



mtx = tv.fit_transform(queslst)



indices = 

for i,mx in enumerate(mtx):

    if np.sum(mx, axis=1) == 0:

        indices.append(i)



mask = np.ones(mtx.shape[0], dtype=bool)

mask[indices] = False

mtx = mtx[mask]        



cocluster = SpectralCoclustering(n_clusters=qlen, svd_method='arpack', random_state=0) #



t = time()



cocluster.fit(mtx)

Finally it works. I hope, it helps, good luck!

edited Nov 19 '18 at 1:13

answered Nov 18 '18 at 20:37

Geeocode

2,3161820

This works like a charm. You're a lifesaver @Geeocode, thank you! :)

– Mayukh Nair
Nov 21 '18 at 6:31

1

@MayukhNair You're welcome. I'm happy, it helps! :)

– Geeocode
Nov 21 '18 at 10:53

add a comment |

I made the second solution, as I didn't dived into the original tasks, as follows:

swords = set(stopwords.words('english'))

tv = TfidfVectorizer(stop_words = swords , strip_accents='ascii')



queslst = [q for (q,a,c) in qlist]

qlen = len(set([c for (q,a,c) in qlist]))



mtx = tv.fit_transform(queslst)



indices = 

for i,mx in enumerate(mtx):

    if np.sum(mx, axis=1) == 0:

        indices.append(i)



mask = np.ones(mtx.shape[0], dtype=bool)

mask[indices] = False

mtx = mtx[mask]        



cocluster = SpectralCoclustering(n_clusters=qlen, svd_method='arpack', random_state=0) #



t = time()



cocluster.fit(mtx)

Finally it works. I hope, it helps, good luck!

edited Nov 19 '18 at 1:13

answered Nov 18 '18 at 20:37

Geeocode

2,3161820

This works like a charm. You're a lifesaver @Geeocode, thank you! :)

– Mayukh Nair
Nov 21 '18 at 6:31

1

@MayukhNair You're welcome. I'm happy, it helps! :)

– Geeocode
Nov 21 '18 at 10:53

add a comment |

I made the second solution, as I didn't dived into the original tasks, as follows:

swords = set(stopwords.words('english'))

tv = TfidfVectorizer(stop_words = swords , strip_accents='ascii')



queslst = [q for (q,a,c) in qlist]

qlen = len(set([c for (q,a,c) in qlist]))



mtx = tv.fit_transform(queslst)



indices = 

for i,mx in enumerate(mtx):

    if np.sum(mx, axis=1) == 0:

        indices.append(i)



mask = np.ones(mtx.shape[0], dtype=bool)

mask[indices] = False

mtx = mtx[mask]        



cocluster = SpectralCoclustering(n_clusters=qlen, svd_method='arpack', random_state=0) #



t = time()



cocluster.fit(mtx)

Finally it works. I hope, it helps, good luck!

edited Nov 19 '18 at 1:13

answered Nov 18 '18 at 20:37

Geeocode

2,3161820

I made the second solution, as I didn't dived into the original tasks, as follows:

swords = set(stopwords.words('english'))

tv = TfidfVectorizer(stop_words = swords , strip_accents='ascii')



queslst = [q for (q,a,c) in qlist]

qlen = len(set([c for (q,a,c) in qlist]))



mtx = tv.fit_transform(queslst)



indices = 

for i,mx in enumerate(mtx):

    if np.sum(mx, axis=1) == 0:

        indices.append(i)



mask = np.ones(mtx.shape[0], dtype=bool)

mask[indices] = False

mtx = mtx[mask]        



cocluster = SpectralCoclustering(n_clusters=qlen, svd_method='arpack', random_state=0) #



t = time()



cocluster.fit(mtx)

Finally it works. I hope, it helps, good luck!

edited Nov 19 '18 at 1:13

answered Nov 18 '18 at 20:37

Geeocode

2,3161820

edited Nov 19 '18 at 1:13

answered Nov 18 '18 at 20:37

Geeocode

2,3161820

answered Nov 18 '18 at 20:37

Geeocode

2,3161820

answered Nov 18 '18 at 20:37

Geeocode

2,3161820

This works like a charm. You're a lifesaver @Geeocode, thank you! :)

– Mayukh Nair
Nov 21 '18 at 6:31

1

@MayukhNair You're welcome. I'm happy, it helps! :)

– Geeocode
Nov 21 '18 at 10:53

add a comment |

This works like a charm. You're a lifesaver @Geeocode, thank you! :)

– Mayukh Nair
Nov 21 '18 at 6:31

1

@MayukhNair You're welcome. I'm happy, it helps! :)

– Geeocode
Nov 21 '18 at 10:53

This works like a charm. You're a lifesaver @Geeocode, thank you! :)

– Mayukh Nair
Nov 21 '18 at 6:31

@MayukhNair You're welcome. I'm happy, it helps! :)

– Geeocode
Nov 21 '18 at 10:53

add a comment |

The division by zero is causing the NaNs. You need to fix the root cause first. As you may know 1/0=NaN.

You probably have a column or row that is all zeros.

answered Nov 19 '18 at 6:56

Anony-Mousse

57.9k797161

add a comment |

The division by zero is causing the NaNs. You need to fix the root cause first. As you may know 1/0=NaN.

You probably have a column or row that is all zeros.

answered Nov 19 '18 at 6:56

Anony-Mousse

57.9k797161

add a comment |

The division by zero is causing the NaNs. You need to fix the root cause first. As you may know 1/0=NaN.

You probably have a column or row that is all zeros.

answered Nov 19 '18 at 6:56

Anony-Mousse

57.9k797161

The division by zero is causing the NaNs. You need to fix the root cause first. As you may know 1/0=NaN.

You probably have a column or row that is all zeros.

answered Nov 19 '18 at 6:56

Anony-Mousse

57.9k797161

answered Nov 19 '18 at 6:56

Anony-Mousse

57.9k797161

answered Nov 19 '18 at 6:56

Anony-Mousse

57.9k797161

answered Nov 19 '18 at 6:56

Anony-Mousse

57.9k797161

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Wsrtjtyk