scikit-learn spectral clustering: unable to find NaN lurking in data
I'm running spectral coclustering on this dataset of Jeopardy questions, and there is this frustrating issue I'm facing with the data. Note that I'm only clustering all the values in the 'question' column.
There is apparently a "divide by zero" ValueError occurring when I run biclustering on the dataset.
/usr/local/lib/python3.6/dist-packages/sklearn/cluster/bicluster.py:38: RuntimeWarning: divide by zero encountered in true_divide
row_diag = np.asarray(1.0 / np.sqrt(X.sum(axis=1))).squeeze()
/usr/local/lib/python3.6/dist-packages/sklearn/cluster/bicluster.py:286: RuntimeWarning: invalid value encountered in multiply
z = np.vstack((row_diag[:, np.newaxis] * u,
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
...
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
The error apparently suggests that there is a NaN or infinite value lurking in my data (which is only the singular column of questions). It's completely text data that I'm handling, and I've already tried most NumPy and Pandas functions for filtering NaNs and inf, as well as many solutions on Stack Overflow. I couldn't find any.
Just to ensure my code doesn't have a fault, the same thing perfectly works on the twenty newsgroups dataset.
Here's the code on Kaggle if you want to run it and see for yourself. However, just in case SO's policies prohibit this, here's the code in a nutshell:
dat = pd.DataFrame(pd.read_csv('../input/jarchive_cleaned.csv'))
qlist =
def cleanhtml(raw_html):
cleanr = re.compile('<.*?>')
cleantext = re.sub(cleanr, '', raw_html)
return cleantext
for row in dat.iterrows():
txt = row[1]['text'].lower()
txt = cleanhtml(txt)
txt = re.sub(r'[^a-z ]',"",txt)
txt = re.sub(r' ',' ',txt)
# txt = ' '.join([stem(w) for w in txt.split(" ")])
qlist.append([txt,row[1]['answer'],row[1]['category']])
print(qlist[:10])
swords = set(stopwords.words('english'))
tv = TfidfVectorizer(stop_words = swords , strip_accents='ascii')
queslst = [q for (q,a,c) in qlist]
qlen = len(set([c for (q,a,c) in qlist]))
mtx = tv.fit_transform(queslst)
cocluster = SpectralCoclustering(n_clusters=qlen, svd_method='arpack', random_state=0) #
t = time()
cocluster.fit(mtx)
python machine-learning scikit-learn cluster-analysis
add a comment |
I'm running spectral coclustering on this dataset of Jeopardy questions, and there is this frustrating issue I'm facing with the data. Note that I'm only clustering all the values in the 'question' column.
There is apparently a "divide by zero" ValueError occurring when I run biclustering on the dataset.
/usr/local/lib/python3.6/dist-packages/sklearn/cluster/bicluster.py:38: RuntimeWarning: divide by zero encountered in true_divide
row_diag = np.asarray(1.0 / np.sqrt(X.sum(axis=1))).squeeze()
/usr/local/lib/python3.6/dist-packages/sklearn/cluster/bicluster.py:286: RuntimeWarning: invalid value encountered in multiply
z = np.vstack((row_diag[:, np.newaxis] * u,
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
...
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
The error apparently suggests that there is a NaN or infinite value lurking in my data (which is only the singular column of questions). It's completely text data that I'm handling, and I've already tried most NumPy and Pandas functions for filtering NaNs and inf, as well as many solutions on Stack Overflow. I couldn't find any.
Just to ensure my code doesn't have a fault, the same thing perfectly works on the twenty newsgroups dataset.
Here's the code on Kaggle if you want to run it and see for yourself. However, just in case SO's policies prohibit this, here's the code in a nutshell:
dat = pd.DataFrame(pd.read_csv('../input/jarchive_cleaned.csv'))
qlist =
def cleanhtml(raw_html):
cleanr = re.compile('<.*?>')
cleantext = re.sub(cleanr, '', raw_html)
return cleantext
for row in dat.iterrows():
txt = row[1]['text'].lower()
txt = cleanhtml(txt)
txt = re.sub(r'[^a-z ]',"",txt)
txt = re.sub(r' ',' ',txt)
# txt = ' '.join([stem(w) for w in txt.split(" ")])
qlist.append([txt,row[1]['answer'],row[1]['category']])
print(qlist[:10])
swords = set(stopwords.words('english'))
tv = TfidfVectorizer(stop_words = swords , strip_accents='ascii')
queslst = [q for (q,a,c) in qlist]
qlen = len(set([c for (q,a,c) in qlist]))
mtx = tv.fit_transform(queslst)
cocluster = SpectralCoclustering(n_clusters=qlen, svd_method='arpack', random_state=0) #
t = time()
cocluster.fit(mtx)
python machine-learning scikit-learn cluster-analysis
Did you try to printmtx
?
– Gal Sivan
Nov 18 '18 at 6:26
@GalSivan I did. It throws a matrix of term frequencies. Nothing odd and no NaNs/infs there.
– Mayukh Nair
Nov 18 '18 at 6:49
add a comment |
I'm running spectral coclustering on this dataset of Jeopardy questions, and there is this frustrating issue I'm facing with the data. Note that I'm only clustering all the values in the 'question' column.
There is apparently a "divide by zero" ValueError occurring when I run biclustering on the dataset.
/usr/local/lib/python3.6/dist-packages/sklearn/cluster/bicluster.py:38: RuntimeWarning: divide by zero encountered in true_divide
row_diag = np.asarray(1.0 / np.sqrt(X.sum(axis=1))).squeeze()
/usr/local/lib/python3.6/dist-packages/sklearn/cluster/bicluster.py:286: RuntimeWarning: invalid value encountered in multiply
z = np.vstack((row_diag[:, np.newaxis] * u,
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
...
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
The error apparently suggests that there is a NaN or infinite value lurking in my data (which is only the singular column of questions). It's completely text data that I'm handling, and I've already tried most NumPy and Pandas functions for filtering NaNs and inf, as well as many solutions on Stack Overflow. I couldn't find any.
Just to ensure my code doesn't have a fault, the same thing perfectly works on the twenty newsgroups dataset.
Here's the code on Kaggle if you want to run it and see for yourself. However, just in case SO's policies prohibit this, here's the code in a nutshell:
dat = pd.DataFrame(pd.read_csv('../input/jarchive_cleaned.csv'))
qlist =
def cleanhtml(raw_html):
cleanr = re.compile('<.*?>')
cleantext = re.sub(cleanr, '', raw_html)
return cleantext
for row in dat.iterrows():
txt = row[1]['text'].lower()
txt = cleanhtml(txt)
txt = re.sub(r'[^a-z ]',"",txt)
txt = re.sub(r' ',' ',txt)
# txt = ' '.join([stem(w) for w in txt.split(" ")])
qlist.append([txt,row[1]['answer'],row[1]['category']])
print(qlist[:10])
swords = set(stopwords.words('english'))
tv = TfidfVectorizer(stop_words = swords , strip_accents='ascii')
queslst = [q for (q,a,c) in qlist]
qlen = len(set([c for (q,a,c) in qlist]))
mtx = tv.fit_transform(queslst)
cocluster = SpectralCoclustering(n_clusters=qlen, svd_method='arpack', random_state=0) #
t = time()
cocluster.fit(mtx)
python machine-learning scikit-learn cluster-analysis
I'm running spectral coclustering on this dataset of Jeopardy questions, and there is this frustrating issue I'm facing with the data. Note that I'm only clustering all the values in the 'question' column.
There is apparently a "divide by zero" ValueError occurring when I run biclustering on the dataset.
/usr/local/lib/python3.6/dist-packages/sklearn/cluster/bicluster.py:38: RuntimeWarning: divide by zero encountered in true_divide
row_diag = np.asarray(1.0 / np.sqrt(X.sum(axis=1))).squeeze()
/usr/local/lib/python3.6/dist-packages/sklearn/cluster/bicluster.py:286: RuntimeWarning: invalid value encountered in multiply
z = np.vstack((row_diag[:, np.newaxis] * u,
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
...
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
The error apparently suggests that there is a NaN or infinite value lurking in my data (which is only the singular column of questions). It's completely text data that I'm handling, and I've already tried most NumPy and Pandas functions for filtering NaNs and inf, as well as many solutions on Stack Overflow. I couldn't find any.
Just to ensure my code doesn't have a fault, the same thing perfectly works on the twenty newsgroups dataset.
Here's the code on Kaggle if you want to run it and see for yourself. However, just in case SO's policies prohibit this, here's the code in a nutshell:
dat = pd.DataFrame(pd.read_csv('../input/jarchive_cleaned.csv'))
qlist =
def cleanhtml(raw_html):
cleanr = re.compile('<.*?>')
cleantext = re.sub(cleanr, '', raw_html)
return cleantext
for row in dat.iterrows():
txt = row[1]['text'].lower()
txt = cleanhtml(txt)
txt = re.sub(r'[^a-z ]',"",txt)
txt = re.sub(r' ',' ',txt)
# txt = ' '.join([stem(w) for w in txt.split(" ")])
qlist.append([txt,row[1]['answer'],row[1]['category']])
print(qlist[:10])
swords = set(stopwords.words('english'))
tv = TfidfVectorizer(stop_words = swords , strip_accents='ascii')
queslst = [q for (q,a,c) in qlist]
qlen = len(set([c for (q,a,c) in qlist]))
mtx = tv.fit_transform(queslst)
cocluster = SpectralCoclustering(n_clusters=qlen, svd_method='arpack', random_state=0) #
t = time()
cocluster.fit(mtx)
python machine-learning scikit-learn cluster-analysis
python machine-learning scikit-learn cluster-analysis
edited Nov 21 '18 at 6:37
Mayukh Nair
asked Nov 18 '18 at 5:51
Mayukh NairMayukh Nair
2451318
2451318
Did you try to printmtx
?
– Gal Sivan
Nov 18 '18 at 6:26
@GalSivan I did. It throws a matrix of term frequencies. Nothing odd and no NaNs/infs there.
– Mayukh Nair
Nov 18 '18 at 6:49
add a comment |
Did you try to printmtx
?
– Gal Sivan
Nov 18 '18 at 6:26
@GalSivan I did. It throws a matrix of term frequencies. Nothing odd and no NaNs/infs there.
– Mayukh Nair
Nov 18 '18 at 6:49
Did you try to print
mtx
?– Gal Sivan
Nov 18 '18 at 6:26
Did you try to print
mtx
?– Gal Sivan
Nov 18 '18 at 6:26
@GalSivan I did. It throws a matrix of term frequencies. Nothing odd and no NaNs/infs there.
– Mayukh Nair
Nov 18 '18 at 6:49
@GalSivan I did. It throws a matrix of term frequencies. Nothing odd and no NaNs/infs there.
– Mayukh Nair
Nov 18 '18 at 6:49
add a comment |
2 Answers
2
active
oldest
votes
Some strings sequence like e.g. 'down out' results in a zero return value from TfidfVectorizer()
. That causes the errors starting with a divide by zero error, which results in inf
values in the mtx
sparse matrix
and this causes the second error.
As a workaround to this problem to remove this sequences or remove the zero matrix elements from the mtx
matrix after it created by TfidfVectorizer.fit_transform()
, which a bit tricky because of the sparse matrix operation.
I made the second solution, as I didn't dived into the original tasks, as follows:
swords = set(stopwords.words('english'))
tv = TfidfVectorizer(stop_words = swords , strip_accents='ascii')
queslst = [q for (q,a,c) in qlist]
qlen = len(set([c for (q,a,c) in qlist]))
mtx = tv.fit_transform(queslst)
indices =
for i,mx in enumerate(mtx):
if np.sum(mx, axis=1) == 0:
indices.append(i)
mask = np.ones(mtx.shape[0], dtype=bool)
mask[indices] = False
mtx = mtx[mask]
cocluster = SpectralCoclustering(n_clusters=qlen, svd_method='arpack', random_state=0) #
t = time()
cocluster.fit(mtx)
Finally it works. I hope, it helps, good luck!
This works like a charm. You're a lifesaver @Geeocode, thank you! :)
– Mayukh Nair
Nov 21 '18 at 6:31
1
@MayukhNair You're welcome. I'm happy, it helps! :)
– Geeocode
Nov 21 '18 at 10:53
add a comment |
The division by zero is causing the NaNs. You need to fix the root cause first. As you may know 1/0=NaN.
You probably have a column or row that is all zeros.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53358270%2fscikit-learn-spectral-clustering-unable-to-find-nan-lurking-in-data%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Some strings sequence like e.g. 'down out' results in a zero return value from TfidfVectorizer()
. That causes the errors starting with a divide by zero error, which results in inf
values in the mtx
sparse matrix
and this causes the second error.
As a workaround to this problem to remove this sequences or remove the zero matrix elements from the mtx
matrix after it created by TfidfVectorizer.fit_transform()
, which a bit tricky because of the sparse matrix operation.
I made the second solution, as I didn't dived into the original tasks, as follows:
swords = set(stopwords.words('english'))
tv = TfidfVectorizer(stop_words = swords , strip_accents='ascii')
queslst = [q for (q,a,c) in qlist]
qlen = len(set([c for (q,a,c) in qlist]))
mtx = tv.fit_transform(queslst)
indices =
for i,mx in enumerate(mtx):
if np.sum(mx, axis=1) == 0:
indices.append(i)
mask = np.ones(mtx.shape[0], dtype=bool)
mask[indices] = False
mtx = mtx[mask]
cocluster = SpectralCoclustering(n_clusters=qlen, svd_method='arpack', random_state=0) #
t = time()
cocluster.fit(mtx)
Finally it works. I hope, it helps, good luck!
This works like a charm. You're a lifesaver @Geeocode, thank you! :)
– Mayukh Nair
Nov 21 '18 at 6:31
1
@MayukhNair You're welcome. I'm happy, it helps! :)
– Geeocode
Nov 21 '18 at 10:53
add a comment |
Some strings sequence like e.g. 'down out' results in a zero return value from TfidfVectorizer()
. That causes the errors starting with a divide by zero error, which results in inf
values in the mtx
sparse matrix
and this causes the second error.
As a workaround to this problem to remove this sequences or remove the zero matrix elements from the mtx
matrix after it created by TfidfVectorizer.fit_transform()
, which a bit tricky because of the sparse matrix operation.
I made the second solution, as I didn't dived into the original tasks, as follows:
swords = set(stopwords.words('english'))
tv = TfidfVectorizer(stop_words = swords , strip_accents='ascii')
queslst = [q for (q,a,c) in qlist]
qlen = len(set([c for (q,a,c) in qlist]))
mtx = tv.fit_transform(queslst)
indices =
for i,mx in enumerate(mtx):
if np.sum(mx, axis=1) == 0:
indices.append(i)
mask = np.ones(mtx.shape[0], dtype=bool)
mask[indices] = False
mtx = mtx[mask]
cocluster = SpectralCoclustering(n_clusters=qlen, svd_method='arpack', random_state=0) #
t = time()
cocluster.fit(mtx)
Finally it works. I hope, it helps, good luck!
This works like a charm. You're a lifesaver @Geeocode, thank you! :)
– Mayukh Nair
Nov 21 '18 at 6:31
1
@MayukhNair You're welcome. I'm happy, it helps! :)
– Geeocode
Nov 21 '18 at 10:53
add a comment |
Some strings sequence like e.g. 'down out' results in a zero return value from TfidfVectorizer()
. That causes the errors starting with a divide by zero error, which results in inf
values in the mtx
sparse matrix
and this causes the second error.
As a workaround to this problem to remove this sequences or remove the zero matrix elements from the mtx
matrix after it created by TfidfVectorizer.fit_transform()
, which a bit tricky because of the sparse matrix operation.
I made the second solution, as I didn't dived into the original tasks, as follows:
swords = set(stopwords.words('english'))
tv = TfidfVectorizer(stop_words = swords , strip_accents='ascii')
queslst = [q for (q,a,c) in qlist]
qlen = len(set([c for (q,a,c) in qlist]))
mtx = tv.fit_transform(queslst)
indices =
for i,mx in enumerate(mtx):
if np.sum(mx, axis=1) == 0:
indices.append(i)
mask = np.ones(mtx.shape[0], dtype=bool)
mask[indices] = False
mtx = mtx[mask]
cocluster = SpectralCoclustering(n_clusters=qlen, svd_method='arpack', random_state=0) #
t = time()
cocluster.fit(mtx)
Finally it works. I hope, it helps, good luck!
Some strings sequence like e.g. 'down out' results in a zero return value from TfidfVectorizer()
. That causes the errors starting with a divide by zero error, which results in inf
values in the mtx
sparse matrix
and this causes the second error.
As a workaround to this problem to remove this sequences or remove the zero matrix elements from the mtx
matrix after it created by TfidfVectorizer.fit_transform()
, which a bit tricky because of the sparse matrix operation.
I made the second solution, as I didn't dived into the original tasks, as follows:
swords = set(stopwords.words('english'))
tv = TfidfVectorizer(stop_words = swords , strip_accents='ascii')
queslst = [q for (q,a,c) in qlist]
qlen = len(set([c for (q,a,c) in qlist]))
mtx = tv.fit_transform(queslst)
indices =
for i,mx in enumerate(mtx):
if np.sum(mx, axis=1) == 0:
indices.append(i)
mask = np.ones(mtx.shape[0], dtype=bool)
mask[indices] = False
mtx = mtx[mask]
cocluster = SpectralCoclustering(n_clusters=qlen, svd_method='arpack', random_state=0) #
t = time()
cocluster.fit(mtx)
Finally it works. I hope, it helps, good luck!
edited Nov 19 '18 at 1:13
answered Nov 18 '18 at 20:37
GeeocodeGeeocode
2,3161820
2,3161820
This works like a charm. You're a lifesaver @Geeocode, thank you! :)
– Mayukh Nair
Nov 21 '18 at 6:31
1
@MayukhNair You're welcome. I'm happy, it helps! :)
– Geeocode
Nov 21 '18 at 10:53
add a comment |
This works like a charm. You're a lifesaver @Geeocode, thank you! :)
– Mayukh Nair
Nov 21 '18 at 6:31
1
@MayukhNair You're welcome. I'm happy, it helps! :)
– Geeocode
Nov 21 '18 at 10:53
This works like a charm. You're a lifesaver @Geeocode, thank you! :)
– Mayukh Nair
Nov 21 '18 at 6:31
This works like a charm. You're a lifesaver @Geeocode, thank you! :)
– Mayukh Nair
Nov 21 '18 at 6:31
1
1
@MayukhNair You're welcome. I'm happy, it helps! :)
– Geeocode
Nov 21 '18 at 10:53
@MayukhNair You're welcome. I'm happy, it helps! :)
– Geeocode
Nov 21 '18 at 10:53
add a comment |
The division by zero is causing the NaNs. You need to fix the root cause first. As you may know 1/0=NaN.
You probably have a column or row that is all zeros.
add a comment |
The division by zero is causing the NaNs. You need to fix the root cause first. As you may know 1/0=NaN.
You probably have a column or row that is all zeros.
add a comment |
The division by zero is causing the NaNs. You need to fix the root cause first. As you may know 1/0=NaN.
You probably have a column or row that is all zeros.
The division by zero is causing the NaNs. You need to fix the root cause first. As you may know 1/0=NaN.
You probably have a column or row that is all zeros.
answered Nov 19 '18 at 6:56
Anony-MousseAnony-Mousse
57.9k797161
57.9k797161
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53358270%2fscikit-learn-spectral-clustering-unable-to-find-nan-lurking-in-data%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Did you try to print
mtx
?– Gal Sivan
Nov 18 '18 at 6:26
@GalSivan I did. It throws a matrix of term frequencies. Nothing odd and no NaNs/infs there.
– Mayukh Nair
Nov 18 '18 at 6:49