Text corpus clustering
I have 27000 free text elements, each of around 2-3 sentences. I need to cluster these by similarity. So far, I have pretty limited success. I have tried the following:
I used Python Natural Language Toolkit to remove stop words, lemmatize and tokenize, then generated semantically similar words for each word in the sentence before inserting them into a Neo4j graph database. I then tried querying that using the TF counts for each word and related word. That didn't work very well and only resulted in being able to easily calculate the similarity between two text items.
I then looked at Graphawares NLP library to annotate, enrich and calculate the cosine similarity between each text item. After 4 days of processing similarity I checked the log to find that it would take 1.5 years to process. Apparently the community version of the plugin isn't optimised, so I guess it's not appropriate for this kind of volume of data.
I then wrote a custom implementation that took the same approach as the Graphaware plugin, but in much simpler form. I used scikitlearn's TfidfVectorizer to calculate the cosine similarity between each text item and every other text item and saved those as relationships between the Neo4j nodes. However, with 27000 text items that creates 27000 * 27000 = 729000000 relationships! The intention was to take the graph into Grephi selecting relationships of over X threshold of similarity and use modularity clustering to extract clusters. Processing for this is around 4 days which is much better. Processing is incomplete and is currently running. However, I believe that Grephi has a max edge count of 1M, so I expect this to restrict what I can do.
So I turned to more conventional ML techniques using scikitlearn's KMeans, DBSCAN, and MeanShift algorithms. I am getting clustering, but when it's plotted on a scatter chart there is no separation (I can show code if that would help). Here is what I get with DBSCAN:
I get similar results with KMeans. These algorithms run within a few seconds, which obviously makes life easier, but the results seem poor.
So my questions are:
- Is there a better approach to this?
- Can I expect to find distinct clusters at all in free text?
- What should my next move be?
Thank you very much.
machine-learning nlp cluster-analysis
add a comment |
I have 27000 free text elements, each of around 2-3 sentences. I need to cluster these by similarity. So far, I have pretty limited success. I have tried the following:
I used Python Natural Language Toolkit to remove stop words, lemmatize and tokenize, then generated semantically similar words for each word in the sentence before inserting them into a Neo4j graph database. I then tried querying that using the TF counts for each word and related word. That didn't work very well and only resulted in being able to easily calculate the similarity between two text items.
I then looked at Graphawares NLP library to annotate, enrich and calculate the cosine similarity between each text item. After 4 days of processing similarity I checked the log to find that it would take 1.5 years to process. Apparently the community version of the plugin isn't optimised, so I guess it's not appropriate for this kind of volume of data.
I then wrote a custom implementation that took the same approach as the Graphaware plugin, but in much simpler form. I used scikitlearn's TfidfVectorizer to calculate the cosine similarity between each text item and every other text item and saved those as relationships between the Neo4j nodes. However, with 27000 text items that creates 27000 * 27000 = 729000000 relationships! The intention was to take the graph into Grephi selecting relationships of over X threshold of similarity and use modularity clustering to extract clusters. Processing for this is around 4 days which is much better. Processing is incomplete and is currently running. However, I believe that Grephi has a max edge count of 1M, so I expect this to restrict what I can do.
So I turned to more conventional ML techniques using scikitlearn's KMeans, DBSCAN, and MeanShift algorithms. I am getting clustering, but when it's plotted on a scatter chart there is no separation (I can show code if that would help). Here is what I get with DBSCAN:
I get similar results with KMeans. These algorithms run within a few seconds, which obviously makes life easier, but the results seem poor.
So my questions are:
- Is there a better approach to this?
- Can I expect to find distinct clusters at all in free text?
- What should my next move be?
Thank you very much.
machine-learning nlp cluster-analysis
You have not defined what you mean by similarity. Are documents which contain the same words but in different order "similar"? What about a document which is a fragment of another? What about character level -- same letters, different order?
– tripleee
Nov 21 '18 at 11:29
Maybe your visualization is worse than your clusters?
– Anony-Mousse
Nov 23 '18 at 7:53
add a comment |
I have 27000 free text elements, each of around 2-3 sentences. I need to cluster these by similarity. So far, I have pretty limited success. I have tried the following:
I used Python Natural Language Toolkit to remove stop words, lemmatize and tokenize, then generated semantically similar words for each word in the sentence before inserting them into a Neo4j graph database. I then tried querying that using the TF counts for each word and related word. That didn't work very well and only resulted in being able to easily calculate the similarity between two text items.
I then looked at Graphawares NLP library to annotate, enrich and calculate the cosine similarity between each text item. After 4 days of processing similarity I checked the log to find that it would take 1.5 years to process. Apparently the community version of the plugin isn't optimised, so I guess it's not appropriate for this kind of volume of data.
I then wrote a custom implementation that took the same approach as the Graphaware plugin, but in much simpler form. I used scikitlearn's TfidfVectorizer to calculate the cosine similarity between each text item and every other text item and saved those as relationships between the Neo4j nodes. However, with 27000 text items that creates 27000 * 27000 = 729000000 relationships! The intention was to take the graph into Grephi selecting relationships of over X threshold of similarity and use modularity clustering to extract clusters. Processing for this is around 4 days which is much better. Processing is incomplete and is currently running. However, I believe that Grephi has a max edge count of 1M, so I expect this to restrict what I can do.
So I turned to more conventional ML techniques using scikitlearn's KMeans, DBSCAN, and MeanShift algorithms. I am getting clustering, but when it's plotted on a scatter chart there is no separation (I can show code if that would help). Here is what I get with DBSCAN:
I get similar results with KMeans. These algorithms run within a few seconds, which obviously makes life easier, but the results seem poor.
So my questions are:
- Is there a better approach to this?
- Can I expect to find distinct clusters at all in free text?
- What should my next move be?
Thank you very much.
machine-learning nlp cluster-analysis
I have 27000 free text elements, each of around 2-3 sentences. I need to cluster these by similarity. So far, I have pretty limited success. I have tried the following:
I used Python Natural Language Toolkit to remove stop words, lemmatize and tokenize, then generated semantically similar words for each word in the sentence before inserting them into a Neo4j graph database. I then tried querying that using the TF counts for each word and related word. That didn't work very well and only resulted in being able to easily calculate the similarity between two text items.
I then looked at Graphawares NLP library to annotate, enrich and calculate the cosine similarity between each text item. After 4 days of processing similarity I checked the log to find that it would take 1.5 years to process. Apparently the community version of the plugin isn't optimised, so I guess it's not appropriate for this kind of volume of data.
I then wrote a custom implementation that took the same approach as the Graphaware plugin, but in much simpler form. I used scikitlearn's TfidfVectorizer to calculate the cosine similarity between each text item and every other text item and saved those as relationships between the Neo4j nodes. However, with 27000 text items that creates 27000 * 27000 = 729000000 relationships! The intention was to take the graph into Grephi selecting relationships of over X threshold of similarity and use modularity clustering to extract clusters. Processing for this is around 4 days which is much better. Processing is incomplete and is currently running. However, I believe that Grephi has a max edge count of 1M, so I expect this to restrict what I can do.
So I turned to more conventional ML techniques using scikitlearn's KMeans, DBSCAN, and MeanShift algorithms. I am getting clustering, but when it's plotted on a scatter chart there is no separation (I can show code if that would help). Here is what I get with DBSCAN:
I get similar results with KMeans. These algorithms run within a few seconds, which obviously makes life easier, but the results seem poor.
So my questions are:
- Is there a better approach to this?
- Can I expect to find distinct clusters at all in free text?
- What should my next move be?
Thank you very much.
machine-learning nlp cluster-analysis
machine-learning nlp cluster-analysis
edited Nov 19 '18 at 3:14
Doug
asked Nov 18 '18 at 6:39
DougDoug
888
888
You have not defined what you mean by similarity. Are documents which contain the same words but in different order "similar"? What about a document which is a fragment of another? What about character level -- same letters, different order?
– tripleee
Nov 21 '18 at 11:29
Maybe your visualization is worse than your clusters?
– Anony-Mousse
Nov 23 '18 at 7:53
add a comment |
You have not defined what you mean by similarity. Are documents which contain the same words but in different order "similar"? What about a document which is a fragment of another? What about character level -- same letters, different order?
– tripleee
Nov 21 '18 at 11:29
Maybe your visualization is worse than your clusters?
– Anony-Mousse
Nov 23 '18 at 7:53
You have not defined what you mean by similarity. Are documents which contain the same words but in different order "similar"? What about a document which is a fragment of another? What about character level -- same letters, different order?
– tripleee
Nov 21 '18 at 11:29
You have not defined what you mean by similarity. Are documents which contain the same words but in different order "similar"? What about a document which is a fragment of another? What about character level -- same letters, different order?
– tripleee
Nov 21 '18 at 11:29
Maybe your visualization is worse than your clusters?
– Anony-Mousse
Nov 23 '18 at 7:53
Maybe your visualization is worse than your clusters?
– Anony-Mousse
Nov 23 '18 at 7:53
add a comment |
1 Answer
1
active
oldest
votes
I think your question is too general to be a good fit for Stack Overflow, but to give you some pointers...
What is your data? You discuss your methods in detail but not your data. What sort of clusters are you expecting?
Example useful description:
I have a bunch of short product reviews. I expect to be able to separate reviews of shoes, hats, and refrigerators.
Have you tried topic modelling? It's not fancy but it's a traditional method of sorting textual documents into clusters. Start with LDA if you're not familiar with anything.
Are you looking for duplicates? If you're looking for plagiarism or bot-generated spam, look into MinHash, SimHash, and the FuzzyWuzzy library for Python.
Thank you. The data is free text responses to questions in surveys. I don't know what clusters I will find, but I am hoping to find responses which are of a similar nature so that I can then compare those with other data that we have and see if there is correlation. It's a purely exploratory process and I'm not looking for anything specific at this point. I'm looking for responses that are saying basically the same kind of thing. It may be that this isn't even possible as the responses are all to the same question, so it may just be that they are just all very similar!
– Doug
Nov 23 '18 at 3:11
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53358501%2ftext-corpus-clustering%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
I think your question is too general to be a good fit for Stack Overflow, but to give you some pointers...
What is your data? You discuss your methods in detail but not your data. What sort of clusters are you expecting?
Example useful description:
I have a bunch of short product reviews. I expect to be able to separate reviews of shoes, hats, and refrigerators.
Have you tried topic modelling? It's not fancy but it's a traditional method of sorting textual documents into clusters. Start with LDA if you're not familiar with anything.
Are you looking for duplicates? If you're looking for plagiarism or bot-generated spam, look into MinHash, SimHash, and the FuzzyWuzzy library for Python.
Thank you. The data is free text responses to questions in surveys. I don't know what clusters I will find, but I am hoping to find responses which are of a similar nature so that I can then compare those with other data that we have and see if there is correlation. It's a purely exploratory process and I'm not looking for anything specific at this point. I'm looking for responses that are saying basically the same kind of thing. It may be that this isn't even possible as the responses are all to the same question, so it may just be that they are just all very similar!
– Doug
Nov 23 '18 at 3:11
add a comment |
I think your question is too general to be a good fit for Stack Overflow, but to give you some pointers...
What is your data? You discuss your methods in detail but not your data. What sort of clusters are you expecting?
Example useful description:
I have a bunch of short product reviews. I expect to be able to separate reviews of shoes, hats, and refrigerators.
Have you tried topic modelling? It's not fancy but it's a traditional method of sorting textual documents into clusters. Start with LDA if you're not familiar with anything.
Are you looking for duplicates? If you're looking for plagiarism or bot-generated spam, look into MinHash, SimHash, and the FuzzyWuzzy library for Python.
Thank you. The data is free text responses to questions in surveys. I don't know what clusters I will find, but I am hoping to find responses which are of a similar nature so that I can then compare those with other data that we have and see if there is correlation. It's a purely exploratory process and I'm not looking for anything specific at this point. I'm looking for responses that are saying basically the same kind of thing. It may be that this isn't even possible as the responses are all to the same question, so it may just be that they are just all very similar!
– Doug
Nov 23 '18 at 3:11
add a comment |
I think your question is too general to be a good fit for Stack Overflow, but to give you some pointers...
What is your data? You discuss your methods in detail but not your data. What sort of clusters are you expecting?
Example useful description:
I have a bunch of short product reviews. I expect to be able to separate reviews of shoes, hats, and refrigerators.
Have you tried topic modelling? It's not fancy but it's a traditional method of sorting textual documents into clusters. Start with LDA if you're not familiar with anything.
Are you looking for duplicates? If you're looking for plagiarism or bot-generated spam, look into MinHash, SimHash, and the FuzzyWuzzy library for Python.
I think your question is too general to be a good fit for Stack Overflow, but to give you some pointers...
What is your data? You discuss your methods in detail but not your data. What sort of clusters are you expecting?
Example useful description:
I have a bunch of short product reviews. I expect to be able to separate reviews of shoes, hats, and refrigerators.
Have you tried topic modelling? It's not fancy but it's a traditional method of sorting textual documents into clusters. Start with LDA if you're not familiar with anything.
Are you looking for duplicates? If you're looking for plagiarism or bot-generated spam, look into MinHash, SimHash, and the FuzzyWuzzy library for Python.
answered Nov 21 '18 at 11:20
polm23polm23
2,4091533
2,4091533
Thank you. The data is free text responses to questions in surveys. I don't know what clusters I will find, but I am hoping to find responses which are of a similar nature so that I can then compare those with other data that we have and see if there is correlation. It's a purely exploratory process and I'm not looking for anything specific at this point. I'm looking for responses that are saying basically the same kind of thing. It may be that this isn't even possible as the responses are all to the same question, so it may just be that they are just all very similar!
– Doug
Nov 23 '18 at 3:11
add a comment |
Thank you. The data is free text responses to questions in surveys. I don't know what clusters I will find, but I am hoping to find responses which are of a similar nature so that I can then compare those with other data that we have and see if there is correlation. It's a purely exploratory process and I'm not looking for anything specific at this point. I'm looking for responses that are saying basically the same kind of thing. It may be that this isn't even possible as the responses are all to the same question, so it may just be that they are just all very similar!
– Doug
Nov 23 '18 at 3:11
Thank you. The data is free text responses to questions in surveys. I don't know what clusters I will find, but I am hoping to find responses which are of a similar nature so that I can then compare those with other data that we have and see if there is correlation. It's a purely exploratory process and I'm not looking for anything specific at this point. I'm looking for responses that are saying basically the same kind of thing. It may be that this isn't even possible as the responses are all to the same question, so it may just be that they are just all very similar!
– Doug
Nov 23 '18 at 3:11
Thank you. The data is free text responses to questions in surveys. I don't know what clusters I will find, but I am hoping to find responses which are of a similar nature so that I can then compare those with other data that we have and see if there is correlation. It's a purely exploratory process and I'm not looking for anything specific at this point. I'm looking for responses that are saying basically the same kind of thing. It may be that this isn't even possible as the responses are all to the same question, so it may just be that they are just all very similar!
– Doug
Nov 23 '18 at 3:11
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53358501%2ftext-corpus-clustering%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
You have not defined what you mean by similarity. Are documents which contain the same words but in different order "similar"? What about a document which is a fragment of another? What about character level -- same letters, different order?
– tripleee
Nov 21 '18 at 11:29
Maybe your visualization is worse than your clusters?
– Anony-Mousse
Nov 23 '18 at 7:53