Text corpus clustering












0















I have 27000 free text elements, each of around 2-3 sentences. I need to cluster these by similarity. So far, I have pretty limited success. I have tried the following:



I used Python Natural Language Toolkit to remove stop words, lemmatize and tokenize, then generated semantically similar words for each word in the sentence before inserting them into a Neo4j graph database. I then tried querying that using the TF counts for each word and related word. That didn't work very well and only resulted in being able to easily calculate the similarity between two text items.



I then looked at Graphawares NLP library to annotate, enrich and calculate the cosine similarity between each text item. After 4 days of processing similarity I checked the log to find that it would take 1.5 years to process. Apparently the community version of the plugin isn't optimised, so I guess it's not appropriate for this kind of volume of data.



I then wrote a custom implementation that took the same approach as the Graphaware plugin, but in much simpler form. I used scikitlearn's TfidfVectorizer to calculate the cosine similarity between each text item and every other text item and saved those as relationships between the Neo4j nodes. However, with 27000 text items that creates 27000 * 27000 = 729000000 relationships! The intention was to take the graph into Grephi selecting relationships of over X threshold of similarity and use modularity clustering to extract clusters. Processing for this is around 4 days which is much better. Processing is incomplete and is currently running. However, I believe that Grephi has a max edge count of 1M, so I expect this to restrict what I can do.



So I turned to more conventional ML techniques using scikitlearn's KMeans, DBSCAN, and MeanShift algorithms. I am getting clustering, but when it's plotted on a scatter chart there is no separation (I can show code if that would help). Here is what I get with DBSCAN:



enter image description here



I get similar results with KMeans. These algorithms run within a few seconds, which obviously makes life easier, but the results seem poor.



So my questions are:




  • Is there a better approach to this?

  • Can I expect to find distinct clusters at all in free text?

  • What should my next move be?


Thank you very much.










share|improve this question

























  • You have not defined what you mean by similarity. Are documents which contain the same words but in different order "similar"? What about a document which is a fragment of another? What about character level -- same letters, different order?

    – tripleee
    Nov 21 '18 at 11:29













  • Maybe your visualization is worse than your clusters?

    – Anony-Mousse
    Nov 23 '18 at 7:53


















0















I have 27000 free text elements, each of around 2-3 sentences. I need to cluster these by similarity. So far, I have pretty limited success. I have tried the following:



I used Python Natural Language Toolkit to remove stop words, lemmatize and tokenize, then generated semantically similar words for each word in the sentence before inserting them into a Neo4j graph database. I then tried querying that using the TF counts for each word and related word. That didn't work very well and only resulted in being able to easily calculate the similarity between two text items.



I then looked at Graphawares NLP library to annotate, enrich and calculate the cosine similarity between each text item. After 4 days of processing similarity I checked the log to find that it would take 1.5 years to process. Apparently the community version of the plugin isn't optimised, so I guess it's not appropriate for this kind of volume of data.



I then wrote a custom implementation that took the same approach as the Graphaware plugin, but in much simpler form. I used scikitlearn's TfidfVectorizer to calculate the cosine similarity between each text item and every other text item and saved those as relationships between the Neo4j nodes. However, with 27000 text items that creates 27000 * 27000 = 729000000 relationships! The intention was to take the graph into Grephi selecting relationships of over X threshold of similarity and use modularity clustering to extract clusters. Processing for this is around 4 days which is much better. Processing is incomplete and is currently running. However, I believe that Grephi has a max edge count of 1M, so I expect this to restrict what I can do.



So I turned to more conventional ML techniques using scikitlearn's KMeans, DBSCAN, and MeanShift algorithms. I am getting clustering, but when it's plotted on a scatter chart there is no separation (I can show code if that would help). Here is what I get with DBSCAN:



enter image description here



I get similar results with KMeans. These algorithms run within a few seconds, which obviously makes life easier, but the results seem poor.



So my questions are:




  • Is there a better approach to this?

  • Can I expect to find distinct clusters at all in free text?

  • What should my next move be?


Thank you very much.










share|improve this question

























  • You have not defined what you mean by similarity. Are documents which contain the same words but in different order "similar"? What about a document which is a fragment of another? What about character level -- same letters, different order?

    – tripleee
    Nov 21 '18 at 11:29













  • Maybe your visualization is worse than your clusters?

    – Anony-Mousse
    Nov 23 '18 at 7:53
















0












0








0








I have 27000 free text elements, each of around 2-3 sentences. I need to cluster these by similarity. So far, I have pretty limited success. I have tried the following:



I used Python Natural Language Toolkit to remove stop words, lemmatize and tokenize, then generated semantically similar words for each word in the sentence before inserting them into a Neo4j graph database. I then tried querying that using the TF counts for each word and related word. That didn't work very well and only resulted in being able to easily calculate the similarity between two text items.



I then looked at Graphawares NLP library to annotate, enrich and calculate the cosine similarity between each text item. After 4 days of processing similarity I checked the log to find that it would take 1.5 years to process. Apparently the community version of the plugin isn't optimised, so I guess it's not appropriate for this kind of volume of data.



I then wrote a custom implementation that took the same approach as the Graphaware plugin, but in much simpler form. I used scikitlearn's TfidfVectorizer to calculate the cosine similarity between each text item and every other text item and saved those as relationships between the Neo4j nodes. However, with 27000 text items that creates 27000 * 27000 = 729000000 relationships! The intention was to take the graph into Grephi selecting relationships of over X threshold of similarity and use modularity clustering to extract clusters. Processing for this is around 4 days which is much better. Processing is incomplete and is currently running. However, I believe that Grephi has a max edge count of 1M, so I expect this to restrict what I can do.



So I turned to more conventional ML techniques using scikitlearn's KMeans, DBSCAN, and MeanShift algorithms. I am getting clustering, but when it's plotted on a scatter chart there is no separation (I can show code if that would help). Here is what I get with DBSCAN:



enter image description here



I get similar results with KMeans. These algorithms run within a few seconds, which obviously makes life easier, but the results seem poor.



So my questions are:




  • Is there a better approach to this?

  • Can I expect to find distinct clusters at all in free text?

  • What should my next move be?


Thank you very much.










share|improve this question
















I have 27000 free text elements, each of around 2-3 sentences. I need to cluster these by similarity. So far, I have pretty limited success. I have tried the following:



I used Python Natural Language Toolkit to remove stop words, lemmatize and tokenize, then generated semantically similar words for each word in the sentence before inserting them into a Neo4j graph database. I then tried querying that using the TF counts for each word and related word. That didn't work very well and only resulted in being able to easily calculate the similarity between two text items.



I then looked at Graphawares NLP library to annotate, enrich and calculate the cosine similarity between each text item. After 4 days of processing similarity I checked the log to find that it would take 1.5 years to process. Apparently the community version of the plugin isn't optimised, so I guess it's not appropriate for this kind of volume of data.



I then wrote a custom implementation that took the same approach as the Graphaware plugin, but in much simpler form. I used scikitlearn's TfidfVectorizer to calculate the cosine similarity between each text item and every other text item and saved those as relationships between the Neo4j nodes. However, with 27000 text items that creates 27000 * 27000 = 729000000 relationships! The intention was to take the graph into Grephi selecting relationships of over X threshold of similarity and use modularity clustering to extract clusters. Processing for this is around 4 days which is much better. Processing is incomplete and is currently running. However, I believe that Grephi has a max edge count of 1M, so I expect this to restrict what I can do.



So I turned to more conventional ML techniques using scikitlearn's KMeans, DBSCAN, and MeanShift algorithms. I am getting clustering, but when it's plotted on a scatter chart there is no separation (I can show code if that would help). Here is what I get with DBSCAN:



enter image description here



I get similar results with KMeans. These algorithms run within a few seconds, which obviously makes life easier, but the results seem poor.



So my questions are:




  • Is there a better approach to this?

  • Can I expect to find distinct clusters at all in free text?

  • What should my next move be?


Thank you very much.







machine-learning nlp cluster-analysis






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 19 '18 at 3:14







Doug

















asked Nov 18 '18 at 6:39









DougDoug

888




888













  • You have not defined what you mean by similarity. Are documents which contain the same words but in different order "similar"? What about a document which is a fragment of another? What about character level -- same letters, different order?

    – tripleee
    Nov 21 '18 at 11:29













  • Maybe your visualization is worse than your clusters?

    – Anony-Mousse
    Nov 23 '18 at 7:53





















  • You have not defined what you mean by similarity. Are documents which contain the same words but in different order "similar"? What about a document which is a fragment of another? What about character level -- same letters, different order?

    – tripleee
    Nov 21 '18 at 11:29













  • Maybe your visualization is worse than your clusters?

    – Anony-Mousse
    Nov 23 '18 at 7:53



















You have not defined what you mean by similarity. Are documents which contain the same words but in different order "similar"? What about a document which is a fragment of another? What about character level -- same letters, different order?

– tripleee
Nov 21 '18 at 11:29







You have not defined what you mean by similarity. Are documents which contain the same words but in different order "similar"? What about a document which is a fragment of another? What about character level -- same letters, different order?

– tripleee
Nov 21 '18 at 11:29















Maybe your visualization is worse than your clusters?

– Anony-Mousse
Nov 23 '18 at 7:53







Maybe your visualization is worse than your clusters?

– Anony-Mousse
Nov 23 '18 at 7:53














1 Answer
1






active

oldest

votes


















0














I think your question is too general to be a good fit for Stack Overflow, but to give you some pointers...





  1. What is your data? You discuss your methods in detail but not your data. What sort of clusters are you expecting?


Example useful description:




I have a bunch of short product reviews. I expect to be able to separate reviews of shoes, hats, and refrigerators.





  1. Have you tried topic modelling? It's not fancy but it's a traditional method of sorting textual documents into clusters. Start with LDA if you're not familiar with anything.


  2. Are you looking for duplicates? If you're looking for plagiarism or bot-generated spam, look into MinHash, SimHash, and the FuzzyWuzzy library for Python.







share|improve this answer
























  • Thank you. The data is free text responses to questions in surveys. I don't know what clusters I will find, but I am hoping to find responses which are of a similar nature so that I can then compare those with other data that we have and see if there is correlation. It's a purely exploratory process and I'm not looking for anything specific at this point. I'm looking for responses that are saying basically the same kind of thing. It may be that this isn't even possible as the responses are all to the same question, so it may just be that they are just all very similar!

    – Doug
    Nov 23 '18 at 3:11











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53358501%2ftext-corpus-clustering%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














I think your question is too general to be a good fit for Stack Overflow, but to give you some pointers...





  1. What is your data? You discuss your methods in detail but not your data. What sort of clusters are you expecting?


Example useful description:




I have a bunch of short product reviews. I expect to be able to separate reviews of shoes, hats, and refrigerators.





  1. Have you tried topic modelling? It's not fancy but it's a traditional method of sorting textual documents into clusters. Start with LDA if you're not familiar with anything.


  2. Are you looking for duplicates? If you're looking for plagiarism or bot-generated spam, look into MinHash, SimHash, and the FuzzyWuzzy library for Python.







share|improve this answer
























  • Thank you. The data is free text responses to questions in surveys. I don't know what clusters I will find, but I am hoping to find responses which are of a similar nature so that I can then compare those with other data that we have and see if there is correlation. It's a purely exploratory process and I'm not looking for anything specific at this point. I'm looking for responses that are saying basically the same kind of thing. It may be that this isn't even possible as the responses are all to the same question, so it may just be that they are just all very similar!

    – Doug
    Nov 23 '18 at 3:11
















0














I think your question is too general to be a good fit for Stack Overflow, but to give you some pointers...





  1. What is your data? You discuss your methods in detail but not your data. What sort of clusters are you expecting?


Example useful description:




I have a bunch of short product reviews. I expect to be able to separate reviews of shoes, hats, and refrigerators.





  1. Have you tried topic modelling? It's not fancy but it's a traditional method of sorting textual documents into clusters. Start with LDA if you're not familiar with anything.


  2. Are you looking for duplicates? If you're looking for plagiarism or bot-generated spam, look into MinHash, SimHash, and the FuzzyWuzzy library for Python.







share|improve this answer
























  • Thank you. The data is free text responses to questions in surveys. I don't know what clusters I will find, but I am hoping to find responses which are of a similar nature so that I can then compare those with other data that we have and see if there is correlation. It's a purely exploratory process and I'm not looking for anything specific at this point. I'm looking for responses that are saying basically the same kind of thing. It may be that this isn't even possible as the responses are all to the same question, so it may just be that they are just all very similar!

    – Doug
    Nov 23 '18 at 3:11














0












0








0







I think your question is too general to be a good fit for Stack Overflow, but to give you some pointers...





  1. What is your data? You discuss your methods in detail but not your data. What sort of clusters are you expecting?


Example useful description:




I have a bunch of short product reviews. I expect to be able to separate reviews of shoes, hats, and refrigerators.





  1. Have you tried topic modelling? It's not fancy but it's a traditional method of sorting textual documents into clusters. Start with LDA if you're not familiar with anything.


  2. Are you looking for duplicates? If you're looking for plagiarism or bot-generated spam, look into MinHash, SimHash, and the FuzzyWuzzy library for Python.







share|improve this answer













I think your question is too general to be a good fit for Stack Overflow, but to give you some pointers...





  1. What is your data? You discuss your methods in detail but not your data. What sort of clusters are you expecting?


Example useful description:




I have a bunch of short product reviews. I expect to be able to separate reviews of shoes, hats, and refrigerators.





  1. Have you tried topic modelling? It's not fancy but it's a traditional method of sorting textual documents into clusters. Start with LDA if you're not familiar with anything.


  2. Are you looking for duplicates? If you're looking for plagiarism or bot-generated spam, look into MinHash, SimHash, and the FuzzyWuzzy library for Python.








share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 21 '18 at 11:20









polm23polm23

2,4091533




2,4091533













  • Thank you. The data is free text responses to questions in surveys. I don't know what clusters I will find, but I am hoping to find responses which are of a similar nature so that I can then compare those with other data that we have and see if there is correlation. It's a purely exploratory process and I'm not looking for anything specific at this point. I'm looking for responses that are saying basically the same kind of thing. It may be that this isn't even possible as the responses are all to the same question, so it may just be that they are just all very similar!

    – Doug
    Nov 23 '18 at 3:11



















  • Thank you. The data is free text responses to questions in surveys. I don't know what clusters I will find, but I am hoping to find responses which are of a similar nature so that I can then compare those with other data that we have and see if there is correlation. It's a purely exploratory process and I'm not looking for anything specific at this point. I'm looking for responses that are saying basically the same kind of thing. It may be that this isn't even possible as the responses are all to the same question, so it may just be that they are just all very similar!

    – Doug
    Nov 23 '18 at 3:11

















Thank you. The data is free text responses to questions in surveys. I don't know what clusters I will find, but I am hoping to find responses which are of a similar nature so that I can then compare those with other data that we have and see if there is correlation. It's a purely exploratory process and I'm not looking for anything specific at this point. I'm looking for responses that are saying basically the same kind of thing. It may be that this isn't even possible as the responses are all to the same question, so it may just be that they are just all very similar!

– Doug
Nov 23 '18 at 3:11





Thank you. The data is free text responses to questions in surveys. I don't know what clusters I will find, but I am hoping to find responses which are of a similar nature so that I can then compare those with other data that we have and see if there is correlation. It's a purely exploratory process and I'm not looking for anything specific at this point. I'm looking for responses that are saying basically the same kind of thing. It may be that this isn't even possible as the responses are all to the same question, so it may just be that they are just all very similar!

– Doug
Nov 23 '18 at 3:11


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53358501%2ftext-corpus-clustering%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

Tangent Lines Diagram Along Smooth Curve

Yusuf al-Mu'taman ibn Hud

Zucchini