Using online LDA to predict on test data
up vote
0
down vote
favorite
I am using online LDA to perform some topic modeling task. I am using the core code based on the paper Original Online LDA paper: Hoffman, Blei and Bach, "Online Learning for Latent Dirichlet Allocation." NIPS, 2010. and the code is available at : https://github.com/blei-lab/onlineldavb.
I am using a train set of ~167000 documents. The code generates lambda files as output which I use to generate the topics(https://github.com/wellecks/online_lda_python , printtopics.py).But I am not sure how I can use it to find topics on new test data ( similar to model.get_document_topics in gensim ).
Please help to resolve my confusion.
python algorithm lda topic-modeling dirichlet
add a comment |
up vote
0
down vote
favorite
I am using online LDA to perform some topic modeling task. I am using the core code based on the paper Original Online LDA paper: Hoffman, Blei and Bach, "Online Learning for Latent Dirichlet Allocation." NIPS, 2010. and the code is available at : https://github.com/blei-lab/onlineldavb.
I am using a train set of ~167000 documents. The code generates lambda files as output which I use to generate the topics(https://github.com/wellecks/online_lda_python , printtopics.py).But I am not sure how I can use it to find topics on new test data ( similar to model.get_document_topics in gensim ).
Please help to resolve my confusion.
python algorithm lda topic-modeling dirichlet
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I am using online LDA to perform some topic modeling task. I am using the core code based on the paper Original Online LDA paper: Hoffman, Blei and Bach, "Online Learning for Latent Dirichlet Allocation." NIPS, 2010. and the code is available at : https://github.com/blei-lab/onlineldavb.
I am using a train set of ~167000 documents. The code generates lambda files as output which I use to generate the topics(https://github.com/wellecks/online_lda_python , printtopics.py).But I am not sure how I can use it to find topics on new test data ( similar to model.get_document_topics in gensim ).
Please help to resolve my confusion.
python algorithm lda topic-modeling dirichlet
I am using online LDA to perform some topic modeling task. I am using the core code based on the paper Original Online LDA paper: Hoffman, Blei and Bach, "Online Learning for Latent Dirichlet Allocation." NIPS, 2010. and the code is available at : https://github.com/blei-lab/onlineldavb.
I am using a train set of ~167000 documents. The code generates lambda files as output which I use to generate the topics(https://github.com/wellecks/online_lda_python , printtopics.py).But I am not sure how I can use it to find topics on new test data ( similar to model.get_document_topics in gensim ).
Please help to resolve my confusion.
python algorithm lda topic-modeling dirichlet
python algorithm lda topic-modeling dirichlet
edited Nov 9 at 5:01
asked Nov 7 at 15:46
Vishnu
2716
2716
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
up vote
0
down vote
Follow same data processing steps on test data i.e Tokenization etc and then use your training data vocab to transform test data into gensim corpus.
Once you have test corpus use LDA to find document- topic distribution. Hope this helps.
add a comment |
up vote
0
down vote
In the code you already have there is enough to do this. What you have is the lambda (the word-topic matrix), what you want to compute is the gamma (the document-topic matrix).
All you need to do is call OnlineLDA.do_e_step
on the documents, the results are the topic vectors. Performance might be improved by stripping out the sstats
from it as those are only needed to update the lambda. The result would be a function that only infers the topic vectors for the model.
You don't need to update the model as you aren't training it which is what update_lambda
does after calling do_e_step
.
add a comment |
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
Follow same data processing steps on test data i.e Tokenization etc and then use your training data vocab to transform test data into gensim corpus.
Once you have test corpus use LDA to find document- topic distribution. Hope this helps.
add a comment |
up vote
0
down vote
Follow same data processing steps on test data i.e Tokenization etc and then use your training data vocab to transform test data into gensim corpus.
Once you have test corpus use LDA to find document- topic distribution. Hope this helps.
add a comment |
up vote
0
down vote
up vote
0
down vote
Follow same data processing steps on test data i.e Tokenization etc and then use your training data vocab to transform test data into gensim corpus.
Once you have test corpus use LDA to find document- topic distribution. Hope this helps.
Follow same data processing steps on test data i.e Tokenization etc and then use your training data vocab to transform test data into gensim corpus.
Once you have test corpus use LDA to find document- topic distribution. Hope this helps.
answered Nov 9 at 6:51
Atendra
1318
1318
add a comment |
add a comment |
up vote
0
down vote
In the code you already have there is enough to do this. What you have is the lambda (the word-topic matrix), what you want to compute is the gamma (the document-topic matrix).
All you need to do is call OnlineLDA.do_e_step
on the documents, the results are the topic vectors. Performance might be improved by stripping out the sstats
from it as those are only needed to update the lambda. The result would be a function that only infers the topic vectors for the model.
You don't need to update the model as you aren't training it which is what update_lambda
does after calling do_e_step
.
add a comment |
up vote
0
down vote
In the code you already have there is enough to do this. What you have is the lambda (the word-topic matrix), what you want to compute is the gamma (the document-topic matrix).
All you need to do is call OnlineLDA.do_e_step
on the documents, the results are the topic vectors. Performance might be improved by stripping out the sstats
from it as those are only needed to update the lambda. The result would be a function that only infers the topic vectors for the model.
You don't need to update the model as you aren't training it which is what update_lambda
does after calling do_e_step
.
add a comment |
up vote
0
down vote
up vote
0
down vote
In the code you already have there is enough to do this. What you have is the lambda (the word-topic matrix), what you want to compute is the gamma (the document-topic matrix).
All you need to do is call OnlineLDA.do_e_step
on the documents, the results are the topic vectors. Performance might be improved by stripping out the sstats
from it as those are only needed to update the lambda. The result would be a function that only infers the topic vectors for the model.
You don't need to update the model as you aren't training it which is what update_lambda
does after calling do_e_step
.
In the code you already have there is enough to do this. What you have is the lambda (the word-topic matrix), what you want to compute is the gamma (the document-topic matrix).
All you need to do is call OnlineLDA.do_e_step
on the documents, the results are the topic vectors. Performance might be improved by stripping out the sstats
from it as those are only needed to update the lambda. The result would be a function that only infers the topic vectors for the model.
You don't need to update the model as you aren't training it which is what update_lambda
does after calling do_e_step
.
answered Nov 9 at 11:04
Dan D.
50.5k107999
50.5k107999
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53192902%2fusing-online-lda-to-predict-on-test-data%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown