Using online LDA to predict on test data











up vote
0
down vote

favorite
1












I am using online LDA to perform some topic modeling task. I am using the core code based on the paper Original Online LDA paper: Hoffman, Blei and Bach, "Online Learning for Latent Dirichlet Allocation." NIPS, 2010. and the code is available at : https://github.com/blei-lab/onlineldavb.



I am using a train set of ~167000 documents. The code generates lambda files as output which I use to generate the topics(https://github.com/wellecks/online_lda_python , printtopics.py).But I am not sure how I can use it to find topics on new test data ( similar to model.get_document_topics in gensim ).
Please help to resolve my confusion.










share|improve this question




























    up vote
    0
    down vote

    favorite
    1












    I am using online LDA to perform some topic modeling task. I am using the core code based on the paper Original Online LDA paper: Hoffman, Blei and Bach, "Online Learning for Latent Dirichlet Allocation." NIPS, 2010. and the code is available at : https://github.com/blei-lab/onlineldavb.



    I am using a train set of ~167000 documents. The code generates lambda files as output which I use to generate the topics(https://github.com/wellecks/online_lda_python , printtopics.py).But I am not sure how I can use it to find topics on new test data ( similar to model.get_document_topics in gensim ).
    Please help to resolve my confusion.










    share|improve this question


























      up vote
      0
      down vote

      favorite
      1









      up vote
      0
      down vote

      favorite
      1






      1





      I am using online LDA to perform some topic modeling task. I am using the core code based on the paper Original Online LDA paper: Hoffman, Blei and Bach, "Online Learning for Latent Dirichlet Allocation." NIPS, 2010. and the code is available at : https://github.com/blei-lab/onlineldavb.



      I am using a train set of ~167000 documents. The code generates lambda files as output which I use to generate the topics(https://github.com/wellecks/online_lda_python , printtopics.py).But I am not sure how I can use it to find topics on new test data ( similar to model.get_document_topics in gensim ).
      Please help to resolve my confusion.










      share|improve this question















      I am using online LDA to perform some topic modeling task. I am using the core code based on the paper Original Online LDA paper: Hoffman, Blei and Bach, "Online Learning for Latent Dirichlet Allocation." NIPS, 2010. and the code is available at : https://github.com/blei-lab/onlineldavb.



      I am using a train set of ~167000 documents. The code generates lambda files as output which I use to generate the topics(https://github.com/wellecks/online_lda_python , printtopics.py).But I am not sure how I can use it to find topics on new test data ( similar to model.get_document_topics in gensim ).
      Please help to resolve my confusion.







      python algorithm lda topic-modeling dirichlet






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 9 at 5:01

























      asked Nov 7 at 15:46









      Vishnu

      2716




      2716
























          2 Answers
          2






          active

          oldest

          votes

















          up vote
          0
          down vote













          Follow same data processing steps on test data i.e Tokenization etc and then use your training data vocab to transform test data into gensim corpus.



          Once you have test corpus use LDA to find document- topic distribution. Hope this helps.






          share|improve this answer




























            up vote
            0
            down vote













            In the code you already have there is enough to do this. What you have is the lambda (the word-topic matrix), what you want to compute is the gamma (the document-topic matrix).



            All you need to do is call OnlineLDA.do_e_step on the documents, the results are the topic vectors. Performance might be improved by stripping out the sstats from it as those are only needed to update the lambda. The result would be a function that only infers the topic vectors for the model.



            You don't need to update the model as you aren't training it which is what update_lambda does after calling do_e_step.






            share|improve this answer





















              Your Answer






              StackExchange.ifUsing("editor", function () {
              StackExchange.using("externalEditor", function () {
              StackExchange.using("snippets", function () {
              StackExchange.snippets.init();
              });
              });
              }, "code-snippets");

              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "1"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              convertImagesToLinks: true,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: 10,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














               

              draft saved


              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53192902%2fusing-online-lda-to-predict-on-test-data%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes








              up vote
              0
              down vote













              Follow same data processing steps on test data i.e Tokenization etc and then use your training data vocab to transform test data into gensim corpus.



              Once you have test corpus use LDA to find document- topic distribution. Hope this helps.






              share|improve this answer

























                up vote
                0
                down vote













                Follow same data processing steps on test data i.e Tokenization etc and then use your training data vocab to transform test data into gensim corpus.



                Once you have test corpus use LDA to find document- topic distribution. Hope this helps.






                share|improve this answer























                  up vote
                  0
                  down vote










                  up vote
                  0
                  down vote









                  Follow same data processing steps on test data i.e Tokenization etc and then use your training data vocab to transform test data into gensim corpus.



                  Once you have test corpus use LDA to find document- topic distribution. Hope this helps.






                  share|improve this answer












                  Follow same data processing steps on test data i.e Tokenization etc and then use your training data vocab to transform test data into gensim corpus.



                  Once you have test corpus use LDA to find document- topic distribution. Hope this helps.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Nov 9 at 6:51









                  Atendra

                  1318




                  1318
























                      up vote
                      0
                      down vote













                      In the code you already have there is enough to do this. What you have is the lambda (the word-topic matrix), what you want to compute is the gamma (the document-topic matrix).



                      All you need to do is call OnlineLDA.do_e_step on the documents, the results are the topic vectors. Performance might be improved by stripping out the sstats from it as those are only needed to update the lambda. The result would be a function that only infers the topic vectors for the model.



                      You don't need to update the model as you aren't training it which is what update_lambda does after calling do_e_step.






                      share|improve this answer

























                        up vote
                        0
                        down vote













                        In the code you already have there is enough to do this. What you have is the lambda (the word-topic matrix), what you want to compute is the gamma (the document-topic matrix).



                        All you need to do is call OnlineLDA.do_e_step on the documents, the results are the topic vectors. Performance might be improved by stripping out the sstats from it as those are only needed to update the lambda. The result would be a function that only infers the topic vectors for the model.



                        You don't need to update the model as you aren't training it which is what update_lambda does after calling do_e_step.






                        share|improve this answer























                          up vote
                          0
                          down vote










                          up vote
                          0
                          down vote









                          In the code you already have there is enough to do this. What you have is the lambda (the word-topic matrix), what you want to compute is the gamma (the document-topic matrix).



                          All you need to do is call OnlineLDA.do_e_step on the documents, the results are the topic vectors. Performance might be improved by stripping out the sstats from it as those are only needed to update the lambda. The result would be a function that only infers the topic vectors for the model.



                          You don't need to update the model as you aren't training it which is what update_lambda does after calling do_e_step.






                          share|improve this answer












                          In the code you already have there is enough to do this. What you have is the lambda (the word-topic matrix), what you want to compute is the gamma (the document-topic matrix).



                          All you need to do is call OnlineLDA.do_e_step on the documents, the results are the topic vectors. Performance might be improved by stripping out the sstats from it as those are only needed to update the lambda. The result would be a function that only infers the topic vectors for the model.



                          You don't need to update the model as you aren't training it which is what update_lambda does after calling do_e_step.







                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered Nov 9 at 11:04









                          Dan D.

                          50.5k107999




                          50.5k107999






























                               

                              draft saved


                              draft discarded



















































                               


                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53192902%2fusing-online-lda-to-predict-on-test-data%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              這個網誌中的熱門文章

                              Tangent Lines Diagram Along Smooth Curve

                              Yusuf al-Mu'taman ibn Hud

                              Zucchini