How to implement Latent Dirichlet Allocation in regression analysis





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







1















I have a dataset consisting of hotel reviews, ratings, and other features such as traveler type, and word count of the review. I want to perform topic modeling (LDA) and use the topics derived from the reviews as well as other features to identify the features that most affects the ratings (ratings as the dependent variable).



If I want to use linear regression to do this, does this mean I would have to label each review with the topics derived? Is there a way to do this in R or will I have to manually label each review?
(I am new to text mining and data science in general.)










share|improve this question





























    1















    I have a dataset consisting of hotel reviews, ratings, and other features such as traveler type, and word count of the review. I want to perform topic modeling (LDA) and use the topics derived from the reviews as well as other features to identify the features that most affects the ratings (ratings as the dependent variable).



    If I want to use linear regression to do this, does this mean I would have to label each review with the topics derived? Is there a way to do this in R or will I have to manually label each review?
    (I am new to text mining and data science in general.)










    share|improve this question

























      1












      1








      1








      I have a dataset consisting of hotel reviews, ratings, and other features such as traveler type, and word count of the review. I want to perform topic modeling (LDA) and use the topics derived from the reviews as well as other features to identify the features that most affects the ratings (ratings as the dependent variable).



      If I want to use linear regression to do this, does this mean I would have to label each review with the topics derived? Is there a way to do this in R or will I have to manually label each review?
      (I am new to text mining and data science in general.)










      share|improve this question














      I have a dataset consisting of hotel reviews, ratings, and other features such as traveler type, and word count of the review. I want to perform topic modeling (LDA) and use the topics derived from the reviews as well as other features to identify the features that most affects the ratings (ratings as the dependent variable).



      If I want to use linear regression to do this, does this mean I would have to label each review with the topics derived? Is there a way to do this in R or will I have to manually label each review?
      (I am new to text mining and data science in general.)







      r linear-regression lda topic-modeling






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 23 '18 at 20:24









      allmineallmine

      63115




      63115
























          1 Answer
          1






          active

          oldest

          votes


















          1














          The short answer : you don't have to label each review with the topics derived because you'd be relying on the topic model you train to determine the topics of the reviews, which would then be used to construct features for your regression model.



          There is a good explanation of topic modeling with code samples (in R) at
          www.tidytextmining.com/topicmodeling.html. Sections 6.2.1 and 6.2.2 should help you quickly get started.



          Keeping in mind the following two principles




          • Every document (hotel review) is a mixture of topics

          • Every topic is a mixture of words


          once a topic model has been trained on the reviews, for every review,




          • the Document-topic probabilities could be used as features

          • the top N terms within each topic could be used to construct a Document-Term Matrix (each review mapped with zero or more of the top terms) which could then be used as additional features


          A simplified example : there might be 4 topics the reviews broadly fall under.




          • Topic 1 might be about location (top terms : convenient, location, train_station, walk_distance, shopping, etc)

          • Topic 2 might be about hotel staff (top terms : reception, friendly, professional, quick, late_checkout etc.)

          • Topic 3 might be about hotel rooms (top terms : clean_room, decor, tasteful, etc.)

          • Topic 4 might be about hotel amenities (top terms : pool, wifi, fitness_centre, etc.)


          The document-topic probabilities combined with the top terms of each topic can be used as features similar to :




          • topic_1_location_probability

          • topic_2_hotel_staff_probability

          • topic_3_hotel_room_probability


          • topic_4_hotel_amenities_probability is_convenient_location

          • is_train_station_nearby

          • is_walk_distance

          • is_clean

          • is_late_checkout

          • is_fitness_centre

          • etc.


          For newer reviews :




          • The example above shows how the initial training dataset would be created - based on which you train your models.

          • For newer reviews (i.e. the ones previously not used for training the models) you don't have to repeat the entire exercise above. Instead, a trained topic model can be used to identify topics of previously unseen documents (reviews). Answers to this question has sample code to help do this.


          I hope this helps you.






          share|improve this answer
























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53452575%2fhow-to-implement-latent-dirichlet-allocation-in-regression-analysis%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1














            The short answer : you don't have to label each review with the topics derived because you'd be relying on the topic model you train to determine the topics of the reviews, which would then be used to construct features for your regression model.



            There is a good explanation of topic modeling with code samples (in R) at
            www.tidytextmining.com/topicmodeling.html. Sections 6.2.1 and 6.2.2 should help you quickly get started.



            Keeping in mind the following two principles




            • Every document (hotel review) is a mixture of topics

            • Every topic is a mixture of words


            once a topic model has been trained on the reviews, for every review,




            • the Document-topic probabilities could be used as features

            • the top N terms within each topic could be used to construct a Document-Term Matrix (each review mapped with zero or more of the top terms) which could then be used as additional features


            A simplified example : there might be 4 topics the reviews broadly fall under.




            • Topic 1 might be about location (top terms : convenient, location, train_station, walk_distance, shopping, etc)

            • Topic 2 might be about hotel staff (top terms : reception, friendly, professional, quick, late_checkout etc.)

            • Topic 3 might be about hotel rooms (top terms : clean_room, decor, tasteful, etc.)

            • Topic 4 might be about hotel amenities (top terms : pool, wifi, fitness_centre, etc.)


            The document-topic probabilities combined with the top terms of each topic can be used as features similar to :




            • topic_1_location_probability

            • topic_2_hotel_staff_probability

            • topic_3_hotel_room_probability


            • topic_4_hotel_amenities_probability is_convenient_location

            • is_train_station_nearby

            • is_walk_distance

            • is_clean

            • is_late_checkout

            • is_fitness_centre

            • etc.


            For newer reviews :




            • The example above shows how the initial training dataset would be created - based on which you train your models.

            • For newer reviews (i.e. the ones previously not used for training the models) you don't have to repeat the entire exercise above. Instead, a trained topic model can be used to identify topics of previously unseen documents (reviews). Answers to this question has sample code to help do this.


            I hope this helps you.






            share|improve this answer




























              1














              The short answer : you don't have to label each review with the topics derived because you'd be relying on the topic model you train to determine the topics of the reviews, which would then be used to construct features for your regression model.



              There is a good explanation of topic modeling with code samples (in R) at
              www.tidytextmining.com/topicmodeling.html. Sections 6.2.1 and 6.2.2 should help you quickly get started.



              Keeping in mind the following two principles




              • Every document (hotel review) is a mixture of topics

              • Every topic is a mixture of words


              once a topic model has been trained on the reviews, for every review,




              • the Document-topic probabilities could be used as features

              • the top N terms within each topic could be used to construct a Document-Term Matrix (each review mapped with zero or more of the top terms) which could then be used as additional features


              A simplified example : there might be 4 topics the reviews broadly fall under.




              • Topic 1 might be about location (top terms : convenient, location, train_station, walk_distance, shopping, etc)

              • Topic 2 might be about hotel staff (top terms : reception, friendly, professional, quick, late_checkout etc.)

              • Topic 3 might be about hotel rooms (top terms : clean_room, decor, tasteful, etc.)

              • Topic 4 might be about hotel amenities (top terms : pool, wifi, fitness_centre, etc.)


              The document-topic probabilities combined with the top terms of each topic can be used as features similar to :




              • topic_1_location_probability

              • topic_2_hotel_staff_probability

              • topic_3_hotel_room_probability


              • topic_4_hotel_amenities_probability is_convenient_location

              • is_train_station_nearby

              • is_walk_distance

              • is_clean

              • is_late_checkout

              • is_fitness_centre

              • etc.


              For newer reviews :




              • The example above shows how the initial training dataset would be created - based on which you train your models.

              • For newer reviews (i.e. the ones previously not used for training the models) you don't have to repeat the entire exercise above. Instead, a trained topic model can be used to identify topics of previously unseen documents (reviews). Answers to this question has sample code to help do this.


              I hope this helps you.






              share|improve this answer


























                1












                1








                1







                The short answer : you don't have to label each review with the topics derived because you'd be relying on the topic model you train to determine the topics of the reviews, which would then be used to construct features for your regression model.



                There is a good explanation of topic modeling with code samples (in R) at
                www.tidytextmining.com/topicmodeling.html. Sections 6.2.1 and 6.2.2 should help you quickly get started.



                Keeping in mind the following two principles




                • Every document (hotel review) is a mixture of topics

                • Every topic is a mixture of words


                once a topic model has been trained on the reviews, for every review,




                • the Document-topic probabilities could be used as features

                • the top N terms within each topic could be used to construct a Document-Term Matrix (each review mapped with zero or more of the top terms) which could then be used as additional features


                A simplified example : there might be 4 topics the reviews broadly fall under.




                • Topic 1 might be about location (top terms : convenient, location, train_station, walk_distance, shopping, etc)

                • Topic 2 might be about hotel staff (top terms : reception, friendly, professional, quick, late_checkout etc.)

                • Topic 3 might be about hotel rooms (top terms : clean_room, decor, tasteful, etc.)

                • Topic 4 might be about hotel amenities (top terms : pool, wifi, fitness_centre, etc.)


                The document-topic probabilities combined with the top terms of each topic can be used as features similar to :




                • topic_1_location_probability

                • topic_2_hotel_staff_probability

                • topic_3_hotel_room_probability


                • topic_4_hotel_amenities_probability is_convenient_location

                • is_train_station_nearby

                • is_walk_distance

                • is_clean

                • is_late_checkout

                • is_fitness_centre

                • etc.


                For newer reviews :




                • The example above shows how the initial training dataset would be created - based on which you train your models.

                • For newer reviews (i.e. the ones previously not used for training the models) you don't have to repeat the entire exercise above. Instead, a trained topic model can be used to identify topics of previously unseen documents (reviews). Answers to this question has sample code to help do this.


                I hope this helps you.






                share|improve this answer













                The short answer : you don't have to label each review with the topics derived because you'd be relying on the topic model you train to determine the topics of the reviews, which would then be used to construct features for your regression model.



                There is a good explanation of topic modeling with code samples (in R) at
                www.tidytextmining.com/topicmodeling.html. Sections 6.2.1 and 6.2.2 should help you quickly get started.



                Keeping in mind the following two principles




                • Every document (hotel review) is a mixture of topics

                • Every topic is a mixture of words


                once a topic model has been trained on the reviews, for every review,




                • the Document-topic probabilities could be used as features

                • the top N terms within each topic could be used to construct a Document-Term Matrix (each review mapped with zero or more of the top terms) which could then be used as additional features


                A simplified example : there might be 4 topics the reviews broadly fall under.




                • Topic 1 might be about location (top terms : convenient, location, train_station, walk_distance, shopping, etc)

                • Topic 2 might be about hotel staff (top terms : reception, friendly, professional, quick, late_checkout etc.)

                • Topic 3 might be about hotel rooms (top terms : clean_room, decor, tasteful, etc.)

                • Topic 4 might be about hotel amenities (top terms : pool, wifi, fitness_centre, etc.)


                The document-topic probabilities combined with the top terms of each topic can be used as features similar to :




                • topic_1_location_probability

                • topic_2_hotel_staff_probability

                • topic_3_hotel_room_probability


                • topic_4_hotel_amenities_probability is_convenient_location

                • is_train_station_nearby

                • is_walk_distance

                • is_clean

                • is_late_checkout

                • is_fitness_centre

                • etc.


                For newer reviews :




                • The example above shows how the initial training dataset would be created - based on which you train your models.

                • For newer reviews (i.e. the ones previously not used for training the models) you don't have to repeat the entire exercise above. Instead, a trained topic model can be used to identify topics of previously unseen documents (reviews). Answers to this question has sample code to help do this.


                I hope this helps you.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 24 '18 at 1:48









                user799188user799188

                9,16432631




                9,16432631
































                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53452575%2fhow-to-implement-latent-dirichlet-allocation-in-regression-analysis%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    這個網誌中的熱門文章

                    Academy of Television Arts & Sciences

                    L'Équipe

                    1995 France bombings