Using online LDA to predict on test data

up vote
0
down vote

favorite

I am using online LDA to perform some topic modeling task. I am using the core code based on the paper Original Online LDA paper: Hoffman, Blei and Bach, "Online Learning for Latent Dirichlet Allocation." NIPS, 2010. and the code is available at : https://github.com/blei-lab/onlineldavb.

I am using a train set of ~167000 documents. The code generates lambda files as output which I use to generate the topics(https://github.com/wellecks/online_lda_python , printtopics.py).But I am not sure how I can use it to find topics on new test data ( similar to model.get_document_topics in gensim ).
Please help to resolve my confusion.

edited Nov 9 at 5:01

asked Nov 7 at 15:46

Vishnu

2716

add a comment |

up vote
0
down vote

favorite

edited Nov 9 at 5:01

asked Nov 7 at 15:46

Vishnu

2716

add a comment |

up vote
0
down vote

favorite

edited Nov 9 at 5:01

asked Nov 7 at 15:46

Vishnu

2716

python algorithm lda topic-modeling dirichlet

edited Nov 9 at 5:01

asked Nov 7 at 15:46

Vishnu

2716

edited Nov 9 at 5:01

asked Nov 7 at 15:46

Vishnu

2716

edited Nov 9 at 5:01

asked Nov 7 at 15:46

Vishnu

2716

asked Nov 7 at 15:46

Vishnu

2716

asked Nov 7 at 15:46

Vishnu

2716

add a comment |

2 Answers
2

active

oldest

votes

up vote
0
down vote

Follow same data processing steps on test data i.e Tokenization etc and then use your training data vocab to transform test data into gensim corpus.

Once you have test corpus use LDA to find document- topic distribution. Hope this helps.

answered Nov 9 at 6:51

Atendra

1318

add a comment |

up vote
0
down vote

In the code you already have there is enough to do this. What you have is the lambda (the word-topic matrix), what you want to compute is the gamma (the document-topic matrix).

All you need to do is call OnlineLDA.do_e_step on the documents, the results are the topic vectors. Performance might be improved by stripping out the sstats from it as those are only needed to update the lambda. The result would be a function that only infers the topic vectors for the model.

You don't need to update the model as you aren't training it which is what update_lambda does after calling do_e_step.

answered Nov 9 at 11:04

Dan D.

50.5k107999

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53192902%2fusing-online-lda-to-predict-on-test-data%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
0
down vote

Follow same data processing steps on test data i.e Tokenization etc and then use your training data vocab to transform test data into gensim corpus.

Once you have test corpus use LDA to find document- topic distribution. Hope this helps.

answered Nov 9 at 6:51

Atendra

1318

add a comment |

up vote
0
down vote

Follow same data processing steps on test data i.e Tokenization etc and then use your training data vocab to transform test data into gensim corpus.

Once you have test corpus use LDA to find document- topic distribution. Hope this helps.

answered Nov 9 at 6:51

Atendra

1318

add a comment |

up vote
0
down vote

Follow same data processing steps on test data i.e Tokenization etc and then use your training data vocab to transform test data into gensim corpus.

Once you have test corpus use LDA to find document- topic distribution. Hope this helps.

answered Nov 9 at 6:51

Atendra

1318

Follow same data processing steps on test data i.e Tokenization etc and then use your training data vocab to transform test data into gensim corpus.

Once you have test corpus use LDA to find document- topic distribution. Hope this helps.

answered Nov 9 at 6:51

Atendra

1318

answered Nov 9 at 6:51

Atendra

1318

answered Nov 9 at 6:51

Atendra

1318

answered Nov 9 at 6:51

Atendra

1318

add a comment |

up vote
0
down vote

In the code you already have there is enough to do this. What you have is the lambda (the word-topic matrix), what you want to compute is the gamma (the document-topic matrix).

You don't need to update the model as you aren't training it which is what update_lambda does after calling do_e_step.

answered Nov 9 at 11:04

Dan D.

50.5k107999

add a comment |

up vote
0
down vote

In the code you already have there is enough to do this. What you have is the lambda (the word-topic matrix), what you want to compute is the gamma (the document-topic matrix).

You don't need to update the model as you aren't training it which is what update_lambda does after calling do_e_step.

answered Nov 9 at 11:04

Dan D.

50.5k107999

add a comment |

up vote
0
down vote

In the code you already have there is enough to do this. What you have is the lambda (the word-topic matrix), what you want to compute is the gamma (the document-topic matrix).

You don't need to update the model as you aren't training it which is what update_lambda does after calling do_e_step.

answered Nov 9 at 11:04

Dan D.

50.5k107999

In the code you already have there is enough to do this. What you have is the lambda (the word-topic matrix), what you want to compute is the gamma (the document-topic matrix).

You don't need to update the model as you aren't training it which is what update_lambda does after calling do_e_step.

answered Nov 9 at 11:04

Dan D.

50.5k107999

answered Nov 9 at 11:04

Dan D.

50.5k107999

answered Nov 9 at 11:04

Dan D.

50.5k107999

answered Nov 9 at 11:04

Dan D.

50.5k107999

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Wsrtjtyk