NLTK MWETokenizer is failing to extract/Tag value

I am using NLTK's MWETokenizer to get the multi word tagging. Here is my sample code:

import nltk

import pickle

from nltk.stem.wordnet import WordNetLemmatizer

from nltk.tokenize import MWETokenizer



# initializing Wordnet Lemmatizer

lmtzr = WordNetLemmatizer()



# values to tag/extract

values = ["net income","net income (loss)" 

,"net income (loss) attributable to 'company'","net income (loss) attributable to bank","net income (loss) attributable to bank and noncontrolling (minority) interests","net income (loss) attributable to bank and noncontrolling interests","net income (loss) attributable to bank and noncontrolling minority interests","net income (loss) attributable to noncontrolling interests","net income after tax","net income associated to minority interests","net income associated to partners","net income attributable to "company name"","accumulated distributions in excess of net income","antidilutive securities excluded from computation of net income, per outstanding unit, amount","cash from net income","consolidated net income attributable to foreign offices","consolidated net income in foreign offices","consolidated net income of foreign offices","decrease in net income","diluted net income","diluted net income attributable to common shareholders","diluted net income per share","eliminations of net income to foreign offices","foreign offices consolidated net income","foreign offices net income","foreign offices net income before internal allocation of income and expense","income excluded from net income","increase in net income","less net income attributable to noncontrolling interests", 

"less: net income (loss) attributable income taxes to noncontrolling (minority) interests","net income attributable to bank","net income attributable to bank and minority interests","net income attributable to class a and class b common stockholders","net income attributable to common shareholders (in dollars per share) diluted","net income attributable to company", 

"net income attributable to foreign offices","net income attributable to income taxes","net income attributable to noncontrolling interests","net income attributable to noncontrolling parties","net income attributable to participating securities","net income attributed to bank","net income basic earnings","net income before given to non-controlling interests","net income before minority controlling interests","net income before non-controlling interests","net income diluted to common shareholders","net income from cash flows","net income from operations in cash flow","net income generated from investment","net income generated from joint venture partnerships","net income in foreign offices","net income including noncontrolling interests","net income loss attributable to bank and noncontrolling minority interests", 

"net income of interest", 

"net income of foreign offices before allocations","net income of interest","net income or loss attributable to bank and noncontrolling (minority) interests","net income or loss attributable to bank and noncontrolling interests","net income or loss attributed to bank","net income per share - basic (in usd per share)","net income per share - diluted (in usd per share)","net income per share diluted","net income to all parties","net income to foreign offices before internal allocations of income and expense","net income/loss to 'company name'","net income/loss to noncontrolling interests","other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, before tax (deprecated 2012-01-31)", 

"other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, before tax (deprecated 2013-01-31)","other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, net of tax (deprecated 2013-01-31)","other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, tax (deprecated 2012-01-31)","other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, tax (deprecated 2013-01-31)","other than temporary impairment losses, investments, reclassification adjustment of noncredit portion included in net income, availabe-for-sale securities, before tax","other than temporary impairment losses, investments, reclassification adjustment of noncredit portion included in net income, held-to-maturity securities, before tax", 

"revision of net income to foreign offices","net income of foreign offices"

]



# Initializing MWETokenizet with a starter value

tokenizer = MWETokenizer([('total', 'expense')])



# Populating Tokenizer

for item in values:

    tokenizer.add_mwe((item.split()))



# Sample target sample

sentence = 'what is the net incomes of banks of america for q2 2014'



# Splitting for stammer

tokens = sentence.split()



# changing nouns to singular

singles = [lmtzr.lemmatize(plural,'n') for plural in tokens]



# Joining back and trying extraction/Tags

result= tokenizer.tokenize(' '.join(singles).split())





print(result)

# Result:

# Actual:   ['what', 'is', 'the', 'net', 'income', 'of', 'bank', 'of', 'america', 'for', 'q2', '2014']

# Expected: ['what', 'is', 'the', 'net_income', 'of', 'bank', 'of', 'america', 'for', 'q2', '2014']

I have first value in my tags is Net income and the following are other values that contains "Net income". Unlike expectation, tokenizer is failing to recognize the first value for some reason.

# Result:

# Actual:   ['what', 'is', 'the', 'net', 'income', 'of', 'bank', 'of', 'america', 'for', 'q2', '2014']

# Expected: ['what', 'is', 'the', 'net_income', 'of', 'bank', 'of', 'america', 'for', 'q2', '2014']

Is there a limitation or something that I dont know.
How do I debug this?

Also, if there is another way to get Multiword tagging done that let me know, it would be great help.

asked Nov 21 '18 at 7:06

user3870821

112

add a comment |

I am using NLTK's MWETokenizer to get the multi word tagging. Here is my sample code:

import nltk

import pickle

from nltk.stem.wordnet import WordNetLemmatizer

from nltk.tokenize import MWETokenizer



# initializing Wordnet Lemmatizer

lmtzr = WordNetLemmatizer()



# values to tag/extract

values = ["net income","net income (loss)" 

,"net income (loss) attributable to 'company'","net income (loss) attributable to bank","net income (loss) attributable to bank and noncontrolling (minority) interests","net income (loss) attributable to bank and noncontrolling interests","net income (loss) attributable to bank and noncontrolling minority interests","net income (loss) attributable to noncontrolling interests","net income after tax","net income associated to minority interests","net income associated to partners","net income attributable to "company name"","accumulated distributions in excess of net income","antidilutive securities excluded from computation of net income, per outstanding unit, amount","cash from net income","consolidated net income attributable to foreign offices","consolidated net income in foreign offices","consolidated net income of foreign offices","decrease in net income","diluted net income","diluted net income attributable to common shareholders","diluted net income per share","eliminations of net income to foreign offices","foreign offices consolidated net income","foreign offices net income","foreign offices net income before internal allocation of income and expense","income excluded from net income","increase in net income","less net income attributable to noncontrolling interests", 

"less: net income (loss) attributable income taxes to noncontrolling (minority) interests","net income attributable to bank","net income attributable to bank and minority interests","net income attributable to class a and class b common stockholders","net income attributable to common shareholders (in dollars per share) diluted","net income attributable to company", 

"net income attributable to foreign offices","net income attributable to income taxes","net income attributable to noncontrolling interests","net income attributable to noncontrolling parties","net income attributable to participating securities","net income attributed to bank","net income basic earnings","net income before given to non-controlling interests","net income before minority controlling interests","net income before non-controlling interests","net income diluted to common shareholders","net income from cash flows","net income from operations in cash flow","net income generated from investment","net income generated from joint venture partnerships","net income in foreign offices","net income including noncontrolling interests","net income loss attributable to bank and noncontrolling minority interests", 

"net income of interest", 

"net income of foreign offices before allocations","net income of interest","net income or loss attributable to bank and noncontrolling (minority) interests","net income or loss attributable to bank and noncontrolling interests","net income or loss attributed to bank","net income per share - basic (in usd per share)","net income per share - diluted (in usd per share)","net income per share diluted","net income to all parties","net income to foreign offices before internal allocations of income and expense","net income/loss to 'company name'","net income/loss to noncontrolling interests","other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, before tax (deprecated 2012-01-31)", 

"other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, before tax (deprecated 2013-01-31)","other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, net of tax (deprecated 2013-01-31)","other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, tax (deprecated 2012-01-31)","other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, tax (deprecated 2013-01-31)","other than temporary impairment losses, investments, reclassification adjustment of noncredit portion included in net income, availabe-for-sale securities, before tax","other than temporary impairment losses, investments, reclassification adjustment of noncredit portion included in net income, held-to-maturity securities, before tax", 

"revision of net income to foreign offices","net income of foreign offices"

]



# Initializing MWETokenizet with a starter value

tokenizer = MWETokenizer([('total', 'expense')])



# Populating Tokenizer

for item in values:

    tokenizer.add_mwe((item.split()))



# Sample target sample

sentence = 'what is the net incomes of banks of america for q2 2014'



# Splitting for stammer

tokens = sentence.split()



# changing nouns to singular

singles = [lmtzr.lemmatize(plural,'n') for plural in tokens]



# Joining back and trying extraction/Tags

result= tokenizer.tokenize(' '.join(singles).split())





print(result)

# Result:

# Actual:   ['what', 'is', 'the', 'net', 'income', 'of', 'bank', 'of', 'america', 'for', 'q2', '2014']

# Expected: ['what', 'is', 'the', 'net_income', 'of', 'bank', 'of', 'america', 'for', 'q2', '2014']

I have first value in my tags is Net income and the following are other values that contains "Net income". Unlike expectation, tokenizer is failing to recognize the first value for some reason.

# Result:

# Actual:   ['what', 'is', 'the', 'net', 'income', 'of', 'bank', 'of', 'america', 'for', 'q2', '2014']

# Expected: ['what', 'is', 'the', 'net_income', 'of', 'bank', 'of', 'america', 'for', 'q2', '2014']

Is there a limitation or something that I dont know.
How do I debug this?

Also, if there is another way to get Multiword tagging done that let me know, it would be great help.

asked Nov 21 '18 at 7:06

user3870821

112

add a comment |

I am using NLTK's MWETokenizer to get the multi word tagging. Here is my sample code:

import nltk

import pickle

from nltk.stem.wordnet import WordNetLemmatizer

from nltk.tokenize import MWETokenizer



# initializing Wordnet Lemmatizer

lmtzr = WordNetLemmatizer()



# values to tag/extract

values = ["net income","net income (loss)" 

,"net income (loss) attributable to 'company'","net income (loss) attributable to bank","net income (loss) attributable to bank and noncontrolling (minority) interests","net income (loss) attributable to bank and noncontrolling interests","net income (loss) attributable to bank and noncontrolling minority interests","net income (loss) attributable to noncontrolling interests","net income after tax","net income associated to minority interests","net income associated to partners","net income attributable to "company name"","accumulated distributions in excess of net income","antidilutive securities excluded from computation of net income, per outstanding unit, amount","cash from net income","consolidated net income attributable to foreign offices","consolidated net income in foreign offices","consolidated net income of foreign offices","decrease in net income","diluted net income","diluted net income attributable to common shareholders","diluted net income per share","eliminations of net income to foreign offices","foreign offices consolidated net income","foreign offices net income","foreign offices net income before internal allocation of income and expense","income excluded from net income","increase in net income","less net income attributable to noncontrolling interests", 

"less: net income (loss) attributable income taxes to noncontrolling (minority) interests","net income attributable to bank","net income attributable to bank and minority interests","net income attributable to class a and class b common stockholders","net income attributable to common shareholders (in dollars per share) diluted","net income attributable to company", 

"net income attributable to foreign offices","net income attributable to income taxes","net income attributable to noncontrolling interests","net income attributable to noncontrolling parties","net income attributable to participating securities","net income attributed to bank","net income basic earnings","net income before given to non-controlling interests","net income before minority controlling interests","net income before non-controlling interests","net income diluted to common shareholders","net income from cash flows","net income from operations in cash flow","net income generated from investment","net income generated from joint venture partnerships","net income in foreign offices","net income including noncontrolling interests","net income loss attributable to bank and noncontrolling minority interests", 

"net income of interest", 

"net income of foreign offices before allocations","net income of interest","net income or loss attributable to bank and noncontrolling (minority) interests","net income or loss attributable to bank and noncontrolling interests","net income or loss attributed to bank","net income per share - basic (in usd per share)","net income per share - diluted (in usd per share)","net income per share diluted","net income to all parties","net income to foreign offices before internal allocations of income and expense","net income/loss to 'company name'","net income/loss to noncontrolling interests","other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, before tax (deprecated 2012-01-31)", 

"other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, before tax (deprecated 2013-01-31)","other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, net of tax (deprecated 2013-01-31)","other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, tax (deprecated 2012-01-31)","other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, tax (deprecated 2013-01-31)","other than temporary impairment losses, investments, reclassification adjustment of noncredit portion included in net income, availabe-for-sale securities, before tax","other than temporary impairment losses, investments, reclassification adjustment of noncredit portion included in net income, held-to-maturity securities, before tax", 

"revision of net income to foreign offices","net income of foreign offices"

]



# Initializing MWETokenizet with a starter value

tokenizer = MWETokenizer([('total', 'expense')])



# Populating Tokenizer

for item in values:

    tokenizer.add_mwe((item.split()))



# Sample target sample

sentence = 'what is the net incomes of banks of america for q2 2014'



# Splitting for stammer

tokens = sentence.split()



# changing nouns to singular

singles = [lmtzr.lemmatize(plural,'n') for plural in tokens]



# Joining back and trying extraction/Tags

result= tokenizer.tokenize(' '.join(singles).split())





print(result)

# Result:

# Actual:   ['what', 'is', 'the', 'net', 'income', 'of', 'bank', 'of', 'america', 'for', 'q2', '2014']

# Expected: ['what', 'is', 'the', 'net_income', 'of', 'bank', 'of', 'america', 'for', 'q2', '2014']

I have first value in my tags is Net income and the following are other values that contains "Net income". Unlike expectation, tokenizer is failing to recognize the first value for some reason.

# Result:

# Actual:   ['what', 'is', 'the', 'net', 'income', 'of', 'bank', 'of', 'america', 'for', 'q2', '2014']

# Expected: ['what', 'is', 'the', 'net_income', 'of', 'bank', 'of', 'america', 'for', 'q2', '2014']

Is there a limitation or something that I dont know.
How do I debug this?

Also, if there is another way to get Multiword tagging done that let me know, it would be great help.

asked Nov 21 '18 at 7:06

user3870821

112

I am using NLTK's MWETokenizer to get the multi word tagging. Here is my sample code:

import nltk

import pickle

from nltk.stem.wordnet import WordNetLemmatizer

from nltk.tokenize import MWETokenizer



# initializing Wordnet Lemmatizer

lmtzr = WordNetLemmatizer()



# values to tag/extract

values = ["net income","net income (loss)" 

,"net income (loss) attributable to 'company'","net income (loss) attributable to bank","net income (loss) attributable to bank and noncontrolling (minority) interests","net income (loss) attributable to bank and noncontrolling interests","net income (loss) attributable to bank and noncontrolling minority interests","net income (loss) attributable to noncontrolling interests","net income after tax","net income associated to minority interests","net income associated to partners","net income attributable to "company name"","accumulated distributions in excess of net income","antidilutive securities excluded from computation of net income, per outstanding unit, amount","cash from net income","consolidated net income attributable to foreign offices","consolidated net income in foreign offices","consolidated net income of foreign offices","decrease in net income","diluted net income","diluted net income attributable to common shareholders","diluted net income per share","eliminations of net income to foreign offices","foreign offices consolidated net income","foreign offices net income","foreign offices net income before internal allocation of income and expense","income excluded from net income","increase in net income","less net income attributable to noncontrolling interests", 

"less: net income (loss) attributable income taxes to noncontrolling (minority) interests","net income attributable to bank","net income attributable to bank and minority interests","net income attributable to class a and class b common stockholders","net income attributable to common shareholders (in dollars per share) diluted","net income attributable to company", 

"net income attributable to foreign offices","net income attributable to income taxes","net income attributable to noncontrolling interests","net income attributable to noncontrolling parties","net income attributable to participating securities","net income attributed to bank","net income basic earnings","net income before given to non-controlling interests","net income before minority controlling interests","net income before non-controlling interests","net income diluted to common shareholders","net income from cash flows","net income from operations in cash flow","net income generated from investment","net income generated from joint venture partnerships","net income in foreign offices","net income including noncontrolling interests","net income loss attributable to bank and noncontrolling minority interests", 

"net income of interest", 

"net income of foreign offices before allocations","net income of interest","net income or loss attributable to bank and noncontrolling (minority) interests","net income or loss attributable to bank and noncontrolling interests","net income or loss attributed to bank","net income per share - basic (in usd per share)","net income per share - diluted (in usd per share)","net income per share diluted","net income to all parties","net income to foreign offices before internal allocations of income and expense","net income/loss to 'company name'","net income/loss to noncontrolling interests","other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, before tax (deprecated 2012-01-31)", 

"other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, before tax (deprecated 2013-01-31)","other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, net of tax (deprecated 2013-01-31)","other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, tax (deprecated 2012-01-31)","other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, tax (deprecated 2013-01-31)","other than temporary impairment losses, investments, reclassification adjustment of noncredit portion included in net income, availabe-for-sale securities, before tax","other than temporary impairment losses, investments, reclassification adjustment of noncredit portion included in net income, held-to-maturity securities, before tax", 

"revision of net income to foreign offices","net income of foreign offices"

]



# Initializing MWETokenizet with a starter value

tokenizer = MWETokenizer([('total', 'expense')])



# Populating Tokenizer

for item in values:

    tokenizer.add_mwe((item.split()))



# Sample target sample

sentence = 'what is the net incomes of banks of america for q2 2014'



# Splitting for stammer

tokens = sentence.split()



# changing nouns to singular

singles = [lmtzr.lemmatize(plural,'n') for plural in tokens]



# Joining back and trying extraction/Tags

result= tokenizer.tokenize(' '.join(singles).split())





print(result)

# Result:

# Actual:   ['what', 'is', 'the', 'net', 'income', 'of', 'bank', 'of', 'america', 'for', 'q2', '2014']

# Expected: ['what', 'is', 'the', 'net_income', 'of', 'bank', 'of', 'america', 'for', 'q2', '2014']

I have first value in my tags is Net income and the following are other values that contains "Net income". Unlike expectation, tokenizer is failing to recognize the first value for some reason.

# Result:

# Actual:   ['what', 'is', 'the', 'net', 'income', 'of', 'bank', 'of', 'america', 'for', 'q2', '2014']

# Expected: ['what', 'is', 'the', 'net_income', 'of', 'bank', 'of', 'america', 'for', 'q2', '2014']

Is there a limitation or something that I dont know.
How do I debug this?

Also, if there is another way to get Multiword tagging done that let me know, it would be great help.

nlp nltk pos-tagger pos-tagging

asked Nov 21 '18 at 7:06

user3870821

112

asked Nov 21 '18 at 7:06

user3870821

112

asked Nov 21 '18 at 7:06

user3870821

112

asked Nov 21 '18 at 7:06

user3870821

112

asked Nov 21 '18 at 7:06

user3870821

112

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53406873%2fnltk-mwetokenizer-is-failing-to-extract-tag-value%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Wsrtjtyk