NLTK MWETokenizer is failing to extract/Tag value












0















I am using NLTK's MWETokenizer to get the multi word tagging. Here is my sample code:



import nltk
import pickle
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import MWETokenizer

# initializing Wordnet Lemmatizer
lmtzr = WordNetLemmatizer()

# values to tag/extract
values = ["net income","net income (loss)"
,"net income (loss) attributable to 'company'","net income (loss) attributable to bank","net income (loss) attributable to bank and noncontrolling (minority) interests","net income (loss) attributable to bank and noncontrolling interests","net income (loss) attributable to bank and noncontrolling minority interests","net income (loss) attributable to noncontrolling interests","net income after tax","net income associated to minority interests","net income associated to partners","net income attributable to "company name"","accumulated distributions in excess of net income","antidilutive securities excluded from computation of net income, per outstanding unit, amount","cash from net income","consolidated net income attributable to foreign offices","consolidated net income in foreign offices","consolidated net income of foreign offices","decrease in net income","diluted net income","diluted net income attributable to common shareholders","diluted net income per share","eliminations of net income to foreign offices","foreign offices consolidated net income","foreign offices net income","foreign offices net income before internal allocation of income and expense","income excluded from net income","increase in net income","less net income attributable to noncontrolling interests",
"less: net income (loss) attributable income taxes to noncontrolling (minority) interests","net income attributable to bank","net income attributable to bank and minority interests","net income attributable to class a and class b common stockholders","net income attributable to common shareholders (in dollars per share) diluted","net income attributable to company",
"net income attributable to foreign offices","net income attributable to income taxes","net income attributable to noncontrolling interests","net income attributable to noncontrolling parties","net income attributable to participating securities","net income attributed to bank","net income basic earnings","net income before given to non-controlling interests","net income before minority controlling interests","net income before non-controlling interests","net income diluted to common shareholders","net income from cash flows","net income from operations in cash flow","net income generated from investment","net income generated from joint venture partnerships","net income in foreign offices","net income including noncontrolling interests","net income loss attributable to bank and noncontrolling minority interests",
"net income of interest",
"net income of foreign offices before allocations","net income of interest","net income or loss attributable to bank and noncontrolling (minority) interests","net income or loss attributable to bank and noncontrolling interests","net income or loss attributed to bank","net income per share - basic (in usd per share)","net income per share - diluted (in usd per share)","net income per share diluted","net income to all parties","net income to foreign offices before internal allocations of income and expense","net income/loss to 'company name'","net income/loss to noncontrolling interests","other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, before tax (deprecated 2012-01-31)",
"other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, before tax (deprecated 2013-01-31)","other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, net of tax (deprecated 2013-01-31)","other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, tax (deprecated 2012-01-31)","other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, tax (deprecated 2013-01-31)","other than temporary impairment losses, investments, reclassification adjustment of noncredit portion included in net income, availabe-for-sale securities, before tax","other than temporary impairment losses, investments, reclassification adjustment of noncredit portion included in net income, held-to-maturity securities, before tax",
"revision of net income to foreign offices","net income of foreign offices"
]

# Initializing MWETokenizet with a starter value
tokenizer = MWETokenizer([('total', 'expense')])

# Populating Tokenizer
for item in values:
tokenizer.add_mwe((item.split()))

# Sample target sample
sentence = 'what is the net incomes of banks of america for q2 2014'

# Splitting for stammer
tokens = sentence.split()

# changing nouns to singular
singles = [lmtzr.lemmatize(plural,'n') for plural in tokens]

# Joining back and trying extraction/Tags
result= tokenizer.tokenize(' '.join(singles).split())


print(result)
# Result:
# Actual: ['what', 'is', 'the', 'net', 'income', 'of', 'bank', 'of', 'america', 'for', 'q2', '2014']
# Expected: ['what', 'is', 'the', 'net_income', 'of', 'bank', 'of', 'america', 'for', 'q2', '2014']


I have first value in my tags is Net income and the following are other values that contains "Net income". Unlike expectation, tokenizer is failing to recognize the first value for some reason.



# Result:
# Actual: ['what', 'is', 'the', 'net', 'income', 'of', 'bank', 'of', 'america', 'for', 'q2', '2014']
# Expected: ['what', 'is', 'the', 'net_income', 'of', 'bank', 'of', 'america', 'for', 'q2', '2014']


Is there a limitation or something that I dont know.
How do I debug this?



Also, if there is another way to get Multiword tagging done that let me know, it would be great help.










share|improve this question



























    0















    I am using NLTK's MWETokenizer to get the multi word tagging. Here is my sample code:



    import nltk
    import pickle
    from nltk.stem.wordnet import WordNetLemmatizer
    from nltk.tokenize import MWETokenizer

    # initializing Wordnet Lemmatizer
    lmtzr = WordNetLemmatizer()

    # values to tag/extract
    values = ["net income","net income (loss)"
    ,"net income (loss) attributable to 'company'","net income (loss) attributable to bank","net income (loss) attributable to bank and noncontrolling (minority) interests","net income (loss) attributable to bank and noncontrolling interests","net income (loss) attributable to bank and noncontrolling minority interests","net income (loss) attributable to noncontrolling interests","net income after tax","net income associated to minority interests","net income associated to partners","net income attributable to "company name"","accumulated distributions in excess of net income","antidilutive securities excluded from computation of net income, per outstanding unit, amount","cash from net income","consolidated net income attributable to foreign offices","consolidated net income in foreign offices","consolidated net income of foreign offices","decrease in net income","diluted net income","diluted net income attributable to common shareholders","diluted net income per share","eliminations of net income to foreign offices","foreign offices consolidated net income","foreign offices net income","foreign offices net income before internal allocation of income and expense","income excluded from net income","increase in net income","less net income attributable to noncontrolling interests",
    "less: net income (loss) attributable income taxes to noncontrolling (minority) interests","net income attributable to bank","net income attributable to bank and minority interests","net income attributable to class a and class b common stockholders","net income attributable to common shareholders (in dollars per share) diluted","net income attributable to company",
    "net income attributable to foreign offices","net income attributable to income taxes","net income attributable to noncontrolling interests","net income attributable to noncontrolling parties","net income attributable to participating securities","net income attributed to bank","net income basic earnings","net income before given to non-controlling interests","net income before minority controlling interests","net income before non-controlling interests","net income diluted to common shareholders","net income from cash flows","net income from operations in cash flow","net income generated from investment","net income generated from joint venture partnerships","net income in foreign offices","net income including noncontrolling interests","net income loss attributable to bank and noncontrolling minority interests",
    "net income of interest",
    "net income of foreign offices before allocations","net income of interest","net income or loss attributable to bank and noncontrolling (minority) interests","net income or loss attributable to bank and noncontrolling interests","net income or loss attributed to bank","net income per share - basic (in usd per share)","net income per share - diluted (in usd per share)","net income per share diluted","net income to all parties","net income to foreign offices before internal allocations of income and expense","net income/loss to 'company name'","net income/loss to noncontrolling interests","other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, before tax (deprecated 2012-01-31)",
    "other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, before tax (deprecated 2013-01-31)","other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, net of tax (deprecated 2013-01-31)","other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, tax (deprecated 2012-01-31)","other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, tax (deprecated 2013-01-31)","other than temporary impairment losses, investments, reclassification adjustment of noncredit portion included in net income, availabe-for-sale securities, before tax","other than temporary impairment losses, investments, reclassification adjustment of noncredit portion included in net income, held-to-maturity securities, before tax",
    "revision of net income to foreign offices","net income of foreign offices"
    ]

    # Initializing MWETokenizet with a starter value
    tokenizer = MWETokenizer([('total', 'expense')])

    # Populating Tokenizer
    for item in values:
    tokenizer.add_mwe((item.split()))

    # Sample target sample
    sentence = 'what is the net incomes of banks of america for q2 2014'

    # Splitting for stammer
    tokens = sentence.split()

    # changing nouns to singular
    singles = [lmtzr.lemmatize(plural,'n') for plural in tokens]

    # Joining back and trying extraction/Tags
    result= tokenizer.tokenize(' '.join(singles).split())


    print(result)
    # Result:
    # Actual: ['what', 'is', 'the', 'net', 'income', 'of', 'bank', 'of', 'america', 'for', 'q2', '2014']
    # Expected: ['what', 'is', 'the', 'net_income', 'of', 'bank', 'of', 'america', 'for', 'q2', '2014']


    I have first value in my tags is Net income and the following are other values that contains "Net income". Unlike expectation, tokenizer is failing to recognize the first value for some reason.



    # Result:
    # Actual: ['what', 'is', 'the', 'net', 'income', 'of', 'bank', 'of', 'america', 'for', 'q2', '2014']
    # Expected: ['what', 'is', 'the', 'net_income', 'of', 'bank', 'of', 'america', 'for', 'q2', '2014']


    Is there a limitation or something that I dont know.
    How do I debug this?



    Also, if there is another way to get Multiword tagging done that let me know, it would be great help.










    share|improve this question

























      0












      0








      0








      I am using NLTK's MWETokenizer to get the multi word tagging. Here is my sample code:



      import nltk
      import pickle
      from nltk.stem.wordnet import WordNetLemmatizer
      from nltk.tokenize import MWETokenizer

      # initializing Wordnet Lemmatizer
      lmtzr = WordNetLemmatizer()

      # values to tag/extract
      values = ["net income","net income (loss)"
      ,"net income (loss) attributable to 'company'","net income (loss) attributable to bank","net income (loss) attributable to bank and noncontrolling (minority) interests","net income (loss) attributable to bank and noncontrolling interests","net income (loss) attributable to bank and noncontrolling minority interests","net income (loss) attributable to noncontrolling interests","net income after tax","net income associated to minority interests","net income associated to partners","net income attributable to "company name"","accumulated distributions in excess of net income","antidilutive securities excluded from computation of net income, per outstanding unit, amount","cash from net income","consolidated net income attributable to foreign offices","consolidated net income in foreign offices","consolidated net income of foreign offices","decrease in net income","diluted net income","diluted net income attributable to common shareholders","diluted net income per share","eliminations of net income to foreign offices","foreign offices consolidated net income","foreign offices net income","foreign offices net income before internal allocation of income and expense","income excluded from net income","increase in net income","less net income attributable to noncontrolling interests",
      "less: net income (loss) attributable income taxes to noncontrolling (minority) interests","net income attributable to bank","net income attributable to bank and minority interests","net income attributable to class a and class b common stockholders","net income attributable to common shareholders (in dollars per share) diluted","net income attributable to company",
      "net income attributable to foreign offices","net income attributable to income taxes","net income attributable to noncontrolling interests","net income attributable to noncontrolling parties","net income attributable to participating securities","net income attributed to bank","net income basic earnings","net income before given to non-controlling interests","net income before minority controlling interests","net income before non-controlling interests","net income diluted to common shareholders","net income from cash flows","net income from operations in cash flow","net income generated from investment","net income generated from joint venture partnerships","net income in foreign offices","net income including noncontrolling interests","net income loss attributable to bank and noncontrolling minority interests",
      "net income of interest",
      "net income of foreign offices before allocations","net income of interest","net income or loss attributable to bank and noncontrolling (minority) interests","net income or loss attributable to bank and noncontrolling interests","net income or loss attributed to bank","net income per share - basic (in usd per share)","net income per share - diluted (in usd per share)","net income per share diluted","net income to all parties","net income to foreign offices before internal allocations of income and expense","net income/loss to 'company name'","net income/loss to noncontrolling interests","other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, before tax (deprecated 2012-01-31)",
      "other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, before tax (deprecated 2013-01-31)","other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, net of tax (deprecated 2013-01-31)","other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, tax (deprecated 2012-01-31)","other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, tax (deprecated 2013-01-31)","other than temporary impairment losses, investments, reclassification adjustment of noncredit portion included in net income, availabe-for-sale securities, before tax","other than temporary impairment losses, investments, reclassification adjustment of noncredit portion included in net income, held-to-maturity securities, before tax",
      "revision of net income to foreign offices","net income of foreign offices"
      ]

      # Initializing MWETokenizet with a starter value
      tokenizer = MWETokenizer([('total', 'expense')])

      # Populating Tokenizer
      for item in values:
      tokenizer.add_mwe((item.split()))

      # Sample target sample
      sentence = 'what is the net incomes of banks of america for q2 2014'

      # Splitting for stammer
      tokens = sentence.split()

      # changing nouns to singular
      singles = [lmtzr.lemmatize(plural,'n') for plural in tokens]

      # Joining back and trying extraction/Tags
      result= tokenizer.tokenize(' '.join(singles).split())


      print(result)
      # Result:
      # Actual: ['what', 'is', 'the', 'net', 'income', 'of', 'bank', 'of', 'america', 'for', 'q2', '2014']
      # Expected: ['what', 'is', 'the', 'net_income', 'of', 'bank', 'of', 'america', 'for', 'q2', '2014']


      I have first value in my tags is Net income and the following are other values that contains "Net income". Unlike expectation, tokenizer is failing to recognize the first value for some reason.



      # Result:
      # Actual: ['what', 'is', 'the', 'net', 'income', 'of', 'bank', 'of', 'america', 'for', 'q2', '2014']
      # Expected: ['what', 'is', 'the', 'net_income', 'of', 'bank', 'of', 'america', 'for', 'q2', '2014']


      Is there a limitation or something that I dont know.
      How do I debug this?



      Also, if there is another way to get Multiword tagging done that let me know, it would be great help.










      share|improve this question














      I am using NLTK's MWETokenizer to get the multi word tagging. Here is my sample code:



      import nltk
      import pickle
      from nltk.stem.wordnet import WordNetLemmatizer
      from nltk.tokenize import MWETokenizer

      # initializing Wordnet Lemmatizer
      lmtzr = WordNetLemmatizer()

      # values to tag/extract
      values = ["net income","net income (loss)"
      ,"net income (loss) attributable to 'company'","net income (loss) attributable to bank","net income (loss) attributable to bank and noncontrolling (minority) interests","net income (loss) attributable to bank and noncontrolling interests","net income (loss) attributable to bank and noncontrolling minority interests","net income (loss) attributable to noncontrolling interests","net income after tax","net income associated to minority interests","net income associated to partners","net income attributable to "company name"","accumulated distributions in excess of net income","antidilutive securities excluded from computation of net income, per outstanding unit, amount","cash from net income","consolidated net income attributable to foreign offices","consolidated net income in foreign offices","consolidated net income of foreign offices","decrease in net income","diluted net income","diluted net income attributable to common shareholders","diluted net income per share","eliminations of net income to foreign offices","foreign offices consolidated net income","foreign offices net income","foreign offices net income before internal allocation of income and expense","income excluded from net income","increase in net income","less net income attributable to noncontrolling interests",
      "less: net income (loss) attributable income taxes to noncontrolling (minority) interests","net income attributable to bank","net income attributable to bank and minority interests","net income attributable to class a and class b common stockholders","net income attributable to common shareholders (in dollars per share) diluted","net income attributable to company",
      "net income attributable to foreign offices","net income attributable to income taxes","net income attributable to noncontrolling interests","net income attributable to noncontrolling parties","net income attributable to participating securities","net income attributed to bank","net income basic earnings","net income before given to non-controlling interests","net income before minority controlling interests","net income before non-controlling interests","net income diluted to common shareholders","net income from cash flows","net income from operations in cash flow","net income generated from investment","net income generated from joint venture partnerships","net income in foreign offices","net income including noncontrolling interests","net income loss attributable to bank and noncontrolling minority interests",
      "net income of interest",
      "net income of foreign offices before allocations","net income of interest","net income or loss attributable to bank and noncontrolling (minority) interests","net income or loss attributable to bank and noncontrolling interests","net income or loss attributed to bank","net income per share - basic (in usd per share)","net income per share - diluted (in usd per share)","net income per share diluted","net income to all parties","net income to foreign offices before internal allocations of income and expense","net income/loss to 'company name'","net income/loss to noncontrolling interests","other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, before tax (deprecated 2012-01-31)",
      "other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, before tax (deprecated 2013-01-31)","other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, net of tax (deprecated 2013-01-31)","other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, tax (deprecated 2012-01-31)","other comprehensive income (loss), reclassification adjustment for write-down of securities included in net income, tax (deprecated 2013-01-31)","other than temporary impairment losses, investments, reclassification adjustment of noncredit portion included in net income, availabe-for-sale securities, before tax","other than temporary impairment losses, investments, reclassification adjustment of noncredit portion included in net income, held-to-maturity securities, before tax",
      "revision of net income to foreign offices","net income of foreign offices"
      ]

      # Initializing MWETokenizet with a starter value
      tokenizer = MWETokenizer([('total', 'expense')])

      # Populating Tokenizer
      for item in values:
      tokenizer.add_mwe((item.split()))

      # Sample target sample
      sentence = 'what is the net incomes of banks of america for q2 2014'

      # Splitting for stammer
      tokens = sentence.split()

      # changing nouns to singular
      singles = [lmtzr.lemmatize(plural,'n') for plural in tokens]

      # Joining back and trying extraction/Tags
      result= tokenizer.tokenize(' '.join(singles).split())


      print(result)
      # Result:
      # Actual: ['what', 'is', 'the', 'net', 'income', 'of', 'bank', 'of', 'america', 'for', 'q2', '2014']
      # Expected: ['what', 'is', 'the', 'net_income', 'of', 'bank', 'of', 'america', 'for', 'q2', '2014']


      I have first value in my tags is Net income and the following are other values that contains "Net income". Unlike expectation, tokenizer is failing to recognize the first value for some reason.



      # Result:
      # Actual: ['what', 'is', 'the', 'net', 'income', 'of', 'bank', 'of', 'america', 'for', 'q2', '2014']
      # Expected: ['what', 'is', 'the', 'net_income', 'of', 'bank', 'of', 'america', 'for', 'q2', '2014']


      Is there a limitation or something that I dont know.
      How do I debug this?



      Also, if there is another way to get Multiword tagging done that let me know, it would be great help.







      nlp nltk pos-tagger pos-tagging






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 21 '18 at 7:06









      user3870821user3870821

      112




      112
























          0






          active

          oldest

          votes











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53406873%2fnltk-mwetokenizer-is-failing-to-extract-tag-value%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes
















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53406873%2fnltk-mwetokenizer-is-failing-to-extract-tag-value%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          這個網誌中的熱門文章

          Hercules Kyvelos

          Tangent Lines Diagram Along Smooth Curve

          Yusuf al-Mu'taman ibn Hud