Fast removal of low frequency words in Python












1














In the code below, lowFrequencyWords is a list with the low frequency words and doc is a list of tokens.



doc=[w for w in doc if not w in lowFrequencyWords]


The problem is that this piece of code lasts forever.



I am note sure, but I believe that the problem is that the operation of removal of an intermediate element from a list costs O(n), where n is the size of the list. Since the number of lowFrequencyWords is giant, python has to repeat that many times. I looked for linked lists, but I believe that they are not available in Python.










share|improve this question


















  • 2




    make lowFrequencyWords a frozenset(lowFrequencyWords) to begin with ...
    – Patrick Artner
    Nov 10 at 20:06








  • 1




    also using doc[:] = [w for w in doc if w not in frozenset_of_lowFreq] might help - not sure - someone told me thats some kind of inplace replacing which might be faster. dont ask me, just heard it...
    – Patrick Artner
    Nov 10 at 20:10






  • 1




    because you mostly look into the list with O(n) lookup and never find your words (they are low freq). So you essentially iterate all of lowFreqWords all the time. sets have constant lookups - thats fast compared. frozensets are just the immutable cousins of sets.
    – Patrick Artner
    Nov 10 at 20:15








  • 2




    @PatrickArtner The doc[:] = there is nothing to do with speed... Just mutability... a = [1, 2, 3]; b = a... then do a = [4, 5, 6]... you just rebind the name a to a new list and b is still referring to the original.... if you do a[:] = [4, 5, 6] then you're mutating the underlying list that a and b refer to so b will also be referring to the updated list.
    – Jon Clements
    Nov 10 at 20:15








  • 1




    @JonClements thanks for explaining - it was some kind of "obscure" tip I got 15 months back when starting on pyhton or so and never understood correctly it seems. Ok, banned from the "optimize speed" category of list comps :)
    – Patrick Artner
    Nov 10 at 20:17


















1














In the code below, lowFrequencyWords is a list with the low frequency words and doc is a list of tokens.



doc=[w for w in doc if not w in lowFrequencyWords]


The problem is that this piece of code lasts forever.



I am note sure, but I believe that the problem is that the operation of removal of an intermediate element from a list costs O(n), where n is the size of the list. Since the number of lowFrequencyWords is giant, python has to repeat that many times. I looked for linked lists, but I believe that they are not available in Python.










share|improve this question


















  • 2




    make lowFrequencyWords a frozenset(lowFrequencyWords) to begin with ...
    – Patrick Artner
    Nov 10 at 20:06








  • 1




    also using doc[:] = [w for w in doc if w not in frozenset_of_lowFreq] might help - not sure - someone told me thats some kind of inplace replacing which might be faster. dont ask me, just heard it...
    – Patrick Artner
    Nov 10 at 20:10






  • 1




    because you mostly look into the list with O(n) lookup and never find your words (they are low freq). So you essentially iterate all of lowFreqWords all the time. sets have constant lookups - thats fast compared. frozensets are just the immutable cousins of sets.
    – Patrick Artner
    Nov 10 at 20:15








  • 2




    @PatrickArtner The doc[:] = there is nothing to do with speed... Just mutability... a = [1, 2, 3]; b = a... then do a = [4, 5, 6]... you just rebind the name a to a new list and b is still referring to the original.... if you do a[:] = [4, 5, 6] then you're mutating the underlying list that a and b refer to so b will also be referring to the updated list.
    – Jon Clements
    Nov 10 at 20:15








  • 1




    @JonClements thanks for explaining - it was some kind of "obscure" tip I got 15 months back when starting on pyhton or so and never understood correctly it seems. Ok, banned from the "optimize speed" category of list comps :)
    – Patrick Artner
    Nov 10 at 20:17
















1












1








1







In the code below, lowFrequencyWords is a list with the low frequency words and doc is a list of tokens.



doc=[w for w in doc if not w in lowFrequencyWords]


The problem is that this piece of code lasts forever.



I am note sure, but I believe that the problem is that the operation of removal of an intermediate element from a list costs O(n), where n is the size of the list. Since the number of lowFrequencyWords is giant, python has to repeat that many times. I looked for linked lists, but I believe that they are not available in Python.










share|improve this question













In the code below, lowFrequencyWords is a list with the low frequency words and doc is a list of tokens.



doc=[w for w in doc if not w in lowFrequencyWords]


The problem is that this piece of code lasts forever.



I am note sure, but I believe that the problem is that the operation of removal of an intermediate element from a list costs O(n), where n is the size of the list. Since the number of lowFrequencyWords is giant, python has to repeat that many times. I looked for linked lists, but I believe that they are not available in Python.







python-3.x nlp nltk






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 10 at 20:03









DanielTheRocketMan

1,99242137




1,99242137








  • 2




    make lowFrequencyWords a frozenset(lowFrequencyWords) to begin with ...
    – Patrick Artner
    Nov 10 at 20:06








  • 1




    also using doc[:] = [w for w in doc if w not in frozenset_of_lowFreq] might help - not sure - someone told me thats some kind of inplace replacing which might be faster. dont ask me, just heard it...
    – Patrick Artner
    Nov 10 at 20:10






  • 1




    because you mostly look into the list with O(n) lookup and never find your words (they are low freq). So you essentially iterate all of lowFreqWords all the time. sets have constant lookups - thats fast compared. frozensets are just the immutable cousins of sets.
    – Patrick Artner
    Nov 10 at 20:15








  • 2




    @PatrickArtner The doc[:] = there is nothing to do with speed... Just mutability... a = [1, 2, 3]; b = a... then do a = [4, 5, 6]... you just rebind the name a to a new list and b is still referring to the original.... if you do a[:] = [4, 5, 6] then you're mutating the underlying list that a and b refer to so b will also be referring to the updated list.
    – Jon Clements
    Nov 10 at 20:15








  • 1




    @JonClements thanks for explaining - it was some kind of "obscure" tip I got 15 months back when starting on pyhton or so and never understood correctly it seems. Ok, banned from the "optimize speed" category of list comps :)
    – Patrick Artner
    Nov 10 at 20:17
















  • 2




    make lowFrequencyWords a frozenset(lowFrequencyWords) to begin with ...
    – Patrick Artner
    Nov 10 at 20:06








  • 1




    also using doc[:] = [w for w in doc if w not in frozenset_of_lowFreq] might help - not sure - someone told me thats some kind of inplace replacing which might be faster. dont ask me, just heard it...
    – Patrick Artner
    Nov 10 at 20:10






  • 1




    because you mostly look into the list with O(n) lookup and never find your words (they are low freq). So you essentially iterate all of lowFreqWords all the time. sets have constant lookups - thats fast compared. frozensets are just the immutable cousins of sets.
    – Patrick Artner
    Nov 10 at 20:15








  • 2




    @PatrickArtner The doc[:] = there is nothing to do with speed... Just mutability... a = [1, 2, 3]; b = a... then do a = [4, 5, 6]... you just rebind the name a to a new list and b is still referring to the original.... if you do a[:] = [4, 5, 6] then you're mutating the underlying list that a and b refer to so b will also be referring to the updated list.
    – Jon Clements
    Nov 10 at 20:15








  • 1




    @JonClements thanks for explaining - it was some kind of "obscure" tip I got 15 months back when starting on pyhton or so and never understood correctly it seems. Ok, banned from the "optimize speed" category of list comps :)
    – Patrick Artner
    Nov 10 at 20:17










2




2




make lowFrequencyWords a frozenset(lowFrequencyWords) to begin with ...
– Patrick Artner
Nov 10 at 20:06






make lowFrequencyWords a frozenset(lowFrequencyWords) to begin with ...
– Patrick Artner
Nov 10 at 20:06






1




1




also using doc[:] = [w for w in doc if w not in frozenset_of_lowFreq] might help - not sure - someone told me thats some kind of inplace replacing which might be faster. dont ask me, just heard it...
– Patrick Artner
Nov 10 at 20:10




also using doc[:] = [w for w in doc if w not in frozenset_of_lowFreq] might help - not sure - someone told me thats some kind of inplace replacing which might be faster. dont ask me, just heard it...
– Patrick Artner
Nov 10 at 20:10




1




1




because you mostly look into the list with O(n) lookup and never find your words (they are low freq). So you essentially iterate all of lowFreqWords all the time. sets have constant lookups - thats fast compared. frozensets are just the immutable cousins of sets.
– Patrick Artner
Nov 10 at 20:15






because you mostly look into the list with O(n) lookup and never find your words (they are low freq). So you essentially iterate all of lowFreqWords all the time. sets have constant lookups - thats fast compared. frozensets are just the immutable cousins of sets.
– Patrick Artner
Nov 10 at 20:15






2




2




@PatrickArtner The doc[:] = there is nothing to do with speed... Just mutability... a = [1, 2, 3]; b = a... then do a = [4, 5, 6]... you just rebind the name a to a new list and b is still referring to the original.... if you do a[:] = [4, 5, 6] then you're mutating the underlying list that a and b refer to so b will also be referring to the updated list.
– Jon Clements
Nov 10 at 20:15






@PatrickArtner The doc[:] = there is nothing to do with speed... Just mutability... a = [1, 2, 3]; b = a... then do a = [4, 5, 6]... you just rebind the name a to a new list and b is still referring to the original.... if you do a[:] = [4, 5, 6] then you're mutating the underlying list that a and b refer to so b will also be referring to the updated list.
– Jon Clements
Nov 10 at 20:15






1




1




@JonClements thanks for explaining - it was some kind of "obscure" tip I got 15 months back when starting on pyhton or so and never understood correctly it seems. Ok, banned from the "optimize speed" category of list comps :)
– Patrick Artner
Nov 10 at 20:17






@JonClements thanks for explaining - it was some kind of "obscure" tip I got 15 months back when starting on pyhton or so and never understood correctly it seems. Ok, banned from the "optimize speed" category of list comps :)
– Patrick Artner
Nov 10 at 20:17














1 Answer
1






active

oldest

votes


















1














from comments: @Patrick Artner
make lowFrequencyWords a frozenset(lowFrequencyWords) to begin with






share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53242925%2ffast-removal-of-low-frequency-words-in-python%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    from comments: @Patrick Artner
    make lowFrequencyWords a frozenset(lowFrequencyWords) to begin with






    share|improve this answer




























      1














      from comments: @Patrick Artner
      make lowFrequencyWords a frozenset(lowFrequencyWords) to begin with






      share|improve this answer


























        1












        1








        1






        from comments: @Patrick Artner
        make lowFrequencyWords a frozenset(lowFrequencyWords) to begin with






        share|improve this answer














        from comments: @Patrick Artner
        make lowFrequencyWords a frozenset(lowFrequencyWords) to begin with







        share|improve this answer














        share|improve this answer



        share|improve this answer








        answered Nov 10 at 20:31


























        community wiki





        Paritosh Singh































            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53242925%2ffast-removal-of-low-frequency-words-in-python%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            這個網誌中的熱門文章

            Academy of Television Arts & Sciences

            L'Équipe

            1995 France bombings