Python regex remove everything except strings from list












0














I have string:



bdv. mot. g. vns. kilm.


And knowing list of strings like



important_strings_lst=['bdv.', 'dktv.', 'mot. g.', 'vyr. g.']


I want to get regex selection like:



bdv. mot. g.


I joined list and tried: idea from here



regex = re.compile(r'b(?!bdv.|dktv.|mot. g.|vyr. g.)w+', re.UNICODE)
regex.sub("", 'bdv. mot. g. vns. kilm.')


Got



'bdv. mot. . . .'


Changing places in regex with s also didn't work out. How to do it?



I could use something like [x for x in important_strings_lst if x in my_string] but I need good performance as this will be used with million rows of pandas dataframe with str.replace










share|improve this question





























    0














    I have string:



    bdv. mot. g. vns. kilm.


    And knowing list of strings like



    important_strings_lst=['bdv.', 'dktv.', 'mot. g.', 'vyr. g.']


    I want to get regex selection like:



    bdv. mot. g.


    I joined list and tried: idea from here



    regex = re.compile(r'b(?!bdv.|dktv.|mot. g.|vyr. g.)w+', re.UNICODE)
    regex.sub("", 'bdv. mot. g. vns. kilm.')


    Got



    'bdv. mot. . . .'


    Changing places in regex with s also didn't work out. How to do it?



    I could use something like [x for x in important_strings_lst if x in my_string] but I need good performance as this will be used with million rows of pandas dataframe with str.replace










    share|improve this question



























      0












      0








      0







      I have string:



      bdv. mot. g. vns. kilm.


      And knowing list of strings like



      important_strings_lst=['bdv.', 'dktv.', 'mot. g.', 'vyr. g.']


      I want to get regex selection like:



      bdv. mot. g.


      I joined list and tried: idea from here



      regex = re.compile(r'b(?!bdv.|dktv.|mot. g.|vyr. g.)w+', re.UNICODE)
      regex.sub("", 'bdv. mot. g. vns. kilm.')


      Got



      'bdv. mot. . . .'


      Changing places in regex with s also didn't work out. How to do it?



      I could use something like [x for x in important_strings_lst if x in my_string] but I need good performance as this will be used with million rows of pandas dataframe with str.replace










      share|improve this question















      I have string:



      bdv. mot. g. vns. kilm.


      And knowing list of strings like



      important_strings_lst=['bdv.', 'dktv.', 'mot. g.', 'vyr. g.']


      I want to get regex selection like:



      bdv. mot. g.


      I joined list and tried: idea from here



      regex = re.compile(r'b(?!bdv.|dktv.|mot. g.|vyr. g.)w+', re.UNICODE)
      regex.sub("", 'bdv. mot. g. vns. kilm.')


      Got



      'bdv. mot. . . .'


      Changing places in regex with s also didn't work out. How to do it?



      I could use something like [x for x in important_strings_lst if x in my_string] but I need good performance as this will be used with million rows of pandas dataframe with str.replace







      python regex list replace






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 10 at 17:38









      Sandeep Kadapa

      5,667427




      5,667427










      asked Nov 10 at 17:37









      Lukas

      355




      355
























          2 Answers
          2






          active

          oldest

          votes


















          0














          The . character has special meaning in regular expressions. You can use re.escape to make a string "safe" for use in a regular expression.



          >>> import re
          ... important_strings=['bdv.', 'dktv.', 'mot. g.', 'vyr. g.']
          ... regex = re.compile('|'.join(re.escape(s) for s in important_strings))
          ... regex.findall('bdv. mot. g. vns. kilm.')
          ['bdv.', 'mot. g.']


          Pandas has its own findall which should work like re.findall






          share|improve this answer



















          • 1




            @perreal, your comment above is not clear, can your pls make it clear.
            – pygo
            Nov 10 at 18:02










          • Pandas series indeed has str.findall method. And re.escape removes dots. What is left is list instead of string. But may I get out with this.
            – Lukas
            Nov 10 at 18:17






          • 1




            .str.findall('|'.join(re.escape(s) for s in important_strings)).str.join(' ')
            – Lukas
            Nov 10 at 18:23










          • You can benchmark to test if findall is faster than your original negative lookahead. I try to avoid using lookaround assertions in my regular expressions because they are often hard to read/understand and in some cases they can be very slow, if the regex engine is forced to do a lot of backtracking.
            – Håken Lid
            Nov 10 at 18:38





















          0














          Maybe split string



              bdv. mot. g. vns. kilm.


          using your list and remove from oryginal string what left after spliting.






          share|improve this answer





















            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53241664%2fpython-regex-remove-everything-except-strings-from-list%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0














            The . character has special meaning in regular expressions. You can use re.escape to make a string "safe" for use in a regular expression.



            >>> import re
            ... important_strings=['bdv.', 'dktv.', 'mot. g.', 'vyr. g.']
            ... regex = re.compile('|'.join(re.escape(s) for s in important_strings))
            ... regex.findall('bdv. mot. g. vns. kilm.')
            ['bdv.', 'mot. g.']


            Pandas has its own findall which should work like re.findall






            share|improve this answer



















            • 1




              @perreal, your comment above is not clear, can your pls make it clear.
              – pygo
              Nov 10 at 18:02










            • Pandas series indeed has str.findall method. And re.escape removes dots. What is left is list instead of string. But may I get out with this.
              – Lukas
              Nov 10 at 18:17






            • 1




              .str.findall('|'.join(re.escape(s) for s in important_strings)).str.join(' ')
              – Lukas
              Nov 10 at 18:23










            • You can benchmark to test if findall is faster than your original negative lookahead. I try to avoid using lookaround assertions in my regular expressions because they are often hard to read/understand and in some cases they can be very slow, if the regex engine is forced to do a lot of backtracking.
              – Håken Lid
              Nov 10 at 18:38


















            0














            The . character has special meaning in regular expressions. You can use re.escape to make a string "safe" for use in a regular expression.



            >>> import re
            ... important_strings=['bdv.', 'dktv.', 'mot. g.', 'vyr. g.']
            ... regex = re.compile('|'.join(re.escape(s) for s in important_strings))
            ... regex.findall('bdv. mot. g. vns. kilm.')
            ['bdv.', 'mot. g.']


            Pandas has its own findall which should work like re.findall






            share|improve this answer



















            • 1




              @perreal, your comment above is not clear, can your pls make it clear.
              – pygo
              Nov 10 at 18:02










            • Pandas series indeed has str.findall method. And re.escape removes dots. What is left is list instead of string. But may I get out with this.
              – Lukas
              Nov 10 at 18:17






            • 1




              .str.findall('|'.join(re.escape(s) for s in important_strings)).str.join(' ')
              – Lukas
              Nov 10 at 18:23










            • You can benchmark to test if findall is faster than your original negative lookahead. I try to avoid using lookaround assertions in my regular expressions because they are often hard to read/understand and in some cases they can be very slow, if the regex engine is forced to do a lot of backtracking.
              – Håken Lid
              Nov 10 at 18:38
















            0












            0








            0






            The . character has special meaning in regular expressions. You can use re.escape to make a string "safe" for use in a regular expression.



            >>> import re
            ... important_strings=['bdv.', 'dktv.', 'mot. g.', 'vyr. g.']
            ... regex = re.compile('|'.join(re.escape(s) for s in important_strings))
            ... regex.findall('bdv. mot. g. vns. kilm.')
            ['bdv.', 'mot. g.']


            Pandas has its own findall which should work like re.findall






            share|improve this answer














            The . character has special meaning in regular expressions. You can use re.escape to make a string "safe" for use in a regular expression.



            >>> import re
            ... important_strings=['bdv.', 'dktv.', 'mot. g.', 'vyr. g.']
            ... regex = re.compile('|'.join(re.escape(s) for s in important_strings))
            ... regex.findall('bdv. mot. g. vns. kilm.')
            ['bdv.', 'mot. g.']


            Pandas has its own findall which should work like re.findall







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Nov 10 at 18:17

























            answered Nov 10 at 17:51









            Håken Lid

            10.5k62441




            10.5k62441








            • 1




              @perreal, your comment above is not clear, can your pls make it clear.
              – pygo
              Nov 10 at 18:02










            • Pandas series indeed has str.findall method. And re.escape removes dots. What is left is list instead of string. But may I get out with this.
              – Lukas
              Nov 10 at 18:17






            • 1




              .str.findall('|'.join(re.escape(s) for s in important_strings)).str.join(' ')
              – Lukas
              Nov 10 at 18:23










            • You can benchmark to test if findall is faster than your original negative lookahead. I try to avoid using lookaround assertions in my regular expressions because they are often hard to read/understand and in some cases they can be very slow, if the regex engine is forced to do a lot of backtracking.
              – Håken Lid
              Nov 10 at 18:38
















            • 1




              @perreal, your comment above is not clear, can your pls make it clear.
              – pygo
              Nov 10 at 18:02










            • Pandas series indeed has str.findall method. And re.escape removes dots. What is left is list instead of string. But may I get out with this.
              – Lukas
              Nov 10 at 18:17






            • 1




              .str.findall('|'.join(re.escape(s) for s in important_strings)).str.join(' ')
              – Lukas
              Nov 10 at 18:23










            • You can benchmark to test if findall is faster than your original negative lookahead. I try to avoid using lookaround assertions in my regular expressions because they are often hard to read/understand and in some cases they can be very slow, if the regex engine is forced to do a lot of backtracking.
              – Håken Lid
              Nov 10 at 18:38










            1




            1




            @perreal, your comment above is not clear, can your pls make it clear.
            – pygo
            Nov 10 at 18:02




            @perreal, your comment above is not clear, can your pls make it clear.
            – pygo
            Nov 10 at 18:02












            Pandas series indeed has str.findall method. And re.escape removes dots. What is left is list instead of string. But may I get out with this.
            – Lukas
            Nov 10 at 18:17




            Pandas series indeed has str.findall method. And re.escape removes dots. What is left is list instead of string. But may I get out with this.
            – Lukas
            Nov 10 at 18:17




            1




            1




            .str.findall('|'.join(re.escape(s) for s in important_strings)).str.join(' ')
            – Lukas
            Nov 10 at 18:23




            .str.findall('|'.join(re.escape(s) for s in important_strings)).str.join(' ')
            – Lukas
            Nov 10 at 18:23












            You can benchmark to test if findall is faster than your original negative lookahead. I try to avoid using lookaround assertions in my regular expressions because they are often hard to read/understand and in some cases they can be very slow, if the regex engine is forced to do a lot of backtracking.
            – Håken Lid
            Nov 10 at 18:38






            You can benchmark to test if findall is faster than your original negative lookahead. I try to avoid using lookaround assertions in my regular expressions because they are often hard to read/understand and in some cases they can be very slow, if the regex engine is forced to do a lot of backtracking.
            – Håken Lid
            Nov 10 at 18:38















            0














            Maybe split string



                bdv. mot. g. vns. kilm.


            using your list and remove from oryginal string what left after spliting.






            share|improve this answer


























              0














              Maybe split string



                  bdv. mot. g. vns. kilm.


              using your list and remove from oryginal string what left after spliting.






              share|improve this answer
























                0












                0








                0






                Maybe split string



                    bdv. mot. g. vns. kilm.


                using your list and remove from oryginal string what left after spliting.






                share|improve this answer












                Maybe split string



                    bdv. mot. g. vns. kilm.


                using your list and remove from oryginal string what left after spliting.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 10 at 18:08









                user10403681

                11




                11






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.





                    Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                    Please pay close attention to the following guidance:


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53241664%2fpython-regex-remove-everything-except-strings-from-list%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    這個網誌中的熱門文章

                    Tangent Lines Diagram Along Smooth Curve

                    Yusuf al-Mu'taman ibn Hud

                    Zucchini