How to collect all lines of data between keywords in a file - starting+ending at linebreaks











up vote
3
down vote

favorite












I am trying to collect specific information from very large log files but cannot figure out how to get the behavior I need.



For reference, an example log is sort of like this:




garbage I don't need - garbage I don't need
timestamp - date - server info - 'keyword 1' - data
more data more data more data more data
more data more data more data more data
more data more data 'keyword 2' - last bit of data
garbage I don't need - garbage I don't need



What I need is to find 'keyword 1', grab the whole line keyword 1 is on (back to timestamp) and all subsequent lines until (and including) the whole line that 'keyword 2' is on (through the last bit of data).



So far I have tried a few things. I cannot get decent results with re methods (findall, match, search etc.); I cannot figure out how to grab the data before the match (even with a look behind) but more importantly, I cannot figure out how to have the capture stop at a phrase and not just a single character.



for match in re.findall('keyword1[keyword2]+|', showall.read()):


I also tried something like this:



start_capture = False
for current_line in fileName:
if 'keyword1' in current_line:
start_capture = True
if start_capture:
new_list.append(current_line)
if 'keyword2' in current_line:
return(new_list)


No matter what I tried, this returned an empty list



Finally,I tried something like this:



def takewhile_plus_next(predicate, xs):
for x in xs:
if not predicate(x):
break
yield x
yield x
with lastdb as f:
lines = map(str.rstrip, f)
skipped = dropwhile(lambda line: 'Warning: fatal assert' not in line, lines)
lines_to_keep = takewhile_plus_next(lambda line: 'uptime:' not in line, skipped)


This last one took everything from keyword 1 to the EOF which includes almost 100,000 lines of garbage data.










share|improve this question
























  • Look, you tried checking if line contains keyword1, but your data has keyword 1. Try this parsing code.
    – Wiktor Stribiżew
    Nov 8 at 22:38












  • @WiktorStribiżew that isn't my literal code, in my actual code I have definite matching terms
    – Toenailsmcgee
    Nov 8 at 23:01










  • So, what is the problem? If these are regexps, use if re.search(rx, line) instead of if 'keyword' in line.
    – Wiktor Stribiżew
    Nov 8 at 23:33










  • First example doesn't initialise new_list as a list. Check indentation in the second example.
    – Nick
    Nov 9 at 7:15










  • @Nick these are just excerpts of the relevant code. In my real code new_list is initialized and remains empty after the block of code runs. the indentation of the second is a matter of text formatting. I didn't realize it got messed up like that on the copy/paste. Again, it is proper in my real code. I appreciate you trying to give feedback though.
    – Toenailsmcgee
    Nov 10 at 2:07















up vote
3
down vote

favorite












I am trying to collect specific information from very large log files but cannot figure out how to get the behavior I need.



For reference, an example log is sort of like this:




garbage I don't need - garbage I don't need
timestamp - date - server info - 'keyword 1' - data
more data more data more data more data
more data more data more data more data
more data more data 'keyword 2' - last bit of data
garbage I don't need - garbage I don't need



What I need is to find 'keyword 1', grab the whole line keyword 1 is on (back to timestamp) and all subsequent lines until (and including) the whole line that 'keyword 2' is on (through the last bit of data).



So far I have tried a few things. I cannot get decent results with re methods (findall, match, search etc.); I cannot figure out how to grab the data before the match (even with a look behind) but more importantly, I cannot figure out how to have the capture stop at a phrase and not just a single character.



for match in re.findall('keyword1[keyword2]+|', showall.read()):


I also tried something like this:



start_capture = False
for current_line in fileName:
if 'keyword1' in current_line:
start_capture = True
if start_capture:
new_list.append(current_line)
if 'keyword2' in current_line:
return(new_list)


No matter what I tried, this returned an empty list



Finally,I tried something like this:



def takewhile_plus_next(predicate, xs):
for x in xs:
if not predicate(x):
break
yield x
yield x
with lastdb as f:
lines = map(str.rstrip, f)
skipped = dropwhile(lambda line: 'Warning: fatal assert' not in line, lines)
lines_to_keep = takewhile_plus_next(lambda line: 'uptime:' not in line, skipped)


This last one took everything from keyword 1 to the EOF which includes almost 100,000 lines of garbage data.










share|improve this question
























  • Look, you tried checking if line contains keyword1, but your data has keyword 1. Try this parsing code.
    – Wiktor Stribiżew
    Nov 8 at 22:38












  • @WiktorStribiżew that isn't my literal code, in my actual code I have definite matching terms
    – Toenailsmcgee
    Nov 8 at 23:01










  • So, what is the problem? If these are regexps, use if re.search(rx, line) instead of if 'keyword' in line.
    – Wiktor Stribiżew
    Nov 8 at 23:33










  • First example doesn't initialise new_list as a list. Check indentation in the second example.
    – Nick
    Nov 9 at 7:15










  • @Nick these are just excerpts of the relevant code. In my real code new_list is initialized and remains empty after the block of code runs. the indentation of the second is a matter of text formatting. I didn't realize it got messed up like that on the copy/paste. Again, it is proper in my real code. I appreciate you trying to give feedback though.
    – Toenailsmcgee
    Nov 10 at 2:07













up vote
3
down vote

favorite









up vote
3
down vote

favorite











I am trying to collect specific information from very large log files but cannot figure out how to get the behavior I need.



For reference, an example log is sort of like this:




garbage I don't need - garbage I don't need
timestamp - date - server info - 'keyword 1' - data
more data more data more data more data
more data more data more data more data
more data more data 'keyword 2' - last bit of data
garbage I don't need - garbage I don't need



What I need is to find 'keyword 1', grab the whole line keyword 1 is on (back to timestamp) and all subsequent lines until (and including) the whole line that 'keyword 2' is on (through the last bit of data).



So far I have tried a few things. I cannot get decent results with re methods (findall, match, search etc.); I cannot figure out how to grab the data before the match (even with a look behind) but more importantly, I cannot figure out how to have the capture stop at a phrase and not just a single character.



for match in re.findall('keyword1[keyword2]+|', showall.read()):


I also tried something like this:



start_capture = False
for current_line in fileName:
if 'keyword1' in current_line:
start_capture = True
if start_capture:
new_list.append(current_line)
if 'keyword2' in current_line:
return(new_list)


No matter what I tried, this returned an empty list



Finally,I tried something like this:



def takewhile_plus_next(predicate, xs):
for x in xs:
if not predicate(x):
break
yield x
yield x
with lastdb as f:
lines = map(str.rstrip, f)
skipped = dropwhile(lambda line: 'Warning: fatal assert' not in line, lines)
lines_to_keep = takewhile_plus_next(lambda line: 'uptime:' not in line, skipped)


This last one took everything from keyword 1 to the EOF which includes almost 100,000 lines of garbage data.










share|improve this question















I am trying to collect specific information from very large log files but cannot figure out how to get the behavior I need.



For reference, an example log is sort of like this:




garbage I don't need - garbage I don't need
timestamp - date - server info - 'keyword 1' - data
more data more data more data more data
more data more data more data more data
more data more data 'keyword 2' - last bit of data
garbage I don't need - garbage I don't need



What I need is to find 'keyword 1', grab the whole line keyword 1 is on (back to timestamp) and all subsequent lines until (and including) the whole line that 'keyword 2' is on (through the last bit of data).



So far I have tried a few things. I cannot get decent results with re methods (findall, match, search etc.); I cannot figure out how to grab the data before the match (even with a look behind) but more importantly, I cannot figure out how to have the capture stop at a phrase and not just a single character.



for match in re.findall('keyword1[keyword2]+|', showall.read()):


I also tried something like this:



start_capture = False
for current_line in fileName:
if 'keyword1' in current_line:
start_capture = True
if start_capture:
new_list.append(current_line)
if 'keyword2' in current_line:
return(new_list)


No matter what I tried, this returned an empty list



Finally,I tried something like this:



def takewhile_plus_next(predicate, xs):
for x in xs:
if not predicate(x):
break
yield x
yield x
with lastdb as f:
lines = map(str.rstrip, f)
skipped = dropwhile(lambda line: 'Warning: fatal assert' not in line, lines)
lines_to_keep = takewhile_plus_next(lambda line: 'uptime:' not in line, skipped)


This last one took everything from keyword 1 to the EOF which includes almost 100,000 lines of garbage data.







python regex python-3.x parsing






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 8 at 22:03









Patrick Artner

19.1k51940




19.1k51940










asked Nov 8 at 21:48









Toenailsmcgee

143




143












  • Look, you tried checking if line contains keyword1, but your data has keyword 1. Try this parsing code.
    – Wiktor Stribiżew
    Nov 8 at 22:38












  • @WiktorStribiżew that isn't my literal code, in my actual code I have definite matching terms
    – Toenailsmcgee
    Nov 8 at 23:01










  • So, what is the problem? If these are regexps, use if re.search(rx, line) instead of if 'keyword' in line.
    – Wiktor Stribiżew
    Nov 8 at 23:33










  • First example doesn't initialise new_list as a list. Check indentation in the second example.
    – Nick
    Nov 9 at 7:15










  • @Nick these are just excerpts of the relevant code. In my real code new_list is initialized and remains empty after the block of code runs. the indentation of the second is a matter of text formatting. I didn't realize it got messed up like that on the copy/paste. Again, it is proper in my real code. I appreciate you trying to give feedback though.
    – Toenailsmcgee
    Nov 10 at 2:07


















  • Look, you tried checking if line contains keyword1, but your data has keyword 1. Try this parsing code.
    – Wiktor Stribiżew
    Nov 8 at 22:38












  • @WiktorStribiżew that isn't my literal code, in my actual code I have definite matching terms
    – Toenailsmcgee
    Nov 8 at 23:01










  • So, what is the problem? If these are regexps, use if re.search(rx, line) instead of if 'keyword' in line.
    – Wiktor Stribiżew
    Nov 8 at 23:33










  • First example doesn't initialise new_list as a list. Check indentation in the second example.
    – Nick
    Nov 9 at 7:15










  • @Nick these are just excerpts of the relevant code. In my real code new_list is initialized and remains empty after the block of code runs. the indentation of the second is a matter of text formatting. I didn't realize it got messed up like that on the copy/paste. Again, it is proper in my real code. I appreciate you trying to give feedback though.
    – Toenailsmcgee
    Nov 10 at 2:07
















Look, you tried checking if line contains keyword1, but your data has keyword 1. Try this parsing code.
– Wiktor Stribiżew
Nov 8 at 22:38






Look, you tried checking if line contains keyword1, but your data has keyword 1. Try this parsing code.
– Wiktor Stribiżew
Nov 8 at 22:38














@WiktorStribiżew that isn't my literal code, in my actual code I have definite matching terms
– Toenailsmcgee
Nov 8 at 23:01




@WiktorStribiżew that isn't my literal code, in my actual code I have definite matching terms
– Toenailsmcgee
Nov 8 at 23:01












So, what is the problem? If these are regexps, use if re.search(rx, line) instead of if 'keyword' in line.
– Wiktor Stribiżew
Nov 8 at 23:33




So, what is the problem? If these are regexps, use if re.search(rx, line) instead of if 'keyword' in line.
– Wiktor Stribiżew
Nov 8 at 23:33












First example doesn't initialise new_list as a list. Check indentation in the second example.
– Nick
Nov 9 at 7:15




First example doesn't initialise new_list as a list. Check indentation in the second example.
– Nick
Nov 9 at 7:15












@Nick these are just excerpts of the relevant code. In my real code new_list is initialized and remains empty after the block of code runs. the indentation of the second is a matter of text formatting. I didn't realize it got messed up like that on the copy/paste. Again, it is proper in my real code. I appreciate you trying to give feedback though.
– Toenailsmcgee
Nov 10 at 2:07




@Nick these are just excerpts of the relevant code. In my real code new_list is initialized and remains empty after the block of code runs. the indentation of the second is a matter of text formatting. I didn't realize it got messed up like that on the copy/paste. Again, it is proper in my real code. I appreciate you trying to give feedback though.
– Toenailsmcgee
Nov 10 at 2:07












3 Answers
3






active

oldest

votes

















up vote
1
down vote













You can use regex if you specify re.dotall and use lazy anythings .*? to match start and end:



import re

regex = r"n.*?(keyword 1).*?(keyword 2).*?$"

test_str = ("garbage I don't need - garbage I don't needn"
"timestamp - date - server info - 'keyword 1' - datan"
"more data more data more data more datan"
"more data more data more data more datan"
"more data more data 'keyword 2' - last bit of datan"
"garbage I don't need - garbage I don't need")

matches = re.finditer(regex, test_str, re.DOTALL | re.MULTILINE)

for matchNum, match in enumerate(matches):
matchNum = matchNum + 1

print (match.group()) # your match is the whole group


Output:



timestamp - date - server info - 'keyword 1' - data 
more data more data more data more data
more data more data more data more data
more data more data 'keyword 2' - last bit of data


You might need to strip('n') from it ...



You can view it here: https://regex101.com/r/HWIALZ/1 - it also holds the explanation of the patter. The short of it:



n        newline 
.*? as few as possible anythings
(keyword 1) literal text - the () are not needed only if you want the group
.*? as few as possible anythings
(keyword 2) literal text - again () are not needed
.*? as few as possible anythings
$ end of line


I included the () for clarity - you do not evaluate groups, you you remove them.






share|improve this answer























  • I am getting this error:
    – Toenailsmcgee
    Nov 8 at 22:49










  • sorry - I messed up the last edit -I substituted filename for test_str matches = re.finditer(regex, filename, re.DOTALL | re.MULTILINE) File "C:Program FilesPython36libre.py", line 229, in finditer return _compile(pattern, flags).finditer(string) TypeError: expected string or bytes-like object
    – Toenailsmcgee
    Nov 8 at 22:59












  • @Toenailsmcgee using with open(filename,"r") as f: re.finditer(regex,f.read(), .. flags ...) should do it, unless your files are so big that they don't fit into memory
    – Patrick Artner
    Nov 9 at 6:35










  • @Patrick Artner why n not ^ at start of regex?
    – Nick
    Nov 11 at 19:59










  • @Nick does not work - try it. it would still match but from the start of the first line instead of from the closest n
    – Patrick Artner
    Nov 11 at 20:09




















up vote
1
down vote













The following is fast for any size of file. It extracts from a 250M log file of nearly 2 million lines in 3 seconds. The extracted portion was at the end of the file.



I would not recommend using list, regexes or other in-memory techniques if there is a chance your files won't fit in available memory.



Test text file startstop_text:



line 1 this should not appear in output
line 2 keyword1
line 3 appears in output
line 4 keyword2
line 5 this should not appear in output


Code:



from itertools import dropwhile


def keepuntil(contains_end_keyword, lines):
for line in lines:
yield line
if contains_end_keyword(line):
break


with open('startstop_text', 'r') as f:
from_start_line = dropwhile(lambda line: 'keyword1' not in line, f)
extracted = keepuntil(lambda line: 'keyword2' in line, from_start_line)
for line in extracted:
print(line.rstrip())


>>> python startstop.py
line 2 keyword1
line 3 appears in output
line 4 keyword2





share|improve this answer






























    up vote
    -1
    down vote



    accepted










    None of the other responses worked but I was able to figure it out using regex.



    for match in re.findall(".*keyword1[sS]*?keyword2:[sS]*?keyword3.*", log_file.read()):





    share|improve this answer





















    • Hey @Toenailsmcgee the above solution is not complete code. That makes it less helpful than it could be.
      – Nick
      Nov 11 at 22:29










    • My solution works fine for me in Python 2.7 and 3.6, on small and large files, with the extracted lines in initial, final or middle position. If there's a problem with it please let me know what error or what faulty output you're getting. Did you want to find multiple instances of keyword1 and keyword2 and extract all? If so, my solution doesn't work-- but that isn't what you asked for.
      – Nick
      Nov 11 at 22:31











    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53216657%2fhow-to-collect-all-lines-of-data-between-keywords-in-a-file-startingending-at%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    3 Answers
    3






    active

    oldest

    votes








    3 Answers
    3






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    1
    down vote













    You can use regex if you specify re.dotall and use lazy anythings .*? to match start and end:



    import re

    regex = r"n.*?(keyword 1).*?(keyword 2).*?$"

    test_str = ("garbage I don't need - garbage I don't needn"
    "timestamp - date - server info - 'keyword 1' - datan"
    "more data more data more data more datan"
    "more data more data more data more datan"
    "more data more data 'keyword 2' - last bit of datan"
    "garbage I don't need - garbage I don't need")

    matches = re.finditer(regex, test_str, re.DOTALL | re.MULTILINE)

    for matchNum, match in enumerate(matches):
    matchNum = matchNum + 1

    print (match.group()) # your match is the whole group


    Output:



    timestamp - date - server info - 'keyword 1' - data 
    more data more data more data more data
    more data more data more data more data
    more data more data 'keyword 2' - last bit of data


    You might need to strip('n') from it ...



    You can view it here: https://regex101.com/r/HWIALZ/1 - it also holds the explanation of the patter. The short of it:



    n        newline 
    .*? as few as possible anythings
    (keyword 1) literal text - the () are not needed only if you want the group
    .*? as few as possible anythings
    (keyword 2) literal text - again () are not needed
    .*? as few as possible anythings
    $ end of line


    I included the () for clarity - you do not evaluate groups, you you remove them.






    share|improve this answer























    • I am getting this error:
      – Toenailsmcgee
      Nov 8 at 22:49










    • sorry - I messed up the last edit -I substituted filename for test_str matches = re.finditer(regex, filename, re.DOTALL | re.MULTILINE) File "C:Program FilesPython36libre.py", line 229, in finditer return _compile(pattern, flags).finditer(string) TypeError: expected string or bytes-like object
      – Toenailsmcgee
      Nov 8 at 22:59












    • @Toenailsmcgee using with open(filename,"r") as f: re.finditer(regex,f.read(), .. flags ...) should do it, unless your files are so big that they don't fit into memory
      – Patrick Artner
      Nov 9 at 6:35










    • @Patrick Artner why n not ^ at start of regex?
      – Nick
      Nov 11 at 19:59










    • @Nick does not work - try it. it would still match but from the start of the first line instead of from the closest n
      – Patrick Artner
      Nov 11 at 20:09

















    up vote
    1
    down vote













    You can use regex if you specify re.dotall and use lazy anythings .*? to match start and end:



    import re

    regex = r"n.*?(keyword 1).*?(keyword 2).*?$"

    test_str = ("garbage I don't need - garbage I don't needn"
    "timestamp - date - server info - 'keyword 1' - datan"
    "more data more data more data more datan"
    "more data more data more data more datan"
    "more data more data 'keyword 2' - last bit of datan"
    "garbage I don't need - garbage I don't need")

    matches = re.finditer(regex, test_str, re.DOTALL | re.MULTILINE)

    for matchNum, match in enumerate(matches):
    matchNum = matchNum + 1

    print (match.group()) # your match is the whole group


    Output:



    timestamp - date - server info - 'keyword 1' - data 
    more data more data more data more data
    more data more data more data more data
    more data more data 'keyword 2' - last bit of data


    You might need to strip('n') from it ...



    You can view it here: https://regex101.com/r/HWIALZ/1 - it also holds the explanation of the patter. The short of it:



    n        newline 
    .*? as few as possible anythings
    (keyword 1) literal text - the () are not needed only if you want the group
    .*? as few as possible anythings
    (keyword 2) literal text - again () are not needed
    .*? as few as possible anythings
    $ end of line


    I included the () for clarity - you do not evaluate groups, you you remove them.






    share|improve this answer























    • I am getting this error:
      – Toenailsmcgee
      Nov 8 at 22:49










    • sorry - I messed up the last edit -I substituted filename for test_str matches = re.finditer(regex, filename, re.DOTALL | re.MULTILINE) File "C:Program FilesPython36libre.py", line 229, in finditer return _compile(pattern, flags).finditer(string) TypeError: expected string or bytes-like object
      – Toenailsmcgee
      Nov 8 at 22:59












    • @Toenailsmcgee using with open(filename,"r") as f: re.finditer(regex,f.read(), .. flags ...) should do it, unless your files are so big that they don't fit into memory
      – Patrick Artner
      Nov 9 at 6:35










    • @Patrick Artner why n not ^ at start of regex?
      – Nick
      Nov 11 at 19:59










    • @Nick does not work - try it. it would still match but from the start of the first line instead of from the closest n
      – Patrick Artner
      Nov 11 at 20:09















    up vote
    1
    down vote










    up vote
    1
    down vote









    You can use regex if you specify re.dotall and use lazy anythings .*? to match start and end:



    import re

    regex = r"n.*?(keyword 1).*?(keyword 2).*?$"

    test_str = ("garbage I don't need - garbage I don't needn"
    "timestamp - date - server info - 'keyword 1' - datan"
    "more data more data more data more datan"
    "more data more data more data more datan"
    "more data more data 'keyword 2' - last bit of datan"
    "garbage I don't need - garbage I don't need")

    matches = re.finditer(regex, test_str, re.DOTALL | re.MULTILINE)

    for matchNum, match in enumerate(matches):
    matchNum = matchNum + 1

    print (match.group()) # your match is the whole group


    Output:



    timestamp - date - server info - 'keyword 1' - data 
    more data more data more data more data
    more data more data more data more data
    more data more data 'keyword 2' - last bit of data


    You might need to strip('n') from it ...



    You can view it here: https://regex101.com/r/HWIALZ/1 - it also holds the explanation of the patter. The short of it:



    n        newline 
    .*? as few as possible anythings
    (keyword 1) literal text - the () are not needed only if you want the group
    .*? as few as possible anythings
    (keyword 2) literal text - again () are not needed
    .*? as few as possible anythings
    $ end of line


    I included the () for clarity - you do not evaluate groups, you you remove them.






    share|improve this answer














    You can use regex if you specify re.dotall and use lazy anythings .*? to match start and end:



    import re

    regex = r"n.*?(keyword 1).*?(keyword 2).*?$"

    test_str = ("garbage I don't need - garbage I don't needn"
    "timestamp - date - server info - 'keyword 1' - datan"
    "more data more data more data more datan"
    "more data more data more data more datan"
    "more data more data 'keyword 2' - last bit of datan"
    "garbage I don't need - garbage I don't need")

    matches = re.finditer(regex, test_str, re.DOTALL | re.MULTILINE)

    for matchNum, match in enumerate(matches):
    matchNum = matchNum + 1

    print (match.group()) # your match is the whole group


    Output:



    timestamp - date - server info - 'keyword 1' - data 
    more data more data more data more data
    more data more data more data more data
    more data more data 'keyword 2' - last bit of data


    You might need to strip('n') from it ...



    You can view it here: https://regex101.com/r/HWIALZ/1 - it also holds the explanation of the patter. The short of it:



    n        newline 
    .*? as few as possible anythings
    (keyword 1) literal text - the () are not needed only if you want the group
    .*? as few as possible anythings
    (keyword 2) literal text - again () are not needed
    .*? as few as possible anythings
    $ end of line


    I included the () for clarity - you do not evaluate groups, you you remove them.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 8 at 22:02

























    answered Nov 8 at 21:57









    Patrick Artner

    19.1k51940




    19.1k51940












    • I am getting this error:
      – Toenailsmcgee
      Nov 8 at 22:49










    • sorry - I messed up the last edit -I substituted filename for test_str matches = re.finditer(regex, filename, re.DOTALL | re.MULTILINE) File "C:Program FilesPython36libre.py", line 229, in finditer return _compile(pattern, flags).finditer(string) TypeError: expected string or bytes-like object
      – Toenailsmcgee
      Nov 8 at 22:59












    • @Toenailsmcgee using with open(filename,"r") as f: re.finditer(regex,f.read(), .. flags ...) should do it, unless your files are so big that they don't fit into memory
      – Patrick Artner
      Nov 9 at 6:35










    • @Patrick Artner why n not ^ at start of regex?
      – Nick
      Nov 11 at 19:59










    • @Nick does not work - try it. it would still match but from the start of the first line instead of from the closest n
      – Patrick Artner
      Nov 11 at 20:09




















    • I am getting this error:
      – Toenailsmcgee
      Nov 8 at 22:49










    • sorry - I messed up the last edit -I substituted filename for test_str matches = re.finditer(regex, filename, re.DOTALL | re.MULTILINE) File "C:Program FilesPython36libre.py", line 229, in finditer return _compile(pattern, flags).finditer(string) TypeError: expected string or bytes-like object
      – Toenailsmcgee
      Nov 8 at 22:59












    • @Toenailsmcgee using with open(filename,"r") as f: re.finditer(regex,f.read(), .. flags ...) should do it, unless your files are so big that they don't fit into memory
      – Patrick Artner
      Nov 9 at 6:35










    • @Patrick Artner why n not ^ at start of regex?
      – Nick
      Nov 11 at 19:59










    • @Nick does not work - try it. it would still match but from the start of the first line instead of from the closest n
      – Patrick Artner
      Nov 11 at 20:09


















    I am getting this error:
    – Toenailsmcgee
    Nov 8 at 22:49




    I am getting this error:
    – Toenailsmcgee
    Nov 8 at 22:49












    sorry - I messed up the last edit -I substituted filename for test_str matches = re.finditer(regex, filename, re.DOTALL | re.MULTILINE) File "C:Program FilesPython36libre.py", line 229, in finditer return _compile(pattern, flags).finditer(string) TypeError: expected string or bytes-like object
    – Toenailsmcgee
    Nov 8 at 22:59






    sorry - I messed up the last edit -I substituted filename for test_str matches = re.finditer(regex, filename, re.DOTALL | re.MULTILINE) File "C:Program FilesPython36libre.py", line 229, in finditer return _compile(pattern, flags).finditer(string) TypeError: expected string or bytes-like object
    – Toenailsmcgee
    Nov 8 at 22:59














    @Toenailsmcgee using with open(filename,"r") as f: re.finditer(regex,f.read(), .. flags ...) should do it, unless your files are so big that they don't fit into memory
    – Patrick Artner
    Nov 9 at 6:35




    @Toenailsmcgee using with open(filename,"r") as f: re.finditer(regex,f.read(), .. flags ...) should do it, unless your files are so big that they don't fit into memory
    – Patrick Artner
    Nov 9 at 6:35












    @Patrick Artner why n not ^ at start of regex?
    – Nick
    Nov 11 at 19:59




    @Patrick Artner why n not ^ at start of regex?
    – Nick
    Nov 11 at 19:59












    @Nick does not work - try it. it would still match but from the start of the first line instead of from the closest n
    – Patrick Artner
    Nov 11 at 20:09






    @Nick does not work - try it. it would still match but from the start of the first line instead of from the closest n
    – Patrick Artner
    Nov 11 at 20:09














    up vote
    1
    down vote













    The following is fast for any size of file. It extracts from a 250M log file of nearly 2 million lines in 3 seconds. The extracted portion was at the end of the file.



    I would not recommend using list, regexes or other in-memory techniques if there is a chance your files won't fit in available memory.



    Test text file startstop_text:



    line 1 this should not appear in output
    line 2 keyword1
    line 3 appears in output
    line 4 keyword2
    line 5 this should not appear in output


    Code:



    from itertools import dropwhile


    def keepuntil(contains_end_keyword, lines):
    for line in lines:
    yield line
    if contains_end_keyword(line):
    break


    with open('startstop_text', 'r') as f:
    from_start_line = dropwhile(lambda line: 'keyword1' not in line, f)
    extracted = keepuntil(lambda line: 'keyword2' in line, from_start_line)
    for line in extracted:
    print(line.rstrip())


    >>> python startstop.py
    line 2 keyword1
    line 3 appears in output
    line 4 keyword2





    share|improve this answer



























      up vote
      1
      down vote













      The following is fast for any size of file. It extracts from a 250M log file of nearly 2 million lines in 3 seconds. The extracted portion was at the end of the file.



      I would not recommend using list, regexes or other in-memory techniques if there is a chance your files won't fit in available memory.



      Test text file startstop_text:



      line 1 this should not appear in output
      line 2 keyword1
      line 3 appears in output
      line 4 keyword2
      line 5 this should not appear in output


      Code:



      from itertools import dropwhile


      def keepuntil(contains_end_keyword, lines):
      for line in lines:
      yield line
      if contains_end_keyword(line):
      break


      with open('startstop_text', 'r') as f:
      from_start_line = dropwhile(lambda line: 'keyword1' not in line, f)
      extracted = keepuntil(lambda line: 'keyword2' in line, from_start_line)
      for line in extracted:
      print(line.rstrip())


      >>> python startstop.py
      line 2 keyword1
      line 3 appears in output
      line 4 keyword2





      share|improve this answer

























        up vote
        1
        down vote










        up vote
        1
        down vote









        The following is fast for any size of file. It extracts from a 250M log file of nearly 2 million lines in 3 seconds. The extracted portion was at the end of the file.



        I would not recommend using list, regexes or other in-memory techniques if there is a chance your files won't fit in available memory.



        Test text file startstop_text:



        line 1 this should not appear in output
        line 2 keyword1
        line 3 appears in output
        line 4 keyword2
        line 5 this should not appear in output


        Code:



        from itertools import dropwhile


        def keepuntil(contains_end_keyword, lines):
        for line in lines:
        yield line
        if contains_end_keyword(line):
        break


        with open('startstop_text', 'r') as f:
        from_start_line = dropwhile(lambda line: 'keyword1' not in line, f)
        extracted = keepuntil(lambda line: 'keyword2' in line, from_start_line)
        for line in extracted:
        print(line.rstrip())


        >>> python startstop.py
        line 2 keyword1
        line 3 appears in output
        line 4 keyword2





        share|improve this answer














        The following is fast for any size of file. It extracts from a 250M log file of nearly 2 million lines in 3 seconds. The extracted portion was at the end of the file.



        I would not recommend using list, regexes or other in-memory techniques if there is a chance your files won't fit in available memory.



        Test text file startstop_text:



        line 1 this should not appear in output
        line 2 keyword1
        line 3 appears in output
        line 4 keyword2
        line 5 this should not appear in output


        Code:



        from itertools import dropwhile


        def keepuntil(contains_end_keyword, lines):
        for line in lines:
        yield line
        if contains_end_keyword(line):
        break


        with open('startstop_text', 'r') as f:
        from_start_line = dropwhile(lambda line: 'keyword1' not in line, f)
        extracted = keepuntil(lambda line: 'keyword2' in line, from_start_line)
        for line in extracted:
        print(line.rstrip())


        >>> python startstop.py
        line 2 keyword1
        line 3 appears in output
        line 4 keyword2






        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 10 at 20:06

























        answered Nov 9 at 4:48









        Nick

        472414




        472414






















            up vote
            -1
            down vote



            accepted










            None of the other responses worked but I was able to figure it out using regex.



            for match in re.findall(".*keyword1[sS]*?keyword2:[sS]*?keyword3.*", log_file.read()):





            share|improve this answer





















            • Hey @Toenailsmcgee the above solution is not complete code. That makes it less helpful than it could be.
              – Nick
              Nov 11 at 22:29










            • My solution works fine for me in Python 2.7 and 3.6, on small and large files, with the extracted lines in initial, final or middle position. If there's a problem with it please let me know what error or what faulty output you're getting. Did you want to find multiple instances of keyword1 and keyword2 and extract all? If so, my solution doesn't work-- but that isn't what you asked for.
              – Nick
              Nov 11 at 22:31















            up vote
            -1
            down vote



            accepted










            None of the other responses worked but I was able to figure it out using regex.



            for match in re.findall(".*keyword1[sS]*?keyword2:[sS]*?keyword3.*", log_file.read()):





            share|improve this answer





















            • Hey @Toenailsmcgee the above solution is not complete code. That makes it less helpful than it could be.
              – Nick
              Nov 11 at 22:29










            • My solution works fine for me in Python 2.7 and 3.6, on small and large files, with the extracted lines in initial, final or middle position. If there's a problem with it please let me know what error or what faulty output you're getting. Did you want to find multiple instances of keyword1 and keyword2 and extract all? If so, my solution doesn't work-- but that isn't what you asked for.
              – Nick
              Nov 11 at 22:31













            up vote
            -1
            down vote



            accepted







            up vote
            -1
            down vote



            accepted






            None of the other responses worked but I was able to figure it out using regex.



            for match in re.findall(".*keyword1[sS]*?keyword2:[sS]*?keyword3.*", log_file.read()):





            share|improve this answer












            None of the other responses worked but I was able to figure it out using regex.



            for match in re.findall(".*keyword1[sS]*?keyword2:[sS]*?keyword3.*", log_file.read()):






            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Nov 11 at 5:48









            Toenailsmcgee

            143




            143












            • Hey @Toenailsmcgee the above solution is not complete code. That makes it less helpful than it could be.
              – Nick
              Nov 11 at 22:29










            • My solution works fine for me in Python 2.7 and 3.6, on small and large files, with the extracted lines in initial, final or middle position. If there's a problem with it please let me know what error or what faulty output you're getting. Did you want to find multiple instances of keyword1 and keyword2 and extract all? If so, my solution doesn't work-- but that isn't what you asked for.
              – Nick
              Nov 11 at 22:31


















            • Hey @Toenailsmcgee the above solution is not complete code. That makes it less helpful than it could be.
              – Nick
              Nov 11 at 22:29










            • My solution works fine for me in Python 2.7 and 3.6, on small and large files, with the extracted lines in initial, final or middle position. If there's a problem with it please let me know what error or what faulty output you're getting. Did you want to find multiple instances of keyword1 and keyword2 and extract all? If so, my solution doesn't work-- but that isn't what you asked for.
              – Nick
              Nov 11 at 22:31
















            Hey @Toenailsmcgee the above solution is not complete code. That makes it less helpful than it could be.
            – Nick
            Nov 11 at 22:29




            Hey @Toenailsmcgee the above solution is not complete code. That makes it less helpful than it could be.
            – Nick
            Nov 11 at 22:29












            My solution works fine for me in Python 2.7 and 3.6, on small and large files, with the extracted lines in initial, final or middle position. If there's a problem with it please let me know what error or what faulty output you're getting. Did you want to find multiple instances of keyword1 and keyword2 and extract all? If so, my solution doesn't work-- but that isn't what you asked for.
            – Nick
            Nov 11 at 22:31




            My solution works fine for me in Python 2.7 and 3.6, on small and large files, with the extracted lines in initial, final or middle position. If there's a problem with it please let me know what error or what faulty output you're getting. Did you want to find multiple instances of keyword1 and keyword2 and extract all? If so, my solution doesn't work-- but that isn't what you asked for.
            – Nick
            Nov 11 at 22:31


















            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53216657%2fhow-to-collect-all-lines-of-data-between-keywords-in-a-file-startingending-at%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            這個網誌中的熱門文章

            Hercules Kyvelos

            Tangent Lines Diagram Along Smooth Curve

            Yusuf al-Mu'taman ibn Hud