How to collect all lines of data between keywords in a file

How to collect all lines of data between keywords in a file - starting+ending at linebreaks

up vote
3
down vote

favorite

I am trying to collect specific information from very large log files but cannot figure out how to get the behavior I need.

For reference, an example log is sort of like this:

garbage I don't need - garbage I don't need

timestamp - date - server info - 'keyword 1' - data

more data more data more data more data

more data more data more data more data

more data more data 'keyword 2' - last bit of data

garbage I don't need - garbage I don't need

What I need is to find 'keyword 1', grab the whole line keyword 1 is on (back to timestamp) and all subsequent lines until (and including) the whole line that 'keyword 2' is on (through the last bit of data).

So far I have tried a few things. I cannot get decent results with re methods (findall, match, search etc.); I cannot figure out how to grab the data before the match (even with a look behind) but more importantly, I cannot figure out how to have the capture stop at a phrase and not just a single character.

for match in re.findall('keyword1[keyword2]+|', showall.read()):

I also tried something like this:

start_capture = False

for current_line in fileName:

    if 'keyword1' in current_line:

        start_capture = True

    if start_capture:

        new_list.append(current_line)

    if 'keyword2' in current_line:

        return(new_list)

No matter what I tried, this returned an empty list

Finally,I tried something like this:

def takewhile_plus_next(predicate, xs):

for x in xs:

    if not predicate(x):

        break

    yield x

yield x

with lastdb as f:

    lines = map(str.rstrip, f)

    skipped = dropwhile(lambda line: 'Warning: fatal assert' not in line, lines)

    lines_to_keep = takewhile_plus_next(lambda line: 'uptime:' not in line, skipped)

This last one took everything from keyword 1 to the EOF which includes almost 100,000 lines of garbage data.

edited Nov 8 at 22:03

Patrick Artner

19.1k51940

asked Nov 8 at 21:48

Toenailsmcgee

143

Look, you tried checking if line contains keyword1, but your data has keyword 1. Try this parsing code.
– Wiktor Stribiżew
Nov 8 at 22:38

@WiktorStribiżew that isn't my literal code, in my actual code I have definite matching terms
– Toenailsmcgee
Nov 8 at 23:01

So, what is the problem? If these are regexps, use if re.search(rx, line) instead of if 'keyword' in line.
– Wiktor Stribiżew
Nov 8 at 23:33

First example doesn't initialise new_list as a list. Check indentation in the second example.
– Nick
Nov 9 at 7:15

@Nick these are just excerpts of the relevant code. In my real code new_list is initialized and remains empty after the block of code runs. the indentation of the second is a matter of text formatting. I didn't realize it got messed up like that on the copy/paste. Again, it is proper in my real code. I appreciate you trying to give feedback though.
– Toenailsmcgee
Nov 10 at 2:07

|
show 4 more comments

up vote
3
down vote

favorite

I am trying to collect specific information from very large log files but cannot figure out how to get the behavior I need.

For reference, an example log is sort of like this:

garbage I don't need - garbage I don't need

timestamp - date - server info - 'keyword 1' - data

more data more data more data more data

more data more data more data more data

more data more data 'keyword 2' - last bit of data

garbage I don't need - garbage I don't need

for match in re.findall('keyword1[keyword2]+|', showall.read()):

I also tried something like this:

start_capture = False

for current_line in fileName:

    if 'keyword1' in current_line:

        start_capture = True

    if start_capture:

        new_list.append(current_line)

    if 'keyword2' in current_line:

        return(new_list)

No matter what I tried, this returned an empty list

Finally,I tried something like this:

def takewhile_plus_next(predicate, xs):

for x in xs:

    if not predicate(x):

        break

    yield x

yield x

with lastdb as f:

    lines = map(str.rstrip, f)

    skipped = dropwhile(lambda line: 'Warning: fatal assert' not in line, lines)

    lines_to_keep = takewhile_plus_next(lambda line: 'uptime:' not in line, skipped)

This last one took everything from keyword 1 to the EOF which includes almost 100,000 lines of garbage data.

edited Nov 8 at 22:03

Patrick Artner

19.1k51940

asked Nov 8 at 21:48

Toenailsmcgee

143

Look, you tried checking if line contains keyword1, but your data has keyword 1. Try this parsing code.
– Wiktor Stribiżew
Nov 8 at 22:38

@WiktorStribiżew that isn't my literal code, in my actual code I have definite matching terms
– Toenailsmcgee
Nov 8 at 23:01

So, what is the problem? If these are regexps, use if re.search(rx, line) instead of if 'keyword' in line.
– Wiktor Stribiżew
Nov 8 at 23:33

First example doesn't initialise new_list as a list. Check indentation in the second example.
– Nick
Nov 9 at 7:15

@Nick these are just excerpts of the relevant code. In my real code new_list is initialized and remains empty after the block of code runs. the indentation of the second is a matter of text formatting. I didn't realize it got messed up like that on the copy/paste. Again, it is proper in my real code. I appreciate you trying to give feedback though.
– Toenailsmcgee
Nov 10 at 2:07

|
show 4 more comments

up vote
3
down vote

favorite

I am trying to collect specific information from very large log files but cannot figure out how to get the behavior I need.

For reference, an example log is sort of like this:

garbage I don't need - garbage I don't need

timestamp - date - server info - 'keyword 1' - data

more data more data more data more data

more data more data more data more data

more data more data 'keyword 2' - last bit of data

garbage I don't need - garbage I don't need

for match in re.findall('keyword1[keyword2]+|', showall.read()):

I also tried something like this:

start_capture = False

for current_line in fileName:

    if 'keyword1' in current_line:

        start_capture = True

    if start_capture:

        new_list.append(current_line)

    if 'keyword2' in current_line:

        return(new_list)

No matter what I tried, this returned an empty list

Finally,I tried something like this:

def takewhile_plus_next(predicate, xs):

for x in xs:

    if not predicate(x):

        break

    yield x

yield x

with lastdb as f:

    lines = map(str.rstrip, f)

    skipped = dropwhile(lambda line: 'Warning: fatal assert' not in line, lines)

    lines_to_keep = takewhile_plus_next(lambda line: 'uptime:' not in line, skipped)

This last one took everything from keyword 1 to the EOF which includes almost 100,000 lines of garbage data.

edited Nov 8 at 22:03

Patrick Artner

19.1k51940

asked Nov 8 at 21:48

Toenailsmcgee

143

I am trying to collect specific information from very large log files but cannot figure out how to get the behavior I need.

For reference, an example log is sort of like this:

garbage I don't need - garbage I don't need

timestamp - date - server info - 'keyword 1' - data

more data more data more data more data

more data more data more data more data

more data more data 'keyword 2' - last bit of data

garbage I don't need - garbage I don't need

for match in re.findall('keyword1[keyword2]+|', showall.read()):

I also tried something like this:

start_capture = False

for current_line in fileName:

    if 'keyword1' in current_line:

        start_capture = True

    if start_capture:

        new_list.append(current_line)

    if 'keyword2' in current_line:

        return(new_list)

No matter what I tried, this returned an empty list

Finally,I tried something like this:

def takewhile_plus_next(predicate, xs):

for x in xs:

    if not predicate(x):

        break

    yield x

yield x

with lastdb as f:

    lines = map(str.rstrip, f)

    skipped = dropwhile(lambda line: 'Warning: fatal assert' not in line, lines)

    lines_to_keep = takewhile_plus_next(lambda line: 'uptime:' not in line, skipped)

This last one took everything from keyword 1 to the EOF which includes almost 100,000 lines of garbage data.

python regex python-3.x parsing

edited Nov 8 at 22:03

Patrick Artner

19.1k51940

asked Nov 8 at 21:48

Toenailsmcgee

143

edited Nov 8 at 22:03

Patrick Artner

19.1k51940

asked Nov 8 at 21:48

Toenailsmcgee

143

edited Nov 8 at 22:03

Patrick Artner

19.1k51940

edited Nov 8 at 22:03

Patrick Artner

19.1k51940

edited Nov 8 at 22:03

Patrick Artner

19.1k51940

asked Nov 8 at 21:48

Toenailsmcgee

143

asked Nov 8 at 21:48

Toenailsmcgee

143

asked Nov 8 at 21:48

Toenailsmcgee

143

Look, you tried checking if line contains keyword1, but your data has keyword 1. Try this parsing code.
– Wiktor Stribiżew
Nov 8 at 22:38

@WiktorStribiżew that isn't my literal code, in my actual code I have definite matching terms
– Toenailsmcgee
Nov 8 at 23:01

So, what is the problem? If these are regexps, use if re.search(rx, line) instead of if 'keyword' in line.
– Wiktor Stribiżew
Nov 8 at 23:33

First example doesn't initialise new_list as a list. Check indentation in the second example.
– Nick
Nov 9 at 7:15

@Nick these are just excerpts of the relevant code. In my real code new_list is initialized and remains empty after the block of code runs. the indentation of the second is a matter of text formatting. I didn't realize it got messed up like that on the copy/paste. Again, it is proper in my real code. I appreciate you trying to give feedback though.
– Toenailsmcgee
Nov 10 at 2:07

|
show 4 more comments

Look, you tried checking if line contains keyword1, but your data has keyword 1. Try this parsing code.
– Wiktor Stribiżew
Nov 8 at 22:38

@WiktorStribiżew that isn't my literal code, in my actual code I have definite matching terms
– Toenailsmcgee
Nov 8 at 23:01

So, what is the problem? If these are regexps, use if re.search(rx, line) instead of if 'keyword' in line.
– Wiktor Stribiżew
Nov 8 at 23:33

First example doesn't initialise new_list as a list. Check indentation in the second example.
– Nick
Nov 9 at 7:15

@Nick these are just excerpts of the relevant code. In my real code new_list is initialized and remains empty after the block of code runs. the indentation of the second is a matter of text formatting. I didn't realize it got messed up like that on the copy/paste. Again, it is proper in my real code. I appreciate you trying to give feedback though.
– Toenailsmcgee
Nov 10 at 2:07

Look, you tried checking if line contains keyword1, but your data has keyword 1. Try this parsing code.
– Wiktor Stribiżew
Nov 8 at 22:38

@WiktorStribiżew that isn't my literal code, in my actual code I have definite matching terms
– Toenailsmcgee
Nov 8 at 23:01

So, what is the problem? If these are regexps, use if re.search(rx, line) instead of if 'keyword' in line.
– Wiktor Stribiżew
Nov 8 at 23:33

First example doesn't initialise new_list as a list. Check indentation in the second example.
– Nick
Nov 9 at 7:15

@Nick these are just excerpts of the relevant code. In my real code new_list is initialized and remains empty after the block of code runs. the indentation of the second is a matter of text formatting. I didn't realize it got messed up like that on the copy/paste. Again, it is proper in my real code. I appreciate you trying to give feedback though.
– Toenailsmcgee
Nov 10 at 2:07

|
show 4 more comments

3 Answers
3

active

oldest

votes

up vote
1
down vote

You can use regex if you specify re.dotall and use lazy anythings .*? to match start and end:

import re



regex = r"n.*?(keyword 1).*?(keyword 2).*?$"



test_str = ("garbage I don't need - garbage I don't needn"

    "timestamp - date - server info - 'keyword 1' - datan"

    "more data more data more data more datan"

    "more data more data more data more datan"

    "more data more data 'keyword 2' - last bit of datan"

    "garbage I don't need - garbage I don't need")



matches = re.finditer(regex, test_str, re.DOTALL | re.MULTILINE)



for matchNum, match in enumerate(matches):

    matchNum = matchNum + 1



    print (match.group()) # your match is the whole group

Output:

timestamp - date - server info - 'keyword 1' - data 

more data more data more data more data

more data more data more data more data

more data more data 'keyword 2' - last bit of data

You might need to strip('n') from it ...

You can view it here: https://regex101.com/r/HWIALZ/1 - it also holds the explanation of the patter. The short of it:

n        newline 

   .*?    as few as possible anythings

   (keyword 1)   literal text - the () are not needed only if you want the group

   .*?    as few as possible anythings

   (keyword 2)   literal text - again () are not needed 

   .*?    as few as possible anythings

$         end of line

I included the () for clarity - you do not evaluate groups, you you remove them.

edited Nov 8 at 22:02

answered Nov 8 at 21:57

Patrick Artner

19.1k51940

I am getting this error:
– Toenailsmcgee
Nov 8 at 22:49

sorry - I messed up the last edit -I substituted filename for test_str matches = re.finditer(regex, filename, re.DOTALL | re.MULTILINE) File "C:Program FilesPython36libre.py", line 229, in finditer return _compile(pattern, flags).finditer(string) TypeError: expected string or bytes-like object
– Toenailsmcgee
Nov 8 at 22:59

@Toenailsmcgee using with open(filename,"r") as f: re.finditer(regex,f.read(), .. flags ...) should do it, unless your files are so big that they don't fit into memory
– Patrick Artner
Nov 9 at 6:35

@Patrick Artner why n not ^ at start of regex?
– Nick
Nov 11 at 19:59

@Nick does not work - try it. it would still match but from the start of the first line instead of from the closest n
– Patrick Artner
Nov 11 at 20:09

add a comment |

up vote
1
down vote

The following is fast for any size of file. It extracts from a 250M log file of nearly 2 million lines in 3 seconds. The extracted portion was at the end of the file.

I would not recommend using list, regexes or other in-memory techniques if there is a chance your files won't fit in available memory.

Test text file startstop_text:

line 1 this should not appear in output

line 2 keyword1

line 3 appears in output

line 4 keyword2

line 5 this should not appear in output

Code:

from itertools import dropwhile





def keepuntil(contains_end_keyword, lines):

    for line in lines:

        yield line

        if contains_end_keyword(line):

            break





with open('startstop_text', 'r') as f:

    from_start_line = dropwhile(lambda line: 'keyword1' not in line, f)

    extracted = keepuntil(lambda line: 'keyword2' in line, from_start_line)

    for line in extracted:

        print(line.rstrip())





>>> python startstop.py

line 2 keyword1

line 3 appears in output

line 4 keyword2

edited Nov 10 at 20:06

answered Nov 9 at 4:48

Nick

472414

add a comment |

up vote
-1
down vote

accepted

None of the other responses worked but I was able to figure it out using regex.

for match in re.findall(".*keyword1[sS]*?keyword2:[sS]*?keyword3.*", log_file.read()):

answered Nov 11 at 5:48

Toenailsmcgee

143

Hey @Toenailsmcgee the above solution is not complete code. That makes it less helpful than it could be.
– Nick
Nov 11 at 22:29

My solution works fine for me in Python 2.7 and 3.6, on small and large files, with the extracted lines in initial, final or middle position. If there's a problem with it please let me know what error or what faulty output you're getting. Did you want to find multiple instances of keyword1 and keyword2 and extract all? If so, my solution doesn't work-- but that isn't what you asked for.
– Nick
Nov 11 at 22:31

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53216657%2fhow-to-collect-all-lines-of-data-between-keywords-in-a-file-startingending-at%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

up vote
1
down vote

You can use regex if you specify re.dotall and use lazy anythings .*? to match start and end:

import re



regex = r"n.*?(keyword 1).*?(keyword 2).*?$"



test_str = ("garbage I don't need - garbage I don't needn"

    "timestamp - date - server info - 'keyword 1' - datan"

    "more data more data more data more datan"

    "more data more data more data more datan"

    "more data more data 'keyword 2' - last bit of datan"

    "garbage I don't need - garbage I don't need")



matches = re.finditer(regex, test_str, re.DOTALL | re.MULTILINE)



for matchNum, match in enumerate(matches):

    matchNum = matchNum + 1



    print (match.group()) # your match is the whole group

Output:

timestamp - date - server info - 'keyword 1' - data 

more data more data more data more data

more data more data more data more data

more data more data 'keyword 2' - last bit of data

You might need to strip('n') from it ...

You can view it here: https://regex101.com/r/HWIALZ/1 - it also holds the explanation of the patter. The short of it:

n        newline 

   .*?    as few as possible anythings

   (keyword 1)   literal text - the () are not needed only if you want the group

   .*?    as few as possible anythings

   (keyword 2)   literal text - again () are not needed 

   .*?    as few as possible anythings

$         end of line

I included the () for clarity - you do not evaluate groups, you you remove them.

edited Nov 8 at 22:02

answered Nov 8 at 21:57

Patrick Artner

19.1k51940

I am getting this error:
– Toenailsmcgee
Nov 8 at 22:49

sorry - I messed up the last edit -I substituted filename for test_str matches = re.finditer(regex, filename, re.DOTALL | re.MULTILINE) File "C:Program FilesPython36libre.py", line 229, in finditer return _compile(pattern, flags).finditer(string) TypeError: expected string or bytes-like object
– Toenailsmcgee
Nov 8 at 22:59

@Toenailsmcgee using with open(filename,"r") as f: re.finditer(regex,f.read(), .. flags ...) should do it, unless your files are so big that they don't fit into memory
– Patrick Artner
Nov 9 at 6:35

@Patrick Artner why n not ^ at start of regex?
– Nick
Nov 11 at 19:59

@Nick does not work - try it. it would still match but from the start of the first line instead of from the closest n
– Patrick Artner
Nov 11 at 20:09

add a comment |

up vote
1
down vote

You can use regex if you specify re.dotall and use lazy anythings .*? to match start and end:

import re



regex = r"n.*?(keyword 1).*?(keyword 2).*?$"



test_str = ("garbage I don't need - garbage I don't needn"

    "timestamp - date - server info - 'keyword 1' - datan"

    "more data more data more data more datan"

    "more data more data more data more datan"

    "more data more data 'keyword 2' - last bit of datan"

    "garbage I don't need - garbage I don't need")



matches = re.finditer(regex, test_str, re.DOTALL | re.MULTILINE)



for matchNum, match in enumerate(matches):

    matchNum = matchNum + 1



    print (match.group()) # your match is the whole group

Output:

timestamp - date - server info - 'keyword 1' - data 

more data more data more data more data

more data more data more data more data

more data more data 'keyword 2' - last bit of data

You might need to strip('n') from it ...

You can view it here: https://regex101.com/r/HWIALZ/1 - it also holds the explanation of the patter. The short of it:

n        newline 

   .*?    as few as possible anythings

   (keyword 1)   literal text - the () are not needed only if you want the group

   .*?    as few as possible anythings

   (keyword 2)   literal text - again () are not needed 

   .*?    as few as possible anythings

$         end of line

I included the () for clarity - you do not evaluate groups, you you remove them.

edited Nov 8 at 22:02

answered Nov 8 at 21:57

Patrick Artner

19.1k51940

I am getting this error:
– Toenailsmcgee
Nov 8 at 22:49

sorry - I messed up the last edit -I substituted filename for test_str matches = re.finditer(regex, filename, re.DOTALL | re.MULTILINE) File "C:Program FilesPython36libre.py", line 229, in finditer return _compile(pattern, flags).finditer(string) TypeError: expected string or bytes-like object
– Toenailsmcgee
Nov 8 at 22:59

@Toenailsmcgee using with open(filename,"r") as f: re.finditer(regex,f.read(), .. flags ...) should do it, unless your files are so big that they don't fit into memory
– Patrick Artner
Nov 9 at 6:35

@Patrick Artner why n not ^ at start of regex?
– Nick
Nov 11 at 19:59

@Nick does not work - try it. it would still match but from the start of the first line instead of from the closest n
– Patrick Artner
Nov 11 at 20:09

add a comment |

up vote
1
down vote

You can use regex if you specify re.dotall and use lazy anythings .*? to match start and end:

import re



regex = r"n.*?(keyword 1).*?(keyword 2).*?$"



test_str = ("garbage I don't need - garbage I don't needn"

    "timestamp - date - server info - 'keyword 1' - datan"

    "more data more data more data more datan"

    "more data more data more data more datan"

    "more data more data 'keyword 2' - last bit of datan"

    "garbage I don't need - garbage I don't need")



matches = re.finditer(regex, test_str, re.DOTALL | re.MULTILINE)



for matchNum, match in enumerate(matches):

    matchNum = matchNum + 1



    print (match.group()) # your match is the whole group

Output:

timestamp - date - server info - 'keyword 1' - data 

more data more data more data more data

more data more data more data more data

more data more data 'keyword 2' - last bit of data

You might need to strip('n') from it ...

You can view it here: https://regex101.com/r/HWIALZ/1 - it also holds the explanation of the patter. The short of it:

n        newline 

   .*?    as few as possible anythings

   (keyword 1)   literal text - the () are not needed only if you want the group

   .*?    as few as possible anythings

   (keyword 2)   literal text - again () are not needed 

   .*?    as few as possible anythings

$         end of line

I included the () for clarity - you do not evaluate groups, you you remove them.

edited Nov 8 at 22:02

answered Nov 8 at 21:57

Patrick Artner

19.1k51940

You can use regex if you specify re.dotall and use lazy anythings .*? to match start and end:

import re



regex = r"n.*?(keyword 1).*?(keyword 2).*?$"



test_str = ("garbage I don't need - garbage I don't needn"

    "timestamp - date - server info - 'keyword 1' - datan"

    "more data more data more data more datan"

    "more data more data more data more datan"

    "more data more data 'keyword 2' - last bit of datan"

    "garbage I don't need - garbage I don't need")



matches = re.finditer(regex, test_str, re.DOTALL | re.MULTILINE)



for matchNum, match in enumerate(matches):

    matchNum = matchNum + 1



    print (match.group()) # your match is the whole group

Output:

timestamp - date - server info - 'keyword 1' - data 

more data more data more data more data

more data more data more data more data

more data more data 'keyword 2' - last bit of data

You might need to strip('n') from it ...

You can view it here: https://regex101.com/r/HWIALZ/1 - it also holds the explanation of the patter. The short of it:

n        newline 

   .*?    as few as possible anythings

   (keyword 1)   literal text - the () are not needed only if you want the group

   .*?    as few as possible anythings

   (keyword 2)   literal text - again () are not needed 

   .*?    as few as possible anythings

$         end of line

I included the () for clarity - you do not evaluate groups, you you remove them.

edited Nov 8 at 22:02

answered Nov 8 at 21:57

Patrick Artner

19.1k51940

edited Nov 8 at 22:02

answered Nov 8 at 21:57

Patrick Artner

19.1k51940

answered Nov 8 at 21:57

Patrick Artner

19.1k51940

answered Nov 8 at 21:57

Patrick Artner

19.1k51940

I am getting this error:
– Toenailsmcgee
Nov 8 at 22:49

sorry - I messed up the last edit -I substituted filename for test_str matches = re.finditer(regex, filename, re.DOTALL | re.MULTILINE) File "C:Program FilesPython36libre.py", line 229, in finditer return _compile(pattern, flags).finditer(string) TypeError: expected string or bytes-like object
– Toenailsmcgee
Nov 8 at 22:59

@Toenailsmcgee using with open(filename,"r") as f: re.finditer(regex,f.read(), .. flags ...) should do it, unless your files are so big that they don't fit into memory
– Patrick Artner
Nov 9 at 6:35

@Patrick Artner why n not ^ at start of regex?
– Nick
Nov 11 at 19:59

@Nick does not work - try it. it would still match but from the start of the first line instead of from the closest n
– Patrick Artner
Nov 11 at 20:09

add a comment |

I am getting this error:
– Toenailsmcgee
Nov 8 at 22:49

sorry - I messed up the last edit -I substituted filename for test_str matches = re.finditer(regex, filename, re.DOTALL | re.MULTILINE) File "C:Program FilesPython36libre.py", line 229, in finditer return _compile(pattern, flags).finditer(string) TypeError: expected string or bytes-like object
– Toenailsmcgee
Nov 8 at 22:59

@Toenailsmcgee using with open(filename,"r") as f: re.finditer(regex,f.read(), .. flags ...) should do it, unless your files are so big that they don't fit into memory
– Patrick Artner
Nov 9 at 6:35

@Patrick Artner why n not ^ at start of regex?
– Nick
Nov 11 at 19:59

@Nick does not work - try it. it would still match but from the start of the first line instead of from the closest n
– Patrick Artner
Nov 11 at 20:09

I am getting this error:
– Toenailsmcgee
Nov 8 at 22:49

sorry - I messed up the last edit -I substituted filename for test_str matches = re.finditer(regex, filename, re.DOTALL | re.MULTILINE) File "C:Program FilesPython36libre.py", line 229, in finditer return _compile(pattern, flags).finditer(string) TypeError: expected string or bytes-like object
– Toenailsmcgee
Nov 8 at 22:59

@Toenailsmcgee using with open(filename,"r") as f: re.finditer(regex,f.read(), .. flags ...) should do it, unless your files are so big that they don't fit into memory
– Patrick Artner
Nov 9 at 6:35

@Patrick Artner why n not ^ at start of regex?
– Nick
Nov 11 at 19:59

@Nick does not work - try it. it would still match but from the start of the first line instead of from the closest n
– Patrick Artner
Nov 11 at 20:09

add a comment |

up vote
1
down vote

The following is fast for any size of file. It extracts from a 250M log file of nearly 2 million lines in 3 seconds. The extracted portion was at the end of the file.

I would not recommend using list, regexes or other in-memory techniques if there is a chance your files won't fit in available memory.

Test text file startstop_text:

line 1 this should not appear in output

line 2 keyword1

line 3 appears in output

line 4 keyword2

line 5 this should not appear in output

Code:

from itertools import dropwhile





def keepuntil(contains_end_keyword, lines):

    for line in lines:

        yield line

        if contains_end_keyword(line):

            break





with open('startstop_text', 'r') as f:

    from_start_line = dropwhile(lambda line: 'keyword1' not in line, f)

    extracted = keepuntil(lambda line: 'keyword2' in line, from_start_line)

    for line in extracted:

        print(line.rstrip())





>>> python startstop.py

line 2 keyword1

line 3 appears in output

line 4 keyword2

edited Nov 10 at 20:06

answered Nov 9 at 4:48

Nick

472414

add a comment |

up vote
1
down vote

The following is fast for any size of file. It extracts from a 250M log file of nearly 2 million lines in 3 seconds. The extracted portion was at the end of the file.

I would not recommend using list, regexes or other in-memory techniques if there is a chance your files won't fit in available memory.

Test text file startstop_text:

line 1 this should not appear in output

line 2 keyword1

line 3 appears in output

line 4 keyword2

line 5 this should not appear in output

Code:

from itertools import dropwhile





def keepuntil(contains_end_keyword, lines):

    for line in lines:

        yield line

        if contains_end_keyword(line):

            break





with open('startstop_text', 'r') as f:

    from_start_line = dropwhile(lambda line: 'keyword1' not in line, f)

    extracted = keepuntil(lambda line: 'keyword2' in line, from_start_line)

    for line in extracted:

        print(line.rstrip())





>>> python startstop.py

line 2 keyword1

line 3 appears in output

line 4 keyword2

edited Nov 10 at 20:06

answered Nov 9 at 4:48

Nick

472414

add a comment |

up vote
1
down vote

The following is fast for any size of file. It extracts from a 250M log file of nearly 2 million lines in 3 seconds. The extracted portion was at the end of the file.

I would not recommend using list, regexes or other in-memory techniques if there is a chance your files won't fit in available memory.

Test text file startstop_text:

line 1 this should not appear in output

line 2 keyword1

line 3 appears in output

line 4 keyword2

line 5 this should not appear in output

Code:

from itertools import dropwhile





def keepuntil(contains_end_keyword, lines):

    for line in lines:

        yield line

        if contains_end_keyword(line):

            break





with open('startstop_text', 'r') as f:

    from_start_line = dropwhile(lambda line: 'keyword1' not in line, f)

    extracted = keepuntil(lambda line: 'keyword2' in line, from_start_line)

    for line in extracted:

        print(line.rstrip())





>>> python startstop.py

line 2 keyword1

line 3 appears in output

line 4 keyword2

edited Nov 10 at 20:06

answered Nov 9 at 4:48

Nick

472414

The following is fast for any size of file. It extracts from a 250M log file of nearly 2 million lines in 3 seconds. The extracted portion was at the end of the file.

I would not recommend using list, regexes or other in-memory techniques if there is a chance your files won't fit in available memory.

Test text file startstop_text:

line 1 this should not appear in output

line 2 keyword1

line 3 appears in output

line 4 keyword2

line 5 this should not appear in output

Code:

from itertools import dropwhile





def keepuntil(contains_end_keyword, lines):

    for line in lines:

        yield line

        if contains_end_keyword(line):

            break





with open('startstop_text', 'r') as f:

    from_start_line = dropwhile(lambda line: 'keyword1' not in line, f)

    extracted = keepuntil(lambda line: 'keyword2' in line, from_start_line)

    for line in extracted:

        print(line.rstrip())





>>> python startstop.py

line 2 keyword1

line 3 appears in output

line 4 keyword2

edited Nov 10 at 20:06

answered Nov 9 at 4:48

Nick

472414

edited Nov 10 at 20:06

answered Nov 9 at 4:48

Nick

472414

answered Nov 9 at 4:48

Nick

472414

answered Nov 9 at 4:48

Nick

472414

add a comment |

up vote
-1
down vote

accepted

None of the other responses worked but I was able to figure it out using regex.

for match in re.findall(".*keyword1[sS]*?keyword2:[sS]*?keyword3.*", log_file.read()):

answered Nov 11 at 5:48

Toenailsmcgee

143

Hey @Toenailsmcgee the above solution is not complete code. That makes it less helpful than it could be.
– Nick
Nov 11 at 22:29

My solution works fine for me in Python 2.7 and 3.6, on small and large files, with the extracted lines in initial, final or middle position. If there's a problem with it please let me know what error or what faulty output you're getting. Did you want to find multiple instances of keyword1 and keyword2 and extract all? If so, my solution doesn't work-- but that isn't what you asked for.
– Nick
Nov 11 at 22:31

add a comment |

up vote
-1
down vote

accepted

None of the other responses worked but I was able to figure it out using regex.

for match in re.findall(".*keyword1[sS]*?keyword2:[sS]*?keyword3.*", log_file.read()):

answered Nov 11 at 5:48

Toenailsmcgee

143

Hey @Toenailsmcgee the above solution is not complete code. That makes it less helpful than it could be.
– Nick
Nov 11 at 22:29

My solution works fine for me in Python 2.7 and 3.6, on small and large files, with the extracted lines in initial, final or middle position. If there's a problem with it please let me know what error or what faulty output you're getting. Did you want to find multiple instances of keyword1 and keyword2 and extract all? If so, my solution doesn't work-- but that isn't what you asked for.
– Nick
Nov 11 at 22:31

add a comment |

up vote
-1
down vote

accepted

None of the other responses worked but I was able to figure it out using regex.

for match in re.findall(".*keyword1[sS]*?keyword2:[sS]*?keyword3.*", log_file.read()):

answered Nov 11 at 5:48

Toenailsmcgee

143

None of the other responses worked but I was able to figure it out using regex.

for match in re.findall(".*keyword1[sS]*?keyword2:[sS]*?keyword3.*", log_file.read()):

answered Nov 11 at 5:48

Toenailsmcgee

143

answered Nov 11 at 5:48

Toenailsmcgee

143

answered Nov 11 at 5:48

Toenailsmcgee

143

answered Nov 11 at 5:48

Toenailsmcgee

143

Hey @Toenailsmcgee the above solution is not complete code. That makes it less helpful than it could be.
– Nick
Nov 11 at 22:29

My solution works fine for me in Python 2.7 and 3.6, on small and large files, with the extracted lines in initial, final or middle position. If there's a problem with it please let me know what error or what faulty output you're getting. Did you want to find multiple instances of keyword1 and keyword2 and extract all? If so, my solution doesn't work-- but that isn't what you asked for.
– Nick
Nov 11 at 22:31

add a comment |

Hey @Toenailsmcgee the above solution is not complete code. That makes it less helpful than it could be.
– Nick
Nov 11 at 22:29

My solution works fine for me in Python 2.7 and 3.6, on small and large files, with the extracted lines in initial, final or middle position. If there's a problem with it please let me know what error or what faulty output you're getting. Did you want to find multiple instances of keyword1 and keyword2 and extract all? If so, my solution doesn't work-- but that isn't what you asked for.
– Nick
Nov 11 at 22:31

Hey @Toenailsmcgee the above solution is not complete code. That makes it less helpful than it could be.
– Nick
Nov 11 at 22:29

My solution works fine for me in Python 2.7 and 3.6, on small and large files, with the extracted lines in initial, final or middle position. If there's a problem with it please let me know what error or what faulty output you're getting. Did you want to find multiple instances of keyword1 and keyword2 and extract all? If so, my solution doesn't work-- but that isn't what you asked for.
– Nick
Nov 11 at 22:31

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

qGNrKvqCC4ryh

搜尋此網誌

Wsrtjtyk