How to collect all lines of data between keywords in a file - starting+ending at linebreaks
up vote
3
down vote
favorite
I am trying to collect specific information from very large log files but cannot figure out how to get the behavior I need.
For reference, an example log is sort of like this:
garbage I don't need - garbage I don't need
timestamp - date - server info - 'keyword 1' - data
more data more data more data more data
more data more data more data more data
more data more data 'keyword 2' - last bit of data
garbage I don't need - garbage I don't need
What I need is to find 'keyword 1', grab the whole line keyword 1 is on (back to timestamp) and all subsequent lines until (and including) the whole line that 'keyword 2' is on (through the last bit of data).
So far I have tried a few things. I cannot get decent results with re methods (findall, match, search etc.); I cannot figure out how to grab the data before the match (even with a look behind) but more importantly, I cannot figure out how to have the capture stop at a phrase and not just a single character.
for match in re.findall('keyword1[keyword2]+|', showall.read()):
I also tried something like this:
start_capture = False
for current_line in fileName:
if 'keyword1' in current_line:
start_capture = True
if start_capture:
new_list.append(current_line)
if 'keyword2' in current_line:
return(new_list)
No matter what I tried, this returned an empty list
Finally,I tried something like this:
def takewhile_plus_next(predicate, xs):
for x in xs:
if not predicate(x):
break
yield x
yield x
with lastdb as f:
lines = map(str.rstrip, f)
skipped = dropwhile(lambda line: 'Warning: fatal assert' not in line, lines)
lines_to_keep = takewhile_plus_next(lambda line: 'uptime:' not in line, skipped)
This last one took everything from keyword 1 to the EOF which includes almost 100,000 lines of garbage data.
python regex python-3.x parsing
|
show 4 more comments
up vote
3
down vote
favorite
I am trying to collect specific information from very large log files but cannot figure out how to get the behavior I need.
For reference, an example log is sort of like this:
garbage I don't need - garbage I don't need
timestamp - date - server info - 'keyword 1' - data
more data more data more data more data
more data more data more data more data
more data more data 'keyword 2' - last bit of data
garbage I don't need - garbage I don't need
What I need is to find 'keyword 1', grab the whole line keyword 1 is on (back to timestamp) and all subsequent lines until (and including) the whole line that 'keyword 2' is on (through the last bit of data).
So far I have tried a few things. I cannot get decent results with re methods (findall, match, search etc.); I cannot figure out how to grab the data before the match (even with a look behind) but more importantly, I cannot figure out how to have the capture stop at a phrase and not just a single character.
for match in re.findall('keyword1[keyword2]+|', showall.read()):
I also tried something like this:
start_capture = False
for current_line in fileName:
if 'keyword1' in current_line:
start_capture = True
if start_capture:
new_list.append(current_line)
if 'keyword2' in current_line:
return(new_list)
No matter what I tried, this returned an empty list
Finally,I tried something like this:
def takewhile_plus_next(predicate, xs):
for x in xs:
if not predicate(x):
break
yield x
yield x
with lastdb as f:
lines = map(str.rstrip, f)
skipped = dropwhile(lambda line: 'Warning: fatal assert' not in line, lines)
lines_to_keep = takewhile_plus_next(lambda line: 'uptime:' not in line, skipped)
This last one took everything from keyword 1 to the EOF which includes almost 100,000 lines of garbage data.
python regex python-3.x parsing
Look, you tried checking if line containskeyword1
, but your data haskeyword 1
. Try this parsing code.
– Wiktor Stribiżew
Nov 8 at 22:38
@WiktorStribiżew that isn't my literal code, in my actual code I have definite matching terms
– Toenailsmcgee
Nov 8 at 23:01
So, what is the problem? If these are regexps, useif re.search(rx, line)
instead ofif 'keyword' in line
.
– Wiktor Stribiżew
Nov 8 at 23:33
First example doesn't initialisenew_list
as a list. Check indentation in the second example.
– Nick
Nov 9 at 7:15
@Nick these are just excerpts of the relevant code. In my real code new_list is initialized and remains empty after the block of code runs. the indentation of the second is a matter of text formatting. I didn't realize it got messed up like that on the copy/paste. Again, it is proper in my real code. I appreciate you trying to give feedback though.
– Toenailsmcgee
Nov 10 at 2:07
|
show 4 more comments
up vote
3
down vote
favorite
up vote
3
down vote
favorite
I am trying to collect specific information from very large log files but cannot figure out how to get the behavior I need.
For reference, an example log is sort of like this:
garbage I don't need - garbage I don't need
timestamp - date - server info - 'keyword 1' - data
more data more data more data more data
more data more data more data more data
more data more data 'keyword 2' - last bit of data
garbage I don't need - garbage I don't need
What I need is to find 'keyword 1', grab the whole line keyword 1 is on (back to timestamp) and all subsequent lines until (and including) the whole line that 'keyword 2' is on (through the last bit of data).
So far I have tried a few things. I cannot get decent results with re methods (findall, match, search etc.); I cannot figure out how to grab the data before the match (even with a look behind) but more importantly, I cannot figure out how to have the capture stop at a phrase and not just a single character.
for match in re.findall('keyword1[keyword2]+|', showall.read()):
I also tried something like this:
start_capture = False
for current_line in fileName:
if 'keyword1' in current_line:
start_capture = True
if start_capture:
new_list.append(current_line)
if 'keyword2' in current_line:
return(new_list)
No matter what I tried, this returned an empty list
Finally,I tried something like this:
def takewhile_plus_next(predicate, xs):
for x in xs:
if not predicate(x):
break
yield x
yield x
with lastdb as f:
lines = map(str.rstrip, f)
skipped = dropwhile(lambda line: 'Warning: fatal assert' not in line, lines)
lines_to_keep = takewhile_plus_next(lambda line: 'uptime:' not in line, skipped)
This last one took everything from keyword 1 to the EOF which includes almost 100,000 lines of garbage data.
python regex python-3.x parsing
I am trying to collect specific information from very large log files but cannot figure out how to get the behavior I need.
For reference, an example log is sort of like this:
garbage I don't need - garbage I don't need
timestamp - date - server info - 'keyword 1' - data
more data more data more data more data
more data more data more data more data
more data more data 'keyword 2' - last bit of data
garbage I don't need - garbage I don't need
What I need is to find 'keyword 1', grab the whole line keyword 1 is on (back to timestamp) and all subsequent lines until (and including) the whole line that 'keyword 2' is on (through the last bit of data).
So far I have tried a few things. I cannot get decent results with re methods (findall, match, search etc.); I cannot figure out how to grab the data before the match (even with a look behind) but more importantly, I cannot figure out how to have the capture stop at a phrase and not just a single character.
for match in re.findall('keyword1[keyword2]+|', showall.read()):
I also tried something like this:
start_capture = False
for current_line in fileName:
if 'keyword1' in current_line:
start_capture = True
if start_capture:
new_list.append(current_line)
if 'keyword2' in current_line:
return(new_list)
No matter what I tried, this returned an empty list
Finally,I tried something like this:
def takewhile_plus_next(predicate, xs):
for x in xs:
if not predicate(x):
break
yield x
yield x
with lastdb as f:
lines = map(str.rstrip, f)
skipped = dropwhile(lambda line: 'Warning: fatal assert' not in line, lines)
lines_to_keep = takewhile_plus_next(lambda line: 'uptime:' not in line, skipped)
This last one took everything from keyword 1 to the EOF which includes almost 100,000 lines of garbage data.
python regex python-3.x parsing
python regex python-3.x parsing
edited Nov 8 at 22:03
Patrick Artner
19.1k51940
19.1k51940
asked Nov 8 at 21:48
Toenailsmcgee
143
143
Look, you tried checking if line containskeyword1
, but your data haskeyword 1
. Try this parsing code.
– Wiktor Stribiżew
Nov 8 at 22:38
@WiktorStribiżew that isn't my literal code, in my actual code I have definite matching terms
– Toenailsmcgee
Nov 8 at 23:01
So, what is the problem? If these are regexps, useif re.search(rx, line)
instead ofif 'keyword' in line
.
– Wiktor Stribiżew
Nov 8 at 23:33
First example doesn't initialisenew_list
as a list. Check indentation in the second example.
– Nick
Nov 9 at 7:15
@Nick these are just excerpts of the relevant code. In my real code new_list is initialized and remains empty after the block of code runs. the indentation of the second is a matter of text formatting. I didn't realize it got messed up like that on the copy/paste. Again, it is proper in my real code. I appreciate you trying to give feedback though.
– Toenailsmcgee
Nov 10 at 2:07
|
show 4 more comments
Look, you tried checking if line containskeyword1
, but your data haskeyword 1
. Try this parsing code.
– Wiktor Stribiżew
Nov 8 at 22:38
@WiktorStribiżew that isn't my literal code, in my actual code I have definite matching terms
– Toenailsmcgee
Nov 8 at 23:01
So, what is the problem? If these are regexps, useif re.search(rx, line)
instead ofif 'keyword' in line
.
– Wiktor Stribiżew
Nov 8 at 23:33
First example doesn't initialisenew_list
as a list. Check indentation in the second example.
– Nick
Nov 9 at 7:15
@Nick these are just excerpts of the relevant code. In my real code new_list is initialized and remains empty after the block of code runs. the indentation of the second is a matter of text formatting. I didn't realize it got messed up like that on the copy/paste. Again, it is proper in my real code. I appreciate you trying to give feedback though.
– Toenailsmcgee
Nov 10 at 2:07
Look, you tried checking if line contains
keyword1
, but your data has keyword 1
. Try this parsing code.– Wiktor Stribiżew
Nov 8 at 22:38
Look, you tried checking if line contains
keyword1
, but your data has keyword 1
. Try this parsing code.– Wiktor Stribiżew
Nov 8 at 22:38
@WiktorStribiżew that isn't my literal code, in my actual code I have definite matching terms
– Toenailsmcgee
Nov 8 at 23:01
@WiktorStribiżew that isn't my literal code, in my actual code I have definite matching terms
– Toenailsmcgee
Nov 8 at 23:01
So, what is the problem? If these are regexps, use
if re.search(rx, line)
instead of if 'keyword' in line
.– Wiktor Stribiżew
Nov 8 at 23:33
So, what is the problem? If these are regexps, use
if re.search(rx, line)
instead of if 'keyword' in line
.– Wiktor Stribiżew
Nov 8 at 23:33
First example doesn't initialise
new_list
as a list. Check indentation in the second example.– Nick
Nov 9 at 7:15
First example doesn't initialise
new_list
as a list. Check indentation in the second example.– Nick
Nov 9 at 7:15
@Nick these are just excerpts of the relevant code. In my real code new_list is initialized and remains empty after the block of code runs. the indentation of the second is a matter of text formatting. I didn't realize it got messed up like that on the copy/paste. Again, it is proper in my real code. I appreciate you trying to give feedback though.
– Toenailsmcgee
Nov 10 at 2:07
@Nick these are just excerpts of the relevant code. In my real code new_list is initialized and remains empty after the block of code runs. the indentation of the second is a matter of text formatting. I didn't realize it got messed up like that on the copy/paste. Again, it is proper in my real code. I appreciate you trying to give feedback though.
– Toenailsmcgee
Nov 10 at 2:07
|
show 4 more comments
3 Answers
3
active
oldest
votes
up vote
1
down vote
You can use regex if you specify re.dotall
and use lazy anythings .*? to match start and end:
import re
regex = r"n.*?(keyword 1).*?(keyword 2).*?$"
test_str = ("garbage I don't need - garbage I don't needn"
"timestamp - date - server info - 'keyword 1' - datan"
"more data more data more data more datan"
"more data more data more data more datan"
"more data more data 'keyword 2' - last bit of datan"
"garbage I don't need - garbage I don't need")
matches = re.finditer(regex, test_str, re.DOTALL | re.MULTILINE)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print (match.group()) # your match is the whole group
Output:
timestamp - date - server info - 'keyword 1' - data
more data more data more data more data
more data more data more data more data
more data more data 'keyword 2' - last bit of data
You might need to strip('n')
from it ...
You can view it here: https://regex101.com/r/HWIALZ/1 - it also holds the explanation of the patter. The short of it:
n newline
.*? as few as possible anythings
(keyword 1) literal text - the () are not needed only if you want the group
.*? as few as possible anythings
(keyword 2) literal text - again () are not needed
.*? as few as possible anythings
$ end of line
I included the () for clarity - you do not evaluate groups, you you remove them.
I am getting this error:
– Toenailsmcgee
Nov 8 at 22:49
sorry - I messed up the last edit -I substituted filename for test_str matches = re.finditer(regex, filename, re.DOTALL | re.MULTILINE) File "C:Program FilesPython36libre.py", line 229, in finditer return _compile(pattern, flags).finditer(string) TypeError: expected string or bytes-like object
– Toenailsmcgee
Nov 8 at 22:59
@Toenailsmcgee usingwith open(filename,"r") as f: re.finditer(regex,f.read(), .. flags ...)
should do it, unless your files are so big that they don't fit into memory
– Patrick Artner
Nov 9 at 6:35
@Patrick Artner whyn
not^
at start of regex?
– Nick
Nov 11 at 19:59
@Nick does not work - try it. it would still match but from the start of the first line instead of from the closest n
– Patrick Artner
Nov 11 at 20:09
add a comment |
up vote
1
down vote
The following is fast for any size of file. It extracts from a 250M log file of nearly 2 million lines in 3 seconds. The extracted portion was at the end of the file.
I would not recommend using list
, regexes or other in-memory techniques if there is a chance your files won't fit in available memory.
Test text file startstop_text
:
line 1 this should not appear in output
line 2 keyword1
line 3 appears in output
line 4 keyword2
line 5 this should not appear in output
Code:
from itertools import dropwhile
def keepuntil(contains_end_keyword, lines):
for line in lines:
yield line
if contains_end_keyword(line):
break
with open('startstop_text', 'r') as f:
from_start_line = dropwhile(lambda line: 'keyword1' not in line, f)
extracted = keepuntil(lambda line: 'keyword2' in line, from_start_line)
for line in extracted:
print(line.rstrip())
>>> python startstop.py
line 2 keyword1
line 3 appears in output
line 4 keyword2
add a comment |
up vote
-1
down vote
accepted
None of the other responses worked but I was able to figure it out using regex.
for match in re.findall(".*keyword1[sS]*?keyword2:[sS]*?keyword3.*", log_file.read()):
Hey @Toenailsmcgee the above solution is not complete code. That makes it less helpful than it could be.
– Nick
Nov 11 at 22:29
My solution works fine for me in Python 2.7 and 3.6, on small and large files, with the extracted lines in initial, final or middle position. If there's a problem with it please let me know what error or what faulty output you're getting. Did you want to find multiple instances ofkeyword1
andkeyword2
and extract all? If so, my solution doesn't work-- but that isn't what you asked for.
– Nick
Nov 11 at 22:31
add a comment |
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
You can use regex if you specify re.dotall
and use lazy anythings .*? to match start and end:
import re
regex = r"n.*?(keyword 1).*?(keyword 2).*?$"
test_str = ("garbage I don't need - garbage I don't needn"
"timestamp - date - server info - 'keyword 1' - datan"
"more data more data more data more datan"
"more data more data more data more datan"
"more data more data 'keyword 2' - last bit of datan"
"garbage I don't need - garbage I don't need")
matches = re.finditer(regex, test_str, re.DOTALL | re.MULTILINE)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print (match.group()) # your match is the whole group
Output:
timestamp - date - server info - 'keyword 1' - data
more data more data more data more data
more data more data more data more data
more data more data 'keyword 2' - last bit of data
You might need to strip('n')
from it ...
You can view it here: https://regex101.com/r/HWIALZ/1 - it also holds the explanation of the patter. The short of it:
n newline
.*? as few as possible anythings
(keyword 1) literal text - the () are not needed only if you want the group
.*? as few as possible anythings
(keyword 2) literal text - again () are not needed
.*? as few as possible anythings
$ end of line
I included the () for clarity - you do not evaluate groups, you you remove them.
I am getting this error:
– Toenailsmcgee
Nov 8 at 22:49
sorry - I messed up the last edit -I substituted filename for test_str matches = re.finditer(regex, filename, re.DOTALL | re.MULTILINE) File "C:Program FilesPython36libre.py", line 229, in finditer return _compile(pattern, flags).finditer(string) TypeError: expected string or bytes-like object
– Toenailsmcgee
Nov 8 at 22:59
@Toenailsmcgee usingwith open(filename,"r") as f: re.finditer(regex,f.read(), .. flags ...)
should do it, unless your files are so big that they don't fit into memory
– Patrick Artner
Nov 9 at 6:35
@Patrick Artner whyn
not^
at start of regex?
– Nick
Nov 11 at 19:59
@Nick does not work - try it. it would still match but from the start of the first line instead of from the closest n
– Patrick Artner
Nov 11 at 20:09
add a comment |
up vote
1
down vote
You can use regex if you specify re.dotall
and use lazy anythings .*? to match start and end:
import re
regex = r"n.*?(keyword 1).*?(keyword 2).*?$"
test_str = ("garbage I don't need - garbage I don't needn"
"timestamp - date - server info - 'keyword 1' - datan"
"more data more data more data more datan"
"more data more data more data more datan"
"more data more data 'keyword 2' - last bit of datan"
"garbage I don't need - garbage I don't need")
matches = re.finditer(regex, test_str, re.DOTALL | re.MULTILINE)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print (match.group()) # your match is the whole group
Output:
timestamp - date - server info - 'keyword 1' - data
more data more data more data more data
more data more data more data more data
more data more data 'keyword 2' - last bit of data
You might need to strip('n')
from it ...
You can view it here: https://regex101.com/r/HWIALZ/1 - it also holds the explanation of the patter. The short of it:
n newline
.*? as few as possible anythings
(keyword 1) literal text - the () are not needed only if you want the group
.*? as few as possible anythings
(keyword 2) literal text - again () are not needed
.*? as few as possible anythings
$ end of line
I included the () for clarity - you do not evaluate groups, you you remove them.
I am getting this error:
– Toenailsmcgee
Nov 8 at 22:49
sorry - I messed up the last edit -I substituted filename for test_str matches = re.finditer(regex, filename, re.DOTALL | re.MULTILINE) File "C:Program FilesPython36libre.py", line 229, in finditer return _compile(pattern, flags).finditer(string) TypeError: expected string or bytes-like object
– Toenailsmcgee
Nov 8 at 22:59
@Toenailsmcgee usingwith open(filename,"r") as f: re.finditer(regex,f.read(), .. flags ...)
should do it, unless your files are so big that they don't fit into memory
– Patrick Artner
Nov 9 at 6:35
@Patrick Artner whyn
not^
at start of regex?
– Nick
Nov 11 at 19:59
@Nick does not work - try it. it would still match but from the start of the first line instead of from the closest n
– Patrick Artner
Nov 11 at 20:09
add a comment |
up vote
1
down vote
up vote
1
down vote
You can use regex if you specify re.dotall
and use lazy anythings .*? to match start and end:
import re
regex = r"n.*?(keyword 1).*?(keyword 2).*?$"
test_str = ("garbage I don't need - garbage I don't needn"
"timestamp - date - server info - 'keyword 1' - datan"
"more data more data more data more datan"
"more data more data more data more datan"
"more data more data 'keyword 2' - last bit of datan"
"garbage I don't need - garbage I don't need")
matches = re.finditer(regex, test_str, re.DOTALL | re.MULTILINE)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print (match.group()) # your match is the whole group
Output:
timestamp - date - server info - 'keyword 1' - data
more data more data more data more data
more data more data more data more data
more data more data 'keyword 2' - last bit of data
You might need to strip('n')
from it ...
You can view it here: https://regex101.com/r/HWIALZ/1 - it also holds the explanation of the patter. The short of it:
n newline
.*? as few as possible anythings
(keyword 1) literal text - the () are not needed only if you want the group
.*? as few as possible anythings
(keyword 2) literal text - again () are not needed
.*? as few as possible anythings
$ end of line
I included the () for clarity - you do not evaluate groups, you you remove them.
You can use regex if you specify re.dotall
and use lazy anythings .*? to match start and end:
import re
regex = r"n.*?(keyword 1).*?(keyword 2).*?$"
test_str = ("garbage I don't need - garbage I don't needn"
"timestamp - date - server info - 'keyword 1' - datan"
"more data more data more data more datan"
"more data more data more data more datan"
"more data more data 'keyword 2' - last bit of datan"
"garbage I don't need - garbage I don't need")
matches = re.finditer(regex, test_str, re.DOTALL | re.MULTILINE)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print (match.group()) # your match is the whole group
Output:
timestamp - date - server info - 'keyword 1' - data
more data more data more data more data
more data more data more data more data
more data more data 'keyword 2' - last bit of data
You might need to strip('n')
from it ...
You can view it here: https://regex101.com/r/HWIALZ/1 - it also holds the explanation of the patter. The short of it:
n newline
.*? as few as possible anythings
(keyword 1) literal text - the () are not needed only if you want the group
.*? as few as possible anythings
(keyword 2) literal text - again () are not needed
.*? as few as possible anythings
$ end of line
I included the () for clarity - you do not evaluate groups, you you remove them.
edited Nov 8 at 22:02
answered Nov 8 at 21:57
Patrick Artner
19.1k51940
19.1k51940
I am getting this error:
– Toenailsmcgee
Nov 8 at 22:49
sorry - I messed up the last edit -I substituted filename for test_str matches = re.finditer(regex, filename, re.DOTALL | re.MULTILINE) File "C:Program FilesPython36libre.py", line 229, in finditer return _compile(pattern, flags).finditer(string) TypeError: expected string or bytes-like object
– Toenailsmcgee
Nov 8 at 22:59
@Toenailsmcgee usingwith open(filename,"r") as f: re.finditer(regex,f.read(), .. flags ...)
should do it, unless your files are so big that they don't fit into memory
– Patrick Artner
Nov 9 at 6:35
@Patrick Artner whyn
not^
at start of regex?
– Nick
Nov 11 at 19:59
@Nick does not work - try it. it would still match but from the start of the first line instead of from the closest n
– Patrick Artner
Nov 11 at 20:09
add a comment |
I am getting this error:
– Toenailsmcgee
Nov 8 at 22:49
sorry - I messed up the last edit -I substituted filename for test_str matches = re.finditer(regex, filename, re.DOTALL | re.MULTILINE) File "C:Program FilesPython36libre.py", line 229, in finditer return _compile(pattern, flags).finditer(string) TypeError: expected string or bytes-like object
– Toenailsmcgee
Nov 8 at 22:59
@Toenailsmcgee usingwith open(filename,"r") as f: re.finditer(regex,f.read(), .. flags ...)
should do it, unless your files are so big that they don't fit into memory
– Patrick Artner
Nov 9 at 6:35
@Patrick Artner whyn
not^
at start of regex?
– Nick
Nov 11 at 19:59
@Nick does not work - try it. it would still match but from the start of the first line instead of from the closest n
– Patrick Artner
Nov 11 at 20:09
I am getting this error:
– Toenailsmcgee
Nov 8 at 22:49
I am getting this error:
– Toenailsmcgee
Nov 8 at 22:49
sorry - I messed up the last edit -I substituted filename for test_str matches = re.finditer(regex, filename, re.DOTALL | re.MULTILINE) File "C:Program FilesPython36libre.py", line 229, in finditer return _compile(pattern, flags).finditer(string) TypeError: expected string or bytes-like object
– Toenailsmcgee
Nov 8 at 22:59
sorry - I messed up the last edit -I substituted filename for test_str matches = re.finditer(regex, filename, re.DOTALL | re.MULTILINE) File "C:Program FilesPython36libre.py", line 229, in finditer return _compile(pattern, flags).finditer(string) TypeError: expected string or bytes-like object
– Toenailsmcgee
Nov 8 at 22:59
@Toenailsmcgee using
with open(filename,"r") as f: re.finditer(regex,f.read(), .. flags ...)
should do it, unless your files are so big that they don't fit into memory– Patrick Artner
Nov 9 at 6:35
@Toenailsmcgee using
with open(filename,"r") as f: re.finditer(regex,f.read(), .. flags ...)
should do it, unless your files are so big that they don't fit into memory– Patrick Artner
Nov 9 at 6:35
@Patrick Artner why
n
not ^
at start of regex?– Nick
Nov 11 at 19:59
@Patrick Artner why
n
not ^
at start of regex?– Nick
Nov 11 at 19:59
@Nick does not work - try it. it would still match but from the start of the first line instead of from the closest n
– Patrick Artner
Nov 11 at 20:09
@Nick does not work - try it. it would still match but from the start of the first line instead of from the closest n
– Patrick Artner
Nov 11 at 20:09
add a comment |
up vote
1
down vote
The following is fast for any size of file. It extracts from a 250M log file of nearly 2 million lines in 3 seconds. The extracted portion was at the end of the file.
I would not recommend using list
, regexes or other in-memory techniques if there is a chance your files won't fit in available memory.
Test text file startstop_text
:
line 1 this should not appear in output
line 2 keyword1
line 3 appears in output
line 4 keyword2
line 5 this should not appear in output
Code:
from itertools import dropwhile
def keepuntil(contains_end_keyword, lines):
for line in lines:
yield line
if contains_end_keyword(line):
break
with open('startstop_text', 'r') as f:
from_start_line = dropwhile(lambda line: 'keyword1' not in line, f)
extracted = keepuntil(lambda line: 'keyword2' in line, from_start_line)
for line in extracted:
print(line.rstrip())
>>> python startstop.py
line 2 keyword1
line 3 appears in output
line 4 keyword2
add a comment |
up vote
1
down vote
The following is fast for any size of file. It extracts from a 250M log file of nearly 2 million lines in 3 seconds. The extracted portion was at the end of the file.
I would not recommend using list
, regexes or other in-memory techniques if there is a chance your files won't fit in available memory.
Test text file startstop_text
:
line 1 this should not appear in output
line 2 keyword1
line 3 appears in output
line 4 keyword2
line 5 this should not appear in output
Code:
from itertools import dropwhile
def keepuntil(contains_end_keyword, lines):
for line in lines:
yield line
if contains_end_keyword(line):
break
with open('startstop_text', 'r') as f:
from_start_line = dropwhile(lambda line: 'keyword1' not in line, f)
extracted = keepuntil(lambda line: 'keyword2' in line, from_start_line)
for line in extracted:
print(line.rstrip())
>>> python startstop.py
line 2 keyword1
line 3 appears in output
line 4 keyword2
add a comment |
up vote
1
down vote
up vote
1
down vote
The following is fast for any size of file. It extracts from a 250M log file of nearly 2 million lines in 3 seconds. The extracted portion was at the end of the file.
I would not recommend using list
, regexes or other in-memory techniques if there is a chance your files won't fit in available memory.
Test text file startstop_text
:
line 1 this should not appear in output
line 2 keyword1
line 3 appears in output
line 4 keyword2
line 5 this should not appear in output
Code:
from itertools import dropwhile
def keepuntil(contains_end_keyword, lines):
for line in lines:
yield line
if contains_end_keyword(line):
break
with open('startstop_text', 'r') as f:
from_start_line = dropwhile(lambda line: 'keyword1' not in line, f)
extracted = keepuntil(lambda line: 'keyword2' in line, from_start_line)
for line in extracted:
print(line.rstrip())
>>> python startstop.py
line 2 keyword1
line 3 appears in output
line 4 keyword2
The following is fast for any size of file. It extracts from a 250M log file of nearly 2 million lines in 3 seconds. The extracted portion was at the end of the file.
I would not recommend using list
, regexes or other in-memory techniques if there is a chance your files won't fit in available memory.
Test text file startstop_text
:
line 1 this should not appear in output
line 2 keyword1
line 3 appears in output
line 4 keyword2
line 5 this should not appear in output
Code:
from itertools import dropwhile
def keepuntil(contains_end_keyword, lines):
for line in lines:
yield line
if contains_end_keyword(line):
break
with open('startstop_text', 'r') as f:
from_start_line = dropwhile(lambda line: 'keyword1' not in line, f)
extracted = keepuntil(lambda line: 'keyword2' in line, from_start_line)
for line in extracted:
print(line.rstrip())
>>> python startstop.py
line 2 keyword1
line 3 appears in output
line 4 keyword2
edited Nov 10 at 20:06
answered Nov 9 at 4:48
Nick
472414
472414
add a comment |
add a comment |
up vote
-1
down vote
accepted
None of the other responses worked but I was able to figure it out using regex.
for match in re.findall(".*keyword1[sS]*?keyword2:[sS]*?keyword3.*", log_file.read()):
Hey @Toenailsmcgee the above solution is not complete code. That makes it less helpful than it could be.
– Nick
Nov 11 at 22:29
My solution works fine for me in Python 2.7 and 3.6, on small and large files, with the extracted lines in initial, final or middle position. If there's a problem with it please let me know what error or what faulty output you're getting. Did you want to find multiple instances ofkeyword1
andkeyword2
and extract all? If so, my solution doesn't work-- but that isn't what you asked for.
– Nick
Nov 11 at 22:31
add a comment |
up vote
-1
down vote
accepted
None of the other responses worked but I was able to figure it out using regex.
for match in re.findall(".*keyword1[sS]*?keyword2:[sS]*?keyword3.*", log_file.read()):
Hey @Toenailsmcgee the above solution is not complete code. That makes it less helpful than it could be.
– Nick
Nov 11 at 22:29
My solution works fine for me in Python 2.7 and 3.6, on small and large files, with the extracted lines in initial, final or middle position. If there's a problem with it please let me know what error or what faulty output you're getting. Did you want to find multiple instances ofkeyword1
andkeyword2
and extract all? If so, my solution doesn't work-- but that isn't what you asked for.
– Nick
Nov 11 at 22:31
add a comment |
up vote
-1
down vote
accepted
up vote
-1
down vote
accepted
None of the other responses worked but I was able to figure it out using regex.
for match in re.findall(".*keyword1[sS]*?keyword2:[sS]*?keyword3.*", log_file.read()):
None of the other responses worked but I was able to figure it out using regex.
for match in re.findall(".*keyword1[sS]*?keyword2:[sS]*?keyword3.*", log_file.read()):
answered Nov 11 at 5:48
Toenailsmcgee
143
143
Hey @Toenailsmcgee the above solution is not complete code. That makes it less helpful than it could be.
– Nick
Nov 11 at 22:29
My solution works fine for me in Python 2.7 and 3.6, on small and large files, with the extracted lines in initial, final or middle position. If there's a problem with it please let me know what error or what faulty output you're getting. Did you want to find multiple instances ofkeyword1
andkeyword2
and extract all? If so, my solution doesn't work-- but that isn't what you asked for.
– Nick
Nov 11 at 22:31
add a comment |
Hey @Toenailsmcgee the above solution is not complete code. That makes it less helpful than it could be.
– Nick
Nov 11 at 22:29
My solution works fine for me in Python 2.7 and 3.6, on small and large files, with the extracted lines in initial, final or middle position. If there's a problem with it please let me know what error or what faulty output you're getting. Did you want to find multiple instances ofkeyword1
andkeyword2
and extract all? If so, my solution doesn't work-- but that isn't what you asked for.
– Nick
Nov 11 at 22:31
Hey @Toenailsmcgee the above solution is not complete code. That makes it less helpful than it could be.
– Nick
Nov 11 at 22:29
Hey @Toenailsmcgee the above solution is not complete code. That makes it less helpful than it could be.
– Nick
Nov 11 at 22:29
My solution works fine for me in Python 2.7 and 3.6, on small and large files, with the extracted lines in initial, final or middle position. If there's a problem with it please let me know what error or what faulty output you're getting. Did you want to find multiple instances of
keyword1
and keyword2
and extract all? If so, my solution doesn't work-- but that isn't what you asked for.– Nick
Nov 11 at 22:31
My solution works fine for me in Python 2.7 and 3.6, on small and large files, with the extracted lines in initial, final or middle position. If there's a problem with it please let me know what error or what faulty output you're getting. Did you want to find multiple instances of
keyword1
and keyword2
and extract all? If so, my solution doesn't work-- but that isn't what you asked for.– Nick
Nov 11 at 22:31
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53216657%2fhow-to-collect-all-lines-of-data-between-keywords-in-a-file-startingending-at%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Look, you tried checking if line contains
keyword1
, but your data haskeyword 1
. Try this parsing code.– Wiktor Stribiżew
Nov 8 at 22:38
@WiktorStribiżew that isn't my literal code, in my actual code I have definite matching terms
– Toenailsmcgee
Nov 8 at 23:01
So, what is the problem? If these are regexps, use
if re.search(rx, line)
instead ofif 'keyword' in line
.– Wiktor Stribiżew
Nov 8 at 23:33
First example doesn't initialise
new_list
as a list. Check indentation in the second example.– Nick
Nov 9 at 7:15
@Nick these are just excerpts of the relevant code. In my real code new_list is initialized and remains empty after the block of code runs. the indentation of the second is a matter of text formatting. I didn't realize it got messed up like that on the copy/paste. Again, it is proper in my real code. I appreciate you trying to give feedback though.
– Toenailsmcgee
Nov 10 at 2:07