pandas read excel sheet with multiple sheets and different header offsets
up vote
1
down vote
favorite
I have to read an Excel sheet in pandas which contains multiple sheets.
Unfortunately, the number of white space rows before the header starts seems to be different:
pd.read_excel('foo.xlsx', header=[2,3], sheet_name='first')
pd.read_excel('foo.xlsx', header=[1,2], sheet_name='second')
Is there an elegant way to fix this and read the Excel into a pandas.Dataframe with an additional column which contains the name of each sheet?
I.e. how can
pd.read_excel(file_name, sheet_name=None)
be passed a varying header argument or choose at least the 2 first (non empty) rows as header?
edit
dynamically skip top blank rows of excel in python pandas
seems to be related but not the solution as only the first headers are accepted.
edit2
Description of exact file structure:
... (varying number of empty rows)
__irrelevant_row__
HEADER_1
HEADER_2
where currently it is either 1 or 0 empty rows. But as pointed out in the comment it would be great if that would be more dynamic.
python excel pandas
|
show 1 more comment
up vote
1
down vote
favorite
I have to read an Excel sheet in pandas which contains multiple sheets.
Unfortunately, the number of white space rows before the header starts seems to be different:
pd.read_excel('foo.xlsx', header=[2,3], sheet_name='first')
pd.read_excel('foo.xlsx', header=[1,2], sheet_name='second')
Is there an elegant way to fix this and read the Excel into a pandas.Dataframe with an additional column which contains the name of each sheet?
I.e. how can
pd.read_excel(file_name, sheet_name=None)
be passed a varying header argument or choose at least the 2 first (non empty) rows as header?
edit
dynamically skip top blank rows of excel in python pandas
seems to be related but not the solution as only the first headers are accepted.
edit2
Description of exact file structure:
... (varying number of empty rows)
__irrelevant_row__
HEADER_1
HEADER_2
where currently it is either 1 or 0 empty rows. But as pointed out in the comment it would be great if that would be more dynamic.
python excel pandas
Interesting. So theDataFrame
constructor needs to assess at each sheet whether the current set of column names is sufficient, or it needs to expand? In that case it might be worthwhile to first loop over the sheets (could be done using parts of the linked question) to find theset
of headers; before reading in the actual data.
– Uvar
Nov 8 at 9:33
At least for my current data it turns out that there are only the two variants of header:(1,2), (2,3)
. However, ideally this would be recognised dynamically.
– Georg Heiler
Nov 8 at 9:34
Could you perchance add a mini-example (something like 2 sheets with arbitrary headers and minimal amount of data) to create a reproducible test, where we can assess/discuss the intermediate results as well?
– Uvar
Nov 8 at 9:39
Certainly, github.com/geoHeil/pandas-dynamic-header-from-excel/blob/master/… is an sample Excel
– Georg Heiler
Nov 8 at 9:46
Ok. can you try the following:import openpyxl; book=openpyxl.load_workbook(PATH_TO_FILE); for sh in book.sheetnames: a = pd.DataFrame(book[sh].values).dropna().reset_index(drop=True); print(a)
? Apologies for the chain code, but as it's not an answer yet, can't do it outside the comment box..
– Uvar
Nov 8 at 10:09
|
show 1 more comment
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I have to read an Excel sheet in pandas which contains multiple sheets.
Unfortunately, the number of white space rows before the header starts seems to be different:
pd.read_excel('foo.xlsx', header=[2,3], sheet_name='first')
pd.read_excel('foo.xlsx', header=[1,2], sheet_name='second')
Is there an elegant way to fix this and read the Excel into a pandas.Dataframe with an additional column which contains the name of each sheet?
I.e. how can
pd.read_excel(file_name, sheet_name=None)
be passed a varying header argument or choose at least the 2 first (non empty) rows as header?
edit
dynamically skip top blank rows of excel in python pandas
seems to be related but not the solution as only the first headers are accepted.
edit2
Description of exact file structure:
... (varying number of empty rows)
__irrelevant_row__
HEADER_1
HEADER_2
where currently it is either 1 or 0 empty rows. But as pointed out in the comment it would be great if that would be more dynamic.
python excel pandas
I have to read an Excel sheet in pandas which contains multiple sheets.
Unfortunately, the number of white space rows before the header starts seems to be different:
pd.read_excel('foo.xlsx', header=[2,3], sheet_name='first')
pd.read_excel('foo.xlsx', header=[1,2], sheet_name='second')
Is there an elegant way to fix this and read the Excel into a pandas.Dataframe with an additional column which contains the name of each sheet?
I.e. how can
pd.read_excel(file_name, sheet_name=None)
be passed a varying header argument or choose at least the 2 first (non empty) rows as header?
edit
dynamically skip top blank rows of excel in python pandas
seems to be related but not the solution as only the first headers are accepted.
edit2
Description of exact file structure:
... (varying number of empty rows)
__irrelevant_row__
HEADER_1
HEADER_2
where currently it is either 1 or 0 empty rows. But as pointed out in the comment it would be great if that would be more dynamic.
python excel pandas
python excel pandas
edited Nov 8 at 9:36
asked Nov 8 at 8:59
Georg Heiler
4,883551123
4,883551123
Interesting. So theDataFrame
constructor needs to assess at each sheet whether the current set of column names is sufficient, or it needs to expand? In that case it might be worthwhile to first loop over the sheets (could be done using parts of the linked question) to find theset
of headers; before reading in the actual data.
– Uvar
Nov 8 at 9:33
At least for my current data it turns out that there are only the two variants of header:(1,2), (2,3)
. However, ideally this would be recognised dynamically.
– Georg Heiler
Nov 8 at 9:34
Could you perchance add a mini-example (something like 2 sheets with arbitrary headers and minimal amount of data) to create a reproducible test, where we can assess/discuss the intermediate results as well?
– Uvar
Nov 8 at 9:39
Certainly, github.com/geoHeil/pandas-dynamic-header-from-excel/blob/master/… is an sample Excel
– Georg Heiler
Nov 8 at 9:46
Ok. can you try the following:import openpyxl; book=openpyxl.load_workbook(PATH_TO_FILE); for sh in book.sheetnames: a = pd.DataFrame(book[sh].values).dropna().reset_index(drop=True); print(a)
? Apologies for the chain code, but as it's not an answer yet, can't do it outside the comment box..
– Uvar
Nov 8 at 10:09
|
show 1 more comment
Interesting. So theDataFrame
constructor needs to assess at each sheet whether the current set of column names is sufficient, or it needs to expand? In that case it might be worthwhile to first loop over the sheets (could be done using parts of the linked question) to find theset
of headers; before reading in the actual data.
– Uvar
Nov 8 at 9:33
At least for my current data it turns out that there are only the two variants of header:(1,2), (2,3)
. However, ideally this would be recognised dynamically.
– Georg Heiler
Nov 8 at 9:34
Could you perchance add a mini-example (something like 2 sheets with arbitrary headers and minimal amount of data) to create a reproducible test, where we can assess/discuss the intermediate results as well?
– Uvar
Nov 8 at 9:39
Certainly, github.com/geoHeil/pandas-dynamic-header-from-excel/blob/master/… is an sample Excel
– Georg Heiler
Nov 8 at 9:46
Ok. can you try the following:import openpyxl; book=openpyxl.load_workbook(PATH_TO_FILE); for sh in book.sheetnames: a = pd.DataFrame(book[sh].values).dropna().reset_index(drop=True); print(a)
? Apologies for the chain code, but as it's not an answer yet, can't do it outside the comment box..
– Uvar
Nov 8 at 10:09
Interesting. So the
DataFrame
constructor needs to assess at each sheet whether the current set of column names is sufficient, or it needs to expand? In that case it might be worthwhile to first loop over the sheets (could be done using parts of the linked question) to find the set
of headers; before reading in the actual data.– Uvar
Nov 8 at 9:33
Interesting. So the
DataFrame
constructor needs to assess at each sheet whether the current set of column names is sufficient, or it needs to expand? In that case it might be worthwhile to first loop over the sheets (could be done using parts of the linked question) to find the set
of headers; before reading in the actual data.– Uvar
Nov 8 at 9:33
At least for my current data it turns out that there are only the two variants of header:
(1,2), (2,3)
. However, ideally this would be recognised dynamically.– Georg Heiler
Nov 8 at 9:34
At least for my current data it turns out that there are only the two variants of header:
(1,2), (2,3)
. However, ideally this would be recognised dynamically.– Georg Heiler
Nov 8 at 9:34
Could you perchance add a mini-example (something like 2 sheets with arbitrary headers and minimal amount of data) to create a reproducible test, where we can assess/discuss the intermediate results as well?
– Uvar
Nov 8 at 9:39
Could you perchance add a mini-example (something like 2 sheets with arbitrary headers and minimal amount of data) to create a reproducible test, where we can assess/discuss the intermediate results as well?
– Uvar
Nov 8 at 9:39
Certainly, github.com/geoHeil/pandas-dynamic-header-from-excel/blob/master/… is an sample Excel
– Georg Heiler
Nov 8 at 9:46
Certainly, github.com/geoHeil/pandas-dynamic-header-from-excel/blob/master/… is an sample Excel
– Georg Heiler
Nov 8 at 9:46
Ok. can you try the following:
import openpyxl; book=openpyxl.load_workbook(PATH_TO_FILE); for sh in book.sheetnames: a = pd.DataFrame(book[sh].values).dropna().reset_index(drop=True); print(a)
? Apologies for the chain code, but as it's not an answer yet, can't do it outside the comment box..– Uvar
Nov 8 at 10:09
Ok. can you try the following:
import openpyxl; book=openpyxl.load_workbook(PATH_TO_FILE); for sh in book.sheetnames: a = pd.DataFrame(book[sh].values).dropna().reset_index(drop=True); print(a)
? Apologies for the chain code, but as it's not an answer yet, can't do it outside the comment box..– Uvar
Nov 8 at 10:09
|
show 1 more comment
1 Answer
1
active
oldest
votes
up vote
0
down vote
accepted
I am certain this could be done in a more neat fashion, but a way to achieve (I think) what you want is:
import openpyxl
import pandas as pd
book = openpyxl.load_workbook(PATH_TO_FILE)
for sh in book.sheetnames:
a = pd.DataFrame(book[sh].values).dropna(how='all').reset_index(drop=True)
a.columns = a.iloc[1]
a = a.iloc[2:]
a.iloc[0].index.name=sh
a["sheet"] = a.iloc[0].index.name
try:
b = b.append(a)
except NameError:
b = a.copy()
b.iloc[0].index.name = ''
print(b)
# header1 header2 sheet
#2 1 2 first
#3 3 4 first
#2 1 2 second
#3 3 4 second
#2 1 2 3rd
#3 3 4 3rd
Unfortunately I have no clue how it interacts with your actual data, but I do hope this helps you in your quest.
Indeed, this works for the sample file. Do you see an option wo work also for the real file, i.e. for a file which contains NULL values in the data rows? Your method is currently deleting all the data rows which contain at least a single null value.
– Georg Heiler
Nov 8 at 10:26
Usinghow='all'
will only filter completely empty rows so this is a step better. But I need to also fix the multi index. I.e. there are 2 rows of header in the real data.
– Georg Heiler
Nov 8 at 10:28
That is a nice addition. Then you can manually delete the first row which you wanted to skip. In other words:a.columns = a.iloc[1:]; a = a.iloc[2:]
in the loop
– Uvar
Nov 8 at 10:30
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
accepted
I am certain this could be done in a more neat fashion, but a way to achieve (I think) what you want is:
import openpyxl
import pandas as pd
book = openpyxl.load_workbook(PATH_TO_FILE)
for sh in book.sheetnames:
a = pd.DataFrame(book[sh].values).dropna(how='all').reset_index(drop=True)
a.columns = a.iloc[1]
a = a.iloc[2:]
a.iloc[0].index.name=sh
a["sheet"] = a.iloc[0].index.name
try:
b = b.append(a)
except NameError:
b = a.copy()
b.iloc[0].index.name = ''
print(b)
# header1 header2 sheet
#2 1 2 first
#3 3 4 first
#2 1 2 second
#3 3 4 second
#2 1 2 3rd
#3 3 4 3rd
Unfortunately I have no clue how it interacts with your actual data, but I do hope this helps you in your quest.
Indeed, this works for the sample file. Do you see an option wo work also for the real file, i.e. for a file which contains NULL values in the data rows? Your method is currently deleting all the data rows which contain at least a single null value.
– Georg Heiler
Nov 8 at 10:26
Usinghow='all'
will only filter completely empty rows so this is a step better. But I need to also fix the multi index. I.e. there are 2 rows of header in the real data.
– Georg Heiler
Nov 8 at 10:28
That is a nice addition. Then you can manually delete the first row which you wanted to skip. In other words:a.columns = a.iloc[1:]; a = a.iloc[2:]
in the loop
– Uvar
Nov 8 at 10:30
add a comment |
up vote
0
down vote
accepted
I am certain this could be done in a more neat fashion, but a way to achieve (I think) what you want is:
import openpyxl
import pandas as pd
book = openpyxl.load_workbook(PATH_TO_FILE)
for sh in book.sheetnames:
a = pd.DataFrame(book[sh].values).dropna(how='all').reset_index(drop=True)
a.columns = a.iloc[1]
a = a.iloc[2:]
a.iloc[0].index.name=sh
a["sheet"] = a.iloc[0].index.name
try:
b = b.append(a)
except NameError:
b = a.copy()
b.iloc[0].index.name = ''
print(b)
# header1 header2 sheet
#2 1 2 first
#3 3 4 first
#2 1 2 second
#3 3 4 second
#2 1 2 3rd
#3 3 4 3rd
Unfortunately I have no clue how it interacts with your actual data, but I do hope this helps you in your quest.
Indeed, this works for the sample file. Do you see an option wo work also for the real file, i.e. for a file which contains NULL values in the data rows? Your method is currently deleting all the data rows which contain at least a single null value.
– Georg Heiler
Nov 8 at 10:26
Usinghow='all'
will only filter completely empty rows so this is a step better. But I need to also fix the multi index. I.e. there are 2 rows of header in the real data.
– Georg Heiler
Nov 8 at 10:28
That is a nice addition. Then you can manually delete the first row which you wanted to skip. In other words:a.columns = a.iloc[1:]; a = a.iloc[2:]
in the loop
– Uvar
Nov 8 at 10:30
add a comment |
up vote
0
down vote
accepted
up vote
0
down vote
accepted
I am certain this could be done in a more neat fashion, but a way to achieve (I think) what you want is:
import openpyxl
import pandas as pd
book = openpyxl.load_workbook(PATH_TO_FILE)
for sh in book.sheetnames:
a = pd.DataFrame(book[sh].values).dropna(how='all').reset_index(drop=True)
a.columns = a.iloc[1]
a = a.iloc[2:]
a.iloc[0].index.name=sh
a["sheet"] = a.iloc[0].index.name
try:
b = b.append(a)
except NameError:
b = a.copy()
b.iloc[0].index.name = ''
print(b)
# header1 header2 sheet
#2 1 2 first
#3 3 4 first
#2 1 2 second
#3 3 4 second
#2 1 2 3rd
#3 3 4 3rd
Unfortunately I have no clue how it interacts with your actual data, but I do hope this helps you in your quest.
I am certain this could be done in a more neat fashion, but a way to achieve (I think) what you want is:
import openpyxl
import pandas as pd
book = openpyxl.load_workbook(PATH_TO_FILE)
for sh in book.sheetnames:
a = pd.DataFrame(book[sh].values).dropna(how='all').reset_index(drop=True)
a.columns = a.iloc[1]
a = a.iloc[2:]
a.iloc[0].index.name=sh
a["sheet"] = a.iloc[0].index.name
try:
b = b.append(a)
except NameError:
b = a.copy()
b.iloc[0].index.name = ''
print(b)
# header1 header2 sheet
#2 1 2 first
#3 3 4 first
#2 1 2 second
#3 3 4 second
#2 1 2 3rd
#3 3 4 3rd
Unfortunately I have no clue how it interacts with your actual data, but I do hope this helps you in your quest.
edited Nov 8 at 10:34
answered Nov 8 at 10:24
Uvar
2,401520
2,401520
Indeed, this works for the sample file. Do you see an option wo work also for the real file, i.e. for a file which contains NULL values in the data rows? Your method is currently deleting all the data rows which contain at least a single null value.
– Georg Heiler
Nov 8 at 10:26
Usinghow='all'
will only filter completely empty rows so this is a step better. But I need to also fix the multi index. I.e. there are 2 rows of header in the real data.
– Georg Heiler
Nov 8 at 10:28
That is a nice addition. Then you can manually delete the first row which you wanted to skip. In other words:a.columns = a.iloc[1:]; a = a.iloc[2:]
in the loop
– Uvar
Nov 8 at 10:30
add a comment |
Indeed, this works for the sample file. Do you see an option wo work also for the real file, i.e. for a file which contains NULL values in the data rows? Your method is currently deleting all the data rows which contain at least a single null value.
– Georg Heiler
Nov 8 at 10:26
Usinghow='all'
will only filter completely empty rows so this is a step better. But I need to also fix the multi index. I.e. there are 2 rows of header in the real data.
– Georg Heiler
Nov 8 at 10:28
That is a nice addition. Then you can manually delete the first row which you wanted to skip. In other words:a.columns = a.iloc[1:]; a = a.iloc[2:]
in the loop
– Uvar
Nov 8 at 10:30
Indeed, this works for the sample file. Do you see an option wo work also for the real file, i.e. for a file which contains NULL values in the data rows? Your method is currently deleting all the data rows which contain at least a single null value.
– Georg Heiler
Nov 8 at 10:26
Indeed, this works for the sample file. Do you see an option wo work also for the real file, i.e. for a file which contains NULL values in the data rows? Your method is currently deleting all the data rows which contain at least a single null value.
– Georg Heiler
Nov 8 at 10:26
Using
how='all'
will only filter completely empty rows so this is a step better. But I need to also fix the multi index. I.e. there are 2 rows of header in the real data.– Georg Heiler
Nov 8 at 10:28
Using
how='all'
will only filter completely empty rows so this is a step better. But I need to also fix the multi index. I.e. there are 2 rows of header in the real data.– Georg Heiler
Nov 8 at 10:28
That is a nice addition. Then you can manually delete the first row which you wanted to skip. In other words:
a.columns = a.iloc[1:]; a = a.iloc[2:]
in the loop– Uvar
Nov 8 at 10:30
That is a nice addition. Then you can manually delete the first row which you wanted to skip. In other words:
a.columns = a.iloc[1:]; a = a.iloc[2:]
in the loop– Uvar
Nov 8 at 10:30
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53204367%2fpandas-read-excel-sheet-with-multiple-sheets-and-different-header-offsets%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Interesting. So the
DataFrame
constructor needs to assess at each sheet whether the current set of column names is sufficient, or it needs to expand? In that case it might be worthwhile to first loop over the sheets (could be done using parts of the linked question) to find theset
of headers; before reading in the actual data.– Uvar
Nov 8 at 9:33
At least for my current data it turns out that there are only the two variants of header:
(1,2), (2,3)
. However, ideally this would be recognised dynamically.– Georg Heiler
Nov 8 at 9:34
Could you perchance add a mini-example (something like 2 sheets with arbitrary headers and minimal amount of data) to create a reproducible test, where we can assess/discuss the intermediate results as well?
– Uvar
Nov 8 at 9:39
Certainly, github.com/geoHeil/pandas-dynamic-header-from-excel/blob/master/… is an sample Excel
– Georg Heiler
Nov 8 at 9:46
Ok. can you try the following:
import openpyxl; book=openpyxl.load_workbook(PATH_TO_FILE); for sh in book.sheetnames: a = pd.DataFrame(book[sh].values).dropna().reset_index(drop=True); print(a)
? Apologies for the chain code, but as it's not an answer yet, can't do it outside the comment box..– Uvar
Nov 8 at 10:09