pandas read excel sheet with multiple sheets and different header offsets











up vote
1
down vote

favorite












I have to read an Excel sheet in pandas which contains multiple sheets.
Unfortunately, the number of white space rows before the header starts seems to be different:



pd.read_excel('foo.xlsx', header=[2,3], sheet_name='first')
pd.read_excel('foo.xlsx', header=[1,2], sheet_name='second')


Is there an elegant way to fix this and read the Excel into a pandas.Dataframe with an additional column which contains the name of each sheet?



I.e. how can



pd.read_excel(file_name, sheet_name=None)


be passed a varying header argument or choose at least the 2 first (non empty) rows as header?



edit



dynamically skip top blank rows of excel in python pandas
seems to be related but not the solution as only the first headers are accepted.



edit2



Description of exact file structure:



... (varying number of empty rows)
__irrelevant_row__
HEADER_1
HEADER_2


where currently it is either 1 or 0 empty rows. But as pointed out in the comment it would be great if that would be more dynamic.










share|improve this question
























  • Interesting. So the DataFrame constructor needs to assess at each sheet whether the current set of column names is sufficient, or it needs to expand? In that case it might be worthwhile to first loop over the sheets (could be done using parts of the linked question) to find the set of headers; before reading in the actual data.
    – Uvar
    Nov 8 at 9:33










  • At least for my current data it turns out that there are only the two variants of header: (1,2), (2,3). However, ideally this would be recognised dynamically.
    – Georg Heiler
    Nov 8 at 9:34










  • Could you perchance add a mini-example (something like 2 sheets with arbitrary headers and minimal amount of data) to create a reproducible test, where we can assess/discuss the intermediate results as well?
    – Uvar
    Nov 8 at 9:39










  • Certainly, github.com/geoHeil/pandas-dynamic-header-from-excel/blob/master/… is an sample Excel
    – Georg Heiler
    Nov 8 at 9:46










  • Ok. can you try the following: import openpyxl; book=openpyxl.load_workbook(PATH_TO_FILE); for sh in book.sheetnames: a = pd.DataFrame(book[sh].values).dropna().reset_index(drop=True); print(a) ? Apologies for the chain code, but as it's not an answer yet, can't do it outside the comment box..
    – Uvar
    Nov 8 at 10:09

















up vote
1
down vote

favorite












I have to read an Excel sheet in pandas which contains multiple sheets.
Unfortunately, the number of white space rows before the header starts seems to be different:



pd.read_excel('foo.xlsx', header=[2,3], sheet_name='first')
pd.read_excel('foo.xlsx', header=[1,2], sheet_name='second')


Is there an elegant way to fix this and read the Excel into a pandas.Dataframe with an additional column which contains the name of each sheet?



I.e. how can



pd.read_excel(file_name, sheet_name=None)


be passed a varying header argument or choose at least the 2 first (non empty) rows as header?



edit



dynamically skip top blank rows of excel in python pandas
seems to be related but not the solution as only the first headers are accepted.



edit2



Description of exact file structure:



... (varying number of empty rows)
__irrelevant_row__
HEADER_1
HEADER_2


where currently it is either 1 or 0 empty rows. But as pointed out in the comment it would be great if that would be more dynamic.










share|improve this question
























  • Interesting. So the DataFrame constructor needs to assess at each sheet whether the current set of column names is sufficient, or it needs to expand? In that case it might be worthwhile to first loop over the sheets (could be done using parts of the linked question) to find the set of headers; before reading in the actual data.
    – Uvar
    Nov 8 at 9:33










  • At least for my current data it turns out that there are only the two variants of header: (1,2), (2,3). However, ideally this would be recognised dynamically.
    – Georg Heiler
    Nov 8 at 9:34










  • Could you perchance add a mini-example (something like 2 sheets with arbitrary headers and minimal amount of data) to create a reproducible test, where we can assess/discuss the intermediate results as well?
    – Uvar
    Nov 8 at 9:39










  • Certainly, github.com/geoHeil/pandas-dynamic-header-from-excel/blob/master/… is an sample Excel
    – Georg Heiler
    Nov 8 at 9:46










  • Ok. can you try the following: import openpyxl; book=openpyxl.load_workbook(PATH_TO_FILE); for sh in book.sheetnames: a = pd.DataFrame(book[sh].values).dropna().reset_index(drop=True); print(a) ? Apologies for the chain code, but as it's not an answer yet, can't do it outside the comment box..
    – Uvar
    Nov 8 at 10:09















up vote
1
down vote

favorite









up vote
1
down vote

favorite











I have to read an Excel sheet in pandas which contains multiple sheets.
Unfortunately, the number of white space rows before the header starts seems to be different:



pd.read_excel('foo.xlsx', header=[2,3], sheet_name='first')
pd.read_excel('foo.xlsx', header=[1,2], sheet_name='second')


Is there an elegant way to fix this and read the Excel into a pandas.Dataframe with an additional column which contains the name of each sheet?



I.e. how can



pd.read_excel(file_name, sheet_name=None)


be passed a varying header argument or choose at least the 2 first (non empty) rows as header?



edit



dynamically skip top blank rows of excel in python pandas
seems to be related but not the solution as only the first headers are accepted.



edit2



Description of exact file structure:



... (varying number of empty rows)
__irrelevant_row__
HEADER_1
HEADER_2


where currently it is either 1 or 0 empty rows. But as pointed out in the comment it would be great if that would be more dynamic.










share|improve this question















I have to read an Excel sheet in pandas which contains multiple sheets.
Unfortunately, the number of white space rows before the header starts seems to be different:



pd.read_excel('foo.xlsx', header=[2,3], sheet_name='first')
pd.read_excel('foo.xlsx', header=[1,2], sheet_name='second')


Is there an elegant way to fix this and read the Excel into a pandas.Dataframe with an additional column which contains the name of each sheet?



I.e. how can



pd.read_excel(file_name, sheet_name=None)


be passed a varying header argument or choose at least the 2 first (non empty) rows as header?



edit



dynamically skip top blank rows of excel in python pandas
seems to be related but not the solution as only the first headers are accepted.



edit2



Description of exact file structure:



... (varying number of empty rows)
__irrelevant_row__
HEADER_1
HEADER_2


where currently it is either 1 or 0 empty rows. But as pointed out in the comment it would be great if that would be more dynamic.







python excel pandas






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 8 at 9:36

























asked Nov 8 at 8:59









Georg Heiler

4,883551123




4,883551123












  • Interesting. So the DataFrame constructor needs to assess at each sheet whether the current set of column names is sufficient, or it needs to expand? In that case it might be worthwhile to first loop over the sheets (could be done using parts of the linked question) to find the set of headers; before reading in the actual data.
    – Uvar
    Nov 8 at 9:33










  • At least for my current data it turns out that there are only the two variants of header: (1,2), (2,3). However, ideally this would be recognised dynamically.
    – Georg Heiler
    Nov 8 at 9:34










  • Could you perchance add a mini-example (something like 2 sheets with arbitrary headers and minimal amount of data) to create a reproducible test, where we can assess/discuss the intermediate results as well?
    – Uvar
    Nov 8 at 9:39










  • Certainly, github.com/geoHeil/pandas-dynamic-header-from-excel/blob/master/… is an sample Excel
    – Georg Heiler
    Nov 8 at 9:46










  • Ok. can you try the following: import openpyxl; book=openpyxl.load_workbook(PATH_TO_FILE); for sh in book.sheetnames: a = pd.DataFrame(book[sh].values).dropna().reset_index(drop=True); print(a) ? Apologies for the chain code, but as it's not an answer yet, can't do it outside the comment box..
    – Uvar
    Nov 8 at 10:09




















  • Interesting. So the DataFrame constructor needs to assess at each sheet whether the current set of column names is sufficient, or it needs to expand? In that case it might be worthwhile to first loop over the sheets (could be done using parts of the linked question) to find the set of headers; before reading in the actual data.
    – Uvar
    Nov 8 at 9:33










  • At least for my current data it turns out that there are only the two variants of header: (1,2), (2,3). However, ideally this would be recognised dynamically.
    – Georg Heiler
    Nov 8 at 9:34










  • Could you perchance add a mini-example (something like 2 sheets with arbitrary headers and minimal amount of data) to create a reproducible test, where we can assess/discuss the intermediate results as well?
    – Uvar
    Nov 8 at 9:39










  • Certainly, github.com/geoHeil/pandas-dynamic-header-from-excel/blob/master/… is an sample Excel
    – Georg Heiler
    Nov 8 at 9:46










  • Ok. can you try the following: import openpyxl; book=openpyxl.load_workbook(PATH_TO_FILE); for sh in book.sheetnames: a = pd.DataFrame(book[sh].values).dropna().reset_index(drop=True); print(a) ? Apologies for the chain code, but as it's not an answer yet, can't do it outside the comment box..
    – Uvar
    Nov 8 at 10:09


















Interesting. So the DataFrame constructor needs to assess at each sheet whether the current set of column names is sufficient, or it needs to expand? In that case it might be worthwhile to first loop over the sheets (could be done using parts of the linked question) to find the set of headers; before reading in the actual data.
– Uvar
Nov 8 at 9:33




Interesting. So the DataFrame constructor needs to assess at each sheet whether the current set of column names is sufficient, or it needs to expand? In that case it might be worthwhile to first loop over the sheets (could be done using parts of the linked question) to find the set of headers; before reading in the actual data.
– Uvar
Nov 8 at 9:33












At least for my current data it turns out that there are only the two variants of header: (1,2), (2,3). However, ideally this would be recognised dynamically.
– Georg Heiler
Nov 8 at 9:34




At least for my current data it turns out that there are only the two variants of header: (1,2), (2,3). However, ideally this would be recognised dynamically.
– Georg Heiler
Nov 8 at 9:34












Could you perchance add a mini-example (something like 2 sheets with arbitrary headers and minimal amount of data) to create a reproducible test, where we can assess/discuss the intermediate results as well?
– Uvar
Nov 8 at 9:39




Could you perchance add a mini-example (something like 2 sheets with arbitrary headers and minimal amount of data) to create a reproducible test, where we can assess/discuss the intermediate results as well?
– Uvar
Nov 8 at 9:39












Certainly, github.com/geoHeil/pandas-dynamic-header-from-excel/blob/master/… is an sample Excel
– Georg Heiler
Nov 8 at 9:46




Certainly, github.com/geoHeil/pandas-dynamic-header-from-excel/blob/master/… is an sample Excel
– Georg Heiler
Nov 8 at 9:46












Ok. can you try the following: import openpyxl; book=openpyxl.load_workbook(PATH_TO_FILE); for sh in book.sheetnames: a = pd.DataFrame(book[sh].values).dropna().reset_index(drop=True); print(a) ? Apologies for the chain code, but as it's not an answer yet, can't do it outside the comment box..
– Uvar
Nov 8 at 10:09






Ok. can you try the following: import openpyxl; book=openpyxl.load_workbook(PATH_TO_FILE); for sh in book.sheetnames: a = pd.DataFrame(book[sh].values).dropna().reset_index(drop=True); print(a) ? Apologies for the chain code, but as it's not an answer yet, can't do it outside the comment box..
– Uvar
Nov 8 at 10:09














1 Answer
1






active

oldest

votes

















up vote
0
down vote



accepted










I am certain this could be done in a more neat fashion, but a way to achieve (I think) what you want is:



import openpyxl
import pandas as pd
book = openpyxl.load_workbook(PATH_TO_FILE)
for sh in book.sheetnames:
a = pd.DataFrame(book[sh].values).dropna(how='all').reset_index(drop=True)
a.columns = a.iloc[1]
a = a.iloc[2:]
a.iloc[0].index.name=sh
a["sheet"] = a.iloc[0].index.name
try:
b = b.append(a)
except NameError:
b = a.copy()
b.iloc[0].index.name = ''
print(b)
# header1 header2 sheet
#2 1 2 first
#3 3 4 first
#2 1 2 second
#3 3 4 second
#2 1 2 3rd
#3 3 4 3rd


Unfortunately I have no clue how it interacts with your actual data, but I do hope this helps you in your quest.






share|improve this answer























  • Indeed, this works for the sample file. Do you see an option wo work also for the real file, i.e. for a file which contains NULL values in the data rows? Your method is currently deleting all the data rows which contain at least a single null value.
    – Georg Heiler
    Nov 8 at 10:26










  • Using how='all' will only filter completely empty rows so this is a step better. But I need to also fix the multi index. I.e. there are 2 rows of header in the real data.
    – Georg Heiler
    Nov 8 at 10:28












  • That is a nice addition. Then you can manually delete the first row which you wanted to skip. In other words: a.columns = a.iloc[1:]; a = a.iloc[2:] in the loop
    – Uvar
    Nov 8 at 10:30













Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53204367%2fpandas-read-excel-sheet-with-multiple-sheets-and-different-header-offsets%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
0
down vote



accepted










I am certain this could be done in a more neat fashion, but a way to achieve (I think) what you want is:



import openpyxl
import pandas as pd
book = openpyxl.load_workbook(PATH_TO_FILE)
for sh in book.sheetnames:
a = pd.DataFrame(book[sh].values).dropna(how='all').reset_index(drop=True)
a.columns = a.iloc[1]
a = a.iloc[2:]
a.iloc[0].index.name=sh
a["sheet"] = a.iloc[0].index.name
try:
b = b.append(a)
except NameError:
b = a.copy()
b.iloc[0].index.name = ''
print(b)
# header1 header2 sheet
#2 1 2 first
#3 3 4 first
#2 1 2 second
#3 3 4 second
#2 1 2 3rd
#3 3 4 3rd


Unfortunately I have no clue how it interacts with your actual data, but I do hope this helps you in your quest.






share|improve this answer























  • Indeed, this works for the sample file. Do you see an option wo work also for the real file, i.e. for a file which contains NULL values in the data rows? Your method is currently deleting all the data rows which contain at least a single null value.
    – Georg Heiler
    Nov 8 at 10:26










  • Using how='all' will only filter completely empty rows so this is a step better. But I need to also fix the multi index. I.e. there are 2 rows of header in the real data.
    – Georg Heiler
    Nov 8 at 10:28












  • That is a nice addition. Then you can manually delete the first row which you wanted to skip. In other words: a.columns = a.iloc[1:]; a = a.iloc[2:] in the loop
    – Uvar
    Nov 8 at 10:30

















up vote
0
down vote



accepted










I am certain this could be done in a more neat fashion, but a way to achieve (I think) what you want is:



import openpyxl
import pandas as pd
book = openpyxl.load_workbook(PATH_TO_FILE)
for sh in book.sheetnames:
a = pd.DataFrame(book[sh].values).dropna(how='all').reset_index(drop=True)
a.columns = a.iloc[1]
a = a.iloc[2:]
a.iloc[0].index.name=sh
a["sheet"] = a.iloc[0].index.name
try:
b = b.append(a)
except NameError:
b = a.copy()
b.iloc[0].index.name = ''
print(b)
# header1 header2 sheet
#2 1 2 first
#3 3 4 first
#2 1 2 second
#3 3 4 second
#2 1 2 3rd
#3 3 4 3rd


Unfortunately I have no clue how it interacts with your actual data, but I do hope this helps you in your quest.






share|improve this answer























  • Indeed, this works for the sample file. Do you see an option wo work also for the real file, i.e. for a file which contains NULL values in the data rows? Your method is currently deleting all the data rows which contain at least a single null value.
    – Georg Heiler
    Nov 8 at 10:26










  • Using how='all' will only filter completely empty rows so this is a step better. But I need to also fix the multi index. I.e. there are 2 rows of header in the real data.
    – Georg Heiler
    Nov 8 at 10:28












  • That is a nice addition. Then you can manually delete the first row which you wanted to skip. In other words: a.columns = a.iloc[1:]; a = a.iloc[2:] in the loop
    – Uvar
    Nov 8 at 10:30















up vote
0
down vote



accepted







up vote
0
down vote



accepted






I am certain this could be done in a more neat fashion, but a way to achieve (I think) what you want is:



import openpyxl
import pandas as pd
book = openpyxl.load_workbook(PATH_TO_FILE)
for sh in book.sheetnames:
a = pd.DataFrame(book[sh].values).dropna(how='all').reset_index(drop=True)
a.columns = a.iloc[1]
a = a.iloc[2:]
a.iloc[0].index.name=sh
a["sheet"] = a.iloc[0].index.name
try:
b = b.append(a)
except NameError:
b = a.copy()
b.iloc[0].index.name = ''
print(b)
# header1 header2 sheet
#2 1 2 first
#3 3 4 first
#2 1 2 second
#3 3 4 second
#2 1 2 3rd
#3 3 4 3rd


Unfortunately I have no clue how it interacts with your actual data, but I do hope this helps you in your quest.






share|improve this answer














I am certain this could be done in a more neat fashion, but a way to achieve (I think) what you want is:



import openpyxl
import pandas as pd
book = openpyxl.load_workbook(PATH_TO_FILE)
for sh in book.sheetnames:
a = pd.DataFrame(book[sh].values).dropna(how='all').reset_index(drop=True)
a.columns = a.iloc[1]
a = a.iloc[2:]
a.iloc[0].index.name=sh
a["sheet"] = a.iloc[0].index.name
try:
b = b.append(a)
except NameError:
b = a.copy()
b.iloc[0].index.name = ''
print(b)
# header1 header2 sheet
#2 1 2 first
#3 3 4 first
#2 1 2 second
#3 3 4 second
#2 1 2 3rd
#3 3 4 3rd


Unfortunately I have no clue how it interacts with your actual data, but I do hope this helps you in your quest.







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 8 at 10:34

























answered Nov 8 at 10:24









Uvar

2,401520




2,401520












  • Indeed, this works for the sample file. Do you see an option wo work also for the real file, i.e. for a file which contains NULL values in the data rows? Your method is currently deleting all the data rows which contain at least a single null value.
    – Georg Heiler
    Nov 8 at 10:26










  • Using how='all' will only filter completely empty rows so this is a step better. But I need to also fix the multi index. I.e. there are 2 rows of header in the real data.
    – Georg Heiler
    Nov 8 at 10:28












  • That is a nice addition. Then you can manually delete the first row which you wanted to skip. In other words: a.columns = a.iloc[1:]; a = a.iloc[2:] in the loop
    – Uvar
    Nov 8 at 10:30




















  • Indeed, this works for the sample file. Do you see an option wo work also for the real file, i.e. for a file which contains NULL values in the data rows? Your method is currently deleting all the data rows which contain at least a single null value.
    – Georg Heiler
    Nov 8 at 10:26










  • Using how='all' will only filter completely empty rows so this is a step better. But I need to also fix the multi index. I.e. there are 2 rows of header in the real data.
    – Georg Heiler
    Nov 8 at 10:28












  • That is a nice addition. Then you can manually delete the first row which you wanted to skip. In other words: a.columns = a.iloc[1:]; a = a.iloc[2:] in the loop
    – Uvar
    Nov 8 at 10:30


















Indeed, this works for the sample file. Do you see an option wo work also for the real file, i.e. for a file which contains NULL values in the data rows? Your method is currently deleting all the data rows which contain at least a single null value.
– Georg Heiler
Nov 8 at 10:26




Indeed, this works for the sample file. Do you see an option wo work also for the real file, i.e. for a file which contains NULL values in the data rows? Your method is currently deleting all the data rows which contain at least a single null value.
– Georg Heiler
Nov 8 at 10:26












Using how='all' will only filter completely empty rows so this is a step better. But I need to also fix the multi index. I.e. there are 2 rows of header in the real data.
– Georg Heiler
Nov 8 at 10:28






Using how='all' will only filter completely empty rows so this is a step better. But I need to also fix the multi index. I.e. there are 2 rows of header in the real data.
– Georg Heiler
Nov 8 at 10:28














That is a nice addition. Then you can manually delete the first row which you wanted to skip. In other words: a.columns = a.iloc[1:]; a = a.iloc[2:] in the loop
– Uvar
Nov 8 at 10:30






That is a nice addition. Then you can manually delete the first row which you wanted to skip. In other words: a.columns = a.iloc[1:]; a = a.iloc[2:] in the loop
– Uvar
Nov 8 at 10:30




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53204367%2fpandas-read-excel-sheet-with-multiple-sheets-and-different-header-offsets%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

Xamarin.form Move up view when keyboard appear

Post-Redirect-Get with Spring WebFlux and Thymeleaf

Anylogic : not able to use stopDelay()