pandas read excel sheet with multiple sheets and different header offsets

up vote
1
down vote

favorite

I have to read an Excel sheet in pandas which contains multiple sheets.
Unfortunately, the number of white space rows before the header starts seems to be different:

pd.read_excel('foo.xlsx', header=[2,3], sheet_name='first')

pd.read_excel('foo.xlsx', header=[1,2], sheet_name='second')

Is there an elegant way to fix this and read the Excel into a pandas.Dataframe with an additional column which contains the name of each sheet?

I.e. how can

pd.read_excel(file_name, sheet_name=None)

be passed a varying header argument or choose at least the 2 first (non empty) rows as header?

edit

dynamically skip top blank rows of excel in python pandas
seems to be related but not the solution as only the first headers are accepted.

edit2

Description of exact file structure:

... (varying number of empty rows)

__irrelevant_row__

HEADER_1

HEADER_2

where currently it is either 1 or 0 empty rows. But as pointed out in the comment it would be great if that would be more dynamic.

edited Nov 8 at 9:36

asked Nov 8 at 8:59

Georg Heiler

4,883551123

Interesting. So the DataFrame constructor needs to assess at each sheet whether the current set of column names is sufficient, or it needs to expand? In that case it might be worthwhile to first loop over the sheets (could be done using parts of the linked question) to find the set of headers; before reading in the actual data.
– Uvar
Nov 8 at 9:33

At least for my current data it turns out that there are only the two variants of header: (1,2), (2,3). However, ideally this would be recognised dynamically.
– Georg Heiler
Nov 8 at 9:34

Could you perchance add a mini-example (something like 2 sheets with arbitrary headers and minimal amount of data) to create a reproducible test, where we can assess/discuss the intermediate results as well?
– Uvar
Nov 8 at 9:39

Certainly, github.com/geoHeil/pandas-dynamic-header-from-excel/blob/master/… is an sample Excel
– Georg Heiler
Nov 8 at 9:46

Ok. can you try the following: import openpyxl; book=openpyxl.load_workbook(PATH_TO_FILE); for sh in book.sheetnames: a = pd.DataFrame(book[sh].values).dropna().reset_index(drop=True); print(a) ? Apologies for the chain code, but as it's not an answer yet, can't do it outside the comment box..
– Uvar
Nov 8 at 10:09

|
show 1 more comment

up vote
1
down vote

favorite

I have to read an Excel sheet in pandas which contains multiple sheets.
Unfortunately, the number of white space rows before the header starts seems to be different:

pd.read_excel('foo.xlsx', header=[2,3], sheet_name='first')

pd.read_excel('foo.xlsx', header=[1,2], sheet_name='second')

Is there an elegant way to fix this and read the Excel into a pandas.Dataframe with an additional column which contains the name of each sheet?

I.e. how can

pd.read_excel(file_name, sheet_name=None)

be passed a varying header argument or choose at least the 2 first (non empty) rows as header?

edit

dynamically skip top blank rows of excel in python pandas
seems to be related but not the solution as only the first headers are accepted.

edit2

Description of exact file structure:

... (varying number of empty rows)

__irrelevant_row__

HEADER_1

HEADER_2

where currently it is either 1 or 0 empty rows. But as pointed out in the comment it would be great if that would be more dynamic.

edited Nov 8 at 9:36

asked Nov 8 at 8:59

Georg Heiler

4,883551123

Interesting. So the DataFrame constructor needs to assess at each sheet whether the current set of column names is sufficient, or it needs to expand? In that case it might be worthwhile to first loop over the sheets (could be done using parts of the linked question) to find the set of headers; before reading in the actual data.
– Uvar
Nov 8 at 9:33

At least for my current data it turns out that there are only the two variants of header: (1,2), (2,3). However, ideally this would be recognised dynamically.
– Georg Heiler
Nov 8 at 9:34

Could you perchance add a mini-example (something like 2 sheets with arbitrary headers and minimal amount of data) to create a reproducible test, where we can assess/discuss the intermediate results as well?
– Uvar
Nov 8 at 9:39

Certainly, github.com/geoHeil/pandas-dynamic-header-from-excel/blob/master/… is an sample Excel
– Georg Heiler
Nov 8 at 9:46

Ok. can you try the following: import openpyxl; book=openpyxl.load_workbook(PATH_TO_FILE); for sh in book.sheetnames: a = pd.DataFrame(book[sh].values).dropna().reset_index(drop=True); print(a) ? Apologies for the chain code, but as it's not an answer yet, can't do it outside the comment box..
– Uvar
Nov 8 at 10:09

|
show 1 more comment

up vote
1
down vote

favorite

I have to read an Excel sheet in pandas which contains multiple sheets.
Unfortunately, the number of white space rows before the header starts seems to be different:

pd.read_excel('foo.xlsx', header=[2,3], sheet_name='first')

pd.read_excel('foo.xlsx', header=[1,2], sheet_name='second')

Is there an elegant way to fix this and read the Excel into a pandas.Dataframe with an additional column which contains the name of each sheet?

I.e. how can

pd.read_excel(file_name, sheet_name=None)

be passed a varying header argument or choose at least the 2 first (non empty) rows as header?

edit

dynamically skip top blank rows of excel in python pandas
seems to be related but not the solution as only the first headers are accepted.

edit2

Description of exact file structure:

... (varying number of empty rows)

__irrelevant_row__

HEADER_1

HEADER_2

where currently it is either 1 or 0 empty rows. But as pointed out in the comment it would be great if that would be more dynamic.

edited Nov 8 at 9:36

asked Nov 8 at 8:59

Georg Heiler

4,883551123

I have to read an Excel sheet in pandas which contains multiple sheets.
Unfortunately, the number of white space rows before the header starts seems to be different:

pd.read_excel('foo.xlsx', header=[2,3], sheet_name='first')

pd.read_excel('foo.xlsx', header=[1,2], sheet_name='second')

Is there an elegant way to fix this and read the Excel into a pandas.Dataframe with an additional column which contains the name of each sheet?

I.e. how can

pd.read_excel(file_name, sheet_name=None)

be passed a varying header argument or choose at least the 2 first (non empty) rows as header?

edit

dynamically skip top blank rows of excel in python pandas
seems to be related but not the solution as only the first headers are accepted.

edit2

Description of exact file structure:

... (varying number of empty rows)

__irrelevant_row__

HEADER_1

HEADER_2

where currently it is either 1 or 0 empty rows. But as pointed out in the comment it would be great if that would be more dynamic.

python excel pandas

edited Nov 8 at 9:36

asked Nov 8 at 8:59

Georg Heiler

4,883551123

edited Nov 8 at 9:36

asked Nov 8 at 8:59

Georg Heiler

4,883551123

edited Nov 8 at 9:36

asked Nov 8 at 8:59

Georg Heiler

4,883551123

asked Nov 8 at 8:59

Georg Heiler

4,883551123

asked Nov 8 at 8:59

Georg Heiler

4,883551123

Interesting. So the DataFrame constructor needs to assess at each sheet whether the current set of column names is sufficient, or it needs to expand? In that case it might be worthwhile to first loop over the sheets (could be done using parts of the linked question) to find the set of headers; before reading in the actual data.
– Uvar
Nov 8 at 9:33

At least for my current data it turns out that there are only the two variants of header: (1,2), (2,3). However, ideally this would be recognised dynamically.
– Georg Heiler
Nov 8 at 9:34

Could you perchance add a mini-example (something like 2 sheets with arbitrary headers and minimal amount of data) to create a reproducible test, where we can assess/discuss the intermediate results as well?
– Uvar
Nov 8 at 9:39

Certainly, github.com/geoHeil/pandas-dynamic-header-from-excel/blob/master/… is an sample Excel
– Georg Heiler
Nov 8 at 9:46

Ok. can you try the following: import openpyxl; book=openpyxl.load_workbook(PATH_TO_FILE); for sh in book.sheetnames: a = pd.DataFrame(book[sh].values).dropna().reset_index(drop=True); print(a) ? Apologies for the chain code, but as it's not an answer yet, can't do it outside the comment box..
– Uvar
Nov 8 at 10:09

|
show 1 more comment

Interesting. So the DataFrame constructor needs to assess at each sheet whether the current set of column names is sufficient, or it needs to expand? In that case it might be worthwhile to first loop over the sheets (could be done using parts of the linked question) to find the set of headers; before reading in the actual data.
– Uvar
Nov 8 at 9:33

At least for my current data it turns out that there are only the two variants of header: (1,2), (2,3). However, ideally this would be recognised dynamically.
– Georg Heiler
Nov 8 at 9:34

Could you perchance add a mini-example (something like 2 sheets with arbitrary headers and minimal amount of data) to create a reproducible test, where we can assess/discuss the intermediate results as well?
– Uvar
Nov 8 at 9:39

Certainly, github.com/geoHeil/pandas-dynamic-header-from-excel/blob/master/… is an sample Excel
– Georg Heiler
Nov 8 at 9:46

Ok. can you try the following: import openpyxl; book=openpyxl.load_workbook(PATH_TO_FILE); for sh in book.sheetnames: a = pd.DataFrame(book[sh].values).dropna().reset_index(drop=True); print(a) ? Apologies for the chain code, but as it's not an answer yet, can't do it outside the comment box..
– Uvar
Nov 8 at 10:09

Interesting. So the DataFrame constructor needs to assess at each sheet whether the current set of column names is sufficient, or it needs to expand? In that case it might be worthwhile to first loop over the sheets (could be done using parts of the linked question) to find the set of headers; before reading in the actual data.
– Uvar
Nov 8 at 9:33

At least for my current data it turns out that there are only the two variants of header: (1,2), (2,3). However, ideally this would be recognised dynamically.
– Georg Heiler
Nov 8 at 9:34

Could you perchance add a mini-example (something like 2 sheets with arbitrary headers and minimal amount of data) to create a reproducible test, where we can assess/discuss the intermediate results as well?
– Uvar
Nov 8 at 9:39

Certainly, github.com/geoHeil/pandas-dynamic-header-from-excel/blob/master/… is an sample Excel
– Georg Heiler
Nov 8 at 9:46

Ok. can you try the following:

import openpyxl; book=openpyxl.load_workbook(PATH_TO_FILE); for sh in book.sheetnames: a = pd.DataFrame(book[sh].values).dropna().reset_index(drop=True); print(a)

? Apologies for the chain code, but as it's not an answer yet, can't do it outside the comment box..
– Uvar
Nov 8 at 10:09

Ok. can you try the following:

import openpyxl; book=openpyxl.load_workbook(PATH_TO_FILE); for sh in book.sheetnames: a = pd.DataFrame(book[sh].values).dropna().reset_index(drop=True); print(a)

? Apologies for the chain code, but as it's not an answer yet, can't do it outside the comment box..
– Uvar
Nov 8 at 10:09

|
show 1 more comment

1 Answer
1

active

oldest

votes

up vote
0
down vote

accepted

I am certain this could be done in a more neat fashion, but a way to achieve (I think) what you want is:

import openpyxl

import pandas as pd

book = openpyxl.load_workbook(PATH_TO_FILE)

for sh in book.sheetnames:

    a = pd.DataFrame(book[sh].values).dropna(how='all').reset_index(drop=True)

    a.columns = a.iloc[1]

    a = a.iloc[2:]

    a.iloc[0].index.name=sh

    a["sheet"] = a.iloc[0].index.name

    try:

        b = b.append(a)

    except NameError:

        b = a.copy()

b.iloc[0].index.name = ''

print(b)

#  header1 header2   sheet

#2       1       2   first

#3       3       4   first

#2       1       2  second

#3       3       4  second

#2       1       2     3rd

#3       3       4     3rd

Unfortunately I have no clue how it interacts with your actual data, but I do hope this helps you in your quest.

edited Nov 8 at 10:34

answered Nov 8 at 10:24

Uvar

2,401520

Indeed, this works for the sample file. Do you see an option wo work also for the real file, i.e. for a file which contains NULL values in the data rows? Your method is currently deleting all the data rows which contain at least a single null value.
– Georg Heiler
Nov 8 at 10:26

Using how='all' will only filter completely empty rows so this is a step better. But I need to also fix the multi index. I.e. there are 2 rows of header in the real data.
– Georg Heiler
Nov 8 at 10:28

That is a nice addition. Then you can manually delete the first row which you wanted to skip. In other words: a.columns = a.iloc[1:]; a = a.iloc[2:] in the loop
– Uvar
Nov 8 at 10:30

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53204367%2fpandas-read-excel-sheet-with-multiple-sheets-and-different-header-offsets%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

accepted

I am certain this could be done in a more neat fashion, but a way to achieve (I think) what you want is:

import openpyxl

import pandas as pd

book = openpyxl.load_workbook(PATH_TO_FILE)

for sh in book.sheetnames:

    a = pd.DataFrame(book[sh].values).dropna(how='all').reset_index(drop=True)

    a.columns = a.iloc[1]

    a = a.iloc[2:]

    a.iloc[0].index.name=sh

    a["sheet"] = a.iloc[0].index.name

    try:

        b = b.append(a)

    except NameError:

        b = a.copy()

b.iloc[0].index.name = ''

print(b)

#  header1 header2   sheet

#2       1       2   first

#3       3       4   first

#2       1       2  second

#3       3       4  second

#2       1       2     3rd

#3       3       4     3rd

Unfortunately I have no clue how it interacts with your actual data, but I do hope this helps you in your quest.

edited Nov 8 at 10:34

answered Nov 8 at 10:24

Uvar

2,401520

Indeed, this works for the sample file. Do you see an option wo work also for the real file, i.e. for a file which contains NULL values in the data rows? Your method is currently deleting all the data rows which contain at least a single null value.
– Georg Heiler
Nov 8 at 10:26

Using how='all' will only filter completely empty rows so this is a step better. But I need to also fix the multi index. I.e. there are 2 rows of header in the real data.
– Georg Heiler
Nov 8 at 10:28

That is a nice addition. Then you can manually delete the first row which you wanted to skip. In other words: a.columns = a.iloc[1:]; a = a.iloc[2:] in the loop
– Uvar
Nov 8 at 10:30

add a comment |

up vote
0
down vote

accepted

I am certain this could be done in a more neat fashion, but a way to achieve (I think) what you want is:

import openpyxl

import pandas as pd

book = openpyxl.load_workbook(PATH_TO_FILE)

for sh in book.sheetnames:

    a = pd.DataFrame(book[sh].values).dropna(how='all').reset_index(drop=True)

    a.columns = a.iloc[1]

    a = a.iloc[2:]

    a.iloc[0].index.name=sh

    a["sheet"] = a.iloc[0].index.name

    try:

        b = b.append(a)

    except NameError:

        b = a.copy()

b.iloc[0].index.name = ''

print(b)

#  header1 header2   sheet

#2       1       2   first

#3       3       4   first

#2       1       2  second

#3       3       4  second

#2       1       2     3rd

#3       3       4     3rd

Unfortunately I have no clue how it interacts with your actual data, but I do hope this helps you in your quest.

edited Nov 8 at 10:34

answered Nov 8 at 10:24

Uvar

2,401520

Indeed, this works for the sample file. Do you see an option wo work also for the real file, i.e. for a file which contains NULL values in the data rows? Your method is currently deleting all the data rows which contain at least a single null value.
– Georg Heiler
Nov 8 at 10:26

Using how='all' will only filter completely empty rows so this is a step better. But I need to also fix the multi index. I.e. there are 2 rows of header in the real data.
– Georg Heiler
Nov 8 at 10:28

That is a nice addition. Then you can manually delete the first row which you wanted to skip. In other words: a.columns = a.iloc[1:]; a = a.iloc[2:] in the loop
– Uvar
Nov 8 at 10:30

add a comment |

up vote
0
down vote

accepted

I am certain this could be done in a more neat fashion, but a way to achieve (I think) what you want is:

import openpyxl

import pandas as pd

book = openpyxl.load_workbook(PATH_TO_FILE)

for sh in book.sheetnames:

    a = pd.DataFrame(book[sh].values).dropna(how='all').reset_index(drop=True)

    a.columns = a.iloc[1]

    a = a.iloc[2:]

    a.iloc[0].index.name=sh

    a["sheet"] = a.iloc[0].index.name

    try:

        b = b.append(a)

    except NameError:

        b = a.copy()

b.iloc[0].index.name = ''

print(b)

#  header1 header2   sheet

#2       1       2   first

#3       3       4   first

#2       1       2  second

#3       3       4  second

#2       1       2     3rd

#3       3       4     3rd

Unfortunately I have no clue how it interacts with your actual data, but I do hope this helps you in your quest.

edited Nov 8 at 10:34

answered Nov 8 at 10:24

Uvar

2,401520

I am certain this could be done in a more neat fashion, but a way to achieve (I think) what you want is:

import openpyxl

import pandas as pd

book = openpyxl.load_workbook(PATH_TO_FILE)

for sh in book.sheetnames:

    a = pd.DataFrame(book[sh].values).dropna(how='all').reset_index(drop=True)

    a.columns = a.iloc[1]

    a = a.iloc[2:]

    a.iloc[0].index.name=sh

    a["sheet"] = a.iloc[0].index.name

    try:

        b = b.append(a)

    except NameError:

        b = a.copy()

b.iloc[0].index.name = ''

print(b)

#  header1 header2   sheet

#2       1       2   first

#3       3       4   first

#2       1       2  second

#3       3       4  second

#2       1       2     3rd

#3       3       4     3rd

Unfortunately I have no clue how it interacts with your actual data, but I do hope this helps you in your quest.

edited Nov 8 at 10:34

answered Nov 8 at 10:24

Uvar

2,401520

edited Nov 8 at 10:34

answered Nov 8 at 10:24

Uvar

2,401520

answered Nov 8 at 10:24

Uvar

2,401520

answered Nov 8 at 10:24

Uvar

2,401520

Indeed, this works for the sample file. Do you see an option wo work also for the real file, i.e. for a file which contains NULL values in the data rows? Your method is currently deleting all the data rows which contain at least a single null value.
– Georg Heiler
Nov 8 at 10:26

Using how='all' will only filter completely empty rows so this is a step better. But I need to also fix the multi index. I.e. there are 2 rows of header in the real data.
– Georg Heiler
Nov 8 at 10:28

That is a nice addition. Then you can manually delete the first row which you wanted to skip. In other words: a.columns = a.iloc[1:]; a = a.iloc[2:] in the loop
– Uvar
Nov 8 at 10:30

add a comment |

Indeed, this works for the sample file. Do you see an option wo work also for the real file, i.e. for a file which contains NULL values in the data rows? Your method is currently deleting all the data rows which contain at least a single null value.
– Georg Heiler
Nov 8 at 10:26

Using how='all' will only filter completely empty rows so this is a step better. But I need to also fix the multi index. I.e. there are 2 rows of header in the real data.
– Georg Heiler
Nov 8 at 10:28

That is a nice addition. Then you can manually delete the first row which you wanted to skip. In other words: a.columns = a.iloc[1:]; a = a.iloc[2:] in the loop
– Uvar
Nov 8 at 10:30

Indeed, this works for the sample file. Do you see an option wo work also for the real file, i.e. for a file which contains NULL values in the data rows? Your method is currently deleting all the data rows which contain at least a single null value.
– Georg Heiler
Nov 8 at 10:26

Using how='all' will only filter completely empty rows so this is a step better. But I need to also fix the multi index. I.e. there are 2 rows of header in the real data.
– Georg Heiler
Nov 8 at 10:28

That is a nice addition. Then you can manually delete the first row which you wanted to skip. In other words: a.columns = a.iloc[1:]; a = a.iloc[2:] in the loop
– Uvar
Nov 8 at 10:30

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Wsrtjtyk

pandas read excel sheet with multiple sheets and different header offsets

edit

edit2

edit

edit2

edit

edit2

edit

edit2

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

這個網誌中的熱門文章

Academy of Television Arts & Sciences

L'Équipe

1995 France bombings

pandas read excel sheet with multiple sheets and different header offsets

edit

edit2

edit

edit2

edit

edit2

edit

edit2

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

這個網誌中的熱門文章

Academy of Television Arts & Sciences

L'Équipe

1995 France bombings

1 Answer
1

1 Answer
1

1 Answer
1