Weighted violinplot
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
I have a Pandas dataframe of data on generator plant capacity (MW) by fuel type. I wanted to show the estimated distribution of plant capacity in two different ways: by plant (easy) and by MW (harder). Here's an example:
# import libraries
import pandas as pd
import numpy as np
import seaborn as sns
# generate empty dataframe
df = pd.DataFrame(data=None,columns=['Fuel','MW'])
# create and seed a randomstate object (to make #s repeatable below)
rnd = np.random.RandomState(7)
# generate fake data for each fuel type and append to df
for myfuel in ['Biomass','Coal','Hydro','Natural Gas','Oil','Solar','Wind','Other']:
mymean = rnd.uniform(low=2.8,high=3.2)
mysigma = rnd.uniform(low=0.6,high=1.0)
df = df.append(
pd.DataFrame({'Fuel': myfuel,
'MW': np.array(rnd.lognormal(mean=mymean,sigma=mysigma,size=1000))
}),
ignore_index=True
)
# make violinplot
sns.violinplot(x = 'Fuel',
y = 'MW',
data=df,
inner=None,
scale='area',
cut=0,
linewidth=0.5
)
And here's the plot of the estimated distributions of plant size by MW this code makes:
This violinplot is very deceptive, without more context. Because it is not weighted, the thin tail at the top of each category hides the fact that the relatively few plants in the tail contain a lot of (maybe even most of) the MWs of capacity. So I want a second plot with the distribution by MWs--basically a weighted version of this first violinplot.
I wanted to know if anyone has figured out an elegant way to make such a "weighted" violinplot, or if anyone has an idea about the most elegant way to do that.
I figured I could loop through each row of my plant-level dataframe and decompose the plant data (into a new dataframe) into MW-level data. For instance, for a row in the plant-level dataframe that shows a plant with 350 MW, I could decompose that into 3500 new rows of my new dataframe, each representing 100 kW of capacity. (I think I have to go to at least the 100 kW level of resolution, because some of these plants are pretty small, in the 100 kW range.) That new dataframe would be enormous, but I could then do a violinplot of that decomposed data. That seemed slightly brute force. Any better ideas for approach?
Update:
I implemented the brute force method described above. Here's what it looks like if anyone is interested. This is not the "answer" to this question, because I still would be interested if anyone knows a more elegant/simple/efficient way to do this. So please chime in if you know of such a way. Otherwise, I hope this brute force approach might be helpful to someone in the future.
So that it's easy to see that the weighted violinplot makes sense, I replaced the random data with a simple uniform series of numbers from 0 to 10. Under this new approach, the violinplot of df should look pretty uniform, and the violinplot of the weighted data (dfw) should get steadily wider towards the top of the violins. That's exactly what happens (see image of violinplots below).
# import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# generate empty dataframe
df = pd.DataFrame(data=None,columns=['Fuel','MW'])
# generate fake data for each fuel type and append to df
for myfuel in ['Biomass','Coal','Hydro','Natural Gas','Oil','Solar','Wind','Other']:
df = df.append(
pd.DataFrame({'Fuel': myfuel,
# To make it easy to see that the violinplot of dfw (below)
# makes sense, here we'll just use a simple range list from
# 0 to 10
'MW': np.array(range(11))
}),
ignore_index=True
)
# I have to recast the data type here to avoid an error when using violinplot below
df.MW = df.MW.astype(float)
# create another empty dataframe
dfw = pd.DataFrame(data=None,columns=['Fuel','MW'])
# since dfw will be huge, specify data types (in particular, use "category" for Fuel to limit dfw size)
dfw = dfw.astype(dtype={'Fuel':'category', 'MW':'float'})
# Define the MW size by which to normalize all of the units
# Careful: too big -> loss of fidelity in data for small plants
# too small -> dfw will need to store an enormous amount of data
norm = 0.1 # this is in MW, so 0.1 MW = 100 kW
# Define a var to represent (for each row) how many basic units
# of size = norm there are in each row
mynum = 0
# loop through rows of df
for index, row in df.iterrows():
# calculate and store the number of norm MW there are within the MW of each plant
mynum = int(round(row['MW']/norm))
# insert mynum rows into dfw, each with Fuel = row['Fuel'] and MW = row['MW']
dfw = dfw.append(
pd.DataFrame({'Fuel': row['Fuel'],
'MW': np.array([row['MW']]*mynum,dtype='float')
}),
ignore_index=True
)
# Set up figure and axes
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, sharey='row')
# make violinplot
sns.violinplot(x = 'Fuel',
y = 'MW',
data=df,
inner=None,
scale='area',
cut=0,
linewidth=0.5,
ax = ax1
)
# make violinplot
sns.violinplot(x = 'Fuel',
y = 'MW',
data=dfw,
inner=None,
scale='area',
cut=0,
linewidth=0.5,
ax = ax2
)
# loop through the set of tick labels for both axes
# set tick label size and rotation
for item in (ax1.get_xticklabels() + ax2.get_xticklabels()):
item.set_fontsize(8)
item.set_rotation(30)
item.set_horizontalalignment('right')
plt.show()
python pandas seaborn violin-plot
add a comment |
I have a Pandas dataframe of data on generator plant capacity (MW) by fuel type. I wanted to show the estimated distribution of plant capacity in two different ways: by plant (easy) and by MW (harder). Here's an example:
# import libraries
import pandas as pd
import numpy as np
import seaborn as sns
# generate empty dataframe
df = pd.DataFrame(data=None,columns=['Fuel','MW'])
# create and seed a randomstate object (to make #s repeatable below)
rnd = np.random.RandomState(7)
# generate fake data for each fuel type and append to df
for myfuel in ['Biomass','Coal','Hydro','Natural Gas','Oil','Solar','Wind','Other']:
mymean = rnd.uniform(low=2.8,high=3.2)
mysigma = rnd.uniform(low=0.6,high=1.0)
df = df.append(
pd.DataFrame({'Fuel': myfuel,
'MW': np.array(rnd.lognormal(mean=mymean,sigma=mysigma,size=1000))
}),
ignore_index=True
)
# make violinplot
sns.violinplot(x = 'Fuel',
y = 'MW',
data=df,
inner=None,
scale='area',
cut=0,
linewidth=0.5
)
And here's the plot of the estimated distributions of plant size by MW this code makes:
This violinplot is very deceptive, without more context. Because it is not weighted, the thin tail at the top of each category hides the fact that the relatively few plants in the tail contain a lot of (maybe even most of) the MWs of capacity. So I want a second plot with the distribution by MWs--basically a weighted version of this first violinplot.
I wanted to know if anyone has figured out an elegant way to make such a "weighted" violinplot, or if anyone has an idea about the most elegant way to do that.
I figured I could loop through each row of my plant-level dataframe and decompose the plant data (into a new dataframe) into MW-level data. For instance, for a row in the plant-level dataframe that shows a plant with 350 MW, I could decompose that into 3500 new rows of my new dataframe, each representing 100 kW of capacity. (I think I have to go to at least the 100 kW level of resolution, because some of these plants are pretty small, in the 100 kW range.) That new dataframe would be enormous, but I could then do a violinplot of that decomposed data. That seemed slightly brute force. Any better ideas for approach?
Update:
I implemented the brute force method described above. Here's what it looks like if anyone is interested. This is not the "answer" to this question, because I still would be interested if anyone knows a more elegant/simple/efficient way to do this. So please chime in if you know of such a way. Otherwise, I hope this brute force approach might be helpful to someone in the future.
So that it's easy to see that the weighted violinplot makes sense, I replaced the random data with a simple uniform series of numbers from 0 to 10. Under this new approach, the violinplot of df should look pretty uniform, and the violinplot of the weighted data (dfw) should get steadily wider towards the top of the violins. That's exactly what happens (see image of violinplots below).
# import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# generate empty dataframe
df = pd.DataFrame(data=None,columns=['Fuel','MW'])
# generate fake data for each fuel type and append to df
for myfuel in ['Biomass','Coal','Hydro','Natural Gas','Oil','Solar','Wind','Other']:
df = df.append(
pd.DataFrame({'Fuel': myfuel,
# To make it easy to see that the violinplot of dfw (below)
# makes sense, here we'll just use a simple range list from
# 0 to 10
'MW': np.array(range(11))
}),
ignore_index=True
)
# I have to recast the data type here to avoid an error when using violinplot below
df.MW = df.MW.astype(float)
# create another empty dataframe
dfw = pd.DataFrame(data=None,columns=['Fuel','MW'])
# since dfw will be huge, specify data types (in particular, use "category" for Fuel to limit dfw size)
dfw = dfw.astype(dtype={'Fuel':'category', 'MW':'float'})
# Define the MW size by which to normalize all of the units
# Careful: too big -> loss of fidelity in data for small plants
# too small -> dfw will need to store an enormous amount of data
norm = 0.1 # this is in MW, so 0.1 MW = 100 kW
# Define a var to represent (for each row) how many basic units
# of size = norm there are in each row
mynum = 0
# loop through rows of df
for index, row in df.iterrows():
# calculate and store the number of norm MW there are within the MW of each plant
mynum = int(round(row['MW']/norm))
# insert mynum rows into dfw, each with Fuel = row['Fuel'] and MW = row['MW']
dfw = dfw.append(
pd.DataFrame({'Fuel': row['Fuel'],
'MW': np.array([row['MW']]*mynum,dtype='float')
}),
ignore_index=True
)
# Set up figure and axes
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, sharey='row')
# make violinplot
sns.violinplot(x = 'Fuel',
y = 'MW',
data=df,
inner=None,
scale='area',
cut=0,
linewidth=0.5,
ax = ax1
)
# make violinplot
sns.violinplot(x = 'Fuel',
y = 'MW',
data=dfw,
inner=None,
scale='area',
cut=0,
linewidth=0.5,
ax = ax2
)
# loop through the set of tick labels for both axes
# set tick label size and rotation
for item in (ax1.get_xticklabels() + ax2.get_xticklabels()):
item.set_fontsize(8)
item.set_rotation(30)
item.set_horizontalalignment('right')
plt.show()
python pandas seaborn violin-plot
add a comment |
I have a Pandas dataframe of data on generator plant capacity (MW) by fuel type. I wanted to show the estimated distribution of plant capacity in two different ways: by plant (easy) and by MW (harder). Here's an example:
# import libraries
import pandas as pd
import numpy as np
import seaborn as sns
# generate empty dataframe
df = pd.DataFrame(data=None,columns=['Fuel','MW'])
# create and seed a randomstate object (to make #s repeatable below)
rnd = np.random.RandomState(7)
# generate fake data for each fuel type and append to df
for myfuel in ['Biomass','Coal','Hydro','Natural Gas','Oil','Solar','Wind','Other']:
mymean = rnd.uniform(low=2.8,high=3.2)
mysigma = rnd.uniform(low=0.6,high=1.0)
df = df.append(
pd.DataFrame({'Fuel': myfuel,
'MW': np.array(rnd.lognormal(mean=mymean,sigma=mysigma,size=1000))
}),
ignore_index=True
)
# make violinplot
sns.violinplot(x = 'Fuel',
y = 'MW',
data=df,
inner=None,
scale='area',
cut=0,
linewidth=0.5
)
And here's the plot of the estimated distributions of plant size by MW this code makes:
This violinplot is very deceptive, without more context. Because it is not weighted, the thin tail at the top of each category hides the fact that the relatively few plants in the tail contain a lot of (maybe even most of) the MWs of capacity. So I want a second plot with the distribution by MWs--basically a weighted version of this first violinplot.
I wanted to know if anyone has figured out an elegant way to make such a "weighted" violinplot, or if anyone has an idea about the most elegant way to do that.
I figured I could loop through each row of my plant-level dataframe and decompose the plant data (into a new dataframe) into MW-level data. For instance, for a row in the plant-level dataframe that shows a plant with 350 MW, I could decompose that into 3500 new rows of my new dataframe, each representing 100 kW of capacity. (I think I have to go to at least the 100 kW level of resolution, because some of these plants are pretty small, in the 100 kW range.) That new dataframe would be enormous, but I could then do a violinplot of that decomposed data. That seemed slightly brute force. Any better ideas for approach?
Update:
I implemented the brute force method described above. Here's what it looks like if anyone is interested. This is not the "answer" to this question, because I still would be interested if anyone knows a more elegant/simple/efficient way to do this. So please chime in if you know of such a way. Otherwise, I hope this brute force approach might be helpful to someone in the future.
So that it's easy to see that the weighted violinplot makes sense, I replaced the random data with a simple uniform series of numbers from 0 to 10. Under this new approach, the violinplot of df should look pretty uniform, and the violinplot of the weighted data (dfw) should get steadily wider towards the top of the violins. That's exactly what happens (see image of violinplots below).
# import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# generate empty dataframe
df = pd.DataFrame(data=None,columns=['Fuel','MW'])
# generate fake data for each fuel type and append to df
for myfuel in ['Biomass','Coal','Hydro','Natural Gas','Oil','Solar','Wind','Other']:
df = df.append(
pd.DataFrame({'Fuel': myfuel,
# To make it easy to see that the violinplot of dfw (below)
# makes sense, here we'll just use a simple range list from
# 0 to 10
'MW': np.array(range(11))
}),
ignore_index=True
)
# I have to recast the data type here to avoid an error when using violinplot below
df.MW = df.MW.astype(float)
# create another empty dataframe
dfw = pd.DataFrame(data=None,columns=['Fuel','MW'])
# since dfw will be huge, specify data types (in particular, use "category" for Fuel to limit dfw size)
dfw = dfw.astype(dtype={'Fuel':'category', 'MW':'float'})
# Define the MW size by which to normalize all of the units
# Careful: too big -> loss of fidelity in data for small plants
# too small -> dfw will need to store an enormous amount of data
norm = 0.1 # this is in MW, so 0.1 MW = 100 kW
# Define a var to represent (for each row) how many basic units
# of size = norm there are in each row
mynum = 0
# loop through rows of df
for index, row in df.iterrows():
# calculate and store the number of norm MW there are within the MW of each plant
mynum = int(round(row['MW']/norm))
# insert mynum rows into dfw, each with Fuel = row['Fuel'] and MW = row['MW']
dfw = dfw.append(
pd.DataFrame({'Fuel': row['Fuel'],
'MW': np.array([row['MW']]*mynum,dtype='float')
}),
ignore_index=True
)
# Set up figure and axes
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, sharey='row')
# make violinplot
sns.violinplot(x = 'Fuel',
y = 'MW',
data=df,
inner=None,
scale='area',
cut=0,
linewidth=0.5,
ax = ax1
)
# make violinplot
sns.violinplot(x = 'Fuel',
y = 'MW',
data=dfw,
inner=None,
scale='area',
cut=0,
linewidth=0.5,
ax = ax2
)
# loop through the set of tick labels for both axes
# set tick label size and rotation
for item in (ax1.get_xticklabels() + ax2.get_xticklabels()):
item.set_fontsize(8)
item.set_rotation(30)
item.set_horizontalalignment('right')
plt.show()
python pandas seaborn violin-plot
I have a Pandas dataframe of data on generator plant capacity (MW) by fuel type. I wanted to show the estimated distribution of plant capacity in two different ways: by plant (easy) and by MW (harder). Here's an example:
# import libraries
import pandas as pd
import numpy as np
import seaborn as sns
# generate empty dataframe
df = pd.DataFrame(data=None,columns=['Fuel','MW'])
# create and seed a randomstate object (to make #s repeatable below)
rnd = np.random.RandomState(7)
# generate fake data for each fuel type and append to df
for myfuel in ['Biomass','Coal','Hydro','Natural Gas','Oil','Solar','Wind','Other']:
mymean = rnd.uniform(low=2.8,high=3.2)
mysigma = rnd.uniform(low=0.6,high=1.0)
df = df.append(
pd.DataFrame({'Fuel': myfuel,
'MW': np.array(rnd.lognormal(mean=mymean,sigma=mysigma,size=1000))
}),
ignore_index=True
)
# make violinplot
sns.violinplot(x = 'Fuel',
y = 'MW',
data=df,
inner=None,
scale='area',
cut=0,
linewidth=0.5
)
And here's the plot of the estimated distributions of plant size by MW this code makes:
This violinplot is very deceptive, without more context. Because it is not weighted, the thin tail at the top of each category hides the fact that the relatively few plants in the tail contain a lot of (maybe even most of) the MWs of capacity. So I want a second plot with the distribution by MWs--basically a weighted version of this first violinplot.
I wanted to know if anyone has figured out an elegant way to make such a "weighted" violinplot, or if anyone has an idea about the most elegant way to do that.
I figured I could loop through each row of my plant-level dataframe and decompose the plant data (into a new dataframe) into MW-level data. For instance, for a row in the plant-level dataframe that shows a plant with 350 MW, I could decompose that into 3500 new rows of my new dataframe, each representing 100 kW of capacity. (I think I have to go to at least the 100 kW level of resolution, because some of these plants are pretty small, in the 100 kW range.) That new dataframe would be enormous, but I could then do a violinplot of that decomposed data. That seemed slightly brute force. Any better ideas for approach?
Update:
I implemented the brute force method described above. Here's what it looks like if anyone is interested. This is not the "answer" to this question, because I still would be interested if anyone knows a more elegant/simple/efficient way to do this. So please chime in if you know of such a way. Otherwise, I hope this brute force approach might be helpful to someone in the future.
So that it's easy to see that the weighted violinplot makes sense, I replaced the random data with a simple uniform series of numbers from 0 to 10. Under this new approach, the violinplot of df should look pretty uniform, and the violinplot of the weighted data (dfw) should get steadily wider towards the top of the violins. That's exactly what happens (see image of violinplots below).
# import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# generate empty dataframe
df = pd.DataFrame(data=None,columns=['Fuel','MW'])
# generate fake data for each fuel type and append to df
for myfuel in ['Biomass','Coal','Hydro','Natural Gas','Oil','Solar','Wind','Other']:
df = df.append(
pd.DataFrame({'Fuel': myfuel,
# To make it easy to see that the violinplot of dfw (below)
# makes sense, here we'll just use a simple range list from
# 0 to 10
'MW': np.array(range(11))
}),
ignore_index=True
)
# I have to recast the data type here to avoid an error when using violinplot below
df.MW = df.MW.astype(float)
# create another empty dataframe
dfw = pd.DataFrame(data=None,columns=['Fuel','MW'])
# since dfw will be huge, specify data types (in particular, use "category" for Fuel to limit dfw size)
dfw = dfw.astype(dtype={'Fuel':'category', 'MW':'float'})
# Define the MW size by which to normalize all of the units
# Careful: too big -> loss of fidelity in data for small plants
# too small -> dfw will need to store an enormous amount of data
norm = 0.1 # this is in MW, so 0.1 MW = 100 kW
# Define a var to represent (for each row) how many basic units
# of size = norm there are in each row
mynum = 0
# loop through rows of df
for index, row in df.iterrows():
# calculate and store the number of norm MW there are within the MW of each plant
mynum = int(round(row['MW']/norm))
# insert mynum rows into dfw, each with Fuel = row['Fuel'] and MW = row['MW']
dfw = dfw.append(
pd.DataFrame({'Fuel': row['Fuel'],
'MW': np.array([row['MW']]*mynum,dtype='float')
}),
ignore_index=True
)
# Set up figure and axes
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, sharey='row')
# make violinplot
sns.violinplot(x = 'Fuel',
y = 'MW',
data=df,
inner=None,
scale='area',
cut=0,
linewidth=0.5,
ax = ax1
)
# make violinplot
sns.violinplot(x = 'Fuel',
y = 'MW',
data=dfw,
inner=None,
scale='area',
cut=0,
linewidth=0.5,
ax = ax2
)
# loop through the set of tick labels for both axes
# set tick label size and rotation
for item in (ax1.get_xticklabels() + ax2.get_xticklabels()):
item.set_fontsize(8)
item.set_rotation(30)
item.set_horizontalalignment('right')
plt.show()
python pandas seaborn violin-plot
python pandas seaborn violin-plot
edited Nov 23 '18 at 20:27
Emily Beth
asked Nov 23 '18 at 17:23
Emily BethEmily Beth
169111
169111
add a comment |
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53450787%2fweighted-violinplot%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53450787%2fweighted-violinplot%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown