Convert pandas.DataFrame to numpy tensor using factor levels for shape [duplicate]
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
This question already has an answer here:
Transform Pandas DataFrame with n-level hierarchical index into n-D Numpy array
1 answer
I have data from a full factorial experiment.  For example, for each of N samples, I have J types of measurement and K measurement loci.  I receive this data in long format, for example,
import numpy as np
import pandas as pd
import itertools
from numpy.random import normal as rnorm
# [[N], [J], [K]]
levels = [[1,2,3,4], ['start', 'stop'], ['gene1', 'gene2', 'gene3']]
# fully crossed
exp_design = list(itertools.product(*levels))
df = pd.DataFrame(exp_design, columns=["sample", "mode", "gene"])
# some fake data
df['x'] = rnorm(size=len(exp_design))
which results in 24 observations (x) with a column for each of the three factors.
> df.head()
    sample  mode    gene    x
0   1       start   gene1   -1.229370
1   1       start   gene2   1.129773
2   1       start   gene3   -1.155202
3   1       stop    gene1   -0.757551
4   1       stop    gene2   -0.166129
I want to convert these observations to the corresponding (N,J,K)-shaped tensor (numpy array).  I was thinking pivoting to wide format with a MultiIndex, then extracting values would generate the correct tensor, but it simply comes off as a column vector:
> df.pivot_table(values='x', index=['sample', 'mode', 'gene']).values
array([[-1.22936989],
       [ 1.12977346],
       [-1.15520216],
       ...,
       [-0.1031641 ],
       [ 1.1296491 ],
       [ 1.31113584]])
Is there a quick way to get tensor formatted data from a long format pandas.DataFrame?
python pandas numpy tensor numpy-ndarray
                    marked as duplicate by merv, Community♦ Nov 29 '18 at 2:37
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
add a comment |
This question already has an answer here:
Transform Pandas DataFrame with n-level hierarchical index into n-D Numpy array
1 answer
I have data from a full factorial experiment.  For example, for each of N samples, I have J types of measurement and K measurement loci.  I receive this data in long format, for example,
import numpy as np
import pandas as pd
import itertools
from numpy.random import normal as rnorm
# [[N], [J], [K]]
levels = [[1,2,3,4], ['start', 'stop'], ['gene1', 'gene2', 'gene3']]
# fully crossed
exp_design = list(itertools.product(*levels))
df = pd.DataFrame(exp_design, columns=["sample", "mode", "gene"])
# some fake data
df['x'] = rnorm(size=len(exp_design))
which results in 24 observations (x) with a column for each of the three factors.
> df.head()
    sample  mode    gene    x
0   1       start   gene1   -1.229370
1   1       start   gene2   1.129773
2   1       start   gene3   -1.155202
3   1       stop    gene1   -0.757551
4   1       stop    gene2   -0.166129
I want to convert these observations to the corresponding (N,J,K)-shaped tensor (numpy array).  I was thinking pivoting to wide format with a MultiIndex, then extracting values would generate the correct tensor, but it simply comes off as a column vector:
> df.pivot_table(values='x', index=['sample', 'mode', 'gene']).values
array([[-1.22936989],
       [ 1.12977346],
       [-1.15520216],
       ...,
       [-0.1031641 ],
       [ 1.1296491 ],
       [ 1.31113584]])
Is there a quick way to get tensor formatted data from a long format pandas.DataFrame?
python pandas numpy tensor numpy-ndarray
                    marked as duplicate by merv, Community♦ Nov 29 '18 at 2:37
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
add a comment |
This question already has an answer here:
Transform Pandas DataFrame with n-level hierarchical index into n-D Numpy array
1 answer
I have data from a full factorial experiment.  For example, for each of N samples, I have J types of measurement and K measurement loci.  I receive this data in long format, for example,
import numpy as np
import pandas as pd
import itertools
from numpy.random import normal as rnorm
# [[N], [J], [K]]
levels = [[1,2,3,4], ['start', 'stop'], ['gene1', 'gene2', 'gene3']]
# fully crossed
exp_design = list(itertools.product(*levels))
df = pd.DataFrame(exp_design, columns=["sample", "mode", "gene"])
# some fake data
df['x'] = rnorm(size=len(exp_design))
which results in 24 observations (x) with a column for each of the three factors.
> df.head()
    sample  mode    gene    x
0   1       start   gene1   -1.229370
1   1       start   gene2   1.129773
2   1       start   gene3   -1.155202
3   1       stop    gene1   -0.757551
4   1       stop    gene2   -0.166129
I want to convert these observations to the corresponding (N,J,K)-shaped tensor (numpy array).  I was thinking pivoting to wide format with a MultiIndex, then extracting values would generate the correct tensor, but it simply comes off as a column vector:
> df.pivot_table(values='x', index=['sample', 'mode', 'gene']).values
array([[-1.22936989],
       [ 1.12977346],
       [-1.15520216],
       ...,
       [-0.1031641 ],
       [ 1.1296491 ],
       [ 1.31113584]])
Is there a quick way to get tensor formatted data from a long format pandas.DataFrame?
python pandas numpy tensor numpy-ndarray
This question already has an answer here:
Transform Pandas DataFrame with n-level hierarchical index into n-D Numpy array
1 answer
I have data from a full factorial experiment.  For example, for each of N samples, I have J types of measurement and K measurement loci.  I receive this data in long format, for example,
import numpy as np
import pandas as pd
import itertools
from numpy.random import normal as rnorm
# [[N], [J], [K]]
levels = [[1,2,3,4], ['start', 'stop'], ['gene1', 'gene2', 'gene3']]
# fully crossed
exp_design = list(itertools.product(*levels))
df = pd.DataFrame(exp_design, columns=["sample", "mode", "gene"])
# some fake data
df['x'] = rnorm(size=len(exp_design))
which results in 24 observations (x) with a column for each of the three factors.
> df.head()
    sample  mode    gene    x
0   1       start   gene1   -1.229370
1   1       start   gene2   1.129773
2   1       start   gene3   -1.155202
3   1       stop    gene1   -0.757551
4   1       stop    gene2   -0.166129
I want to convert these observations to the corresponding (N,J,K)-shaped tensor (numpy array).  I was thinking pivoting to wide format with a MultiIndex, then extracting values would generate the correct tensor, but it simply comes off as a column vector:
> df.pivot_table(values='x', index=['sample', 'mode', 'gene']).values
array([[-1.22936989],
       [ 1.12977346],
       [-1.15520216],
       ...,
       [-0.1031641 ],
       [ 1.1296491 ],
       [ 1.31113584]])
Is there a quick way to get tensor formatted data from a long format pandas.DataFrame?
This question already has an answer here:
Transform Pandas DataFrame with n-level hierarchical index into n-D Numpy array
1 answer
python pandas numpy tensor numpy-ndarray
python pandas numpy tensor numpy-ndarray
edited Nov 24 '18 at 4:13
merv
asked Nov 24 '18 at 3:59
mervmerv
26k676113
26k676113
                    marked as duplicate by merv, Community♦ Nov 29 '18 at 2:37
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
                    marked as duplicate by merv, Community♦ Nov 29 '18 at 2:37
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
add a comment |
add a comment |
                                1 Answer
                            1
                        
active
oldest
votes
Try with
df.agg('nunique')
Out[69]: 
sample     4
mode       2
gene       3
x         24
dtype: int64
s=df.agg('nunique')
df.x.values.reshape(s['sample'],s['mode'],s['gene'])
Out[71]: 
array([[[-2.78133759e-01, -1.42234420e+00,  5.42439121e-01],
        [ 2.15359867e+00,  6.55837886e-01, -1.01293568e+00]],
       [[ 7.92306679e-01, -1.62539763e-01, -6.13120335e-01],
        [-2.91567999e-01, -4.01257702e-01,  7.96422763e-01]],
       [[ 1.05088264e-01, -7.23400925e-02,  2.78515041e-01],
        [ 2.63088568e-01,  1.47477886e+00, -2.10735619e+00]],
       [[-1.71756374e+00,  6.12224005e-04, -3.11562798e-02],
        [ 5.26028807e-01, -1.18502045e+00,  1.88633760e+00]]])
I think it's important to note here that this assumes that data frame is first sorted like,df.sort_values(by=['sample', 'mode', 'gene'])
– merv
Nov 24 '18 at 5:46
@merv yes you are right
– Wen-Ben
Nov 24 '18 at 6:33
add a comment |
                                1 Answer
                            1
                        
active
oldest
votes
                                1 Answer
                            1
                        
active
oldest
votes
active
oldest
votes
active
oldest
votes
Try with
df.agg('nunique')
Out[69]: 
sample     4
mode       2
gene       3
x         24
dtype: int64
s=df.agg('nunique')
df.x.values.reshape(s['sample'],s['mode'],s['gene'])
Out[71]: 
array([[[-2.78133759e-01, -1.42234420e+00,  5.42439121e-01],
        [ 2.15359867e+00,  6.55837886e-01, -1.01293568e+00]],
       [[ 7.92306679e-01, -1.62539763e-01, -6.13120335e-01],
        [-2.91567999e-01, -4.01257702e-01,  7.96422763e-01]],
       [[ 1.05088264e-01, -7.23400925e-02,  2.78515041e-01],
        [ 2.63088568e-01,  1.47477886e+00, -2.10735619e+00]],
       [[-1.71756374e+00,  6.12224005e-04, -3.11562798e-02],
        [ 5.26028807e-01, -1.18502045e+00,  1.88633760e+00]]])
I think it's important to note here that this assumes that data frame is first sorted like,df.sort_values(by=['sample', 'mode', 'gene'])
– merv
Nov 24 '18 at 5:46
@merv yes you are right
– Wen-Ben
Nov 24 '18 at 6:33
add a comment |
Try with
df.agg('nunique')
Out[69]: 
sample     4
mode       2
gene       3
x         24
dtype: int64
s=df.agg('nunique')
df.x.values.reshape(s['sample'],s['mode'],s['gene'])
Out[71]: 
array([[[-2.78133759e-01, -1.42234420e+00,  5.42439121e-01],
        [ 2.15359867e+00,  6.55837886e-01, -1.01293568e+00]],
       [[ 7.92306679e-01, -1.62539763e-01, -6.13120335e-01],
        [-2.91567999e-01, -4.01257702e-01,  7.96422763e-01]],
       [[ 1.05088264e-01, -7.23400925e-02,  2.78515041e-01],
        [ 2.63088568e-01,  1.47477886e+00, -2.10735619e+00]],
       [[-1.71756374e+00,  6.12224005e-04, -3.11562798e-02],
        [ 5.26028807e-01, -1.18502045e+00,  1.88633760e+00]]])
I think it's important to note here that this assumes that data frame is first sorted like,df.sort_values(by=['sample', 'mode', 'gene'])
– merv
Nov 24 '18 at 5:46
@merv yes you are right
– Wen-Ben
Nov 24 '18 at 6:33
add a comment |
Try with
df.agg('nunique')
Out[69]: 
sample     4
mode       2
gene       3
x         24
dtype: int64
s=df.agg('nunique')
df.x.values.reshape(s['sample'],s['mode'],s['gene'])
Out[71]: 
array([[[-2.78133759e-01, -1.42234420e+00,  5.42439121e-01],
        [ 2.15359867e+00,  6.55837886e-01, -1.01293568e+00]],
       [[ 7.92306679e-01, -1.62539763e-01, -6.13120335e-01],
        [-2.91567999e-01, -4.01257702e-01,  7.96422763e-01]],
       [[ 1.05088264e-01, -7.23400925e-02,  2.78515041e-01],
        [ 2.63088568e-01,  1.47477886e+00, -2.10735619e+00]],
       [[-1.71756374e+00,  6.12224005e-04, -3.11562798e-02],
        [ 5.26028807e-01, -1.18502045e+00,  1.88633760e+00]]])
Try with
df.agg('nunique')
Out[69]: 
sample     4
mode       2
gene       3
x         24
dtype: int64
s=df.agg('nunique')
df.x.values.reshape(s['sample'],s['mode'],s['gene'])
Out[71]: 
array([[[-2.78133759e-01, -1.42234420e+00,  5.42439121e-01],
        [ 2.15359867e+00,  6.55837886e-01, -1.01293568e+00]],
       [[ 7.92306679e-01, -1.62539763e-01, -6.13120335e-01],
        [-2.91567999e-01, -4.01257702e-01,  7.96422763e-01]],
       [[ 1.05088264e-01, -7.23400925e-02,  2.78515041e-01],
        [ 2.63088568e-01,  1.47477886e+00, -2.10735619e+00]],
       [[-1.71756374e+00,  6.12224005e-04, -3.11562798e-02],
        [ 5.26028807e-01, -1.18502045e+00,  1.88633760e+00]]])
edited Nov 24 '18 at 5:21
answered Nov 24 '18 at 4:36
Wen-BenWen-Ben
125k83872
125k83872
I think it's important to note here that this assumes that data frame is first sorted like,df.sort_values(by=['sample', 'mode', 'gene'])
– merv
Nov 24 '18 at 5:46
@merv yes you are right
– Wen-Ben
Nov 24 '18 at 6:33
add a comment |
I think it's important to note here that this assumes that data frame is first sorted like,df.sort_values(by=['sample', 'mode', 'gene'])
– merv
Nov 24 '18 at 5:46
@merv yes you are right
– Wen-Ben
Nov 24 '18 at 6:33
I think it's important to note here that this assumes that data frame is first sorted like,
df.sort_values(by=['sample', 'mode', 'gene'])– merv
Nov 24 '18 at 5:46
I think it's important to note here that this assumes that data frame is first sorted like,
df.sort_values(by=['sample', 'mode', 'gene'])– merv
Nov 24 '18 at 5:46
@merv yes you are right
– Wen-Ben
Nov 24 '18 at 6:33
@merv yes you are right
– Wen-Ben
Nov 24 '18 at 6:33
add a comment |