How can I improve my looping Python script, involving different mathematical operations for different...











up vote
0
down vote

favorite












I am posting again as I had no luck trying to make the following script more efficient. For more details, do check out my previous post, but the basic situation is as below.



I have written a script in order to compute a score, as well as a frequency for a list of genetic profiles.



A genetic profile here consists of a combination of SNPs. Each SNP has two alleles. Hence, the input file for 3 SNPs is something like below, which shows all possible combinations of all alleles for all three SNPs. This table was generated using itertool's product in another script:



    AA   CC   TT
AT CC TT
TT CC TT
AA CG TT
AT CG TT
TT CG TT
AA GG TT
AT GG TT
TT GG TT
AA CC TA
AT CC TA
TT CC TA
AA CG TA
AT CG TA
TT CG TA
AA GG TA
AT GG TA
TT GG TA
AA CC AA
AT CC AA
TT CC AA
AA CG AA
AT CG AA
TT CG AA
AA GG AA
AT GG AA
TT GG AA


I then have another file with a table containing weights and frequencies for the three SNPs, such as below:



SNP1             A       T       1.25    0.223143551314     0.97273 
SNP2 C G 1.07 0.0676586484738 0.3
SNP3 T A 1.08 0.0769610411361 0.1136


The columns are the SNP IDs, risk allele, reference allele, OR, log(OR), and population frequency. The weights are for the risk allele.



The main script takes these two files, and computes a score, based on the sum of log odds ratios for each risk allele in each SNP for each genetic profile, as well as a frequency based on multiplying the allele frequencies, assuming Hardy Weinberg equilibrium.



import sys

snp={}
riskall={}
weights={}
freqs={} # effect allele, *MAY NOT BE MINOR ALLELE

pop = int(int(sys.argv[4]) + 4) # for additional columns due to additional populations. the example table given only has one population (column 6)

# read in OR table
pos = 0
with open(sys.argv[1], 'r') as f:
for line in f:
snp[pos]=(line.split()[0])
riskall[line.split()[0]]=line.split()[1]
weights[line.split()[0]]=line.split()[4]
freqs[line.split()[0]]=line.split()[pop]

pos+=1



### compute scores for each combination
with open(sys.argv[2], 'r') as f:
for line in f:
score=0
freq=1
for j in range(len(line.split())):
rsid=snp[j]
riskallele=riskall[rsid]
frequency=freqs[rsid]
wei=weights[rsid]
allele1=line.split()[j][0]
allele2=line.split()[j][1]
if allele2 != riskallele: # homozygous for ref
score+=0
freq*=(1-float(frequency))*(1-float(frequency))
elif allele1 != riskallele and allele2 == riskallele: # heterozygous, be sure that A2 is risk allele!
score+=float(wei)
freq*=2*(1-float(frequency))*(float(frequency))
elif allele1 == riskallele: # and allele2 == riskall[snp[j]]: # homozygous for risk, be sure to limit risk to second allele!
score+=2*float(wei)
freq*=float(frequency)*float(frequency)

if freq < float(sys.argv[3]): # threshold to stop loop in interest of efficiency
break

print(','.join(line.split()) + "t" + str(score) + "t" + str(freq))


I have set a variable where I can specify a threshold to break the loop when the frequency gets extremely low. What improvements can be done to speed up the script?



I have tried using Pandas, which is still much slower, as I am not sure if vectorization is possible in this case. I have issues installing Dask on my Unix server. I have also made sure to use only Python dictionaries and not lists, and this gave a slight improvement.



The expected output from the above would be as such:



GG,AA,GG        0       0.000286302968304
GG,AA,GA 0.0769610411361 7.33845153414e-05
GG,AA,AA 0.153922082272 4.70243735491e-06
GG,AG,GG 0.0676586484738 0.00024540254426
GG,AG,GA 0.14461968961 6.29010131498e-05
GG,AG,AA 0.221580730746 4.03066058992e-06
GG,GG,GG 0.135317296948 5.25862594844e-05
GG,GG,GA 0.212278338084 1.34787885321e-05
GG,GG,AA 0.28923937922 8.63712983555e-07
GA,AA,GG 0.223143551314 0.0204250448374
GA,AA,GA 0.30010459245 0.00523530030129
GA,AA,AA 0.377065633586 0.000335475019306
GA,AG,GG 0.290802199788 0.0175071812892
GA,AG,GA 0.367763240924 0.00448740025824
GA,AG,AA 0.44472428206 0.000287550016548
GA,GG,GG 0.358460848262 0.00375153884769
GA,GG,GA 0.435421889398 0.000961585769624
GA,GG,AA 0.512382930534 6.16178606889e-05
AA,AA,GG 0.446287102628 0.364284082594
AA,AA,GA 0.523248143764 0.0933724543834
AA,AA,AA 0.6002091849 0.00598325294334
AA,AG,GG 0.513945751102 0.312243499367
AA,AG,GA 0.590906792238 0.0800335323286
AA,AG,AA 0.667867833374 0.00512850252286
AA,GG,GG 0.581604399576 0.0669093212928
AA,GG,GA 0.658565440712 0.0171500426418
AA,GG,AA 0.735526481848 0.00109896482633


EDIT: Added previous post link, along with expected output.










share|improve this question




















  • 3




    instead of writing "do checkout my post history" you should mention your post link
    – Gahan
    Nov 9 at 5:13






  • 1




    At a first glance - you are repeating .split() all over the place. If you don't change the line, it should always split the same way, so do it once and remember the result. EDIT: also, give us the command line arguments you would execute this with.
    – Amadan
    Nov 9 at 5:17








  • 1




    Could you provide some expected output?
    – Alex
    Nov 9 at 5:29










  • tried using numba ??
    – Lijo Jose
    Nov 9 at 6:18










  • @Amadan I see, do you mean it would be better to assign the line split to a variable and only use that variable from there onwards? In terms of command line arguments, it would be like such: python myscript.py table2.txt table1.txt 1e-5 1
    – Volka
    Nov 9 at 6:41















up vote
0
down vote

favorite












I am posting again as I had no luck trying to make the following script more efficient. For more details, do check out my previous post, but the basic situation is as below.



I have written a script in order to compute a score, as well as a frequency for a list of genetic profiles.



A genetic profile here consists of a combination of SNPs. Each SNP has two alleles. Hence, the input file for 3 SNPs is something like below, which shows all possible combinations of all alleles for all three SNPs. This table was generated using itertool's product in another script:



    AA   CC   TT
AT CC TT
TT CC TT
AA CG TT
AT CG TT
TT CG TT
AA GG TT
AT GG TT
TT GG TT
AA CC TA
AT CC TA
TT CC TA
AA CG TA
AT CG TA
TT CG TA
AA GG TA
AT GG TA
TT GG TA
AA CC AA
AT CC AA
TT CC AA
AA CG AA
AT CG AA
TT CG AA
AA GG AA
AT GG AA
TT GG AA


I then have another file with a table containing weights and frequencies for the three SNPs, such as below:



SNP1             A       T       1.25    0.223143551314     0.97273 
SNP2 C G 1.07 0.0676586484738 0.3
SNP3 T A 1.08 0.0769610411361 0.1136


The columns are the SNP IDs, risk allele, reference allele, OR, log(OR), and population frequency. The weights are for the risk allele.



The main script takes these two files, and computes a score, based on the sum of log odds ratios for each risk allele in each SNP for each genetic profile, as well as a frequency based on multiplying the allele frequencies, assuming Hardy Weinberg equilibrium.



import sys

snp={}
riskall={}
weights={}
freqs={} # effect allele, *MAY NOT BE MINOR ALLELE

pop = int(int(sys.argv[4]) + 4) # for additional columns due to additional populations. the example table given only has one population (column 6)

# read in OR table
pos = 0
with open(sys.argv[1], 'r') as f:
for line in f:
snp[pos]=(line.split()[0])
riskall[line.split()[0]]=line.split()[1]
weights[line.split()[0]]=line.split()[4]
freqs[line.split()[0]]=line.split()[pop]

pos+=1



### compute scores for each combination
with open(sys.argv[2], 'r') as f:
for line in f:
score=0
freq=1
for j in range(len(line.split())):
rsid=snp[j]
riskallele=riskall[rsid]
frequency=freqs[rsid]
wei=weights[rsid]
allele1=line.split()[j][0]
allele2=line.split()[j][1]
if allele2 != riskallele: # homozygous for ref
score+=0
freq*=(1-float(frequency))*(1-float(frequency))
elif allele1 != riskallele and allele2 == riskallele: # heterozygous, be sure that A2 is risk allele!
score+=float(wei)
freq*=2*(1-float(frequency))*(float(frequency))
elif allele1 == riskallele: # and allele2 == riskall[snp[j]]: # homozygous for risk, be sure to limit risk to second allele!
score+=2*float(wei)
freq*=float(frequency)*float(frequency)

if freq < float(sys.argv[3]): # threshold to stop loop in interest of efficiency
break

print(','.join(line.split()) + "t" + str(score) + "t" + str(freq))


I have set a variable where I can specify a threshold to break the loop when the frequency gets extremely low. What improvements can be done to speed up the script?



I have tried using Pandas, which is still much slower, as I am not sure if vectorization is possible in this case. I have issues installing Dask on my Unix server. I have also made sure to use only Python dictionaries and not lists, and this gave a slight improvement.



The expected output from the above would be as such:



GG,AA,GG        0       0.000286302968304
GG,AA,GA 0.0769610411361 7.33845153414e-05
GG,AA,AA 0.153922082272 4.70243735491e-06
GG,AG,GG 0.0676586484738 0.00024540254426
GG,AG,GA 0.14461968961 6.29010131498e-05
GG,AG,AA 0.221580730746 4.03066058992e-06
GG,GG,GG 0.135317296948 5.25862594844e-05
GG,GG,GA 0.212278338084 1.34787885321e-05
GG,GG,AA 0.28923937922 8.63712983555e-07
GA,AA,GG 0.223143551314 0.0204250448374
GA,AA,GA 0.30010459245 0.00523530030129
GA,AA,AA 0.377065633586 0.000335475019306
GA,AG,GG 0.290802199788 0.0175071812892
GA,AG,GA 0.367763240924 0.00448740025824
GA,AG,AA 0.44472428206 0.000287550016548
GA,GG,GG 0.358460848262 0.00375153884769
GA,GG,GA 0.435421889398 0.000961585769624
GA,GG,AA 0.512382930534 6.16178606889e-05
AA,AA,GG 0.446287102628 0.364284082594
AA,AA,GA 0.523248143764 0.0933724543834
AA,AA,AA 0.6002091849 0.00598325294334
AA,AG,GG 0.513945751102 0.312243499367
AA,AG,GA 0.590906792238 0.0800335323286
AA,AG,AA 0.667867833374 0.00512850252286
AA,GG,GG 0.581604399576 0.0669093212928
AA,GG,GA 0.658565440712 0.0171500426418
AA,GG,AA 0.735526481848 0.00109896482633


EDIT: Added previous post link, along with expected output.










share|improve this question




















  • 3




    instead of writing "do checkout my post history" you should mention your post link
    – Gahan
    Nov 9 at 5:13






  • 1




    At a first glance - you are repeating .split() all over the place. If you don't change the line, it should always split the same way, so do it once and remember the result. EDIT: also, give us the command line arguments you would execute this with.
    – Amadan
    Nov 9 at 5:17








  • 1




    Could you provide some expected output?
    – Alex
    Nov 9 at 5:29










  • tried using numba ??
    – Lijo Jose
    Nov 9 at 6:18










  • @Amadan I see, do you mean it would be better to assign the line split to a variable and only use that variable from there onwards? In terms of command line arguments, it would be like such: python myscript.py table2.txt table1.txt 1e-5 1
    – Volka
    Nov 9 at 6:41













up vote
0
down vote

favorite









up vote
0
down vote

favorite











I am posting again as I had no luck trying to make the following script more efficient. For more details, do check out my previous post, but the basic situation is as below.



I have written a script in order to compute a score, as well as a frequency for a list of genetic profiles.



A genetic profile here consists of a combination of SNPs. Each SNP has two alleles. Hence, the input file for 3 SNPs is something like below, which shows all possible combinations of all alleles for all three SNPs. This table was generated using itertool's product in another script:



    AA   CC   TT
AT CC TT
TT CC TT
AA CG TT
AT CG TT
TT CG TT
AA GG TT
AT GG TT
TT GG TT
AA CC TA
AT CC TA
TT CC TA
AA CG TA
AT CG TA
TT CG TA
AA GG TA
AT GG TA
TT GG TA
AA CC AA
AT CC AA
TT CC AA
AA CG AA
AT CG AA
TT CG AA
AA GG AA
AT GG AA
TT GG AA


I then have another file with a table containing weights and frequencies for the three SNPs, such as below:



SNP1             A       T       1.25    0.223143551314     0.97273 
SNP2 C G 1.07 0.0676586484738 0.3
SNP3 T A 1.08 0.0769610411361 0.1136


The columns are the SNP IDs, risk allele, reference allele, OR, log(OR), and population frequency. The weights are for the risk allele.



The main script takes these two files, and computes a score, based on the sum of log odds ratios for each risk allele in each SNP for each genetic profile, as well as a frequency based on multiplying the allele frequencies, assuming Hardy Weinberg equilibrium.



import sys

snp={}
riskall={}
weights={}
freqs={} # effect allele, *MAY NOT BE MINOR ALLELE

pop = int(int(sys.argv[4]) + 4) # for additional columns due to additional populations. the example table given only has one population (column 6)

# read in OR table
pos = 0
with open(sys.argv[1], 'r') as f:
for line in f:
snp[pos]=(line.split()[0])
riskall[line.split()[0]]=line.split()[1]
weights[line.split()[0]]=line.split()[4]
freqs[line.split()[0]]=line.split()[pop]

pos+=1



### compute scores for each combination
with open(sys.argv[2], 'r') as f:
for line in f:
score=0
freq=1
for j in range(len(line.split())):
rsid=snp[j]
riskallele=riskall[rsid]
frequency=freqs[rsid]
wei=weights[rsid]
allele1=line.split()[j][0]
allele2=line.split()[j][1]
if allele2 != riskallele: # homozygous for ref
score+=0
freq*=(1-float(frequency))*(1-float(frequency))
elif allele1 != riskallele and allele2 == riskallele: # heterozygous, be sure that A2 is risk allele!
score+=float(wei)
freq*=2*(1-float(frequency))*(float(frequency))
elif allele1 == riskallele: # and allele2 == riskall[snp[j]]: # homozygous for risk, be sure to limit risk to second allele!
score+=2*float(wei)
freq*=float(frequency)*float(frequency)

if freq < float(sys.argv[3]): # threshold to stop loop in interest of efficiency
break

print(','.join(line.split()) + "t" + str(score) + "t" + str(freq))


I have set a variable where I can specify a threshold to break the loop when the frequency gets extremely low. What improvements can be done to speed up the script?



I have tried using Pandas, which is still much slower, as I am not sure if vectorization is possible in this case. I have issues installing Dask on my Unix server. I have also made sure to use only Python dictionaries and not lists, and this gave a slight improvement.



The expected output from the above would be as such:



GG,AA,GG        0       0.000286302968304
GG,AA,GA 0.0769610411361 7.33845153414e-05
GG,AA,AA 0.153922082272 4.70243735491e-06
GG,AG,GG 0.0676586484738 0.00024540254426
GG,AG,GA 0.14461968961 6.29010131498e-05
GG,AG,AA 0.221580730746 4.03066058992e-06
GG,GG,GG 0.135317296948 5.25862594844e-05
GG,GG,GA 0.212278338084 1.34787885321e-05
GG,GG,AA 0.28923937922 8.63712983555e-07
GA,AA,GG 0.223143551314 0.0204250448374
GA,AA,GA 0.30010459245 0.00523530030129
GA,AA,AA 0.377065633586 0.000335475019306
GA,AG,GG 0.290802199788 0.0175071812892
GA,AG,GA 0.367763240924 0.00448740025824
GA,AG,AA 0.44472428206 0.000287550016548
GA,GG,GG 0.358460848262 0.00375153884769
GA,GG,GA 0.435421889398 0.000961585769624
GA,GG,AA 0.512382930534 6.16178606889e-05
AA,AA,GG 0.446287102628 0.364284082594
AA,AA,GA 0.523248143764 0.0933724543834
AA,AA,AA 0.6002091849 0.00598325294334
AA,AG,GG 0.513945751102 0.312243499367
AA,AG,GA 0.590906792238 0.0800335323286
AA,AG,AA 0.667867833374 0.00512850252286
AA,GG,GG 0.581604399576 0.0669093212928
AA,GG,GA 0.658565440712 0.0171500426418
AA,GG,AA 0.735526481848 0.00109896482633


EDIT: Added previous post link, along with expected output.










share|improve this question















I am posting again as I had no luck trying to make the following script more efficient. For more details, do check out my previous post, but the basic situation is as below.



I have written a script in order to compute a score, as well as a frequency for a list of genetic profiles.



A genetic profile here consists of a combination of SNPs. Each SNP has two alleles. Hence, the input file for 3 SNPs is something like below, which shows all possible combinations of all alleles for all three SNPs. This table was generated using itertool's product in another script:



    AA   CC   TT
AT CC TT
TT CC TT
AA CG TT
AT CG TT
TT CG TT
AA GG TT
AT GG TT
TT GG TT
AA CC TA
AT CC TA
TT CC TA
AA CG TA
AT CG TA
TT CG TA
AA GG TA
AT GG TA
TT GG TA
AA CC AA
AT CC AA
TT CC AA
AA CG AA
AT CG AA
TT CG AA
AA GG AA
AT GG AA
TT GG AA


I then have another file with a table containing weights and frequencies for the three SNPs, such as below:



SNP1             A       T       1.25    0.223143551314     0.97273 
SNP2 C G 1.07 0.0676586484738 0.3
SNP3 T A 1.08 0.0769610411361 0.1136


The columns are the SNP IDs, risk allele, reference allele, OR, log(OR), and population frequency. The weights are for the risk allele.



The main script takes these two files, and computes a score, based on the sum of log odds ratios for each risk allele in each SNP for each genetic profile, as well as a frequency based on multiplying the allele frequencies, assuming Hardy Weinberg equilibrium.



import sys

snp={}
riskall={}
weights={}
freqs={} # effect allele, *MAY NOT BE MINOR ALLELE

pop = int(int(sys.argv[4]) + 4) # for additional columns due to additional populations. the example table given only has one population (column 6)

# read in OR table
pos = 0
with open(sys.argv[1], 'r') as f:
for line in f:
snp[pos]=(line.split()[0])
riskall[line.split()[0]]=line.split()[1]
weights[line.split()[0]]=line.split()[4]
freqs[line.split()[0]]=line.split()[pop]

pos+=1



### compute scores for each combination
with open(sys.argv[2], 'r') as f:
for line in f:
score=0
freq=1
for j in range(len(line.split())):
rsid=snp[j]
riskallele=riskall[rsid]
frequency=freqs[rsid]
wei=weights[rsid]
allele1=line.split()[j][0]
allele2=line.split()[j][1]
if allele2 != riskallele: # homozygous for ref
score+=0
freq*=(1-float(frequency))*(1-float(frequency))
elif allele1 != riskallele and allele2 == riskallele: # heterozygous, be sure that A2 is risk allele!
score+=float(wei)
freq*=2*(1-float(frequency))*(float(frequency))
elif allele1 == riskallele: # and allele2 == riskall[snp[j]]: # homozygous for risk, be sure to limit risk to second allele!
score+=2*float(wei)
freq*=float(frequency)*float(frequency)

if freq < float(sys.argv[3]): # threshold to stop loop in interest of efficiency
break

print(','.join(line.split()) + "t" + str(score) + "t" + str(freq))


I have set a variable where I can specify a threshold to break the loop when the frequency gets extremely low. What improvements can be done to speed up the script?



I have tried using Pandas, which is still much slower, as I am not sure if vectorization is possible in this case. I have issues installing Dask on my Unix server. I have also made sure to use only Python dictionaries and not lists, and this gave a slight improvement.



The expected output from the above would be as such:



GG,AA,GG        0       0.000286302968304
GG,AA,GA 0.0769610411361 7.33845153414e-05
GG,AA,AA 0.153922082272 4.70243735491e-06
GG,AG,GG 0.0676586484738 0.00024540254426
GG,AG,GA 0.14461968961 6.29010131498e-05
GG,AG,AA 0.221580730746 4.03066058992e-06
GG,GG,GG 0.135317296948 5.25862594844e-05
GG,GG,GA 0.212278338084 1.34787885321e-05
GG,GG,AA 0.28923937922 8.63712983555e-07
GA,AA,GG 0.223143551314 0.0204250448374
GA,AA,GA 0.30010459245 0.00523530030129
GA,AA,AA 0.377065633586 0.000335475019306
GA,AG,GG 0.290802199788 0.0175071812892
GA,AG,GA 0.367763240924 0.00448740025824
GA,AG,AA 0.44472428206 0.000287550016548
GA,GG,GG 0.358460848262 0.00375153884769
GA,GG,GA 0.435421889398 0.000961585769624
GA,GG,AA 0.512382930534 6.16178606889e-05
AA,AA,GG 0.446287102628 0.364284082594
AA,AA,GA 0.523248143764 0.0933724543834
AA,AA,AA 0.6002091849 0.00598325294334
AA,AG,GG 0.513945751102 0.312243499367
AA,AG,GA 0.590906792238 0.0800335323286
AA,AG,AA 0.667867833374 0.00512850252286
AA,GG,GG 0.581604399576 0.0669093212928
AA,GG,GA 0.658565440712 0.0171500426418
AA,GG,AA 0.735526481848 0.00109896482633


EDIT: Added previous post link, along with expected output.







python performance loops






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 9 at 6:46

























asked Nov 9 at 5:11









Volka

317




317








  • 3




    instead of writing "do checkout my post history" you should mention your post link
    – Gahan
    Nov 9 at 5:13






  • 1




    At a first glance - you are repeating .split() all over the place. If you don't change the line, it should always split the same way, so do it once and remember the result. EDIT: also, give us the command line arguments you would execute this with.
    – Amadan
    Nov 9 at 5:17








  • 1




    Could you provide some expected output?
    – Alex
    Nov 9 at 5:29










  • tried using numba ??
    – Lijo Jose
    Nov 9 at 6:18










  • @Amadan I see, do you mean it would be better to assign the line split to a variable and only use that variable from there onwards? In terms of command line arguments, it would be like such: python myscript.py table2.txt table1.txt 1e-5 1
    – Volka
    Nov 9 at 6:41














  • 3




    instead of writing "do checkout my post history" you should mention your post link
    – Gahan
    Nov 9 at 5:13






  • 1




    At a first glance - you are repeating .split() all over the place. If you don't change the line, it should always split the same way, so do it once and remember the result. EDIT: also, give us the command line arguments you would execute this with.
    – Amadan
    Nov 9 at 5:17








  • 1




    Could you provide some expected output?
    – Alex
    Nov 9 at 5:29










  • tried using numba ??
    – Lijo Jose
    Nov 9 at 6:18










  • @Amadan I see, do you mean it would be better to assign the line split to a variable and only use that variable from there onwards? In terms of command line arguments, it would be like such: python myscript.py table2.txt table1.txt 1e-5 1
    – Volka
    Nov 9 at 6:41








3




3




instead of writing "do checkout my post history" you should mention your post link
– Gahan
Nov 9 at 5:13




instead of writing "do checkout my post history" you should mention your post link
– Gahan
Nov 9 at 5:13




1




1




At a first glance - you are repeating .split() all over the place. If you don't change the line, it should always split the same way, so do it once and remember the result. EDIT: also, give us the command line arguments you would execute this with.
– Amadan
Nov 9 at 5:17






At a first glance - you are repeating .split() all over the place. If you don't change the line, it should always split the same way, so do it once and remember the result. EDIT: also, give us the command line arguments you would execute this with.
– Amadan
Nov 9 at 5:17






1




1




Could you provide some expected output?
– Alex
Nov 9 at 5:29




Could you provide some expected output?
– Alex
Nov 9 at 5:29












tried using numba ??
– Lijo Jose
Nov 9 at 6:18




tried using numba ??
– Lijo Jose
Nov 9 at 6:18












@Amadan I see, do you mean it would be better to assign the line split to a variable and only use that variable from there onwards? In terms of command line arguments, it would be like such: python myscript.py table2.txt table1.txt 1e-5 1
– Volka
Nov 9 at 6:41




@Amadan I see, do you mean it would be better to assign the line split to a variable and only use that variable from there onwards? In terms of command line arguments, it would be like such: python myscript.py table2.txt table1.txt 1e-5 1
– Volka
Nov 9 at 6:41












1 Answer
1






active

oldest

votes

















up vote
1
down vote



accepted










Disclaimer: I did not test this, it is rather a pseudo-code.



I provide some general ideas about what is slow/fast in programming and particularly in python:



You should try to move out of loops everything what is not changing in that loop.
Also, in python, you should try to replace loops with comprehensions
https://www.pythonforbeginners.com/basics/list-comprehensions-in-python



[ expression for item in list if conditional ]


you should try to use map/filter functions if possible and you also can prepare your data so that the program is more efficient



    rsid=snp[j]
riskallele=riskall[rsid]


is basically a double mapping and it can possibly be done better if you can create your snp structure like this (you can use -1 index for the last column and get rid of pop):



snp = [{"riskall": line[1],"freq": float(line[4]),"weight": float(line[-1])} 
for line in map(split,f)]


and your computing loop can become something like this:



### compute scores for each combination
stop = sys.argv[3]
with open(sys.argv[2], 'r') as f:
for fline in f:
score=0.0 # work with floats from the start
freq=1.0
line = fline.split() # do it only once

for j,field in line:
s=snp[j]
riskallele=s["riskall"]
frequency=s["freq"]
wei=s["weight"]
(allele1,allele2) = line[j]

if allele2 != riskallele: # homozygous for ref
score+=0
freq*=(1-frequency)*(1-frequency)
elif allele1 != riskallele and allele2 == riskallele: # heterozygous, be sure that A2 is risk allele!
score+=wei
freq*=2*(1-frequency)*frequency
elif allele1 == riskallele: # and allele2 == riskallele: # homozygous for risk, be sure to limit risk to second allele!
score+=2*wei
freq*=frequency*frequency

if freq < stop): # threshold to stop loop in interest of efficiency
break

print(','.join(line.split()) + "t" + str(score) + "t" + str(freq))


The ultimate goal I would try to achieve is to convert it to some map/reduce form:



the allele can have [A,C,G,T][A,C,G,T] 16 combinations and we test against it [A,C,G,T] this only 64 combinations so I can create a map in form
[AC,C] -> score,freq_function and I can get rid of the whole if block.



Sometimes the best approach is to split the code to small functions, reorganize and then merge back.






share|improve this answer























  • One more thing I noticed: " This table was generated using itertool's product in another script:" - I think in that case I would also merge the scripts because you can easily generate it inside this one from the SNP file and you don't need the generating script.
    – petrch
    Nov 9 at 14:13










  • Thanks! The map form is really helpful and improved my script runtimes. I'm not too sure how to apply that to the if block, and ended up incorporating the respective weights and frequencies into the initial dictionary instead.
    – Volka
    Nov 16 at 7:31










  • Good to hear that. It always depends how much time you want to invest and what is "good enough" running time. If your input is large (eg. 1 GB or more) the IO may start to be an issue as well. Read for example this rabexc.org/posts/io-performance-in-python. You want to put as much code out of the for loop so if you can anything precalculate than it may help. the "print()" itself is a bit slow. Maybe if you care just about some results than print only those. Or use buffered file writes and read the results from a file later.
    – petrch
    Nov 16 at 13:22











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53220198%2fhow-can-i-improve-my-looping-python-script-involving-different-mathematical-ope%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
1
down vote



accepted










Disclaimer: I did not test this, it is rather a pseudo-code.



I provide some general ideas about what is slow/fast in programming and particularly in python:



You should try to move out of loops everything what is not changing in that loop.
Also, in python, you should try to replace loops with comprehensions
https://www.pythonforbeginners.com/basics/list-comprehensions-in-python



[ expression for item in list if conditional ]


you should try to use map/filter functions if possible and you also can prepare your data so that the program is more efficient



    rsid=snp[j]
riskallele=riskall[rsid]


is basically a double mapping and it can possibly be done better if you can create your snp structure like this (you can use -1 index for the last column and get rid of pop):



snp = [{"riskall": line[1],"freq": float(line[4]),"weight": float(line[-1])} 
for line in map(split,f)]


and your computing loop can become something like this:



### compute scores for each combination
stop = sys.argv[3]
with open(sys.argv[2], 'r') as f:
for fline in f:
score=0.0 # work with floats from the start
freq=1.0
line = fline.split() # do it only once

for j,field in line:
s=snp[j]
riskallele=s["riskall"]
frequency=s["freq"]
wei=s["weight"]
(allele1,allele2) = line[j]

if allele2 != riskallele: # homozygous for ref
score+=0
freq*=(1-frequency)*(1-frequency)
elif allele1 != riskallele and allele2 == riskallele: # heterozygous, be sure that A2 is risk allele!
score+=wei
freq*=2*(1-frequency)*frequency
elif allele1 == riskallele: # and allele2 == riskallele: # homozygous for risk, be sure to limit risk to second allele!
score+=2*wei
freq*=frequency*frequency

if freq < stop): # threshold to stop loop in interest of efficiency
break

print(','.join(line.split()) + "t" + str(score) + "t" + str(freq))


The ultimate goal I would try to achieve is to convert it to some map/reduce form:



the allele can have [A,C,G,T][A,C,G,T] 16 combinations and we test against it [A,C,G,T] this only 64 combinations so I can create a map in form
[AC,C] -> score,freq_function and I can get rid of the whole if block.



Sometimes the best approach is to split the code to small functions, reorganize and then merge back.






share|improve this answer























  • One more thing I noticed: " This table was generated using itertool's product in another script:" - I think in that case I would also merge the scripts because you can easily generate it inside this one from the SNP file and you don't need the generating script.
    – petrch
    Nov 9 at 14:13










  • Thanks! The map form is really helpful and improved my script runtimes. I'm not too sure how to apply that to the if block, and ended up incorporating the respective weights and frequencies into the initial dictionary instead.
    – Volka
    Nov 16 at 7:31










  • Good to hear that. It always depends how much time you want to invest and what is "good enough" running time. If your input is large (eg. 1 GB or more) the IO may start to be an issue as well. Read for example this rabexc.org/posts/io-performance-in-python. You want to put as much code out of the for loop so if you can anything precalculate than it may help. the "print()" itself is a bit slow. Maybe if you care just about some results than print only those. Or use buffered file writes and read the results from a file later.
    – petrch
    Nov 16 at 13:22















up vote
1
down vote



accepted










Disclaimer: I did not test this, it is rather a pseudo-code.



I provide some general ideas about what is slow/fast in programming and particularly in python:



You should try to move out of loops everything what is not changing in that loop.
Also, in python, you should try to replace loops with comprehensions
https://www.pythonforbeginners.com/basics/list-comprehensions-in-python



[ expression for item in list if conditional ]


you should try to use map/filter functions if possible and you also can prepare your data so that the program is more efficient



    rsid=snp[j]
riskallele=riskall[rsid]


is basically a double mapping and it can possibly be done better if you can create your snp structure like this (you can use -1 index for the last column and get rid of pop):



snp = [{"riskall": line[1],"freq": float(line[4]),"weight": float(line[-1])} 
for line in map(split,f)]


and your computing loop can become something like this:



### compute scores for each combination
stop = sys.argv[3]
with open(sys.argv[2], 'r') as f:
for fline in f:
score=0.0 # work with floats from the start
freq=1.0
line = fline.split() # do it only once

for j,field in line:
s=snp[j]
riskallele=s["riskall"]
frequency=s["freq"]
wei=s["weight"]
(allele1,allele2) = line[j]

if allele2 != riskallele: # homozygous for ref
score+=0
freq*=(1-frequency)*(1-frequency)
elif allele1 != riskallele and allele2 == riskallele: # heterozygous, be sure that A2 is risk allele!
score+=wei
freq*=2*(1-frequency)*frequency
elif allele1 == riskallele: # and allele2 == riskallele: # homozygous for risk, be sure to limit risk to second allele!
score+=2*wei
freq*=frequency*frequency

if freq < stop): # threshold to stop loop in interest of efficiency
break

print(','.join(line.split()) + "t" + str(score) + "t" + str(freq))


The ultimate goal I would try to achieve is to convert it to some map/reduce form:



the allele can have [A,C,G,T][A,C,G,T] 16 combinations and we test against it [A,C,G,T] this only 64 combinations so I can create a map in form
[AC,C] -> score,freq_function and I can get rid of the whole if block.



Sometimes the best approach is to split the code to small functions, reorganize and then merge back.






share|improve this answer























  • One more thing I noticed: " This table was generated using itertool's product in another script:" - I think in that case I would also merge the scripts because you can easily generate it inside this one from the SNP file and you don't need the generating script.
    – petrch
    Nov 9 at 14:13










  • Thanks! The map form is really helpful and improved my script runtimes. I'm not too sure how to apply that to the if block, and ended up incorporating the respective weights and frequencies into the initial dictionary instead.
    – Volka
    Nov 16 at 7:31










  • Good to hear that. It always depends how much time you want to invest and what is "good enough" running time. If your input is large (eg. 1 GB or more) the IO may start to be an issue as well. Read for example this rabexc.org/posts/io-performance-in-python. You want to put as much code out of the for loop so if you can anything precalculate than it may help. the "print()" itself is a bit slow. Maybe if you care just about some results than print only those. Or use buffered file writes and read the results from a file later.
    – petrch
    Nov 16 at 13:22













up vote
1
down vote



accepted







up vote
1
down vote



accepted






Disclaimer: I did not test this, it is rather a pseudo-code.



I provide some general ideas about what is slow/fast in programming and particularly in python:



You should try to move out of loops everything what is not changing in that loop.
Also, in python, you should try to replace loops with comprehensions
https://www.pythonforbeginners.com/basics/list-comprehensions-in-python



[ expression for item in list if conditional ]


you should try to use map/filter functions if possible and you also can prepare your data so that the program is more efficient



    rsid=snp[j]
riskallele=riskall[rsid]


is basically a double mapping and it can possibly be done better if you can create your snp structure like this (you can use -1 index for the last column and get rid of pop):



snp = [{"riskall": line[1],"freq": float(line[4]),"weight": float(line[-1])} 
for line in map(split,f)]


and your computing loop can become something like this:



### compute scores for each combination
stop = sys.argv[3]
with open(sys.argv[2], 'r') as f:
for fline in f:
score=0.0 # work with floats from the start
freq=1.0
line = fline.split() # do it only once

for j,field in line:
s=snp[j]
riskallele=s["riskall"]
frequency=s["freq"]
wei=s["weight"]
(allele1,allele2) = line[j]

if allele2 != riskallele: # homozygous for ref
score+=0
freq*=(1-frequency)*(1-frequency)
elif allele1 != riskallele and allele2 == riskallele: # heterozygous, be sure that A2 is risk allele!
score+=wei
freq*=2*(1-frequency)*frequency
elif allele1 == riskallele: # and allele2 == riskallele: # homozygous for risk, be sure to limit risk to second allele!
score+=2*wei
freq*=frequency*frequency

if freq < stop): # threshold to stop loop in interest of efficiency
break

print(','.join(line.split()) + "t" + str(score) + "t" + str(freq))


The ultimate goal I would try to achieve is to convert it to some map/reduce form:



the allele can have [A,C,G,T][A,C,G,T] 16 combinations and we test against it [A,C,G,T] this only 64 combinations so I can create a map in form
[AC,C] -> score,freq_function and I can get rid of the whole if block.



Sometimes the best approach is to split the code to small functions, reorganize and then merge back.






share|improve this answer














Disclaimer: I did not test this, it is rather a pseudo-code.



I provide some general ideas about what is slow/fast in programming and particularly in python:



You should try to move out of loops everything what is not changing in that loop.
Also, in python, you should try to replace loops with comprehensions
https://www.pythonforbeginners.com/basics/list-comprehensions-in-python



[ expression for item in list if conditional ]


you should try to use map/filter functions if possible and you also can prepare your data so that the program is more efficient



    rsid=snp[j]
riskallele=riskall[rsid]


is basically a double mapping and it can possibly be done better if you can create your snp structure like this (you can use -1 index for the last column and get rid of pop):



snp = [{"riskall": line[1],"freq": float(line[4]),"weight": float(line[-1])} 
for line in map(split,f)]


and your computing loop can become something like this:



### compute scores for each combination
stop = sys.argv[3]
with open(sys.argv[2], 'r') as f:
for fline in f:
score=0.0 # work with floats from the start
freq=1.0
line = fline.split() # do it only once

for j,field in line:
s=snp[j]
riskallele=s["riskall"]
frequency=s["freq"]
wei=s["weight"]
(allele1,allele2) = line[j]

if allele2 != riskallele: # homozygous for ref
score+=0
freq*=(1-frequency)*(1-frequency)
elif allele1 != riskallele and allele2 == riskallele: # heterozygous, be sure that A2 is risk allele!
score+=wei
freq*=2*(1-frequency)*frequency
elif allele1 == riskallele: # and allele2 == riskallele: # homozygous for risk, be sure to limit risk to second allele!
score+=2*wei
freq*=frequency*frequency

if freq < stop): # threshold to stop loop in interest of efficiency
break

print(','.join(line.split()) + "t" + str(score) + "t" + str(freq))


The ultimate goal I would try to achieve is to convert it to some map/reduce form:



the allele can have [A,C,G,T][A,C,G,T] 16 combinations and we test against it [A,C,G,T] this only 64 combinations so I can create a map in form
[AC,C] -> score,freq_function and I can get rid of the whole if block.



Sometimes the best approach is to split the code to small functions, reorganize and then merge back.







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 16 at 13:02

























answered Nov 9 at 11:12









petrch

31626




31626












  • One more thing I noticed: " This table was generated using itertool's product in another script:" - I think in that case I would also merge the scripts because you can easily generate it inside this one from the SNP file and you don't need the generating script.
    – petrch
    Nov 9 at 14:13










  • Thanks! The map form is really helpful and improved my script runtimes. I'm not too sure how to apply that to the if block, and ended up incorporating the respective weights and frequencies into the initial dictionary instead.
    – Volka
    Nov 16 at 7:31










  • Good to hear that. It always depends how much time you want to invest and what is "good enough" running time. If your input is large (eg. 1 GB or more) the IO may start to be an issue as well. Read for example this rabexc.org/posts/io-performance-in-python. You want to put as much code out of the for loop so if you can anything precalculate than it may help. the "print()" itself is a bit slow. Maybe if you care just about some results than print only those. Or use buffered file writes and read the results from a file later.
    – petrch
    Nov 16 at 13:22


















  • One more thing I noticed: " This table was generated using itertool's product in another script:" - I think in that case I would also merge the scripts because you can easily generate it inside this one from the SNP file and you don't need the generating script.
    – petrch
    Nov 9 at 14:13










  • Thanks! The map form is really helpful and improved my script runtimes. I'm not too sure how to apply that to the if block, and ended up incorporating the respective weights and frequencies into the initial dictionary instead.
    – Volka
    Nov 16 at 7:31










  • Good to hear that. It always depends how much time you want to invest and what is "good enough" running time. If your input is large (eg. 1 GB or more) the IO may start to be an issue as well. Read for example this rabexc.org/posts/io-performance-in-python. You want to put as much code out of the for loop so if you can anything precalculate than it may help. the "print()" itself is a bit slow. Maybe if you care just about some results than print only those. Or use buffered file writes and read the results from a file later.
    – petrch
    Nov 16 at 13:22
















One more thing I noticed: " This table was generated using itertool's product in another script:" - I think in that case I would also merge the scripts because you can easily generate it inside this one from the SNP file and you don't need the generating script.
– petrch
Nov 9 at 14:13




One more thing I noticed: " This table was generated using itertool's product in another script:" - I think in that case I would also merge the scripts because you can easily generate it inside this one from the SNP file and you don't need the generating script.
– petrch
Nov 9 at 14:13












Thanks! The map form is really helpful and improved my script runtimes. I'm not too sure how to apply that to the if block, and ended up incorporating the respective weights and frequencies into the initial dictionary instead.
– Volka
Nov 16 at 7:31




Thanks! The map form is really helpful and improved my script runtimes. I'm not too sure how to apply that to the if block, and ended up incorporating the respective weights and frequencies into the initial dictionary instead.
– Volka
Nov 16 at 7:31












Good to hear that. It always depends how much time you want to invest and what is "good enough" running time. If your input is large (eg. 1 GB or more) the IO may start to be an issue as well. Read for example this rabexc.org/posts/io-performance-in-python. You want to put as much code out of the for loop so if you can anything precalculate than it may help. the "print()" itself is a bit slow. Maybe if you care just about some results than print only those. Or use buffered file writes and read the results from a file later.
– petrch
Nov 16 at 13:22




Good to hear that. It always depends how much time you want to invest and what is "good enough" running time. If your input is large (eg. 1 GB or more) the IO may start to be an issue as well. Read for example this rabexc.org/posts/io-performance-in-python. You want to put as much code out of the for loop so if you can anything precalculate than it may help. the "print()" itself is a bit slow. Maybe if you care just about some results than print only those. Or use buffered file writes and read the results from a file later.
– petrch
Nov 16 at 13:22


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53220198%2fhow-can-i-improve-my-looping-python-script-involving-different-mathematical-ope%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

Tangent Lines Diagram Along Smooth Curve

Yusuf al-Mu'taman ibn Hud

Zucchini