How can I improve my looping Python script, involving different mathematical operations for different...
up vote
0
down vote
favorite
I am posting again as I had no luck trying to make the following script more efficient. For more details, do check out my previous post, but the basic situation is as below.
I have written a script in order to compute a score, as well as a frequency for a list of genetic profiles.
A genetic profile here consists of a combination of SNPs. Each SNP has two alleles. Hence, the input file for 3 SNPs is something like below, which shows all possible combinations of all alleles for all three SNPs. This table was generated using itertool's product in another script:
AA CC TT
AT CC TT
TT CC TT
AA CG TT
AT CG TT
TT CG TT
AA GG TT
AT GG TT
TT GG TT
AA CC TA
AT CC TA
TT CC TA
AA CG TA
AT CG TA
TT CG TA
AA GG TA
AT GG TA
TT GG TA
AA CC AA
AT CC AA
TT CC AA
AA CG AA
AT CG AA
TT CG AA
AA GG AA
AT GG AA
TT GG AA
I then have another file with a table containing weights and frequencies for the three SNPs, such as below:
SNP1 A T 1.25 0.223143551314 0.97273
SNP2 C G 1.07 0.0676586484738 0.3
SNP3 T A 1.08 0.0769610411361 0.1136
The columns are the SNP IDs, risk allele, reference allele, OR, log(OR), and population frequency. The weights are for the risk allele.
The main script takes these two files, and computes a score, based on the sum of log odds ratios for each risk allele in each SNP for each genetic profile, as well as a frequency based on multiplying the allele frequencies, assuming Hardy Weinberg equilibrium.
import sys
snp={}
riskall={}
weights={}
freqs={} # effect allele, *MAY NOT BE MINOR ALLELE
pop = int(int(sys.argv[4]) + 4) # for additional columns due to additional populations. the example table given only has one population (column 6)
# read in OR table
pos = 0
with open(sys.argv[1], 'r') as f:
for line in f:
snp[pos]=(line.split()[0])
riskall[line.split()[0]]=line.split()[1]
weights[line.split()[0]]=line.split()[4]
freqs[line.split()[0]]=line.split()[pop]
pos+=1
### compute scores for each combination
with open(sys.argv[2], 'r') as f:
for line in f:
score=0
freq=1
for j in range(len(line.split())):
rsid=snp[j]
riskallele=riskall[rsid]
frequency=freqs[rsid]
wei=weights[rsid]
allele1=line.split()[j][0]
allele2=line.split()[j][1]
if allele2 != riskallele: # homozygous for ref
score+=0
freq*=(1-float(frequency))*(1-float(frequency))
elif allele1 != riskallele and allele2 == riskallele: # heterozygous, be sure that A2 is risk allele!
score+=float(wei)
freq*=2*(1-float(frequency))*(float(frequency))
elif allele1 == riskallele: # and allele2 == riskall[snp[j]]: # homozygous for risk, be sure to limit risk to second allele!
score+=2*float(wei)
freq*=float(frequency)*float(frequency)
if freq < float(sys.argv[3]): # threshold to stop loop in interest of efficiency
break
print(','.join(line.split()) + "t" + str(score) + "t" + str(freq))
I have set a variable where I can specify a threshold to break the loop when the frequency gets extremely low. What improvements can be done to speed up the script?
I have tried using Pandas, which is still much slower, as I am not sure if vectorization is possible in this case. I have issues installing Dask on my Unix server. I have also made sure to use only Python dictionaries and not lists, and this gave a slight improvement.
The expected output from the above would be as such:
GG,AA,GG 0 0.000286302968304
GG,AA,GA 0.0769610411361 7.33845153414e-05
GG,AA,AA 0.153922082272 4.70243735491e-06
GG,AG,GG 0.0676586484738 0.00024540254426
GG,AG,GA 0.14461968961 6.29010131498e-05
GG,AG,AA 0.221580730746 4.03066058992e-06
GG,GG,GG 0.135317296948 5.25862594844e-05
GG,GG,GA 0.212278338084 1.34787885321e-05
GG,GG,AA 0.28923937922 8.63712983555e-07
GA,AA,GG 0.223143551314 0.0204250448374
GA,AA,GA 0.30010459245 0.00523530030129
GA,AA,AA 0.377065633586 0.000335475019306
GA,AG,GG 0.290802199788 0.0175071812892
GA,AG,GA 0.367763240924 0.00448740025824
GA,AG,AA 0.44472428206 0.000287550016548
GA,GG,GG 0.358460848262 0.00375153884769
GA,GG,GA 0.435421889398 0.000961585769624
GA,GG,AA 0.512382930534 6.16178606889e-05
AA,AA,GG 0.446287102628 0.364284082594
AA,AA,GA 0.523248143764 0.0933724543834
AA,AA,AA 0.6002091849 0.00598325294334
AA,AG,GG 0.513945751102 0.312243499367
AA,AG,GA 0.590906792238 0.0800335323286
AA,AG,AA 0.667867833374 0.00512850252286
AA,GG,GG 0.581604399576 0.0669093212928
AA,GG,GA 0.658565440712 0.0171500426418
AA,GG,AA 0.735526481848 0.00109896482633
EDIT: Added previous post link, along with expected output.
python performance loops
|
show 1 more comment
up vote
0
down vote
favorite
I am posting again as I had no luck trying to make the following script more efficient. For more details, do check out my previous post, but the basic situation is as below.
I have written a script in order to compute a score, as well as a frequency for a list of genetic profiles.
A genetic profile here consists of a combination of SNPs. Each SNP has two alleles. Hence, the input file for 3 SNPs is something like below, which shows all possible combinations of all alleles for all three SNPs. This table was generated using itertool's product in another script:
AA CC TT
AT CC TT
TT CC TT
AA CG TT
AT CG TT
TT CG TT
AA GG TT
AT GG TT
TT GG TT
AA CC TA
AT CC TA
TT CC TA
AA CG TA
AT CG TA
TT CG TA
AA GG TA
AT GG TA
TT GG TA
AA CC AA
AT CC AA
TT CC AA
AA CG AA
AT CG AA
TT CG AA
AA GG AA
AT GG AA
TT GG AA
I then have another file with a table containing weights and frequencies for the three SNPs, such as below:
SNP1 A T 1.25 0.223143551314 0.97273
SNP2 C G 1.07 0.0676586484738 0.3
SNP3 T A 1.08 0.0769610411361 0.1136
The columns are the SNP IDs, risk allele, reference allele, OR, log(OR), and population frequency. The weights are for the risk allele.
The main script takes these two files, and computes a score, based on the sum of log odds ratios for each risk allele in each SNP for each genetic profile, as well as a frequency based on multiplying the allele frequencies, assuming Hardy Weinberg equilibrium.
import sys
snp={}
riskall={}
weights={}
freqs={} # effect allele, *MAY NOT BE MINOR ALLELE
pop = int(int(sys.argv[4]) + 4) # for additional columns due to additional populations. the example table given only has one population (column 6)
# read in OR table
pos = 0
with open(sys.argv[1], 'r') as f:
for line in f:
snp[pos]=(line.split()[0])
riskall[line.split()[0]]=line.split()[1]
weights[line.split()[0]]=line.split()[4]
freqs[line.split()[0]]=line.split()[pop]
pos+=1
### compute scores for each combination
with open(sys.argv[2], 'r') as f:
for line in f:
score=0
freq=1
for j in range(len(line.split())):
rsid=snp[j]
riskallele=riskall[rsid]
frequency=freqs[rsid]
wei=weights[rsid]
allele1=line.split()[j][0]
allele2=line.split()[j][1]
if allele2 != riskallele: # homozygous for ref
score+=0
freq*=(1-float(frequency))*(1-float(frequency))
elif allele1 != riskallele and allele2 == riskallele: # heterozygous, be sure that A2 is risk allele!
score+=float(wei)
freq*=2*(1-float(frequency))*(float(frequency))
elif allele1 == riskallele: # and allele2 == riskall[snp[j]]: # homozygous for risk, be sure to limit risk to second allele!
score+=2*float(wei)
freq*=float(frequency)*float(frequency)
if freq < float(sys.argv[3]): # threshold to stop loop in interest of efficiency
break
print(','.join(line.split()) + "t" + str(score) + "t" + str(freq))
I have set a variable where I can specify a threshold to break the loop when the frequency gets extremely low. What improvements can be done to speed up the script?
I have tried using Pandas, which is still much slower, as I am not sure if vectorization is possible in this case. I have issues installing Dask on my Unix server. I have also made sure to use only Python dictionaries and not lists, and this gave a slight improvement.
The expected output from the above would be as such:
GG,AA,GG 0 0.000286302968304
GG,AA,GA 0.0769610411361 7.33845153414e-05
GG,AA,AA 0.153922082272 4.70243735491e-06
GG,AG,GG 0.0676586484738 0.00024540254426
GG,AG,GA 0.14461968961 6.29010131498e-05
GG,AG,AA 0.221580730746 4.03066058992e-06
GG,GG,GG 0.135317296948 5.25862594844e-05
GG,GG,GA 0.212278338084 1.34787885321e-05
GG,GG,AA 0.28923937922 8.63712983555e-07
GA,AA,GG 0.223143551314 0.0204250448374
GA,AA,GA 0.30010459245 0.00523530030129
GA,AA,AA 0.377065633586 0.000335475019306
GA,AG,GG 0.290802199788 0.0175071812892
GA,AG,GA 0.367763240924 0.00448740025824
GA,AG,AA 0.44472428206 0.000287550016548
GA,GG,GG 0.358460848262 0.00375153884769
GA,GG,GA 0.435421889398 0.000961585769624
GA,GG,AA 0.512382930534 6.16178606889e-05
AA,AA,GG 0.446287102628 0.364284082594
AA,AA,GA 0.523248143764 0.0933724543834
AA,AA,AA 0.6002091849 0.00598325294334
AA,AG,GG 0.513945751102 0.312243499367
AA,AG,GA 0.590906792238 0.0800335323286
AA,AG,AA 0.667867833374 0.00512850252286
AA,GG,GG 0.581604399576 0.0669093212928
AA,GG,GA 0.658565440712 0.0171500426418
AA,GG,AA 0.735526481848 0.00109896482633
EDIT: Added previous post link, along with expected output.
python performance loops
3
instead of writing "do checkout my post history" you should mention your post link
– Gahan
Nov 9 at 5:13
1
At a first glance - you are repeating.split()
all over the place. If you don't change the line, it should always split the same way, so do it once and remember the result. EDIT: also, give us the command line arguments you would execute this with.
– Amadan
Nov 9 at 5:17
1
Could you provide some expected output?
– Alex
Nov 9 at 5:29
tried using numba ??
– Lijo Jose
Nov 9 at 6:18
@Amadan I see, do you mean it would be better to assign the line split to a variable and only use that variable from there onwards? In terms of command line arguments, it would be like such:python myscript.py table2.txt table1.txt 1e-5 1
– Volka
Nov 9 at 6:41
|
show 1 more comment
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I am posting again as I had no luck trying to make the following script more efficient. For more details, do check out my previous post, but the basic situation is as below.
I have written a script in order to compute a score, as well as a frequency for a list of genetic profiles.
A genetic profile here consists of a combination of SNPs. Each SNP has two alleles. Hence, the input file for 3 SNPs is something like below, which shows all possible combinations of all alleles for all three SNPs. This table was generated using itertool's product in another script:
AA CC TT
AT CC TT
TT CC TT
AA CG TT
AT CG TT
TT CG TT
AA GG TT
AT GG TT
TT GG TT
AA CC TA
AT CC TA
TT CC TA
AA CG TA
AT CG TA
TT CG TA
AA GG TA
AT GG TA
TT GG TA
AA CC AA
AT CC AA
TT CC AA
AA CG AA
AT CG AA
TT CG AA
AA GG AA
AT GG AA
TT GG AA
I then have another file with a table containing weights and frequencies for the three SNPs, such as below:
SNP1 A T 1.25 0.223143551314 0.97273
SNP2 C G 1.07 0.0676586484738 0.3
SNP3 T A 1.08 0.0769610411361 0.1136
The columns are the SNP IDs, risk allele, reference allele, OR, log(OR), and population frequency. The weights are for the risk allele.
The main script takes these two files, and computes a score, based on the sum of log odds ratios for each risk allele in each SNP for each genetic profile, as well as a frequency based on multiplying the allele frequencies, assuming Hardy Weinberg equilibrium.
import sys
snp={}
riskall={}
weights={}
freqs={} # effect allele, *MAY NOT BE MINOR ALLELE
pop = int(int(sys.argv[4]) + 4) # for additional columns due to additional populations. the example table given only has one population (column 6)
# read in OR table
pos = 0
with open(sys.argv[1], 'r') as f:
for line in f:
snp[pos]=(line.split()[0])
riskall[line.split()[0]]=line.split()[1]
weights[line.split()[0]]=line.split()[4]
freqs[line.split()[0]]=line.split()[pop]
pos+=1
### compute scores for each combination
with open(sys.argv[2], 'r') as f:
for line in f:
score=0
freq=1
for j in range(len(line.split())):
rsid=snp[j]
riskallele=riskall[rsid]
frequency=freqs[rsid]
wei=weights[rsid]
allele1=line.split()[j][0]
allele2=line.split()[j][1]
if allele2 != riskallele: # homozygous for ref
score+=0
freq*=(1-float(frequency))*(1-float(frequency))
elif allele1 != riskallele and allele2 == riskallele: # heterozygous, be sure that A2 is risk allele!
score+=float(wei)
freq*=2*(1-float(frequency))*(float(frequency))
elif allele1 == riskallele: # and allele2 == riskall[snp[j]]: # homozygous for risk, be sure to limit risk to second allele!
score+=2*float(wei)
freq*=float(frequency)*float(frequency)
if freq < float(sys.argv[3]): # threshold to stop loop in interest of efficiency
break
print(','.join(line.split()) + "t" + str(score) + "t" + str(freq))
I have set a variable where I can specify a threshold to break the loop when the frequency gets extremely low. What improvements can be done to speed up the script?
I have tried using Pandas, which is still much slower, as I am not sure if vectorization is possible in this case. I have issues installing Dask on my Unix server. I have also made sure to use only Python dictionaries and not lists, and this gave a slight improvement.
The expected output from the above would be as such:
GG,AA,GG 0 0.000286302968304
GG,AA,GA 0.0769610411361 7.33845153414e-05
GG,AA,AA 0.153922082272 4.70243735491e-06
GG,AG,GG 0.0676586484738 0.00024540254426
GG,AG,GA 0.14461968961 6.29010131498e-05
GG,AG,AA 0.221580730746 4.03066058992e-06
GG,GG,GG 0.135317296948 5.25862594844e-05
GG,GG,GA 0.212278338084 1.34787885321e-05
GG,GG,AA 0.28923937922 8.63712983555e-07
GA,AA,GG 0.223143551314 0.0204250448374
GA,AA,GA 0.30010459245 0.00523530030129
GA,AA,AA 0.377065633586 0.000335475019306
GA,AG,GG 0.290802199788 0.0175071812892
GA,AG,GA 0.367763240924 0.00448740025824
GA,AG,AA 0.44472428206 0.000287550016548
GA,GG,GG 0.358460848262 0.00375153884769
GA,GG,GA 0.435421889398 0.000961585769624
GA,GG,AA 0.512382930534 6.16178606889e-05
AA,AA,GG 0.446287102628 0.364284082594
AA,AA,GA 0.523248143764 0.0933724543834
AA,AA,AA 0.6002091849 0.00598325294334
AA,AG,GG 0.513945751102 0.312243499367
AA,AG,GA 0.590906792238 0.0800335323286
AA,AG,AA 0.667867833374 0.00512850252286
AA,GG,GG 0.581604399576 0.0669093212928
AA,GG,GA 0.658565440712 0.0171500426418
AA,GG,AA 0.735526481848 0.00109896482633
EDIT: Added previous post link, along with expected output.
python performance loops
I am posting again as I had no luck trying to make the following script more efficient. For more details, do check out my previous post, but the basic situation is as below.
I have written a script in order to compute a score, as well as a frequency for a list of genetic profiles.
A genetic profile here consists of a combination of SNPs. Each SNP has two alleles. Hence, the input file for 3 SNPs is something like below, which shows all possible combinations of all alleles for all three SNPs. This table was generated using itertool's product in another script:
AA CC TT
AT CC TT
TT CC TT
AA CG TT
AT CG TT
TT CG TT
AA GG TT
AT GG TT
TT GG TT
AA CC TA
AT CC TA
TT CC TA
AA CG TA
AT CG TA
TT CG TA
AA GG TA
AT GG TA
TT GG TA
AA CC AA
AT CC AA
TT CC AA
AA CG AA
AT CG AA
TT CG AA
AA GG AA
AT GG AA
TT GG AA
I then have another file with a table containing weights and frequencies for the three SNPs, such as below:
SNP1 A T 1.25 0.223143551314 0.97273
SNP2 C G 1.07 0.0676586484738 0.3
SNP3 T A 1.08 0.0769610411361 0.1136
The columns are the SNP IDs, risk allele, reference allele, OR, log(OR), and population frequency. The weights are for the risk allele.
The main script takes these two files, and computes a score, based on the sum of log odds ratios for each risk allele in each SNP for each genetic profile, as well as a frequency based on multiplying the allele frequencies, assuming Hardy Weinberg equilibrium.
import sys
snp={}
riskall={}
weights={}
freqs={} # effect allele, *MAY NOT BE MINOR ALLELE
pop = int(int(sys.argv[4]) + 4) # for additional columns due to additional populations. the example table given only has one population (column 6)
# read in OR table
pos = 0
with open(sys.argv[1], 'r') as f:
for line in f:
snp[pos]=(line.split()[0])
riskall[line.split()[0]]=line.split()[1]
weights[line.split()[0]]=line.split()[4]
freqs[line.split()[0]]=line.split()[pop]
pos+=1
### compute scores for each combination
with open(sys.argv[2], 'r') as f:
for line in f:
score=0
freq=1
for j in range(len(line.split())):
rsid=snp[j]
riskallele=riskall[rsid]
frequency=freqs[rsid]
wei=weights[rsid]
allele1=line.split()[j][0]
allele2=line.split()[j][1]
if allele2 != riskallele: # homozygous for ref
score+=0
freq*=(1-float(frequency))*(1-float(frequency))
elif allele1 != riskallele and allele2 == riskallele: # heterozygous, be sure that A2 is risk allele!
score+=float(wei)
freq*=2*(1-float(frequency))*(float(frequency))
elif allele1 == riskallele: # and allele2 == riskall[snp[j]]: # homozygous for risk, be sure to limit risk to second allele!
score+=2*float(wei)
freq*=float(frequency)*float(frequency)
if freq < float(sys.argv[3]): # threshold to stop loop in interest of efficiency
break
print(','.join(line.split()) + "t" + str(score) + "t" + str(freq))
I have set a variable where I can specify a threshold to break the loop when the frequency gets extremely low. What improvements can be done to speed up the script?
I have tried using Pandas, which is still much slower, as I am not sure if vectorization is possible in this case. I have issues installing Dask on my Unix server. I have also made sure to use only Python dictionaries and not lists, and this gave a slight improvement.
The expected output from the above would be as such:
GG,AA,GG 0 0.000286302968304
GG,AA,GA 0.0769610411361 7.33845153414e-05
GG,AA,AA 0.153922082272 4.70243735491e-06
GG,AG,GG 0.0676586484738 0.00024540254426
GG,AG,GA 0.14461968961 6.29010131498e-05
GG,AG,AA 0.221580730746 4.03066058992e-06
GG,GG,GG 0.135317296948 5.25862594844e-05
GG,GG,GA 0.212278338084 1.34787885321e-05
GG,GG,AA 0.28923937922 8.63712983555e-07
GA,AA,GG 0.223143551314 0.0204250448374
GA,AA,GA 0.30010459245 0.00523530030129
GA,AA,AA 0.377065633586 0.000335475019306
GA,AG,GG 0.290802199788 0.0175071812892
GA,AG,GA 0.367763240924 0.00448740025824
GA,AG,AA 0.44472428206 0.000287550016548
GA,GG,GG 0.358460848262 0.00375153884769
GA,GG,GA 0.435421889398 0.000961585769624
GA,GG,AA 0.512382930534 6.16178606889e-05
AA,AA,GG 0.446287102628 0.364284082594
AA,AA,GA 0.523248143764 0.0933724543834
AA,AA,AA 0.6002091849 0.00598325294334
AA,AG,GG 0.513945751102 0.312243499367
AA,AG,GA 0.590906792238 0.0800335323286
AA,AG,AA 0.667867833374 0.00512850252286
AA,GG,GG 0.581604399576 0.0669093212928
AA,GG,GA 0.658565440712 0.0171500426418
AA,GG,AA 0.735526481848 0.00109896482633
EDIT: Added previous post link, along with expected output.
python performance loops
python performance loops
edited Nov 9 at 6:46
asked Nov 9 at 5:11
Volka
317
317
3
instead of writing "do checkout my post history" you should mention your post link
– Gahan
Nov 9 at 5:13
1
At a first glance - you are repeating.split()
all over the place. If you don't change the line, it should always split the same way, so do it once and remember the result. EDIT: also, give us the command line arguments you would execute this with.
– Amadan
Nov 9 at 5:17
1
Could you provide some expected output?
– Alex
Nov 9 at 5:29
tried using numba ??
– Lijo Jose
Nov 9 at 6:18
@Amadan I see, do you mean it would be better to assign the line split to a variable and only use that variable from there onwards? In terms of command line arguments, it would be like such:python myscript.py table2.txt table1.txt 1e-5 1
– Volka
Nov 9 at 6:41
|
show 1 more comment
3
instead of writing "do checkout my post history" you should mention your post link
– Gahan
Nov 9 at 5:13
1
At a first glance - you are repeating.split()
all over the place. If you don't change the line, it should always split the same way, so do it once and remember the result. EDIT: also, give us the command line arguments you would execute this with.
– Amadan
Nov 9 at 5:17
1
Could you provide some expected output?
– Alex
Nov 9 at 5:29
tried using numba ??
– Lijo Jose
Nov 9 at 6:18
@Amadan I see, do you mean it would be better to assign the line split to a variable and only use that variable from there onwards? In terms of command line arguments, it would be like such:python myscript.py table2.txt table1.txt 1e-5 1
– Volka
Nov 9 at 6:41
3
3
instead of writing "do checkout my post history" you should mention your post link
– Gahan
Nov 9 at 5:13
instead of writing "do checkout my post history" you should mention your post link
– Gahan
Nov 9 at 5:13
1
1
At a first glance - you are repeating
.split()
all over the place. If you don't change the line, it should always split the same way, so do it once and remember the result. EDIT: also, give us the command line arguments you would execute this with.– Amadan
Nov 9 at 5:17
At a first glance - you are repeating
.split()
all over the place. If you don't change the line, it should always split the same way, so do it once and remember the result. EDIT: also, give us the command line arguments you would execute this with.– Amadan
Nov 9 at 5:17
1
1
Could you provide some expected output?
– Alex
Nov 9 at 5:29
Could you provide some expected output?
– Alex
Nov 9 at 5:29
tried using numba ??
– Lijo Jose
Nov 9 at 6:18
tried using numba ??
– Lijo Jose
Nov 9 at 6:18
@Amadan I see, do you mean it would be better to assign the line split to a variable and only use that variable from there onwards? In terms of command line arguments, it would be like such:
python myscript.py table2.txt table1.txt 1e-5 1
– Volka
Nov 9 at 6:41
@Amadan I see, do you mean it would be better to assign the line split to a variable and only use that variable from there onwards? In terms of command line arguments, it would be like such:
python myscript.py table2.txt table1.txt 1e-5 1
– Volka
Nov 9 at 6:41
|
show 1 more comment
1 Answer
1
active
oldest
votes
up vote
1
down vote
accepted
Disclaimer: I did not test this, it is rather a pseudo-code.
I provide some general ideas about what is slow/fast in programming and particularly in python:
You should try to move out of loops everything what is not changing in that loop.
Also, in python, you should try to replace loops with comprehensions
https://www.pythonforbeginners.com/basics/list-comprehensions-in-python
[ expression for item in list if conditional ]
you should try to use map/filter functions if possible and you also can prepare your data so that the program is more efficient
rsid=snp[j]
riskallele=riskall[rsid]
is basically a double mapping and it can possibly be done better if you can create your snp structure like this (you can use -1 index for the last column and get rid of pop
):
snp = [{"riskall": line[1],"freq": float(line[4]),"weight": float(line[-1])}
for line in map(split,f)]
and your computing loop can become something like this:
### compute scores for each combination
stop = sys.argv[3]
with open(sys.argv[2], 'r') as f:
for fline in f:
score=0.0 # work with floats from the start
freq=1.0
line = fline.split() # do it only once
for j,field in line:
s=snp[j]
riskallele=s["riskall"]
frequency=s["freq"]
wei=s["weight"]
(allele1,allele2) = line[j]
if allele2 != riskallele: # homozygous for ref
score+=0
freq*=(1-frequency)*(1-frequency)
elif allele1 != riskallele and allele2 == riskallele: # heterozygous, be sure that A2 is risk allele!
score+=wei
freq*=2*(1-frequency)*frequency
elif allele1 == riskallele: # and allele2 == riskallele: # homozygous for risk, be sure to limit risk to second allele!
score+=2*wei
freq*=frequency*frequency
if freq < stop): # threshold to stop loop in interest of efficiency
break
print(','.join(line.split()) + "t" + str(score) + "t" + str(freq))
The ultimate goal I would try to achieve is to convert it to some map/reduce form:
the allele can have [A,C,G,T][A,C,G,T] 16 combinations and we test against it [A,C,G,T] this only 64 combinations so I can create a map in form
[AC,C] -> score,freq_function and I can get rid of the whole if
block.
Sometimes the best approach is to split the code to small functions, reorganize and then merge back.
One more thing I noticed: " This table was generated using itertool's product in another script:" - I think in that case I would also merge the scripts because you can easily generate it inside this one from the SNP file and you don't need the generating script.
– petrch
Nov 9 at 14:13
Thanks! The map form is really helpful and improved my script runtimes. I'm not too sure how to apply that to the if block, and ended up incorporating the respective weights and frequencies into the initial dictionary instead.
– Volka
Nov 16 at 7:31
Good to hear that. It always depends how much time you want to invest and what is "good enough" running time. If your input is large (eg. 1 GB or more) the IO may start to be an issue as well. Read for example this rabexc.org/posts/io-performance-in-python. You want to put as much code out of the for loop so if you can anything precalculate than it may help. the "print()" itself is a bit slow. Maybe if you care just about some results than print only those. Or use buffered file writes and read the results from a file later.
– petrch
Nov 16 at 13:22
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
Disclaimer: I did not test this, it is rather a pseudo-code.
I provide some general ideas about what is slow/fast in programming and particularly in python:
You should try to move out of loops everything what is not changing in that loop.
Also, in python, you should try to replace loops with comprehensions
https://www.pythonforbeginners.com/basics/list-comprehensions-in-python
[ expression for item in list if conditional ]
you should try to use map/filter functions if possible and you also can prepare your data so that the program is more efficient
rsid=snp[j]
riskallele=riskall[rsid]
is basically a double mapping and it can possibly be done better if you can create your snp structure like this (you can use -1 index for the last column and get rid of pop
):
snp = [{"riskall": line[1],"freq": float(line[4]),"weight": float(line[-1])}
for line in map(split,f)]
and your computing loop can become something like this:
### compute scores for each combination
stop = sys.argv[3]
with open(sys.argv[2], 'r') as f:
for fline in f:
score=0.0 # work with floats from the start
freq=1.0
line = fline.split() # do it only once
for j,field in line:
s=snp[j]
riskallele=s["riskall"]
frequency=s["freq"]
wei=s["weight"]
(allele1,allele2) = line[j]
if allele2 != riskallele: # homozygous for ref
score+=0
freq*=(1-frequency)*(1-frequency)
elif allele1 != riskallele and allele2 == riskallele: # heterozygous, be sure that A2 is risk allele!
score+=wei
freq*=2*(1-frequency)*frequency
elif allele1 == riskallele: # and allele2 == riskallele: # homozygous for risk, be sure to limit risk to second allele!
score+=2*wei
freq*=frequency*frequency
if freq < stop): # threshold to stop loop in interest of efficiency
break
print(','.join(line.split()) + "t" + str(score) + "t" + str(freq))
The ultimate goal I would try to achieve is to convert it to some map/reduce form:
the allele can have [A,C,G,T][A,C,G,T] 16 combinations and we test against it [A,C,G,T] this only 64 combinations so I can create a map in form
[AC,C] -> score,freq_function and I can get rid of the whole if
block.
Sometimes the best approach is to split the code to small functions, reorganize and then merge back.
One more thing I noticed: " This table was generated using itertool's product in another script:" - I think in that case I would also merge the scripts because you can easily generate it inside this one from the SNP file and you don't need the generating script.
– petrch
Nov 9 at 14:13
Thanks! The map form is really helpful and improved my script runtimes. I'm not too sure how to apply that to the if block, and ended up incorporating the respective weights and frequencies into the initial dictionary instead.
– Volka
Nov 16 at 7:31
Good to hear that. It always depends how much time you want to invest and what is "good enough" running time. If your input is large (eg. 1 GB or more) the IO may start to be an issue as well. Read for example this rabexc.org/posts/io-performance-in-python. You want to put as much code out of the for loop so if you can anything precalculate than it may help. the "print()" itself is a bit slow. Maybe if you care just about some results than print only those. Or use buffered file writes and read the results from a file later.
– petrch
Nov 16 at 13:22
add a comment |
up vote
1
down vote
accepted
Disclaimer: I did not test this, it is rather a pseudo-code.
I provide some general ideas about what is slow/fast in programming and particularly in python:
You should try to move out of loops everything what is not changing in that loop.
Also, in python, you should try to replace loops with comprehensions
https://www.pythonforbeginners.com/basics/list-comprehensions-in-python
[ expression for item in list if conditional ]
you should try to use map/filter functions if possible and you also can prepare your data so that the program is more efficient
rsid=snp[j]
riskallele=riskall[rsid]
is basically a double mapping and it can possibly be done better if you can create your snp structure like this (you can use -1 index for the last column and get rid of pop
):
snp = [{"riskall": line[1],"freq": float(line[4]),"weight": float(line[-1])}
for line in map(split,f)]
and your computing loop can become something like this:
### compute scores for each combination
stop = sys.argv[3]
with open(sys.argv[2], 'r') as f:
for fline in f:
score=0.0 # work with floats from the start
freq=1.0
line = fline.split() # do it only once
for j,field in line:
s=snp[j]
riskallele=s["riskall"]
frequency=s["freq"]
wei=s["weight"]
(allele1,allele2) = line[j]
if allele2 != riskallele: # homozygous for ref
score+=0
freq*=(1-frequency)*(1-frequency)
elif allele1 != riskallele and allele2 == riskallele: # heterozygous, be sure that A2 is risk allele!
score+=wei
freq*=2*(1-frequency)*frequency
elif allele1 == riskallele: # and allele2 == riskallele: # homozygous for risk, be sure to limit risk to second allele!
score+=2*wei
freq*=frequency*frequency
if freq < stop): # threshold to stop loop in interest of efficiency
break
print(','.join(line.split()) + "t" + str(score) + "t" + str(freq))
The ultimate goal I would try to achieve is to convert it to some map/reduce form:
the allele can have [A,C,G,T][A,C,G,T] 16 combinations and we test against it [A,C,G,T] this only 64 combinations so I can create a map in form
[AC,C] -> score,freq_function and I can get rid of the whole if
block.
Sometimes the best approach is to split the code to small functions, reorganize and then merge back.
One more thing I noticed: " This table was generated using itertool's product in another script:" - I think in that case I would also merge the scripts because you can easily generate it inside this one from the SNP file and you don't need the generating script.
– petrch
Nov 9 at 14:13
Thanks! The map form is really helpful and improved my script runtimes. I'm not too sure how to apply that to the if block, and ended up incorporating the respective weights and frequencies into the initial dictionary instead.
– Volka
Nov 16 at 7:31
Good to hear that. It always depends how much time you want to invest and what is "good enough" running time. If your input is large (eg. 1 GB or more) the IO may start to be an issue as well. Read for example this rabexc.org/posts/io-performance-in-python. You want to put as much code out of the for loop so if you can anything precalculate than it may help. the "print()" itself is a bit slow. Maybe if you care just about some results than print only those. Or use buffered file writes and read the results from a file later.
– petrch
Nov 16 at 13:22
add a comment |
up vote
1
down vote
accepted
up vote
1
down vote
accepted
Disclaimer: I did not test this, it is rather a pseudo-code.
I provide some general ideas about what is slow/fast in programming and particularly in python:
You should try to move out of loops everything what is not changing in that loop.
Also, in python, you should try to replace loops with comprehensions
https://www.pythonforbeginners.com/basics/list-comprehensions-in-python
[ expression for item in list if conditional ]
you should try to use map/filter functions if possible and you also can prepare your data so that the program is more efficient
rsid=snp[j]
riskallele=riskall[rsid]
is basically a double mapping and it can possibly be done better if you can create your snp structure like this (you can use -1 index for the last column and get rid of pop
):
snp = [{"riskall": line[1],"freq": float(line[4]),"weight": float(line[-1])}
for line in map(split,f)]
and your computing loop can become something like this:
### compute scores for each combination
stop = sys.argv[3]
with open(sys.argv[2], 'r') as f:
for fline in f:
score=0.0 # work with floats from the start
freq=1.0
line = fline.split() # do it only once
for j,field in line:
s=snp[j]
riskallele=s["riskall"]
frequency=s["freq"]
wei=s["weight"]
(allele1,allele2) = line[j]
if allele2 != riskallele: # homozygous for ref
score+=0
freq*=(1-frequency)*(1-frequency)
elif allele1 != riskallele and allele2 == riskallele: # heterozygous, be sure that A2 is risk allele!
score+=wei
freq*=2*(1-frequency)*frequency
elif allele1 == riskallele: # and allele2 == riskallele: # homozygous for risk, be sure to limit risk to second allele!
score+=2*wei
freq*=frequency*frequency
if freq < stop): # threshold to stop loop in interest of efficiency
break
print(','.join(line.split()) + "t" + str(score) + "t" + str(freq))
The ultimate goal I would try to achieve is to convert it to some map/reduce form:
the allele can have [A,C,G,T][A,C,G,T] 16 combinations and we test against it [A,C,G,T] this only 64 combinations so I can create a map in form
[AC,C] -> score,freq_function and I can get rid of the whole if
block.
Sometimes the best approach is to split the code to small functions, reorganize and then merge back.
Disclaimer: I did not test this, it is rather a pseudo-code.
I provide some general ideas about what is slow/fast in programming and particularly in python:
You should try to move out of loops everything what is not changing in that loop.
Also, in python, you should try to replace loops with comprehensions
https://www.pythonforbeginners.com/basics/list-comprehensions-in-python
[ expression for item in list if conditional ]
you should try to use map/filter functions if possible and you also can prepare your data so that the program is more efficient
rsid=snp[j]
riskallele=riskall[rsid]
is basically a double mapping and it can possibly be done better if you can create your snp structure like this (you can use -1 index for the last column and get rid of pop
):
snp = [{"riskall": line[1],"freq": float(line[4]),"weight": float(line[-1])}
for line in map(split,f)]
and your computing loop can become something like this:
### compute scores for each combination
stop = sys.argv[3]
with open(sys.argv[2], 'r') as f:
for fline in f:
score=0.0 # work with floats from the start
freq=1.0
line = fline.split() # do it only once
for j,field in line:
s=snp[j]
riskallele=s["riskall"]
frequency=s["freq"]
wei=s["weight"]
(allele1,allele2) = line[j]
if allele2 != riskallele: # homozygous for ref
score+=0
freq*=(1-frequency)*(1-frequency)
elif allele1 != riskallele and allele2 == riskallele: # heterozygous, be sure that A2 is risk allele!
score+=wei
freq*=2*(1-frequency)*frequency
elif allele1 == riskallele: # and allele2 == riskallele: # homozygous for risk, be sure to limit risk to second allele!
score+=2*wei
freq*=frequency*frequency
if freq < stop): # threshold to stop loop in interest of efficiency
break
print(','.join(line.split()) + "t" + str(score) + "t" + str(freq))
The ultimate goal I would try to achieve is to convert it to some map/reduce form:
the allele can have [A,C,G,T][A,C,G,T] 16 combinations and we test against it [A,C,G,T] this only 64 combinations so I can create a map in form
[AC,C] -> score,freq_function and I can get rid of the whole if
block.
Sometimes the best approach is to split the code to small functions, reorganize and then merge back.
edited Nov 16 at 13:02
answered Nov 9 at 11:12
petrch
31626
31626
One more thing I noticed: " This table was generated using itertool's product in another script:" - I think in that case I would also merge the scripts because you can easily generate it inside this one from the SNP file and you don't need the generating script.
– petrch
Nov 9 at 14:13
Thanks! The map form is really helpful and improved my script runtimes. I'm not too sure how to apply that to the if block, and ended up incorporating the respective weights and frequencies into the initial dictionary instead.
– Volka
Nov 16 at 7:31
Good to hear that. It always depends how much time you want to invest and what is "good enough" running time. If your input is large (eg. 1 GB or more) the IO may start to be an issue as well. Read for example this rabexc.org/posts/io-performance-in-python. You want to put as much code out of the for loop so if you can anything precalculate than it may help. the "print()" itself is a bit slow. Maybe if you care just about some results than print only those. Or use buffered file writes and read the results from a file later.
– petrch
Nov 16 at 13:22
add a comment |
One more thing I noticed: " This table was generated using itertool's product in another script:" - I think in that case I would also merge the scripts because you can easily generate it inside this one from the SNP file and you don't need the generating script.
– petrch
Nov 9 at 14:13
Thanks! The map form is really helpful and improved my script runtimes. I'm not too sure how to apply that to the if block, and ended up incorporating the respective weights and frequencies into the initial dictionary instead.
– Volka
Nov 16 at 7:31
Good to hear that. It always depends how much time you want to invest and what is "good enough" running time. If your input is large (eg. 1 GB or more) the IO may start to be an issue as well. Read for example this rabexc.org/posts/io-performance-in-python. You want to put as much code out of the for loop so if you can anything precalculate than it may help. the "print()" itself is a bit slow. Maybe if you care just about some results than print only those. Or use buffered file writes and read the results from a file later.
– petrch
Nov 16 at 13:22
One more thing I noticed: " This table was generated using itertool's product in another script:" - I think in that case I would also merge the scripts because you can easily generate it inside this one from the SNP file and you don't need the generating script.
– petrch
Nov 9 at 14:13
One more thing I noticed: " This table was generated using itertool's product in another script:" - I think in that case I would also merge the scripts because you can easily generate it inside this one from the SNP file and you don't need the generating script.
– petrch
Nov 9 at 14:13
Thanks! The map form is really helpful and improved my script runtimes. I'm not too sure how to apply that to the if block, and ended up incorporating the respective weights and frequencies into the initial dictionary instead.
– Volka
Nov 16 at 7:31
Thanks! The map form is really helpful and improved my script runtimes. I'm not too sure how to apply that to the if block, and ended up incorporating the respective weights and frequencies into the initial dictionary instead.
– Volka
Nov 16 at 7:31
Good to hear that. It always depends how much time you want to invest and what is "good enough" running time. If your input is large (eg. 1 GB or more) the IO may start to be an issue as well. Read for example this rabexc.org/posts/io-performance-in-python. You want to put as much code out of the for loop so if you can anything precalculate than it may help. the "print()" itself is a bit slow. Maybe if you care just about some results than print only those. Or use buffered file writes and read the results from a file later.
– petrch
Nov 16 at 13:22
Good to hear that. It always depends how much time you want to invest and what is "good enough" running time. If your input is large (eg. 1 GB or more) the IO may start to be an issue as well. Read for example this rabexc.org/posts/io-performance-in-python. You want to put as much code out of the for loop so if you can anything precalculate than it may help. the "print()" itself is a bit slow. Maybe if you care just about some results than print only those. Or use buffered file writes and read the results from a file later.
– petrch
Nov 16 at 13:22
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53220198%2fhow-can-i-improve-my-looping-python-script-involving-different-mathematical-ope%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
3
instead of writing "do checkout my post history" you should mention your post link
– Gahan
Nov 9 at 5:13
1
At a first glance - you are repeating
.split()
all over the place. If you don't change the line, it should always split the same way, so do it once and remember the result. EDIT: also, give us the command line arguments you would execute this with.– Amadan
Nov 9 at 5:17
1
Could you provide some expected output?
– Alex
Nov 9 at 5:29
tried using numba ??
– Lijo Jose
Nov 9 at 6:18
@Amadan I see, do you mean it would be better to assign the line split to a variable and only use that variable from there onwards? In terms of command line arguments, it would be like such:
python myscript.py table2.txt table1.txt 1e-5 1
– Volka
Nov 9 at 6:41