How can I improve my looping Python script, involving different mathematical operations for different...

up vote
0
down vote

favorite

I am posting again as I had no luck trying to make the following script more efficient. For more details, do check out my previous post, but the basic situation is as below.

I have written a script in order to compute a score, as well as a frequency for a list of genetic profiles.

A genetic profile here consists of a combination of SNPs. Each SNP has two alleles. Hence, the input file for 3 SNPs is something like below, which shows all possible combinations of all alleles for all three SNPs. This table was generated using itertool's product in another script:

    AA   CC   TT

    AT   CC   TT

    TT   CC   TT

    AA   CG   TT

    AT   CG   TT

    TT   CG   TT

    AA   GG   TT

    AT   GG   TT

    TT   GG   TT

    AA   CC   TA

    AT   CC   TA

    TT   CC   TA

    AA   CG   TA

    AT   CG   TA

    TT   CG   TA

    AA   GG   TA

    AT   GG   TA

    TT   GG   TA

    AA   CC   AA

    AT   CC   AA

    TT   CC   AA

    AA   CG   AA

    AT   CG   AA

    TT   CG   AA

    AA   GG   AA

    AT   GG   AA

    TT   GG   AA

I then have another file with a table containing weights and frequencies for the three SNPs, such as below:

SNP1             A       T       1.25    0.223143551314     0.97273 

SNP2             C       G       1.07    0.0676586484738    0.3     

SNP3             T       A       1.08    0.0769610411361    0.1136

The columns are the SNP IDs, risk allele, reference allele, OR, log(OR), and population frequency. The weights are for the risk allele.

The main script takes these two files, and computes a score, based on the sum of log odds ratios for each risk allele in each SNP for each genetic profile, as well as a frequency based on multiplying the allele frequencies, assuming Hardy Weinberg equilibrium.

import sys



snp={}

riskall={}

weights={}

freqs={}    # effect allele, *MAY NOT BE MINOR ALLELE



pop = int(int(sys.argv[4]) + 4) # for additional columns due to additional populations. the example table given only has one population (column 6)



# read in OR table

pos = 0

with open(sys.argv[1], 'r') as f:

    for line in f:

        snp[pos]=(line.split()[0])

        riskall[line.split()[0]]=line.split()[1]

        weights[line.split()[0]]=line.split()[4]

        freqs[line.split()[0]]=line.split()[pop]



        pos+=1







### compute scores for each combination

with open(sys.argv[2], 'r') as f:

    for line in f:

        score=0

        freq=1

        for j in range(len(line.split())):

            rsid=snp[j]

            riskallele=riskall[rsid]

            frequency=freqs[rsid]

            wei=weights[rsid]

            allele1=line.split()[j][0]

            allele2=line.split()[j][1]

            if allele2 != riskallele:      # homozygous for ref

                score+=0

                freq*=(1-float(frequency))*(1-float(frequency))

            elif allele1 != riskallele and allele2 == riskallele:  # heterozygous, be sure that A2 is risk allele!

                score+=float(wei)

                freq*=2*(1-float(frequency))*(float(frequency))

            elif allele1 == riskallele: # and allele2 == riskall[snp[j]]:      # homozygous for risk, be sure to limit risk to second allele!

                score+=2*float(wei)

                freq*=float(frequency)*float(frequency)



            if freq < float(sys.argv[3]):   # threshold to stop loop in interest of efficiency 

                break



        print(','.join(line.split()) + "t" + str(score) + "t" + str(freq))

I have set a variable where I can specify a threshold to break the loop when the frequency gets extremely low. What improvements can be done to speed up the script?

I have tried using Pandas, which is still much slower, as I am not sure if vectorization is possible in this case. I have issues installing Dask on my Unix server. I have also made sure to use only Python dictionaries and not lists, and this gave a slight improvement.

The expected output from the above would be as such:

GG,AA,GG        0       0.000286302968304

GG,AA,GA        0.0769610411361 7.33845153414e-05

GG,AA,AA        0.153922082272  4.70243735491e-06

GG,AG,GG        0.0676586484738 0.00024540254426

GG,AG,GA        0.14461968961   6.29010131498e-05

GG,AG,AA        0.221580730746  4.03066058992e-06

GG,GG,GG        0.135317296948  5.25862594844e-05

GG,GG,GA        0.212278338084  1.34787885321e-05

GG,GG,AA        0.28923937922   8.63712983555e-07

GA,AA,GG        0.223143551314  0.0204250448374

GA,AA,GA        0.30010459245   0.00523530030129

GA,AA,AA        0.377065633586  0.000335475019306

GA,AG,GG        0.290802199788  0.0175071812892

GA,AG,GA        0.367763240924  0.00448740025824

GA,AG,AA        0.44472428206   0.000287550016548

GA,GG,GG        0.358460848262  0.00375153884769

GA,GG,GA        0.435421889398  0.000961585769624

GA,GG,AA        0.512382930534  6.16178606889e-05

AA,AA,GG        0.446287102628  0.364284082594

AA,AA,GA        0.523248143764  0.0933724543834

AA,AA,AA        0.6002091849    0.00598325294334

AA,AG,GG        0.513945751102  0.312243499367

AA,AG,GA        0.590906792238  0.0800335323286

AA,AG,AA        0.667867833374  0.00512850252286

AA,GG,GG        0.581604399576  0.0669093212928

AA,GG,GA        0.658565440712  0.0171500426418

AA,GG,AA        0.735526481848  0.00109896482633

EDIT: Added previous post link, along with expected output.

edited Nov 9 at 6:46

asked Nov 9 at 5:11

Volka

317

3

instead of writing "do checkout my post history" you should mention your post link
– Gahan
Nov 9 at 5:13

1

At a first glance - you are repeating .split() all over the place. If you don't change the line, it should always split the same way, so do it once and remember the result. EDIT: also, give us the command line arguments you would execute this with.
– Amadan
Nov 9 at 5:17

1

Could you provide some expected output?
– Alex
Nov 9 at 5:29

tried using numba ??
– Lijo Jose
Nov 9 at 6:18

@Amadan I see, do you mean it would be better to assign the line split to a variable and only use that variable from there onwards? In terms of command line arguments, it would be like such: python myscript.py table2.txt table1.txt 1e-5 1
– Volka
Nov 9 at 6:41

|
show 1 more comment

up vote
0
down vote

favorite

I am posting again as I had no luck trying to make the following script more efficient. For more details, do check out my previous post, but the basic situation is as below.

I have written a script in order to compute a score, as well as a frequency for a list of genetic profiles.

    AA   CC   TT

    AT   CC   TT

    TT   CC   TT

    AA   CG   TT

    AT   CG   TT

    TT   CG   TT

    AA   GG   TT

    AT   GG   TT

    TT   GG   TT

    AA   CC   TA

    AT   CC   TA

    TT   CC   TA

    AA   CG   TA

    AT   CG   TA

    TT   CG   TA

    AA   GG   TA

    AT   GG   TA

    TT   GG   TA

    AA   CC   AA

    AT   CC   AA

    TT   CC   AA

    AA   CG   AA

    AT   CG   AA

    TT   CG   AA

    AA   GG   AA

    AT   GG   AA

    TT   GG   AA

I then have another file with a table containing weights and frequencies for the three SNPs, such as below:

SNP1             A       T       1.25    0.223143551314     0.97273 

SNP2             C       G       1.07    0.0676586484738    0.3     

SNP3             T       A       1.08    0.0769610411361    0.1136

The columns are the SNP IDs, risk allele, reference allele, OR, log(OR), and population frequency. The weights are for the risk allele.

import sys



snp={}

riskall={}

weights={}

freqs={}    # effect allele, *MAY NOT BE MINOR ALLELE



pop = int(int(sys.argv[4]) + 4) # for additional columns due to additional populations. the example table given only has one population (column 6)



# read in OR table

pos = 0

with open(sys.argv[1], 'r') as f:

    for line in f:

        snp[pos]=(line.split()[0])

        riskall[line.split()[0]]=line.split()[1]

        weights[line.split()[0]]=line.split()[4]

        freqs[line.split()[0]]=line.split()[pop]



        pos+=1







### compute scores for each combination

with open(sys.argv[2], 'r') as f:

    for line in f:

        score=0

        freq=1

        for j in range(len(line.split())):

            rsid=snp[j]

            riskallele=riskall[rsid]

            frequency=freqs[rsid]

            wei=weights[rsid]

            allele1=line.split()[j][0]

            allele2=line.split()[j][1]

            if allele2 != riskallele:      # homozygous for ref

                score+=0

                freq*=(1-float(frequency))*(1-float(frequency))

            elif allele1 != riskallele and allele2 == riskallele:  # heterozygous, be sure that A2 is risk allele!

                score+=float(wei)

                freq*=2*(1-float(frequency))*(float(frequency))

            elif allele1 == riskallele: # and allele2 == riskall[snp[j]]:      # homozygous for risk, be sure to limit risk to second allele!

                score+=2*float(wei)

                freq*=float(frequency)*float(frequency)



            if freq < float(sys.argv[3]):   # threshold to stop loop in interest of efficiency 

                break



        print(','.join(line.split()) + "t" + str(score) + "t" + str(freq))

I have set a variable where I can specify a threshold to break the loop when the frequency gets extremely low. What improvements can be done to speed up the script?

The expected output from the above would be as such:

GG,AA,GG        0       0.000286302968304

GG,AA,GA        0.0769610411361 7.33845153414e-05

GG,AA,AA        0.153922082272  4.70243735491e-06

GG,AG,GG        0.0676586484738 0.00024540254426

GG,AG,GA        0.14461968961   6.29010131498e-05

GG,AG,AA        0.221580730746  4.03066058992e-06

GG,GG,GG        0.135317296948  5.25862594844e-05

GG,GG,GA        0.212278338084  1.34787885321e-05

GG,GG,AA        0.28923937922   8.63712983555e-07

GA,AA,GG        0.223143551314  0.0204250448374

GA,AA,GA        0.30010459245   0.00523530030129

GA,AA,AA        0.377065633586  0.000335475019306

GA,AG,GG        0.290802199788  0.0175071812892

GA,AG,GA        0.367763240924  0.00448740025824

GA,AG,AA        0.44472428206   0.000287550016548

GA,GG,GG        0.358460848262  0.00375153884769

GA,GG,GA        0.435421889398  0.000961585769624

GA,GG,AA        0.512382930534  6.16178606889e-05

AA,AA,GG        0.446287102628  0.364284082594

AA,AA,GA        0.523248143764  0.0933724543834

AA,AA,AA        0.6002091849    0.00598325294334

AA,AG,GG        0.513945751102  0.312243499367

AA,AG,GA        0.590906792238  0.0800335323286

AA,AG,AA        0.667867833374  0.00512850252286

AA,GG,GG        0.581604399576  0.0669093212928

AA,GG,GA        0.658565440712  0.0171500426418

AA,GG,AA        0.735526481848  0.00109896482633

EDIT: Added previous post link, along with expected output.

edited Nov 9 at 6:46

asked Nov 9 at 5:11

Volka

317

3

instead of writing "do checkout my post history" you should mention your post link
– Gahan
Nov 9 at 5:13

1

At a first glance - you are repeating .split() all over the place. If you don't change the line, it should always split the same way, so do it once and remember the result. EDIT: also, give us the command line arguments you would execute this with.
– Amadan
Nov 9 at 5:17

1

Could you provide some expected output?
– Alex
Nov 9 at 5:29

tried using numba ??
– Lijo Jose
Nov 9 at 6:18

@Amadan I see, do you mean it would be better to assign the line split to a variable and only use that variable from there onwards? In terms of command line arguments, it would be like such: python myscript.py table2.txt table1.txt 1e-5 1
– Volka
Nov 9 at 6:41

|
show 1 more comment

up vote
0
down vote

favorite

I am posting again as I had no luck trying to make the following script more efficient. For more details, do check out my previous post, but the basic situation is as below.

I have written a script in order to compute a score, as well as a frequency for a list of genetic profiles.

    AA   CC   TT

    AT   CC   TT

    TT   CC   TT

    AA   CG   TT

    AT   CG   TT

    TT   CG   TT

    AA   GG   TT

    AT   GG   TT

    TT   GG   TT

    AA   CC   TA

    AT   CC   TA

    TT   CC   TA

    AA   CG   TA

    AT   CG   TA

    TT   CG   TA

    AA   GG   TA

    AT   GG   TA

    TT   GG   TA

    AA   CC   AA

    AT   CC   AA

    TT   CC   AA

    AA   CG   AA

    AT   CG   AA

    TT   CG   AA

    AA   GG   AA

    AT   GG   AA

    TT   GG   AA

I then have another file with a table containing weights and frequencies for the three SNPs, such as below:

SNP1             A       T       1.25    0.223143551314     0.97273 

SNP2             C       G       1.07    0.0676586484738    0.3     

SNP3             T       A       1.08    0.0769610411361    0.1136

The columns are the SNP IDs, risk allele, reference allele, OR, log(OR), and population frequency. The weights are for the risk allele.

import sys



snp={}

riskall={}

weights={}

freqs={}    # effect allele, *MAY NOT BE MINOR ALLELE



pop = int(int(sys.argv[4]) + 4) # for additional columns due to additional populations. the example table given only has one population (column 6)



# read in OR table

pos = 0

with open(sys.argv[1], 'r') as f:

    for line in f:

        snp[pos]=(line.split()[0])

        riskall[line.split()[0]]=line.split()[1]

        weights[line.split()[0]]=line.split()[4]

        freqs[line.split()[0]]=line.split()[pop]



        pos+=1







### compute scores for each combination

with open(sys.argv[2], 'r') as f:

    for line in f:

        score=0

        freq=1

        for j in range(len(line.split())):

            rsid=snp[j]

            riskallele=riskall[rsid]

            frequency=freqs[rsid]

            wei=weights[rsid]

            allele1=line.split()[j][0]

            allele2=line.split()[j][1]

            if allele2 != riskallele:      # homozygous for ref

                score+=0

                freq*=(1-float(frequency))*(1-float(frequency))

            elif allele1 != riskallele and allele2 == riskallele:  # heterozygous, be sure that A2 is risk allele!

                score+=float(wei)

                freq*=2*(1-float(frequency))*(float(frequency))

            elif allele1 == riskallele: # and allele2 == riskall[snp[j]]:      # homozygous for risk, be sure to limit risk to second allele!

                score+=2*float(wei)

                freq*=float(frequency)*float(frequency)



            if freq < float(sys.argv[3]):   # threshold to stop loop in interest of efficiency 

                break



        print(','.join(line.split()) + "t" + str(score) + "t" + str(freq))

I have set a variable where I can specify a threshold to break the loop when the frequency gets extremely low. What improvements can be done to speed up the script?

The expected output from the above would be as such:

GG,AA,GG        0       0.000286302968304

GG,AA,GA        0.0769610411361 7.33845153414e-05

GG,AA,AA        0.153922082272  4.70243735491e-06

GG,AG,GG        0.0676586484738 0.00024540254426

GG,AG,GA        0.14461968961   6.29010131498e-05

GG,AG,AA        0.221580730746  4.03066058992e-06

GG,GG,GG        0.135317296948  5.25862594844e-05

GG,GG,GA        0.212278338084  1.34787885321e-05

GG,GG,AA        0.28923937922   8.63712983555e-07

GA,AA,GG        0.223143551314  0.0204250448374

GA,AA,GA        0.30010459245   0.00523530030129

GA,AA,AA        0.377065633586  0.000335475019306

GA,AG,GG        0.290802199788  0.0175071812892

GA,AG,GA        0.367763240924  0.00448740025824

GA,AG,AA        0.44472428206   0.000287550016548

GA,GG,GG        0.358460848262  0.00375153884769

GA,GG,GA        0.435421889398  0.000961585769624

GA,GG,AA        0.512382930534  6.16178606889e-05

AA,AA,GG        0.446287102628  0.364284082594

AA,AA,GA        0.523248143764  0.0933724543834

AA,AA,AA        0.6002091849    0.00598325294334

AA,AG,GG        0.513945751102  0.312243499367

AA,AG,GA        0.590906792238  0.0800335323286

AA,AG,AA        0.667867833374  0.00512850252286

AA,GG,GG        0.581604399576  0.0669093212928

AA,GG,GA        0.658565440712  0.0171500426418

AA,GG,AA        0.735526481848  0.00109896482633

EDIT: Added previous post link, along with expected output.

edited Nov 9 at 6:46

asked Nov 9 at 5:11

Volka

317

I am posting again as I had no luck trying to make the following script more efficient. For more details, do check out my previous post, but the basic situation is as below.

I have written a script in order to compute a score, as well as a frequency for a list of genetic profiles.

    AA   CC   TT

    AT   CC   TT

    TT   CC   TT

    AA   CG   TT

    AT   CG   TT

    TT   CG   TT

    AA   GG   TT

    AT   GG   TT

    TT   GG   TT

    AA   CC   TA

    AT   CC   TA

    TT   CC   TA

    AA   CG   TA

    AT   CG   TA

    TT   CG   TA

    AA   GG   TA

    AT   GG   TA

    TT   GG   TA

    AA   CC   AA

    AT   CC   AA

    TT   CC   AA

    AA   CG   AA

    AT   CG   AA

    TT   CG   AA

    AA   GG   AA

    AT   GG   AA

    TT   GG   AA

I then have another file with a table containing weights and frequencies for the three SNPs, such as below:

SNP1             A       T       1.25    0.223143551314     0.97273 

SNP2             C       G       1.07    0.0676586484738    0.3     

SNP3             T       A       1.08    0.0769610411361    0.1136

The columns are the SNP IDs, risk allele, reference allele, OR, log(OR), and population frequency. The weights are for the risk allele.

import sys



snp={}

riskall={}

weights={}

freqs={}    # effect allele, *MAY NOT BE MINOR ALLELE



pop = int(int(sys.argv[4]) + 4) # for additional columns due to additional populations. the example table given only has one population (column 6)



# read in OR table

pos = 0

with open(sys.argv[1], 'r') as f:

    for line in f:

        snp[pos]=(line.split()[0])

        riskall[line.split()[0]]=line.split()[1]

        weights[line.split()[0]]=line.split()[4]

        freqs[line.split()[0]]=line.split()[pop]



        pos+=1







### compute scores for each combination

with open(sys.argv[2], 'r') as f:

    for line in f:

        score=0

        freq=1

        for j in range(len(line.split())):

            rsid=snp[j]

            riskallele=riskall[rsid]

            frequency=freqs[rsid]

            wei=weights[rsid]

            allele1=line.split()[j][0]

            allele2=line.split()[j][1]

            if allele2 != riskallele:      # homozygous for ref

                score+=0

                freq*=(1-float(frequency))*(1-float(frequency))

            elif allele1 != riskallele and allele2 == riskallele:  # heterozygous, be sure that A2 is risk allele!

                score+=float(wei)

                freq*=2*(1-float(frequency))*(float(frequency))

            elif allele1 == riskallele: # and allele2 == riskall[snp[j]]:      # homozygous for risk, be sure to limit risk to second allele!

                score+=2*float(wei)

                freq*=float(frequency)*float(frequency)



            if freq < float(sys.argv[3]):   # threshold to stop loop in interest of efficiency 

                break



        print(','.join(line.split()) + "t" + str(score) + "t" + str(freq))

I have set a variable where I can specify a threshold to break the loop when the frequency gets extremely low. What improvements can be done to speed up the script?

The expected output from the above would be as such:

GG,AA,GG        0       0.000286302968304

GG,AA,GA        0.0769610411361 7.33845153414e-05

GG,AA,AA        0.153922082272  4.70243735491e-06

GG,AG,GG        0.0676586484738 0.00024540254426

GG,AG,GA        0.14461968961   6.29010131498e-05

GG,AG,AA        0.221580730746  4.03066058992e-06

GG,GG,GG        0.135317296948  5.25862594844e-05

GG,GG,GA        0.212278338084  1.34787885321e-05

GG,GG,AA        0.28923937922   8.63712983555e-07

GA,AA,GG        0.223143551314  0.0204250448374

GA,AA,GA        0.30010459245   0.00523530030129

GA,AA,AA        0.377065633586  0.000335475019306

GA,AG,GG        0.290802199788  0.0175071812892

GA,AG,GA        0.367763240924  0.00448740025824

GA,AG,AA        0.44472428206   0.000287550016548

GA,GG,GG        0.358460848262  0.00375153884769

GA,GG,GA        0.435421889398  0.000961585769624

GA,GG,AA        0.512382930534  6.16178606889e-05

AA,AA,GG        0.446287102628  0.364284082594

AA,AA,GA        0.523248143764  0.0933724543834

AA,AA,AA        0.6002091849    0.00598325294334

AA,AG,GG        0.513945751102  0.312243499367

AA,AG,GA        0.590906792238  0.0800335323286

AA,AG,AA        0.667867833374  0.00512850252286

AA,GG,GG        0.581604399576  0.0669093212928

AA,GG,GA        0.658565440712  0.0171500426418

AA,GG,AA        0.735526481848  0.00109896482633

EDIT: Added previous post link, along with expected output.

python performance loops

edited Nov 9 at 6:46

asked Nov 9 at 5:11

Volka

317

edited Nov 9 at 6:46

asked Nov 9 at 5:11

Volka

317

edited Nov 9 at 6:46

asked Nov 9 at 5:11

Volka

317

asked Nov 9 at 5:11

Volka

317

asked Nov 9 at 5:11

Volka

317

3

instead of writing "do checkout my post history" you should mention your post link
– Gahan
Nov 9 at 5:13

1

At a first glance - you are repeating .split() all over the place. If you don't change the line, it should always split the same way, so do it once and remember the result. EDIT: also, give us the command line arguments you would execute this with.
– Amadan
Nov 9 at 5:17

1

Could you provide some expected output?
– Alex
Nov 9 at 5:29

tried using numba ??
– Lijo Jose
Nov 9 at 6:18

@Amadan I see, do you mean it would be better to assign the line split to a variable and only use that variable from there onwards? In terms of command line arguments, it would be like such: python myscript.py table2.txt table1.txt 1e-5 1
– Volka
Nov 9 at 6:41

|
show 1 more comment

3

instead of writing "do checkout my post history" you should mention your post link
– Gahan
Nov 9 at 5:13

1

At a first glance - you are repeating .split() all over the place. If you don't change the line, it should always split the same way, so do it once and remember the result. EDIT: also, give us the command line arguments you would execute this with.
– Amadan
Nov 9 at 5:17

1

Could you provide some expected output?
– Alex
Nov 9 at 5:29

tried using numba ??
– Lijo Jose
Nov 9 at 6:18

@Amadan I see, do you mean it would be better to assign the line split to a variable and only use that variable from there onwards? In terms of command line arguments, it would be like such: python myscript.py table2.txt table1.txt 1e-5 1
– Volka
Nov 9 at 6:41

instead of writing "do checkout my post history" you should mention your post link
– Gahan
Nov 9 at 5:13

At a first glance - you are repeating .split() all over the place. If you don't change the line, it should always split the same way, so do it once and remember the result. EDIT: also, give us the command line arguments you would execute this with.
– Amadan
Nov 9 at 5:17

Could you provide some expected output?
– Alex
Nov 9 at 5:29

tried using numba ??
– Lijo Jose
Nov 9 at 6:18

@Amadan I see, do you mean it would be better to assign the line split to a variable and only use that variable from there onwards? In terms of command line arguments, it would be like such: python myscript.py table2.txt table1.txt 1e-5 1
– Volka
Nov 9 at 6:41

|
show 1 more comment

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

Disclaimer: I did not test this, it is rather a pseudo-code.

I provide some general ideas about what is slow/fast in programming and particularly in python:

You should try to move out of loops everything what is not changing in that loop.
Also, in python, you should try to replace loops with comprehensions
https://www.pythonforbeginners.com/basics/list-comprehensions-in-python

[ expression for item in list if conditional ]

you should try to use map/filter functions if possible and you also can prepare your data so that the program is more efficient

    rsid=snp[j]

    riskallele=riskall[rsid]

is basically a double mapping and it can possibly be done better if you can create your snp structure like this (you can use -1 index for the last column and get rid of pop):

snp = [{"riskall": line[1],"freq": float(line[4]),"weight": float(line[-1])} 

         for line in map(split,f)]

and your computing loop can become something like this:

### compute scores for each combination

stop = sys.argv[3]

with open(sys.argv[2], 'r') as f:

    for fline in f:

        score=0.0 # work with floats from the start

        freq=1.0

        line = fline.split() # do it only once



        for j,field in line:

            s=snp[j]

            riskallele=s["riskall"]

            frequency=s["freq"]

            wei=s["weight"]

            (allele1,allele2) = line[j]



            if allele2 != riskallele:      # homozygous for ref

                score+=0

                freq*=(1-frequency)*(1-frequency)

            elif allele1 != riskallele and allele2 == riskallele:  # heterozygous, be sure that A2 is risk allele!

                score+=wei

                freq*=2*(1-frequency)*frequency

            elif allele1 == riskallele: # and allele2 == riskallele:      # homozygous for risk, be sure to limit risk to second allele!

                score+=2*wei

                freq*=frequency*frequency



            if freq < stop):   # threshold to stop loop in interest of efficiency 

                break



        print(','.join(line.split()) + "t" + str(score) + "t" + str(freq))

The ultimate goal I would try to achieve is to convert it to some map/reduce form:

the allele can have [A,C,G,T][A,C,G,T] 16 combinations and we test against it [A,C,G,T] this only 64 combinations so I can create a map in form
[AC,C] -> score,freq_function and I can get rid of the whole if block.

Sometimes the best approach is to split the code to small functions, reorganize and then merge back.

edited Nov 16 at 13:02

answered Nov 9 at 11:12

petrch

31626

One more thing I noticed: " This table was generated using itertool's product in another script:" - I think in that case I would also merge the scripts because you can easily generate it inside this one from the SNP file and you don't need the generating script.
– petrch
Nov 9 at 14:13

Thanks! The map form is really helpful and improved my script runtimes. I'm not too sure how to apply that to the if block, and ended up incorporating the respective weights and frequencies into the initial dictionary instead.
– Volka
Nov 16 at 7:31

Good to hear that. It always depends how much time you want to invest and what is "good enough" running time. If your input is large (eg. 1 GB or more) the IO may start to be an issue as well. Read for example this rabexc.org/posts/io-performance-in-python. You want to put as much code out of the for loop so if you can anything precalculate than it may help. the "print()" itself is a bit slow. Maybe if you care just about some results than print only those. Or use buffered file writes and read the results from a file later.
– petrch
Nov 16 at 13:22

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53220198%2fhow-can-i-improve-my-looping-python-script-involving-different-mathematical-ope%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

Disclaimer: I did not test this, it is rather a pseudo-code.

I provide some general ideas about what is slow/fast in programming and particularly in python:

[ expression for item in list if conditional ]

you should try to use map/filter functions if possible and you also can prepare your data so that the program is more efficient

    rsid=snp[j]

    riskallele=riskall[rsid]

is basically a double mapping and it can possibly be done better if you can create your snp structure like this (you can use -1 index for the last column and get rid of pop):

snp = [{"riskall": line[1],"freq": float(line[4]),"weight": float(line[-1])} 

         for line in map(split,f)]

and your computing loop can become something like this:

### compute scores for each combination

stop = sys.argv[3]

with open(sys.argv[2], 'r') as f:

    for fline in f:

        score=0.0 # work with floats from the start

        freq=1.0

        line = fline.split() # do it only once



        for j,field in line:

            s=snp[j]

            riskallele=s["riskall"]

            frequency=s["freq"]

            wei=s["weight"]

            (allele1,allele2) = line[j]



            if allele2 != riskallele:      # homozygous for ref

                score+=0

                freq*=(1-frequency)*(1-frequency)

            elif allele1 != riskallele and allele2 == riskallele:  # heterozygous, be sure that A2 is risk allele!

                score+=wei

                freq*=2*(1-frequency)*frequency

            elif allele1 == riskallele: # and allele2 == riskallele:      # homozygous for risk, be sure to limit risk to second allele!

                score+=2*wei

                freq*=frequency*frequency



            if freq < stop):   # threshold to stop loop in interest of efficiency 

                break



        print(','.join(line.split()) + "t" + str(score) + "t" + str(freq))

The ultimate goal I would try to achieve is to convert it to some map/reduce form:

Sometimes the best approach is to split the code to small functions, reorganize and then merge back.

edited Nov 16 at 13:02

answered Nov 9 at 11:12

petrch

31626

One more thing I noticed: " This table was generated using itertool's product in another script:" - I think in that case I would also merge the scripts because you can easily generate it inside this one from the SNP file and you don't need the generating script.
– petrch
Nov 9 at 14:13

Thanks! The map form is really helpful and improved my script runtimes. I'm not too sure how to apply that to the if block, and ended up incorporating the respective weights and frequencies into the initial dictionary instead.
– Volka
Nov 16 at 7:31

Good to hear that. It always depends how much time you want to invest and what is "good enough" running time. If your input is large (eg. 1 GB or more) the IO may start to be an issue as well. Read for example this rabexc.org/posts/io-performance-in-python. You want to put as much code out of the for loop so if you can anything precalculate than it may help. the "print()" itself is a bit slow. Maybe if you care just about some results than print only those. Or use buffered file writes and read the results from a file later.
– petrch
Nov 16 at 13:22

add a comment |

up vote
1
down vote

accepted

Disclaimer: I did not test this, it is rather a pseudo-code.

I provide some general ideas about what is slow/fast in programming and particularly in python:

[ expression for item in list if conditional ]

you should try to use map/filter functions if possible and you also can prepare your data so that the program is more efficient

    rsid=snp[j]

    riskallele=riskall[rsid]

is basically a double mapping and it can possibly be done better if you can create your snp structure like this (you can use -1 index for the last column and get rid of pop):

snp = [{"riskall": line[1],"freq": float(line[4]),"weight": float(line[-1])} 

         for line in map(split,f)]

and your computing loop can become something like this:

### compute scores for each combination

stop = sys.argv[3]

with open(sys.argv[2], 'r') as f:

    for fline in f:

        score=0.0 # work with floats from the start

        freq=1.0

        line = fline.split() # do it only once



        for j,field in line:

            s=snp[j]

            riskallele=s["riskall"]

            frequency=s["freq"]

            wei=s["weight"]

            (allele1,allele2) = line[j]



            if allele2 != riskallele:      # homozygous for ref

                score+=0

                freq*=(1-frequency)*(1-frequency)

            elif allele1 != riskallele and allele2 == riskallele:  # heterozygous, be sure that A2 is risk allele!

                score+=wei

                freq*=2*(1-frequency)*frequency

            elif allele1 == riskallele: # and allele2 == riskallele:      # homozygous for risk, be sure to limit risk to second allele!

                score+=2*wei

                freq*=frequency*frequency



            if freq < stop):   # threshold to stop loop in interest of efficiency 

                break



        print(','.join(line.split()) + "t" + str(score) + "t" + str(freq))

The ultimate goal I would try to achieve is to convert it to some map/reduce form:

Sometimes the best approach is to split the code to small functions, reorganize and then merge back.

edited Nov 16 at 13:02

answered Nov 9 at 11:12

petrch

31626

One more thing I noticed: " This table was generated using itertool's product in another script:" - I think in that case I would also merge the scripts because you can easily generate it inside this one from the SNP file and you don't need the generating script.
– petrch
Nov 9 at 14:13

Thanks! The map form is really helpful and improved my script runtimes. I'm not too sure how to apply that to the if block, and ended up incorporating the respective weights and frequencies into the initial dictionary instead.
– Volka
Nov 16 at 7:31

Good to hear that. It always depends how much time you want to invest and what is "good enough" running time. If your input is large (eg. 1 GB or more) the IO may start to be an issue as well. Read for example this rabexc.org/posts/io-performance-in-python. You want to put as much code out of the for loop so if you can anything precalculate than it may help. the "print()" itself is a bit slow. Maybe if you care just about some results than print only those. Or use buffered file writes and read the results from a file later.
– petrch
Nov 16 at 13:22

add a comment |

up vote
1
down vote

accepted

Disclaimer: I did not test this, it is rather a pseudo-code.

I provide some general ideas about what is slow/fast in programming and particularly in python:

[ expression for item in list if conditional ]

you should try to use map/filter functions if possible and you also can prepare your data so that the program is more efficient

    rsid=snp[j]

    riskallele=riskall[rsid]

is basically a double mapping and it can possibly be done better if you can create your snp structure like this (you can use -1 index for the last column and get rid of pop):

snp = [{"riskall": line[1],"freq": float(line[4]),"weight": float(line[-1])} 

         for line in map(split,f)]

and your computing loop can become something like this:

### compute scores for each combination

stop = sys.argv[3]

with open(sys.argv[2], 'r') as f:

    for fline in f:

        score=0.0 # work with floats from the start

        freq=1.0

        line = fline.split() # do it only once



        for j,field in line:

            s=snp[j]

            riskallele=s["riskall"]

            frequency=s["freq"]

            wei=s["weight"]

            (allele1,allele2) = line[j]



            if allele2 != riskallele:      # homozygous for ref

                score+=0

                freq*=(1-frequency)*(1-frequency)

            elif allele1 != riskallele and allele2 == riskallele:  # heterozygous, be sure that A2 is risk allele!

                score+=wei

                freq*=2*(1-frequency)*frequency

            elif allele1 == riskallele: # and allele2 == riskallele:      # homozygous for risk, be sure to limit risk to second allele!

                score+=2*wei

                freq*=frequency*frequency



            if freq < stop):   # threshold to stop loop in interest of efficiency 

                break



        print(','.join(line.split()) + "t" + str(score) + "t" + str(freq))

The ultimate goal I would try to achieve is to convert it to some map/reduce form:

Sometimes the best approach is to split the code to small functions, reorganize and then merge back.

edited Nov 16 at 13:02

answered Nov 9 at 11:12

petrch

31626

Disclaimer: I did not test this, it is rather a pseudo-code.

I provide some general ideas about what is slow/fast in programming and particularly in python:

[ expression for item in list if conditional ]

you should try to use map/filter functions if possible and you also can prepare your data so that the program is more efficient

    rsid=snp[j]

    riskallele=riskall[rsid]

is basically a double mapping and it can possibly be done better if you can create your snp structure like this (you can use -1 index for the last column and get rid of pop):

snp = [{"riskall": line[1],"freq": float(line[4]),"weight": float(line[-1])} 

         for line in map(split,f)]

and your computing loop can become something like this:

### compute scores for each combination

stop = sys.argv[3]

with open(sys.argv[2], 'r') as f:

    for fline in f:

        score=0.0 # work with floats from the start

        freq=1.0

        line = fline.split() # do it only once



        for j,field in line:

            s=snp[j]

            riskallele=s["riskall"]

            frequency=s["freq"]

            wei=s["weight"]

            (allele1,allele2) = line[j]



            if allele2 != riskallele:      # homozygous for ref

                score+=0

                freq*=(1-frequency)*(1-frequency)

            elif allele1 != riskallele and allele2 == riskallele:  # heterozygous, be sure that A2 is risk allele!

                score+=wei

                freq*=2*(1-frequency)*frequency

            elif allele1 == riskallele: # and allele2 == riskallele:      # homozygous for risk, be sure to limit risk to second allele!

                score+=2*wei

                freq*=frequency*frequency



            if freq < stop):   # threshold to stop loop in interest of efficiency 

                break



        print(','.join(line.split()) + "t" + str(score) + "t" + str(freq))

The ultimate goal I would try to achieve is to convert it to some map/reduce form:

Sometimes the best approach is to split the code to small functions, reorganize and then merge back.

edited Nov 16 at 13:02

answered Nov 9 at 11:12

petrch

31626

edited Nov 16 at 13:02

answered Nov 9 at 11:12

petrch

31626

answered Nov 9 at 11:12

petrch

31626

answered Nov 9 at 11:12

petrch

31626

One more thing I noticed: " This table was generated using itertool's product in another script:" - I think in that case I would also merge the scripts because you can easily generate it inside this one from the SNP file and you don't need the generating script.
– petrch
Nov 9 at 14:13

Thanks! The map form is really helpful and improved my script runtimes. I'm not too sure how to apply that to the if block, and ended up incorporating the respective weights and frequencies into the initial dictionary instead.
– Volka
Nov 16 at 7:31

Good to hear that. It always depends how much time you want to invest and what is "good enough" running time. If your input is large (eg. 1 GB or more) the IO may start to be an issue as well. Read for example this rabexc.org/posts/io-performance-in-python. You want to put as much code out of the for loop so if you can anything precalculate than it may help. the "print()" itself is a bit slow. Maybe if you care just about some results than print only those. Or use buffered file writes and read the results from a file later.
– petrch
Nov 16 at 13:22

add a comment |

One more thing I noticed: " This table was generated using itertool's product in another script:" - I think in that case I would also merge the scripts because you can easily generate it inside this one from the SNP file and you don't need the generating script.
– petrch
Nov 9 at 14:13

Thanks! The map form is really helpful and improved my script runtimes. I'm not too sure how to apply that to the if block, and ended up incorporating the respective weights and frequencies into the initial dictionary instead.
– Volka
Nov 16 at 7:31

Good to hear that. It always depends how much time you want to invest and what is "good enough" running time. If your input is large (eg. 1 GB or more) the IO may start to be an issue as well. Read for example this rabexc.org/posts/io-performance-in-python. You want to put as much code out of the for loop so if you can anything precalculate than it may help. the "print()" itself is a bit slow. Maybe if you care just about some results than print only those. Or use buffered file writes and read the results from a file later.
– petrch
Nov 16 at 13:22

One more thing I noticed: " This table was generated using itertool's product in another script:" - I think in that case I would also merge the scripts because you can easily generate it inside this one from the SNP file and you don't need the generating script.
– petrch
Nov 9 at 14:13

Thanks! The map form is really helpful and improved my script runtimes. I'm not too sure how to apply that to the if block, and ended up incorporating the respective weights and frequencies into the initial dictionary instead.
– Volka
Nov 16 at 7:31

Good to hear that. It always depends how much time you want to invest and what is "good enough" running time. If your input is large (eg. 1 GB or more) the IO may start to be an issue as well. Read for example this rabexc.org/posts/io-performance-in-python. You want to put as much code out of the for loop so if you can anything precalculate than it may help. the "print()" itself is a bit slow. Maybe if you care just about some results than print only those. Or use buffered file writes and read the results from a file later.
– petrch
Nov 16 at 13:22

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Wsrtjtyk