Skip to content Skip to sidebar Skip to footer

Numerical Coding Of Mutated Residues And Positions

I'm writing a python program which has to compute a numerical coding of mutated residues and positions of a set of strings.These strings are protein sequences.These sequences are s

Solution 1:

If you are to work with tabular data, consider pandas:

from pandas import *

data = 'MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD'

df = DataFrame([list(row) for row indata.split(',')])

print DataFrame({str(col)+val:(df[col]==val).apply(int) 
        for col in df.columns forvalinset(df[col])})

output:

0M  1K  1T2A  2S  3Q  4D  4E  4H  5D
0101101100111011011001210101101013101101100141101010011

If you want to drop the columns with all ones:

print df.select(lambda x: not df[x].all(), axis = 1)    

   1K  1T  2A  2S  4D  4E  4H
0011010010110100201010103011010041010001

Solution 2:

Something like this?

ls = 'MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD'.split(',')

pos = [set(enumerate(x, 1)) for x inls]
alle = sorted(set().union(*pos))

print'\t'.join(str(x) + y for x, y in alle)
for p in pos:
    print'\t'.join('1'if key in p else'0'for key in alle)

Solution 3:

protein_sequence = "MTAQDDSYSDGKGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYLGAVFQLN,MTSQEDSYSDGKGNYNTIMPGAVFQLN,MTAQDDSYSDGRGDYNTIMPGAVFQLN,MKAQDDSYSDGRGNYNTIYLGAVFQLQ,MKSQEDSYSDGRGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYPGAVFQLN,MTAQEDSYSDGRGEYNTIYLGAVFQLQ,MTAQDDSYSDGKGDYNTIMLGAVFQLN,MTAQDDSYSDGRGEYNTIYLGAVFQLN"#Parse the file
proteins = protein_sequence.split(",")
#For each protein sequence remove the duplicates
proteins = map(lambda x:"".join(set(list(x))), proteins)

#Create result
result = []
key_set = ['T', 'K', 'A', 'S', 'D', 'E', 'K', 'R', 'D', 'N', 'E', 'Y', 'M', 'L', 'P', 'N', 'Q']
for protein in proteins:
    local_dict = dict(zip(key_set, [0] * len(key_set)))
    #Split the protein in amino acid
    components = list(protein)
    for amino_acid in components:
        local_dict[amino_acid] = 1
    result.append((protein, local_dict))

Solution 4:

You can use the pandas function get_dummies to do most of the hard work:

In [11]: s # a pandas Series (DataFrame's column)
Out[11]:0T1T2T3T4    K
Name:1

In [12]: pd.get_dummies(s, prefix=s.name, prefix_sep='')
Out[12]:1K  1T001101201301410

To put your data into a DataFrame you could use:

df = pd.DataFrame(map(list, 'MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD'.split(',')))

In [20]: df
Out[20]: 
   0  1  2  3  4  5
0  M  T  A  Q  D  D
1  M  T  A  Q  D  D
2  M  T  S  Q  E  D
3  M  T  A  Q  D  D
4  M  K  A  Q  H  D

And to find those columns which have differing values:

In [21]: (df.ix[0] != df).any()
Out[21]: 
0False1True2True3False4True5False

Putting this all together:

In [31]: I = df.columns[(df.ix[0] != df).any()]

In [32]: J = (pd.get_dummies(df[i], prefix=df[i].name, prefix_sep='') for i in I)

In [33]: df[[]].join(J)
Out[33]: 
   1K  1T  2A  2S  4D  4E  4H
0   0   1   1   0   1   0   0
1   0   1   1   0   1   0   0
2   0   1   0   1   0   1   0
3   0   1   1   0   1   0   0
4   1   0   1   0   0   0   1

Post a Comment for "Numerical Coding Of Mutated Residues And Positions"