Numerical Coding Of Mutated Residues And Positions
I'm writing a python program which has to compute a numerical coding of mutated residues and positions of a set of strings.These strings are protein sequences.These sequences are s
Solution 1:
If you are to work with tabular data, consider pandas:
from pandas import *
data = 'MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD'
df = DataFrame([list(row) for row indata.split(',')])
print DataFrame({str(col)+val:(df[col]==val).apply(int)
for col in df.columns forvalinset(df[col])})
output:
0M 1K 1T2A 2S 3Q 4D 4E 4H 5D
0101101100111011011001210101101013101101100141101010011
If you want to drop the columns with all ones:
print df.select(lambda x: not df[x].all(), axis = 1)
1K 1T 2A 2S 4D 4E 4H
0011010010110100201010103011010041010001
Solution 2:
Something like this?
ls = 'MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD'.split(',')
pos = [set(enumerate(x, 1)) for x inls]
alle = sorted(set().union(*pos))
print'\t'.join(str(x) + y for x, y in alle)
for p in pos:
print'\t'.join('1'if key in p else'0'for key in alle)
Solution 3:
protein_sequence = "MTAQDDSYSDGKGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYLGAVFQLN,MTSQEDSYSDGKGNYNTIMPGAVFQLN,MTAQDDSYSDGRGDYNTIMPGAVFQLN,MKAQDDSYSDGRGNYNTIYLGAVFQLQ,MKSQEDSYSDGRGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYPGAVFQLN,MTAQEDSYSDGRGEYNTIYLGAVFQLQ,MTAQDDSYSDGKGDYNTIMLGAVFQLN,MTAQDDSYSDGRGEYNTIYLGAVFQLN"#Parse the file
proteins = protein_sequence.split(",")
#For each protein sequence remove the duplicates
proteins = map(lambda x:"".join(set(list(x))), proteins)
#Create result
result = []
key_set = ['T', 'K', 'A', 'S', 'D', 'E', 'K', 'R', 'D', 'N', 'E', 'Y', 'M', 'L', 'P', 'N', 'Q']
for protein in proteins:
local_dict = dict(zip(key_set, [0] * len(key_set)))
#Split the protein in amino acid
components = list(protein)
for amino_acid in components:
local_dict[amino_acid] = 1
result.append((protein, local_dict))
Solution 4:
You can use the pandas function get_dummies
to do most of the hard work:
In [11]: s # a pandas Series (DataFrame's column)
Out[11]:0T1T2T3T4 K
Name:1
In [12]: pd.get_dummies(s, prefix=s.name, prefix_sep='')
Out[12]:1K 1T001101201301410
To put your data into a DataFrame you could use:
df = pd.DataFrame(map(list, 'MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD'.split(',')))
In [20]: df
Out[20]:
0 1 2 3 4 5
0 M T A Q D D
1 M T A Q D D
2 M T S Q E D
3 M T A Q D D
4 M K A Q H D
And to find those columns which have differing values:
In [21]: (df.ix[0] != df).any()
Out[21]:
0False1True2True3False4True5False
Putting this all together:
In [31]: I = df.columns[(df.ix[0] != df).any()]
In [32]: J = (pd.get_dummies(df[i], prefix=df[i].name, prefix_sep='') for i in I)
In [33]: df[[]].join(J)
Out[33]:
1K 1T 2A 2S 4D 4E 4H
0 0 1 1 0 1 0 0
1 0 1 1 0 1 0 0
2 0 1 0 1 0 1 0
3 0 1 1 0 1 0 0
4 1 0 1 0 0 0 1
Post a Comment for "Numerical Coding Of Mutated Residues And Positions"