I have two sequences, for example:

Seq 1: MAT--LA-B
seq 2: MATATLAB

Is it possible in python to compare the two sequences and then insert the missing portion in Sequence 1 without changing the rest of sequence 1, i.e, the final sequence 1 should be MATAT--LA-B?

The insert could be at more than one position..(I have a multiple sequence alignment in which parts of sequences are discarded...I want to re-insert these portions..)

#### #0

I'd suggest beginning your search for a solution by obtaining opcodes for transforming one sequence into the other. Opcodes can be generated with difflib.SequenceMatcher.get_opcodes. These will be tuples with instructions (insert, delete, or replace) and start/stop indices were the changes must occur to transform one sequence into another. A problem, though, will probably be that due to the vagaries of the SequenceMatcher algorithm, left-most matches always get precedence over potential matches to their right, which could yield an unwanted result in your case. You can always design your own opcodes handler function. I notice that in the example, the result could be obtained with normal opcodes by simply reversing both strings before using SequenceMatcher to produce opcodes, since the answer would require right-most matches to have precedence. Just a thought.

#### #1

A bit less general than the preceding answer; but it looked like an interesting problem, so I figured I'd try it anyway:

import re

def find_start_of(needle, haystack):
"""
@param needle    Search on first char of string
@param haystack  Longer string to search in

Look for first char of needle in haystack; return offset
"""

if needle=='':
return 0

offs = haystack.find(needle[0])
if offs==-1:
return len(haystack)
else:
return offs

def find_end_of(lst, letterset):
"""
@param lst       Chars to search for
@param letterset String to search through

lst contains some chars of letterset in order;
Return offset in letterset of last char of lst
"""

offs = 0
for ch in lst:
t = letterset.find(ch, offs)

if t==-1:
raise ValueError('letterset (%s) is not an ordered superset of lst (%s)' % (letterset, lst))
else:
offs = t+1

return offs-1

def alignSeq(s1, s2):
"""
@param s1 A string consisting of letters and hyphens
@param s2 A string containing only letters

The letters in s1 are an in-sequence subset of s2

Returns s1 with the missing letters from s2 inserted
in-sequence and greedily preceding hyphens.
"""

# break s1 into letter-chunks and hyphen-chunks
r = '([^-]*)([-]*)'        # string of letters followed by string of hyphens
seq = re.findall(r, s1) # break string into list of tuples
seq = seq[:-1]          # discard final empty pair
# eg: "MAT--LA-B" becomes [('MAT', '--'), ('LA', '-'), ('B', '')]

# find start of corresponding letter-chunks in s2
offs = 0
chunkstart = []
for letters,hyphens in seq:
offs += find_start_of(letters, s2[offs:])
chunkstart.append(offs)
offs += find_end_of(letters, s2[offs:]) + 1

# get end+1 for each letter-chunk
chunkend = chunkstart[1:] + [len(s2)]
# get replacement letter-chunks
chunks = [s2[st:en] for st,en in zip(chunkstart,chunkend)]

# do replacement for each chunk
outp = [c+s[1] for c,s in zip(chunks, seq)]

return ''.join(outp)

Then

alignSeq('MAT--LA-B','MATATLAB')

returns

'MATAT--LA-B'

#### 阅读全文

WPF会成为我们缺乏光泽UI的救星吗？

display-line-number.el (emacs中显示行号)

Java的currentTimeMillis返回负值

docker 搭建registry私有仓库 （Centos6.5）

GNU

asp.net mvc会话示例

docker-netcore-firstapp学习笔记