在python中插入序列的缺失部分

I have two sequences, for example:

Seq 1: MAT--LA-B
seq 2: MATATLAB

Is it possible in python to compare the two sequences and then insert the missing portion in Sequence 1 without changing the rest of sequence 1, i.e, the final sequence 1 should be MATAT--LA-B?

The insert could be at more than one position..(I have a multiple sequence alignment in which parts of sequences are discarded...I want to re-insert these portions..)

Thanks in advance!!

#0

I'd suggest beginning your search for a solution by obtaining opcodes for transforming one sequence into the other. Opcodes can be generated with difflib.SequenceMatcher.get_opcodes. These will be tuples with instructions (insert, delete, or replace) and start/stop indices were the changes must occur to transform one sequence into another. A problem, though, will probably be that due to the vagaries of the SequenceMatcher algorithm, left-most matches always get precedence over potential matches to their right, which could yield an unwanted result in your case. You can always design your own opcodes handler function. I notice that in the example, the result could be obtained with normal opcodes by simply reversing both strings before using SequenceMatcher to produce opcodes, since the answer would require right-most matches to have precedence. Just a thought.

#1

A bit less general than the preceding answer; but it looked like an interesting problem, so I figured I'd try it anyway:

import re

def find_start_of(needle, haystack):
    """
    @param needle    Search on first char of string
    @param haystack  Longer string to search in

    Look for first char of needle in haystack; return offset
    """

    if needle=='':
        return 0

    offs = haystack.find(needle[0])
    if offs==-1:
        return len(haystack)
    else:
        return offs

def find_end_of(lst, letterset):
    """
    @param lst       Chars to search for
    @param letterset String to search through

    lst contains some chars of letterset in order;
    Return offset in letterset of last char of lst
    """

    offs = 0
    for ch in lst:
        t = letterset.find(ch, offs)

        if t==-1:
            raise ValueError('letterset (%s) is not an ordered superset of lst (%s)' % (letterset, lst))
        else:
            offs = t+1

    return offs-1

def alignSeq(s1, s2):
    """
    @param s1 A string consisting of letters and hyphens
    @param s2 A string containing only letters

    The letters in s1 are an in-sequence subset of s2

    Returns s1 with the missing letters from s2 inserted
    in-sequence and greedily preceding hyphens.
    """

    # break s1 into letter-chunks and hyphen-chunks
    r = '([^-]*)([-]*)'        # string of letters followed by string of hyphens
    seq = re.findall(r, s1) # break string into list of tuples
    seq = seq[:-1]          # discard final empty pair
    # eg: "MAT--LA-B" becomes [('MAT', '--'), ('LA', '-'), ('B', '')]

    # find start of corresponding letter-chunks in s2
    offs = 0
    chunkstart = []
    for letters,hyphens in seq:
        offs += find_start_of(letters, s2[offs:])
        chunkstart.append(offs)
        offs += find_end_of(letters, s2[offs:]) + 1

    # get end+1 for each letter-chunk
    chunkend = chunkstart[1:] + [len(s2)]
    # get replacement letter-chunks
    chunks = [s2[st:en] for st,en in zip(chunkstart,chunkend)]

    # do replacement for each chunk
    outp = [c+s[1] for c,s in zip(chunks, seq)]

    return ''.join(outp)

Then

alignSeq('MAT--LA-B','MATATLAB')

returns

'MATAT--LA-B'

推荐文章

原子化字符串

原子化字符串

推荐文章

WPF会成为我们缺乏光泽UI的救星吗?

WPF会成为我们缺乏光泽UI的救星吗?

推荐文章

display-line-number.el (emacs中显示行号)

display-line-number.el (emacs中显示行号)

推荐文章

Java的currentTimeMillis返回负值

Java的currentTimeMillis返回负值

推荐文章

配置 VirtualBox backend - 每天5分钟玩转 Docker 容器技术(75)

配置 VirtualBox backend - 每天5分钟玩转 Docker 容器技术(75)

推荐文章

docker 搭建registry私有仓库 (Centos6.5)

docker 搭建registry私有仓库 (Centos6.5)

推荐文章

使用Socket时Windows窗体GUI冻结

使用Socket时Windows窗体GUI冻结

推荐文章

GNU

GNU

推荐文章

用内部构造函数实例化类

用内部构造函数实例化类

推荐文章

如何在php中回显xml文件

如何在php中回显xml文件

推荐文章

asp.net mvc会话示例

asp.net mvc会话示例

推荐文章

我可以结束在asp.net mvc中的视图呈现吗

我可以结束在asp.net mvc中的视图呈现吗

推荐文章

尼特分类组合?

尼特分类组合?

推荐文章

创建 Rex-Ray volume - 每天5分钟玩转 Docker 容器技术(76)

创建 Rex-Ray volume - 每天5分钟玩转 Docker 容器技术(76)

推荐文章

docker-netcore-firstapp学习笔记

docker-netcore-firstapp学习笔记

推荐文章

从bash脚本中的URL提取文件名和路径

从bash脚本中的URL提取文件名和路径