CBMS-2024A-12

题目来源Question 12 日期:2024-07-31 题目主题:CS-算法-动态规划-序列比对

解题思路

本题涉及两个序列的全局和局部比对问题。题目给出了全局比对的迭代公式,并要求推导出相关的公式和算法。序列比对中,常用的评分包括匹配分、错配分和插入/删除(gap)的罚分。罚分由 gap opening penalty 和 gap extension penalty 组成。

反向互补序列是 DNA 双链结构中的一个重要概念。在 DNA 中,A 与 T 配对,C 与 G 配对,两条链的方向相反。因此,一条链的序列可以决定另一条链的序列。这个概念在本题的后半部分起到了关键作用。

Solution

1. Global Alignment with Affine Gap Penalty

1-1: Formula for the Penalty of a Gap of Length

Let’s denote the penalty for a gap of length as . From the given equations, we can see that:

  • Opening a gap costs
  • Extending a gap costs for each additional position

Therefore, the formula for the penalty of a gap of length is:

Note: This is known as an affine gap penalty model.

1-2: Initial Values for and

Given the initial conditions:

  • ,
  • for
  • for

We need to determine and .

For :

represents a gap in sequence at the beginning. According to the recurrence relation:

Therefore, .

For :

represents a gap in sequence at the beginning. It’s symmetrical to :

Hence, the initial values of and are as follows:

1-3: Iterative Equations for Local Alignment

To compute the local alignment, the iterative equations are modified to allow for the possibility of starting a new alignment, indicated by a score of 0:

1-4: Obtaining a Local Alignment with Maximum Score

To obtain a local alignment with the maximum score:

  1. Initialize all cells in the first row and column to 0.
  2. Fill the dynamic programming matrix using the equations from (1-3).
  3. Find the cell with the maximum score in the entire matrix.
  4. Perform a traceback from until reaching a cell with score 0 or the matrix boundary.
  5. The path of this traceback gives the optimal local alignment.

2. Reverse Complementary Sequence Analysis

2-1: Explanation of the Algorithm

The algorithm scans a sequence for reverse complementary matches. It uses a matrix to record the length of matching substrings that are reverse complements. If the length of the match exceeds a threshold , the algorithm reports the corresponding subsequences.

Specifically:

  • stores the length of the reverse complementary match ending at and .
  • The algorithm compares with the complement of for from down to .
  • If a match is found, it extends the previous match () by 1.
  • If the length of the match () is at least , it reports the corresponding ranges.

The reported ranges and represent the start and end positions of reverse complementary subsequences of length at least .

2-2: Algorithm for Maximum Reverse Complementary Alignment Score

Algorithm

The algorithm to find the maximum reverse complementary alignment score is as follows:

  1. Initialize a dynamic programming matrix dp where dp[i][j] represents the maximum reverse complementary alignment score ending at positions and .
  2. Fill the matrix using a modified Smith-Waterman algorithm, considering reverse complementary matches.
  3. Keep track of the maximum score and its position.
  4. Perform a traceback from the position of the maximum score to reconstruct the aligned subsequences.
  5. Return the pair of subsequences with the maximum reverse complementary alignment score.

The expected time complexity of this algorithm is , where is the length of the sequence . The expected space complexity is also due to the dynamic programming matrix.

Code Implementation

def max_reverse_complementary_alignment(x):
    m = len(x)
    # Initialize the dynamic programming matrix
    dp = [[0 for * in range(m+1)] for * in range(m+1)]
    max_score = 0
    max_pos = (0, 0)
    
    # Define complementary base pairs
    def comp(a):
        return {'a': 't', 'c': 'g', 'g': 'c', 't': 'a'}[a]
    
    # Define scoring function
    def s(a, b):
        return 1 if comp(a) == b else -1
    
    # Fill the dynamic programming matrix
    for i in range(1, m+1):
        for j in range(m, 0, -1):  # Note: reverse order, as we're looking for reverse complements
            match = dp[i-1][j+1] + s(x[i-1], x[j-1])
            delete = dp[i-1][j] - 1
            insert = dp[i][j+1] - 1
            dp[i][j] = max(0, match, delete, insert)
            if dp[i][j] > max_score:
                max_score = dp[i][j]
                max_pos = (i, j)
    
    # Traceback process, reconstruct optimal alignment
    i, j = max_pos
    seq1, seq2 = [], []
    while dp[i][j] > 0:
        if dp[i][j] == dp[i-1][j+1] + s(x[i-1], x[j-1]):
            seq1.append(x[i-1])
            seq2.append(x[j-1])
            i -= 1
            j += 1
        elif dp[i][j] == dp[i-1][j] - 1:
            seq1.append(x[i-1])
            seq2.append('-')
            i -= 1
        elif dp[i][j] == dp[i][j+1] - 1:
            seq1.append('-')
            seq2.append(x[j-1])
            j += 1
    return ''.join(reversed(seq1)), ''.join(seq2)

知识点

动态规划序列比对生物信息学字符串算法局部比对全局比对反向互补序列

难点思路

这道题目的难点主要在于理解和设计反向互补序列的比对算法。我们需要修改传统的局部比对算法 (Smith-Waterman 算法) 来适应这个特殊的需求。关键是要理解如何在动态规划矩阵中正确地比较序列元素,以及如何进行回溯以重构最优的子序列对。

解题技巧和信息

  1. 对于序列比对问题,通常可以考虑使用动态规划方法。
  2. 在设计动态规划算法时,要注意初始条件的设置,这往往对算法的正确性至关重要。
  3. 对于带有间隔惩罚的序列比对,通常使用仿射间隔惩罚模型 (affine gap penalty model)。
  4. 在处理 DNA 序列时,要注意互补碱基对的概念 (A-T, C-G)。
  5. 局部比对和全局比对的主要区别在于是否允许比对从序列中间开始和结束。
  6. 在处理反向互补序列时,可以通过逆序遍历一个序列来模拟反向操作,同时使用互补碱基对的映射来处理互补关系。

重点词汇

  • global alignment 全局比对
  • local alignment 局部比对
  • affine gap penalty 仿射间隔惩罚
  • reverse complementary 反向互补
  • dynamic programming 动态规划
  • traceback 回溯
  • subsequence 子序列
  • palindromic sequence 回文序列
  • nucleotide 核苷酸
  • base pair 碱基对
  • DNA strand DNA 链
  • complementary base pairing 互补碱基配对

参考资料

  1. Durbin, R., Eddy, S. R., Krogh, A., & Mitchison, G. (1998). Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge university press. Chapter 2-3.
  2. Gusfield, D. (1997). Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge university press. Chapter 11-12.