CBMS-2024A-12

题目来源：Question 12 日期：2024-07-31 题目主题：CS-算法-动态规划-序列比对

解题思路

本题涉及两个序列的全局和局部比对问题。题目给出了全局比对的迭代公式，并要求推导出相关的公式和算法。序列比对中，常用的评分包括匹配分、错配分和插入/删除（gap）的罚分。罚分由 gap opening penalty 和 gap extension penalty 组成。

反向互补序列是 DNA 双链结构中的一个重要概念。在 DNA 中,A 与 T 配对,C 与 G 配对,两条链的方向相反。因此,一条链的序列可以决定另一条链的序列。这个概念在本题的后半部分起到了关键作用。

Solution

1. Global Alignment with Affine Gap Penalty

1-1: Formula for the Penalty of a Gap of Length $k$

Let’s denote the penalty for a gap of length $k$ as $P (k)$ . From the given equations, we can see that:

Opening a gap costs $d$
Extending a gap costs $e$ for each additional position

Therefore, the formula for the penalty of a gap of length $k$ is:

P (k) = d + (k - 1) e

Note: This is known as an affine gap penalty model.

1-2: Initial Values for $X_{i, 0}$ and $Y_{0, j}$

Given the initial conditions:

$M_{0, 0} = 0$
$X_{0, 0} = - \infty$ , $Y_{0, 0} = - \infty$
$M_{0, j} = X_{0, j} = - \infty$ for $j = 1, \dots, n$
$M_{i, 0} = Y_{i, 0} = - \infty$ for $i = 1, \dots, m$

We need to determine $X_{i, 0}$ $(i = 1, \dots, m)$ and $Y_{0, j}$ $(j = 1, \dots, n)$ .

For $X_{i, 0}$ $(i = 1, \dots, m)$ :

$X_{i, 0}$ represents a gap in sequence $y$ at the beginning. According to the recurrence relation:

X_{i, 0} = max ⎩ ⎨ ⎧ M_{i - 1, 0} - d = - \infty X_{i - 1, 0} - e Y_{i - 1, 0} - d = - \infty

Therefore, $X_{i, 0} = X_{i - 1, 0} - e = - d - (i - 1) e$ .

For $Y_{0, j}$ $(j = 1, \dots, n)$ :

$Y_{0, j}$ represents a gap in sequence $x$ at the beginning. It’s symmetrical to $X_{i, 0}$ :

Y_{0, j} = - d - (j - 1) e

Hence, the initial values of $X_{i, 0}$ and $Y_{0, j}$ are as follows:

$X_{i, 0} = - d - (i - 1) e$
$Y_{0, j} = - d - (j - 1) e$

1-3: Iterative Equations for Local Alignment

To compute the local alignment, the iterative equations are modified to allow for the possibility of starting a new alignment, indicated by a score of 0:

M_{i, j} = max ⎩ ⎨ ⎧ 0 M_{i - 1, j - 1} + s (x_{i}, y_{j}) X_{i - 1, j - 1} + s (x_{i}, y_{j}) Y_{i - 1, j - 1} + s (x_{i}, y_{j})

X_{i, j} = max ⎩ ⎨ ⎧ 0 M_{i - 1, j} - d X_{i - 1, j} - e Y_{i - 1, j} - d

Y_{i, j} = max ⎩ ⎨ ⎧ 0 M_{i, j - 1} - d X_{i, j - 1} - d Y_{i, j - 1} - e

1-4: Obtaining a Local Alignment with Maximum Score

To obtain a local alignment with the maximum score:

Initialize all cells in the first row and column to 0.
Fill the dynamic programming matrix using the equations from (1-3).
Find the cell $(i^{*}, j^{*})$ with the maximum score in the entire matrix.
Perform a traceback from $(i^{*}, j^{*})$ until reaching a cell with score 0 or the matrix boundary.
The path of this traceback gives the optimal local alignment.

2. Reverse Complementary Sequence Analysis

2-1: Explanation of the Algorithm

The algorithm scans a sequence $x$ for reverse complementary matches. It uses a matrix $H$ to record the length of matching substrings that are reverse complements. If the length of the match exceeds a threshold $k$ , the algorithm reports the corresponding subsequences.

Specifically:

$H_{i, j}$ stores the length of the reverse complementary match ending at $x_{i}$ and $x_{j}$ .
The algorithm compares $x_{i}$ with the complement of $x_{j}$ for $j$ from $m$ down to $i$ .
If a match is found, it extends the previous match ( $H_{i - 1, j + 1}$ ) by 1.
If the length of the match ( $H_{i - 1, j + 1}$ ) is at least $k$ , it reports the corresponding ranges.

The reported ranges $[i - H_{i - 1, j + 1} + 1, i - 1]$ and $[j + 1, j + H_{i - 1, j + 1}]$ represent the start and end positions of reverse complementary subsequences of length at least $k$ .

2-2: Algorithm for Maximum Reverse Complementary Alignment Score

Algorithm

The algorithm to find the maximum reverse complementary alignment score is as follows:

Initialize a dynamic programming matrix dp where dp[i][j] represents the maximum reverse complementary alignment score ending at positions $i$ and $j$ .
Fill the matrix using a modified Smith-Waterman algorithm, considering reverse complementary matches.
Keep track of the maximum score and its position.
Perform a traceback from the position of the maximum score to reconstruct the aligned subsequences.
Return the pair of subsequences with the maximum reverse complementary alignment score.

The expected time complexity of this algorithm is $O (m^{2})$ , where $m$ is the length of the sequence $x$ . The expected space complexity is also $O (m^{2})$ due to the dynamic programming matrix.

Code Implementation

def max_reverse_complementary_alignment(x):
    m = len(x)
    # Initialize the dynamic programming matrix
    dp = [[0 for * in range(m+1)] for * in range(m+1)]
    max_score = 0
    max_pos = (0, 0)
    
    # Define complementary base pairs
    def comp(a):
        return {'a': 't', 'c': 'g', 'g': 'c', 't': 'a'}[a]
    
    # Define scoring function
    def s(a, b):
        return 1 if comp(a) == b else -1
    
    # Fill the dynamic programming matrix
    for i in range(1, m+1):
        for j in range(m, 0, -1):  # Note: reverse order, as we're looking for reverse complements
            match = dp[i-1][j+1] + s(x[i-1], x[j-1])
            delete = dp[i-1][j] - 1
            insert = dp[i][j+1] - 1
            dp[i][j] = max(0, match, delete, insert)
            if dp[i][j] > max_score:
                max_score = dp[i][j]
                max_pos = (i, j)
    
    # Traceback process, reconstruct optimal alignment
    i, j = max_pos
    seq1, seq2 = [], []
    while dp[i][j] > 0:
        if dp[i][j] == dp[i-1][j+1] + s(x[i-1], x[j-1]):
            seq1.append(x[i-1])
            seq2.append(x[j-1])
            i -= 1
            j += 1
        elif dp[i][j] == dp[i-1][j] - 1:
            seq1.append(x[i-1])
            seq2.append('-')
            i -= 1
        elif dp[i][j] == dp[i][j+1] - 1:
            seq1.append('-')
            seq2.append(x[j-1])
            j += 1
    return ''.join(reversed(seq1)), ''.join(seq2)

知识点

动态规划序列比对生物信息学字符串算法局部比对全局比对反向互补序列

难点思路

这道题目的难点主要在于理解和设计反向互补序列的比对算法。我们需要修改传统的局部比对算法 (Smith-Waterman 算法) 来适应这个特殊的需求。关键是要理解如何在动态规划矩阵中正确地比较序列元素,以及如何进行回溯以重构最优的子序列对。

解题技巧和信息

对于序列比对问题,通常可以考虑使用动态规划方法。
在设计动态规划算法时,要注意初始条件的设置,这往往对算法的正确性至关重要。
对于带有间隔惩罚的序列比对,通常使用仿射间隔惩罚模型 (affine gap penalty model)。
在处理 DNA 序列时,要注意互补碱基对的概念 (A-T, C-G)。
局部比对和全局比对的主要区别在于是否允许比对从序列中间开始和结束。
在处理反向互补序列时,可以通过逆序遍历一个序列来模拟反向操作,同时使用互补碱基对的映射来处理互补关系。

重点词汇

global alignment 全局比对
local alignment 局部比对
affine gap penalty 仿射间隔惩罚
reverse complementary 反向互补
dynamic programming 动态规划
traceback 回溯
subsequence 子序列
palindromic sequence 回文序列
nucleotide 核苷酸
base pair 碱基对
DNA strand DNA 链
complementary base pairing 互补碱基配对

参考资料

Durbin, R., Eddy, S. R., Krogh, A., & Mitchison, G. (1998). Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge university press. Chapter 2-3.
Gusfield, D. (1997). Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge university press. Chapter 11-12.

Zephyr's Notes on ISCS & CBMS, UTokyo

Explorer

Explorer

CBMS-2024-12

CBMS-2024A-12

解题思路

Solution

1. Global Alignment with Affine Gap Penalty

1-1: Formula for the Penalty of a Gap of Length $k$

1-2: Initial Values for $X_{i, 0}$ and $Y_{0, j}$

1-3: Iterative Equations for Local Alignment

1-4: Obtaining a Local Alignment with Maximum Score

2. Reverse Complementary Sequence Analysis

2-1: Explanation of the Algorithm

2-2: Algorithm for Maximum Reverse Complementary Alignment Score

Algorithm

Code Implementation

知识点

难点思路

解题技巧和信息

重点词汇

参考资料

Graph View

Table of Contents

Backlinks

Zephyr's Notes on ISCS & CBMS, UTokyo

Explorer

CBMS-2024-12

CBMS-2024A-12

解题思路

Solution

1. Global Alignment with Affine Gap Penalty

1-1: Formula for the Penalty of a Gap of Length k

1-2: Initial Values for Xi,0​ and Y0,j​

1-3: Iterative Equations for Local Alignment

1-4: Obtaining a Local Alignment with Maximum Score

2. Reverse Complementary Sequence Analysis

2-1: Explanation of the Algorithm

2-2: Algorithm for Maximum Reverse Complementary Alignment Score

Algorithm

Code Implementation

知识点

难点思路

解题技巧和信息

重点词汇

参考资料

Graph View

Table of Contents

Backlinks

1-1: Formula for the Penalty of a Gap of Length $k$

1-2: Initial Values for $X_{i, 0}$ and $Y_{0, j}$