Global Alignment 全局比对

概述 Overview

全局比对是一种序列比对方法，旨在对比两个序列的整体，并在整个序列范围内寻找最佳匹配。这种方法常用于生物信息学中的 DNA、RNA 或蛋白质序列比对。全局比对的目标是最大化匹配并最小化差异，通过插入空格（gaps）来调整序列，使得它们能够对齐。

Global alignment is a sequence alignment method that aims to compare two sequences in their entirety and find the best possible match over the entire length of the sequences. This method is commonly used in bioinformatics for DNA, RNA, or protein sequence alignment. The goal of global alignment is to maximize matches and minimize differences by introducing gaps to adjust the sequences for optimal alignment.

算法介绍 Algorithm Introduction

Needleman-Wunsch 算法 Needleman-Wunsch Algorithm

Needleman-Wunsch 算法是用于全局比对的经典算法。它使用动态规划来构建比对矩阵，并通过回溯路径来确定最佳比对。

The Needleman-Wunsch algorithm is a classic algorithm used for global alignment. It employs dynamic programming to construct an alignment matrix and determines the optimal alignment through traceback.

动态规划 Dynamic Programming

动态规划用于计算两个序列的全局比对得分矩阵 $F$ . 设两个序列分别为 $A$ 和 $B$ ，长度分别为 $m$ 和 $n$ 。初始化 $F$ 矩阵：

Dynamic programming is used to compute the global alignment score matrix $F$ . Let the two sequences be $A$ and $B$ with lengths $m$ and $n$ respectively. Initialize the matrix $F$ :

F (0, 0) = 0

F (i, 0) = i \times gap penalty, for i = 1 to m

F (0, j) = j \times gap penalty, for j = 1 to n

然后使用递推公式填充矩阵：

Then fill the matrix using the recurrence relation:

F (i, j) = max ⎩ ⎨ ⎧ F (i - 1, j - 1) + match/mismatch score (A [i], B [j]) F (i - 1, j) + gap penalty F (i, j - 1) + gap penalty

回溯 Traceback

从矩阵 $F (m, n)$ 开始回溯，确定最佳比对路径。根据得分矩阵的值，选择对应的路径（对角线、上方或左方）。

Start traceback from $F (m, n)$ to determine the optimal alignment path. Based on the score matrix values, choose the corresponding path (diagonal, up, or left).

代码实现 Code Implementation

def needleman_wunsch(seq1, seq2, match_score=1, mismatch_penalty=-1, gap_penalty=-2):
    m, n = len(seq1), len(seq2)
    # 初始化得分矩阵 Initialize the score matrix
    F = [[0] * (n + 1) for _ in range(m + 1)]
    
    # 初始化第一行和第一列 Initialize the first row and column
    for i in range(1, m + 1):
        F[i][0] = i * gap_penalty
    for j in range(1, n + 1):
        F[0][j] = j * gap_penalty
    
    # 填充得分矩阵 Fill the score matrix
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            match = F[i-1][j-1] + (match_score if seq1[i-1] == seq2[j-1] else mismatch_penalty)
            delete = F[i-1][j] + gap_penalty
            insert = F[i][j-1] + gap_penalty
            F[i][j] = max(match, delete, insert)
    
    # 回溯以找到最佳比对 Traceback to find the best alignment
    align1, align2 = '', ''
    i, j = m, n
    while i > 0 and j > 0:
        score_current = F[i][j]
        score_diagonal = F[i-1][j-1]
        score_up = F[i][j-1]
        score_left = F[i-1][j]
        if score_current == score_diagonal + (match_score if seq1[i-1] == seq2[j-1] else mismatch_penalty):
            align1 += seq1[i-1]
            align2 += seq2[j-1]
            i -= 1
            j -= 1
        elif score_current == score_left + gap_penalty:
            align1 += seq1[i-1]
            align2 += '-'
            i -= 1
        elif score_current == score_up + gap_penalty:
            align1 += '-'
            align2 += seq2[j-1]
            j -= 1
    
    # 处理剩余的序列 Handle the remaining sequence
    while i > 0:
        align1 += seq1[i-1]
        align2 += '-'
        i -= 1
    while j > 0:
        align1 += '-'
        align2 += seq2[j-1]
        j -= 1
    
    # 反转字符串 Reverse the strings
    align1 = align1[::-1]
    align2 = align2[::-1]
    
    return align1, align2, F[m][n]
 
# 示例使用 Example Usage
seq1 = "GATTACA"
seq2 = "GCATGCU"
alignment = needleman_wunsch(seq1, seq2)
print("Aligned Sequences:\n", alignment[0], "\n", alignment[1])
print("Alignment Score:", alignment[2])

应用 Applications

全局比对主要应用于：

比对全基因组序列
蛋白质结构预测
进化关系分析

Global alignment is mainly used in:

Whole genome sequence alignment
Protein structure prediction
Evolutionary relationship analysis

注意事项 Considerations

全局比对适用于长度相近且整体相似的序列。对于长度差异较大或仅局部相似的序列，局部比对（如 Smith-Waterman 算法）可能更合适。

Global alignment is suitable for sequences of similar length and overall similarity. For sequences with significant length differences or only local similarities, local alignment (e.g., Smith-Waterman algorithm) may be more appropriate.

参考文献 References

Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3), 443-453.
Durbin, R., Eddy, S. R., Krogh, A., & Mitchison, G. (1998). Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press.

Zephyr's Notes on ISCS & CBMS, UTokyo

Explorer