序列比对方法: 全局比对和局部比对 Sequence Alignment Methods: Global and Local Alignment

全局比对 Global Alignment

概述 Overview

全局比对是一种序列比对方法,旨在对比两个序列的整体,并在整个序列范围内寻找最佳匹配。这种方法常用于生物信息学中的 DNA、RNA 或蛋白质序列比对。全局比对的目标是最大化匹配并最小化差异,通过插入空格(gaps)来调整序列,使得它们能够对齐。

Global alignment is a sequence alignment method that aims to compare two sequences in their entirety and find the best possible match over the entire length of the sequences. This method is commonly used in bioinformatics for DNA, RNA, or protein sequence alignment. The goal of global alignment is to maximize matches and minimize differences by introducing gaps to adjust the sequences for optimal alignment.

算法介绍 Algorithm Introduction

Needleman-Wunsch 算法 Needleman-Wunsch Algorithm

Needleman-Wunsch 算法是用于全局比对的经典算法。它使用动态规划来构建比对矩阵,并通过回溯路径来确定最佳比对。

The Needleman-Wunsch algorithm is a classic algorithm used for global alignment. It employs dynamic programming to construct an alignment matrix and determines the optimal alignment through traceback.

动态规划 Dynamic Programming

动态规划用于计算两个序列的全局比对得分矩阵 . 设两个序列分别为 ,长度分别为 。初始化 矩阵:

Dynamic programming is used to compute the global alignment score matrix . Let the two sequences be and with lengths and respectively. Initialize the matrix :

然后使用递推公式填充矩阵:

Then fill the matrix using the recurrence relation:

回溯 Traceback

从矩阵 开始回溯,确定最佳比对路径。根据得分矩阵的值,选择对应的路径(对角线、上方或左方)。

Start traceback from to determine the optimal alignment path. Based on the score matrix values, choose the corresponding path (diagonal, up, or left).

代码实现 Code Implementation

def needleman_wunsch(seq1, seq2, match_score=1, mismatch_penalty=-1, gap_penalty=-2):
    m, n = len(seq1), len(seq2)
    # 初始化得分矩阵 Initialize the score matrix
    F = [[0] * (n + 1) for _ in range(m + 1)]
    
    # 初始化第一行和第一列 Initialize the first row and column
    for i in range(1, m + 1):
        F[i][0] = i * gap_penalty
    for j in range(1, n + 1):
        F[0][j] = j * gap_penalty
    
    # 填充得分矩阵 Fill the score matrix
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            match = F[i-1][j-1] + (match_score if seq1[i-1] == seq2[j-1] else mismatch_penalty)
            delete = F[i-1][j] + gap_penalty
            insert = F[i][j-1] + gap_penalty
            F[i][j] = max(match, delete, insert)
    
    # 回溯以找到最佳比对 Traceback to find the best alignment
    align1, align2 = '', ''
    i, j = m, n
    while i > 0 and j > 0:
        score_current = F[i][j]
        score_diagonal = F[i-1][j-1]
        score_up = F[i][j-1]
        score_left = F[i-1][j]
        if score_current == score_diagonal + (match_score if seq1[i-1] == seq2[j-1] else mismatch_penalty):
            align1 += seq1[i-1]
            align2 += seq2[j-1]
            i -= 1
            j -= 1
        elif score_current == score_left + gap_penalty:
            align1 += seq1[i-1]
            align2 += '-'
            i -= 1
        elif score_current == score_up + gap_penalty:
            align1 += '-'
            align2 += seq2[j-1]
            j -= 1
    
    # 处理剩余的序列 Handle the remaining sequence
    while i > 0:
        align1 += seq1[i-1]
        align2 += '-'
        i -= 1
    while j > 0:
        align1 += '-'
        align2 += seq2[j-1]
        j -= 1
    
    # 反转字符串 Reverse the strings
    align1 = align1[::-1]
    align2 = align2[::-1]
    
    return align1, align2, F[m][n]
 
# 示例使用 Example Usage
seq1 = "GATTACA"
seq2 = "GCATGCU"
alignment = needleman_wunsch(seq1, seq2)
print("Aligned Sequences:\n", alignment[0], "\n", alignment[1])
print("Alignment Score:", alignment[2])

应用 Applications

全局比对主要应用于:

  • 比对全基因组序列
  • 蛋白质结构预测
  • 进化关系分析

Global alignment is mainly used in:

  • Whole genome sequence alignment
  • Protein structure prediction
  • Evolutionary relationship analysis

注意事项 Considerations

全局比对适用于长度相近且整体相似的序列。对于长度差异较大或仅局部相似的序列,局部比对(如 Smith-Waterman 算法)可能更合适。

Global alignment is suitable for sequences of similar length and overall similarity. For sequences with significant length differences or only local similarities, local alignment (e.g., Smith-Waterman algorithm) may be more appropriate.

局部比对 Local Alignment

概述 Overview

局部比对是一种序列比对方法,旨在找出两个序列中最相似的片段。与全局比对不同,局部比对不考虑整个序列的匹配,而是专注于识别高度相似的子序列。这种方法特别适用于比较长度差异较大或仅在某些区域具有相似性的序列。

Local alignment is a sequence alignment method that aims to find the most similar segments between two sequences. Unlike global alignment, local alignment does not consider matching the entire sequences but focuses on identifying highly similar subsequences. This method is particularly useful for comparing sequences of different lengths or those that share similarities only in certain regions.

Smith-Waterman 算法 Smith-Waterman Algorithm

Smith-Waterman 算法是用于局部比对的经典算法。它也使用动态规划,但与 Needleman-Wunsch 算法有一些关键区别。

The Smith-Waterman algorithm is a classic algorithm used for local alignment. It also uses dynamic programming but has some key differences from the Needleman-Wunsch algorithm.

动态规划 Dynamic Programming

设两个序列分别为 ,长度分别为 。初始化得分矩阵 :

Let the two sequences be and with lengths and respectively. Initialize the score matrix :

填充矩阵的递推公式:

The recurrence relation for filling the matrix:

关键区别在于引入了 0 作为选项,这允许局部比对在任何位置开始和结束。

The key difference is the introduction of 0 as an option, which allows the local alignment to start and end at any position.

回溯 Traceback

从矩阵中的最高分数开始回溯,直到遇到 0。这确定了最佳局部比对。

Start traceback from the highest score in the matrix and continue until a 0 is encountered. This determines the best local alignment.

代码实现 Code Implementation

def smith_waterman(seq1, seq2, match_score=2, mismatch_penalty=-1, gap_penalty=-1):
    m, n = len(seq1), len(seq2)
    H = [[0] * (n + 1) for _ in range(m + 1)]
    
    # 记录最高分及其位置 Keep track of the highest score and its position
    max_score = 0
    max_pos = (0, 0)
    
    # 填充得分矩阵 Fill the score matrix
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            match = H[i-1][j-1] + (match_score if seq1[i-1] == seq2[j-1] else mismatch_penalty)
            delete = H[i-1][j] + gap_penalty
            insert = H[i][j-1] + gap_penalty
            H[i][j] = max(0, match, delete, insert)
            
            if H[i][j] > max_score:
                max_score = H[i][j]
                max_pos = (i, j)
    
    # 回溯以找到最佳局部比对 Traceback to find the best local alignment
    align1, align2 = '', ''
    i, j = max_pos
    while H[i][j] > 0:
        score_current = H[i][j]
        score_diagonal = H[i-1][j-1]
        score_up = H[i][j-1]
        score_left = H[i-1][j]
        
        if score_current == score_diagonal + (match_score if seq1[i-1] == seq2[j-1] else mismatch_penalty):
            align1 = seq1[i-1] + align1
            align2 = seq2[j-1] + align2
            i -= 1
            j -= 1
        elif score_current == score_left + gap_penalty:
            align1 = seq1[i-1] + align1
            align2 = '-' + align2
            i -= 1
        elif score_current == score_up + gap_penalty:
            align1 = '-' + align1
            align2 = seq2[j-1] + align2
            j -= 1
    
    return align1, align2, max_score
 
# 示例使用 Example Usage
seq1 = "ACGTACGT"
seq2 = "TACGTACG"
alignment = smith_waterman(seq1, seq2)
print("Locally Aligned Sequences:\n", alignment[0], "\n", alignment[1])
print("Alignment Score:", alignment[2])

全局比对 vs 局部比对 Global vs Local Alignment

特征 Feature全局比对 Global Alignment局部比对 Local Alignment
目标 Goal比对整个序列 Align entire sequences找出最相似的子序列 Find most similar subsequences
适用情况 Suitable for长度相近且整体相似的序列 Sequences of similar length and overall similarity长度差异大或部分相似的序列 Sequences with length differences or partial similarities
算法 AlgorithmNeedleman-WunschSmith-Waterman
得分矩阵初始化 Score matrix initialization第一行和列有惩罚 First row and column with penalties第一行和列为零 First row and column as zeros
回溯起点 Traceback start矩阵右下角 Bottom-right of matrix矩阵中的最高分 Highest score in matrix
应用 Applications全基因组比对、进化分析 Whole genome alignment, evolutionary analysis模体搜索、基因预测 Motif finding, gene prediction

注意事项 Considerations

选择全局比对还是局部比对取决于具体的生物学问题和序列特征。全局比对适用于整体相似的序列,而局部比对更适合寻找共享的功能域或保守区域。

The choice between global and local alignment depends on the specific biological problem and sequence characteristics. Global alignment is suitable for overall similar sequences, while local alignment is better for finding shared functional domains or conserved regions.

参考文献 References

  • Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3), 443-453.
  • Durbin, R., Eddy, S. R., Krogh, A., & Mitchison, G. (1998). Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press.
  • Smith, T. F., & Waterman, M. S. (1981). Identification of common molecular subsequences. Journal of Molecular Biology, 147(1), 195-197.