CBMS-2024B-10

题目来源：Question 10 日期：2024-08-04 题目主题：CS-算法-排序

解题思路

第一问：针对第一个问题，我们的目标是按序列最左侧碱基的字母顺序对 DNA 序列进行排序。由于碱基的种类有限（只有a, c, g, t四种），我们可以使用计数排序（Counting Sort），这种排序算法的时间复杂度为 $O (N)$ 并且适用于这种小范围有限值排序的问题。具体做法是先统计每种碱基的数量，然后根据这些计数值确定每个序列在排序后的数组中的位置。最终将结果存储在新的数组 $B$ 中。
第二问：在第二个问题中，我们限定了 DNA 序列仅包含碱基 a 和 c，要求我们按某个特定位置的碱基对序列进行排序，并且不得使用额外的数组。在这种情况下，可以使用荷兰国旗问题中的双指针方法。这种方法通过维护两个指针，一个从左至右移动，寻找需要交换的元素；另一个从右至左移动，同样寻找需要交换的元素。这种方法能够在 $O (N)$ 的时间复杂度内完成排序，并且不需要额外的存储空间。
第三问：最后一个问题要求我们按整个序列的字母顺序进行排序。此时，我们可以使用基数排序（Radix Sort），一种非比较型排序算法。该算法的关键在于从最低位（最不重要位）开始，对每一位进行排序。对于每一位的排序，我们可以使用第二问中实现的按某个位置排序的方法作为子程序。通过这种方式，我们可以保证整体排序的时间复杂度是 O(L \times N)，其中 L 是序列长度，N 是序列数量。

Solution

Question 1: Sorting by the Left-Most Base

Pseudocode:

// Pseudocode for Counting Sort based on the first base
 
function countingSortFirstBase(A, N):
    // Step 1: Initialize count array
    count = [0, 0, 0, 0] // for 'a', 'c', 'g', 't'
    B = array of size N
 
    // Step 2: Count occurrences of each base
    for i = 1 to N:
        if A[i][0] == 'a':
            count[0] += 1
        else if A[i][0] == 'c':
            count[1] += 1
        else if A[i][0] == 'g':
            count[2] += 1
        else if A[i][0] == 't':
            count[3] += 1
 
    // Step 3: Calculate starting index for each base in sorted order
    for j = 1 to 3:
        count[j] += count[j - 1]
 
    // Step 4: Place elements in sorted order
    for i = N downto 1:
        if A[i][0] == 'a':
            B[count[0] - 1] = A[i]
            count[0] -= 1
        else if A[i][0] == 'c':
            B[count[1] - 1] = A[i]
            count[1] -= 1
        else if A[i][0] == 'g':
            B[count[2] - 1] = A[i]
            count[2] -= 1
        else if A[i][0] == 't':
            B[count[3] - 1] = A[i]
            count[3] -= 1
 
    return B

Explanation:

Initialize Count Array: We first initialize a count array of size 4 to count the occurrences of the bases ‘a’, ‘c’, ‘g’, and ‘t’. The array indices represent the bases: count[0] for ‘a’, count[1] for ‘c’, count[2] for ‘g’, and count[3] for ‘t’.
Count Occurrences: We iterate over the array $A$ and count the occurrences of each base in the left-most position of the sequences.
Calculate Starting Indices: We compute the starting index for each base in the sorted array by accumulating the counts. This step helps determine where to place the sequences in array $B$ .
Place Elements in Sorted Order: We iterate over the array $A$ from right to left and place each sequence in its correct position in array $B$ based on the left-most base. The count array helps track the next available position for each base.

Question 2: Sorting by the $j$ -th Base (Only ‘a’ and ‘c’)

Pseudocode:

## Pseudocode for In-Place Sorting by the j-th base using Two-Pointer Method
 
function sortByJthBase(A, N, j):
    left = 1
    right = N
    while left <= right:
        if A[left][j] == 'a':
            left += 1
        else if A[right][j] == 'c':
            right -= 1
        else:
            swap A[left] with A[right]
            left += 1
            right -= 1

Explanation:

Initialization: We initialize two pointers, left starting from the beginning and right starting from the end of the array $A$ .
Partitioning: The goal is to move all sequences starting with ‘a’ to the left and all sequences starting with ‘c’ to the right.
Sorting Process:
- If A[left][j] is ‘a’, we move the left pointer to the right.
- If A[right][j] is ‘c’, we move the right pointer to the left.
- If A[left][j] is ‘c’ and A[right][j] is ‘a’, we swap these two sequences, move the left pointer to the right, and the right pointer to the left.

This in-place sorting ensures that all sequences are sorted based on the $j$ -th base without using additional memory.

Question 3: Sorting by Whole Sequence

Pseudocode:

// Pseudocode for Radix Sort using Question 2's Sort as Subroutine
 
function radixSortSequences(A, N, L):
    // Sort by each base position starting from the least significant
    for j = L downto 1:
        sortByJthBase(A, N, j)

Explanation:

Radix Sort: We use the Radix Sort algorithm to sort the sequences by considering each base position, starting from the least significant (rightmost) to the most significant (leftmost).
Subroutine Usage: For each base position $j$ , we call sortByJthBase(A, N, j) to sort the sequences based on the base at the $j$ -th position.
Efficiency: Since sortByJthBase runs in O(N) time, and we do this for all $L$ base positions, the overall time complexity of Radix Sort is $O (L \times N)$ , which is efficient for this problem. This ensures that the entire sorting process is stable and the final order respects the order dictated by all bases in the sequence.

知识点

排序算法计数排序基数排序荷兰国旗问题双指针

难点思路

基数排序的稳定性：需要确保每次排序时相同的元素顺序不变，这样最终排序结果是正确的。
双指针法的应用：在有限的内存情况下进行排序，尤其是排序特定的基因序列位置。

解题技巧和信息

在有限内存情况下排序时，双指针法非常有用。
基数排序对于多位数的排序非常高效，尤其在数据量较大时。

重点词汇

Counting Sort 计数排序
Radix Sort 基数排序
In-Place Sort 原地排序

参考资料

Cormen, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. (2009). Introduction to Algorithms (3rd ed.). MIT Press. Chap. 8, 9.
Sedgewick, R., & Wayne, K. (2011). Algorithms (4th ed.). Addison-Wesley. Chap. 2.

Zephyr's Notes on ISCS & CBMS, UTokyo

Explorer

Explorer

CBMS-2024B-10

CBMS-2024B-10

解题思路

Solution

Question 1: Sorting by the Left-Most Base

Question 2: Sorting by the $j$ -th Base (Only ‘a’ and ‘c’)

Question 3: Sorting by Whole Sequence

知识点

难点思路

解题技巧和信息

重点词汇

参考资料

Graph View

Table of Contents

Backlinks

Zephyr's Notes on ISCS & CBMS, UTokyo

Explorer

CBMS-2024B-10

CBMS-2024B-10

解题思路

Solution

Question 1: Sorting by the Left-Most Base

Question 2: Sorting by the j-th Base (Only ‘a’ and ‘c’)

Question 3: Sorting by Whole Sequence

知识点

难点思路

解题技巧和信息

重点词汇

参考资料

Graph View

Table of Contents

Backlinks

Question 2: Sorting by the $j$ -th Base (Only ‘a’ and ‘c’)