What is the longest common subsequence problem?
Longest common subsequences (LCS) are a typical problem in IT . Approaches to solve the LCS problem often appear in interviews for programmers and algorithm courses.
What is the longest common subsequence?
The aim is to find the longest common subsequence in two sequences. This is where a subsequence is derived from a sequence. The subsequence has the same elemental order, in some instances when elements are removed. Let’s take a look at some examples to get a better idea of the principle:
Sequence X |
Sequence Y |
LCS(X, Y) |
---|---|---|
FATHER
|
MATTER
|
ATER
|
MOTHER
|
HERBAL
|
HER
|
DAVID
|
DANIEL
|
DAI
|
There is an important difference between the longest common subsequence and the longest common substring being that a substring may not have any gaps. In the example of “DAVID” and “DANIEL” the longest common substring would be “DA” since “I” is separated by “V” and “N”.
Are there any practical examples of the LCS problem?
The longest common subsequence problem is an important issue in all areas in which sequences which stem from each other are used. There are certain ways to find LCS to see whether there are similarities or differences and, in turn, discover any plagiarism. The well-known “diff” utility which checks for changes to source text files is based on the LCS problem.
In bioinformatics the longest common subsequence problem is often used when analyzing DNA sequences. DNA bases change in certain positions over time due to mutations. The presence of a long common subsequence in two sequences would suggest a genetic relationship. This can allow us to retrace evolution between species thanks to genetic development.
Linguists can also use the longest common subsequence problem to research linguistic development over centuries. If you have two words from different languages which have the same meaning and a long common subsequence, it suggests that they have the same root and etymology. This would be considered, then, a similar historical development.
What are the solutions to the longest common subsequence problem?
First of all, you can solve the LCS problem with a naïve approach. This is a simple approach without using any optimizations or special methods. To do so you simply compute all subsequences from two sequences and find the longest common subsequence. Unfortunately, this approach isn’t very efficient and is, therefore, only suitable for short sequences.
Below you will find three efficient solutions to the LCS problem:
- Recursive approach
- Optimization through memoization
- Dynamic programming
What all approaches have in common is that in regard to the two sequences, they differ in three cases:
- The last letter is the same.
- The last letter is not the same.
- One of the sequences’ length is zero.
The approaches each differ in their time complexity (asymptotic run time) and space complexity (memory use):
Approach | Run time | Memory |
---|---|---|
Naive approach | O(n * n²)
|
O(n)
|
Recursive approach | O(n²)
|
O(1)
|
Optimization via memoization | O(n *m)
|
O(n* m)
|
Dynamic programming | O(n *m)
|
O(n* m)
|
The algorithms set out below all compute the length of the longest common subsequence. There are, if applicable, multiple set subsequences of this length which can be determined using other steps.
Recursive determination of the longest common subsequence
If you look at the LCS problem, you can see that it has an “optimal substructure”. This means that the problem can be reduced to subproblems. As a solution you can use a recursive approach. Below you can find some examples of algorithms to implement this in three of the most common programming languages.
Python
def lcs(X, Y, m, n):
# Base case
if m == 0 or n == 0:
return 0
# Last elements are equal
elif X[m-1] == Y[n-1]:
return 1 + lcs(X, Y, m-1, n-1)
# Last elements differ
else:
return max(lcs(X, Y, m, n-1), lcs(X, Y, m-1, n))
X, Y = "DANIEL", "DAVID"
lcs_len = lcs(X, Y, len(X), len(Y))
print(f"Length of LCS is: {lcs_len}")
pythonJava
import java.io.*;
class LCS {
public static int lcs(String A, String B, int m, int n)
{
// Base case
if (m == 0 || n == 0)
return 0;
// Last elements are equal
if (A.charAt(m - 1) == B.charAt(n - 1))
return 1 + lcs(A, B, m - 1, n - 1);
// Last elements differ
else
return Math.max(lcs(A, B, m, n - 1),
lcs(A, B, m - 1, n));
}
// Let's test
public static void main(String[] args)
{
String X = "DAVID";
String Y = "DANIEL";
int lcsLength = LCS.lcs(X, Y, X.length(), Y.length());
System.out.println("Length of LCS is: " + lcsLength);
}
}
java- Simple registration
- Premium TLDs at great prices
- 24/7 personal consultant included
- Free privacy protection for eligible domains
C++
#include <iostream>
using namespace std;
int lcs(string X, string Y, int m, int n)
{
// Base case
if (m == 0 || n == 0) {
return 0;
}
// Last elements are equal
if (X[m - 1] == Y[n - 1]) {
return 1 + lcs(X, Y, m - 1, n - 1);
}
// Last elements differ
else {
return max(lcs(X, Y, m, n - 1), lcs(X, Y, m - 1, n));
}
}
// Let's test
int main()
{
// Initialize variables
string X = "DAVID";
string Y = "DANIEL";
// Compute and output length of LCS
cout << "Length of LCS is " << lcs(X, Y, X.size(), Y.size());
return 0;
}
c++Optimizing the recursive approach using memoization
When looking again at the recursive approach, you can see that the overlapping parts are computed. These characteristics, called “overlapping subproblems” are known from Fibonacci sequences. In this case as well recursive sequences are constantly computed for a solution. To make the process more efficient, it’s worthwhile using memoization. In other words, you can cache the subproblems that have already been computed in a two-dimensional matrix.
Python
def lcs(X, Y, m, n, table):
# Base case
if (m == 0 or n == 0):
return 0
# Already computed value at given position
if (table[m][n] != -1):
return table[m][n]
# Last elements are equal
if X[m - 1] == Y[n - 1]:
table[m][n] = 1 + lcs(X, Y, m - 1, n - 1, table)
# Last elements differ
else:
table[m][n] = max(lcs(X, Y, m, n - 1, table), lcs(X, Y, m - 1, n, table))
return table[m][n]
# Let's test
X = "DAVID"
Y = "DANIEL"
m, n = len(X), len(Y)
# Initialize table fields to `-1`
table = [[-1 for i in range(n + 1)] for j in range(m + 1)]
// Compute and output length of LCS
print(f"Length of LCS is: {lcs(X, Y, m, n, table)}")
pythonJava
import java.io.*;
class LCS {
public static int lcs(String X, String Y, int m, int n, int[][] table) {
// Base case
if (m == 0 || n == 0) {
return 0;
}
// Already computed value at given position
if (table[m][n] != -1) {
return table[m][n];
}
// Last elements are equal
if(X.charAt(m - 1) == Y.charAt(n - 1)) {
table[m][n] = 1 + lcs(X, Y, m - 1, n - 1, table);
return table[m][n];
}
// Last elements differ
else {
table[m][n] = Math.max(lcs(X, Y, m, n - 1, table),lcs(X, Y, m - 1, n, table));
return table[m][n];
}
}
// Let's test
public static void main(String args[]){
// Initialize variables
String X = "DAVID";
String Y = "DANIEL";
int m = X.length();
int n = Y.length();
int[][] table = new int[m + 1][n + 1];
// Initialize table fields to `-1`
for(int i=0; i < m + 1; i++) {
for(int j=0; j < n + 1; j++) {
table[i][j] = -1;
}
}
// Compute and output length of LCS
int lcsLength = lcs(X, Y, m, n, table);
System.out.println("Length of LCS is: " + lcsLength);
}
}
javaC++
#include <bits/stdc++.h>
using namespace std;
int lcs(char *X, char* Y, int m, int n, vector<vector<int>>& table)
{
// Base case
if (m == 0 || n == 0)
return 0;
// Already computed value at given position
if (table[m][n] != -1) {
return table[m][n];
}
// Last elements are equal
if (X[m - 1] == Y[n - 1]) {
table[m][n] = 1 + lcs(X, Y, m - 1, n - 1, table);
return table[m][n];
}
// Last elements differ
else {
table[m][n] = max(lcs(X, Y, m, n - 1, table), lcs(X, Y, m - 1, n, table));
return table;
}
}
// Let's test
int main()
{
// Initialize variables
char X[] = "DAVID";
char Y[] = "DANIEL";
int m = strlen(X);
int n = strlen(Y);
// Initialize table with `-1`
vector<vector<int>> table(m + 1, vector<int>(n + 1, -1));
// Compute and output length of LCS
cout << "Length of LCS is: " << lcs(X, Y, m, n, table);
return 0;
}
c++Dynamic programming for the longest common subsequence
Dynamic programming is a non-recursive way to solve optimization problems by dividing them into smaller subproblems and then solving them “bottom up” Among other things, dynamic programming is applied as a solution for pathfinding algorithms. The longest common subsequence problem can also be solved using dynamic programming when using a two-dimensional matrix.
Python
def lcs(X , Y, m, n):
# Initialize dynamic programming table fields to `None`
table = [[None] * (n + 1) for _ in range(m + 1)]
# Compute table values
for i in range(m + 1):
for j in range(n + 1):
# Base case
if i == 0 or j == 0 :
table[i][j] = 0
# Last elements are equal
elif X[i - 1] == Y[j - 1]:
table[i][j] = table[i - 1][j - 1]+ 1
# Last elements differ
else:
table[i][j] = max(table[i - 1][j] , table[i][j - 1])
return table[m][n]
# Let's test
X = "DAVID"
Y = "DANIEL"
# Compute and output length of LCS
lcs_len = lcs(X, Y, len(X), len(Y))
print(f"Length of LCS is: {lcs_len}")
pythonJava
import java.io.*;
class LCS {
public static int lcs(String X, String Y, int m, int n)
{
// Initialize dynamic programming table fields
int table[][] = new int[m + 1][n + 1];
for (int i = 0; i <= m; i++) {
for (int j = 0; j <= n; j++) {
// Base case
if (i == 0 || j == 0)
table[i][j] = 0;
// Last elements are equal
else if (X.charAt(i - 1) == Y.charAt(j - 1))
table[i][j] = table[i - 1][j - 1] + 1;
// Last elements differ
else
table[i][j] = Math.max(table[i - 1][j], table[i][j - 1]);
}
}
return table[m][n];
}
// Let's test
public static void main(String args[]){
// Initialize variables
String X = "DAVID";
String Y = "DANIEL";
int m = X.length();
int n = Y.length();
// Compute and output length of LCS
int lcsLength = lcs(X, Y, m, n);
System.out.println("Length of LCS is: " + lcsLength);
}
}
javaC++
#include <bits/stdc++.h>
using namespace std;
int lcs(string X, string Y, int m, int n) {
// Initialize dynamic programming table
int table[m + 1][n + 1];
// Compute table values
for (int i = 0; i <= m; i++) {
for (int j = 0; j <= n; j++) {
// Base case
if (i == 0 || j == 0)
table[i][j] = 0;
// Last elements are equal
else if (X[i - 1] == Y[j - 1])
table[i][j] = table[i - 1][j - 1] + 1;
// Last elements differ
else
table[i][j] = max(table[i - 1][j], table[i][j - 1]);
}
}
return table[m][n];
}
// Let's test
int main() {
// Initialize variables
string X = "DAVID";
string Y = "DANIEL";
int m = X.size();
int n = Y.size();
// Compute and output length of LCS
cout << "Length of LCS is " << lcs(X, Y, m, n);
return 0;
}
c++