Thursday 2 January 2014

Algorithm : Longest Common Subsequence (a.k.a. LCS)


The problem statement :

We are given two strings 'S' of length 'n' and 'T' of length 'm'. Our goal is to produce the length of their longest common subsequence : the longest sequence of characters that apperar from left to right (not necessarly continos) in both strings, a.w.a. this string.
Example :   S is "ABAZDC"  &  T is :BACBAD". Their LCS string is ABAD

Time complexity :  O( m*n )

This problem is probably one the most fundamental problems in strings and is frequently asked or implemented in programming websites like codechef. Typically we might be asked to display the length of the LCS string and then display the string, or maybe even vice-versa. There are two methods through which we can do it, i will explain with code one of it which has a fair advantage of using Dynamic Programming (DP) while the other one would be for you to work upon.

Methods :
  • Bottom-Up approach
  • Top-Down approach
I will be very explanatory and describe you the approach from grass root level. I expect you have some time and patience to read and understand.
Now how can we start thinking. One way which is pretty bizarre to think and probably the framework of code wont set in mind is find all sub-sequences of both strings, store them in 2-d array locations and then compare them once we have computed all sub sequences. 

Probably as easy it is to see, we can easily infer that this is waste of space and a lot of waste of time. In tight situations we need something where we can avoid work that has nothing to do with the question. Although no doubt this problem is not the simplest of all and hence requires considerable manipulation time for the string.

1. Lets start with Bottom-Up approach. We find the length of the LCS string and then get the string. In this approach we break the solution in simpler sub problems. Once we clear the sub-problems we should be able to combine them all to get to the answer.
Let us say that LCS[ i , j ] denotes the length of LCS in string S from index 1 to i and string T from index 1 to j. For the sake of simplicity we shall denote them as LCS[ i , j ] from S[ 1...i ] and T[ 1...j ]. Since both S and T are character arrays hence for any index, say 'x', S[x] or T[y] would yield the respective charcaters at those indices. So now here there are 2 main possibilities.
  1. S[ i ]  !=  T[ j ]   :  Since these two characters are a mismatch hence we can say that at least either of them don't deserve to be in the LCS[ i , j ]. Hence one thing is sure as given by the expression below :

     LCS[ i , j ] = __MAX__ ( LCS[ i-1 , j ]  ,  LCS[ i , j-1 ] )

    where __MAX__ is a function that returns the maximum length of the two inputs. See if you can convince yourself for the argument above. The one that would produce the largest string would be automatically accepted. We have used some recursion over here as well.
  2. S[ i ] == T[ j ]  :  Since we have a match, hence it is clear that this character (common to both strings) has to be included in the LCS string and would certainly contribute to a 'unit' of length of the length of the final answer string. Hence we can certainly agree upon the expression below

     LCS[ i , j ] = 1 + LCS[ i-1 , j-1 ]
Hence proceeding this way for 'i' and 'j' to cover the length of their whole strings, we can find the length of the end LCS string.

I give here a pictorial representation of a matrix which has the string S on its left panel arranged downwards, and the string T arranged on the top panel left to right. The contents of this matrinx are nothing but the outcomes of the two inequalities we had discussed above.




Just grab a page and manipulate if you get the same output or not. Obviously if you are computing values outside the lowerbound (i.e. 0 ) then return value should be 0.
Length of the LCS : the length of the resulting LCS will be the bottom-rightmost corner of this matrix, i.e. 4.
The LCS String : Keep these two rules in mind and get the string. Start from the bottom-rightmost corner.
  • Rule 1 : if cell above or cell to the left have equal value as to the current cell, then move to that cell. If both cells have equal value to the current cell, then choose either of it.
  • Rule 2 : If both cells have value strictly less than the current cell value, then push the current cell character on to a stack and move one cell diagonally up-left (ie, i-1,j-1).
Once you have done this, pop all the elements of the stack and this gives you the lcs string, We used the stack, coz otherwise it would had given us a reversed string. Hence popping all elements of the stack would give ABAD.
I think i have given you a very fair and detailed description. I leave to code this part to you people.


2 . Top-Down DP method.
   Time Complexity : O( m*n )
We follow the approach of DP which we shall discuss here. Once you are fairly known with DP (in case you havn't ) this method would be clear in a matter of seconds. Lets first write a recursive function to get the length of the LCS, which would be a pseudo code.

typedef unsigned long long int ullint;
ullint LCS( S , n , T , m )
{

  llint result;
  if( n==0 || m==0)  return 0;  //obviously
  if( S[n] == T[m])  result= 1 + LCS( S, n-1, T, m-1);
  else result = MAX(LCS[ S, n-1, T, m] , LCS[S, n, T, m-1]);
  return result;

}

Definitely this seems now a piece of cake, but wait! For large strings it will rip you off your time bounds. We are doing a lot of work that has been done already once in the past in this recursive loop. What I am trying to say, that a lot of work that might had been done due to same calls again in sub problems, can be saved (also called memoization).


A very interesting article and explanation on memoization is given here. Click to jump in!
 (+) first let us take matrix m*n and initialize each of its element to 0 (we will refer it to as a macro UNKNOWN)
 (+) if Current LCS is unknown in the matrix (untouched or hasnt been replaced by a new data yet) then compute the result and save it at the location for further use.
 (+) if LCS is already known (the UNKNOWN was replaced by some data previously) then simple return the result.
Let us assume that there is an array 'arr' that has been initially initialized to UNKNOWN.
The above recursion can be modified as.

llint LCS ( S, n, T, m)

{
 llint result;
 if(n==0 || m==0)  return 0;
 if(arr[n][m] != UNKNOWN)  return arr[n][m];  //since it is known already
 if(S[ n ]==T[ m ]) result = 1 + LCS(S,n-1,T,m-1);
 else result = MAX( LCS[ S, n-1, T, m] , LCS[ S, n, T, m-1] );
 arr[n][m]=result;   //store this work for future
 return result;
}


If you have understood the above code then it would be a matter of minutes before you figure out where can we insert a small code so that it yields the string as well. ッ

Any suggestions are formally welcome. I hope i presented a very detailed description of this problem and must had increased your understanding.
You can also follow me on facebook!

2 comments:

  1. thanks jai for posting this , was very useful ! :)

    ReplyDelete
    Replies
    1. Sorry Melvin returned to the blog after such a long time. Thanx anyway. :)

      Delete