Friday 23 May 2014

DS : TRIE

Trie is an indexed based approach on building trees to store strings (same as that of a dictionary) and then searching them in linear time. The tree formed is different from that of the conventional binary trees that we study, but the approach is more or less quite simple.
The intention is very clear. We need to store a bunch of strings in a data structure first, and then later on when we have a sample string that has to be searched for existence our data structure would serve as a database in which we will search for the string, the same way we search word by word in a dictionary. This reduces the complexity to linear time as opposite to a string comparison search in n^2 complexity.

3 basic things we need to do when we encounter a trie:
  (+)  Tree Initialization - setup tree structure
  (+)  Filling/Insertion - insert you elements/strings
  (+)  Searching - search you favorite hero or movie name ;)

Another point to ponder -
Unlike binary trees, where we have the node of type of a struct and we needed only one of it, here we will have two types of struct.
  (-)  Type1 - With a character array of pointers that will keep track of the string letters as it passes by. And an integer value to keep track of a termination. Obviously multiple occurrences of a string automatically gets ignored.
  (-)  Type2 - With an integer value keeping track of all the nodes it encountered and a pointer to point to nodes of type 1.

The way we will design it will be clear diagrammatically below.



So lets see the code that does the trick. I have furnished it with comments as i usually do so that the picture is clear of what we are currently doing.


#include<iostream>
#include<stdio.h>
#include<string.h>
#include<stdlib.h>

//dont forget to include this, you might get strange errors, LOL :P
using namespace std;

/**
 * trie nodes
 * type1 as mentioned in text
 */
typedef struct trie_node trie_node_t;
struct trie_node
{
    /**
    * contains and integer to count the occurence of
    * any string ending and what number did that
    * string entered on.
    * Also contains an array of size 26 to keep track
    * of all strings passing by.
    **/
    int count;
    trie_node_t * children[26];
};

/**
 * trie root
 * type2 as mentioned in text
 **/
typedef struct trie trie_t;
struct trie
{
    /**
    * contains an integer to keep track of total
    * entries in the structure.
    * also contains a pointer that points to a
    * node. NOTE THE TYPE OF POINTER IT CONTAINS.
    **/
    int value;
    trie_node_t * root;
};


trie_node_t * getNewTrieNode()
{
    /**
    * this function creates a new trie node of type1.
    * all values have been flushed and returned then.
    **/
    trie_node_t * pTemp = (trie_node_t *)malloc(sizeof(trie_node_t));
    pTemp->count=0;                 //reset val
    for(int i=0;i<26;++i)
        pTemp->children[i]=NULL;    //reset val
    return pTemp;

    /**********************************************************/
}

void initializeTRIE(trie_t * pRoot)
{
    /**
    * initialize the trie root node to
    * start linking trees.
    **/
    pRoot->value=0; //reset val
    pRoot->root = getNewTrieNode();
}

void insertTRIE_NODE(trie_t * pRoot,char key[])
{
    /**
    * this function takes in a string and travels
    * through it, and places contents accordingly
    * in the trie.
    **/

    trie_node_t * pCrawl = pRoot->root; //a walking pointer
                                        //through the trie
    pRoot->value++;     //we can increase the value,
                        //as we know we are about to add a new
                        //string to the structure.
    int len = strlen(key);
    for(int i=0; i<len; ++i)    //while the string is not
                                //fully travelled
    {
        if( !pCrawl->children[ key[i]-'a' ] )   //has not been set
        {
            //then set it
            pCrawl->children[ key[i]-'a'] = getNewTrieNode();
        }
        pCrawl = pCrawl->children[key[i]-'a'];  //move ahead
    }
    pCrawl->count = pRoot->value;   //set the end point on the node
                                    //better set it as the entry
                                    //number in the structure
    /***********************************************************/
}




int searchTRIE(trie_t * pRoot,char sample[])
{
    //returns 0 on failure
    //returns 1 on success
    /**
    * first as usual get to the root node
    * and then gather data to start searching
    **/
    trie_node_t * pCrawl = pRoot->root;
    int len = strlen(sample);
    for(int i=0; i<len; ++i)
    {
        /**
        * if a mismatch is found, break off and return
        * a failure flag, else continue searching.
        **/
        if(!pCrawl->children[sample[i]-'a'])
            return 0;   //failure
        pCrawl = pCrawl->children[sample[i]-'a'];
    }

    /**
    * It is possible that the string we were looking for
    * was a substring of a another bigger string.
    * Take for example the string already entered
    * could be 'jaiwardhan', and the search string could
    * be 'jai'.
    * Had that been the case the upper loop would not
    * return a failure.
    * to check if a string did end, we simple check
    * the count value. If it is not 0 means, some
    * entry was made in the past, else there exists
    * none of it.
    **/
    return (pCrawl->count && pCrawl);

    /*******************************************************/
}


int main()
{
    trie_t pRoot;
    initializeTRIE(&pRoot); //INIT trie
    char s1[]="jaiwardhan"; //first entry
    char s2[]="jai";        //second entry
    char s3[]="jayesh";     //third entry

    insertTRIE_NODE(&pRoot,s1); //DO INSERTION
    insertTRIE_NODE(&pRoot,s2); //DO INSERTION
    insertTRIE_NODE(&pRoot,s3); //DO INSERTION

    //search now
    char s[]="jai";
    int result;
    result = searchTRIE(&pRoot,s);
    if(result)cout<<"Found"<<endl;
    else cout<<"Not Found"<<endl;
    return 0;
}
So this was a basic method to setup and use a trie, a very useful tool in programming. It has a wide application prospect and later we will come to know that a special part called suffix array (or suffix tree) with 'Compressed Trie'.
Hope you enjoyed reading it! All comments and suggestions invited :)
Haste-la-vista!!

No comments:

Post a Comment