This is based on a GitHub gist I wrote, which contains the source code. Grab it from here!

Full-text indexes are a very powerful tool which help power the internet. Imagine you have millions of documents, and you want to search through them via keywords and get a result instantly. The naive solution would be to simply filter through them manually, doing string checks on each document - however, as most of you probably know already, that's not going to be fast enough.

So, how do we search a large collection of documents quickly? And also, how would we rank those documents by relevance? There are many solutions to this problem, but we will focus on one of the most simple and most popular solutions: the tf-idf statistic.

TF-i what?

tf-idf stands for "term frequency-inverse document frequency", and is a way of determining how important a word is to a document in a collection. When you calculate the tf-idf for all words in a document, you also conveniently have a structure which you can add to a full-text index of the whole collection, allowing for searching.

There are 2 main variables in this statistic: Term frequency (TF) and Inverse document frequency (IDF.)

TF simply refers to the number of times a word appears in a document, and is calculated for each word and every document. It is common to calculate the TF as a ratio, so tf = how_many_times_this_term_appears / document_word_count, and it is often calculated on log scale (more about this in the code example below.)

As for the IDF, this is calculated for each term and remains constant for each term (unless the collection gets updated of course.) It is essentially the frequency of a term over all documents, and this is calculated to determine how common the term itself is. For example, words like 'it' and 'and' may appear in every document, and thus shouldn't carry weight (or very little weight.) Calculating it is actually very simple, and can be expressed in one line: idf = log ( total_records / matching_records ), where "total_records" is the total number of documents in the collection, and "matching_records" is the total number of documents which contain the term in question.

So, you have your TF for each term-document pair, and IDF for each each term. Now, how can we use these variables to build a full-text index?

The data structure

Every time a document is inserted, the terms are tokenised (that is, the string is stripped of all punctuation except whitespace, and then turned into an array of words), and given weights. These terms are then stored in a hashmap which is easily retrievable and updatable, and contains a reference to all documents containing the term with their weightings. For maximum performance, terms are always stored in an associative array (hashmap,) so they can be recalled at O(1) time.

So we have:

term Hashmap {
    ...document_id (references document[document_id]) Integer // this is the weighting
}
terms Hashmap {
    ... word { // this is each word in our collection
        documents: term // a hashmap of documents containing this term with their weightings, consider a binary-sorted array for larger collections of documents 
        idf: Float // the IDF
    }
}

The insertion algorithm: calculating TF and storing them

This is the algorithm run every time a document is inserted. First, we tokenise the string. That removes a lot of the symbols and separate out the words, and then make them lowercase. If you are using Javascript, you might want to look at the newish "String.prototype.normalize" function to get rid of accents, and then simply remove all non a-zA-Z0-9\' characters with regex.

After this, we iterate through all of the terms, and give them a weight. The weight is awarded with the log-scaled TF formula: (log(t) + 1) / s, where t is the number of times the term appears, and s is the sum of all log(t) + 1 for all terms in the document.

The terms and document data are then stored in the data structure.

Pseudo-code for this may look like this:

insert(document) {
    words = tokenise(all_text_in(document));
    local_terms = {}
    foreach(words as word) {
        local_terms[word] = (local_terms[word] || 0) + 1; // add one count to this word
    }
    logTotal = 0; // a running total of our log(count of words)
    foreach(local_terms as word -> count) {
        logTotal += log(count) + 1;
    }
    foreach(local_terms as word -> count) {
        // and again, once we got our total
        weighting = (log(count) + 1) / logTotal
        // Add this to our global objects as defined above
        terms[word].documents[document.id] = weighting;
    }
}

Calculating the IDF

After your document updates are done, you'll want to calculate the IDF for each term that was updated. Another way is simply calculating the IDF at run-time every time you run a search, but this can result in a small performance sacrifice. In this example, we will also subtract the count of documents containing our term from the total documents, as a small improvement in weighting.

update(term) { 
    // Referencing the global objects defined above. ~.count is simply the number of keys in a hashmap, e.g. Javascript's Object.keys().length
    terms[term].idf = log( ( ( documents.count - terms[term].documents.count) / terms[term].documents.count ) + 1 )
}

The find algorithm

Next, is the algorithm to find and rank terms. We simply tokenise the search input, and run through each term, comparing against our trusty terms hashmap. We then multiply the TF weighting for a term with our IDF score. In order to show a total score for a given document in a search query, we can simply add the weighting of all terms in a document together. There are many techniques for this. You could multiply them together, do some stuff with log, or anything inbetween. There's no correct answer and everyone will want to tweak this differently. In my gist example given at the beginning of the document, I actually store the words after a term in question in the documents table, so that we can enhance the weighting further - but that's getting ahead of ourselves in this article. Check the gist for more details.

Pseudo-code for this may look like this:

find(text) {
    words = tokenise(text)
    results = {}; // Our hashmap of results, we can process this into a sorted array, etc later.
    foreach(words as word) {
        foreach(terms[word].documents as document_id -> document_weight) {
            results[document_id] = (results[document_id] || 0) + terms[word].idf * document_weight // set the score for this document and store in the results
        }
    } 
}

For those who want a real example

Yes, yes, I apologise for simply using pseudocode. I guess I wanted to make this a universal summary of how full-text search engines work. However, I have a Javascript version on Gist which should work: click here. Note that in this version, I calculate idf at runtime, when the find algorithm is used. I am currently using this search engine for one of my static-website blogs.