`WHERE title LIKE %:query%`

. It's convincing at first, but then a few days later the client calls you and claims that "search is broken!"
Of course, your search isn't *broken*, it's just not doing what the client wants. Regular web users don't really understand the concept of exact matches, so your search quality ends up being poor. You decide you need to use full-text search. With some MySQL fidgeting you're able to set up a FULLTEXT index and use a more evolved syntax, the "MATCH() ... AGAINST()" query.

Great! Problem solved. For smallish databases.

As you hit the hundreds of thousands of records, you notice that your database is sluggish. MySQL just isn't *great* at full-text search. So you grab ElasticSearch, refactor your code a bit, and deploy a Lucene-driven full-text search cluster that works wonders. It's fast and the quality of results is great.

Which leads you to ask: what the heck is Lucene doing *so right*?

This article (on TF-IDF, Okapi BM-25, and relevance scoring in general) and the next one (on inverted indices) describe the basic concepts behind full-text search.

It would be convenient to be able to define a "relevance score" that relates a document to a search query. And then, when users search for things, we can sort by the relevance score instead of sorting chronologically. That way the most relevant documents come up on top, regardless (or maybe not) of how old they are.

There are many, many ways to relate one text to another, but let's start simple and use a statistics-based approach that doesn't need to understand language itself, but rather looks at the statistics of word usage and matches and weighs documents based on the prevalence of their unique words.

This algorithm doesn't care about verbs or nouns or even meaning. All it cares about is the simple fact that there are common words and there are rare words, and if your search phrase includes both common and rare words, you'd be better off to rank the documents that have that rare word in it higher, and put less weight on matched common words.

The algorithm we'll use is called Okapi BM25, but it builds on two basic concepts: *term frequency ("TF") *and *inverse document frequency ("IDF"). *Together, these concepts form "TF-IDF", which is a statistical measure that represents how important a term is to a document.

*Term Frequency, *abbreviated "TF", is a simple metric: it's the number of times a certain word appears in a document. You can also represent it as the fraction of the number of times a word appears over the total number of tokens (ie, total words) in a document. Term frequency says "I'm 100 words long and 'the' shows up 8 times, so the term frequency of 'the' is 8 or 8/100 or 8%" (depending on the representation you want).

*Inverse Document Frequency, *abbreviated "IDF", is more evolved: the *rarer* a word is, the *higher* this value. It's the log ratio of the number of total documents over the number of documents a term appears in. Rarer words, therefore, yield bigger "IDF"s.

If you multiply these two numbers together, (TF*IDF), you'll get the importance of a word to a document. "Importance" being defined as "how rare is this word and how often does it appear in this document?"

You can then use this concept to relate a document to a search query. For each term in a search query, calculate its TF-IDF score, add them all up, and whichever document has the highest score is your winner.

Cool? Cool.

The algorithm described above is *okay *but not wonderful. It does provide us with statistically-derived relevance scores, but it could be improved.

Okapi BM25 is considered a state-of-the-art ranking algorithm (so says ElasticSearch). The major improvements that Okapi BM25 bring over TF-IDF are two tunable parameters, called k1 and b, that modulate "term frequency saturation" and "field-length normalization". What?

To intuit term frequency saturation, imagine two documents of roughly the same length that both talk about baseball. Imagine that the overall corpus doesn't have much to do with baseball, so the term "baseball"s IDF is pretty high -- it's a rare and important-ish word. These two documents both talk about baseball, and talk about it a lot, but one of them uses the term "baseball" way more. Should that document *really* show up that much higher in the rankings? Both the documents talk about baseball a hefty amount, and at a certain point it shouldn't really matter if you use the word "baseball" 40 times or 80 times. Anything above 30 is enough!

This is "term frequency saturation." The naive TF-IDF algorithm doesn't saturate, so the document that uses "baseball" 80 times will have twice the score as the one that uses it 40 times. Sometimes that's desired, sometimes it's not.

Okapi BM25, on the other hand, has a parameter called "k1" that actually lets you tune how quickly term frequency will saturate. The parameter k1 is usually taken between 1.2 and 2.0. Lower values result in quicker saturation (meaning that those two documents above will have similar scores, because they both have a significant number of "baseball"s).

Field-length normalization considers the length of the document and normalizes against the average length of all documents. It's useful in single-field collections (like ours) to put documents of differing lengths on the same playing field. It's doubly useful in multiple-field collections (like "title" and "body") in putting the title and body fields on the same playing field as well. The term "b" is ranged from 0 to 1, with 1 giving full normalization and 0 giving no normalization.

You can see the formula for the Okapi BM25 algorithm on the Okapi BM25 Wikipedia page. Now that you know what each of the terms are, it should be pretty straight-forward to understand, so we won't dive into the equation here. Let's dive into code:

```
BM25.Tokenize = function(text) {
text = text
.toLowerCase()
.replace(/\W/g, ' ')
.replace(/\s+/g, ' ')
.trim()
.split(' ')
.map(function(a) { return stemmer(a); });
// Filter out stopStems
var out = [];
for (var i = 0, len = text.length; i < len; i++) {
if (stopStems.indexOf(text[i]) === -1) {
out.push(text[i]);
}
}
return out;
};
```

We define a simple `Tokenize()`

static method whose purpose is to parse a string into an array of tokens. Along the way, we lower-case all the tokens (to reduce entropy), we run the Porter Stemmer Algorithm to reduce the entropy of the corpus and also to improve matching (so that "walking" and "walk" match the same), and we also filter out stop-words (very common words) to further reduce entropy. I've written about all these concepts in-depth previously, so please excuse me if I'm glossing over this section. :)

```
BM25.prototype.addDocument = function(doc) {
if (typeof doc.id === 'undefined') { throw new Error(1000, 'ID is a required property of documents.'); };
if (typeof doc.body === 'undefined') { throw new Error(1001, 'Body is a required property of documents.'); };
// Raw tokenized list of words
var tokens = BM25.Tokenize(doc.body);
// Will hold unique terms and their counts and frequencies
var _terms = {};
// docObj will eventually be added to the documents database
var docObj = {id: doc.id, tokens: tokens, body: doc.body};
// Count number of terms
docObj.termCount = tokens.length;
// Increment totalDocuments
this.totalDocuments++;
// Readjust averageDocumentLength
this.totalDocumentTermLength += docObj.termCount;
this.averageDocumentLength = this.totalDocumentTermLength / this.totalDocuments;
// Calculate term frequency
// First get terms count
for (var i = 0, len = tokens.length; i < len; i++) {
var term = tokens[i];
if (!_terms[term]) {
_terms[term] = {
count: 0,
freq: 0
};
};
_terms[term].count++;
}
// Then re-loop to calculate term frequency.
// We'll also update inverse document frequencies here.
var keys = Object.keys(_terms);
for (var i = 0, len = keys.length; i < len; i++) {
var term = keys[i];
// Term Frequency for this document.
_terms[term].freq = _terms[term].count / docObj.termCount;
// Inverse Document Frequency initialization
if (!this.terms[term]) {
this.terms[term] = {
n: 0, // Number of docs this term appears in, uniquely
idf: 0
};
}
this.terms[term].n++;
};
// Calculate inverse document frequencies
// This is SLOWish so if you want to index a big batch of documents,
// comment this out and run it once at the end of your addDocuments run
// If you're only indexing a document or two at a time you can leave this in.
// this.updateIdf();
// Add docObj to docs db
docObj.terms = _terms;
this.documents[docObj.id] = docObj;
};
```

This `addDocument()`

method is where most of the magic happens. We're essentially building and maintaining two similar data structures: `this.documents`

, and `this.terms`

.

`this.documents`

is our database of individual documents, but along with storing the full, original text of the document, we also store the document length and a list of all the tokens in the document along with their count and frequency. Using this data structure we can easily and quickly (with a super-fast, O(1) hash table lookup) answer the question "in document #3, how many times did the word 'walk' occur?"

We also build a second data structure called `this.terms`

. This represents all terms in the entire corpus. Through this O(1) data structure we can quickly answer the questions "how many documents does 'walk' appear in? And what's its idf score?".

Finally, we record the document length for each individual document, and also maintain an average document length for the whole corpus.

Of course, you see above that idf is initialized to zero, and I've even commented out the `updateIdf()`

call above. That's because it's quite slow, and only needs to be run once at the end of the indexing operation. No need to run this 50,000 times when once will suffice. Leaving this commented out and running it only once at the end of a bulk index operation really speeds things up. Here's the method:

```
BM25.prototype.updateIdf = function() {
var keys = Object.keys(this.terms);
for (var i = 0, len = keys.length; i < len; i++) {
var term = keys[i];
var num = (this.totalDocuments - this.terms[term].n + 0.5);
var denom = (this.terms[term].n + 0.5);
this.terms[term].idf = Math.max(Math.log10(num / denom), 0.01);
}
};
```

It's a simple function, but since it loops over the entire corpus terms list, updating each one, it's a somewhat expensive operation. The implementation is the standard formula for inverse document frequency (which you can easily find on Wikipedia) -- it's the log ratio of total documents to the number of documents a term appears in. I've also modified it to always be above zero.

```
BM25.prototype.search = function(query) {
var queryTerms = BM25.Tokenize(query);
var results = [];
// Look at each document in turn. There are better ways to do this with inverted indices.
var keys = Object.keys(this.documents);
for (var j = 0, nDocs = keys.length; j < nDocs; j++) {
var id = keys[j];
// The relevance score for a document is the sum of a tf-idf-like
// calculation for each query term.
this.documents[id]._score = 0;
// Calculate the score for each query term
for (var i = 0, len = queryTerms.length; i < len; i++) {
var queryTerm = queryTerms[i];
// We've never seen this term before so IDF will be 0.
// Means we can skip the whole term, it adds nothing to the score
// and isn't in any document.
if (typeof this.terms[queryTerm] === 'undefined') {
continue;
}
// This term isn't in the document, so the TF portion is 0 and this
// term contributes nothing to the search score.
if (typeof this.documents[id].terms[queryTerm] === 'undefined') {
continue;
}
// The term is in the document, let's go.
// The whole term is :
// IDF * (TF * (k1 + 1)) / (TF + k1 * (1 - b + b * docLength / avgDocLength))
// IDF is pre-calculated for the whole docset.
var idf = this.terms[queryTerm].idf;
// Numerator of the TF portion.
var num = this.documents[id].terms[queryTerm].count * (this.k1 + 1);
// Denomerator of the TF portion.
var denom = this.documents[id].terms[queryTerm].count
+ (this.k1 * (1 - this.b + (this.b * this.documents[id].termCount / this.averageDocumentLength)));
// Add this query term to the score
this.documents[id]._score += idf * num / denom;
}
if (!isNaN(this.documents[id]._score) && this.documents[id]._score > 0) {
results.push(this.documents[id]);
}
}
results.sort(function(a, b) { return b._score - a._score; });
return results.slice(0, 10);
};
```

Finally, the `search()`

method loops through all documents and assigns a BM25 relevance score to each, sorting the highest scores at the end. Of course, it's silly to loop through every document in the corpus when searching, but that's the subject of Part Two (inverted indices and performance).

The code above is documented inline, but the gist is as follows: for each document and for each query term, calculate the BM25 score. The idf score for each query term is globally pre-calculated and just a simple look-up, the term frequency is document-specific but was also pre-calculated, and the rest of the work is simply multiplication and division! We add a temporary variable called `_score`

to each document, and then sort the results by the score (descending) and return the top 10.

There are lots of ways the above can be improved, and we'll explore those in Part Two of "Full-Text Search"! Stay tuned. I hope to publish it in a few weeks. Some things we'll accomplish next time are:

- Inverted index for faster searches
- Faster indexing
- Even better search results

For this demo, I built a little Wikipedia crawler that grabs the first paragraph of a decent number (85,000) of Wikipedia articles. Since indexing all 85k documents takes about 90 seconds on my computer I've cut the corpus in half for this demo. Don't want you guys to waste your laptop battery juice on indexing Wikipedia articles for a simple full-text demo.

Because the indexing is a CPU-heavy, blocking operation, I implemented it as a Web Worker. The indexing runs in a background thread -- you can find the full source code here. You'll also find references to the stemmer algorithm and my stop-word list in the source code. The code license, as always, is free to use for educational purposes but not for any commercial purpose.

Finally, here's the demo. Once the indexing is complete, try searching for random things and phrases that Wikipedia might know about. Note that there's only 40,000 paragraphs indexed here, so you might have to try a few topics before you find one that the system knows about.

If

]]>If you're just looking for the summary and code demonstration, jump down.

Compared to the other major machine learning tasks, sentiment analysis has surprisingly little written about it. This is in part because sentiment analysis (and natural language in general) is difficult, but I also suspect that many people opt to monetize their sentiment algorithms rather than publishing them.

Sentiment analysis is highly applicable to enterprise business ("alert me any time someone writes something negative about one of our products anywhere on the internet!"), and sentiment analysis is fresh enough that there's a lot of "secret sauce" still out there. And now that the world is run on social sites like Twitter, Facebook, Yelp, and Amazon (Reviews), there are literally billions of data points that can be analyzed every single day. So it's hard to blame anyone for trying to get a piece of that pie!

Today we'll discuss an "easy" but effective sentiment analysis algorithm. I put "easy" in quotes because our work today will build upon the Naive Bayes classifer we talked about last time; but this will only be "easy" if you're familiar with those concepts already--you may want to review that article before continuing.

There are better, more sophisticated algorithms than the one we'll develop today, but this is a great place to start. As always, I'll follow up with additional articles that cover more advanced topics in sentiment analysis in the future.

I found a nice, compact data set for this experiment. It consists of 10,662 sentences from movie reviews, labeled "positive" and "negative" (and split evenly). The primary goal today is to figure out how to tweak the Naive Bayes classifier so that it works well for sentiment analysis.

"Naive Bayes Classification" is termed such because it makes the assumption that each word is statistically independent from each other word. In language, however, that assumption doesn't hold true. "England" follows the word "Queen [of]" much more commonly than most other words ("Queen of Asparagus" just isn't something you see every day), so it's clear that words are not independent of one another. Still, it's been shown that the "naive" assumption is pretty good for the purposes of document classification.

And that's good news, because sentiment analysis feels quite a bit like classification: given a document, would you label it "positive" or "negative"? The main difference between classification and sentiment analysis is the idea that classification is objective but sentiment is subjective. But let's make another "naive" assumption and pretend that we don't care about that. Let's try applying Bayes to sentiment and see what happens.

For what it's worth, lots of scientific discoveries are made by people who are willing to "just try it and see what happens". If it fails, so what? Most experiments do. If it succceds though, that'd be great news!

Still, I don't think we should completely ignore the fact that sentiment is subjective and classification is objective. Subjective language is trickier than objective language because it's prone to idiom, turns of phrase, sarcasm, negation, and other artifacts of emotive language. Detecting what language a sentence is in, for instance, is straightforward because it's built on *facts*. Did you observe the word "*allons-y*"? If so, probably French. But consider Borat-ifying a movie review: "I loved this movie... not!" The naive approach may see the word "loved" and label the review positive, but that "not" changes the entire meaning of the sentence.

We should try to move just a little bit away from our assumption that words are independent. One nice way of doing this -- if you have lots of training data -- would be to use bigrams instead of unigrams. If you recall, the training portion of a Bayes classifer involves looking through your training set and counting the number of times a word appears in a class of documents. (Eg: "Clinton" appeared 9% of the time in "politics" articles, but only 0.1% of the time in "cooking" articles.)

Our Bayes classifier can easily be extended to use bigrams just by tokenizing our document two words at a time instead of one! Rather than treating the words "This" "movie" "was" "not" "great" separately -- and risk getting confused by the negation -- all we have to do is record them as pairs: "This movie" "movie was" "was not" "not great". Now our classifer knows the difference between "was great" and "not great"!

Sadly, using bigrams poses a real problem for us: it increases our entropy. Entropy is a concept used across lots of fields, but roughly speaking, it's a measure of the number of possible "states" a system can be in. If we use 7,000 different words in conversational English and build a unigram-based Bayes classifer for them, then we'll need to store data (word counts) for only 7,000 words. But if we start using bigrams, we could *potentially* need to store data for up to 49,000,000 different word *pairs* (that's our maximum possible value though; in practice we'd probably only encounter 30,000 unique word pairs or so).

But storage space and computational complexity is the least of our concerns (30,000 records is nothing!). The real problem is the lack of training data. Using bigrams and introducing more entropy into our system means that each word pair is going to be relatively rare! A unigram approach might encounter the word "great" 150 times in our training data, but how many times will it see "great movie"? Or "not great"? Or "was great"? Or "seriously great"? The bigram approach is nice, but unless you have a huge training corpus it's probably not the way to go.

Note that there are other ways to combat entropy, and that's a good thing to do even when you're using unigrams. I almost alwaysstemwords as part of the tokenization process. Stemming puts word variations like "great", "greatly", "greatest", and "greater" all into one bucket, effectively decreasing your entropy and giving you more data around the concept of "great".

It turns out that dealing with negations (like "not great") is a pretty important step in sentiment analysis. A negation word can affect the tone of all the words around it, and ignoring negations would be a pretty big oversight. Bigrams, unfortunately, require too much training data, so we've got to find a better way to consider negation terms.

Fortunately, the solution is "kick-yourself-for-not-thinking-of-it" easy. Here's how I do it: if I see a negation (like "not", "never", "no", etc), I just add an exclamation point to the beginning of every word after it!

The sentence "This movie was not great" turns into "This movie was not !great", and the token "!great" gets stored in our Bayes classifer as having appeared in a negative review. "Great" appears in positive reviews, and its negation, "!great" appears in negative ones. It's simple, it's clever, and it's effective. Best of all: you get to use a regular Naive Bayes classifer. The only modification you need to make is in your tokenizer (the code that goes through the text, pre-processes it, and splits it up in to words or "tokens"), but the Bayes classification algorithm stays the same.

I ran this experiment's data through a variety of different tokenizers, and in this case I found it's more effective to flag only the word immediately before and after the negation term, rather than all the words until the end of the sentence. My tokenizer takes the sentence "This movie was not good because of the plot" and turns it into "This movie !was not !good because of the plot".

Another effective trick you can use has nothing to do with the Bayes algorithm itself but rather with how you handle its results. A nice feature of a Bayes classifier is that it outright *tells* you its "confidence" in its ruling (confidence is in quotes here because this isn't the *statistical* definition of confidence, I'm just editorializing and calling probability "confidence").

Recall that the output of a Bayes classifier is "the probability that this document is 'negative'". If you have the luxury of not needing to label every single document you encounter, it may be best to only label documents that have over a certain percentage probability. This approach should make intuitive sense: if you read something and you're only 51% sure it's negative, maybe you shouldn't be labeling that document at all.

In this experiment, setting the "confidence threshold" to 75% or greater increased the classifier's accuracy from 78.5% to 85%, a 6.5 percentage-point difference!

You may not always have the luxury to abstain from guessing, but it makes sense in most cases. For example, if you need to alert a client about negative reviews, you could hit close to 90% accuracy if you only sent alerts out when the probability of being negative was greater than 80% or so.

I won't do a code walkthrough like I usually do, since the core of this project is just the Naive Bayes classifier we built last time. However, the cross-validation and test procedure is worth talking about.

Cross-validation is how you should be testing your algorithms. In short, here's what you do:

- Obtain your training data set.
- Shuffle it up.
- Set 20% (or so) of it aside. Don't use that portion for training.
- Train your algorithm on the other 80% of data.
- Test your algorithm on the 20% you set aside. You test on this portion of the data because your algorithm has never seen it before (it wasn't part of the training set), but it's pre-labeled which means you can:
- Record the algorithm's accuracy (how many did it get right?)
- Optionally, record the algorithm's accuracy more granularly by recording true positives, true negatives, false positives, and false negatives. From this data you can discern not only the overall accuracy, but also the algorithm's "precision" and "recall", which we'll talk about in-depth at a later date.
- Repeat from step 2. I tend to shuffle, retrain, and retest about 30 times in order to get a statistically significant sample. Average your individual run accuracies to get an overall algorithm accuracy.

Remember that your algorithm will likely be highly specific to the type of training data you gave it. The program below gets 85% on movie reviews, but if you use it to judge the sentiment of a married couple's conversation it'll probably fall short.

This experiment's goal was simple: use a Naive Bayes classifier for sentiment analysis, and figure out what we can do to boost its accuracy.

I tried and tested 18 different variants of tokenizers and the Bayes classifier, both with and without adjustments for rare tokens, resulting in 36 total tests. Each test was run 30 times, for a total of 1,080 experiments. I found that the least effective tokenizer was the "stemmed, stopword-removed bigram" tokenizer (61.4% accuracy and slow as hell), and the most effective tokenizer had these properties:

- Unigram tokenizer (look at one word at a time)
- Tokens were stemmed (with the Porter stemmer algorithm; this converts "greater", "greatest", and "greatly" all to "great", thereby giving us more data points for the concept "great" and reducing our entropy)
- Flag any word to the left and right of a negation term (ie, put an exclamation point in front of any word next to "not" or "never" or similar; this turns that word into a different, negatively-connated word as far as Bayes is concerned. "great" becomes distinct from "!great", which implies "not great".)
- Deal with rare tokens by pulling them towards 50% "wordicity" (a term I use in the Bayes article) with a weight of 3 (ie, words we've seen fewer times are forced closer to 50%, because we don't trust them as much).
- Only label a document if we determine the probability to be 75% or greater.

The overall accuracy on that algorithm was 85.0%, which was pleasantly surprising to me. Most machine learning algorithms max out near 90%, so this turned out to be very successful for our "introduction to sentiment analysis"!

Wait for the system to train, and then write a movie-review-like sentence and see if it gets your sentiment right.

I'll be writing at least two more articles about sentiment analysis, so please be sure to ask any questions or make any suggestions you have in the comments below.

If you like this article, or thought it was helpful, please consider:

- Sharing it with friends on Facebook and Twitter
- Discussing it on Hacker News and Reddit
- Discussing it in the comments section below
- Following me on Twitter @bkanber
- Reading my other ML in JS articles
- Signing up for the ML in JS mailing list below — I only send emails about new articles, never spam, won't ever sell your info

This article is part of the Machine Learning in Javascript series which teaches the essential machine learning algorithms using Javascript for examples. I use Javascript because it's well-known and universally supported, making it an excellent language to use for teaching. There’s a mailing list at the bottom of the page if you want to know about new articles; you can also follow me on twitter: @bkanber.

Are you just looking for the code example? Scroll down!

Document classification is one of my favorite tasks. "Document classification" is exactly what you think it is: given a document and a set of labels, apply the most appropriate label to that document. Labels (or "classes" or "categories") can be things like:

- "spam" or "not spam" (most mail clients use some form of Bayesian spam detection)
- "written by a male" or "written by a female" (yes, there are clues that can hint to the gender of the author)
- "technology", "politics", "finance", "sports" (if you need to automatically categorize old articles from a newspaper)

Today we're going to solve a simple problem: **language detection.** Put another way: "given a piece of text, determine if it's in Spanish, English, or French".

First, we'll have to change the way we think about "documents". At their most basic level, documents are just collections of arranged characters. Some characters may be letters, others are not. Letters make words, and words have meaning. Sometimes punctuation alters the meaning of words, other times it doesn't matter too much! (See what I did there?) We typically study documents by studying the individual words in them. Sometimes, if the situation calls for it, we'll look at more than one word at a time (word pairs are called "bigrams"), but since this is only "part 1" of this Bayes series we'll only consider unigrams for now.

Because natural language itself is so quirky and complicated, we generally try to simplify things; we try to reduce the *entropy* of the system. If you consider both uppercase and lowercase letters, you can create 7.3 million different four-letter words. If you limit yourself to only lowercase letters, that figure drops down to only 450,000. Why is this important? You want your system to use as much data as it can, and treating "YOU'RE" and "you're" and "You're" separately only serves to keep what you learn about each in separate buckets. Converting all of those to a simple "you're" is best, because it'll allow you to study the *concept* of the word "you're" rather than the various syntaxes of the word.

Of course, more advanced data scientists will recognize that sometimes you *do *want these quirks in your system. When detecting spam emails, for instance, it turns out that "offer" and "OFFER" are two very different things. Often, spam detection algorithms will *not *normalize based on case because of this. Similarly, "money back" and "money back!!!!" have two very different meanings, so spam detection algorithms will generally also leave the punctuation intact.

You may also decide that you also want the meanings of the words "imagine", "imagination", and "imaginary" to be lumped together. To do so, you might chop off the ending of each, converting them to "imagin" -- it's neither verb nor noun, neither singular nor plural, it's simply a concept. This is called "stemming", and it serves to treat different words with the same meaning as the same entity. These entropy-reduction techniques are very important in machine learning as a whole, since training sets for learning algorithms are usually limited.

The process of splitting a document up into discrete chunks that you can study is called "tokenization". For now, we'll keep our tokenization simple: we'll remove any punctuation, make everything lowercase, and split the document up by spaces to get our tokens.

I'm not going to talk about the math around Bayes' theorem, because interested readers can easily learn the basics of probability and Bayes' theorem on their own. Instead, we'll aim to develop an intuition about this important tool in probability.

We're trying to figure out which language a previously unseen document is in. We have a stack of pre-labeled documents in English, French, and Spanish. Since we decided above that we can learn about a document by inspecting the individual words in that document, let's start there.

If you look at a single word in a document, you can easily figure out how many times it appeared in your training data. Using that information, you can determine the probability that a certain language will use a given word. For example: "vous" is the first word in your document of unknown origin. "Vous" may show up in all of your French documents (100%), no Spanish documents, and a small number of English documents (5%). That's a great hint! The document is French!

But no, we can't stop there. Just because "vous" is clearly a French word doesn't mean the document itself is French. It may be an English novel quoting a French character. So we can't just simply look at the "probability that 'vous' is French", which is what we just tried. Instead, we need to determine the "probability that this document is French given that the word 'vous' is in it". Fortunately, Bayes' theorem does exactly that. If we apply Bayes' theorem to the fake numbers I gave above, we find that there's a 98% chance that a document is French if "vous" appears in it. The formula to calculate that is quite simple; see Bayes' theorem.

Of course, if the next phrase in the document is "said Jean Pierre, the French museum curator", we know there's a much smaller chance of the document being French. So we look at each word in turn and calculate the "probability that the document is (French|English|Spanish) given that this word is in it". We combine those individual probabilities and end up with an overall "probability that this document is French [given that all these words are in it]". If that probability is high enough, you can act upon it.

The Naive Bayes Classifier is so named because it assumes that each word in the document has nothing to do with the next word. That's a naive assumption. But it turns out that, while naive, it's actually a great simplifying assumption; studying words separately like this actually yields very good results. You could also study bigrams or trigrams (sets of two or three words at a time), at which point the classifier is no longer "naive" but it'll require a much larger amount of training data and storage space.

The reason the NB classifier works well for document classification is that it de-correlates the number of times a word is seen in a given language from its statistical importance. The word "a" is found in many languages. Perhaps it even appears in 100% of your English training set. But that doesn't mean that documents that have it are English. We use Bayes to convert the "probability that 'a' appears in an English document" (which is 100%) to the "probability that this document is English because it has 'a' in it" (maybe 50%).

Therefore, the common stuff that's found everywhere is given a very weak significance and the stuff that's found more uniquely across a category is given a much stronger weight. The end result is a very smart, simple algorithm that has low error rates ("low" being a relative term). It's not magic, there's no neural network, there's no "intelligence", it's just math and probability.

As usual, let's just dive in. The first thing our document classifier needs to be able to do is train itself given a piece of text and a label for that text. Since I decided to build this classifier as generalized as possible, we won't hard code the labels or even put a cap on the number of labels allowed. (Though naive Bayes classifiers with only two possible labels, like spam/ham, *are* a little bit easier to build.)

```
Bayes.train = function (text, label) {
registerLabel(label);
var words = tokenize(text);
var length = words.length;
for (var i = 0; i < length; i++)
incrementStem(words[i], label);
incrementDocCount(label);
};
```

The `registerLabel`

function simply adds the label to the "database" (in this case, localStorage) so that we can retrieve a list of labels later.

The `tokenize`

function in this case is very simple. We'll look at more interesting tokenization techniques in another article (tokenization can be an important part of these classifiers), but this one is straight-forward:

```
var tokenize = function (text) {
text = text.toLowerCase().replace(/\W/g, ' ').replace(/\s+/g, ' ').trim().split(' ').unique();
return text;
};
```

In this case, we use `unique()`

because we're only interested in *whether* a word shows up in a document, and not the number of times it shows up. In certain situations, you may get better results by considering the number of times a word appears in a document.

We then loop through each word (or "token"), and call `incrementStem`

on it. This function is also very simple: it just records the number of times a word was seen for a given label.

Finally, we call `incrementDocCount`

, which records how many documents we've seen for a given label.

The end result of training is that we have a database that stores each label we've ever seen (in our example, it'll hold "english", "spanish", and "french"), stores the number of times a word has been seen for a label (eg, "le" was seen in French documents 30 times), and stores the total number of documents for each label (eg, we saw 40 French documents).

Training a naive Bayes classifier is dead simple and really fast, as demonstrated above. Guessing a label given a document is a little tougher, but writing the algorithm is easy to those who understand probability. If you don't understand probability, that's ok; you can spend some time reading up on naive Bayes classifiers and you'll always have this example to come back to and study.

The first thing we'll do in our guessing function (other than initializing variables; see the JSFiddle for the minutiae) is a little bit of bookkeeping:

```
for (var j = 0; j < labels.length; j++) {
var label = labels[j];
docCounts[label] = docCount(label);
docInverseCounts[label] = docInverseCount(label);
totalDocCount += parseInt(docCounts[label]);
}
```

Our goal here is to set ourselves up for calculating certain probabilities later. To do this, we need to know the number of documents we've seen for a given label (docCounts), but we also need to know the number of documents *not* in that label (docInverseCounts). Finally, we need to know the total number of documents we've ever seen.

You could flip this function upside down and get docInverseCount simply by subtracting a label's docCount from the totalDocCount -- in fact, that approach is better and faster, but I did it with an explicit docInverseCount function because it reads a little easier.

Given the above information we can determine, for example, the probability that any arbitrary document would be French. We also know the probability that any document is NOT French.

```
for (var j = 0; j < labels.length; j++) {
var label = labels[j];
var logSum = 0;
...
```

Next, we look at each label. We set up a `logSum`

variable, which will store the probability that the document is in this label's category.

```
for (var i = 0; i < length; i++) {
var word = words[i];
var _stemTotalCount = stemTotalCount(word);
if (_stemTotalCount === 0) {
continue;
} else {
var wordProbability = stemLabelCount(word, label) / docCounts[label];
var wordInverseProbability = stemInverseLabelCount(word, label) / docInverseCounts[label];
var wordicity = wordProbability / (wordProbability + wordInverseProbability);
wordicity = ( (1 * 0.5) + (_stemTotalCount * wordicity) ) / ( 1 + _stemTotalCount );
if (wordicity === 0)
wordicity = 0.01;
else if (wordicity === 1)
wordicity = 0.99;
}
logSum += (Math.log(1 - wordicity) - Math.log(wordicity));
}
scores[label] = 1 / ( 1 + Math.exp(logSum) );
```

The above is the meat of the algorithm. For each label we're considering, we look at each word in the document. The _stemTotalCount variable holds the number of times we've seen that word in *any* document during training. If we've never seen this word before, just skip it! We don't have any information on it, so why use it?

`wordProbability`

represents the "probability that this word shows up in a [French|English|Spanish] document". If you've seen 40 French documents, and 30 of them have the word "le" in them, this value is 0.75. `wordInverseProbability`

is the probability that the word shows up in any other category than the one we're considering.

The funny `wordicity`

variable is what happens when you apply Bayes' theorem to the two probabilities above. While the wordProbability variable represents "the probability that [le] shows up in a [French] document", the wordicity variable represents "the probability that this document is [French] given that [le] is in it". The distinction is subtle but very important. If you're having trouble understanding the distinction at this point, I strongly recommend saying those two phrases out loud and making sure you understand the difference before moving on.

The wordicity line above also makes the assumption that English, French, and Spanish documents are all equally common and starting off on the same footing. This assumption makes the calculation a little simpler, but you can consider the *a priori* probabilities of each language if you'd like. A note on that later.

The line below the wordicity definition is an optional *adjustment* for words that we've only seen in training a few times. If you've only seen a word once, for instance, you don't really have enough information about that word to determine if it's really French or Spanish. So we make a weighted adjustment: we bring the wordicity closer to 50% if we haven't seen it too many times. The "0.5" in that equation is the value we should try to adjust towards, and the "1"s in the equation are the weight -- if you increase this value, the wordicity will remain close to 0.5 longer. If you have a large training set, you can make the weight 5 or 10 or 20 or 50 (depending on how big your training set is). Since we have a very small training set, I made this value 1 but realistically I should have just omitted the line completely (my training set is only 15 paragraphs). I just wanted to show you that adjusting for rare words is something that you can do.

Below that, we avoid letting wordicity be either 0 or 1 since we're about to use a log function on our data, and either of those values would kind of mess up the results.

The logSum line isn't really a part of the mathematical equations, but is rather a practical consideration. After calculating the wordicity for each word, we need to combine those probabilities somehow. The normal mathematical way to do that would be to multiply each probability together and divide by the multiplication of all the inverses. Unfortunately, floating point math isn't perfect and you can run into "floating point underflow", where the number gets too small for floating point math to deal with. So instead, we take a log of the numerator and denominator (combined probabilities and their inverses), and add up the logs.

Finally, after we've combined all the individual word probabilities with the logSum line, we undo the log function we just used to get the probability back in the 0 to 1 range. Note that this happens outside of the "look at each word" loop but still inside the "look at each label" loop.

One thing I'd like to point out is that the above makes a big simplifying assumption: we've assumed that English, French, and Spanish documents are all *equally likely* to appear. In our example, this is a good assumption since you guys are probably going to test one of each later, but in the real world this isn't necessarily true.

The Bayes classification algorithm does actually let you consider the *a priori* probability of a document's language (meaning, the probability that a document is English just based on the number of English documents out there, before considering the actual contents of the document), but I've simply left this out. It's not too hard to put in; adding just a few more terms to the wordicity calculation can do this. The next Bayes article I write will use the full form of Bayes theorem.

Finally, please note that I was *really lazy* while training this algorithm. You can see from the JSFiddle that I've only used 5 paragraphs from each language to train the thing on. That's not nearly enough to go by, as the words seen during training are the only words it knows. I've found that this example *does* work well if you type in sentences or paragraphs (try just copy/pasting stuff from news sites), but simple nouns and phrases probably won't work. For example, you'll get the wrong result for "la tortuga" ("the turtle" in Spanish) simply because we never showed it the word "tortuga" before. The algorithm will guess French in this case because it's seen slightly more "la"s in French than it's seen in Spanish. A larger training set would fix this issue.

We're basically done. All we have to do now is either report all the labels' probabilities or just pluck out the highest one. Try pasting some English, French, or Spanish news text in the JSFiddle below -- you should see the guessed language and the probability that led us to guess that language. This example works better with sentences or paragraphs; the more words you give it to guess by, the better the chance that it has seen one of those words in its limited training set.

So far, I've had 100% accuracy when copying and pasting sentences from news sites. Try it below and see for yourself!

If you like this article, or thought it was helpful, please consider:

- Sharing it with friends on Facebook and Twitter
- Discussing it on Hacker News and Reddit
- Discussing it in the comments section below
- Following me on Twitter @bkanber
- Reading my other ML in JS articles
- Signing up for the ML in JS mailing list below — I only send emails about new articles, never spam, won't ever sell your info

This article is part of the Machine Learning in Javascript series. The series covers some of the essential machine learning algorithms and assumes little background knowledge. There's also a mailing list at the bottom of the page if you want to know about new articles; you can also follow me on twitter: @bkanber.

Are you just looking for the code example? Scroll down!

Today we're going to figure out how to find clusters of data points. Let's say you work at a medical imaging devices company. Imagine you already have a way to identify malignant cells from an image scan, but it would be great to automatically identify the centers of clusters of cells as well. Then a robot could go in with surgical precision and remove the problem!

What we're looking for is a clustering algorithm; today we're going to talk specifically about the k-means algorithm.

Clustering algorithms, in general, find groups of similar pieces of data. If you run an online store you might use a clustering algorithm to identify different shopper types. You may find that you have one type of visitor that just window shops through 3-5 pages of products and leaves. Another group might make meticulous purchasing decisions by looking through 15 pages of products and reviews and end up making only one, high-value purchase. And you may also identify the impulse buyer, who makes numerous small purchases without browsing too deeply. Once you've identified your e-shopper demographics, you're better able to optimize your site to increase sales. You can release features that appeal to your impulse buyers, because now you know that you *have *impulse buyers!

And while that's just one practical example of k-means, you'll find this algorithm used in multiple fields. Sometimes it's just image processing in 2 dimensions, other times it's processing huge data across dozens of dimensions and parameters. Like our k-nearest-neighbor algorithm, k-means is versatile, simple to understand and implement, and sneakily powerful.

Like k-nearest-neighbor, the "k" in k-means gives away that there's going to be some number that we're going to have to feed to our algorithm. Specifically, "k" is the number of clusters we're going to find in our data. Unfortunately, it's rarely possible to know the number of clusters before you solve the problem, so k-means is usually supplemented by another algorithm that first helps you find the best value of *k*.

The issue is this: the k-means algorithm will partition your data into "k" distinct clusters, but it does not tell you if that's the *correct *number of clusters. Your data might naturally have 5 different clusters in it, but if you feed k-means the number 3 you'll get 3 clusters back. Those clusters will be bigger, looser and more awkwardly shaped than if you had told it to find 5 clusters.

The long and short of it is this: in order to use k-means you either need to know how many clusters you're looking for at the outset, *or *you have to use a second algorithm to also guess the number of clusters. K-means just organizes your points into clusters; you need to do something else to figure out the right number of clusters.

For today we'll contrive a situation and use three clusters from the outset. Next time (in k-means part 2) we'll look at a technique you can use to automatically guess the value of "k". Most often, these algorithms rely on some kind of error analysis and multiple passes of the k-means algorithm in order to optimize for the solution with the smallest error value.

The k-means algorithm is simple but can become very powerful if you're using it on a dataset with many dimensions. Today we're going to work in 2 dimensions, but next time we'll do something more complicated. Here's what the algorithm looks like:

- Plot your data points
- Create "k"
*additional*points, placing them randomly on your graph. These points are the "cluster centroids" -- or the candidates for the centers of your clusters. - Repeat the following:
- "Assign" each data point to the cluster centroid closest to it
- Move the centroid to the average position of all the data points that belong to it
- If any of the centroids moved in the last step, repeat. If nothing moved, exit.

It's that simple! As you can see, this is an iterative process. It may take 2 or 3 or dozens of iterations, but eventually your cluster centroids should converge to their solutions and stop moving. You then take the final tally of the assignments and then you have your clusters.

```
var data = [
[1, 2],
[2, 1],
[2, 4],
[1, 3],
[2, 2],
[3, 1],
[1, 1],
[7, 3],
[8, 2],
[6, 4],
[7, 4],
[8, 1],
[9, 2],
[10, 8],
[9, 10],
[7, 8],
[7, 9],
[8, 11],
[9, 9],
];
```

Next, we define two functions that are helpful to us, but not essential. Given a list of points, I'd like to know what the max and min values for each dimension are, and what the range of each dimension is. I want to know "X ranges from 1 to 11, and Y ranges from 3 to 7". Knowing these figures helps us draw the graph on the canvas, and also helps when we initialize our random cluster centers (we'd like them to be within the range of the data points when we start them out).
Keeping in mind that we're writing this to be generic with regards to the number of dimensions they can handle:
```
function getDataRanges(extremes) {
var ranges = [];
for (var dimension in extremes)
{
ranges[dimension] = extremes[dimension].max - extremes[dimension].min;
}
return ranges;
}
function getDataExtremes(points) {
var extremes = [];
for (var i in data)
{
var point = data[i];
for (var dimension in point)
{
if ( ! extremes[dimension] )
{
extremes[dimension] = {min: 1000, max: 0};
}
if (point[dimension] < extremes[dimension].min)
{
extremes[dimension].min = point[dimension];
}
if (point[dimension] > extremes[dimension].max)
{
extremes[dimension].max = point[dimension];
}
}
}
return extremes;
}
```

The `getDataExtremes()`

method loops through all the points and each dimension in each point and finds the min and max values (note there's a hard-coded "1000" in there, which you should change if you're using large numbers). The `getDataRanges()`

function is just a helper that takes that output and returns the range of each dimension (the maximum value minus the minimum value).
Next up, we define a function that initializes ```
function initMeans(k) {
if ( ! k )
{
k = 3;
}
while (k--)
{
var mean = [];
for (var dimension in dataExtremes)
{
mean[dimension] = dataExtremes[dimension].min + ( Math.random() * dataRange[dimension] );
}
means.push(mean);
}
return means;
};
```

We're just creating new points with random coordinates within the range and dimensions of our dataset.
Once we have our randomly seeded centroids, we need to enter our k-means loop. As a reminder, the loop consists of first assigning all our data points to the centroid closest to it, then moving the centroids to the average position of all the data points assigned to it. We repeat that until the centroids stop moving.
```
function makeAssignments() {
for (var i in data)
{
var point = data[i];
var distances = [];
for (var j in means)
{
var mean = means[j];
var sum = 0;
for (var dimension in point)
{
var difference = point[dimension] - mean[dimension];
difference *= difference;
sum += difference;
}
distances[j] = Math.sqrt(sum);
}
assignments[i] = distances.indexOf( Math.min.apply(null, distances) );
}
}
```

The above function is called by our "loop" function and calculates the Euclidean distance between each point and the cluster center.
Note that the above algorithm loops through each point and then loops through each cluster centroid, making this an O(k*n) algorithm. It's not terrible, but it might be computationally intensive if you have a large number of data points or a large number of clusters or both. There are ways you can optimize this, which we'll perhaps discuss in a future article. For one, we can try to eliminate the expensive `Math.sqrt()`

call; we could also try not to iterate through every point.
Once we have our list of assignments -- in this case, just an associative array of `point index => center index`

-- we can go ahead and update the positions of the means (the cluster centers).
```
function moveMeans() {
makeAssignments();
var sums = Array( means.length );
var counts = Array( means.length );
var moved = false;
for (var j in means)
{
counts[j] = 0;
sums[j] = Array( means[j].length );
for (var dimension in means[j])
{
sums[j][dimension] = 0;
}
}
for (var point_index in assignments)
{
var mean_index = assignments[point_index];
var point = data[point_index];
var mean = means[mean_index];
counts[mean_index]++;
for (var dimension in mean)
{
sums[mean_index][dimension] += point[dimension];
}
}
for (var mean_index in sums)
{
console.log(counts[mean_index]);
if ( 0 === counts[mean_index] )
{
sums[mean_index] = means[mean_index];
console.log("Mean with no points");
console.log(sums[mean_index]);
for (var dimension in dataExtremes)
{
sums[mean_index][dimension] = dataExtremes[dimension].min + ( Math.random() * dataRange[dimension] );
}
continue;
}
for (var dimension in sums[mean_index])
{
sums[mean_index][dimension] /= counts[mean_index];
}
}
if (means.toString() !== sums.toString())
{
moved = true;
}
means = sums;
return moved;
}
```

The `moveMeans()`

starts by calling the `makeAssignments()`

function. Once we have our assignments, we initialize two arrays: one called "sums" and the other called "counts". Since we're calculating the arithmetic mean (or average), we'll need to know the sum of points' dimensions as well as the number of points whose dimensions we're averaging.
We then hit three loops:
First we loop through our means and prepare our ```
function setup() {
canvas = document.getElementById('canvas');
ctx = canvas.getContext('2d');
dataExtremes = getDataExtremes(data);
dataRange = getDataRanges(dataExtremes);
means = initMeans(3);
makeAssignments();
draw();
setTimeout(run, drawDelay);
}
function run() {
var moved = moveMeans();
draw();
if (moved)
{
setTimeout(run, drawDelay);
}
}
```

`setup()`

initializes everything we need, and then our `run()`

function checks to see if the algorithm has stopped, and loops based on a timer so that we can watch the algorithm do its work in a reasonable timeframe.
Just

]]>Just looking for the example?

You're a scientist that has recently been framed for murder by an evil company. Before you flee the lab you have an opportunity to steal 1,000 pounds (or kilograms!) of pure elements from the chemical warehouse; your plan is to later sell them and survive off of the earnings.

Given the weight and value of each element, which combination should you take to maximize the total value without exceeding the weight limit?

This is called the knapsack problem. The one above is a one-dimensional problem, meaning the only constraint is weight. We could complicate matters by also considering volume, but we need to start somewhere. Note that in our version of the problem only one piece of each element is available, and each piece has a fixed weight. There are some knapsack problems where you can take unlimited platinum or up to 3 pieces of gold or something like that, but here we only have one of each available to us.

Why is this problem tough to solve? We'll be using 118 elements. The brute-force approach would require that we test 2^{118} or 3.3 * 10^{35} different combinations of elements.

A quick benchmark we'll use for our solution is called the "greedy" solution. The greedy algorithm grabs the most valued items and puts them into the knapsack until it can't fit any more.

Sometimes this works great. Sometimes it doesn't. Imagine that there's a piece of gold in the warehouse that's valued at $1,000 but weighs 600 pounds. And there's also a piece of cadmium that has a value of $950 but only weighs 300 pounds, and there are a bunch of other elements that have a pretty high value but reasonably light weights. The greedy algorithm still tosses gold in there, and all that precious available weight is taken up by something that's not really worth it.

The "naive" greedy algorithm for our dataset will give us a total value of $3649 with a total weight of 998 pounds.

You may at this point be thinking "why don't we just figure out value *per pound* for each element and use that?" Sure, that works too! It will, in fact, work way better than the above.

Using that approach, the "weighted" greedy algorithm gives us a total value of $4901 and a total weight of 969.

So those are our numbers to beat: we should expect to handily beat $3,649, and we'll be happy if we also beat $4,901.

Why does the greedy algorithm work well for this type of problem? Because the greedy algorithm is solving for the "highest value per unit weight" and that's very close to what we want to fill our knapsack with. However, the greedy algorithm will not perform well in instances where there's a large range of weights and values. That is, the greedy algorithm will perform better if the range of values (and/or weights) of the elements is between $1 - $100, but will perform worse if the range is $1 - $500.

Respond in the comments: why would the greedy algorithm perform worse with a larger range of weights and values?

Our GA might lose to the greedy algorithm from time to time, but that's ok. The GA will continue to perform well as complexity increases, but the greedy algorithm will not.

So we're still tackling a pretty simplified and contrived problem here, but it's certainly more complex and useful than our "Hello, World!" from the last article. Let's get started.

There are some major differences between our problem and the previous "Hello, World!":

- We can't use elements (genes) more than once, while "Hello, World!" can (must!) use letters more than once.
- In the "Hello, World!" example, we knew that the string needed to be 13 characters long. Here we don't know how many elements to take.
- We don't know the highest possible value ("fitness" or "score") for this problem. It could be $4,901 like the greedy algorithm guessed, or it could be $10,000, or $23,304

The "Hello, World!" algorithm represented the chromosome as a string. Mutating involved randomly changing a letter. Mating (or crossover, as it's really called) involved splicing two strings together at the halfway point. We need to do things a little differently here.

It turns out that representing our solution requires a little more finesse than "Hello, World!". Since we don't know how many elements to choose, we can't use a fixed length string. (Just the first of our problems!)

Instead, I propose we use a "bitmask" of sorts. We don't have to use an actual bitmask, but my proposal is to use a representation of *all* of the available elements, and set each "present" or not.

Our chromosome could look like this:

Helium: present Hydrogen: not present Lithium: not present

And et cetera for all 118 elements. Or if you want to go the bitmask route:

10000011000001000100000010000010010010000

Where each bit represents a single element and the value of the bit indicates whether that element is in the knapsack or not.

Additionally, if we were to allow more than one of each element, the representation could look like so:

Helium: 0 Hydrogen: 4 Lithium: 2

What would *not* work well is the following:

In Knapsack: Helium, Lithium, Lead, Tin

The above makes mating more difficult. You can make it work, but it feels like you'd be jumping through hoops to pull it off. Structure feels better for this problem.

We have a specific difficulty with mating: we need to make sure that even after mating and mutation we still only have at most one of each element in the list. Using the bitmask approach will help us to that end, but that's a common pitfall when trying the list approach above.

The difficulty of making sure something happens only once in a chromosome is a common one. If you're familiar with the traveling salesman problem, it's easy to imagine a scenario where you mate two solutions and end up visiting the same city twice. That city appeared in the first half of the first parent and the second half of the second parent -- therefore appearing nowhere in one child but twice in the other child.

For this problem we're going to keep track of three properties of the population: weight, value, and *score*.

Score is the same thing as value, with one difference: score accounts for the population being overweight.

You may be tempted to throw out overweight populations completely. It's a natural instinct because overweight solutions are not acceptable solutions! But there's a good, practical reason we don't want to throw out overweight chromosomes: there will be some slightly overweight (1,001 pounds) chromosomes that have very high values and just need to be "tweaked" a little to bring them within the weight range.

There could be a lot of potential in some of the overweight chromosomes. Rather than killing them, we'll penalize them just enough that they still get to reproduce, but are unlikely to be the #1 pick. That's what we'll use "score" for. If you're underweight, then your score is just your total value. If you're overweight, however, we'll penalize you 50 points for every pound over weight. Feel free to play with this number.

Evolutionarily, this "encourages" promising chromosomes to drop some weight. All they need is a little tweaking. There's no use in throwing out a potentially strong candidate!

We'll cover the specifics of mating and mutation and the important of death (really called "elitism") as we're looking at the code.

First, let's take a look at the data set. I wrote a simple PHP script that generates random weights and values (ranged 1 - 500) for each element and outputs the set as JSON. It looks something like this:

```
"Hydrogen":{
"weight":389,
"value":400},
"Helium":{
"weight":309,
"value":380},
"Lithium":{
"weight":339,
"value":424},
"Beryllium":{
"weight":405,
"value":387},
"Boron":{
"weight":12,
"value":174},
```

And so on.

We then define three quick and easy helper functions:

```
function length(obj) {
var length = 0;
for (var i in obj)
length++;
return length;
}
function clone(obj) {
obj = JSON.parse(JSON.stringify(obj));
return obj;
}
function pickRandomProperty(obj) {
var result;
var count = 0;
for (var prop in obj)
if (Math.random() < 1 / ++count)
result = prop;
return result;
}
```

The 'length' property only exists for javascript arrays, so we create a length() function that works for objects.

We create a clone function that ensures our element objects aren't passed by reference.

Finally, we create a function that picks a random property of an object. This is an analog to PHP's 'array_rand' function, which returns a random array key.

```
var Chromosome = function(members) {
this.members = members;
for (var element in this.members)
{
if (typeof this.members[element]['active'] == 'undefined')
{
this.members[element]['active'] = Math.round( Math.random() );
}
}
this.mutate();
this.calcScore();
};
Chromosome.prototype.weight = 0;
Chromosome.prototype.value = 0;
Chromosome.prototype.members = [];
Chromosome.prototype.maxWeight = 1000;
Chromosome.prototype.mutationRate = 0.7;
Chromosome.prototype.score = 0;
```

The chromosome constructor takes an object of 'members'. In this case, we'll either be passing our original list of elements data when we're creating a brand new chromosome, *or* we'll be passing in the results of a mating operation.

The constructor randomly activates elements if the 'active' property isn't yet defined. The end result is that this will create a random chromosome if we're creating one from scratch, and it'll leave a pre-configured chromosome alone.

The prototype also specifies some defaults. The mutationRate property is the chance that a chromosome will mutate.

```
Chromosome.prototype.mutate = function() {
if (Math.random() > this.mutationRate)
return false;
var element = pickRandomProperty(this.members);
this.members[element]['active'] = Number(! this.members[element]['active']);
};
```

The mutate method is most similar to the "Hello, World!" example. If the chromosome is to mutate then we simply pick an element at random and toggle its 'active' property. I cast to Number here. It would have been more semantic to cast the Math.random() in the constructor to Boolean. I'll ignore this, as I've already pasted all the code into this post.

```
Chromosome.prototype.calcScore = function() {
if (this.score)
return this.score;
this.value = 0;
this.weight = 0;
this.score = 0;
for (var element in this.members)
{
if (this.members[element]['active'])
{
this.value += this.members[element]['value'];
this.weight += this.members[element]['weight'];
}
}
this.score = this.value;
if (this.weight > this.maxWeight)
{
this.score -= (this.weight - this.maxWeight) * 50;
}
return this.score;
};
```

The calcScore method starts with a tiny performance optimization: if we've calculated the score already, just serve the cached score -- it's just a nice way to not have to worry about at which point in the chromosome life cycle to calculate the score.

We then look through the elements and add up the value and weights for the active ones. We then apply a penalty of 50 points per overweight pound.

```
Chromosome.prototype.mateWith = function(other) {
var child1 = {};
var child2 = {};
var pivot = Math.round( Math.random() * (length(this.members) - 1) );
var i = 0;
for (var element in elements)
{
if (i < pivot)
{
child1[element] = clone(this.members[element]);
child2[element] = clone(other.members[element]);
}
else
{
child2[element] = clone(this.members[element]);
child1[element] = clone(other.members[element]);
}
i++;
}
child1 = new Chromosome(child1);
child2 = new Chromosome(child2);
return [child1, child2];
};
```

In the "Hello, World!" example we picked the center point as the pivot point when mating two chromosomes; in this example we pick a random point instead.

This adds a little more randomness to the system and can help avoid local optima.

Once we've picked our pivot point we create two children by splicing the parents at the pivot and combining. We then use our chromosome constructor to generate chromosome objects and return them.

```
var Population = function(elements, size)
{
if ( ! size )
size = 20;
this.elements = elements;
this.size = size;
this.fill();
};
Population.prototype.elitism = 0.2;
Population.prototype.chromosomes = [];
Population.prototype.size = 100;
Population.prototype.elements = false;
```

The population constructor is straightforward: we give it the master list of elements and the desired population size. We also define the 'elitism' parameter; this is the percentage of chromosomes that will survive from one generation to the next.

```
Population.prototype.fill = function() {
while (this.chromosomes.length < this.size)
{
if (this.chromosomes.length < this.size / 3)
{
this.chromosomes.push( new Chromosome( clone(this.elements) ) );
}
else
{
this.mate();
}
}
};
```

We use the fill method to initialize the population; we'll also use it to fill the population after killing the weakest chromosomes. A little bit of logic determines whether we should create random chromosomes or fill the population through mating instead. If our population size is 20, the first 6 chromosomes will be random and the remaining will be generated by mating. If the population size ever dips below 30% (perhaps due to death/elitism), new random chromosomes will be created until the population is diverse enough to create babies through mating.

Yes, 'this.chromosomes.length' in a while loop is bad form. If you expect to use a large population size -- or want this to be highly optimized -- do this the right way and cache the length.

```
Population.prototype.sort = function() {
this.chromosomes.sort(function(a, b) { return b.calcScore() - a.calcScore(); });
};
Population.prototype.kill = function() {
var target = Math.floor( this.elitism * this.chromosomes.length );
while (this.chromosomes.length > target)
{
this.chromosomes.pop();
}
};
```

The sort function above is just a helper; note that we use the calcScore method instead of accessing the 'score' property directly. If the score hasn't been calculated by this point, it will be now; if the score was already calculated we just use calcScore as an accessor.

After sorting, the kill method removes the weakest chromosomes from the bottom of the list by popping them until we reach our elitism value.

```
Population.prototype.mate = function() {
var key1 = pickRandomProperty(this.chromosomes);
var key2 = key1;
while (key2 == key1)
{
key2 = pickRandomProperty(this.chromosomes);
}
var children = this.chromosomes[key1].mateWith(this.chromosomes[key2]);
this.chromosomes = this.chromosomes.concat(children);
};
```

The mate method is always called after the kill method, so the only chromosomes allowed to reproduce are the elite ones in the population (in our example, the best 20%). Rather than mating only the best two chromosomes (like we did in "Hello, World!"), we pick any two random chromosomes to mate -- with the exception that we won't mate a chromosome with itself.

Again, this serves to add a little more randomness to the system, and will avoid stasis if the top two chromosomes remain the same for many generations -- we see that happen sometimes in the "Hello, World!" example and fix it here.

```
Population.prototype.generation = function(log) {
this.sort();
this.kill();
this.mate();
this.fill();
this.sort();
};
```

Then we define a "generation". The generation starts by sorting the chromosomes in terms of score. We then kill the weakest members.

Then we do something a little intriguing: we call mate() once and then call fill(), which we know will also call mate(). The reason we call mate() explicitly is as a bit of insurance: if the elitism parameter is less than 0.3 we want to mate at least once before potentially "polluting" the population with random members. This really depends on the elitism value; if you keep it over 0.3 you don't have to call mate() explicitly, because fill() will do that for you. But if you have elitism = 0.2 like we do, then we want to run at least one mating routine that involves only the elite, and not the new random chromosomes we introduce with fill().

Finally, we sort once more at the end of the generation. We could just as easily leave this part out, but it's nice to see the chromosomes in order after every generation if you're debugging.

I briefly introduced this idea in the "Hello, World!" example: we don't know the best possible score in this problem, therefore we won't know when to stop.

The technique (one of many) that we'll use is to stop when you've had 100 (or 1,000 or 1,000,000) generations with no improvement. We'll just call this the "threshold" or the "stop threshold".

```
Population.prototype.run = function(threshold, noImprovement, lastScore, i) {
if ( ! threshold )
threshold = 1000;
if ( ! noImprovement )
noImprovement = 0;
if ( ! lastScore )
lastScore = false;
if ( ! i )
i = 0;
if (noImprovement < threshold)
{
lastScore = this.chromosomes[0].calcScore();
this.generation();
if (lastScore >= this.chromosomes[0].calcScore())
{
noImprovement++;
}
else
{
noImprovement = 0;
}
i++;
if (i % 10 == 0)
this.display(i, noImprovement);
var scope = this;
setTimeout(function() { scope.run(threshold, noImprovement, lastScore, i) }, 1);
return false;
}
this.display(i, noImprovement);
};
```

The run method is an iterative function. It doesn't need to be. The only reason it is in this example is because writing to the DOM in the middle of a fast-moving loop doesn't work -- the DOM doesn't update until execution is done. It's just a javascript thing.

To get around the DOM limitation, we use a short setTimeout and have the run method call itself iteratively until we're done. In general, however, this function could just use a while loop instead of calling itself -- but in that case you'd either need to use console.log or just wait until the loop is done to watch the results.

Other than that DOM messiness, the run method is straightforward. We compare last generation's best score to this generation's best. If there was improvement, then we reset the 'noImprovement' counter. If we have noImprovement equal to our stop threshold, we stop.

Not shown is a simple display method used to print the results to a table on the page. We only call it every 10 generations, and then once again when we're done.

Our JSFiddle is below. You're given a slider that lets you set the stop tolerance and a big "Go" button. You then get to watch the population's evolution.

Even with a 10-generation stop threshold, the GA consistently beats the naive greedy algorithm, though this is to be expected.

At a 50-generation stop threshold, the GA also consistently beats the weighted greedy algorithm. This is a happy result. There are some datasets where the GA will never beat the greedy algorithm, and other datasets where the greedy algorithm doesn't perform well at all. This seems like an in-between. We don't *crush* the greedy algorithm, but we do outperform it by a significant margin.

The best score I've observed so far is 5944, with a weight of 992. Please let me know in the comments below if you find a better score, and I'll post the updates here.

*Update: *Sylvain Zimmer below found a solution with a score of 5968 and weight 998.

As always, feel free to fork and play with this example. I would encourage you to experiment with different values of population size, elitism, and mutation rate and observe the results.

Just looking for the example code? The JSFiddle is at the bottom of the page.

Today we're going to look at the k-nearest-neighbor algorithm (we'll abbreviate it kNN throughout this article). I love this algorithm because it's dead simple, but can still solve some exciting problems. Its strength doesn't lie in mathematical or theoretical sophistication, but rather in the fact that it can elegantly handle lots of different input parameters ("dimensions").

k-nearest-neighbor also serves an ulterior motive of mine: it's a great way to introduce the concept of "supervised learning". So let's go ahead and build the k-nearest-neighbor algorithm. And we'll graph our results, because I love visuals!

There are two giant umbrella categories within Machine Learning: supervised and unsupervised learning. Briefly, *un*supervised learning is like data mining -- look through some data and see what you can pull out of it. Typically, you don't have much additional information before you start the process. We'll be looking at an unsupervised learning algorithm called k-means next week, so we can discuss that more in the next article.

Supervised learning, on the other hand, starts with "training data". Supervised learning is what we do as children as we learn about the world. We're in the kitchen with mom. Mom shows us an apple and says the word "apple". You see an object and your mother has labeled it for you.

The next day, she shows you a different apple. It's smaller, it's less red, and it has a slightly different shape. But your mother shows it to you and says the word "apple". This process repeats for a couple of weeks. Every day your mother shows you a slightly different apple, and tells you they're apples. Through this process you come to understand what an apple is.

Not every apple is exactly the same. Had you only ever seen one apple in your life, you might assume that every apple is *identical*. But instead, your mother has trained you to recognize the overall features of an apple. You're now able to create this category, or label, of objects in your mind. When you see an apple in the future you can recognize it as an apple because you've come to understand that while all apples share some features, they don't have to be identical to still be apples.

This is called "generalization" and is a very important concept in supervised learning algorithms. We would be useless if we couldn't recognize that an iPhone was an iPhone because it had a different case, or because it had a scratch across the screen.

When we build certain types of ML algorithms we therefore need to be aware of this idea of generalization. Our algorithms should be able to generalize but not *over*-generalize (easier said than done!). We want "this is red and kind of round and waxy, it must be an apple" and *not *"this is red and round, it must be a ball; this other thing is orange and round, it must be a ball." That's overgeneralization and can be a problem. Of course, under-generalization is a problem too. This is one of the main difficulties with ML: being able to find the generalization sweet spot. There are some tests you can run as you're training an algorithm to help you find the sweet spot, but we'll talk about those in the future when we get to more advanced algorithms.

Many supervised learning problems are "classification" problems. The classification problem goes like this: there's a bucket of apples, oranges, and pears. Each piece of fruit has a sticker that tells you which fruit it is -- except one! Figure out which fruit the mystery fruit is by learning from the other fruits you're given.

The classification problem is usually very easy for humans, but tough for computers. kNN is one type of many different classification algorithms.

This is a great time to introduce another important aspect of ML: features. Features are what you break an object down into as you're processing it for ML. If you're looking to determine whether an object is an apple or an orange, for instance, you may want to look at the following features: shape, size, color, waxiness, surface texture, etc. It also turns out that sometimes an individual feature ends up not being helpful. Apples and oranges are roughly the same size, so that feature can probably be ignored and save you some computational overhead. In this case, size doesn't really add new information to the mix.

Knowing what features to look for is an important skill when designing for ML algorithms. Sometimes you can get by on intuition, but most of the time you'll want to use a separate algorithm to determine which are the most important features of a data set (we'll talk about that in a much more advanced article).

As you can imagine, features aren't always as straightforward as "color, size, shape". Processing documents is a tricky example. In some scenarios, each *word* in a document is an individual feature. Or maybe each pair of consecutive words ("bigrams") is a feature. We'll talk about document classification in a future article as well.

Given the number of rooms and area (in square feet) of a type of dwelling, figure out if it's an apartment, house, or flat.

As always, we're starting with the most contrived possible problem in order to learn the basics. The description of this problem has *given *us the features we need to look at: number of rooms, and square feet. We can also assume that, since this is a supervised learning problem, we'll be given a handful of example apartments, houses, and flats.

I think the best way to teach the kNN algorithm is to simply define what the phrase "k-nearest-neighbor" actually means.

Here's a table of the example data we're given for this problem:

Rooms | Area | Type |
---|---|---|

1 | 350 | apartment |

2 | 300 | apartment |

3 | 300 | apartment |

4 | 250 | apartment |

4 | 500 | apartment |

4 | 400 | apartment |

5 | 450 | apartment |

7 | 850 | house |

7 | 900 | house |

7 | 1200 | house |

8 | 1500 | house |

9 | 1300 | house |

8 | 1240 | house |

10 | 1700 | house |

9 | 1000 | house |

1 | 800 | flat |

3 | 900 | flat |

2 | 700 | flat |

1 | 900 | flat |

2 | 1150 | flat |

1 | 1000 | flat |

2 | 1200 | flat |

1 | 1300 | flat |

We're going to plot the above as points on a graph in two dimensions, using number of rooms as the x-axis and the area as the y-axis.

When we inevitably run into a new, unlabeled data point ("mystery point"), we'll put that on the graph too. Then we'll pick a number (called "k") and just find the "k" closest points on the graph to our mystery point. If the majority of the points close to the new point are "flats", then we'll guess that our mystery point is a flat.

That's what k-nearest-neighbor means. "If the 3 (or 5 or 10, or 'k') nearest neighbors to the mystery point are two apartments and one house, then the mystery point is an apartment."

Here's the (simplified) procedure:

- Put all the data you have (including the mystery point) on a graph.
- Measure the distances between the mystery point and every other point.
- Pick a number. Three is usually good for small data sets.
- Figure out what the three closest points to the mystery point are.
- The majority of the three closest points is the answer.

If you're having trouble visualizing this, please take a quick break to scroll down to the bottom of the page and run the JS fiddle. That should illustrate the concept. Then come back up here and continue reading!

Let's start building this thing. There are a few fine points that will come out as we're implementing the algorithm, so please read the following carefully. If you skim, you'll miss out on another important concept!

I'll build objects of two different classes for this algorithm: the "Node", and a "NodeList". A Node represents a single data point from the set, whether it's a pre-labeled (training) point, or an unknown point. The NodeList manages all the nodes and also does some canvas drawing to graph them all.

The Node constructor has nothing to it. It just expects an object with the properties "type", "area", and "rooms":

```
var Node = function(object) {
for (var key in object)
{
this[key] = object[key];
}
};
```

When I build these algorithms "for realsies" I usually abstract the features a little bit more. This example requires "area" and "rooms" to be hard-coded in certain places, but I usually build a generalized kNN algorithm that can work with arbitrary features rather than our pre-defined ones here. I'll leave that as an exercise for you!

Similarly, the NodeList constructor is simple:

```
var NodeList = function(k) {
this.nodes = [];
this.k = k;
};
```

The NodeList constructor takes the "k" from k-nearest-neighbor as its sole argument. Please fork the JSFiddle code and experiment with different values of k. Don't forget to try the number 1 as well!

Not shown is a simple NodeList.prototype.add(node) function -- that just takes a node and pushes it onto the this.nodes array.

At this point, we could dive right in to calculating distances, but I'd like to take a quick diversion.

Look at the data in the table above. The number of rooms varies from 1 to 10, and the area ranges from 250 to 1700. What would happen if we tried to graph this data onto a chart (without scaling anything)? For the most part, the data points would be lined up in a vertical column. That'll look pretty ugly, and hard to read.

Unfortunately, that's not just an aesthetic problem. This is an issue of a large discrepancy of scale of our data features. The difference between 1 room and 10 rooms is *huge* when you consider what it means for classifying a "flat" vs a "house"! But the same difference of 9, when you're talking about square feet, is *nothing*. If you were to measure distances between Nodes right now without adjusting for that discrepancy, you'd find that the number of rooms would have almost no effect on the results because those points are so close together on the x-axis (the "rooms" axis).

Consider the difference between a dwelling with 1 room and 700 square feet and 5 rooms and 700 square feet. Looking at the data table by eye, you'd recognize the first to be a flat and the second to be an apartment. But if you were to graph these and run kNN, it would consider them both to be flats.

So instead of looking at absolute values of number of rooms and area, we should normalize these values to be between 0 and 1. After normalization, the lowest number of rooms (1) becomes 0, and the largest number of rooms (10) becomes 1. Similarly, the smallest area (250) becomes 0 and the largest area (1700) becomes 1. That puts everything on the same playing field and will adjust for discrepancies of scale. It's a simple thing to do that makes all the difference in the world.

Pro-tip: you don't need to scale things evenly (into a square) like I described above. If area is more important to the problem than the number of rooms, you can scale those two features differently -- this is called "weighting", and gives more importance to one feature or another. There are also algorithms that will determine the ideal feature weights for you. All in due time...

To start normalizing our data we should give NodeList a way of finding the minimum and maximum values of each feature:

```
NodeList.prototype.calculateRanges = function() {
this.areas = {min: 1000000, max: 0};
this.rooms = {min: 1000000, max: 0};
for (var i in this.nodes)
{
if (this.nodes[i].rooms < this.rooms.min)
{
this.rooms.min = this.nodes[i].rooms;
}
if (this.nodes[i].rooms > this.rooms.max)
{
this.rooms.max = this.nodes[i].rooms;
}
if (this.nodes[i].area < this.areas.min)
{
this.areas.min = this.nodes[i].area;
}
if (this.nodes[i].area > this.areas.max)
{
this.areas.max = this.nodes[i].area;
}
}
};
```

As I mentioned earlier, the best approach would be to abstract the features and not have areas or rooms hard-coded. But doing it this way reads a little more clearly to me.

Now that we have our minimum and maximum values, we can move along to the meat and berries of the algorithm. After we've added all our Nodes to the NodeList:

```
NodeList.prototype.determineUnknown = function() {
this.calculateRanges();
/*
* Loop through our nodes and look for unknown types.
*/
for (var i in this.nodes)
{
if ( ! this.nodes[i].type)
{
/*
* If the node is an unknown type, clone the nodes list and then measure distances.
*/
/* Clone nodes */
this.nodes[i].neighbors = [];
for (var j in this.nodes)
{
if ( ! this.nodes[j].type)
continue;
this.nodes[i].neighbors.push( new Node(this.nodes[j]) );
}
/* Measure distances */
this.nodes[i].measureDistances(this.areas, this.rooms);
/* Sort by distance */
this.nodes[i].sortByDistance();
/* Guess type */
console.log(this.nodes[i].guessType(this.k));
}
}
};
```

That's a mouthful. First off, we calculate the min and max ranges so that the NodeList is aware of it.

We then loop through the Nodes and look for any unknown nodes (yes, this can do more than one mystery node at a time).

When we find an unknown Node, we clone the Nodes in the NodeList and make the Node aware of them. The reason we do this is because each unknown Node will have to calculate its own distance to every other Node -- so we can't use global state here.

Finally, we call three Node methods in succession on the unknown Node: measureDistances, sortByDistance, and guessType.

```
Node.prototype.measureDistances = function(area_range_obj, rooms_range_obj) {
var rooms_range = rooms_range_obj.max - rooms_range_obj.min;
var area_range = area_range_obj.max - area_range_obj.min;
for (var i in this.neighbors)
{
/* Just shortcut syntax */
var neighbor = this.neighbors[i];
var delta_rooms = neighbor.rooms - this.rooms;
delta_rooms = (delta_rooms ) / rooms_range;
var delta_area = neighbor.area - this.area;
delta_area = (delta_area ) / area_range;
neighbor.distance = Math.sqrt( delta_rooms*delta_rooms + delta_area*delta_area );
}
};
```

The measureDistances function takes the NodeList's set of ranges for min and max rooms and areas. If you've abstracted away from hard-coding features, the argument to this method would just be an array of ranges for features, but here we've hardcoded it.

We quickly calculate the rooms*range (here, it'll be 9) and area*range (here, it'll be 1450).

Then we loop through our Node's neighbors (we gave this Node the neighbors property above when we cloned the Nodes from NodeList). For each one of the neighbors we calculate the difference in number of rooms and area. After each of those calculations, we normalize by dividing each feature by its range.

If the difference between the number of rooms turns out to be 3 and the total range is 9 we end up for 0.333 for the value of that delta. This number should always be between -1 and +1, since we've normalized to a square for this problem.

Finally, we calculate the distance using the Pythagorean theorem. Note that if you have more than 2 features (dimensions), you still keep the Math.sqrt -- you just add all the squared features like so:

```
Math.sqrt( a*a + b*b + c*c + d*d + ... + z*z );
```

It should be pretty clear now what the true strength of this algorithm is. While our feeble minds may only be able to figure out correlations involving 5 or 10 different features, this algorithm can do that for hundreds or thousands of dimensions.

```
Node.prototype.sortByDistance = function() {
this.neighbors.sort(function (a, b) {
return a.distance - b.distance;
});
};
```

This above sortByDistance is just a helper function to sort the Node's neighbors by distance.

```
Node.prototype.guessType = function(k) {
var types = {};
for (var i in this.neighbors.slice(0, k))
{
var neighbor = this.neighbors[i];
if ( ! types[neighbor.type] )
{
types[neighbor.type] = 0;
}
types[neighbor.type] += 1;
}
var guess = {type: false, count: 0};
for (var type in types)
{
if (types[type] > guess.count)
{
guess.type = type;
guess.count = types[type];
}
}
this.guess = guess;
return types;
};
```

The final piece of the algorithm is the guessType method. This method accepts the value of "k" from NodeList and slices out the k closest neighbors. It then tallies up the types of the neighbors, picks the most common one, and returns the result.

Algorithm finished. Congratulations! Now let's figure out how to graph this thing.

Graphing our results is pretty straightforward, with a few exceptions:

- We'll color code apartments = red, houses = green, flats = blue.
- We have to scale the graph into a square (for the same reasons we had to normalize the features).
- We'll also add a little bit of padding to the edges of the graph.
- We'll display the result of the kNN guess by drawing the radius that encompasses the "k" nearest neighbors, and coloring that radius with the result's color.

```
NodeList.prototype.draw = function(canvas_id) {
var rooms_range = this.rooms.max - this.rooms.min;
var areas_range = this.areas.max - this.areas.min;
var canvas = document.getElementById(canvas_id);
var ctx = canvas.getContext("2d");
var width = 400;
var height = 400;
ctx.clearRect(0,0,width, height);
for (var i in this.nodes)
{
ctx.save();
switch (this.nodes[i].type)
{
case 'apartment':
ctx.fillStyle = 'red';
break;
case 'house':
ctx.fillStyle = 'green';
break;
case 'flat':
ctx.fillStyle = 'blue';
break;
default:
ctx.fillStyle = '#666666';
}
var padding = 40;
var x_shift_pct = (width - padding) / width;
var y_shift_pct = (height - padding) / height;
var x = (this.nodes[i].rooms - this.rooms.min) * (width / rooms_range) * x_shift_pct + (padding / 2);
var y = (this.nodes[i].area - this.areas.min) * (height / areas_range) * y_shift_pct + (padding / 2);
y = Math.abs(y - height);
ctx.translate(x, y);
ctx.beginPath();
ctx.arc(0, 0, 5, 0, Math.PI*2, true);
ctx.fill();
ctx.closePath();
/*
* Is this an unknown node? If so, draw the radius of influence
*/
if ( ! this.nodes[i].type )
{
switch (this.nodes[i].guess.type)
{
case 'apartment':
ctx.strokeStyle = 'red';
break;
case 'house':
ctx.strokeStyle = 'green';
break;
case 'flat':
ctx.strokeStyle = 'blue';
break;
default:
ctx.strokeStyle = '#666666';
}
var radius = this.nodes[i].neighbors[this.k - 1].distance * width;
radius *= x_shift_pct;
ctx.beginPath();
ctx.arc(0, 0, radius, 0, Math.PI*2, true);
ctx.stroke();
ctx.closePath();
}
ctx.restore();
}
};
```

I won't describe the above in detail, but rather point out two lines that I want you to look at and figure out on your own:

- The line starting "var x = "
- The line starting "var radius = "

It's a good but potentially frustrating exercise to try and figure out what's going on in those two lines, but please don't give up until you understand them!

Finally, we'll throw some code together to create random mystery points and do that every 5 seconds:

```
var run = function() {
nodes = new NodeList(3);
for (var i in data)
{
nodes.add( new Node(data[i]) );
}
var random_rooms = Math.round( Math.random() * 10 );
var random_area = Math.round( Math.random() * 2000 );
nodes.add( new Node({rooms: random_rooms, area: random_area, type: false}) );
nodes.determineUnknown();
nodes.draw("canvas");
};
window.onload = function() {
setInterval(run, 5000);
run();
};
```

If you want to play with the value of "k", do it in the run() function above.

The kNN algorithm isn't the most sophisticated classifier there is, but it has some excellent uses. What's even more exciting is that you don't *have* to use kNN as a classifier; the concepts behind kNN are very flexible and can be used for non-classification problems. Consider the following problems that kNN might be a good candidate for (list includes both classification problems and otherwise):

- What named color (blue, red, green, yellow, grey, purple, etc) is a given RGB value closest to? This is useful if you have an image search application and want to "search for purple images" for instance.
- You're building a dating site and want to rank matches for a given profile. The features could be location, age, height, and weight, for instance. Pick the 20 closest neighbors and rank them by kNN distance.
- Quickly find 5 documents similar to a given document. Features are words in this case. Not the most sophisticated algorithm to solve this problem, but it works in a pinch.
- Given data for previous e-shoppers, figure out if the person currently browsing your site is likely to make a purchase. Features could be time of day, number of pages browsed, location, referral source, etc.

There are two issues with kNN I'd like to briefly point out. First of all, it should be pretty clear that if your training data is all over the place, this algorithm won't work well. The data needs to be "separable", or clustered somehow. Random speckles on the graph is no help, and very few ML algorithms can discern patterns from nearly-random data.

Secondly, you'll run into performance problems if you have thousands and thousands of Nodes. Calculating all those distances adds up! One way around this is to pre-filter out Nodes outside of a certain feature's range. For example, if our mystery point has # of rooms = 3, we might not even calculate distances for points with # of rooms > 6 at all.

Here's the JSFiddle. The known points are colored, the mystery point is grey, and the result of the kNN guess is signified by the colored radius around the mystery point. The radius encapsulates the 3 nearest neighbors. A new random mystery point is solved for every 5 seconds:

The Introduction to "Machine Learning in Javascript" post provides a nice introduction and context for this post and the rest of the series.

Genetic algorithms are inspired by nature and evolution, which is seriously cool to me. It's no surprise, either, that artificial neural networks ("NN") are also modeled from biology: evolution is the best general-purpose learning algorithm we've experienced, and the brain is the best general-purpose problem solver we know. These are two very important pieces of our biological existence, and also two rapidly growing fields of artificial intelligence and machine learning study. While I'm tempted to talk more about the distinction I make between the GA's "learning algorithm" and the NN's "problem solver" terminology, we'll drop the topic of NNs altogether and concentrate on GAs... for now.

One phrase I used above is profoundly important: "general-purpose". For almost any specific computational problem, you can probably find an algorithm that solves it more efficiently than a GA. But that's not the point of this exercise, and it's also not the point of GAs. You use the GA not when you have a complex problem, but when you have a complex problem of problems. Or you may use it when you have a complicated set of disparate parameters.

One application that comes to mind is bipedal robot walking. It's damn hard to make robots walk on two legs. Hand-coding a walking routine will almost certainly fail. Even if you succeed in making a robot walk, the next robot that comes off the line might have a slightly different center of balance, and that algorithm you slaved over no longer works. Instead of enduring the inevitable heartbreak, you might use a GA to "teach the robot to learn to walk" rather than simply "teaching the robot to walk".

Let's build a GA in Javascript.

Build a genetic algorithm in Javascript that reproduces the text "Hello, World!".

Naturally, everything starts with "Hello, World!" and so building a GA to reproduce that phrase is apropos. Note that this problem is highly contrived. At one point we're even going to type the phrase "Hello, World!" into the source code! Now that seems silly -- if you know the desired result, why program the algorithm in the first place? The answer is simple: this is a learning exercise. The next GA exercise (which will be in PHP) will be a little less contrived, but we need to start somewhere.

The basic approach to GAs is to generate a bunch of "answer candidates" and use some sort of feedback to figure out how close the candidate is to optimal. Far-from-optimal candidates literally die and are never seen again. Close-to-optimal candidates combine with each other and maybe mutate slightly; this is an attempt to modify the candidates from time to time and see if they get closer to optimal or farther from optimal.

These "answer candidates" are called ** genes chromosomes**. (

Chromosomes mate, produce offspring, and mutate. They either die due to survival of the fittest, or are allowed to produce offspring who may have more desirable traits and adhere to natural selection.

This may be a strange way to think about solving for "Hello, World!" but stick with it. This example isn't the only problem that can be solved with GAs!

The chromosome is a representation of a solution candidate. In our case, the chromosome itself is a string. Let's assume that all chromosomes have a length of 13 characters (the same as "Hello, World!"). Here are some possible chromosomes that could be solution candidates for our problem:

- Gekmo+ xosmd!
- Gekln, worle"
- Fello, wosld!
- Gello, wprld!
- Hello, world!

Obviously that last one is the "correct" (or globally-optimum) chromosome. But how do we measure the optimality of a chromosome?

The cost function (or error function, or fitness function as the inverse) is some sort of measure of the optimality of a chromosome. If we're calling it "fitness function" then we're shooting for higher scores, and if we're using "cost function" then we're looking for low scores.

In this case, we might define a cost function to be something like the following:

For each character in the string, figure out the difference in ASCII representation between the candidate character and the target character, and then square it so that the "cost" is always positive.

For example, if we have a capital "A" (ASCII 65) but it's supposed to be a capital "C" (ASCII 67), then our cost for that character is 4 (67 - 65 = 2, and 2^2 = 4).

Again, the reason we're using the square of the difference is so that we never end up with a negative cost. You could just use absolute value if you want, too. Please experiment with different approaches -- that's how you learn!

Using that rule as a cost function, we can calculate the costs of the above 5 example chromosomes (in parentheses):

- Gekmo+ xosmd!
**(7)** - Gekln, worle"
**(5)** - Fello, wosld!
**(5)** - Gello, wprld!
**(2)** - Hello, world!
**(0)**

In this case, since this problem is easy and contrived, we know that we're shooting for a cost of 0 and that we can stop there. Sometimes that's not the case. Sometimes you're just looking for the lowest cost you can find, and need to figure out different ways to end the calculation. Other times you're looking for the *highest* "fitness score" you can find, and similarly need to figure out some other criteria to use to stop the calculation.

The cost function is a very important aspect of GAs, because if you're clever enough, you can use it to reconcile *completely disparate parameters.* In our case, we're just looking at letters. But what if you're building a driving directions app and need to weigh tolls vs distance vs speed vs traffic lights vs bad neighborhoods vs bridges? Those are completely disparate parameters that you can reduce into one, neat, tidy cost function for a route by applying different weights to each parameter.

Mating is a fact of life, and we use it tons in GAs. Mating is a magical moment that two chromosomes in love with each other share. The technical term for mating is "crossover", but I'll continue calling it "mating" here, because that paints a more intuitive picture.

We haven't talked about "populations" in GAs yet (we'll get to that a bit later) but for now I'll just say that when you run a GA, you don't just look at one chromosome at a time. You might have a population of 20 or 100 or 5,000 going all at once. Just like in evolution, you might be inclined to have the best and strongest chromosomes of the population mate with each other, with the hope that their offspring will be even healthier than either parent.

Mating strings, like in our "Hello, World!" example is pretty easy. You can pick two candidates (two strings; two chromosome) and pick a point in the middle of the string. This point can be dead-center if you want, or randomized if you prefer. Experiment with it! Take that middle point (called a "pivot" point), and make two new chromosomes by combining the first half of one with the second half of the other and vice versa.

Take these two strings for example:

- Hello, wprld!
**(1)** - Iello, world!
**(1)**

Cutting them in half and making two new strings from the alternating halves gives us these two new "children":

- Iello, wprld!
**(2)** - Hello, world!
**(0)**

As you can see, the offspring includes one bastard child that has the worst traits of both parents, but also an angel child that has the best traits of both.

Mating is how you get from one generation of genes to the next.

Mating alone has a problem: in-breeding. If all you do is mate your candidates to go from generation to generation, you'll get stuck near a "local optimum": an answer that's *pretty good* but not necessarily the "global optimum" (the best you can hope for).

(*I was recently informed that the above paragraph is misleading. Some readers got the impression that the most important aspect of chromosome evolution is the mating, when in actuality a GA would achieve very little if not for the combined effects of both mating and mutation. The mating helps discover more optimal solutions from already-good solutions [many problems' solutions, like our Hello World problem, can be divided into optimal sub-solutions, like "Hello" and "world" separately], but it's the mutation that pushes the search for solutions in new directions.*)

Think of the world that these genes are living in as a physical setting. It's really hilly with all sorts of weird peaks and valleys. There is one valley that's the lowest of all, but there are also tons of other little valleys -- while these other valleys are lower than the land directly around them, they're still above sea-level overall. Searching for a solution is like starting a bunch of balls on the hills in *random places*. You let the balls go and they roll downhill. Eventually, the balls will get stuck in the valleys -- but many of them will get stuck in the random mini-valleys that are stilly pretty high up the hill (the local optima). It's your job to make sure at least one of the balls ends up in the lowest point on the whole map: the global optimum. Since the balls all start in random places, it's hard to do this from the outset, and it's impossible to predict which ball will get stuck where. But what you *can *do is visit a bunch of the balls *at random* and give them a kick. Maybe the kick will help, maybe it'll hurt -- but the idea here is to shake up the system a little bit to make sure things aren't getting stuck in local optima for too long.

This is called **mutation**. It's a completely random process by which you target an unsuspecting chromosome and blast it with just enough radiation to make one of its letters randomly change.

Here's an example to illustrate. Let's say you end up with these two chromosomes:

- Hfllp, worlb!
- Hfllp, worlb!

Again, a contrived example, but it happens. Your two chromosomes are exactly the same. That means that their children will be exactly the same as the parents, and no progress will ever be made. But if 1 in 100 chromosomes has one letter randomly mutated, it's only a matter of time before chromosome #2 above turns into "Ifllp, worlb!" -- and then the evolution continues because the children will finally be different from the parents once again. Mutation pushes the evolution forward.

How and when you mutate is up to you. Again, experiment. The code I'll give you later has a very high mutation rate (50%), but that's really just for demonstration. You might make it low, like 1%. My code makes only *one letter* move by *one ASCII code* but you can have yours be more radical. Experiment, test, and learn. It's the only way.

Chromosomes are representations of candidate solutions to your problem. They consist of the representation itself (in our case, a 13-character string), a cost or fitness score and function, the ability to mate ("crossover"), and the ability to mutate.

I like thinking about these things in OOP terms. The "Chromosome" class therefore has the following properties:

**Properties:**

- Genetic code
- Cost/fitness score

**Methods:**

- Mate
- Mutate
- Calculate Fitness Score

We'll now look at how to have genes interact with each other in the final piece of the GA puzzle: the "Population".

The population is a group of chromosomes. The population generally remains the same size but will typically evolve to better average cost scores over time.

You get to choose your population size. I picked 20 for mine below, but you could choose 10 or 100 or 10,000 if you want. There are advantages and disadvantages, but as I've said a few times by now: experiment and learn for yourself!

The population experiences "generations". A typical generation may consist of:

- Calculating the cost/fitness score for each chromosome
- Sorting the chromosome by cost/fitness score
- Killing a certain number of the weakest members -- you pick the number of chromosome that will die
- Mating a certain number of the strongest members -- again, you pick how you do this
- Mutating members at random
- Some kind of completeness test -- ie, how do you determine when to consider the problem "solved"?

Starting a population is easy. Just fill it with completely random chromosomes. In our case, the cost scores for completely random strings will be *horrendous.* My code starts at an average cost score of 30,000. But that's ok -- that's what evolution is for. This is why we're here.

Knowing when to stop the population is a little trickier. Today's example is pretty simple: stop when you get a cost of 0. But this isn't always the case. Sometimes you don't know the minimum achievable cost. Or, if you're using fitness instead of cost, you may not know the maximum possible fitness.

In those cases you should specify a completeness criteria. This can be anything you want, but here's a starting suggestion to jump off from:

Stop the algorithm if the best score hasn't changed in 1,000 generations, and use that as your answer.

Criteria like that may mean that you never achieve the global optimum, but in many cases you don't *need* to achieve the global optimum. Sometimes "close enough" really is good enough.

I'll soon be writing another article on GAs (for PHP this time) with a slightly different problem, and that one will have a completeness rule similar to the above. It might be hard to swallow that "close enough is good enough" right now, but once you see the example in action hopefully you'll believe me.

Finally! I like OOP methods, but I also like rough and simple code, so I'll do this as straight-forwardly as possible though it may be a little rough around the edges.

(*Note that while I changed the occurrences of the term "gene" to "chromosome" in the text above, the code below still uses the incorrect "gene" terminology. It's a semantic and pedantic difference, but where would we be without semantic pedantics?*)

```
var Gene = function(code) {
if (code)
this.code = code;
this.cost = 9999;
};
Gene.prototype.code = '';
Gene.prototype.random = function(length) {
while (length--) {
this.code += String.fromCharCode(Math.floor(Math.random()*255));
}
};
```

Simple. It's just a class that takes a string as a constructor, sets a cost, and has a helper function to create a new, random chromosome.

```
Gene.prototype.calcCost = function(compareTo) {
var total = 0;
for(i = 0; i < this.code.length; i++) {
total += (this.code.charCodeAt(i) - compareTo.charCodeAt(i)) * (this.code.charCodeAt(i) - compareTo.charCodeAt(i));
}
this.cost = total;
};
```

The cost function takes the "model" string as an argument, finds the differences between ASCII codes, and squares them.

```
Gene.prototype.mate = function(gene) {
var pivot = Math.round(this.code.length / 2) - 1;
var child1 = this.code.substr(0, pivot) + gene.code.substr(pivot);
var child2 = gene.code.substr(0, pivot) + this.code.substr(pivot);
return [new Gene(child1), new Gene(child2)];
};
```

The mating function takes another chromosome as an argument, finds the center point, and returns an array of two new children.

```
Gene.prototype.mutate = function(chance) {
if (Math.random() > chance)
return;
var index = Math.floor(Math.random()*this.code.length);
var upOrDown = Math.random()
```

The mutate method takes a float as an argument -- the percent chance that the chromosome will mutate. If the chromosome is to mutate we randomly decide if we're going to add or subtract one from the randomly-selected character code. I was going too fast to write a proper String.prototype.replaceAt method, so I just took an easy shortcut there.

```
var Population = function(goal, size) {
this.members = [];
this.goal = goal;
this.generationNumber = 0;
while (size--) {
var gene = new Gene();
gene.random(this.goal.length);
this.members.push(gene);
}
};
```

The Population class constructor takes the target string and population size as arguments, then fills the population with random chromosomes.

```
Population.prototype.sort = function() {
this.members.sort(function(a, b) {
return a.cost - b.cost;
});
}
```

I define a Population.prototype.sort method as a helper function to sort the population by their cost score.

```
Population.prototype.generation = function() {
for (var i = 0; i < this.members.length; i++) {
this.members[i].calcCost(this.goal);
}
this.sort();
this.display();
var children = this.members[0].mate(this.members[1]);
this.members.splice(this.members.length - 2, 2, children[0], children[1]);
for (var i = 0; i < this.members.length; i++) {
this.members[i].mutate(0.5);
this.members[i].calcCost(this.goal);
if (this.members[i].code == this.goal) {
this.sort();
this.display();
return true;
}
}
this.generationNumber++;
var scope = this;
setTimeout(function() { scope.generation(); } , 20);
};
```

The meatiest population method is the generation method. There's no real magic here. The display() method (not shown on this page) just renders output to the page, and I set a timeout between generations so that things don't explode.

Note that in this example I'm only mating the top two chromosomes. This doesn't have to be your approach.

```
window.onload = function() {
var population = new Population("Hello, world!", 20);
population.generation();
};
```

That gets the ball rolling. See it in action:

It's slow -- but it's not too difficult to figure out where the inefficiencies are. If you're clever enough you can certainly make it lightning fast. As always, I encourage you to fork and experiment and learn on your own.

Notice that there's nothing in the above that can't be done in *any* programming language. But you probably expected that, since this is "Machine Learning in All Languages", after all.

Happy learning! The next post in this series is ML in JS: k-nearest-neighbor.

There's also a Genetic Algorithms Part 2, which you should read if you want to get a little more advanced.

]]>I also happen to be a PHP and JavaScript developer. I've taught

]]>I also happen to be a PHP and JavaScript developer. I've taught classes on both of these as well -- but like any decent software engineer I have experience with Ruby, Python, Perl, and C. I just prefer PHP and JS. (Before you flame PHP, I'll just say that while it has its problems, I like it because it gets stuff done.)

Whenever I say that Tidal Labs' ML algorithms are in PHP, they look at me funny and ask me how it's possible. Simple: it's possible to write ML algorithms in just about any language. Most people just don't care to learn the fundamentals strongly enough that they can write an algorithm from scratch. Instead, they rely on Python libraries to do the work for them, and end up not truly grasping what's happening inside the black box. Other people only know ML academically, using Octave or Matlab.

Through this series of articles, I'll teach you the fundamental machine learning algorithms using Javascript -- not Python or Octave -- as the example language. Originally I intended to write these articles in a variety of languages (PHP, JS, Perl, C, Ruby), but decided to stick with Javascript for the following reasons:

- If you're a web developer you probably already know JS, regardless of your backend expertise.
- Javascript has JSFiddle, a great tool that lets me embed executable Javascript right in my posts (hard to do that with C or Perl!)
- Several people asked me to stick to just one language.

While I'll be writing these articles with Javascript in mind, *please* re-write the examples in your language of choice as homework! Practice is how you get better, and writing the same algorithm several times in different languages really helps you understand the paradigms better.

It's possible to get excellent performance out of ML algorithms in languages like PHP and Javascript. I advocate writing ML algorithms in other languages because the practice of writing ML algorithms from scratch helps you learn them fundamentally, and it also helps you unify your backend by not requiring a Python script to do processing in the middle of a PHP application. You can do it in PHP, and cut out the (mental and computational) overhead of using another language.

... well, most of the time. There are some things you really can't do in PHP or Javascript, but those are the more advanced algorithms that require heavy matrix math. While you *can *do matrix math in JS, there is a big difference between simply "doing matrix math" and doing it efficiently. The advantage of NumPy or Matlab is not in their ability to do matrix operations, it's in the fact that they use optimized algorithms to do so -- things you wouldn't be able to do yourself unless you dedicate yourself to learning computational linear algebra. And that's not my field, so we'll just stick to the ML that doesn't require the advanced matrix math. You could try brute-forcing the matrix operations, but you'll end up with a relatively inefficient system. It's great for learning, so I'm not discouraging it -- I would just be wary of doing that in a production environment.

Keep in mind that most of the algorithms we'll look at can be solved both with and without matrix math. We'll use iterative or functional approaches here, but most of these algorithms can be done with linear algebra as well. There's more than one way to skin a cat! I encourage you to also go and learn (or figure out) the linear algebra approaches, but since that's not my strong suit I'll use other approaches.

Here are some of the algorithms I intend to cover. I'll update this list with links to the relevant articles as they're published:

- k-nearest-neighbor (Introduction)
- k-means clustering (Part 1)
- Genetic algorithms (Part 1, Part 2)
- Naive Bayes classifier (Part 1: Document Classification)
- Sentiment Analysis (Part 1)
- Full-text Search (Part 1: Relevance Scoring)
- Neural network

Happy learning!

]]>