In its most basic form, machine learning looks like this: given some training data (text, a spreadsheet, an image) the model learns associations between features and produces an output based on those features. Typically the output of a machine learning model falls under one of two categories: a value or a classification. After the model sees enough training data to make associations, it can make predictions about new data.

Features can be a variety of things depending on the data you want to analyze. If you were to use a corpus of blog entries as training data, individual words can be used as features but so can things like word count, date published, or site views. The features you define for a task heavily depend on what you want the model to predict and the types of data you have access to.

Classification is a staple of machine learning that can handle a lot of diverse problems. Basically, you're building a model to predict whether the data is more like Thing A or Thing B (or Thing C...and so on). For example, I’ve been thinking about developing a model that classifies tweets as harassment vs non-harassment. Given the tweet, the model abstracts the tweet into features, and decides whether the given tweet belongs to one box or the other based on these features. I would probably define these features based on character ngrams in a user's handle and display name, the words used in the tweet, and the amount of followers and accounts followed associated with the user.

There are several ways to approach classification. One very flexible and useful model is naive bayes. Other useful methods are logistic regression or support vector machines, more on these later. I’ve had to suffer through years of Bayes Theorem being presented to me as a messiah, so now you do too, here it is: $$P(H|E) = \frac{P(E|H)P(H)}{P(E)}$$ Basically, H is one of the classification categories, and E is the data. The probability that a tweet can be categorized as harassment given the data, \(p(H|E)\), is the product of the probability of the tweet given the category \(p(E|H)\) and the probability of that category at all \(p(H)\), divided by the probability of the data \(p(E)\). In practice, we ignore the denominator. This is a simple trick of probability that turns out to go a really long way in machine learning. I'm gonna run through a quick example to illustrate how intuitive naive bayes is. The features here are outlook, temperature, humidity, and wind. We want to classify “yes” or “no” based on weather (ha!) it would be a good idea to play tennis on a given day.

Outlook Temperature Humidity Wind Classification
sunny hot high weak no
rain mild high strong no
overcast hot low weak yes
rain mild high weak yes

Given this training data, we can classify new data probabilistically using bayes theorem. If we had some new data that looks like this:

        Outlook=Rain, Temperature=Mild, Humidity=High, Wind=Strong
Then p(x | sunny, mild, high, strong) is simply: p(x) * p(x | sunny) * p(x | mild) * p(x | high) * p(x | strong) For x = no this comes out to: