Supervised learning

WHAT IS SUPERVISED LEARNING?

Supervised learning basically teaches machines by example. During training, systems are exposed to large amounts of labelled data, for example images of handwritten figures annotated to indicate which number they correspond to. Given sufficient examples, a supervised-learning system would learn to recognize the clusters of pixels and shapes associated with each number and eventually be able to recognize handwritten numbers, able to reliably distinguish between the numbers 9 and 4 or 6 and 8.

However, training these systems typically requires huge amounts of labelled data, with some systems needing to be exposed to millions of examples to master a task.

As a result, the datasets used to train these systems can be vast. Here are some examples:

Google's Open Images dataset: 9 million labeled images
Google's YouTube-8M video dataset: 7 million labeled videos
ImageNet: 14 million labeled images

The size of training datasets continues to grow, with Facebook recently announcing it had compiled 3.5 billion images publicly available on Instagram, using hashtags attached to each image as labels. Using one billion of these photos to train an image-recognition system yielded record levels of accuracy on the famous ImageNet benchmark.

The laborious process of labeling the datasets used in training is often carried out using crowdworking services such as Amazon Mechanical Turk, which provides access to a large pool of low-cost labor spread across the globe. For instance, ImageNet was put together over two years by nearly 50,000 people, mainly recruited through Amazon Mechanical Turk. However, Facebook's approach of using publicly available data to train systems could provide an alternative way of training systems using billion-strong datasets without the overhead of manual labeling.

HOW DOES SUPERVISED MACHINE LEARNING WORK?

Everything begins with training a machine-learning model, a mathematical function capable of repeatedly modifying how it operates until it can make accurate predictions when given fresh data.

Before training begins, you first have to choose which data to gather and decide which features of the data are important.

For example, we could train a machine learning model to predict how many ice cream cones will be sold based on the outside temperature.

So in this example, from all the sales data and weather information we could choose from, we select the temperature and number of ice cream cones sold features to train the model on. We call the latter feature the 'label' as this is the quantity we're trying to predict.

Before training gets underway there will generally also be a data-preparation step, during which processes such as deduplication, normalization and error correction will be carried out.

The next step will be choosing an appropriate machine-learning algorithm from the wide variety available. Each have strengths and weaknesses depending on the type of data, for example some are suited to handling images, some to text, and some to purely numerical data.

The actual training process involves the machine-learning model automatically tweaking how it functions until it can make accurate predictions from data.

Let's look at that ice cream data again. Imagine plotting the dataset with the ice cream sales on the y-axis and the outside temperature on the x-axis. It would look something like this:

To predict how many ice creams will be sold in future based on the outdoor temperature, the machine learning algorithm will try to draw a line through all these points.

The algorithm starts with a random line and then keeps tweaking the vertical position and slope of the line, moving the line up or down a tiny bit, or angling it steeper or shallower, to try and find a better solution.

Eventually, the algorithm will settle on a final solution that it can't improve any further. Now, ice cream sales can be predicted at any temperature by finding the point at which the line passes through a particular temperature and reading off the corresponding sales at that point.