Why Image Recognition and Machine Learning Go Hand-in-Hand

Recognition and Machine Learning

Six years ago, a senior engineer at Google, Yonatan Zunger, had this to say about a sticky wicket on the Google+ photo app:

“Lots of work being done, and lots still to be done. But we’re very much on it,” he tweeted. He went on to explain that “image recognition software has problems with obscured faces as well as different skin tones, and lighting.”

The sticky wicket in question was a mislabeling of ethnic faces by Google’s facial recognition software as animals.


The misnomer was first flagged by Jacky Alcine, a New York-based programmer, and the Tweet discussion that followed between him and Zunger was appropriately contrite:

“Thank you for telling us so quickly. Sheesh. High on my list of bugs you *never* want to see happen. Shudder.”

Eventually, Google issued a broad apology, mentioning that its image labeling technology was “still in its infancy and so not yet perfect.” Zunger’s own tweeted response mirrored this official one, saying that these issues are down to “ordinary machine learning trouble.”

So, six years later, is machine learning more sophisticated? It should be, now that we’ve seen significant advances in computing capacities and image processing hardware. Even more importantly, any image processing initiative that began in the mid-2010s now has over six years’ worth of data to “learn” from and produce more accurate results.

Like all machine learning processes, computer vision relies on human vision. And human vision is inherently, well, biased. That’s why it’s more important than ever to understand the core of how image recognition works in conjunction with machine learning to improve accuracy and applications in web and user experiences.

What Is Image Recognition?

Humans like creating things in their own image.

So, it’s not entirely surprising that machines are equipped with the ability to view and classify images, which is called “computer vision.”

Image recognition or classification is just one of the tasks of computer vision. Some others include object classification, object segmentation, and instance segmentation. For our purposes, we’ll focus on image recognition.

In the task of image recognition, hardware and software work together to identify places, people, icons, logos, objects, buildings, and other variables that appear in digital images.

. . . And How Does Image Recognition Work?

In image recognition, the computer relies on the numerical values of each of the pixels that make up a digital image. As it evaluates these values — aka, numerical data — it creates rules to recognize patterns and regularities. In essence, it’s building a “model” of the world (that image) from which to draw on in the future.

Image recognition uses labels to classify images, and these can be either single or multiclass recognition exercises. In single image recognition models, the algorithm parses for a single image — for example, a dog versus a cat.

In multiclass image recognition, the model would assign or “recognize” several labels, along with a “confidence” score for each possible label or class. This numerical score tells the user how sure the image recognition model is about its output.

Image recognition requires “training.” That’s why it’s such a perfect candidate for machine learning. Both functionalities depend on a neural network for learning or processing data. The greater the amount of data, the better that image recognition software operates.

Here’s a rough schematic of how image recognition works:

  • You would work with a dataset that features images and their labels. For example, an image of a dog would be called “dog.”
  • You would then feed these datasets into a DNN or deep neural network. Computers will use these datasets to “learn,” draw conclusions, track patterns, and store this information as data for future use.
  • Finally, you would “test” the image recognition capabilities of the algorithm by offering images not in the training set and receive accurate (or not) predictions or responses about what is being viewed.

Breaking Open the Black Box

Sounds simple enough, right?

Well, yes. And no. The model of input-model-output is an oversimplification, much as the model of input-hidden-layers-output is in this case.

It’s the hidden layers — yes, that’s multiple layers — where the “magic” of image recognition processing occurs.

Like adaptive user interfaces that harness machine learning to offer personalized user experiences, image recognition software relies on the architecture of neural networks.

User interfaces rely on interactions like clicks or metrics like session cookies to provide data. With image processing, it’s the pixelated image data and various nuances that form the inputs or “data” the machine is learning from. It’s these nuances that form the complexity around image processing and recognition because the information is variable and it’s hard to assign static rules to variable factors like:

  • Lighting
  • Composition
  • Viewpoint-dependent variability
  • Background noise
  • Image deformation

However, artificial neural networks eliminate the need for programmers and engineers to “code” these instances. Instead, machine learning-powered image recognition learns features directly from the data.

Essentially, neural networks recognize patterns. In the architecture of a neural network, each layer consists of nodes or artificial neurons. Traditional neural networks have up to three layers. However, deep neural networks contain hundreds of them, and these are the “hidden layers” where processing occurs.

The number of layers and subsequent nodes matter because more layers and nodes equal better and more accurate neural network predictive capabilities. For image recognition, that means improved accuracy and zero issues like Google’s unfortunate snafu.

It works because each layer of nodes relies on the feature set produced by the previous layer of data. This is why image recognition becomes more nuanced and better able to recognize complex, detailed features.

Right now, the architecture of layers that image recognition relies on for classification and detection tasks is called Convolutional Neural Networks (CNNs). CNNs rely on two types of layers to parse through image pixel data:

  • Convolutional layers — These feature a set of learnable filters, which then scan through numerical pixel data and gather information. This information gets passed to the next layer.
  • Pooling layer — Relying on a feature map, pooling layers slide a two-dimensional filter over each channel and summarize the features within a specific region covered by the filter.

What becomes extremely obvious with image recognition powered by machine learning is something we haven’t seen in other use cases. What specialists are trying to do is to mimic neural brain function. The convolutional layer’s operation, for example, is not unlike the response of your frontal cortex to a visual stimulus.

So, unlike content personalization or adaptive user interfaces, truly accurate image recognition cannot survive without deep neural networks.

As Jason Brownlee explains, “The power of neural networks comes from their ability to learn the representation in your training data and how to best relate it to the output variable that you want to predict. In this sense, neural networks learn mapping.” 

Use Cases for Image Recognition

A few years after Google’s major trip-up, it shuttered the Google+ platform for consumers, along with its photo sharing and image recognition capabilities.

At the same time, it launched “BERT,” an ML-powered technology that focused on natural language processing (NLP).

Their stated goal: to better map searches to user intent.

Let’s neatly sidestep the whole debate about technologies like BERT creating an echo chamber (an argument that extends to image recognition and could explain why algorithmic bias exists — in other words, machine learning-powered techniques simply perpetuate the biases we already have in the “real” world).

Instead, let’s focus on why image recognition is not only inevitable but powerful when driven by machine learning. Image recognition has numerous standalone applications that retail businesses, B2B enterprises, and even public works bodies are beginning to pursue.

Consider, for example, how useful image recognition can be for:

  • User-generated content moderation
  • Enhanced visual search
  • Interactive marketing or creative campaigns
  • Traffic surveillance to improve traffic flows and reduce rush-hour density
  • Automated photo and video tagging for rapid processing at public terminals like airports
  • Computer vision application in retail to augment heat maps for better shopper flow, recommendation engines, cashier-less checkouts, inventory management, shelf-space organization, pass-by traffic analysis, and loss and theft detection and prevention

Another use case for an ML-powered image recognition feature could be predicting customer churn.

Using data points like visual or eye-tracking on smartphones, facial recognition, and even virtual mirrors, image recognition could help provide insight into where customers are dropping off, what they most want to see, what they’re most likely to purchase, and which parts of the brand experience they’re responding well to.

But it’s not just commerce — image recognition powered by machine learning can aid in other use cases as well:

  • Helping law enforcement or legal teams to analyze and gain insight about traffic injuries or CCTV footage
  • Detecting for abnormalities during the course of treatment, such as screening for and treating breast cancer
  • Monitoring livestock health and populations for changes in behavior, potential signs of disease, or birth
  • Improving ecological outcomes through accurate plant image identification and prediction of future varieties and potential growth forms that would support specific species
  • Reducing the incidence of copyright infringement for digital artworks and proprietary images


Even though machine learning significantly increases the potential of successful and accurate image recognition, limitations still need to be worked through.

Most of them relate to variations, such as viewpoint variation, scale variation, and even inter-class variation. This latter issue is fascinating because it raises questions about image recognition for recommendation engines.

For example, image recognition features have trouble identifying a “handbag” because of varieties in style, shape, size, and even construction. That’s why they may not accurately present relevant information to a user.

Still, it’s worth noting that machine learning image recognition whisks us handily away from the days where programmers needed to code rules and onward to the moments in which algorithms themselves are creating the rules.

 The job of humans has become ensuring these rules actually reflect the world we want to see, rather than the one we’re already in.

The avatar of the Marc Caposino - author of the publication


Marketing Director -
Senior Strategist

Marc Caposino

About Author

Marc has over 20 years of senior-level creative experience; developing countless digital products, mobile and Internet applications, marketing and outreach campaigns for numerous public and private agencies across California, Maryland, Virginia, and D.C. In 2017 Marc co-founded Fuselab Creative with the hopes of creating better user experiences online through human-centered design.

We love to collaborate with curious and smart people. Let's do great things together!