TDWI Upside - Where Data Means Business

Win at the Game of Business with Reinforcement Learning

Machine learning is often demonstrated by teaching computers to win games. Now is the time to use machine learning to win the game of business.

Throughout the history of humankind, we've been fascinated with games. They let us set a target goal and test strategies to see which are superior. Some games are relatively simple in nature with limited winning strategies; others are more complex.

For Further Reading:

To Be Successful with Machine Learning, Expect to Learn As You Go

Monetizing the Digital Consumer Through Data

Myths and Realities of Deep Learning

The game of Go has a near-infinite number of potential paths to success. Developed 2,500 years ago, according to legend it was intended for Chinese aristocracy to test their strategic thinking skills. The rules are relatively simple but the countless different paths to gameplay are what make Go so challenging.

For this reason, Go was a prime test to see whether a machine could be trained to defeat a human player. In 2016, Google's DeepMind proved that a machine can, indeed, be taught to think strategically and win against the world's best Go players. This win was a landmark in artificial intelligence and machine learning because it demonstrated with practical evidence the concepts behind reinforcement learning.

Until recently, most applications in AI and machine learning fell into one of two categories: supervised learning and unsupervised learning. Many remarkable advances have been made with the models and algorithms associated with these two categories, but problems exist that need a different approach.

Reinforcement learning is a new, third category of machine learning that is garnering attention in both academia and business. As a field, reinforcement learning has been around since the 1980s, but we are now starting to see real-world applications of it thanks to the recent explosion of computing power. It has the potential to change business as we know it.

Strategic Thinking and the Customer Journey

Utilizing machine learning to teach computers to win at games is novel and even groundbreaking, but the real value comes into play when we can apply the same technology and concepts to real-world business problems. What made DeepMind's AlphaGo different was that researchers started with limited data and taught the machine much in the same way a new player learns -- by trial and error, getting better with each game.

In the business environment, there are many problems similar in nature to the game of Go. The rules are relatively simple but the number of potential paths to success is seemingly infinite. This is why reinforcement learning has such great transformational potential.

Nowhere is this more critical than in what is popularly called the customer journey. This is the set of steps a customer progresses through when attracted by a business offering to learn more about the company, explore its products, or make a buying decision. Companies such as Amazon have spent considerable time and money optimizing their customer journeys, which has led to dominance in the marketplace.

The customer journey is much like a game where the business is seeking the optimal path for each customer. In this process, companies identify the path for moving strategically through a complex set of choices for how to engage the customer, when to engage the customer, and what to offer the customer at different points to optimize their experience. Optimization of the customer journey with machine learning is a new frontier in which reinforcement learning is a fundamental component.

The Four Components of Reinforcement Learning

To understand reinforcement learning, think of the problem as a series of events that ultimately leads to a reward. Reinforcement learning is about letting the computer test different paths and determine if the results improve. Upon finding a better path, the model is updated to account for this new knowledge.

This series of incremental improvements is what machine learning is all about. Over time, this learning creates a model that is robust enough to choose the correct path with increasing effectiveness but also abstract enough that it can predict the best path even when presented with new information.

There are four basic parts of reinforcement learning: states, actions and transitions, rewards, and policies. These come from the concept behind Markov Decision Process (MDP), which provides a mathematical framework for modeling decision making where outcomes are partly random and partly under the control of a decision maker.


A state represents each step in the flow. For a customer journey, this represents where the customer is at any given point in time. This could vary from the customer not having heard of a company and its offering to a loyal customer who has purchased during consecutive months. At each state, the environment for that customer is different and where they can and will go next is different.

Actions and Transitions

An action is the activity that occurs to move a customer from one state to the next. This action could be one that the customer instigates, such as a purchase, communication with customer service, a product return, or a social media post. It could also be an action instigated by the company, such as an email campaign, a promotion, or the fulfillment of an order. Each of these actions moves the customer from one state in the process to another.


A positive reward or a negative reward (punishment) is the result of one or a series of transitions. In reinforcement learning, it is the reward that helps the machine learn and evolve over time and optimize the process so it maximizes rewards and minimizes punishments.

One important factor of rewards is the discounted time value of the reward. A reward now is more valuable than a reward in the future. This is considered as the model develops.

At times, you cannot measure the reward at each state and might only be able to see the reward at the end of a series of transitions. A positive reward such as a customer purchase could be the result of multiple email and ad campaigns, each transitioning the customer to a new state. The company does not necessarily know which communication ultimately drove the customer to the business reward, but the ability to track which campaigns had an impact on which customers is important. Working backward from the reward, machine learning can determine which patterns consistently lead to the reward.

Policies A policy is a set of rules that guide the action(s) and transition(s). It is the policy that evolves as the machine garners more information from transitions that result in rewards. The policy is applied to future interactions to automate decisions that will allow a company to win the game. The policy is often used to determine the customer's next-best action or where the company should focus next to improve the likelihood of earning a reward.

Playing to Win

When fully leveraged, this new form of machine learning has great value in improving how businesses operate and interact. The game of business is very similar to other games, but the stakes are usually much higher. The goal is to employ different strategies in the hopes that your strategy is better than that of your competition.

As the science and application of reinforcement learning becomes more popular and more powerful, the companies that master it will dominate their markets and those who don't will slowly disappear.


About the Author

Troy Hiltbrand is the chief digital officer at Kyäni where he is responsible for digital strategy and transformation. You can reach the author at

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, & Team memberships available.