The first step in any project – no matter whether it includes machine learning, agreeing on a company strategy, or going away for the weekend with a group of friends – is the definition of the (business) problem. This step often is underestimated. Do not make the mistake and skip it – it will come back and haunt you if you do.
After identifying the topic of my first ML project, I needed to outline my business problem. Following what I had learned in online courses and YouTube videos, I went through these 5 steps:
- Defining the business problem thoroughly
- Setting goals for my project
- Checking how to measure success, both on the technical side, and on the overall project
- Getting an overview of the available data, as well as possibly missing data
- Characterizing the machine learning problem
Surprising or not – these steps do not really differ from the usual recommendations on how to start a business project.
#1: The business problem
This step is a tiny little bit artificial if you’re just planning for yourself, with no strings attached and no real problem to solve. However, it is the most crucial exercise for any project! Starting without a clear vision means that you will maybe get results in the end of the project that you cannot use at all. You will have answers without having the question. Think about the answer to the ultimate question of life, the universe and everything – was it helpful? No. It wasn’t. Because the question was missing (maybe also for other reasons, but I want to make a point here). So let’s be thorough with knowing exactly what we want to find out!
I wanted to build a recommendation system for Martin and me, which would predict whether we’d like a certain movie based on the movies that are present in our collection. The algorithm should be able to tell us how confident it is that we will like a movie or not.
#2: Setting goals
Goals should include both sides of the project: the ‘business’ side as well as the ML side. When would I consider the project finished?
- This project does not contain many stakeholders – it’s Martin and me, really. The business side therefore asks the question when Martin and me would be happy with the results.
- The goal on the ML side is staggered – as this is my first project, my goal with this really just is ‘making it happen’. As I have never exercised through all steps in an ML project, these will be my milestones.
- The overall goal on the ML side, of course, is providing us with an algorithm that is reliable and can predict whether we’ll like a movie or not (which will be validated in testing phases a.k.a. ‘movie nights’).
There is no financial impact upon success or failure – as long as I do not go to the store and blindly buy any Blu-Ray that the algorithm approves. It will ‘cost’ my time, which is well invested no matter how the outcome is, because I need the training anyways.
#3: Measuring success
It’s nice to know when you would consider a project a success – but it’s worth nothing if you cannot measure it. Again, with a narrow number of stakeholders, it’s not that important. Especially if it’s basically a fun learning opportunity. However, I know that leading projects in a company environment highly depends on how well you can measure your success. This is true for any project you might lead. The literature about measuring success is overwhelming (and still, there are many people out there who never heard of it!). Defining my goal as ‘making Martin and me happy with the results’ is a typical case of a non-definable result. When did we reach the status of ‘happy’? What if Martin is ‘happier’ than I am, or vice versa? We clearly need a measurable definition. We could measure our happiness with one question after each new movie: Would you watch this movie again? For movies that were recommended, the amount of positive answers should outnumber the amount of negative answers, as well as the other way round – if we decide to watch a movie despite the algorithm saying that we will not like it.
From the ML perspective, we can define success as reaching a certain threshold of correct predictions. Let’s say, if the algorithm is able to identify more than 75% of the items in the test set correctly, this could be a success. Defining the threshold at this point in time simply is me blindly guessing. I am planning for a binary outcome: Either the algorithm recommends a movie (‘yes’), or not (‘no’). This means that with blindly guessing the answers, it would reach 50% correct classifications. In order to count as a success, the algorithm needs to be significantly better than this.
#4: Gathering data
Now that we have a clear definition of what we want to achieve, the follow-up question is what kind of data we need during the training phase. The algorithm has to know
- which movies we watched, and how we liked them.
- about other movies that we didn’t watch and how they are categorized
We have a database available of the movies that we bought as Blu-Ray. Most of them were added to our collection after we watched them. This means that most of them are movies that we found exceptionally great. Some were also added before we watched them because we thought we would like them – not always with a positive result.
We of course watched several other movies that we didn’t add to our collection and still liked. It would be an option to put some of them into the test set, as I will be able to judge whether the algorithm decided correctly.
The training database could be based on IMDB. There is open data available, which is generally well maintained and up-to-date.
#5: The ML problem
First things first: Always check whether the problem you want to tackle actually is an ML problem at all. Machine Learning is not the cure for everything, and if there is an easier way – try it first. In this case, we have a classic ML situation:
- The database is quite large, consisting of numerical and textual data.
- We have several attributes per item, which identifies the problem as a multi-dimensional ML problem.
- The combination of attributes might influence the overall results and is not yet known – but we know the data labels, meaning that we’re dealing with a supervised ML problem.
- The outcome should be a classification of new data based on the training data (yes/no classification).
Using classic scripting or programming approaches would not lead to success in this case. We are dealing with multiple dimensions of data, with a fixed amount of attributes with known labels and expect a classification as the result – a textbook supervised ML problem.
As a complete beginner, I am thankful for all of the nice people out there who uploaded tutorials and allow for an easy head start into the interesting topic of machine learning. I am following the learning path for data science according to the AWS training and certification, but also using some other resources.
Shoutout to Jason Brownlee from Machine Learning Mastery for his tutorial on how to create the first ML project. Also, big thank you (as always) to the community at stackoverflow, which is a treasure chest of answers to questions that I constantly have. I hope that one day, I will be in a position to provide helpful answers to other newbies!