5 Aspects You Should Consider When Choosing A Data Science Project

Posted by

Following-up on my narrative of why I wanted to pursue data science in the first place, I still stood at that metaphorical junction with the big heavy sign above my head, asking me what my first project would be. I looked for a project in which data played a vital role, and where machine learning could be applied – mainly because data preparation and machine learning, as well as data visualization were areas where I wanted to improve my skills. While I meditated about this topic, I identified five aspects that stood out.

#1: Data Availability

Many of the reasons that brought me to my learning path were related to data that I couldn’t obtain for private purposes because they were owned by the company I worked for. I knew that I wanted to write about what I was doing, and possibly also open-sourcing my achievements, so working with proprietary data – which was also highly confidential – could not qualify as my first project.

There always is the option to gather data for a specific purpose – but this would delay the start of the actual analysis by at least a few weeks, if not even months.

#2: Passion

If there is no one standing behind you and poking you with a stick constantly, it is quite hard to stay on track with projects. This is no news. We all know this. Procrastination is one of the major factors that I wanted to avoid as much as possible, especially since that project was started during the Corona lockdown, when I was at home 24/7. Yes, laundry has to be done, and yes, I wanted to cook that extraordinary dish for a very long time, and yes, the garden always needs my attention. Nonetheless, this project is important, as it will determine whether I’ll stay with data science or not. I need to be the one poking me with a stick. No one else will. There was also this very special deadline coming up (called due date, after which I was sure I would not get anything done for quite a long time), and I really wanted to be clear about my future career before this.

#3: Relevance in RL

Related to the P-word again, working on something that is relevant for my real life is much more motivating than working on some text book problem that is *yawn* boring AF. When I analyzed my hobbies, I found out that many of them did not seem to be viable for a data analysis problem. Why should I analyze data about my crochet or knitting patterns? What would be the benefit? (I later learned that there actually is a project called SkyKnit, a neural network trained on knitting instructions.)

Yes, you can apply deep learning to basically anything and everything, but there has to be at least a faint idea of the benefit it would bring. This also relates to the usual data science life cycle: The first step always is defining the business problem and setting the goal. So: What is my goal?

#4: Fun Factor

You need a motivation to pursue something. If you learn to finish a grade, or receive a certificate, this might be motivation enough to bring you through a course or through school. However, learning just for yourself is different. Procrastination will hit you harder when you don’t have fun with what you’re doing. Chances that you just stop mid-way increase when the fun is missing. Therefore, think of the fun factor as one of the pillars that build the foundation of your learning path. There will be times where it’s more or less present, but it shouldn’t be completely absent from the beginning.

#5: Level of complexity

As a first project, I wanted to start with something that I could actually manage. I remember when I looked for a viable thesis topic back in university. One of the major pitfalls that my mentor warned me about was the danger of broad topics. For a good business (or master’s thesis) case, a concise and narrow topic with a good set of questions was the key ingredient. You cannot answer questions that are too broad, or that do not match the data you have. You can’t answer questions that are vague. And most important, you can’t answer questions that you haven’t fully understood and thought through. Therefore, breaking down any topic I would decide for would be highly important. Additionally – you do not want to start with the most complex task as the first project. If the matter is too complex, you will not be able to identify mistakes and improve. Worst case scenario, you will not be able to understand your own data and its implications at all, which will remove any fun and any RL relevance from your project.

Bringing everything together

It’s quite easy: What am I passionate about, which at the same time is not too complex, there is open data available, which is relevant for my RL, and is fun to work with? Bringing all of these aspects together, I’ll have a very clear idea of my ‘business problem’ and goal setting. Let’s review some possible projects:

  • Trading recommendations
  • Movie recommendations
  • Language analysis
  • Corona Virus updates
  • Improving our garden and plant caring

Applying the criteria to the topics

Language analysis is one, as has always been one, of my favorite topics. There are so many unanswered questions about language! At the same time, with all the data we have at hand (thanks, Internet!), language research nowadays is completely different from the situation we had until a few years ago. However, language analysis is very complex. There are many factors to include, and analyzing language data with the help of machine learning is a study on its own. As algorithms cannot read language data as humans do, there is a whole lot of data preparation required before you can actually start feeding your training data into an algorithm. Still sounds interesting – but for my first project, that’s a no.

The idea of diving into the topic of trading was provided by Martin. We would have a very high motivation for this, as everything we would learn about the stock market would be beneficial for our financial decisions. Additionally, the end product would definitely be interesting for the same reason. There is data available (if not free, then it actually would make sense for this topic to buy some). However, this topic sounds awfully complex. I also would have to get into an area of interest that I sadly find very boring. Yeah, financial decisions should not be boring at all, but they are! Due to my lack of knowledge, I would most probably not be able to judge any results of my project, and I would not be able to really work with the data productively. In conclusion, that’s a no for trading as my first project.

The whole Corona virus situation has affected all of us, and following daily updates had become one of my hobbies at that time. Additionally, there were several COVID-19 datasets available open source. So, advantages of using this topic: Large, free datasets, very up-to-date. Interesting topic and surely great for learning data visualization. However, what I was missing was the business problem. What would I want to solve with my analysis? Would there be the need to apply machine learning at all? If so, what would I predict? It would be interesting to include the countries’ reactions to the virus in some way, and to identify great measures in times of crisis by analyzing how Corona hit those countries. However, this would again require a lot of research. Additionally, understanding the results would be quite difficult due to the many factors to consider during the interpretation phase. Therefore, no Corona project for me.

Improving our gardening and plant caring skills, based on a thorough analysis of the current situation, still sounds like the greatest project ever. It is not relevant at all for the world, but very much so for us. We love gardening, and seeing plants suffer despite good caring (at least we think we do!) is sad. It’s also a matter of costs, as buying a plant that we know will most probably suffer basically is the same as just pouring money down the drain. The one and only reason why I did not pursue this topic until today is the missing data. Of course, there’s no publicly available data of our garden (everything else would be scary), so it’s up to us to gather it… and I swear this is what we will do. At some point. In the future. For now, no gardening data science for me.

And the winner is…

This leaves one topic, where all five aspects apply: Using movie data to build a recommendation algorithm for us. Movies definitely are something we are passionate about. There is data available which I could access, and we also have a database ourselves for our movie collection. With a bit of refurbishing, this is something I could work with! As I know my and Martin’s movie preferences quite well, we would also be able to use the end product in our daily lives and could start empiric research on how well-fitting my algorithm would be.

Funnily enough, only when I went through the exercise of evaluating all the topic at hand, it hit me: This indeed is the optimal topic for my project! Martin and I have a little ritual, where he would ask me what I would like to watch in the evening. We have some series that we watch together, and even between them, I sometimes have difficulty to decide for one. Then there are the evenings where we really want to watch a movie, and the disaster starts: I struggle so much with deciding on one that we sometimes end up watching either what Martin decided or one of our usual series, simply because we are getting too annoyed to search for a viable movie. How great would it be to have some help at hand… Decision taken – I want to build a movie recommender that would analyze new movies based on what it learned about our preferences.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s