ML projects – Annotating data

Posted by

After having merged and roughly cleaned my reference data, I was eager to start annotating.

Quick recap: The overall goal is to build a movie recommendation model via machine learning, that will help me decide what to watch next Saturday. I already had merged the reference CSV files, and cleaned them roughly of not needed columns and data.

Data Annotation

The plan for annotating the reference data (which after this point should become “the data”, as it would contain everything we’d need) was quite straightforward. The husband and I started to track our movie collection some years ago, when we accidentally bought a Blu-Ray that we already possessed. We use an app for this (this one, to be precise) that allows an export of the collection as CSV (a very strange one, to be honest). I had to rework the CSV and fix it in order to really make use of it, but in the end it provides me with the movie title and the movie ID in the IMDB database. That’s what I also have in my reference data, so here’s the very easy plan:

  1. Add a column to the reference data file called “InCollection” that takes a boolean value.
  2. Compare movie IDs row by row, add True or False value to this column accordingly.
  3. Check for matches that were not found (hopefully none).

Why it doesn’t always need to be the perfect solution

I am completely sure that there is a much easier or more efficient way to do this – but remember, this was my first project ever, so exploring possible options was one of the goals. I share this because I realized that I struggled with understanding all the best and perfect and super-efficient solutions that I found on the internet. Yes, they were great, but in case they didn’t fit my case 100% (which happened frequently), I didn’t have the knowledge to adapt them. Imperfect solutions still can achieve the goals, and may hopefully help other beginners to get the hang of it. And what is more satisfying than coming back to your own code later, rework it and make it two lines where you thought you needed twelve?

Adding a column to a dataframe

Just inserting the column was quite easy. You need to know at which index to insert it, how to name it and which value it should take. Regarding the index, I very simply looked at my data with

print(full_reference(head())

and counted. Don’t forget to start with 0! I decided to place my new column directly after the title at loc=1. I was a bit confused by the description in the pandas documentation because it mentions int, Series, or array-like as possible input for the value the column can take. However, sometimes I tend to be too afraid. With value=bool, the column displayed <class ‘bool’> as intended:

full_reference.insert(loc=1, column="InCollection", value=bool)

Follow up with a print command to check whether it now looks correct.

Comparing IMDB IDs

This is where the fun started. Treating a pandas dataframe turned out to be slightly more difficult than expected. The problem was not pandas itself – it was rather me, because newbie-me tried to perform some non-standard tasks. Again – answers on the internet frequently were “This is not a standard task, please do something else” – which was not helpful. I couldn’t figure out how to do something else and still achieve my goals, so I needed to go through this non-standard and horribly complicated exercise. That’s how you learn, by doing something you think is a good way, and optimizing it later on.

After some hours of trial and error, I found a way to compare my collection with my reference data row by row, and change the boolean value accordingly to the result of the comparison. This is what my final code looked like:

true_count = []
i = 0
for _ in full_reference.transpose():
    for id in collection_ids:
        if full_reference.iat[i, 1] is not True:
            if id == full_reference.iat[i, 16]:
                full_reference.iat[i, 1] = True
                true_count.append(full_reference.iat[i, 16])
            else:
                full_reference.iat[i, 1] = False
    i += 1

First, I transposed my reference data frame in order to be able to loop through the rows. For each row, I then looped through the IMDB IDs in my collection. This step in itself is very inefficient, and I am very sure that there is a better solution for this. However – it worked for me. My collection has around 200 entries, so for each of the 5412 lines of the reference data, we loop through the row 200 times (or less, if the IDs match).

It was very important to build in the if statement, because in the moment where a match was found, I wanted to exit the loop and go to the next row. I had initially not thought about this which resulted in most of the True values being overridden by False values again. In the case that we already found a match, it was clear that the rest of the rows would not contain another match (as IDs are unique). This made it necessary to exit the loop in the moment where the row already had a True value.

I was so excited when this worked for the first time, that I just went on with my live (and the next steps). But – I had to come back to this and added the true_count list. It showed that the count of the found True values was much lower than what I had in my collection. This meant that not all of the movies we owned were actually found in the reference data (damn it). I had to find out why before I could go on – otherwise, I would have to discard these movies and decrease the amount of data used during later steps. I expected that some data would have to be discarded anyways on the journey and didn’t want to sacrifice it now.

Checking for IDs that were not found

I added a step before the actual loop: I created two lists, one of the IDs in the reference data, and one of the IDs in my collection data. I compared these, and wrote the IDs that were in my collection, but not in the reference data, to a third list. In case that this list contained more than 0 matches, I automatically exited the program (because it was pointless to continue with the actual annotation):

import sys
import csv

not_found = []
zero_ids = []

for id in collection_ids:
    if id not in reference_ids:
        if id != "tt0000000":
            not_found.append(id)
        else:
            zero_ids.append(id)

if len(not_found) > 0:
    with open("path/to/file/data.csv", "w", newline="") as f:
        writer = csv.writer(
            f,
            delimiter=",", 
            quotechar='"',
            quoting=csv.QUOTE_MINIMAL
        )
        writer.writerow(not_found)
        f.write("\n")
    sys.exit()

I decided to export the IDs that were not found to a CSV file in order to work with them outside of the program. I also saw when printing the IDs, that some of the movie IDs in my collection were dummy IDs (tt0000000) – that was true for box sets and collections such as our beloved Star Wars collection of the first six episodes. This left me with two tasks: First, get the data for movies that were simply not in the reference data (mainly German movies), and second, get the real IDs for the movies that had a dummy ID.

This meant that before I could continue with my initial project, I had to learn a bit more about how to use APIs.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s