ML projects – Merging and cleaning several CSV files

Posted by

After I had narrowed down the topic of my first machine learning project to building a movie recommendation algorithm, I quickly found some movie data from the IMDB database that I could use (I used these files). That way, I could basically skip the whole step of learning how to use an API properly and get the data myself (haha! I thought. Read on). The main steps to having clean, usable data were:

  1. Go through the CSVs and check which ones to discard.
  2. Merge all the ones I want to keep.
  3. Delete duplicates.
  4. Add a column that indicates whether this movie is actually in my own collection.
  5. Decide which metadata to keep, and whether there needs to be another annotation step.
  6. Change textual into numerical data via the methods pandas provides (Hot One Encoding, for instance).
  7. Feed into algorithm and see what happens.

I hadn’t really thought about any details after step 6, so the goal of this exercise mainly was to get familiar with data preparation and data cleaning methods, as well as with a bit of visualization.

Discarding data

While going through the CSVs, I noticed that all of the files I didn’t need (error logs, runtime) had much less columns than the ones that I wanted to use. Once I realized this, it was quite easy to discard them:

import pandas as pd
from pathlib import Path

# list with all reference CSV filenames
reference_data_filenames = [
    x for x in Path(".").joinpath("path/to/file/").iterdir()
]  

all_refdata = []
i = 0
for item in reference_data_filenames:
    all_refdata.append(pd.read_csv(item))
    all_refdata[i].name = item.name
    i += 1

# getting rid of everything that is not needed at all based on column length
all_refdata_clean = [
    s for s in all_refdata if len(s.columns) > 8
]

Additionally, some CSVs contained columns that I didn’t want to use and that blew up their size, so I got rid of those, too. The result was a list containing pandas dataframes for each CSV I wanted to use.

drop_columns = ["Season", "Episode", "Runtime"]

print("\nChecking columns...")
for item in all_refdata_clean:
    for title in drop_columns:
        if title in item.columns:
            item.drop(title, axis=1, inplace=True)

Merging Data

There I was, one happy Python newbie, with her list of dataframes. Well – until I realized that the fact that it was a list of dataframes created issues. All the beautiful pandas operations I wanted to do could not be performed, because the type of the variable was list, and not dataframe. I tried to just call the pandas operations on index[0] of the list, but that turned out to be a dead end, too: The dataframes were not accessible in this nested format. However, I wanted to merge everything into one big dataframe anyways, and how difficult could this be?

There are several methods to join dataframes (join(), merge(), concat()). Before you merge, think of the following:

  1. Do you want to merge just the columns that all of the files contain, or all available columns?
  2. How do you want to deal with duplicates?
  3. How do you want to deal with possible data loss when you remove duplicates?

… which were all things that I didn’t think about before, of course. I decided, after consulting Stack Overflow around a million times, that I would use concat() with an inner join:

full_reference = pd.concat(all_refdata_clean, join="inner")

This gave me a big dataframe with a lot of (mainly duplicated) rows (16626 to be concrete!), with all the columns that were found in all CSVs (23 columns in total). As I needed to limit the number of columns for the project anyways, this was fine for me. Also, I had removed all the CSVs that had very few columns in the earlier step, which meant that I would not limit the amount of metadata too much. I checked the joined columns by printing a list of the column titles:

joined_columns = list(
    pd.concat(all_refdata_clean, join="inner").columns
)
print(joined_columns)

After this, I just deleted the duplicates from my new big dataframe. I decided to keep the last value instead of the first – this was a decision that came later on in the process, when I was much more advanced and compared the data I got out of the two options.

full_reference.sort_values("imdbID", ascending=True, inplace=True)

full_reference.drop_duplicates(subset="Title", inplace=True, keep="last")

And there it was: my merged dataframe, containing all of the reference movies I wanted to use: 5412 movies, with 23 columns to use.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s