Exploratory Data Analysis guide on Netflix Dataset

Introduction

Exploratory Data Analysis (EDA) is a crucial first step in any data science project. It helps us understand the dataset’s structure, detect patterns, and identify potential issues such as missing values or outliers. In this tutorial, we will perform a comprehensive EDA on the Netflix Movies and TV Shows dataset from Kaggle, which provides insights into content available on Netflix.

You can access the Netflix Movies and TV Shows dataset on Kaggle using the following link:

Netflix Movies and TV Shows Dataset

This dataset provides a comprehensive list of movies and TV shows available on Netflix, including details such as cast, directors, ratings, and release dates.

Why is EDA Important?

EDA allows us to:

  • Detect missing values and outliers
  • Understand data distributions
  • Identify relationships between variables
  • Generate insights that guide feature engineering

Importing Libraries

To start, we need to import the necessary libraries.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Loading the Dataset

We load the Netflix dataset.

netflix = pd.read_csv('netflix_titles.csv')

Understanding the Data Structure

We begin by checking the dataset’s structure and statistics.

netflix.info()
netflix.describe()
netflix.head()

Key observations:

  • The dataset contains numerical and categorical variables.
  • Some columns may have missing values.
  • Key columns include title, type, release_year, rating, duration, and country.

Checking for Missing Values

netflix.isnull().sum()

The dataset has missing values in:

  • director
  • cast
  • country
  • rating
  • duration

We will handle these appropriately.

Handling Missing Values

We fill missing values with appropriate replacements:

netflix['director'].fillna('Unknown', inplace=True)
netflix['cast'].fillna('Unknown', inplace=True)
netflix['country'].fillna(netflix['country'].mode()[0], inplace=True)
netflix['rating'].fillna(netflix['rating'].mode()[0], inplace=True)
netflix['duration'].fillna('Unknown', inplace=True)

Univariate Analysis

Distribution of Content Types

sns.countplot(x='type', data=netflix)
plt.title('Distribution of Movies and TV Shows')
plt.show()

Observations:

  • Netflix has more movies than TV shows.

Most Common Ratings

sns.countplot(y=netflix['rating'], order=netflix['rating'].value_counts().index)
plt.title('Most Common Content Ratings')
plt.show()

Observations:

  • Some ratings appear more frequently than others, such as TV-MA and PG-13.

Top 10 Countries Producing Content

netflix['country'].value_counts().head(10).plot(kind='bar', figsize=(12,6))
plt.title('Top 10 Countries Producing Netflix Content')
plt.show()

Observations:

  • The USA has the highest number of titles on Netflix.

Bivariate Analysis

Release Year vs. Content Type

sns.histplot(data=netflix, x='release_year', hue='type', bins=20, kde=True)
plt.title('Release Year Distribution for Movies and TV Shows')
plt.show()
  • More movies and TV shows have been released in recent years.

Duration vs. Content Type

sns.boxplot(x='type', y=pd.to_numeric(netflix['duration'].str.replace(' min', ''), errors='coerce'), data=netflix)
plt.title('Duration Distribution of Movies and TV Shows')
plt.show()
  • TV Shows generally have shorter durations per episode.

Correlation Analysis

Since most of the data is categorical, correlation analysis is limited. However, we can analyze relationships among numerical variables like release year and duration.

plt.figure(figsize=(10, 6))
sns.heatmap(netflix.corr(), annot=True, cmap='coolwarm')
plt.show()

Observations:

  • Limited strong correlations due to categorical nature of the dataset.

Feature Engineering

Extracting First Genre

netflix['first_genre'] = netflix['listed_in'].apply(lambda x: x.split(',')[0])

Extracted the first genre to simplify analysis.

Encoding Categorical Variables

netflix = pd.get_dummies(netflix, columns=['first_genre', 'country'], drop_first=True)

Converted categorical variables into numerical ones.

Advanced Visualizations

Content Addition Trends Over Time

netflix.groupby('release_year').size().plot(kind='line', figsize=(12,6))
plt.title('Number of Netflix Titles Added Over Time')
plt.show()

Netflix has significantly increased content production in recent years.

Top 10 Directors

netflix['director'].value_counts().head(10).plot(kind='bar')
plt.title('Top 10 Most Frequent Netflix Directors')
plt.show()

Some directors contribute significantly to Netflix’s content library.

Conclusion

EDA is a fundamental step before building machine learning models or drawing insights. We identified missing values, explored distributions, relationships, and engineered new features. This structured approach ensures we extract maximum insights from the data before further analysis. In the next steps, we can proceed with content recommendation models or predictive analytics.

Related Blog Posts

Don’t forget to check out our other detailed guides:

Consequently, if you’re eager to take your skills to the next level, our specialized courses offer comprehensive training in:

  • Advanced NumPy techniques
  • Data manipulation
  • Machine learning fundamentals
  • AI and deep learning concepts

Explore Our Data Science and AI Career Transformation Course