Clustering NBA players using dimensionality reduction in R

Karat Sidhu
9 min readSep 17, 2022

--

Comparing PCA vs UMAP vs TSNE to analyze NBA draft data

Table of contents

This is a continuation of my series of learning to use different algorithms in R and learning more about the Tidymodels package, and its various uses. In addition to using the Tidymodels package, I will also try out the RTSNE package for trying out the T-SNE algorithm.

About the Data

The data set used comes for Kaggle.com, a great source of finding all kinds of data sets for data visualization and analysis. The data contains various statistics for all the NBA players drafted into the league from 1989 to 2021. It is a fairly tidy data set and requires little to none data clean in most of the cases to use for analysis.

Note

Link to the original kaggle.com source can be found here or by copy pasting the following URL:

https://www.kaggle.com/datasets/mattop/nba-draft-basketball-player-data-19892021

banner

Motivation

As stated earlier, my primary motivation for this blog post is learning how to use various methods of clustering data to find relationship between different players and their careers, and look at how closely those careers might be related, and potentially what influences such relations.

Each of the alogrithms/methods use work in a slightly different way, primarily unsupervised and it should hopefully show the differences in which all players are related. Furthermore, these differences will also be method dependent and hopefully interesting to a basketball or a NBA fan.

Algorithms used

Some of the methods I hope to use in this blog post include:

  • PCA
  • UMAP
  • T-SNE

Packages Used

For all of the aforementioned algorithms, R-code was used to carry out the analysis.

Some of the R-packages used for data analysis as well as data modelling include:

  • TidyModels — for PCA and UMAP
  • TidyVerse — for data wrangling, data tidying
  • RTSNE — for T-SNE

Data Analysis

Loading and Looking at the data

Data is available in a CSV format, and can be read and looked at using the readR package in R.

Loading Libraries

Adding all the libraries needed for all analysis in a single place. The main libraries used are mentioned in the previous section.

library(tidyverse)── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6 ✔ purrr 0.3.4
✔ tibble 3.1.8 ✔ dplyr 1.0.10
✔ tidyr 1.2.0 ✔ stringr 1.4.1
✔ readr 2.1.2 ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
library(tidymodels)── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
✔ broom 1.0.1 ✔ rsample 1.1.0
✔ dials 1.0.0 ✔ tune 1.0.0
✔ infer 1.0.3 ✔ workflows 1.0.0
✔ modeldata 1.0.0 ✔ workflowsets 1.0.0
✔ parsnip 1.0.1 ✔ yardstick 1.0.0
✔ recipes 1.0.1
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter() masks stats::filter()
✖ recipes::fixed() masks stringr::fixed()
✖ dplyr::lag() masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step() masks stats::step()
• Learn how to get started at https://www.tidymodels.org/start/
library(tidytext)
library(Rtsne)
library(embed)

Loading the CSV file

The data files were downloaded from the Kaggle repo and saved in the CSV format locally for ease of use. All original data files are also available on the Kaggle repo linked above.

nba <- read_csv("/Users/karatatiwantsinghsidhu/Documents/Code/karat_codes/posts/nba-clustering/data/nbaplayersdraft.csv")Rows: 1922 Columns: 24
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): team, player, college
dbl (21): id, year, rank, overall_pick, years_active, games, minutes_played,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nba |> as_tibble()# A tibble: 1,922 × 24
id year rank overall…¹ team player college years…² games minut…³ points
<dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 1989 1 1 SAC Pervi… Louisv… 11 474 11593 4494
2 2 1989 2 2 LAC Danny… Duke 13 917 18133 6439
3 3 1989 3 3 SAS Sean … Arizona 12 742 24502 10544
4 4 1989 4 4 MIA Glen … Michig… 15 1000 34985 18336
5 5 1989 5 5 CHH J.R. … UNC 11 672 15370 5680
6 6 1989 6 6 CHI Stace… Oklaho… 8 438 7406 2819
7 7 1989 7 7 IND Georg… Florid… 12 766 17429 6925
8 8 1989 8 8 DAL Randy… Louisi… 5 281 5382 2083
9 9 1989 9 9 WSB Tom H… Georgi… 12 687 10419 3617
10 10 1989 10 10 MIN Pooh … UCLA 10 639 19399 7083
# … with 1,912 more rows, 13 more variables: total_rebounds <dbl>,
# assists <dbl>, field_goal_percentage <dbl>, `3_point_percentage` <dbl>,
# free_throw_percentage <dbl>, average_minutes_played <dbl>,
# points_per_game <dbl>, average_total_rebounds <dbl>, average_assists <dbl>,
# win_shares <dbl>, win_shares_per_48_minutes <dbl>, box_plus_minus <dbl>,
# value_over_replacement <dbl>, and abbreviated variable names ¹​overall_pick,
# ²​years_active, ³​minutes_played

Data Cleaning & Exploratory Analysis

Clean the data in the database and remove the old data.

Before cleaning the database, lets look in a bit more detail on how each data component is labelled.

colnames(nba)[1] "id"                        "year"                     
[3] "rank" "overall_pick"
[5] "team" "player"
[7] "college" "years_active"
[9] "games" "minutes_played"
[11] "points" "total_rebounds"
[13] "assists" "field_goal_percentage"
[15] "3_point_percentage" "free_throw_percentage"
[17] "average_minutes_played" "points_per_game"
[19] "average_total_rebounds" "average_assists"
[21] "win_shares" "win_shares_per_48_minutes"
[23] "box_plus_minus" "value_over_replacement"

Data is in the long format, so we don’t need to transform the columns or rows and can use the given dataset as is. Furthermore, we don’t need to use the ID column in the data.

nba <- nba  |> select(-c(id, college))

Remove all empty rows in the data

nba <- nba  |> na.omit()

The values are now ready to be transformed using each of the three different algorithms.

Data Modelling

Set seed

set.seed(123)

PCA

Prep the recipe using the recipe package in TidyModels. Normalize all the data not used as indentifiers in the algorithm

pca_rec <- recipe(~., data = nba) |>  # what data to use
update_role(player,team, new_role = "id") |>
step_normalize(all_predictors()) |> # normalize all other columns
step_pca(all_predictors()) # pca for all other columns
pca_prep <- prep(pca_rec)pca_prepRecipeInputs: role #variables
id 2
predictor 20
Training data contained 1529 data points and no missing data.Operations:Centering and scaling for year, rank, overall_pick, years_active, games, ... [trained]
PCA extraction with year, rank, overall_pick, years_active, games, m... [trained]

Tidied the data

tidied_pca <- tidy(pca_prep, 2)tidied_pca# A tibble: 400 × 4
terms value component id
<chr> <dbl> <chr> <chr>
1 year 0.0618 PC1 pca_EoYnc
2 rank 0.172 PC1 pca_EoYnc
3 overall_pick 0.172 PC1 pca_EoYnc
4 years_active -0.263 PC1 pca_EoYnc
5 games -0.277 PC1 pca_EoYnc
6 minutes_played -0.293 PC1 pca_EoYnc
7 points -0.291 PC1 pca_EoYnc
8 total_rebounds -0.265 PC1 pca_EoYnc
9 assists -0.250 PC1 pca_EoYnc
10 field_goal_percentage -0.125 PC1 pca_EoYnc
# … with 390 more rows

Plot the Principal Components

tidied_pca |> 
filter(
component == "PC1" |
component == "PC2" |
component == "PC3" |
component == "PC4"
) |>
mutate(component = fct_inorder(component)) |>
ggplot(aes(value, terms, fill = terms)) +
geom_col(show.legend = FALSE) +
facet_wrap(~component, nrow = 1) +
hrbrthemes::theme_ipsum() +
labs(y = NULL)

PCA Contributions for the first 2 Principal Components.

tidied_pca |> 
filter(component %in% paste0("PC", 1:2)) |>
group_by(component) |>
top_n(8, abs(value)) |>
ungroup() |>
mutate(terms = reorder_within(terms, abs(value), component)) |>
ggplot(aes(abs(value), terms, fill = value > 0)) +
geom_col() +
facet_wrap(~component, scales = "free_y") +
scale_y_reordered() +
labs(
x = "Absolute value of contribution",
y = NULL, fill = "Positive?"
) +
hrbrthemes::theme_ipsum()

PCA Biplot:

juice(pca_prep) |> 
ggplot(aes(PC1, PC2, label = player)) +
geom_point(aes(color = team),alpha = 0.7, size = 2) +
ggrepel::geom_text_repel(max.overlaps = 40) +
labs(color = NULL) + hrbrthemes::theme_ipsum()
Warning: ggrepel: 1497 unlabeled data points (too many overlaps). Consider
increasing max.overlaps

Better version:

UMAP

UMAP Prep

Similar to the PCA analysis, UMAP recipe is preped using the recipes package in TidyModels.

umap_rec <- recipe(~., data = nba) |>
update_role(player, team, new_role = "id") |>
step_normalize(all_predictors()) |>
step_umap(all_predictors())
umap_prep <- prep(umap_rec)umap_prepRecipeInputs: role #variables
id 2
predictor 20
Training data contained 1529 data points and no missing data.Operations:Centering and scaling for year, rank, overall_pick, years_active, games, ... [trained]
UMAP embedding for year, rank, overall_pick, years_active, games,... [trained]

Looking at the UMAP biplot

juice(umap_prep) |> 
ggplot(aes(UMAP1, UMAP2, label = player)) +
geom_point(aes(color = team), alpha = 0.7, size = 2) +
ggrepel::geom_text_repel(max.overlaps = 75) +
labs(color = NULL) + hrbrthemes::theme_ipsum()
Warning: ggrepel: 1481 unlabeled data points (too many overlaps). Consider
increasing max.overlaps

Slightly better version of the same graph:

TSNE

TSNE in R works slightly differently to the previous two methods. Since the package used is not TidyModels in this case, we use a slightly different approach.

First, we remove every non numeric value from the data set. These values can, however, be used once the T-SNE has been applied to the data set.

nba_tsne <- nba  |> select(-c(team, player))

Carrying out the TSNE.

tsne <- Rtsne(nba_tsne, 
perplexity = 30,
eta = 100,
max_iter = 2000)

Extract the required data from the results set. Add additional columns corresponding to the non numeric values from the original data.

Y <- as.data.frame(tsne$Y)teams <- nba$team
players <- nba$player

Plot the T-SNE results biplot with the first two components.

ggplot(Y, aes(x =V1, y =V2, label = players)) +
geom_point(aes(color = teams)) + labs(x = "tsne-1", y = "tsne-2",color = "team") +
ggrepel::geom_text_repel(max.overlaps = 60) + hrbrthemes::theme_ipsum()
Warning: ggrepel: 1521 unlabeled data points (too many overlaps). Consider
increasing max.overlaps

Slightly better version of the same graph:

tsne

Looking at the shape of the TSNE 1 vs TSNE 2, the data is close to normal distribution.

Results and Conclusion

Three different results are available for each of algorithms through which the data is run. Using a smaller data set or knowing even more information like player positions (PF, C, PG etc.) could further help with the results and establishing a pattern of why each of the players are linked together or clustered together.

Some really interesting results emerge when you look into details for each of the corresponding clusters. Look at PCA results,and you see a lot of modern great point guards like Chris Paul, Steve Nash, Jason Kidd and Stephen Curry appear very close to each other in the results. Similarly, great centers are close to each other in the same results, and similar trend is observed for a lot of other positions. However, this blog post was mainly an exercise of the process involved in getting to the results rather than what the results actually mean.

Nevertheless, it was an interesting exercise for me personally and particularly for trying out TSNE in R for the first time.

References

Original Dataset: Kaggle

T-SNE in R: Andrew Couch Youtube

PCA and UMAP using Tidymodels: Blog Post

TSNE Clearly Explained: Statquest

Note: Some of the detailed graphs look slightly different due to a different seed used.

Reuse

https://creativecommons.org/licenses/by/4.0/

Citation

BibTeX citation:

@online{sidhu2022,
author = {Karat Sidhu},
title = {Clustering {NBA} Players Using Dimensionality Reduction in
{R}},
date = {2022-09-18},
url = {https://karatsidhu.com/posts/nba-clustering/nba-clustering.html},
langid = {en}
}

For attribution, please cite this work as:

Karat Sidhu. 2022. “Clustering NBA Players Using Dimensionality Reduction in R.” September 18, 2022. https://karatsidhu.com/posts/nba-clustering/nba-clustering.html.

--

--