Clustering NBA players using dimensionality reduction in R

9 min readSep 17, 2022

Comparing PCA vs UMAP vs TSNE to analyze NBA draft data

This is a continuation of my series of learning to use different algorithms in R and learning more about the Tidymodels package, and its various uses. In addition to using the Tidymodels package, I will also try out the RTSNE package for trying out the T-SNE algorithm.

About the Data

The data set used comes for Kaggle.com, a great source of finding all kinds of data sets for data visualization and analysis. The data contains various statistics for all the NBA players drafted into the league from 1989 to 2021. It is a fairly tidy data set and requires little to none data clean in most of the cases to use for analysis.

Note

Link to the original kaggle.com source can be found here or by copy pasting the following URL:

https://www.kaggle.com/datasets/mattop/nba-draft-basketball-player-data-19892021

banner

Motivation

As stated earlier, my primary motivation for this blog post is learning how to use various methods of clustering data to find relationship between different players and their careers, and look at how closely those careers might be related, and potentially what influences such relations.

Each of the alogrithms/methods use work in a slightly different way, primarily unsupervised and it should hopefully show the differences in which all players are related. Furthermore, these differences will also be method dependent and hopefully interesting to a basketball or a NBA fan.

Algorithms used

Some of the methods I hope to use in this blog post include:

PCA
UMAP
T-SNE

Packages Used

For all of the aforementioned algorithms, R-code was used to carry out the analysis.

Some of the R-packages used for data analysis as well as data modelling include:

TidyModels — for PCA and UMAP
TidyVerse — for data wrangling, data tidying
RTSNE — for T-SNE

Data Analysis

Loading and Looking at the data

Data is available in a CSV format, and can be read and looked at using the readR package in R.

Loading Libraries

Adding all the libraries needed for all analysis in a single place. The main libraries used are mentioned in the previous section.

library(tidyverse)── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.0      ✔ stringr 1.4.1 
✔ readr   2.1.2      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()library(tidymodels)── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
✔ broom        1.0.1     ✔ rsample      1.1.0
✔ dials        1.0.0     ✔ tune         1.0.0
✔ infer        1.0.3     ✔ workflows    1.0.0
✔ modeldata    1.0.0     ✔ workflowsets 1.0.0
✔ parsnip      1.0.1     ✔ yardstick    1.0.0
✔ recipes      1.0.1     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Learn how to get started at https://www.tidymodels.org/start/library(tidytext)
library(Rtsne)
library(embed)

Loading the CSV file

The data files were downloaded from the Kaggle repo and saved in the CSV format locally for ease of use. All original data files are also available on the Kaggle repo linked above.

nba <- read_csv("/Users/karatatiwantsinghsidhu/Documents/Code/karat_codes/posts/nba-clustering/data/nbaplayersdraft.csv")Rows: 1922 Columns: 24
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): team, player, college
dbl (21): id, year, rank, overall_pick, years_active, games, minutes_played,...ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.nba  |>  as_tibble()# A tibble: 1,922 × 24
      id  year  rank overall…¹ team  player college years…² games minut…³ points
   <dbl> <dbl> <dbl>     <dbl> <chr> <chr>  <chr>     <dbl> <dbl>   <dbl>  <dbl>
 1     1  1989     1         1 SAC   Pervi… Louisv…      11   474   11593   4494
 2     2  1989     2         2 LAC   Danny… Duke         13   917   18133   6439
 3     3  1989     3         3 SAS   Sean … Arizona      12   742   24502  10544
 4     4  1989     4         4 MIA   Glen … Michig…      15  1000   34985  18336
 5     5  1989     5         5 CHH   J.R. … UNC          11   672   15370   5680
 6     6  1989     6         6 CHI   Stace… Oklaho…       8   438    7406   2819
 7     7  1989     7         7 IND   Georg… Florid…      12   766   17429   6925
 8     8  1989     8         8 DAL   Randy… Louisi…       5   281    5382   2083
 9     9  1989     9         9 WSB   Tom H… Georgi…      12   687   10419   3617
10    10  1989    10        10 MIN   Pooh … UCLA         10   639   19399   7083
# … with 1,912 more rows, 13 more variables: total_rebounds <dbl>,
#   assists <dbl>, field_goal_percentage <dbl>, `3_point_percentage` <dbl>,
#   free_throw_percentage <dbl>, average_minutes_played <dbl>,
#   points_per_game <dbl>, average_total_rebounds <dbl>, average_assists <dbl>,
#   win_shares <dbl>, win_shares_per_48_minutes <dbl>, box_plus_minus <dbl>,
#   value_over_replacement <dbl>, and abbreviated variable names ¹overall_pick,
#   ²years_active, ³minutes_played

Data Cleaning & Exploratory Analysis

Clean the data in the database and remove the old data.

Before cleaning the database, lets look in a bit more detail on how each data component is labelled.

colnames(nba)[1] "id"                        "year"                     
 [3] "rank"                      "overall_pick"             
 [5] "team"                      "player"                   
 [7] "college"                   "years_active"             
 [9] "games"                     "minutes_played"           
[11] "points"                    "total_rebounds"           
[13] "assists"                   "field_goal_percentage"    
[15] "3_point_percentage"        "free_throw_percentage"    
[17] "average_minutes_played"    "points_per_game"          
[19] "average_total_rebounds"    "average_assists"          
[21] "win_shares"                "win_shares_per_48_minutes"
[23] "box_plus_minus"            "value_over_replacement"

Data is in the long format, so we don’t need to transform the columns or rows and can use the given dataset as is. Furthermore, we don’t need to use the ID column in the data.

nba <- nba  |> select(-c(id, college))

Remove all empty rows in the data

nba <- nba  |> na.omit()

The values are now ready to be transformed using each of the three different algorithms.

Data Modelling

Set seed

set.seed(123)

PCA

Prep the recipe using the recipe package in TidyModels. Normalize all the data not used as indentifiers in the algorithm

pca_rec <- recipe(~., data = nba) |>  # what data to use
  update_role(player,team, new_role = "id") |> 
  step_normalize(all_predictors()) |> # normalize all other columns
  step_pca(all_predictors()) # pca for all other columnspca_prep <- prep(pca_rec)pca_prepRecipeInputs:      role #variables
        id          2
 predictor         20Training data contained 1529 data points and no missing data.Operations:Centering and scaling for year, rank, overall_pick, years_active, games, ... [trained]
PCA extraction with year, rank, overall_pick, years_active, games, m... [trained]

Tidied the data

tidied_pca <- tidy(pca_prep, 2)tidied_pca# A tibble: 400 × 4
   terms                   value component id       
   <chr>                   <dbl> <chr>     <chr>    
 1 year                   0.0618 PC1       pca_EoYnc
 2 rank                   0.172  PC1       pca_EoYnc
 3 overall_pick           0.172  PC1       pca_EoYnc
 4 years_active          -0.263  PC1       pca_EoYnc
 5 games                 -0.277  PC1       pca_EoYnc
 6 minutes_played        -0.293  PC1       pca_EoYnc
 7 points                -0.291  PC1       pca_EoYnc
 8 total_rebounds        -0.265  PC1       pca_EoYnc
 9 assists               -0.250  PC1       pca_EoYnc
10 field_goal_percentage -0.125  PC1       pca_EoYnc
# … with 390 more rows

Plot the Principal Components

tidied_pca |> 
  filter(
    component == "PC1" |
      component == "PC2" |
      component == "PC3" |
      component == "PC4"
   ) |> 
  mutate(component = fct_inorder(component)) |> 
    ggplot(aes(value, terms, fill = terms)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~component, nrow = 1) +
  hrbrthemes::theme_ipsum() +
  labs(y = NULL)

PCA Contributions for the first 2 Principal Components.

tidied_pca |> 
  filter(component %in% paste0("PC", 1:2)) |> 
  group_by(component) |>
  top_n(8, abs(value)) |>
  ungroup() |>
  mutate(terms = reorder_within(terms, abs(value), component)) |>
  ggplot(aes(abs(value), terms, fill = value > 0)) +
  geom_col() +
  facet_wrap(~component, scales = "free_y") +
  scale_y_reordered() +
  labs(
    x = "Absolute value of contribution",
    y = NULL, fill = "Positive?"
  ) +
  hrbrthemes::theme_ipsum()

PCA Biplot:

juice(pca_prep) |> 
  ggplot(aes(PC1, PC2, label = player)) +
  geom_point(aes(color = team),alpha = 0.7, size = 2) +
  ggrepel::geom_text_repel(max.overlaps = 40) +
  labs(color = NULL) + hrbrthemes::theme_ipsum()Warning: ggrepel: 1497 unlabeled data points (too many overlaps). Consider
increasing max.overlaps

Better version:

UMAP

UMAP Prep

Similar to the PCA analysis, UMAP recipe is preped using the recipes package in TidyModels.

umap_rec <- recipe(~., data = nba) |>
  update_role(player, team, new_role = "id") |>
  step_normalize(all_predictors()) |>
  step_umap(all_predictors())umap_prep <- prep(umap_rec)umap_prepRecipeInputs:      role #variables
        id          2
 predictor         20Training data contained 1529 data points and no missing data.Operations:Centering and scaling for year, rank, overall_pick, years_active, games, ... [trained]
UMAP embedding for year, rank, overall_pick, years_active, games,... [trained]

Looking at the UMAP biplot

juice(umap_prep) |> 
  ggplot(aes(UMAP1, UMAP2, label = player)) +
  geom_point(aes(color = team), alpha = 0.7, size = 2) +
  ggrepel::geom_text_repel(max.overlaps = 75) +
  labs(color = NULL) + hrbrthemes::theme_ipsum()Warning: ggrepel: 1481 unlabeled data points (too many overlaps). Consider
increasing max.overlaps

Slightly better version of the same graph:

TSNE

TSNE in R works slightly differently to the previous two methods. Since the package used is not TidyModels in this case, we use a slightly different approach.

First, we remove every non numeric value from the data set. These values can, however, be used once the T-SNE has been applied to the data set.

nba_tsne <- nba  |> select(-c(team, player))

Carrying out the TSNE.

tsne <- Rtsne(nba_tsne, 
perplexity = 30,
eta = 100,
max_iter = 2000)

Extract the required data from the results set. Add additional columns corresponding to the non numeric values from the original data.

Y <- as.data.frame(tsne$Y)teams <- nba$team
players <- nba$player

Plot the T-SNE results biplot with the first two components.

ggplot(Y, aes(x =V1, y =V2, label = players)) +
geom_point(aes(color = teams)) + labs(x = "tsne-1", y = "tsne-2",color = "team") +
  ggrepel::geom_text_repel(max.overlaps = 60) + hrbrthemes::theme_ipsum()Warning: ggrepel: 1521 unlabeled data points (too many overlaps). Consider
increasing max.overlaps

Slightly better version of the same graph:

tsne

Looking at the shape of the TSNE 1 vs TSNE 2, the data is close to normal distribution.

Results and Conclusion

Three different results are available for each of algorithms through which the data is run. Using a smaller data set or knowing even more information like player positions (PF, C, PG etc.) could further help with the results and establishing a pattern of why each of the players are linked together or clustered together.

Some really interesting results emerge when you look into details for each of the corresponding clusters. Look at PCA results,and you see a lot of modern great point guards like Chris Paul, Steve Nash, Jason Kidd and Stephen Curry appear very close to each other in the results. Similarly, great centers are close to each other in the same results, and similar trend is observed for a lot of other positions. However, this blog post was mainly an exercise of the process involved in getting to the results rather than what the results actually mean.

Nevertheless, it was an interesting exercise for me personally and particularly for trying out TSNE in R for the first time.

References

Original Dataset: Kaggle

T-SNE in R: Andrew Couch Youtube

PCA and UMAP using Tidymodels: Blog Post

TSNE Clearly Explained: Statquest

Note: Some of the detailed graphs look slightly different due to a different seed used.

Reuse

https://creativecommons.org/licenses/by/4.0/

Citation

BibTeX citation:

@online{sidhu2022,
  author = {Karat Sidhu},
  title = {Clustering {NBA} Players Using Dimensionality Reduction in
    {R}},
  date = {2022-09-18},
  url = {https://karatsidhu.com/posts/nba-clustering/nba-clustering.html},
  langid = {en}
}

For attribution, please cite this work as:

Karat Sidhu. 2022. “Clustering NBA Players Using Dimensionality Reduction in R.” September 18, 2022. https://karatsidhu.com/posts/nba-clustering/nba-clustering.html.

Clustering NBA players using dimensionality reduction in R

Comparing PCA vs UMAP vs TSNE to analyze NBA draft data

Table of contents

About the Data

Motivation

Algorithms used

Packages Used

Data Analysis

Loading and Looking at the data

Loading Libraries

Loading the CSV file

Data Cleaning & Exploratory Analysis

Data Modelling

PCA

Better version:

UMAP

Slightly better version of the same graph:

TSNE

Slightly better version of the same graph:

Results and Conclusion

References

Reuse

Citation

Written by Karat Sidhu