saturn·

ml 32m links

source /home/coolhand/html/datavis/data_trove/entertainment/movies/ml-32m/links.csv 87,585 rows 3 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This is a movie ID linkage table with 87,585 rows and 3 numeric identifier columns: imdbId, tmdbId, and movieId. As expected for ID columns, movieId and imdbId are unique per row, while tmdbId is nearly unique with a tiny null rate (0.0014). The most notable shape is imdbId, which is heavily right-skewed (skew 2.24) with about 8.6% flagged as outliers — reflecting how IMDb IDs span a huge range from 1 up to ~29M. tmdbId is also right-skewed but more moderately, while movieId is distributed fairly evenly up to 292,757. There's little analytical signal here beyond confirming the file is a clean ID crosswalk.

citing: row_count · column_count · columns.imdbId.stats.skew · columns.imdbId.stats.outlier_rate · columns.imdbId.stats.max · columns.tmdbId.stats.skew · columns.tmdbId.null_rate · columns.movieId.stats.max · columns.movieId.stats.skew

Schema

3 columns
Per-column summary. Click column name to jump to its detail.
Alerts
movieId numeric 0.0% 87,585
imdbId numeric 0.0% 87,585
high_skew outliers
tmdbId numeric 0.1% 87,425

movieId

numeric identifier
movieId is a fully unique integer key spanning 1 to 292,757 across all 87,585 rows with no nulls or zeros, consistent with an external catalogue identifier rather than a dense surrogate index. The range far exceeds the row count, indicating gaps in the ID space (likely a sparse catalogue like MovieLens). Distribution stats (mean 157,651, median 165,741, skew -0.39) are artefacts of the ID assignment and carry no analytical meaning. Treatment: Use as a join key to movie metadata; exclude from modelling features. high · anthropic:claude-opus-4-7
n
87,585
nulls
0 (0.0%)
unique
87,585
min
1
max
292,757
mean
1.577e+05
median
165,741
std
7.901e+04
q1
112,657
q3
213,203
iqr
100,546
skew
-0.3914
kurtosis
-0.5775
n_outliers
0
outlier_rate
0
zero_rate
0

imdbId

numeric identifier high_skew outliers
This is the IMDb title identifier: all 87,585 rows are unique with no nulls, matching the row count exactly. Although flagged as skewed (2.24) with 7,575 outliers, that's expected since IMDb IDs are issued sequentially over time and span 1 to 29,081,098. The distribution shape is an artifact of ID assignment, not a meaningful numeric signal. Treatment: Treat as a key for joins to IMDb metadata; do not use as a numeric feature. high · anthropic:claude-opus-4-7
n
87,585
nulls
0 (0.0%)
unique
87,585
min
1
max
2.908e+07
mean
2.793e+06
median
492,996
std
4.279e+06
q1
94,642
q3
3.877e+06
iqr
3.783e+06
skew
2.243
kurtosis
5.857
n_outliers
7,575
outlier_rate
0.08649
zero_rate
0

tmdbId

numeric foreign_key
This is almost certainly the TMDB movie/title identifier used for joins to The Movie Database. With 87,425 unique values across 87,585 rows and a null rate of 0.0014, it behaves as a near-unique key rather than a modelling feature, despite its numeric stats showing a wide range (2 to 1,186,337) and right skew (1.25). Treatment: left-join on this id; do not use as a numeric feature. high · anthropic:claude-opus-4-7
n
87,585
nulls
124 (0.1%)
unique
87,425
min
2
max
1.186e+06
mean
2.414e+05
median
139,272
std
2.471e+05
q1
46,836
q3
381,693
iqr
334,857
skew
1.254
kurtosis
0.9079
n_outliers
2,200
outlier_rate
0.02515
zero_rate
0