saturn·

ml 32m links

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/entertainment/movies/ml-32m/links.csv

Saturn profiled 87,585 rows across 3 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/entertainment/movies/ml-32m/links.csv",
    "--findings", "ml-32m-links.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This is a movie ID linkage table with 87,585 rows and 3 numeric identifier columns: imdbId, tmdbId, and movieId. As expected for ID columns, movieId and imdbId are unique per row, while tmdbId is nearly unique with a tiny null rate (0.0014). The most notable shape is imdbId, which is heavily right-skewed (skew 2.24) with about 8.6% flagged as outliers — reflecting how IMDb IDs span a huge range from 1 up to ~29M. tmdbId is also right-skewed but more moderately, while movieId is distributed fairly evenly up to 292,757. There's little analytical signal here beyond confirming the file is a clean ID crosswalk.

citing: row_count · column_count · columns.imdbId.stats.skew · columns.imdbId.stats.outlier_rate · columns.imdbId.stats.max · columns.tmdbId.stats.skew · columns.tmdbId.null_rate · columns.movieId.stats.max · columns.movieId.stats.skew

Out[4]:

saturn.schema() · 3 columns

column kind n null% unique alerts
movieId numeric 87,585 0.0% 87,585
imdbId numeric 87,585 0.0% 87,585 high_skew outliers
tmdbId numeric 87,585 0.1% 87,425
Fig 1.
imdbId · Heavy right skew and a long tail up to ~29M — most IDs cluster well below the mean.
Show data table
Histogram bins for imdbId (median: 492996.0).
bincount
1 – 7.27e+0544057
7.27e+05 – 1.454e+067224
1.454e+06 – 2.181e+066540
2.181e+06 – 2.908e+063749
2.908e+06 – 3.635e+063128
3.635e+06 – 4.362e+062659
4.362e+06 – 5.089e+062286
5.089e+06 – 5.816e+062278
5.816e+06 – 6.543e+062216
6.543e+06 – 7.27e+061712
7.27e+06 – 7.997e+061536
7.997e+06 – 8.724e+061290
8.724e+06 – 9.451e+061232
9.451e+06 – 1.018e+07839
1.018e+07 – 1.091e+071095
1.091e+07 – 1.163e+07930
1.163e+07 – 1.236e+07678
1.236e+07 – 1.309e+07601
1.309e+07 – 1.381e+07674
1.381e+07 – 1.454e+07572
1.454e+07 – 1.527e+07531
1.527e+07 – 1.599e+07391
1.599e+07 – 1.672e+07156
1.672e+07 – 1.745e+0796
1.745e+07 – 1.818e+0779
1.818e+07 – 1.89e+0783
1.89e+07 – 1.963e+0756
1.963e+07 – 2.036e+07120
2.036e+07 – 2.108e+0785
2.108e+07 – 2.181e+07173
2.181e+07 – 2.254e+07135
2.254e+07 – 2.326e+0758
2.326e+07 – 2.399e+0755
2.399e+07 – 2.472e+0736
2.472e+07 – 2.545e+0718
2.545e+07 – 2.617e+0715
2.617e+07 – 2.69e+0765
2.69e+07 – 2.763e+0759
2.763e+07 – 2.835e+0752
2.835e+07 – 2.908e+0726
Fig 2.
tmdbId · Moderately right-skewed distribution of TMDB IDs, with values concentrated under 400K.
Show data table
Histogram bins for tmdbId (median: 139272.0).
bincount
2 – 2.966e+0413785
2.966e+04 – 5.932e+0412892
5.932e+04 – 8.898e+048691
8.898e+04 – 1.186e+055333
1.186e+05 – 1.483e+054232
1.483e+05 – 1.78e+053219
1.78e+05 – 2.076e+052874
2.076e+05 – 2.373e+052446
2.373e+05 – 2.669e+052936
2.669e+05 – 2.966e+052854
2.966e+05 – 3.262e+052029
3.262e+05 – 3.559e+052342
3.559e+05 – 3.856e+052233
3.856e+05 – 4.152e+052131
4.152e+05 – 4.449e+052112
4.449e+05 – 4.745e+052038
4.745e+05 – 5.042e+051694
5.042e+05 – 5.339e+051579
5.339e+05 – 5.635e+051258
5.635e+05 – 5.932e+051452
5.932e+05 – 6.228e+051212
6.228e+05 – 6.525e+05921
6.525e+05 – 6.821e+05939
6.821e+05 – 7.118e+05534
7.118e+05 – 7.415e+05715
7.415e+05 – 7.711e+05670
7.711e+05 – 8.008e+05684
8.008e+05 – 8.304e+05529
8.304e+05 – 8.601e+05532
8.601e+05 – 8.898e+05483
8.898e+05 – 9.194e+05340
9.194e+05 – 9.491e+05404
9.491e+05 – 9.787e+05322
9.787e+05 – 1.008e+06287
1.008e+06 – 1.038e+06262
1.038e+06 – 1.068e+06166
1.068e+06 – 1.097e+06129
1.097e+06 – 1.127e+0683
1.127e+06 – 1.157e+0682
1.157e+06 – 1.186e+0637
Fig 3.
movieId · MovieLens IDs spread fairly evenly across the range, slightly weighted toward higher values.
Show data table
Histogram bins for movieId (median: 165741.0).
bincount
1 – 73207196
7320 – 1.464e+041111
1.464e+04 – 2.196e+040
2.196e+04 – 2.928e+041114
2.928e+04 – 3.66e+04796
3.66e+04 – 4.391e+04435
4.391e+04 – 5.123e+04754
5.123e+04 – 5.855e+04816
5.855e+04 – 6.587e+04789
6.587e+04 – 7.319e+041132
7.319e+04 – 8.051e+041107
8.051e+04 – 8.783e+041398
8.783e+04 – 9.515e+041556
9.515e+04 – 1.025e+051541
1.025e+05 – 1.098e+051535
1.098e+05 – 1.171e+051847
1.171e+05 – 1.244e+052673
1.244e+05 – 1.317e+052763
1.317e+05 – 1.391e+053280
1.391e+05 – 1.464e+053282
1.464e+05 – 1.537e+053146
1.537e+05 – 1.61e+053258
1.61e+05 – 1.683e+053464
1.683e+05 – 1.757e+053492
1.757e+05 – 1.83e+053484
1.83e+05 – 1.903e+053486
1.903e+05 – 1.976e+053401
1.976e+05 – 2.049e+053369
2.049e+05 – 2.122e+053050
2.122e+05 – 2.196e+052907
2.196e+05 – 2.269e+052325
2.269e+05 – 2.342e+051839
2.342e+05 – 2.415e+051672
2.415e+05 – 2.488e+051531
2.488e+05 – 2.562e+051566
2.562e+05 – 2.635e+051679
2.635e+05 – 2.708e+051911
2.708e+05 – 2.781e+052168
2.781e+05 – 2.854e+052580
2.854e+05 – 2.928e+052132
Fig 4.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
movieIdnumeric0.0%
imdbIdnumeric0.0%
tmdbIdnumeric0.1%
Fig 5.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 3 numeric columns (values clipped to 2 decimals).
movieIdimdbIdtmdbId
movieId+1.00+0.51+0.64
imdbId+0.51+1.00+0.44
tmdbId+0.64+0.44+1.00

movieId numeric identifier

movieId is a fully unique integer key spanning 1 to 292,757 across all 87,585 rows with no nulls or zeros, consistent with an external catalogue identifier rather than a dense surrogate index. The range far exceeds the row count, indicating gaps in the ID space (likely a sparse catalogue like MovieLens). Distribution stats (mean 157,651, median 165,741, skew -0.39) are artefacts of the ID assignment and carry no analytical meaning.

Treatment: Use as a join key to movie metadata; exclude from modelling features.

anthropic:claude-opus-4-7 · confidence high
Out[11]:

saturn.columns["movieId"].stats

statvalue
n87,585
nulls0 (0.0%)
unique87,585
min 1
max 292,757
mean 1.577e+05
median 165,741
std 7.901e+04
q1 112,657
q3 213,203
iqr 100,546
skew -0.3914
kurtosis -0.5775
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 6.
Distribution of movieId. Vertical dash marks the median.
Show data table
Histogram bins for movieId (median: 165741.0).
bincount
1 – 73207196
7320 – 1.464e+041111
1.464e+04 – 2.196e+040
2.196e+04 – 2.928e+041114
2.928e+04 – 3.66e+04796
3.66e+04 – 4.391e+04435
4.391e+04 – 5.123e+04754
5.123e+04 – 5.855e+04816
5.855e+04 – 6.587e+04789
6.587e+04 – 7.319e+041132
7.319e+04 – 8.051e+041107
8.051e+04 – 8.783e+041398
8.783e+04 – 9.515e+041556
9.515e+04 – 1.025e+051541
1.025e+05 – 1.098e+051535
1.098e+05 – 1.171e+051847
1.171e+05 – 1.244e+052673
1.244e+05 – 1.317e+052763
1.317e+05 – 1.391e+053280
1.391e+05 – 1.464e+053282
1.464e+05 – 1.537e+053146
1.537e+05 – 1.61e+053258
1.61e+05 – 1.683e+053464
1.683e+05 – 1.757e+053492
1.757e+05 – 1.83e+053484
1.83e+05 – 1.903e+053486
1.903e+05 – 1.976e+053401
1.976e+05 – 2.049e+053369
2.049e+05 – 2.122e+053050
2.122e+05 – 2.196e+052907
2.196e+05 – 2.269e+052325
2.269e+05 – 2.342e+051839
2.342e+05 – 2.415e+051672
2.415e+05 – 2.488e+051531
2.488e+05 – 2.562e+051566
2.562e+05 – 2.635e+051679
2.635e+05 – 2.708e+051911
2.708e+05 – 2.781e+052168
2.781e+05 – 2.854e+052580
2.854e+05 – 2.928e+052132

imdbId numeric identifier

This is the IMDb title identifier: all 87,585 rows are unique with no nulls, matching the row count exactly. Although flagged as skewed (2.24) with 7,575 outliers, that's expected since IMDb IDs are issued sequentially over time and span 1 to 29,081,098. The distribution shape is an artifact of ID assignment, not a meaningful numeric signal.

Treatment: Treat as a key for joins to IMDb metadata; do not use as a numeric feature.

anthropic:claude-opus-4-7 · confidence high
Out[14]:

saturn.columns["imdbId"].stats

statvalue
n87,585
nulls0 (0.0%)
unique87,585
min 1
max 2.908e+07
mean 2.793e+06
median 492,996
std 4.279e+06
q1 94,642
q3 3.877e+06
iqr 3.783e+06
skew 2.243
kurtosis 5.857
n_outliers 7,575
outlier_rate 0.08649
zero_rate 0
alert: high_skewskew=+2.24
alert: outliers8.6% rows beyond 1.5 IQR
Fig 7.
Distribution of imdbId. Vertical dash marks the median.
Show data table
Histogram bins for imdbId (median: 492996.0).
bincount
1 – 7.27e+0544057
7.27e+05 – 1.454e+067224
1.454e+06 – 2.181e+066540
2.181e+06 – 2.908e+063749
2.908e+06 – 3.635e+063128
3.635e+06 – 4.362e+062659
4.362e+06 – 5.089e+062286
5.089e+06 – 5.816e+062278
5.816e+06 – 6.543e+062216
6.543e+06 – 7.27e+061712
7.27e+06 – 7.997e+061536
7.997e+06 – 8.724e+061290
8.724e+06 – 9.451e+061232
9.451e+06 – 1.018e+07839
1.018e+07 – 1.091e+071095
1.091e+07 – 1.163e+07930
1.163e+07 – 1.236e+07678
1.236e+07 – 1.309e+07601
1.309e+07 – 1.381e+07674
1.381e+07 – 1.454e+07572
1.454e+07 – 1.527e+07531
1.527e+07 – 1.599e+07391
1.599e+07 – 1.672e+07156
1.672e+07 – 1.745e+0796
1.745e+07 – 1.818e+0779
1.818e+07 – 1.89e+0783
1.89e+07 – 1.963e+0756
1.963e+07 – 2.036e+07120
2.036e+07 – 2.108e+0785
2.108e+07 – 2.181e+07173
2.181e+07 – 2.254e+07135
2.254e+07 – 2.326e+0758
2.326e+07 – 2.399e+0755
2.399e+07 – 2.472e+0736
2.472e+07 – 2.545e+0718
2.545e+07 – 2.617e+0715
2.617e+07 – 2.69e+0765
2.69e+07 – 2.763e+0759
2.763e+07 – 2.835e+0752
2.835e+07 – 2.908e+0726

tmdbId numeric foreign_key

This is almost certainly the TMDB movie/title identifier used for joins to The Movie Database. With 87,425 unique values across 87,585 rows and a null rate of 0.0014, it behaves as a near-unique key rather than a modelling feature, despite its numeric stats showing a wide range (2 to 1,186,337) and right skew (1.25).

Treatment: left-join on this id; do not use as a numeric feature.

anthropic:claude-opus-4-7 · confidence high
Out[17]:

saturn.columns["tmdbId"].stats

statvalue
n87,585
nulls124 (0.1%)
unique87,425
min 2
max 1.186e+06
mean 2.414e+05
median 139,272
std 2.471e+05
q1 46,836
q3 381,693
iqr 334,857
skew 1.254
kurtosis 0.9079
n_outliers 2,200
outlier_rate 0.02515
zero_rate 0
Fig 8.
Distribution of tmdbId. Vertical dash marks the median.
Show data table
Histogram bins for tmdbId (median: 139272.0).
bincount
2 – 2.966e+0413785
2.966e+04 – 5.932e+0412892
5.932e+04 – 8.898e+048691
8.898e+04 – 1.186e+055333
1.186e+05 – 1.483e+054232
1.483e+05 – 1.78e+053219
1.78e+05 – 2.076e+052874
2.076e+05 – 2.373e+052446
2.373e+05 – 2.669e+052936
2.669e+05 – 2.966e+052854
2.966e+05 – 3.262e+052029
3.262e+05 – 3.559e+052342
3.559e+05 – 3.856e+052233
3.856e+05 – 4.152e+052131
4.152e+05 – 4.449e+052112
4.449e+05 – 4.745e+052038
4.745e+05 – 5.042e+051694
5.042e+05 – 5.339e+051579
5.339e+05 – 5.635e+051258
5.635e+05 – 5.932e+051452
5.932e+05 – 6.228e+051212
6.228e+05 – 6.525e+05921
6.525e+05 – 6.821e+05939
6.821e+05 – 7.118e+05534
7.118e+05 – 7.415e+05715
7.415e+05 – 7.711e+05670
7.711e+05 – 8.008e+05684
8.008e+05 – 8.304e+05529
8.304e+05 – 8.601e+05532
8.601e+05 – 8.898e+05483
8.898e+05 – 9.194e+05340
9.194e+05 – 9.491e+05404
9.491e+05 – 9.787e+05322
9.787e+05 – 1.008e+06287
1.008e+06 – 1.038e+06262
1.038e+06 – 1.068e+06166
1.068e+06 – 1.097e+06129
1.097e+06 – 1.127e+0683
1.127e+06 – 1.157e+0682
1.157e+06 – 1.186e+0637

How to cite

click to copy

BibTeX
@misc{saturn-ml-32m-links-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: ml 32m links},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/ml-32m-links}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: ml 32m links. Source: /home/coolhand/html/datavis/data_trove/entertainment/movies/ml-32m/links.csv. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/ml-32m-links