data-trove-bond-girls · saturn notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/entertainment/film/bond_girls.csv

Saturn profiled 71 rows across 11 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/entertainment/film/bond_girls.csv",
    "--findings", "data-trove-bond-girls.json",
    "--llm", "anthropic:default",
])

Summary confidence: high

This dataset covers 71 records of Bond Girls across 25 James Bond films, tracking each actress alongside financial performance, ages, directors, and Bond actors. The most striking signal is the age disparity: Bond girls average 28.9 years old while Bond actors average 43.1 years — a gap of roughly 14 years that is remarkably consistent across the franchise. Box office figures show strong right skew with outliers (actual revenue ranges from $59.5M to $1,108.6M, with a mean well above the median), suggesting a handful of films massively outperformed the rest. Sean Connery dominates the Bond actor column with 23 appearances, and Guy Hamilton is the most prolific director with 14 entries, both worth examining against box office performance.

citing: actress_age.stats.mean · actress_age.stats.median · bond_actor_age.stats.mean · box_office_actual_$.stats.max · box_office_actual_$.stats.min · box_office_actual_$.stats.mean · box_office_actual_$.stats.median · box_office_actual_$.alerts · bond_actor.top_values · director_name.top_values · row_count · film_title.n_unique

Out[4]:

saturn.schema() · 11 columns

column	kind	n	null%	unique	alerts
bond_girl_name	categorical	71	0.0%	70	long_tail
actress_age	numeric	71	0.0%	20
film_title	categorical	71	0.0%	25
film_release_year	numeric	71	0.0%	25
bond_actor	categorical	71	0.0%	6
bond_actor_age	numeric	71	0.0%	21
director_name	categorical	71	0.0%	12
box_office_actual_$	numeric	71	0.0%	25	outliers
box_office_adjusted_2005_$	numeric	71	0.0%	24	outliers
budget_actual_$	categorical	71	0.0%	23
budget_adjusted_2005_$	categorical	71	0.0%	25

Fig 1.

bond_actor · Compare how many Bond Girls are associated with each Bond actor — Connery's dominance with 23 entries stands out immediately.

Show data table

Top values for bond_actor (6 unique shown, of 6 total).
value	count	share
Sean Connery	23	32.4%
Roger Moore	19	26.8%
Pierce Brosnan	12	16.9%
Daniel Craig	11	15.5%
George Lazenby	3	4.2%
Timothy Dalton	3	4.2%

Fig 2.

box_office_actual_$ · Look for the heavy right skew and outliers at the top end, where a few blockbusters dwarf the typical film's earnings.

Show data table

Histogram bins for box_office_actual_$ (median: 156.2).
bin	count
59.5 – 190.6	41
190.6 – 321.8	7
321.8 – 452.9	12
452.9 – 584	3
584 – 715.2	4
715.2 – 846.3	0
846.3 – 977.5	2
977.5 – 1109	2

Fig 3.

actress_age · The distribution clusters tightly between 24 and 33, revealing how consistently young Bond Girls are cast relative to their co-stars.

Show data table

Histogram bins for actress_age (median: 28.0).
bin	count
21 – 24.75	21
24.75 – 28.5	15
28.5 – 32.25	15
32.25 – 36	11
36 – 39.75	8
39.75 – 43.5	0
43.5 – 47.25	0
47.25 – 51	1

Fig 4.

director_name · Guy Hamilton and John Glen account for the majority of entries — check whether their films skew toward higher or lower box office returns.

Show data table

Top values for director_name (12 unique shown, of 12 total).
value	count	share
Guy Hamilton	14	19.7%
John Glen	11	15.5%
Terence Young	10	14.1%
Lewis Gilbert	10	14.1%
Martin Campbell	5	7.0%
Sam Mendes	4	5.6%
Peter R. Hunt	3	4.2%
Roger Spottiswoode	3	4.2%
Michael Apted	3	4.2%
Lee Tamahori	3	4.2%
Cary Joji Fukunaga	3	4.2%
Marc Forster	2	2.8%

Fig 5.

film_release_year · See how Bond Girl appearances are distributed across six decades, with the bulk of records concentrated in the 1960s–1990s era.

Show data table

Histogram bins for film_release_year (median: 1979.0).
bin	count
1962 – 1969	22
1969 – 1977	9
1977 – 1984	11
1984 – 1992	6
1992 – 1999	6
1999 – 2006	8
2006 – 2014	4
2014 – 2021	5

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
bond_girl_name	categorical	0.0%
actress_age	numeric	0.0%
film_title	categorical	0.0%
film_release_year	numeric	0.0%
bond_actor	categorical	0.0%
bond_actor_age	numeric	0.0%
director_name	categorical	0.0%
box_office_actual_$	numeric	0.0%
box_office_adjusted_2005_$	numeric	0.0%
budget_actual_$	categorical	0.0%
budget_adjusted_2005_$	categorical	0.0%

Fig 7.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 5 numeric columns (values clipped to 2 decimals).
	actress_age	film_release_year	bond_actor_age	box_office_actual_$	box_office_adjusted_2005_$
actress_age	+1.00	+0.31	+0.29	+0.26	-0.08
film_release_year	+0.31	+1.00	+0.50	+0.85	-0.12
bond_actor_age	+0.29	+0.50	+1.00	+0.25	-0.39
box_office_actual_$	+0.26	+0.85	+0.25	+1.00	+0.31
box_office_adjusted_2005_$	-0.08	-0.12	-0.39	+0.31	+1.00

bond_girl_name categorical label

This column contains the names of Bond girls (female characters/actresses from the James Bond film franchise), serving as a near-unique label per row across 71 records. Cardinality is 70 out of 71 rows, with Léa Seydoux being the only duplicate (appearing twice, likely reflecting her appearances in both Spectre and No Time to Die). Entropy ratio of 0.999 confirms the distribution is almost perfectly uniform — every other name appears exactly once, triggering the long-tail alert despite the dataset being small.

Treatment: Use as a display label or join key to a Bond film lookup table; not suitable as a categorical feature due to near-unique cardinality.

anthropic:default · confidence high

Out[13]:

saturn.columns["bond_girl_name"].stats

stat	value
n	71
nulls	0 (0.0%)
unique	70
top_value	Léa Seydoux
top_rate	0.02817
cardinality	70
entropy	6.122
entropy_ratio	0.9987
alert: long_tail	69 singleton categories

Fig 8.

Top values for bond_girl_name.

Show data table

Top values for bond_girl_name (20 unique shown, of 70 total).
value	count	share
Léa Seydoux	2	2.8%
Ursula Andress	1	1.4%
Zena Marshall	1	1.4%
Eunice Gayson	1	1.4%
Daniela Bianchi	1	1.4%
Martine Beswick	1	1.4%
Aliza Gur	1	1.4%
Honor Blackman	1	1.4%
Shirley Eaton	1	1.4%
Tania Mallet	1	1.4%
Nadja Regin	1	1.4%
Margaret Nolan	1	1.4%
Claudine Auger	1	1.4%
Luciana Paluzzi	1	1.4%
Molly Peters	1	1.4%
Maryse Guy Mitsouko	1	1.4%
Mie Hama	1	1.4%
Akiko Wakabayashi	1	1.4%
Tsai Chin	1	1.4%
Karin Dor	1	1.4%

actress_age numeric feature

This column records the age of an actress at the time of some event (likely a film role or award), spanning 21 to 51 years across 71 records. With only 20 unique values despite 71 rows, ages are heavily repeated, suggesting a small cast or repeated appearances by the same actresses. The distribution is moderately right-skewed (skew 0.97) with a single outlier at 51, while the bulk of values cluster between 24 and 33 (IQR = 9.0) around a mean of ~28.9 — consistent with known industry bias toward younger actresses.

Treatment: Use as-is or bin into age brackets; consider log-transform if used in regression given positive skew.

anthropic:default · confidence high

Out[16]:

saturn.columns["actress_age"].stats

stat	value
n	71
nulls	0 (0.0%)
unique	20
min	21
max	51
mean	28.92
median	28
std	5.547
q1	24
q3	33
iqr	9
skew	0.9703
kurtosis	1.794
n_outliers	1
outlier_rate	0.01408
zero_rate	0

Fig 9.

Distribution of actress_age. Vertical dash marks the median.

Show data table

Histogram bins for actress_age (median: 28.0).
bin	count
21 – 24.75	21
24.75 – 28.5	15
28.5 – 32.25	15
32.25 – 36	11
36 – 39.75	8
39.75 – 43.5	0
43.5 – 47.25	0
47.25 – 51	1

film_title categorical label

This column contains film titles from the James Bond franchise, covering at least 25 distinct films across 71 rows — indicating each film appears multiple times (averaging ~2.8 rows per title). The distribution is remarkably flat: the top value 'Thunderball' appears only 5 times (7.04% top_rate) and entropy_ratio is 0.984, meaning titles are spread almost uniformly across the dataset, suggesting rows likely represent individual scenes, characters, songs, or some other per-film sub-records rather than one row per film.

Treatment: Use as a grouping key for aggregations; do not encode as a high-cardinality feature without grouping logic.

anthropic:default · confidence high

Out[19]:

saturn.columns["film_title"].stats

stat	value
n	71
nulls	0 (0.0%)
unique	25
top_value	Thunderball
top_rate	0.07042
cardinality	25
entropy	4.568
entropy_ratio	0.9837

Fig 10.

Top values for film_title.

Show data table

Top values for film_title (20 unique shown, of 25 total).
value	count	share
Thunderball	5	7.0%
Goldfinger	5	7.0%
You Only Live Twice	4	5.6%
Diamonds Are Forever	4	5.6%
From Russia with Love	3	4.2%
On Her Majesty's Secret Service	3	4.2%
Live and Let Die	3	4.2%
The Spy Who Loved Me	3	4.2%
Moonraker	3	4.2%
For Your Eyes Only	3	4.2%
A View to a Kill	3	4.2%
GoldenEye	3	4.2%
Tomorrow Never Dies	3	4.2%
The World Is Not Enough	3	4.2%
Die Another Day	3	4.2%
No Time to Die	3	4.2%
Dr. No	2	2.8%
The Man with the Golden Gun	2	2.8%
Octopussy	2	2.8%
Licence to Kill	2	2.8%

film_release_year numeric feature

This column records the theatrical release year of films, spanning 1962 to 2021 across 71 rows. With only 25 unique values across 71 records, many films share a release year, suggesting clustering around certain production eras. The distribution is mildly right-skewed (skew 0.60) with a median of 1979 and mean of 1983, indicating the dataset leans toward older films, though at least one entry extends to 2021. The IQR of 30 years (1967–1997) confirms the bulk of titles fall in a mid-20th-to-late-20th-century range.

Treatment: Use as a numeric feature or bin into decade-level categories for modelling; consider interaction with other film metadata.

anthropic:default · confidence high

Out[22]:

saturn.columns["film_release_year"].stats

stat	value
n	71
nulls	0 (0.0%)
unique	25
min	1,962
max	2,021
mean	1983
median	1,979
std	17.59
q1	1,967
q3	1,997
iqr	30
skew	0.6012
kurtosis	-0.8409
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 11.

Distribution of film_release_year. Vertical dash marks the median.

Show data table

Histogram bins for film_release_year (median: 1979.0).
bin	count
1962 – 1969	22
1969 – 1977	9
1977 – 1984	11
1984 – 1992	6
1992 – 1999	6
1999 – 2006	8
2006 – 2014	4
2014 – 2021	5

bond_actor categorical label

This column identifies which actor portrayed James Bond in each record, covering all 6 canonical Bond actors across 71 rows. Sean Connery leads with 23 appearances (32.4% of records), followed by Roger Moore (19) and Pierce Brosnan (12); George Lazenby and Timothy Dalton are notably underrepresented with only 3 records each, reflecting their shorter tenures. The entropy ratio of 0.88 indicates a moderately uneven but not extreme distribution. No nulls are present and cardinality is perfectly clean.

Treatment: One-hot encode or use as a grouping/stratification variable; note class imbalance for Lazenby and Dalton (3 records each) if modelling per-actor effects.

anthropic:default · confidence high

Out[25]:

saturn.columns["bond_actor"].stats

stat	value
n	71
nulls	0 (0.0%)
unique	6
top_value	Sean Connery
top_rate	0.3239
cardinality	6
entropy	2.272
entropy_ratio	0.8788

Fig 12.

Top values for bond_actor.

Show data table

Top values for bond_actor (6 unique shown, of 6 total).
value	count	share
Sean Connery	23	32.4%
Roger Moore	19	26.8%
Pierce Brosnan	12	16.9%
Daniel Craig	11	15.5%
George Lazenby	3	4.2%
Timothy Dalton	3	4.2%

bond_actor_age numeric feature

This column records the age of the actor playing James Bond at the time of each film, spanning 71 observations (likely one per film or scene) across only 21 unique values — consistent with a small roster of actors each appearing in multiple entries. The distribution is remarkably symmetric and platykurtic (kurtosis −1.02), ranging from 30 to 58 with a mean and median both near 43, suggesting Bond actors tend to be cast and retained through their late-30s to late-40s with no strong skew toward younger or older talent.

Treatment: Use as-is for modelling; low cardinality (21 unique values) means binning or treating as ordinal may also be appropriate.

anthropic:default · confidence high

Out[28]:

saturn.columns["bond_actor_age"].stats

stat	value
n	71
nulls	0 (0.0%)
unique	21
min	30
max	58
mean	43.14
median	43
std	7.842
q1	36
q3	49
iqr	13
skew	0.1286
kurtosis	-1.018
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 13.

Distribution of bond_actor_age. Vertical dash marks the median.

Show data table

Histogram bins for bond_actor_age (median: 43.0).
bin	count
30 – 33.5	8
33.5 – 37	10
37 – 40.5	8
40.5 – 44	10
44 – 47.5	15
47.5 – 51	6
51 – 54.5	9
54.5 – 58	5

director_name categorical label

This column contains the names of film directors, almost certainly for the James Bond franchise — the 12 unique directors across 71 records are an exact match for the known Bond film series. Guy Hamilton leads with 14 films, followed by John Glen (11), Terence Young (10), and Lewis Gilbert (10), accounting for the bulk of the series. The entropy ratio of 0.917 is surprisingly high for only 12 categories, indicating the distribution is relatively flat rather than dominated by one director. No nulls are present.

Treatment: Use as a categorical grouping variable; with only 12 distinct values and no nulls, one-hot or ordinal encoding is straightforward.

anthropic:default · confidence high

Out[31]:

saturn.columns["director_name"].stats

stat	value
n	71
nulls	0 (0.0%)
unique	12
top_value	Guy Hamilton
top_rate	0.1972
cardinality	12
entropy	3.288
entropy_ratio	0.9172

Fig 14.

Top values for director_name.

Show data table

Top values for director_name (12 unique shown, of 12 total).
value	count	share
Guy Hamilton	14	19.7%
John Glen	11	15.5%
Terence Young	10	14.1%
Lewis Gilbert	10	14.1%
Martin Campbell	5	7.0%
Sam Mendes	4	5.6%
Peter R. Hunt	3	4.2%
Roger Spottiswoode	3	4.2%
Michael Apted	3	4.2%
Lee Tamahori	3	4.2%
Cary Joji Fukunaga	3	4.2%
Marc Forster	2	2.8%

box_office_actual_$ numeric numeric_target

This column records actual box office revenue (likely in millions of dollars) for 71 films, with only 25 unique values suggesting heavy rounding or binning of financial figures. The distribution is strongly right-skewed (skew = 1.92, kurtosis = 3.55): the median is $156.2M while the mean is pulled to $265.4M, and the max reaches $1,108.6M — nearly 5× the mean — with 4 flagged outliers (5.6% of rows) driving this spread. The IQR of $231.55M against a std of $235.3M confirms that a small number of blockbusters are distorting the central tendency.

Treatment: Log-transform before regression to reduce skew; investigate the 4 outliers (≥ upper fence) and confirm whether the 25-unique-value clustering reflects intentional rounding.

anthropic:default · confidence high

Out[34]:

saturn.columns["box_office_actual_$"].stats

stat	value
n	71
nulls	0 (0.0%)
unique	25
min	59.5
max	1109
mean	265.4
median	156.2
std	235.3
q1	120.5
q3	352
iqr	231.6
skew	1.923
kurtosis	3.55
n_outliers	4
outlier_rate	0.05634
zero_rate	0
alert: outliers	5.6% rows beyond 1.5 IQR

Fig 15.

Distribution of box_office_actual_$. Vertical dash marks the median.

Show data table

Histogram bins for box_office_actual_$ (median: 156.2).
bin	count
59.5 – 190.6	41
190.6 – 321.8	7
321.8 – 452.9	12
452.9 – 584	3
584 – 715.2	4
715.2 – 846.3	0
846.3 – 977.5	2
977.5 – 1109	2

box_office_adjusted_2005_$ numeric feature

This column represents inflation-adjusted box office revenue (in millions of 2005 USD) for a small set of films or releases (n=71). Striking is the very low cardinality — only 24 unique values across 71 rows — suggesting heavy value repetition or bucketing, which is unusual for a continuous financial metric. The distribution is moderately right-skewed (skew=0.81) with a high outlier rate of 26.8% (19 of 71 observations), and values range from $250.9M to $943.5M, indicating this dataset likely covers only commercially successful titles.

Treatment: Investigate low cardinality (24 unique values in 71 rows) before use; if intentional bucketing, treat as ordinal; otherwise log-transform to reduce right skew before regression.

anthropic:default · confidence medium

Out[37]:

saturn.columns["box_office_adjusted_2005_$"].stats

stat	value
n	71
nulls	0 (0.0%)
unique	24
min	250.9
max	943.5
mean	520.5
median	465.4
std	178.3
q1	439.5
q3	543.8
iqr	104.3
skew	0.8111
kurtosis	-0.1065
n_outliers	19
outlier_rate	0.2676
zero_rate	0
alert: outliers	26.8% rows beyond 1.5 IQR

Fig 16.

Distribution of box_office_adjusted_2005_$. Vertical dash marks the median.

Show data table

Histogram bins for box_office_adjusted_2005_$ (median: 465.4).
bin	count
250.9 – 337.5	11
337.5 – 424.1	5
424.1 – 510.6	21
510.6 – 597.2	20
597.2 – 683.8	0
683.8 – 770.4	2
770.4 – 856.9	10
856.9 – 943.5	2

budget_actual_$ categorical feature

This column represents actual budget figures in dollars (likely millions), stored as strings rather than a numeric type — a likely ingestion or schema issue. With 23 unique values across 71 rows and an entropy ratio of 0.971, the distribution is nearly uniform, meaning no single budget tier dominates heavily (top value '7.0' appears only 8 times, ~11.3% of rows). Values range from small figures like '2.0' up to '34.0', suggesting a wide spread of project or department budgets. The categorical typing of a clearly numeric column is the primary surprise and will break any quantitative analysis.

Treatment: Cast to float, then assess distribution shape before deciding on scaling or log-transform for modelling.

anthropic:default · confidence medium

Out[40]:

saturn.columns["budget_actual_$"].stats

stat	value
n	71
nulls	0 (0.0%)
unique	23
top_value	7.0
top_rate	0.1127
cardinality	23
entropy	4.392
entropy_ratio	0.971

Fig 17.

Top values for budget_actual_$.

Show data table

Top values for budget_actual_$ (20 unique shown, of 23 total).
value	count	share
7.0	8	11.3%
6.8	5	7.0%
3.0	5	7.0%
10.3	4	5.6%
7.2	4	5.6%
2.0	3	4.2%
14.0	3	4.2%
34.0	3	4.2%
28.0	3	4.2%
30.0	3	4.2%
60.0	3	4.2%
110.0	3	4.2%
135.0	3	4.2%
142.0	3	4.2%
250.0–301.0	3	4.2%
1.1	2	2.8%
27.5	2	2.8%
36.0	2	2.8%
150.0	2	2.8%
200.0	2	2.8%

budget_adjusted_2005_$ categorical feature

This column represents inflation-adjusted budget figures (in millions of 2005 USD), but it has been stored as a categorical type rather than numeric — a likely ingestion or typing error. With 71 rows and 25 unique values, cardinality is moderate, but an entropy ratio of 0.984 indicates near-uniform distribution with no dominant value (top value '41.9' appears only 5 times, a 7% top_rate). The decimal-formatted string values (e.g., '41.9', '91.5') confirm these are numeric quantities masquerading as categories.

Treatment: Cast to float64, verify unit scale (likely millions USD), then use directly or log-transform before modelling.

anthropic:default · confidence high

Out[43]:

saturn.columns["budget_adjusted_2005_$"].stats

stat	value
n	71
nulls	0 (0.0%)
unique	25
top_value	41.9
top_rate	0.07042
cardinality	25
entropy	4.568
entropy_ratio	0.9837

Fig 18.

Top values for budget_adjusted_2005_$.

Show data table

Top values for budget_adjusted_2005_$ (20 unique shown, of 25 total).
value	count	share
41.9	5	7.0%
18.6	5	7.0%
59.9	4	5.6%
34.7	4	5.6%
12.6	3	4.2%
37.3	3	4.2%
30.8	3	4.2%
45.1	3	4.2%
91.5	3	4.2%
60.2	3	4.2%
54.5	3	4.2%
76.9	3	4.2%
133.9	3	4.2%
158.3	3	4.2%
154.2	3	4.2%
188.7–226.4	3	4.2%
7.0	2	2.8%
27.7	2	2.8%
53.9	2	2.8%
56.7	2	2.8%

data trove bond girls

Overview

Summary confidence: high

bond_girl_name categorical label

actress_age numeric feature

film_title categorical label

film_release_year numeric feature

bond_actor categorical label

bond_actor_age numeric feature

director_name categorical label

box_office_actual_$ numeric numeric_target

box_office_adjusted_2005_$ numeric feature

budget_actual_$ categorical feature

budget_adjusted_2005_$ categorical feature

How to cite