# The Python Notebook

This is a Python notebook. It is an example of "[literate programming](https://www-cs-faculty.stanford.edu/~knuth/lp.html)," a format introduced by computer scientist Donald Knuth in the 1980s. The idea behind literate programming is to produce documents that allow a user to:

1. read and edit lines of code
2. run that code and see the results, and
3. read and write prose text explaining what is happening in that code.

Python notebooks (also called Colab notebooks or Jupyter notebooks depending on the platform you use to run them) have become increasingly important in digital humanities and data science as a means to learn programming skills and to circulate results and analysis.

This notebook is divided into what are called "cells." You're currently reading a "text cell." Try double clicking it: you will now see text that can be edited. Type `ESC` or `Ctrl-ENTER` to exit the editing view.

You can also navigate between cells by using your keyboard's up and down arrows.

# Understanding Markdown

Text cells are written in Markdown, a lightweight markup format for text that is easily readable both by humans and machines. For example:

| SYNTAX | OUTPUT |
|--------|--------|
| `*italics*` | *italics* |
| `**bold**` | **bold** |
| `[Princeton](www.princeton.edu)` | [Princeton](www.princeton.edu) |

If you double click the text cell you're now reading, you'll see the underlying Markdown formatting. When you exit the cell, it will return to its fully-rendered format.

For a basic overview of Markdown syntax, see [this cheat sheet](https://www.markdownguide.org/cheat-sheet/). An extended guide and history of the markup language is also available on that site. To test out Markdown in a live editor in your web browser, [try Dillinger](https://dillinger.io/).


## Executing Code Cells

While text cells contain Markdown-formatted text, code cells contain lines of code written in Python that can be run right in the notebook. The cell below this one is a code cell.

You can execute code cells run by hovering over the top left of a cell and clicking the play button. The first time you run a cell in a Jupyter notebook in VS Code, you'll be asked to select a kernel environment. Select "Python environments...", and from there select which Python version you'd like to use. From there, pick whichever one is listed as "Global Env" on the right. For me, this is Python 3.11.7.

You can also type `Ctrl-Enter` (which will run the cell and keep the same cell selected), or `Shift-Enter` (which will run the cell and then select the next cell).

Try running the following cell, selecting your Python environment, and seeing what it prints out:

In [2]:
print("Hello world!")

Hello world!


Any text in a code cell preceded by a `#` character will be interpreted as a comment — or explanatory text — rather than run as code.

In [3]:
# this is a comment in a code cell

You can also run equations:

In [4]:
3 + 4

7

And define variables:

In [5]:
my_sentence = 'The meaning of a word is its use in the language.'

In [6]:
print(my_sentence)

The meaning of a word is its use in the language.


The power of these notebooks is that you can edit the code, experimenting with different inputs or variables, and see how the output changes as a result. Try single-clicking on the code cell above to edit it.

## Loading Data

We can also load remote data, like the Project Gutenberg text of Willa Cather's *O Pioneers!*. You'll notice the code below contains different colors: this is called syntax highlighting, which allows the user to more easily read and distinguish different elements in the code.

In [7]:
import requests

target_url = "https://www.gutenberg.org/files/24/24-0.txt"

# load the document's URL
response = requests.get(target_url)
# detect what appears to be the file encoding
response.encoding = response.apparent_encoding

# print the text of O Pioneers!
# starting with the 2,000th character of the file
passage = response.text[2000:3000]

print(passage)




PART I.
The Wild Land




I


One January day, thirty years ago, the little town of Hanover, anchored
on a windy Nebraska tableland, was trying not to be blown away. A mist
of fine snowflakes was curling and eddying about the cluster of low
drab buildings huddled on the gray prairie, under a gray sky. The
dwelling-houses were set about haphazard on the tough prairie sod; some
of them looked as if they had been moved in overnight, and others as if
they were straying off by themselves, headed straight for the open
plain. None of them had any appearance of permanence, and the howling
wind blew under them as well as over them. The main street was a deeply
rutted road, now frozen hard, which ran from the squat red railway
station and the grain “elevator” at the north end of the town to the
lumber yard and the horse pond at the south end. On either side of this
road straggled two uneven rows of wooden buildings; the general
merchandise stores, the two banks, the d


We can also load tabular data like CSVs using Python libraries (packages of pre-made code) like Pandas. Let's try loading the data on narrative forms from the [Early Novels Dataset](https://earlynovels.github.io/).

In [8]:
# import Pandas, a Python library
import pandas as pd

# read in a remotely hosted CSV as a dataframe
df = pd.read_csv('https://raw.githubusercontent.com/earlynovels/end-dataset/master/end-dataset-master-11282018/11282018-full.tsv', sep='\t')

# view that dataframe
df

Unnamed: 0,id,leader,author name,author dates,author transcribed,title catalog,title full,title half,title series,vols,...,author claim type,author gender claim,author gender,advertisement genres,title words:other works,title words:singular nouns,title words:place names,holding institution,cataloger initials,cataloger institution
0,260496,02463cam a2200505 i 4500,"Weston, Anna Maria",,by Anna Maria Weston.,"Pleasure and pain, or The fate of Ellen : a no...","Pleasure and pain, or the fate of Ellen; A nov...",,,3.0,...,,Female,Female,,,"['pleasure', 'pain', 'fate', 'novel']",,University of Pennsylvania,,
1,260498,04218cam a2200601 i 4500,"Hawkins, Laetitia Matilda",1760-1835.,by Laetitia-Matilda Hawkins.,"Rosanne, or, A father's labour lost / by Laeti...","Rosanne; or, a father's labour lost. In three ...",Rosanne. Vol. II.,,3.0,...,Proper name,,,['Fiction'],,"['Countess', ""Ladyship's"", 'example', 'advanta...",['Waldegrave'],University of Pennsylvania,,
2,260500,02592cam a2200469 a 4500,"Orrery, Roger Boyle",1621-1679.,"wrote by a party in his pleasures, and now pub...",Royal adventures : being the amorous history o...,Royal adventures: being the amorous history of...,,,1.0,...,,Indeterminate,,,,"['adventure', 'history', 'king', 'court', 'par...",,University of Pennsylvania,,
3,260503,04923mam a2200697u 4500,Manley,1663-1724.,,Secret memoirs and manners of several persons ...,Secret memoirs and manners of several persons ...,,,4.0,...,,Male,,['Fiction'],,"['memoir', 'manner', 'person', 'sex', 'island']","['Atalantis', 'Mediterranean']",University of Pennsylvania,,
4,260508,06692cam a2200853 i 4500,"Burney, Sarah Harriet",1772-1844.,by S. H. Burney. ...,Tales of fancy / by S. H. Burney. ...,"Tales of fancy, by S.H. Burney. Author of 'Cla...","Tales of fancy, by S.H. Burney. Vol. I Contain...",,3.0,...,,Indeterminate,,['Fiction'],,['Dedication'],['Hesse Hombourg'],University of Pennsylvania,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1997,99745313503681,01316cam a2200349 a 4500,"Martin,",,by the author of Deloraine.,"Melbourne : a novel, in three volumes / by the...",Melbourne. A novel. in three volumes. By the a...,Melbourne. A novel.,,,...,Reference to other works,Indeterminate,Indeterminate,,['Deloraine'],"['Novel', 'Volume', 'Author']",,PU,SH,Swarthmore College
1998,99746073503681,01120cam a2200313 a 4500,,,,Geraldina : a novel founded on a recent event ...,"Geraldina. A Novel, Founded on A Recent Event....",Geraldina.,,,...,,,,,,"['Novel', 'Event']",,University of Pennsylvania,IC,University of Pennsylvania
1999,99865533503681,01355cam a2200373 4500,"Diderot, Denis,",1713-1784.,Translated from the French of Diderot ...,James the fatalist and his master. Translated ...,James the fatalist and his master. Translated ...,James the fatalist and his master. Vol. I.,,,...,,,,,,"['Fatalist', 'Master', 'French', 'Volume']",,PU,AS,University of Pennsylvania
2000,99869113503681,01372cam a2200337 i 4500,"Radcliffe, Ann Ward,",1764-1823.,by Ann Radcliffe ...,"The Italian, or, The confessional of the black...","The Italian, or The confessional of the Black ...",,,,...,Proper name,Female,Female,,['The mysteries of Udolpho'],"['Italian', 'Confessional', 'Penitent', 'Roman...",,PU,SH,Swarthmore College


In [9]:
# Let's create a subset of that data using only certain columns
df_novels = df[['id', 'author name', 'author gender claim', 'pub date', 'pub location', 'narrative form primary']]

# get a random sample of 10 records
df_novels.sample(10)

Unnamed: 0,id,author name,author gender claim,pub date,pub location,narrative form primary
729,380846,"Batchelder, Eugene",Male,1850,Cambridge,['Third-person']
335,340844,,,1772,London,['Epistolary']
1933,991068823503681,"Crespigny, Mary Champion de",,1796,London,
117,260815,"Erskine, Thomas Erskine",,1818,London,['First-person']
877,440321,,Indeterminate,1812,,['Third-person']
942,440394,"Opie, Amelia",Female,1806,London,['Third-person']
585,374934,"Briggs, Charles F",Indeterminate,1844,New York,['First-person']
524,341622,,,1780,London,['Epistolary']
1929,99875323503681,"Davies, Edward,",,1795,Dublin,['Epistolary']
1457,101332,"Cartwright, H.",Female,1787,,['Epistolary']


In [10]:
# we can use Pandas to summarize that data
df_novels.describe()

Unnamed: 0,id,author name,author gender claim,pub date,pub location,narrative form primary
count,2002,1640,1522,1996,1992,1877
unique,1976,689,11,163,70,21
top,167109,"Swift, Jonathan",Male,1796,London,['Third-person']
freq,3,36,581,57,1369,895


## Visualizing data

We can also use libraries like Altair to visualize our data within the notebook.

In [11]:
# import the Altair library
import altair as alt

narrative = alt.Chart(df_novels).mark_bar().encode(
    alt.X('pub date:Q', 
        axis=alt.Axis(format='t'), 
        scale=alt.Scale(zero=False,domain=(1650,1850),clamp=True), 
        title='Publication Year'
        ),
    alt.Y(aggregate='count', type='quantitative',),
    color='narrative form primary',
).properties(
    title='Narrative Forms in Early Novels',
)

narrative.interactive().properties(width=600)

  col = df[col_name].apply(to_list_if_array, convert_dtype=False)


Here's another example using some astronomical data: a dataset of all discovered exoplanets, updated regularly

In [12]:
planet_js = pd.read_json('https://raw.githubusercontent.com/gwijthoff/exoplanets/main/exoplanets.json')
planet_js

Unnamed: 0,pl_name,discoverymethod,disc_year,disc_refname,disc_locale,disc_facility,disc_telescope,pl_orbper,pl_rade,pl_bmasse,st_spectype,sy_snum,sy_pnum,sy_mnum,cb_flag
0,OGLE-2016-BLG-1227L b,Microlensing,2020,<a refstr=HAN_ET_AL__2020 href=https://ui.adsa...,Ground,OGLE,1.3 m Warsaw University Telescope,,13.90,250.00000,,1,1,0,0
1,Kepler-276 c,Transit,2013,<a refstr=XIE_2014 href=https://ui.adsabs.harv...,Space,Kepler,0.95 m Kepler Telescope,31.884000,2.90,16.60000,,1,3,0,0
2,Kepler-829 b,Transit,2016,<a refstr=MORTON_ET_AL__2016 href=https://ui.a...,Space,Kepler,0.95 m Kepler Telescope,6.883376,2.11,5.10000,,1,1,0,0
3,K2-283 b,Transit,2018,<a refstr=LIVINGSTON_ET_AL__2018 href=https://...,Space,K2,0.95 m Kepler Telescope,1.921036,3.52,12.20000,,1,1,0,0
4,Kepler-477 b,Transit,2016,<a refstr=MORTON_ET_AL__2016 href=https://ui.a...,Space,Kepler,0.95 m Kepler Telescope,11.119907,2.07,4.94000,,2,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5582,HD 222155 b,Radial Velocity,2011,<a refstr=BOISSE_ET_AL__2012 href=https://ui.a...,Ground,Haute-Provence Observatory,1.93 m Telescope,3999.000000,13.40,581.62598,G2 V,1,1,0,0
5583,HD 88986 b,Radial Velocity,2023,<a refstr=HEIDARI_ET_AL__2023 href=https://ui....,Ground,Haute-Provence Observatory,1.93 m Telescope,146.050000,2.49,17.20000,G2 V,1,1,0,0
5584,Kepler-30 b,Transit,2012,<a refstr=SANCHIS_OJEDA_ET_AL__2012 href=https...,Space,Kepler,0.95 m Kepler Telescope,29.334340,3.90,11.30000,,1,3,0,0
5585,HD 3167 d,Radial Velocity,2017,<a refstr=CHRISTIANSEN_ET_AL__2017 href=https:...,Ground,Multiple Observatories,Multiple Telescopes,8.411200,1.92,4.33000,K0 V,1,4,0,0


In [13]:
# create a scatterplot comparing the mass of the planet, its orbital period,
# and the method used to discover it

# disable MaxRowsError, which limits datasets of >5,000 rows
alt.data_transformers.disable_max_rows()

galaxy = alt.Chart(planet_js).mark_circle().encode(
    alt.X('pl_orbper:Q', scale=alt.Scale(type='log'), title='Orbital Period (Earth Days)'),
    alt.Y('pl_bmasse:Q', scale=alt.Scale(type='log'), title='Mass (Earth Masses)'),
    color=alt.Color(
        'discoverymethod', 
        scale=alt.Scale(scheme='dark2'), 
        legend=alt.Legend(title='Discovery Method')
    ),
    tooltip=alt.Tooltip(['pl_name','disc_year']),
    #tooltip=['pl_name', 'pl_orbper', 'pl_rade', 'sy_pnum'],
).properties(title='Exoplanet Discoveries')

galaxy.interactive().properties(width=500, height=400)

  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
