Wolfram Computation Meets Knowledge

John Snow & the Birth of Epidemiology Data Analysis & Visualization

John Snow & the Birth of Epidemiology: Data Analysis & Visualization

In 1854, there was a major cholera outbreak in Soho, a neighborhood in London that Judith Summers described as full of “cow-sheds, animal droppings, slaughterhouses, grease-boiling dens and primitive, decaying sewers.” At the time, the cause of the outbreak was unknown because germ theory was still being developed and disease transmission was not well understood. Miasma theory was the dominant hypothesis, and it proposed that diseases, including cholera and the plague, were spread by foul gasses emitted from decomposing organic matter.

John Snow, a physician known for his work with anesthetics, managed to trace the outbreak back to a water pump on Broad Street contaminated by a nearby cesspit. He did this by mapping the locations of cholera cases and then comparing infections recorded near the Broad Street pump with cases in other neighborhoods. This was a major development in the formulation of germ theory and helped convince many scientists that cholera actually spread through contaminated water instead of noxious air.

We still have Snow’s data on the infected cases and their locations in the Soho neighborhood. In this computational essay, we will explore this data with modern statistical techniques available in the Wolfram Language and the SpatialPointData functionality introduced recently.

First, we’ll define the London cholera outbreak as a spatial point configuration:

sp = SpatialPointData
&#10005


The data contains 322 buildings (locations) in the region that, according to Snow, contained infected individuals (i.e. cases). Additionally, the data includes the number of cases at each location (Cases), how far each location was from the Broad Street pump (DistanceToContaminatedPump) and the nearest uncontaminated pump (DistanceToNonContaminatedPump). The data also contains information on whether the Broad Street pump or a non-contaminated pump was the closest to the location (ContaminatedPumpClosest). The data does not directly use latitude-longitude coordinates, so the pumps’ coordinates must be scaled appropriately.

Now we can define the location of the contaminated Broad Street pump and other non-contaminated pumps in a manner compatible with the data:

contaminatedPump =  GeoPosition
&#10005


With the Broad Street pump in red and non-contaminated pumps in green, infection locations are plotted on a map of the area:

Row
&#10005


Each black dot is a house with one or more cases, the red marker is the location of the contaminated pump and the green markers are non-contaminated pumps.

We can drill deeper into the data to show the number of cases at each location:

Show
&#10005


The data also contains the following annotations:

Column
&#10005


ContaminatedPumpClosest is Boolean, where “true” means the contaminated pump was the closest pump, and “false” means the closest pump was non-contaminated.

In this dataset, each household (represented by a point) contains at least one case of cholera. We can create a histogram that plots the number of these cases per location to show the proportion:

Histogram
&#10005


Fit the case counts to a curve and plot them:

edist = EstimatedDistribution
&#10005


Visualize fitting with the histogram:

Show
&#10005


Test the data against the obtained distribution:

DistributionFitTest
&#10005


The data follows a negative exponential progression. This could be due to the fact that the population per building distribution approximately follows this distribution, assuming the same infection rate per building, on average.

As seen earlier, this dataset also includes the distance to the nearest non-contaminated pump and to the Broad Street pump, which we will explore now.

Create a distribution chart of the non-contaminated and contaminated pump distances:

With
&#10005


This chart shows that because there are more non-contaminated pumps, the maximum distance to them is lower. Cases are, on average, closer to the contaminated pump, and the minimum value of DistanceToContaminatedPump is lower than that of DistanceToNonContaminatedPump.

Show how the cases are distributed in space:

Show
&#10005


This smooth density histogram counts the cases at each location and creates a “density map” of the amount of cases in that area. This shows us that the largest case “hotspot” takes the shape of an elongated oval approximately centered on the Broad Street pump and oriented at about a 45-degree angle from the streets. Additionally, almost all areas with high density are near the contaminated pump. This provides additional evidence that the Broad Street pump is the cause of the cases.

Plot how many points are at each distance from the contaminated pump and the closest non-contaminated pump:

Module
&#10005


Note that in these figures, the interquartile ranges do not overlap. Strangely enough, it appears that some cases used the contaminated Broad Street pump in spite of having a shorter distance to a non-contaminated one. Perhaps the roadways made it easier to get to the Broad Street pump or people were in the area for another reason?

Now we will attempt to find the source of the cholera outbreak. First, we will determine the spatial median. Then, we will calculate how important the relative distance to the Broad Street pump is compared to a non-contaminated pump in affecting cases.

We will do most of this analysis using both unweighted data and data weighted by the number of cases at the location and with different spatial measures. It is very likely that weighting by cases gives a more accurate picture of case hotspots. It is less likely, however, to detect a phenomenon that happens rarely. While unweighted data will detect phenomena like that, it’s easily skewed by outliers.

We can calculate the SpatialMedian of the data, weighting it by the number of cases at the location:

weightedSpatialMedian =  SpatialMedian
&#10005


When we weight the data, the spatial median will be more related to the locations of the cases. This is because areas with very few cases won’t influence the data as much, and case hotspots will provide more of an influence.

When the data is left unweighted, the spatial median is more related to the range and locations of the data because a location with only one case on the edge will provide as much influence as a location with 20 cases toward the center.

The spatial median location minimizes the distance to each case location. Therefore, if the cases were primarily stemming from a single source, it would be likely that the source is near the spatial median. This makes it useful to identify an area to examine more closely for possible sources. Note: unless otherwise specified, the default distance function is EuclideanDistance (or GeoDistance for geo spatial data).

Find the unweighted spatial median:

unweightedSpatialMedian = SpatialMedian
&#10005


Plot the location of the spatial medians (using EuclideanDistance) with the cases:

Show
&#10005


Let’s use different distance functions to calculate spatial medians:

spList = GeoPosition
&#10005


Plot the location of the unweighted spatial medians using different distance functions with the pump location and the cases (note: some of the spatial median locations overlap):

Row
&#10005


The spatial medians, whether weighted, unweighted or with different distance functions, are near the location of the Broad Street pump and each other (see the previous visualizations). This tells us that the number of cases at a location is approximately rotationally symmetric around the spatial median. Otherwise, the unweighted spatial median would be quite different than the spatial median.

Additionally, looking at the following visualization, it appears that the case locations are more common toward the center of the data. This provides some amount of anecdotal evidence that there is a single data source.

Show a smooth density histogram of the data weighted by the number of cases at that location:

Show
&#10005


Show a smooth density histogram of the case locations, ignoring the number of cases:

Show
&#10005


Next, we will calculate how much of an effect being near the Broad Street pump instead of another pump had on the number of cases at a location. It may provide additional evidence for the Broad Street pump being responsible.

We calculate a measure that is simply the ratio between DistanceToNonContaminatedPump to DistanceToContaminatedPump. We would expect that as our number of cases increases, at least up to a point, the value of the ratio should also increase because the ratio will increase as a point is closer to the Broad Street pump:

ratio = Rescale
&#10005


We can create a BoxWhiskerChart of cases and ratios to see if it shows a similar phenomenon or if it is unique to this one method of visualization.

Show box-whisker charts of the ratio for each number of cases:

With
&#10005


Cases and ratios seem to rise together up until five cases, and although the rise in the median still seems to be linear, the 75th percentile appears to grow exponentially, which is interesting. Additionally, the lowest value and 25th percentile continue to rise when there are five cases. This suggests that our hypothesis from earlier that the low values are potentially outliers may be correct. We still do not have, however, the evidence to prove this, and there very well could be some secondary factor here impacting case numbers.

Next, we will test to see if the ratio and cases are independent from each other:

IndependenceTest
&#10005


Showing a lack of independence provides another piece of evidence that the ratio, and therefore the distance to the Broad Street pump and other pumps, is related to cases. Between all these visualizations and data analysis, we can say that the Broad Street pump seems more likely to be related to the cholera outbreak, while the other pumps are significantly less likely to be contributors to it.

Snow’s map persuaded the Board of Guardians of St. James’s Parish, in whose parish the pump fell, to remove the handle from the pump. While the overall effectiveness of this tactic remains in doubt as the cholera outbreak was already declining, the impact of Snow’s disease transmission discovery went far beyond the 1854 cholera outbreak.

In fact, Snow could be considered “the father of epidemiology” because, as the US Centers for Disease Control and Prevention (CDC) states on its website, “his work illustrates the classic sequence from descriptive epidemiology to hypothesis generation to hypothesis testing (analytic epidemiology) to application….”

Today, if you visit the site of the offending Broad Street pump, you’ll find a double commemoration of Snow’s work: a memorial in the form of a water pump and his very own “watering hole,” the John Snow pub (as seen thanks to Google Street View).

John Snow pub

Acknowledgments

We would like to thank Gosia Konwerska (Wolfram Research) for providing the data and helping to improve the essay and Jon McLoone (Wolfram Research) for his help on transforming data into actual geo positions and providing historical and contemporary context about John Snow’s contributions to epidemiology.

Get full access to the latest Wolfram Language functionality with a Mathematica 12.3 or Wolfram|One trial.

Comments

Join the discussion

!Please enter your comment (at least 5 characters).

!Please enter your name.

!Please enter a valid email address.

2 comments

  1. Roman and Mads
    A beautiful treatment of a great example of the triumph of a novel analytic approach over a nasty problem, and again, another example of the effectiveness of combining various Mathematica functionalities, with imagination.
    Thanks.

    Reply
  2. Very nice work and use of WL features! In preparation for a math class I will be teaching to medical students, I just finished studying “The Ghost Map” by Steven Johnson that covers the socio-political aspects of this outbreak (along with the technical). I plan to point to this blog and use one or two of the figures in my class. I will also attempt to produce a Voronio diagram of the pump locations using the walking-distance metric that John Snow employed.

    Reply