Wolfram Computation Meets Knowledge

How to Make an xkcd Comic with a Tiny Tweetable Algorithm Discover Fun Facts about Namesake Cities

Did you know there are no beaches in Miami? “No, The Other One”: Miami, Arizona, a classic Western copper boomtown 80 miles east of Phoenix… but obviously not the Phoenix in Oregon.

The quotation marks indicate the title of this recent Randall Munroe xkcd comic with many well-known United States city names found in unexpected locationsoften more than once!—thanks to lesser-known places homonymous with their more famous counterparts. Munroe is an exceptional artist who can pack scientific storytelling into the frame of a comic. Can we make an algorithm do the same?

Licensing agreement "No, The Other One"

Place names are not unique. Most people will not confuse Portland, Maine—about four hundred years old and named after the Isle of Portland in the English Channel—with Portland, Oregon, named after the one in Maine but 10 times more populous. But due to the many other Portlands in both the US and the world, confusion is understandable.

A popular search engine question is “Can a US state have two cities with the same name?” The answer is yes, and sometimes even more than two. For instance, there are two Portlands in Wisconsin, one in Monroe County and the other in Dodge County, with a two-hour drive between them. In fact, Wisconsin has the most namesake cities in the US (see later discussion). Please note that the definition of what counts as a “city” can differ between data frameworks, and here we will follow the Wolfram Knowledgebase standards.

Let’s begin with a simple question: how did Randall Munroe make his map? It felt like a data story to me, and I wanted to reproduce his map right away, but in a completely automated way built by an algorithm. Here is my tentative version:

First attempt to recreate xkcd map—click to enlarge

It’s definitely not as complete and beautiful as the original xkcd comic, but it’s a start. I invite you to click and zoom in on the maps in this post and explore them carefully to see if any cities and their locations get you curious enough to check them out.

Miami, Arizona, caught my eye because of it being completely landlocked, in contrast to Miami, Florida. Learning its motto, “One mile long, 100 years of history,” and about that history and some residents, such as the celebrated Western film actor Jack Elam, made my day.

Miami sign

I got hooked on namesake cities when my friend and colleague Carlo Barbieri sent me some tweet-size Wolfram Language code partly reproducing the xkcd comic. Here is the code fed into the bot of the Tweet-a-Program project. The project allows anyone to tweet ideas in the form of short programs that are executed automatically. The results are then tweeted back, which is often quite entertaining:

If you have a list of famous cities with several corresponding namesakes, Carlo’s program automatically finds and shows some of them on a map. The key here is the intelligent Interpreter function (with its AmbiguityFunction option) that, given a generic name string, can interpret it as a list of cities with that name.

To get closer to the original xkcd comic, I wanted many more cities, but listing those cities explicitly would exceed the standard tweet size limit of 280 characters. I needed, therefore, a way to find them programmatically. Instead of using Interpreter, I resorted to processing data directly, and also added a few styling options:

This is not bad for just 195 characters of code that’s also quite readable. The obvious shortcomings are label collision and patchy nonuniform spatial distribution. To demonstrate an algorithm that can improve this map, I’ll need to move away from tweets and 280-character limits. For those interested in the so-called code golf aspect of it, however, I have a few notes for you at the end of this article.

The Entity framework provides access to the Wolfram Knowledgebase, one of the world’s largest and broadest repositories of curated data (it’s also used in Wolfram|Alpha). It’s straightforward to get the list of all cities constrained to a specific country, which is the US in our case:

cityALL =  EntityClass
&#10005


The more complete and structured the data, the simpler the algorithm necessary for processing. So we start by getting all the necessary data characteristics:

dataALL = DeleteMissing
&#10005


For these 32,732 US cities, we have three columns of data. Let’s take a look at the first five cities. You’d be right to guess they are already sorted by population size. The second and third columns are unique city and state entities that can be used to extract different information about them, such as geolocation, area and crime rates:

dataALL
&#10005


The first column lists official city names, which are not unique and can be the same for multiple cities, such as “Portland” in Oregon, Maine, Wisconsin and so on. Here’s another factoid: just as different cities can share a single name, a single city can have different nicknames:

Entity
&#10005


We need the first column to gather all cities with a specific name string. After dropping all name strings with just a single corresponding city, we are left with 4,110 ambiguous strings:

byNAME = Select
&#10005


Corresponding to those strings, there are 15,923 distinct namesake cities:

Flatten
&#10005


We cannot plot thousands of cities with labels on a map that would fit a standard screen because it would be an unreadable mess (and obviously Randall Munroe does not do that, either). So which cities out of those thousands should we pick? This is where data visualization design comes into play. I suggest the following steps:

  • Pick the most recognizable city names so most people can relate to the map. This is easy, as our original city list is sorted by population, and most famous cities are at the beginning of the list or any sublist formed during data processing.
  • Replace each famous city with its namesake (or a few) in such a way that distribution of the cities over the map is as uniform as possible. This helps avoid label collision and produces an appealing, comprehensive map design.

A simple approach to a more uniform spatial city distribution is to pick some cities in each state. For this, I will group cities by state:

bySTATE =  GroupBy
&#10005


The syntax 2;;UpTo[3] means that in each group of homonymous cities, I skip the first one, typically the most populous and famous, and take the second and third largest, if available. (Sometimes there are just two namesakes.) Considering only the second- and third-largest namesakes reduces the original total number of such cities from 15,923 to 6,240:

Length /@ bySTATE // Total
&#10005


This is still too many cities to show on the map. I can reduce this number further by picking only, for example, two cities per each of the 50 states for a total of 100 cities, which is quite reasonable for a good map layout. But states vary highly by area, from tiny ones such as Rhode Island to Alaska and Texas, which are larger than Ukraine, the largest country within continental Europe (not considering transcontinental Russia, the largest country on Earth, but with 77% of its area in Asia, not Europe):

ListLogPlot
&#10005


For small Rhode Island, two city labels are too many, while for Texas, two are too few. A simple function can help to define how many cities to pick per state depending on its area:

citiyN
&#10005


I will not go over three labels per state, but this function is quite arbitrary, and readers are welcome to experiment to find better map layouts. Here is my final list of 97 namesake cities for the map:

cityMAP =  Flatten
&#10005


The next line with a few specific style and layout options will generate the second image in this blog post, my final take on the xkcd idea:

GeoListPlot
&#10005


This is different from the xkcd map, first of all due to the choice of cities. My particular algorithmic sampling of cities led to a different map from the one produced with Randall Munroe’s data and design. I do enjoy the fact that some cities, such as Miami and Phoenix, pop up a few times like they do in the xkcd map. I am satisfied with the layout, but one could do better, and the addition of Hawaii and Alaska would be nice. I invite readers to experiment and share their ideas and results in the comments below, or post their code on Wolfram Community.

Working with namesake cities taught me a few curious facts. It is surprising, for example, how many places in the US have the same name:

namesakes = ReverseSort
&#10005


For some of the namesakes, we can see the rationale for multiple uses, such as US presidents or popular notions such as “Union” or “Portland” (“land surrounding a harbor” from the Old English portlanda):

ListPlot
US cities per name—click to enlarge
&#10005


The name “Franklin” is so popular it looks like an outlier. Perhaps in some cases it relates to Benjamin Franklin or even Franklin Roosevelt? A fun way to visualize the data diversity is a WordCloud of the four hundred most popular names scaled by the number of cities per name:

WordCloud
Namesake cities word cloud—click to enlarge

&#10005


Another interesting question is: where are most of the namesake cities located? To get an idea, I need a helper function that takes an “ambiguous” name string and “interprets” it as a set of entities for all possible cities with such a name:

ambiguousENTITY
&#10005


This structure was already used in the code that my colleague Carlo sent me for the first tweet earlier in this article. Let’s also plot the locations of the six most popular names and scale the markers by city population size:

GraphicsGrid
Six most popular city names—click to enlarge
&#10005


Even with the naked eye, you can see significant concentrations in Wisconsin. The same might be true for many other less popular names and namesakes that we did not plot. Can we prove that assumption? Easily. Count the number of namesakes per state:

perSTATE = Sort
&#10005


Wisconsin strongly leads with over a thousand namesakes, followed by New York and Illinois:

Bar chart
Namesake cities per state
&#10005


The same data can be depicted geographically to provide a better sense of the spatial distribution of namesakes:

GeoRegionValuePlot
&#10005


But why Wisconsin? One can guess that the number of namesake cities in a state is proportional to the total number of cities in the state. If Wisconsin has more cities than other states, it should also have more namesakes. And indeed, we can confirm this assumption by merging the data for total city and namesake numbers per state and looking at the scatter plot. Sure enough, Wisconsin leads in both datasets:

ListPlot
Namesake cities vs. total cities—click to enlarge
&#10005


As promised, before I conclude, here are a few thoughts on tweetable and compact code. It is generally quite hard to fit a computer program into a tweet (280 characters), assuming it does something interesting. There is a large, competitive programming scene called code golf where the goal is to compress an algorithm into its shortest form. This idea also relates to Kolmogorov complexity. Moreover, all-new programming languages were invented just for the competition, emphasizing concise expressibility of code.

The Wolfram Language is highly expressive and compact, which can be seen, for instance, in how it is favored over other languages in the Rosetta Code project. We also run our own One-Liner Competition and people often compete with the Wolfram Language on the Code Golf Stack Exchange. As you saw earlier in this post, we even launched the Tweet-a-Program project to encourage the programming community to actually tweet in code with an automated twitter-bot executing that code.

Code golfing is also a type of puzzle that trains the mind to find alternative and often unconventional solutions. But code compactness is not everything, as it can sacrifice clarity and performance. It is aesthetically rewarding to find the balance among code compactness, clarity and performance. No matter which of these qualities attracts you more, I invite you to post your own versions of the xkcd map on Wolfram Community, general- or code golf–style, and link to your work in the comments below. As usual, interesting ideas will be added to our Staff Picks editorial group.

And finally, here’s an Easter egg for you: If you mouse over anywhere on the “No, The Other One” xkcd comic on its original page, it pops up with the text “Key West, Virginia is not to be confused with Key, West Virginia.” The slight comma shift in the names corresponds to the two-and-a-half-hour drive between the cities, and neither place should be confused with the iconic destination and southernmost city in the contiguous US, Key West, Florida.

Visit Wolfram Community or the Wolfram Function Repository to embark on your own computational adventures!

Comments

Join the discussion

!Please enter your comment (at least 5 characters).

!Please enter your name.

!Please enter a valid email address.