Random and Optimal Mathematica Walks on IMDb’s Top Films
July 15, 2013 — Matthias Odisio, Mathematica Algorithm R&D
Or: How I Learned to Watch the Best Movies in the Best Way
I remember when I lived across the street from an art movie theater called Le Club, looking at the movie posters on my way back home was often enough to get me in the ticket line. The director or main actors would ring a bell, or a close friend had recommended the title. Sometimes the poster alone would be appealing enough to lure me in. Even today there are still occasions when I make decisions from limited visual information, like when flipping through movie kiosks, TV guides, or a stack of DVDs written in languages I can’t read.
So how can Mathematica help? We’ll take a look at the top 250 movies rated on IMDb. Based on their posters and genres, how can one create a program that suggests which movies to see? What is the best way to see the most popular movies in sequence?
Drawing from methods used in content-based image retrieval, I will first compute visual similarities between movie posters and display them in a graph. The graph will then be enriched with semantic non-visual information like genres, which should then yield some decent viewing recommendations.
To gather the initial dataset, I looked for lists of great movies. There are many of them, including compilations by The New York Times and by the late Roger Ebert. I felt comfortable opting for the top 250 movies rated by users from IMDb, and I started by copying and pasting the table from the site into a Mathematica notebook.
IMDb is constantly updating the content of this top-250 movies table based on new ratings from users; you should expect the current list to differ from the snapshot I took, which happened to be on June 27, 2013.
The function ImportString parsed the pasted input seamlessly and produced a list of titles. I obtained the remaining data from Wolfram|Alpha by programmatically querying for the corresponding posters and genres:
A handful of obviously incorrect images and genres came back, perhaps because the query was ambiguous for some movie titles. I did correct the data by manually massaging those specific queries. The movies without posters in Wolfram|Alpha have been disregarded. After such curation, a list of 240 movies was left:
The first step was to compute similarities between the movie posters. More precisely, I computed a visual dissimilarity for each pair of images. The function ImageDistance features 16 ways to do so. Since movie posters often have a distinct color theme, I selected the EarthMoverDistance method that estimates the dissimilarities between the color histograms of two images. A coffee break later, I had a symmetric matrix containing the visual distances between all pairs of images:
The matrix appears with dark spots corresponding to strong visual similarities between movies:
Distances between images follow a distribution defined on the domain (0, 1), which is unimodal and has positive skewness:
The distances between images provide information about the similarities of movie posters. This information can be expressed in a graph where nodes correspond to movies and edges correspond to strong similarity.
Let’s assume that only the smallest distances constitute edges. With a threshold set so that only 5% of the distances are selected, not allowing self-loops, the movies’ visual similarity graph is:
At this rendering scale, we don’t see much more beyond common sense: a large group of interconnected dark posters as well as a second group of light posters that is connected to the dark posters group through saturated colorful poster offshoots. A few movies are isolated, among which is the top-rated movie, The Shawshank Redemption.
While you could use this graph to select a sequence in which to watch the movies, relying only on visual information wouldn’t be very meaningful. For the next step, I will augment the visual graph with semantic information gathered from the genres.
To do so, I introduce new vertices in the graph, one for each movie genre, as well as edges connecting each movie poster with its associated genres. In other words, I combine two graphs—the visual graph computed from the image distances and the semantic bipartite graph where each genre is connected to a list of movies. To compute the visual weighted adjacency matrix, I simply use either the EarthMoverDistance or zero, if the distance is too high, as weights. I also normalize each row of the matrix so that it sums to 1. This allows us to think of the entries in the matrix as probabilities of going from one vertex to the other:
I do not put weight in the semantic adjacency matrix, since we don’t know much about the reliability of the genres:
Finally, the visual and semantic adjacency matrix is a linear combination of the two matrices. Here, I use the average. I obtain the corresponding graph by calling the function WeightedAdjacencyGraph, making sure to indicate an absence of edge using ∞ instead of 0:
I can use this graph to compute paths between movies. For example, the shortest path between Toy Story 3 and Citizen Kane occurs by watching Toy Story 3, looking for another comedy, then watching Roman Holiday, and finally Citizen Kane:
The function GraphDistanceMatrix allows me to discover the most dissimilar pairs of movies, and it turns out there are two of them, both including Beauty and the Beast:
Another application is to help moviegoers select their next movie. Based on what they just saw, how can I suggest a few relevant movies? I’ll simply select the five most similar vertices in the graph, implicitly assuming that the moviegoers like any movie they see. For example, to someone who just watched Life of Pi, I’d recommend Toy Story 3, The Killing, The Shawshank Redemption, or browsing the lists of adventures and dramas:
Going further, I can actually model the behavior of a movie buff as he or she sees movies in sequence. I must assume that all the movies from the list are available to my movie buff—many are not lucky enough to have such resources at their disposal.
The process of navigating such a weighted graph based on the probabilities of transitioning from one vertex to the other is called a random walk. Instead of just using the visual and semantic graph’s adjacency matrix as a transition matrix, I’m going to modify it so that it can model boredom. Sooner or later, my movie buff will get tired of watching similar movies in sequence, and looking for fresh air, will switch randomly to another movie. To take into account such behavior, I use as transition matrix a linear combination of the graph’s adjacency matrix and a constant matrix of equiprobable transitions—with the exception that the diagonal entries in this constant matrix are set to 0, because no bored viewer is going to see the same movie again right away:
I set the combination factor α to 0.85 because it seems to be a common setting in such systems. The behavior of my movie buff will be modeled thanks to the functions DiscreteMarkovProcess, which represents a random walk process, and RandomFunction, which allows simulation of such processes. For example, here are 10 possible sequences after watching 3 Idiots:
As a final application, what is the best way to see all the movies? We are looking for the optimal sequence that would visit all the nodes corresponding to movies exactly once and the nodes corresponding to genres an arbitrary number of times. This is essentially a modification of the traveling salesman problem. There is no immediate solution for this modified problem in Mathematica. Instead, I am going to start from the visual graph and fill up each of its missing edges. Each pair of movies that is too distinct visually will be assigned a distance estimated by finding the length of the shortest path between the two movies in the visual and semantic graph.
The code below is a bit more complicated than that because I must take care of scaling the visual and semantic distances in a sensible fashion. Edges connecting movies to genres are weighted such that a jump through the semantic graph is given a weight of at least the maximum visual distance.
The optimal sequence is given by the function FindShortestTour:
And here it is, from The Shawshank Redemption to Memento, the optimal sequence in which to see all IMDb’s top 250 rated movies based on visual and semantic similarity:
From an algorithmic perspective, and maybe for the marathon viewer’s pleasure, it would be interesting in future work to include additional constraints so that movie franchises like The Lord of the Rings are seen in the expected order. Download a companion notebook and start your work on these topics right away.
It was very enjoyable experimenting with these data and models. I was also able to discover new application domains through Mathematica‘s documentation, which contains extensive information on the functionality used here.
Nevertheless, looking back in time, such a program is no match compared to the selections made by the art theater across the street or my dear friends’ suggestions.