Random and Optimal Mathematica Walks on IMDb’s Top Films
Or: How I Learned to Watch the Best Movies in the Best Way
I remember when I lived across the street from an art movie theater called Le Club, looking at the movie posters on my way back home was often enough to get me in the ticket line. The director or main actors would ring a bell, or a close friend had recommended the title. Sometimes the poster alone would be appealing enough to lure me in. Even today there are still occasions when I make decisions from limited visual information, like when flipping through movie kiosks, TV guides, or a stack of DVDs written in languages I can’t read.
So how can Mathematica help? We’ll take a look at the top 250 movies rated on IMDb. Based on their posters and genres, how can one create a program that suggests which movies to see? What is the best way to see the most popular movies in sequence?
Drawing from methods used in content-based image retrieval, I will first compute visual similarities between movie posters and display them in a graph. The graph will then be enriched with semantic non-visual information like genres, which should then yield some decent viewing recommendations.
To gather the initial dataset, I looked for lists of great movies. There are many of them, including compilations by The New York Times and by the late Roger Ebert. I felt comfortable opting for the top 250 movies rated by users from IMDb, and I started by copying and pasting the table from the site into a Mathematica notebook.
IMDb is constantly updating the content of this top-250 movies table based on new ratings from users; you should expect the current list to differ from the snapshot I took, which happened to be on June 27, 2013.
The function ImportString parsed the pasted input seamlessly and produced a list of titles. I obtained the remaining data from Wolfram|Alpha by programmatically querying for the corresponding posters and genres:
A handful of obviously incorrect images and genres came back, perhaps because the query was ambiguous for some movie titles. I did correct the data by manually massaging those specific queries. The movies without posters in Wolfram|Alpha have been disregarded. After such curation, a list of 240 movies was left:
The first step was to compute similarities between the movie posters. More precisely, I computed a visual dissimilarity for each pair of images. The function ImageDistance features 16 ways to do so. Since movie posters often have a distinct color theme, I selected the EarthMoverDistance method that estimates the dissimilarities between the color histograms of two images. A coffee break later, I had a symmetric matrix containing the visual distances between all pairs of images:
The matrix appears with dark spots corresponding to strong visual similarities between movies:
Distances between images follow a distribution defined on the domain (0, 1), which is unimodal and has positive skewness:
The distances between images provide information about the similarities of movie posters. This information can be expressed in a graph where nodes correspond to movies and edges correspond to strong similarity.
Let’s assume that only the smallest distances constitute edges. With a threshold set so that only 5% of the distances are selected, not allowing self-loops, the movies’ visual similarity graph is:
At this rendering scale, we don’t see much more beyond common sense: a large group of interconnected dark posters as well as a second group of light posters that is connected to the dark posters group through saturated colorful poster offshoots. A few movies are isolated, among which is the top-rated movie, The Shawshank Redemption.
While you could use this graph to select a sequence in which to watch the movies, relying only on visual information wouldn’t be very meaningful. For the next step, I will augment the visual graph with semantic information gathered from the genres.
To do so, I introduce new vertices in the graph, one for each movie genre, as well as edges connecting each movie poster with its associated genres. In other words, I combine two graphs—the visual graph computed from the image distances and the semantic bipartite graph where each genre is connected to a list of movies. To compute the visual weighted adjacency matrix, I simply use either the EarthMoverDistance or zero, if the distance is too high, as weights. I also normalize each row of the matrix so that it sums to 1. This allows us to think of the entries in the matrix as probabilities of going from one vertex to the other:
I do not put weight in the semantic adjacency matrix, since we don’t know much about the reliability of the genres:
Finally, the visual and semantic adjacency matrix is a linear combination of the two matrices. Here, I use the average. I obtain the corresponding graph by calling the function WeightedAdjacencyGraph, making sure to indicate an absence of edge using ∞ instead of 0:
I can use this graph to compute paths between movies. For example, the shortest path between Toy Story 3 and Citizen Kane occurs by watching Toy Story 3, looking for another comedy, then watching Roman Holiday, and finally Citizen Kane:
The function GraphDistanceMatrix allows me to discover the most dissimilar pairs of movies, and it turns out there are two of them, both including Beauty and the Beast:
Another application is to help moviegoers select their next movie. Based on what they just saw, how can I suggest a few relevant movies? I’ll simply select the five most similar vertices in the graph, implicitly assuming that the moviegoers like any movie they see. For example, to someone who just watched Life of Pi, I’d recommend Toy Story 3, The Killing, The Shawshank Redemption, or browsing the lists of adventures and dramas:
Going further, I can actually model the behavior of a movie buff as he or she sees movies in sequence. I must assume that all the movies from the list are available to my movie buff—many are not lucky enough to have such resources at their disposal.
The process of navigating such a weighted graph based on the probabilities of transitioning from one vertex to the other is called a random walk. Instead of just using the visual and semantic graph’s adjacency matrix as a transition matrix, I’m going to modify it so that it can model boredom. Sooner or later, my movie buff will get tired of watching similar movies in sequence, and looking for fresh air, will switch randomly to another movie. To take into account such behavior, I use as transition matrix a linear combination of the graph’s adjacency matrix and a constant matrix of equiprobable transitions—with the exception that the diagonal entries in this constant matrix are set to 0, because no bored viewer is going to see the same movie again right away:
I set the combination factor α to 0.85 because it seems to be a common setting in such systems. The behavior of my movie buff will be modeled thanks to the functions DiscreteMarkovProcess, which represents a random walk process, and RandomFunction, which allows simulation of such processes. For example, here are 10 possible sequences after watching 3 Idiots:
As a final application, what is the best way to see all the movies? We are looking for the optimal sequence that would visit all the nodes corresponding to movies exactly once and the nodes corresponding to genres an arbitrary number of times. This is essentially a modification of the traveling salesman problem. There is no immediate solution for this modified problem in Mathematica. Instead, I am going to start from the visual graph and fill up each of its missing edges. Each pair of movies that is too distinct visually will be assigned a distance estimated by finding the length of the shortest path between the two movies in the visual and semantic graph.
The code below is a bit more complicated than that because I must take care of scaling the visual and semantic distances in a sensible fashion. Edges connecting movies to genres are weighted such that a jump through the semantic graph is given a weight of at least the maximum visual distance.
The optimal sequence is given by the function FindShortestTour:
And here it is, from The Shawshank Redemption to Memento, the optimal sequence in which to see all IMDb’s top 250 rated movies based on visual and semantic similarity:
From an algorithmic perspective, and maybe for the marathon viewer’s pleasure, it would be interesting in future work to include additional constraints so that movie franchises like The Lord of the Rings are seen in the expected order. Download a companion notebook and start your work on these topics right away.
It was very enjoyable experimenting with these data and models. I was also able to discover new application domains through Mathematica‘s documentation, which contains extensive information on the functionality used here.
Nevertheless, looking back in time, such a program is no match compared to the selections made by the art theater across the street or my dear friends’ suggestions.
Download this post as a Computable Document Format (CDF) file.
Some imdb and math mix.
People usually watch movies they know, and so what they will watch stuff biased towards what they know. And what they like is biased towards that.
One way to fix that is to get a random movie based on all movies and watch it.
This will not be biased towards movies you know that exist, you will not all movies on imdb, this will also not be biased towards time you have to discover stuff, this will not be biased towards situations (you will maybe only discover some porn fetich movie if you were on a site about this fetiche) and so….
So a way to fix those problems would be to watch random movies from imdb.
But the thing with this way of movie discovery is that when you “hit you hit hard”, but most of the time you will fail and will not get good movies. Your movie taste is just one in a world of many different movie tastes.
A way to solve that is to check the movie info before you watch and only watch the ones you think will be good, but sometimes you think a movie will be bad and then you see the movie was good. How to fix this problem?
Math can help, this is how you do the entire process:
1-find a way to get random IMDB movie
2- get the 1 random movie and watch it (get the first 2 random ones you can find to watch)
3-now make a list with the amount movies you were forced to watch (X) and the amount of those forced movies you liked (Y)
4-do this math 100/((X+1) / (Y+1))
4.1-if you hated the first movie the value will be 50. That is [100/((1+2) /(0+1))]
5-now get a new random movie
6-check if you like it by looking at info
7-if like watch it
8-Remember the value you got at 4? If you hated it, this is the chance of you have of being forced to watch this movie. Roll a random number to find it
9-go back to 3
Thanks for the kind words. You are asking an interesting question. The concept of graph centrality is what you are after, though it can’t answer you directly. Mathematica features various measures for centrality, including ClosenessCentrality (http://reference.wolfram.com/mathematica/ref/ClosenessCentrality.html). In the last graph I used to compute the shortest tour, the most central nodes are “Memento”, ” Rear Window”, “Full Metal Jacket”, “A Clockwork Orange”, and “The Lion King.”