Automotive Reliability in the Wolfram Language
This post originally appeared on Wolfram Community, where the conversation about reliable cars continues. Be sure to check out that conversation and more—we can’t wait to see what you come up with!
For the past couple of years, I’ve been playing with, collecting and analyzing data from used car auctions in my free time with an automotive journalist named Steve Lang to try and get an idea of what the used car market looks like in terms of long-term vehicle reliability. I figured it was about time that I showed off some of the ways that the Wolfram Language has allowed us to parse through information on over one million vehicles (and counting).
I’ll start off by saying that there isn’t anything terribly elaborate about the process we’re using to collect and analyze the information on these vehicles; it’s mostly a process of reading in reports from our data provider (and cleaning up the data), and then cross-referencing that data with various automotive APIs to get additional information. This data then gets dumped into a database that we use for our analysis, but having all of the tools we need built into the Wolfram Language makes the entire operation something that can be scripted—which greatly streamlines the process. I’ll have to skip over some of the details or this will be a very long post, but I’ll try to cover most of the key elements.
The data we get comes in from a third-party provider that manages used car auctions around the country (unfortunately, our licensing agreement doesn’t allow me to share the data right now), but it’s not very computable at first (the data comes in as a text file report once a week):
Fortunately, parsing this sort of log-like data into individual records is easy in the Wolfram Language using basic string patterns:
Then it’s mostly a matter of cleaning up the individual records into something more standardized (I’ll spare you some of the hacky details due to artifacts in the data feed). You’ll end up with something like the following:
From there, we use the handy Edmunds vehicle API to get more information on an individual vehicle using their VIN decoder:
We then insert the records into an HSQL database (conveniently included with Mathematica), resulting in an easy way to search for the records we want:
From there, we can take a quick look at metrics using larger datasets, such as the number of transmission issues for a given set of vehicles for different model years:
Or a histogram of those issues broken down by vehicle mileage:
It also lets us look at industry-wide trends, so we can develop a baseline for what the expected rate of defects for an average vehicle (or vehicle of a certain class) should be:
We can then compare a given vehicle to that model:
We then use that model, as well as other information, to generate a statistical index. We use that index to give vehicles an overall quality rating based on their historical reliability, which ranges from a score of 0 (chronic reliability issues) to 100 (exceptional reliability), with the industry average hovering right around 50:
We also use various gauges to put together informative visualizations of defect rates and the overall quality:
There is a lot more we do to pull all of this together (like the Wolfram Language templating we use to generate the HTML pages and reports), and honestly, there is a whole lot more we could do (my background in statistics is pretty limited, so most of this is pretty rudimentary, and I’m sure others here may already have ideas for improvements in presentation for some of this data). If you’d like to take a look at the site, it’s freely available (Steve has a nice introduction to the site here, and he also writes articles for the page related to practical uses for our findings).
Our original site was called the Long-Term Quality Index, which is still live but showed off my lack of experience in HTML development, so we recently rolled out our newer, WordPress-based venture Dashboard Light, which also includes insights from our auto journalist on his experiences running an independent, used car dealership.
This is essentially a two-man project that Steve and I handle in our (limited) free time, and we’re still getting a handle on presenting the data in a useful way, so if anyone has any suggestions or questions about our methodology, feel free to reach out to us.
Cheers!
Continue the conversation at Wolfram Community.
Can you post code for your yearly defect ratio plot?
I just wanted to say thanks for your hard work on this useful and interesting project. One suggestion that I have is for you to refine the analysis to include analysis by model generation. For example, I’m in the market for a gently used late model small suv and was considering the Jeep Cherokee because I like the lines, but learned there were significant power train issues in the 2014-2016 model years. I was surprised by your data showing exceptional reliability for the Cherokee, but it appears that earlier generation Cherokees were carrying the day. Well, just my two cents. Thanks again for your efforts and all the useful information provided.
Best,
Eric