Nutrients by the Numbers: Food and Nutrition Statistics with Wolfram Language
Statistical analysis is an important tool in food science. It can uncover patterns and relationships in food and nutrition data, leading to advances in food manufacturing, nutrition counseling, food safety and new product development. Wolfram Language offers built-in functions for all standard statistical distributions. Here, we’ll use some of these functions to evaluate relationships between nutrients and visualize the data distributions with informative plots and histograms.
Interpreter for Food Entities
Use Interpreter to gather and group the entities for the foods you want to explore. The “yellow box” entities contain the nutritional data for each food type:
T-Tests for Zinc and Folate
A t-test is a statistical tool used to answer the question “Is the difference in the averages (means) of two groups statistically significant, or are the means different due to random chance?” Let’s use the TTest function to determine if the zinc and folate in berries are significantly different from the zinc and folate in green vegetables.
Berries and green vegetables are not significant sources of zinc, but we can use statistics to evaluate and compare trace amounts of this vital nutrient. Start with the null hypothesis that there’s no meaningful difference between berries and green vegetables in terms of their zinc content. Next, obtain the zinc amounts for each of the food types in both groups. The t-test does not require the sample lengths to be equal. Get only the values, not the units, using the QuantityMagnitude function:
What is the average (mean) zinc content for each group?
The t-test does require normal distribution of the data. The TTest function automatically tests for normal distribution, but you can check it yourself using the DistributionFitTest function. This function will return a p-value, which is the probability that the data satisfies a given null hypothesis. The default null hypothesis for DistributionFitTest is that the data comes from a normal distribution:
We will use the common significance level α of 0.05, or 5%, to determine whether to reject or fail to reject the null hypothesis. Because both of these p-values from DistributionFitTest are greater than 0.05, we fail to reject the null hypothesis and conclude that zinc data for berries and green vegetables is normally distributed. Therefore, we know that the t-test is appropriate to use:
The p-value from the t-test is less than 0.05. Therefore, we can reject the null hypothesis and conclude that there is a significant difference in the average zinc content of berries versus green vegetables. Easily visualize this difference using PairedSmoothHistogram:
Next, we examine the difference in average folate content:
Like zinc, the t-test result below 0.05 confirms that we can reject the null hypothesis because the folate difference between berries and green vegetables is statistically significant. Wolfram Language provides both full and shortened conclusions of the test:
A paired histogram illustrates this difference in the two datasets:
Mann–Whitney Test for Iron
There are multiple ways to visualize the distribution of datasets. A number line plot is a compact way to compare the distribution of two datasets:
Scatter plots and bar charts are also effective visuals, with multiple options to customize the charts:
A related plot is a box-and-whisker chart. The box represents the middle 50% of the data values; the white line in the box represents the median. The vertical lines are the whiskers, which show the range of values, excluding any outliers (there is an option to include the outliers in the chart):
Let’s evaluate the average iron difference for berries versus green vegetables by first checking for normal distribution:
The green vegetables iron data has a p-value below 0.05 and, therefore, is not normally distributed. When the sample data is skewed rather than normally distributed, you can use the Mann–Whitney U test to determine whether two population distributions have roughly the same shape and location. It is called a nonparametric test and does not require a normal distribution like the t-test does:
The resulting p-value is slightly greater than our chosen significance level α of 5%. Therefore, we must fail to reject the null hypothesis and conclude that there is no statistically significant difference in the average iron content of berries versus green vegetables. A smooth histogram is a good way to view the overlap between the two datasets:
Use the TrimmedMean function to remove data outliers that may be skewing a result. In this example, we trim the outlying 10% of data from both ends and obtain a new mean:
Analysis of Variance (ANOVA)
Analysis of variance (ANOVA) compares the means of three or more groups to determine if there are statistically significant differences among them. Let’s load the Analysis of Variance package and analyze the means for iron content in berries, meats and fish:
This ANOVA test is called a one-way analysis of variance because there is one categorical variable in the data. We have already defined berriesIron. We need iron content for meats and fish:
Like other parametric tests, ANOVA requires a normal distribution of the data:
The ANOVA table includes the means of the samples and the overall mean (grand mean) of all the data. In the following example, the p-value of less than 0.05 indicates that we can reject the null hypothesis and conclude that there is a significant difference among the means for iron content in berries, meats and fish:
ANOVA does not specify which group means are significantly different. After ANOVA, you can use post hoc tests to make pairwise comparisons and determine which groups are statistically different from each other.
Linear correlation is the statistical relationship between two variables in which changes in one variable are associated with proportional changes in another variable. A positive correlation suggests that as one variable increases, the other variable tends to also increase. A negative correlation implies that as one variable increases, the other variable tends to decrease.
Let’s examine the correlation between fat and calories in meats. First, obtain the quantitative data:
Use the Transpose function to pair the fat and calorie values for each type of meat, and then plot the pairs:
Because the plot points generally slope upward, we can conclude that the fat and calories in meats are positively correlated. As total fat increases, so do calories. If the line slopes generally downward, the variables are negatively correlated. If the points are scattered, with no upward or downward trend, the variables are uncorrelated.
The positive correlation between fat and calories is not surprising, but this process can be replicated to explore a wide range of nutrients. Vitamin C and potassium are vital nutrients in citrus fruits, but are they correlated? They generally are not associated with one another. Is there a hidden statistical correlation?
The list plot confirms there is no correlation between the amounts of vitamin C and potassium in citrus fruits.
Linear regression is another way of modeling relationships between quantitative variables. The goal of linear regression is to find the best-fitting straight line that represents the relationship between the two variables. Let’s use linear regression to model the relationship between saturated fat and monounsaturated fat in meats:
The following input uses the LinearModelFit function to model the relationship using a straight line:
Use the Correlation function to get the correlation coefficient, which indicates the strength and direction of the linear relationship between two variables. The coefficient is a number between –1 and 1, where 1 indicates perfect positive correlation and –1 indicates perfect negative correlation. A general guideline is that correlation above 0.5 or below –0.5 is strong correlation, and –0.5 to 0.5 is weak correlation or no correlation:
The correlation coefficient of 0.9 indicates a strong positive correlation between the amount of saturated fat and monounsaturated fat in meats. Easily visualize this relationship with SmoothHistogram3D:
Not all correlations are positive. We can reasonably assume that the correlation between sugar and fiber in breakfast cereals is a negative one—as sugar goes up, fiber goes down. Let’s test if our assumption is correct. First, use Interpreter to get the implicit entity (“yellow box”) for the food type "breakfast cereal". The implicit entity is a compilation of the nutrition data for the 230+ specific breakfast cereals that make up the entity:
Next, request the EntityList of the 230+ breakfast cereals attached to the yellow box. We use the semicolon after EntityList so that the actual (very long) list will be suppressed:
As we did in the previous examples, we get the relative sugar and fiber values for each of the 230+ breakfast cereals, then transform those values into a list of pairs:
Test the correlation:
The correlation coefficient of –0.4 confirms a negative correlation, although it’s somewhat weak. The linear regression “best-fit” model illustrates the intercept (0.12) and slope (–0.17) of the line:
Learn More at Wolfram U
To learn more about statistical analysis with Wolfram Language, visit Wolfram U to choose from the free, self-paced Wolfram Language statistics courses on basic (elementary algebra) to more advanced (statistical distributions) topics. Other related online courses include: