Data scientists collect, analyze, distribute, and interpret data from a wide variety of sources. In order to do so, having intimate familiarity with statistics is a very valuable asset.
When we think about statistics in everyday life, we might think of simple averages – the central value of a set of numbers that’s obtained by adding all the numbers together and dividing by the number of values there are.
But in data science, the use of statistics goes much deeper (and wider) than calculating averages.
Instead, data scientists rely on a variety of statistical methods to analyze and interpret vast amounts of data. Let’s explore some of the ways in which statistical methods are used by data scientists to make sense of data.
But First, What is Statistics?
Before we examine some of the statistics that are commonly used in data science, we first need to establish a working definition of statistics.
In its simplest form, statistics is the collection, analysis, and interpretation of data. Furthermore, statistics gives data scientists a means of communicating what the data means.
Because of this, statistics are an integral component of data science. It’s what allows data scientists to mine incredible amounts of data and present what that data means in easy-to-understand ways.
How are Statistics Used in Data Science?
Statistics is what helps data scientists make sense of large quantities of data – it allows scientists to discover trends or patterns in the data and find meaning in all those numbers. So, the first step is using statistics to analyze the data.
Next, data scientists use statistics to derive meaning from the data they’ve analyzed. That is, statistics can be used to drill down and work with the data in a targeted manner. So, rather than relying on big picture views of data that tools like graphical representations provide, statistics is much more information driven.
It’s important to be able to get this targeted view because the more detailed data scientists can be in their analyses, the more applicable their findings will be in real life.
For example, a data scientist might collect data on the delivery times of a shipping company’s products and how delivery times change from month to month. They might generate a graph or chart that quickly shows that delivery times drastically increase during the holidays, which would make sense given the much larger number of orders that many businesses process during the holiday shopping season.
However, to get a better understanding of the situation, a data scientist might use statistical analyses to identify other unseen patterns in the data that might account for delayed deliveries, such as increased illnesses among delivery drivers in the winter. This is a simplistic example, but it nevertheless illustrates how data science and statistics can reveal different layers of meaning from data that can be valuable to businesses and organizations.
Another way that statistics are used in data science is to use the information in big data to understand customers’ buying behavior.
For example, a business owner might complain that some of their customers only buy certain products when they’re on sale. So, a data scientist can analyze all the customer data that the business owner has collected and help identify which customers only buy products when they’re on sale. Furthermore, data scientists can group customers together based on buying habits and examine how changes in the business’ operation (e.g., sales) affect certain groups.
To do so, data scientists might use any number of statistical methods, including dimensionality reduction, clustering, or latent variable analysis.
Another important way data scientists use statistics is in the development of machine learning, which is also known as artificial intelligence. A data scientist might take the numbers found within various sets of statistics and use them within algorithms to aid a company or an organization in the decision-making process.
Data scientists also use statistics in the development of software for the insurance industry.
A data scientist can take the statistics for any variable and add those numbers to other data that has been gathered. While statistics might show that cars that are bright red are more likely to be caught speeding than any other color, data scientists can delve deeper into the statistics to find out whether the age of the driver or the location of the violation might also have something to do with receiving a ticket.
Statistics can also be used by data scientists to help businesses understand what their customers really want.
For example, a data scientist might examine data on customers and their interactions with the business’ online advertisements. By exploring which users interacted with which ads and for what products, data scientists can provide a clearer picture for companies in terms of what products and services are getting the most attention from customers.
In this example, data scientists might use similar statistical methods as mentioned above, including dimensionality reduction, clustering, or latent variable analysis. They might also use predictive modeling or collaborative filtering.
While all of these statistical approaches might seem complex and confusing (and very different from one another), at the end of the day, statistics are used to tell a story, and it’s the job of data scientists to be the storyteller.
It does no good for a company to hire a data scientist if they can’t effectively explain what they’re finding in the data. Statistical methods like those described above (and below) are the storytelling tools that data scientists need to communicate what’s going on with the data and what’s important.From this, businesses and organizations can plot a course that’s based on actionable insights. So in a very real sense, statistics are like a compass for companies – it can tell them which direction they should go to realize their greatest potential.
Types of Statistical Methods Used in Data Science
As noted earlier, there is a wealth of statistical methods that data scientists can use to glean more information from their data. Below are just a few common statistical tools used in data science.
Descriptive statistics enables data scientists to summarize data, organize it and describe it using graphs, tables, and the like. It gives meaning to vast amounts of data.
Descriptive statistics can take many forms. They can be measures of central tendency or measures of frequency. Descriptive statistics can also describe measures of position and measures of dispersion.
For example, measures of central tendency – mean, median, and mode – inform us as to the average value of a data set, the middle number in a data set, and the number that occurs most often in a data set, respectively.
Here’s a simple example: a company wants to know how a new group of recent hires scored on a basic employment test. Providing the mean, median, and mode of the group’s scores allows company executives to quickly see what the scores look like.
As another example, data scientists can use measures of dispersion (e.g., range, variance, standard deviation, and skew) to show how data is spread out. Showing a group of stakeholders the range of a data set is an easy way to communicate what the lowest and highest values are in a data set. Showing the standard deviation of a data set in a graph is informative because it measures how dispersed the data is in comparison to the mean.
Again, the whole point with descriptive statistics is to present data in a more meaningful way, in this case, through specific statistical measures that can then be plotted in a graph or chart.
Where descriptive statistics are used to summarize a data set’s characteristics, inferential statistics are used to determine whether data is representative of a larger population.
In other words, once data has been collected and analyzed, data scientists make generalizations from a representative data sample about a larger population.
For example, if a data scientist wants to know what the average salary is for data scientists in the United States, they would collect income data on a random sample of data scientists, and then infer what they learn from that to the larger population of data scientists in the country.
One area many data scientists focus on is Bayesian thinking, which utilizes probabilities that are placed on parameters. In a nutshell, what we currently know is expressed in a probability distribution called the prior distribution. When new information is discovered, that information is denoted as a “likelihood,” which is combined with prior knowledge to generate an updated distribution called the posterior distribution.
In other words, a data scientist might use this type of statistics to help businesses and organizations adapt their business approach and business model to changing times. As more and more new information is gathered, new directions can be taken and businesses can become more flexible and responsive to the needs of their customers. This type of statistics is actually a key function of machine learning as used within various business models.
Data scientists use probability distributions to more clearly define the chances that a specific event will occur.
Probability is measured on a scale of 0 to 1, with 0 meaning that there is no chance that the event will occur and 1 meaning that the event will certainly occur. So, a data scientist might use a probability distribution to express that there is a 25 percent chance that a customer returns to their online shopping cart to complete their purchase. Of course, this means there’s a far greater chance that the customer would not return to their cart.
The simplest form of a probability distribution is a normal, or bell, curve. However, all probability distributions rely on statistics like mean, kurtosis, standard deviation, and skewness to determine their range.
Other popular statistical tools in data science include:
- Over and under sampling – This can be used when there is too much or too little data in a sample size. To create a more equal balance, data scientists might oversample from the smaller sample or under-sample from the larger sample to get a more equal distribution between the two.
- Dimension reduction – This is used to reduce the number of random variables that are being considered and is done by feature selection and feature extraction. Feature selection involves choosing variables based on a smaller set of relevant features while feature extraction involves creating new features based on the functions of the original features. Usually, this method is used to simplify data models for creating algorithms.
- Measures of association – Two popular types are chi-squares and correlations, both of which tell data scientists the extent to which two variables are related, if they are related at all.
Are Data Science and Statistics the Same?
There appears to be some skepticism as to whether data science is all that different from statistics.
In fact, some experts believe that statisticians have simply created the term “data science” to give the profession a more contemporary-sounding marketing appeal. Either way, there is little doubt statistics forms an integral part of data science.
One key difference between data science and statistics, though, is that statistics are geared towards predicting probability while data science focuses on gathering all available data and presenting it to organizations and individuals.
What’s more, statistics are just one tool in a data scientist’s toolbox, though as discussed earlier, they do have a variety of applications.
The point is that data science and statistics are not the same. Yes, there are many ways in which these disciplines overlap, but as shown above, statistics are just part of the data science approach.
Drawbacks of Statistics in Data Science
One of the drawbacks for data scientists today is the inability to gather statistical data on their own without needing the benefit of big data.
Due to the incredible amount of data that’s available, data scientists are forced to rely upon software programs to give them the needed information. As Forbes has pointed out, however, this also means a great many data scientists are unable to actually “look under the hood” of the algorithmic programs they are using.
While statistics might no longer be the only tool that data scientists have available in today’s technology marketplace, they still play an important role in what data scientists do.
Statistics can aid in the development of new software, help companies devise new marketing techniques, and can assist in creating a more effective business plan for a new business. This wide applicability makes statistical methods and analyses a critical area of knowledge for data scientists.