Data visualizations are a crucial part of any data science project. We use visualizations to confirm relationships between data, discover new relationships, determine outliers, and to communicate a message about the data. A lot of visualizations have similar attributes and can communicate a message the same way, so data scientists can really take advantage of particular techniques that work best for them. As I continually develop visualizations for my data science projects, there are a few methods that I continually resort to using when beginning a project to help determine relationships among the data at the beginning of projects. Some of these early stage techniques I like to use are: scatter plots, histograms, scatter matrices, and heat maps.
One of the basic and first visualizations you will learn to use is a scatter plot. The scatter plot is used to review a relationship between two variables. A scatter plot is a graph in which two variables are plotted in two dimensions (x axis and y axis), where the values are represented by dots. When plotting all the relationships between two variables from a large data set into one graph, you can start to see whether there is a relationship between the variables, linear or otherwise. A scatter plot is a good starting point in determine what kind of relationship, if any, there is between variables, and can help you determine what kind of modeling methodology you should pursue.
Histograms are another great basic visualization used to plot a single variable. A histogram is a two dimensional graph, similar to a barchart, in which each bar represents the frequency of a measure of the variable within the data set. Each bar can either represent a single value of the variable, or it could represent a bin of values if there are continuous values in the dataset. Histograms can show us the distribution of the variable values, and can help us determine whether the data is formed in a normal or skewed distribution. By understanding how distributions work, we can start to develop probabilities of certain values occurring in the dataset. It is a great start to determine variables to use in a predicting model.
A scatter matrix is essentially a combination of the scatter plot and the histogram, in which you select multiple variables and a pair-wise matrix is developed with scatter plots for each relationship combination of the variables you select. For the spots that would compare a variable to itself, a histogram is created. A scatter matrix can tell us whether variables are correlated (via the scatter plot), whether that correlation is positive or negative, and what the distribution of values are for each variable (via a histogram). This technique is a great way to compare relationships of each variable from a data set in one shot. The scatter matrix also can be used to determine collinearity amongst predictor variables when doing a linear regression model, as well as determine heteroskedasticity amongst predictor variables and the predicting variable.
A Heat Map is a diagram that represents a particular relationship between data points in the form of colors. The relationship measured can come in many forms, but by creating the measures amongst multiple variables, you are able to develop degrees of difference for the measure. Typically along a heat map, there will be a heat key that shows the color degree related to that value of measure for the heat map. What’s great about heat maps is that you are visually able to compare the value of a relationship between variables in multiple mediums. Although other visualization techniques, like a scatter matrix, can show relationships based on values, heat maps also offer the color medium to compare. When trying to determine variable relationships, I like to create a heat map based on correlations between variables. This type of heat map helps me determine which variables have the best correlation to our predicting variable in a linear relationship, how our predictor variables correlate amongst each other, and the degree of variation amongst correlation.
As you can read above, these are just a few of the many visual techniques that can be used in our data science pursuits. The above techniques are what I mostly use to determine relationships between the data in the initial stages of a project, but there are many other techniques that can be used, and for other purposes. Also, many of the technique can offer the same underlying message, so data scientists can use some of these techniques interchangeably. I’m hoping to develop my skills moving forward to create and build upon these techniques to help solidify my understanding of the relationships between data points I am working with. I am always looking for more techniques to use, and as I expand my understanding of data visualization, I will be sure to share what I learn.