Not-So-Common Plots in Statistics

Let's explore the world of data analysis, where we'll not only dive into regression but also venture into Principal Component Analysis (PCA) and Partial Least Squares (PLS) regression. Along the way, we'll encounter various plots that reveal important patterns.

In regression, we'll look at classic residual plots. In PCA, we'll explore scree plots, loading plots, and biplots that help us understand data structure. For PLS regression, we'll examine model selection plots, response plots, and coefficient plots that show how different factors affect outcomes. Additionally, we'll explore distance plots and component evaluation plots to understand relationships and model performance.

Regression Analysis

Regression analysis is a powerful statistical technique used to understand and quantify the relationship between one or more independent variables and a dependent variable. Its primary goal is to predict the value of the dependent variable based on the values of the independent variables. By fitting a regression model to observed data, analysts can uncover patterns, make predictions, and infer causal relationships.

It is assumed that a good regression model

Is linear
Has a constant variance
Is normally distributed
Has errors that are independent of each other

To check if the regression model is appropriate, we look out for 4 kinds of residual plots.

Residuals vs. Fits Plot
Normal Probability Plot of Residuals
Residuals vs. Order Plot
Histogram of Standardized Residuals

Residuals vs. Fits Plot

Plot Significance

This plot shows the residuals (the differences between observed and predicted values) on the y-axis and the predicted values (the fitted values) on the x-axis.

Ideal Scenario

Ideally, the residuals should be randomly scattered around zero without any clear pattern, often called homoscedasticity.

Non-Ideal Scenario

If you notice any patterns (e.g., a curve, funnel shape, or increasing/decreasing spread), it suggests a violation of the linearity assumption often called heteroscedasticity, indicating that the model might not be appropriate for the data.

Corrective Measure in Case of Non-Ideal Scenario

Transform (more on this at the end of this post) the response variable or add polynomial terms to better capture the relationship between the predictors and the response. Outliers or influential points may need to be investigated and potentially removed or accounted for in the model.

Normal Probability Plot of Residuals

Plot Significance

In statistics, the assumption of normality of error terms is crucial for several reasons, particularly in the context of linear regression models and hypothesis testing.

This plot is used to assess the normality assumption of the residuals. It compares the distribution of the residuals to a normal distribution. if the error terms follow a normal distribution with mean μ and variance σ², then a plot of the theoretical percentiles of the normal distribution versus the observed sample percentiles of the residuals should be approximately linear.

Ideal Scenario

If the residuals follow a normal distribution, the points on the plot should fall approximately along a straight line. If the points follow the line closely, it indicates that the assumption of normality is reasonable.

Plot where residuals are normally distributed

Non-Ideal Scenario

Deviations from a straight line suggest departures from normality. As such, linear regression models and parametric hypothesis tests will show incorrect results.

A heavy-tailed plot for non-normal residuals

Corrective Measure in Case of Non-Ideal Scenario

If the points deviate significantly from a straight line, one may consider transforming the response variable or applying a robust regression method that is less sensitive to departures from normality. Alternatively, one can use a non-parametric regression technique if the normality assumption cannot be met.

Residuals vs. Order Plot

Plot Significance

This plot helps to identify patterns in the residuals based on the order in time or space of the data points, namely serial correlation. Plot is only useful when the order matters and is known.

Ideal Scenario

A randomly scattered pattern suggests that the residuals are independent of the order of the data points.

Non-Ideal Scenario

If there's a clear pattern or trend in the plot, it suggests that there might be autocorrelation or dependence among the residuals, indicating a violation of the independence assumption.

Corrective Measure in Case of Non-Ideal Scenario

One may need to account for this serial correlation by incorporating time series techniques or including additional predictor variables that capture the temporal structure of the data. Residuals might also indicate a need for a more complex model that includes lagged terms or other time-dependent predictors.

Histogram of Standardized Residuals

Plot Significance

This plot displays the distribution of the standardized residuals, which are the residuals divided by their standard deviation.

Ideal Scenario

Ideally, the histogram should resemble a bell-shaped curve centered around zero, indicating that the residuals have a constant variance across different levels of the predictor variables.

Normally distributed standardized residuals

Non-Ideal Scenario

If the histogram shows a non-ideal shape (e.g., skewed, multi-modal, or with outliers), it suggests a violation of the constant variance assumption.

Left skewed histogram of standardized residuals

Corrective Measure in Case of Non-Ideal Scenario

Transform the response variable or apply a weighted regression to account for heteroscedasticity. Alternatively, you can use robust standard errors or a generalized least squares approach to handle heteroscedasticity.

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a statistical technique used for dimensionality reduction and data exploration. It works by transforming the original variables into a new set of uncorrelated variables called principal components. These components are ordered by the amount of variance they explain in the data, allowing for the most significant patterns and relationships to be captured in fewer dimensions. By retaining the most informative features while discarding noise and redundancy, PCA aids in simplifying data interpretation and improving the efficiency of subsequent analyses.

Scree Plot

Plot Significance

It is used in PCA to visualize the variance explained by each principal component. The x-axis typically represents the number of principal components, while the y-axis shows the proportion of variance explained by each component.

Ideal Scenario

In an ideal scenario, the Scree Plot exhibits a steep drop-off in variance explained after the initial components, forming a clear "elbow" or bend in the plot. This indicates that a small number of components capture the majority of the variance in the data, allowing for effective dimensionality reduction without significant loss of information.

Elbowed Scree Plot that has 3 principal components explaining all variance

Non-Ideal Scenario

Scree Plot lacking a clear elbow suggests that each additional component contributes roughly equally to the total variance explained. This indicates that the data may not have clear underlying structures, making it challenging to reduce dimensionality without losing important information.

Less profound elbow indicates unstructured PCA model with unclear principal components

Corrective Measure in Case of Non-Ideal Scenario

If the Scree Plot lacks a distinct elbow, further exploration may be necessary to understand the underlying data structure. This could involve examining alternative dimensionality reduction techniques or considering different feature selection methods. Additionally, collecting more data or refining data preprocessing steps may help uncover clearer patterns and improve the interpretability of the results obtained from PCA.

Loading Plot

Plot Significance

A Loading Plot is a visual representation often utilized in PCA to elucidate the relationship between the original variables and the principal components. Each variable is represented as a vector in the loading plot, indicating its contribution and direction to the principal components.

Ideal Scenario

In an optimal scenario, a Loading Plot showcases clear patterns where variables cluster together or align along certain directions. This indicates that the original variables are well-represented by the principal components, allowing for straightforward interpretation and understanding of the underlying data structure. Loadings close to -1 or 1 indicate that the variable strongly influences the component. Loadings close to 0 indicate that the variable has a weak influence on the component.

While age is closest to first component, debt is close to second component

Non-Ideal Scenario

A Loading Plot lacking distinct patterns or with erratic, scattered vectors suggests that the original variables may not be effectively captured by the principal components.

This plot has multiple unclear principal components

Corrective Measure in Case of Non-Ideal Scenario

Explore alternative dimensionality reduction techniques, refine feature selection methods, or address issues such as multicollinearity among variables.

Score Plot

Plot Significance

A Score Plot is a graphical representation commonly used in multivariate analysis techniques such as PCA and PLS regression. It visualizes the scores of individual observations or samples along the principal components or latent variables. Each observation is represented as a point in the plot space, with its position determined by its scores on the selected components.

Ideal Scenario

In an ideal scenario, a Score Plot exhibits clear patterns or clusters of observations, indicating distinct groups or relationships in the data. These patterns can provide valuable insights into the underlying structure or characteristics of the dataset, facilitating interpretation and decision-making.

Score plot helps to observe clustering of data

Non-Ideal Scenario

A non-ideal Score Plot may show scattered or overlapping points without clear patterns or groupings.

Corrective Measure in Case of Non-Ideal Scenario

Explore alternative visualization techniques, refine data cleaning processes, or consider different dimensionality reduction methods.

Biplot

Plot Significance

A Biplot is a graphical representation commonly employed PCA. It simultaneously displays both the observations and variables in the same plot space. In a Biplot, the observations are represented as points, while the variables are depicted as vectors. The direction and length of the vectors indicate the contribution and importance of each variable to the principal components, while the proximity of the observations to the vectors reflects their relationships with the variables. It is also a useful plot to observe outliers.

Ideal Scenario

An ideal Biplot exhibits clear patterns where variables and observations cluster together or align in specific directions. This indicates strong relationships between variables and observations. A Biplot overlays the score plot and the loading plot.

Two clusters are observable - on top left and on the right. Bottom right outlier is also observable.

Non-Ideal Scenario

A Biplot lacking distinct patterns or with scattered observations and vectors suggests complexity or ambiguity in the data relationships.

Randomized observations showing no data clustering

Corrective Measure in Case of Non-Ideal Scenario

Explore alternative visualization techniques, refine data cleaning processes, or consider different dimensionality reduction methods.

Partial Least Squares (PLS) Regression

Partial Least Squares (PLS) regression is a powerful statistical technique used for modeling relationships between a set of predictor variables and a response variable, especially in situations where there are multicollinearities or when the number of predictors exceeds the number of observations. Unlike traditional regression methods, PLS regression constructs latent variables, known as components, which are linear combinations of the original predictors. These components are optimized to maximize the covariance between the predictors and the response variable.

Model Selection Plots

Plot Significance

A Model Selection Plot typically displays the coefficient of determination (R²) as a function of the number of principal components included in the model. This plot helps to evaluate and compare different models based on their explanatory power relative to the number of components used.

Ideal Scenario

An ideal model selection plot in Minitab reveals a clear pattern where the R² value increases steadily with the addition of more principal components, reaching a plateau or stabilizing at a certain point. This indicates that the model's explanatory power improves with the inclusion of additional components, up to a certain limit, beyond which adding more components does not significantly enhance the model's performance.

Almost 100% R-Sq is achieved within 3 components

Non-Ideal Scenario

A non-ideal model selection plot may exhibit erratic fluctuations or a lack of clear trends in the R² values as the number of principal components increases.

Cross-validation leads to decreased R-sq after 4 components

Cross-validation involves partitioning the dataset into multiple subsets, typically referred to as folds. The PLS model is then trained on a subset of the data (training set) and validated on the remaining data (validation set). This process is repeated multiple times, each time with a different partitioning of the data into training and validation sets.

Corrective Measure in Case of Non-Ideal Scenario

Revise the model specifications, explore alternative dimensionality reduction techniques, or consider different variable selection methods. Additionally, conduct cross-validation or validation studies that can help assess the generalizability of the models and improve the robustness of the results obtained from the model selection plot.

Response Plots

Plot Significance

A Response Plot in the context of PLS regression is a graphical representation that compares the actual response values with the corresponding predicted or calculated response values obtained from the PLS regression model. In these plots, the actual response values are plotted on the x-axis, while the predicted response values generated by the PLS regression model are plotted on the y-axis.

Ideal Scenario

In an ideal scenario, a Response Plot in PLS regression reveals a strong linear relationship between the actual and predicted response values, with points clustering closely around the diagonal line representing perfect agreement.

Non-Ideal Scenario

A non-ideal Response Plot may show scattered points with deviations from the diagonal line, suggesting discrepancies between the actual and predicted response values. This indicates potential limitations or deficiencies in the PLS regression model, such as bias, underfitting, or overfitting, which can compromise the accuracy and validity of the predictions.

Corrective Measure in Case of Non-Ideal Scenario

When confronted with a non-ideal Response Plot in PLS regression, further analysis or model refinement may be necessary. This could involve adjusting the model parameters, exploring alternative preprocessing techniques, or considering different sets of predictor variables. Additionally, conducting diagnostic checks, such as cross-validation or bootstrapping, can help assess the robustness of the PLS regression model and identify opportunities for improvement in predictive performance and interpretability.

Coefficients and Standardized Coefficients Plots

Plot Significance

Coefficients and Standardized Coefficients Plots are a graphical representation commonly used in statistical modeling to visualize the coefficients of predictor variables and their standardized counterparts. In these plots, the predictor variables are typically listed on the y-axis, while the coefficients are displayed on the x-axis. The coefficients represent the estimated impact of each predictor variable on the response variable, while the standardized coefficients provide a measure of the relative importance of each predictor variable after standardizing their scales.

Ideal Scenario

An ideal Coefficients and Standardized Coefficients Plot reveals clear and interpretable patterns where the coefficients exhibit consistent magnitudes and directions across predictor variables.

Non-Ideal Scenario

A non-ideal Coefficients and Standardized Coefficients Plot may show erratic or inconsistent patterns in the coefficients across predictor variables.

A typical Standardized Coefficients Plot

Corrective Measure in Case of Non-Ideal Scenario

Address issues such as multicollinearity through variable selection techniques, refine model specifications, or consider alternative modeling approaches. Conducting sensitivity analyses or diagnostic checks can help assess the stability and robustness of the coefficient estimates.

Distance Plot

Plot Significance

In Partial Least Squares (PLS) regression, a Distance Plot visualizes the distances between observations in the predictor space (X) and the response space (Y). These plots typically display the distances from the observations to the predictor space (X) on the x-axis and the distances to the response space (Y) on the y-axis.

When examining this plot, look for points with distances greater than other points on the x- or y-axis. Observations with greater distances from the y-model may be outliers and observations with greater distances from the x-model may be leverage points.

Leverage points, in the context of statistical modeling, refer to observations or data points that have a significant influence on the estimation of model parameters. These points can exert a disproportionate impact on the fitted regression model, potentially affecting the estimated coefficients, predicted values, and overall model performance.

Ideal Scenario

In an ideal scenario, a Distance Plot in PLS regression exhibits clear and interpretable patterns where observations are well-separated in both the predictor and response spaces.

Distance Plot shows some clustering to bottom right, indicating many leverage points

Non-Ideal Scenario

A non-ideal Distance Plot may show overlapping or scattered observations without clear patterns or groupings in either the predictor or response spaces.

Corrective Measure in Case of Non-Ideal Scenario

Revise the model specifications, explore alternative preprocessing techniques, or consider different sets of predictor variables.

Component Evaluation Plots

Component evaluation plots in Partial Least Squares (PLS) regression analysis provide valuable insights into the relationships between predictor variables and the response variable.

The Score Plot visualizes the scores of individual observations along the principal components, helping to identify patterns or clusters in the data. For more information, refer to Score Plot in PCA.

A scattered Score Plot signifies weak model fit, multicollinearity, and presence of outliers and leverage points

The 3D Score Plot extends this visualization into three dimensions, facilitating a deeper understanding of the data structure.

PLS 3D Score Plot shows scores of data points across three principal components

Loading Plots display the weights or contributions of predictor variables to the principal components, aiding in variable selection and interpretation. For more information, refer to Loading Plot in PCA.

The Residual X Plot highlights any systematic patterns or trends in the residuals, enabling the detection of model inadequacies or outliers.

Most variation in residuals is due to predictor variables 9 and 10

Finally, the Calculated X Plot illustrates the predicted values of the predictor variables based on the PLS regression model, facilitating the assessment of model accuracy and performance.

Most variations in actual response is due to predictor variables 9 and 10

Transforming the Response Variable

Log Transformation

Used in the Scenario :

Relationship between the response variable and predictors appears to be exponentially growing or decaying
Residuals exhibit heteroscedasticity

Example :

If the response variable is positively skewed, you can try applying the natural logarithm transformation (logarithm with base e) to make the distribution more symmetric.

Square Root Transformation

Used in the Scenario :

When variance increases with the level of the response variable

Reciprocal Transformation

Used in the Scenario :

Relationship between the response variable and predictors appears to be inversely proportional

Example :

If the response variable decreases as the predictor increases, you can try taking the reciprocal of the response variable.

Box-Cox Transformation

Used in the Scenario :

Combination of power transformations (Log, Square Root, and Reciprocal) indexed by a prameter λ.
The optimal transformation (including λ) is chosen to make the residuals as close to normal as possible.

Example:

One can use statistical techniques, such as maximum likelihood estimation, to estimate the optimal value of λ that maximizes the normality of the residuals.

Not-So-Common Plots in Statistics

Regression Analysis

Residuals vs. Fits Plot

Plot Significance

Ideal Scenario

Non-Ideal Scenario

Corrective Measure in Case of Non-Ideal Scenario

Normal Probability Plot of Residuals

Plot Significance

Ideal Scenario

Non-Ideal Scenario

Corrective Measure in Case of Non-Ideal Scenario

Residuals vs. Order Plot

Plot Significance

Ideal Scenario

Non-Ideal Scenario

Corrective Measure in Case of Non-Ideal Scenario

Histogram of Standardized Residuals

Plot Significance

Ideal Scenario

Non-Ideal Scenario

Corrective Measure in Case of Non-Ideal Scenario

Principal Component Analysis (PCA)

Scree Plot

Plot Significance

Ideal Scenario

Non-Ideal Scenario

Corrective Measure in Case of Non-Ideal Scenario

Loading Plot

Plot Significance

Ideal Scenario

Non-Ideal Scenario

Corrective Measure in Case of Non-Ideal Scenario

Score Plot

Plot Significance

Ideal Scenario

Non-Ideal Scenario

Corrective Measure in Case of Non-Ideal Scenario

Biplot

Plot Significance

Ideal Scenario

Non-Ideal Scenario

Corrective Measure in Case of Non-Ideal Scenario

Partial Least Squares (PLS) Regression

Model Selection Plots

Plot Significance

Ideal Scenario

Non-Ideal Scenario

Corrective Measure in Case of Non-Ideal Scenario

Response Plots

Plot Significance

Ideal Scenario

Non-Ideal Scenario

Corrective Measure in Case of Non-Ideal Scenario

Coefficients and Standardized Coefficients Plots

Plot Significance

Ideal Scenario

Non-Ideal Scenario

Corrective Measure in Case of Non-Ideal Scenario

Distance Plot

Plot Significance

Ideal Scenario

Non-Ideal Scenario

Corrective Measure in Case of Non-Ideal Scenario

Component Evaluation Plots

Transforming the Response Variable

Log Transformation

Square Root Transformation

Reciprocal Transformation

Box-Cox Transformation

Project Gallery