# An Alternative to the Correlation Coefficient That Works For Numeric and Categorical Variables

by Rama Ramakrishnan

Dr. Rama Ramakrishnan is Professor of the Practice at MIT Sloan School of Management where he teaches courses in Data Science, Optimization and applied Machine Learning.

When starting to work with a new dataset, it is useful to quickly pinpoint which pairs of variables appear to be strongly related. It helps you spot data issues, make better modeling decisions, and ultimately arrive at better answers.

The correlation coefficient is used widely for this purpose, but it is well-known that it can’t detect non-linear relationships. Take a look at this scatterplot of two variables $$x$$ and $$y$$.

set.seed(42)
x <- seq(-1,1,0.01)
y <- sqrt(1 - x^2) + rnorm(length(x),mean = 0, sd = 0.05)

ggplot(mapping = aes(x, y)) +
geom_point() 

It is obvious to the human eye that $$x$$ and $$y$$ have a strong relationship but the correlation coefficient between $$x$$ and $$y$$ is only -0.01.

Further, if either variable of the pair is categorical, we can’t use the correlation coefficient. We will have to turn to other metrics. If $$x$$ and $$y$$ are both categorical, we can try Cramer’s V or the phi coefficient. If $$x$$ is continuous and $$y$$ is binary, we can use the point-biserial correlation coefficient.

But using different metrics is problematic. Since they are derived from different assumptions, we can’t compare the resulting numbers with one another. If the correlation coefficient between continuous variables $$x$$ and $$y$$ is 0.6 and the phi coefficient between categorical variables $$u$$ and $$v$$ is also 0.6, can we safely conclude that the relationships are equally strong? According to Wikipedia,

The correlation coefficient ranges from −1 to +1, where ±1 indicates perfect agreement or disagreement, and 0 indicates no relationship. The phi coefficient has a maximum value that is determined by the distribution of the two variables if one or both variables can take on more than two values.

A phi coefficient value of 0.6 between $$u$$ and $$v$$ may not mean much if its maximum possible value in this particular situation is much higher. Perhaps we can normalize the phi coefficient to map it to the 0-1 range? But what if that modification introduces biases?

Wouldn’t it be nice if we had one uniform approach that was easy to understand, worked for continuous and categorical variables alike, and could detect linear and nonlinear relationships?

(BTW, when $$x$$ and $$y$$ are continuous, looking at a scatter plot of $$x$$ vs $$y$$ can be very effective since the human brain can detect linear and non-linear patterns very quickly. But even if you are lucky and all your variables are continuous, looking at scatterplots of all pairs of variables is hard when you have lots of variables in your dataset; with just 100 predictors (say), you will need to look through 4950 scatterplots and this obviously isn’t practical)

### A Potential Solution

To devise a metric that satisfies the requirements we listed above, let’s invert the problem: What does it mean to say that $$x$$ and $$y$$ don’t have a strong relationship?

Intuitively, if there’s no relationship between $$x$$ and $$y$$, we would expect to see no patterns in a scatterplot of $$x$$ and $$y$$ - no lines, curves, groups etc. It will be a cloud of points that appears to be randomly scattered, perhaps something like this:

x <- seq(-1,1,0.01)
y <- runif(length(x),min = -1, max = 1)

ggplot(mapping = aes(x, y)) +
geom_point() 

In this situation, does knowing the value of $$x$$ give us any information on $$y$$?

Clearly not. $$y$$ seems to be somewhere between -1 and 1 with no particular pattern, regardless of the value of $$x$$. Knowing $$x$$ does not seem to help reduce our uncertainty about the value of $$y$$.

In contrast, look at the first picture again.

Here, knowing the value of $$x$$ does help. If we know that $$x$$ is around 0.0, for example, from the graph we will guess that $$y$$ is likely near 1.0 (the red dots). We can be confident that $$y$$ is not between 0 and 0.8. Knowing $$x$$ helps us eliminate certain values of $$y$$, reducing our uncertainty about the values $$y$$ might take.

This notion - that knowing something reduces our uncertainty about something else - is exactly the idea behind mutual information from Information Theory.

According to Wikipedia (emphasis mine),

Intuitively, mutual information measures the information that $$X$$ and $$Y$$ share: It measures how much knowing one of these variables reduces uncertainty about the other. For example, if $$X$$ and $$Y$$ are independent, then knowing $$X$$ does not give any information about $$Y$$ and vice versa, so their mutual information is zero.

Furthermore,

Not limited to real-valued random variables and linear dependence like the correlation coefficient, MI is more general and determines how different the joint distribution of the pair $$(X,Y)$$ is to the product of the marginal distributions of $$X$$ and $$Y$$.

This is very promising!

As it turns out, however, implementing mutual information is not so simple. We first need to estimate the joint probabilities (i.e., the joint probability density/mass function) of $$x$$ and $$y$$ before we can calculate their Mutual Information. If $$x$$ and $$y$$ are categorical, this is easy but if one or both of them is continuous, it is more involved.

But we can use the basic insight behind mutual information – that knowing $$x$$ may reduce our uncertainty about $$y$$ – in a different way.

### The X2Y Metric

Consider three variables $$x$$, $$y$$ and $$z$$. If knowing $$x$$ reduces our uncertainty about $$y$$ by 70% but knowing $$z$$ reduces our uncertainty about $$y$$ by only 40%, we will intuitively expect that the association between $$x$$ and $$y$$ will be stronger than the association between $$z$$ and $$y$$.

So, if we can quantify the reduction in uncertainty, that can be used as a measure of the strength of the association. One way to do so is to measure $$x$$’s ability to predict $$y$$ - after all, if $$x$$ reduces our uncertainty about $$y$$, knowing $$x$$ should help us predict $$y$$ better than if we didn’t know $$x$$.

Stated another way, we can think of reduction in prediction error $$\approx$$ reduction in uncertainty $$\approx$$ strength of association.

This suggests the following approach:

1. Predict $$y$$ without using $$x$$.
• If $$y$$ is continuous, we can simply use the average value of $$y$$.
• If $$y$$ is categorical, we can use the most frequent value of $$y$$.
• These are sometimes referred to as a baseline model.
2. Predict $$y$$ using $$x$$
• We can take any of the standard predictive models out there (Linear/Logistic Regression, CART, Random Forests, SVMs, Neural Networks, Gradient Boosting etc.), set $$x$$ as the independent variable and $$y$$ as the dependent variable, fit the model to the data, and make predictions. More on this below.
3. Calculate the % decrease in prediction error when we go from (1) to (2)
• If $$y$$ is continuous, we can use any of the familiar error metrics like RMSE, SSE, MAE etc. I prefer mean absolute error (MAE) since it is less susceptible to outliers and is in the same units as $$y$$ but this is a matter of personal preference.
• If $$y$$ is categorical, we can use Misclassification Error (= 1 - Accuracy) as the error metric.

In summary, the % reduction in error when we go from a baseline model to a predictive model measures the strength of the relationship between $$x$$ and $$y$$. We will call this metric x2y since it measures the ability of $$x$$ to predict $$y$$.

(This definition is similar to R-Squared from Linear Regression. In fact, if $$y$$ is continuous and we use the Sum of Squared Errors as our error metric, the x2y metric is equal to R-Squared.)

To implement (2) above, we need to pick a predictive model to use. Let’s remind ourselves of what the requirements are:

• If there’s a non-linear relationship between $$x$$ and $$y$$, the model should be able to detect it
• It should be able to handle all possible $$x$$-$$y$$ variable types: continuous-continuous, continuous-categorical, categorical-continuous and categorical-categorical
• We may have hundreds (if not thousands) of pairs of variables we want to analyze so we want this to be quick

Classification and Regression Trees (CART) satisfies these requirements very nicely and that’s the one I prefer to use. That said, you can certainly use other models if you like.

Let’s try this approach on the ‘semicircle’ dataset from above. We use CART to predict $$y$$ using $$x$$ and here’s how the fitted values look:

# Let's generate the data again
set.seed(42)
x <- seq(-1,1,0.01)

d <- data.frame(x = x,
y = sqrt(1 - x^2) + rnorm(length(x),mean = 0, sd = 0.05))

library(rpart)
preds <- predict(rpart(y~x, data = d, method = "anova"), type = "vector")

# Set up a chart
ggplot(data = d, mapping = aes(x = x)) +
geom_point(aes(y = y), size = 0.5) +
geom_line(aes(y=preds, color = '2')) +
scale_color_brewer(name = "", labels='CART', palette="Set1")

Visually, the CART predictions seem to approximate the semi-circular relationship between $$x$$ and $$y$$. To confirm, let’s calculate the x2y metric step by step.

• The MAE from using the average of $$y$$ to predict $$y$$ is 0.19.
• The MAE from using the CART predictions to predict $$y$$ is 0.06.
• The % reduction in MAE is 68.88%.

Excellent!

If you are familiar with CART models, it is straightforward to implement the x2y metric in the Machine Learning environment of your choice. An R implementation is here and details can be found in the appendix but, for now, I want to highlight two functions from the R script that we will use in the examples below:

• x2y(u, v) calculates the x2y metric between two vectors $$u$$ and $$v$$
• dx2y(d) calculates the x2y metric between all pairs of variables in a dataframe $$d$$

### Two Caveats

Before we demonstrate the x2y metric on a couple of datasets, I want to highlight two aspects of the x2y approach.

Unlike metrics like the correlation coefficient, the x2y metric is not symmetric with respect to $$x$$ and $$y$$. The extent to which $$x$$ can predict $$y$$ can be different from the extent to which $$y$$ can predict $$x$$. For the semi-circle dataset, x2y(x,y) is 68.88% but x2y(y,x) is only 10.2%.

This shouldn’t come as a surprise, however. Let’s look at the scatterplot again but with the axes reversed.

ggplot(data = d, mapping = aes(x = y)) +
geom_point(aes(y = x), size = 0.5)  +
geom_point(data = d[abs(d$x) < 0.05,], aes(x = y, y = x), color = "orange" ) + geom_point(data = d[abs(d$y-0.6) < 0.05,], aes(x = y, y = x), color = "red" )

When $$x$$ is around 0.0, for instance, $$y$$ is near 1.0 (the orange dots). But when $$y$$ is around 0.6, $$x$$ can be in the (-0.75, - 1.0) range or in the (0.5, 0.75) range (the red dots). Knowing $$x$$ reduces the uncertainty about the value of $$y$$ a lot more than knowing $$y$$ reduces the uncertainty about the value of $$x$$.

But there’s an easy solution if you must have a symmetric metric for your application: just take the average of x2y(x,y) and x2y(y,x).

The second aspect worth highlighting is about the comparability of the x2y metric across variable pairs. All x2y values where the $$y$$ variable is continuous will be measuring a % reduction in MAE. All x2y values where the $$y$$ variable is categorical will be measuring a % reduction in Misclassification Error. Is a 30% reduction in MAE equal to a 30% reduction in Misclassification Error? It is problem dependent, there’s no universal right answer.

On the other hand, since (1) all x2y values are on the same 0-100% scale (2) are conceptually measuring the same thing, i.e., reduction in prediction error and (3) our objective is to quickly scan and identify strongly-related pairs (rather than conduct an in-depth investigation), the x2y approach may be adequate.

### Application to the Iris Dataset

The iris flower dataset is iconic in the statistics/ML communities and is widely used to illustrate basic concepts. The dataset consists of 150 observations in total and each observation has four continuous variables - the length and the width of petals and sepals - and a categorical variable indicating the species of iris.

Let’s take a look at 10 randomly chosen rows.

iris %>% sample_n(10) %>% pander
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.9 3 5.1 1.8 virginica
5.5 2.6 4.4 1.2 versicolor
6.1 2.8 4 1.3 versicolor
5.9 3.2 4.8 1.8 versicolor
7.7 2.6 6.9 2.3 virginica
5.7 4.4 1.5 0.4 setosa
6.5 3 5.2 2 virginica
5.2 2.7 3.9 1.4 versicolor
5.6 2.7 4.2 1.3 versicolor
7.2 3.2 6 1.8 virginica

We can calculate the x2y values for all pairs of variables in iris by running dx2y(iris) in R (details of how to use the dx2y() function are in the appendix).

dx2y(iris) %>% pander
x y perc_of_obs x2y
Petal.Width Species 100 94
Petal.Length Species 100 93
Petal.Width Petal.Length 100 80.73
Species Petal.Length 100 79.72
Petal.Length Petal.Width 100 77.32
Species Petal.Width 100 76.31
Sepal.Length Petal.Length 100 66.88
Sepal.Length Species 100 62
Petal.Length Sepal.Length 100 60.98
Sepal.Length Petal.Width 100 54.36
Petal.Width Sepal.Length 100 48.81
Species Sepal.Length 100 42.08
Sepal.Width Species 100 39
Petal.Width Sepal.Width 100 31.75
Petal.Length Sepal.Width 100 30
Sepal.Width Petal.Length 100 28.16
Sepal.Width Petal.Width 100 23.02
Species Sepal.Width 100 22.37
Sepal.Length Sepal.Width 100 18.22
Sepal.Width Sepal.Length 100 12.18

The first two columns in the output are self-explanatory. The third column - perc_of_obs - is the % of observations in the dataset that was used to calculate that row’s x2y value. When a dataset has missing values, only observations that have values present for both $$x$$ and $$y$$ will be used to calculate the x2y metrics for that variable pair. The iris dataset has no missing values so this value is 100% for all rows. The fourth column is the value of the x2y metric and the results are sorted in descending order of this value.

Looking at the numbers, both Petal.Length and Petal.Width seem to be highly associated with Species (and with each other). In contrast, it appears that Sepal.Length and Sepal.Width are very weakly associated with each other.

Note that even though Species is categorical and the other four variables are continuous, we could simply “drop” the iris dataframe into the dx2y() function and calculate the associations between all the variables.

### Application to a COVID-19 Dataset

Next, we examine a COVID-19 dataset that was downloaded from the COVID-19 Clinical Data Repository in April 2020. This dataset contains clinical characteristics and COVID-19 test outcomes for 352 patients. Since it has a good mix of continuous and categorical variables, having something like the x2y metric that can work for any type of variable pair is convenient.

Let’s read in the data and take a quick look at the columns.

df <- read.csv("covid19.csv", stringsAsFactors = FALSE)
str(df) 
## 'data.frame':    352 obs. of  45 variables:
##  $date_published : chr "2020-04-14" "2020-04-14" "2020-04-14" "2020-04-14" ... ##$ clinic_state                 : chr  "CA" "CA" "CA" "CA" ...
##  $test_name : chr "Rapid COVID-19 Test" "Rapid COVID-19 Test" "Rapid COVID-19 Test" "Rapid COVID-19 Test" ... ##$ swab_type                    : chr  "" "Nasopharyngeal" "Nasal" "" ...
##  $covid_19_test_results : chr "Negative" "Negative" "Negative" "Negative" ... ##$ age                          : int  30 77 49 42 37 23 71 28 55 51 ...
##  $high_risk_exposure_occupation: logi TRUE NA NA FALSE TRUE FALSE ... ##$ high_risk_interactions       : logi  FALSE NA NA FALSE TRUE TRUE ...
##  $diabetes : logi FALSE FALSE FALSE FALSE FALSE FALSE ... ##$ chd                          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $htn : logi FALSE TRUE FALSE TRUE FALSE FALSE ... ##$ cancer                       : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $asthma : logi TRUE TRUE FALSE TRUE FALSE FALSE ... ##$ copd                         : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $autoimmune_dis : logi FALSE FALSE FALSE FALSE FALSE FALSE ... ##$ temperature                  : num  37.1 36.8 37 36.9 37.3 ...
##  $pulse : int 84 96 79 108 74 110 78 NA 97 66 ... ##$ sys                          : int  117 128 120 156 126 134 144 NA 160 98 ...
##  $dia : int 69 73 80 89 67 79 85 NA 97 65 ... ##$ rr                           : int  NA 16 18 14 16 16 15 NA 16 16 ...
##  $sats : int 99 97 100 NA 99 98 96 97 99 100 ... ##$ rapid_flu                    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $rapid_flu_results : chr "" "" "" "" ... ##$ rapid_strep                  : logi  FALSE TRUE FALSE FALSE FALSE TRUE ...
##  $rapid_strep_results : chr "" "Negative" "" "" ... ##$ ctab                         : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $labored_respiration : logi FALSE FALSE FALSE FALSE FALSE FALSE ... ##$ rhonchi                      : logi  FALSE FALSE FALSE TRUE FALSE FALSE ...
##  $wheezes : logi FALSE FALSE FALSE TRUE FALSE FALSE ... ##$ cough                        : logi  FALSE NA TRUE TRUE TRUE TRUE ...
##  $cough_severity : chr "" "" "" "Mild" ... ##$ fever                        : logi  NA NA NA FALSE FALSE TRUE ...
##  $sob : logi FALSE NA FALSE FALSE TRUE TRUE ... ##$ sob_severity                 : chr  "" "" "" "" ...
##  $diarrhea : logi NA NA NA TRUE NA NA ... ##$ fatigue                      : logi  NA NA NA NA TRUE TRUE ...
##  $headache : logi NA NA NA NA TRUE TRUE ... ##$ loss_of_smell                : logi  NA NA NA NA NA NA ...
##  $loss_of_taste : logi NA NA NA NA NA NA ... ##$ runny_nose                   : logi  NA NA NA NA NA TRUE ...
##  $muscle_sore : logi NA NA NA TRUE NA TRUE ... ##$ sore_throat                  : logi  TRUE NA NA NA NA TRUE ...
##  $cxr_findings : chr "" "" "" "" ... ##$ cxr_impression               : chr  "" "" "" "" ...
##  $cxr_link : chr "" "" "" "" ... #%>% pander There are lots of missing values (denoted by ‘NA’) and lots of blanks as well - for example, see the first few values of the rapid_flu_results field above. We will convert the blanks to NAs so that all the missing values can be treated consistently. Also, the rightmost three columns are free-text fields so we will remove them from the dataframe. df <- read.csv("covid19.csv", stringsAsFactors = FALSE, na.strings=c("","NA") # read in blanks as NAs )%>% select(-starts_with("cxr")) # remove the chest x-ray note fields str(df)  ## 'data.frame': 352 obs. of 42 variables: ##$ date_published               : chr  "2020-04-14" "2020-04-14" "2020-04-14" "2020-04-14" ...
##  $clinic_state : chr "CA" "CA" "CA" "CA" ... ##$ test_name                    : chr  "Rapid COVID-19 Test" "Rapid COVID-19 Test" "Rapid COVID-19 Test" "Rapid COVID-19 Test" ...
##  $swab_type : chr NA "Nasopharyngeal" "Nasal" NA ... ##$ covid_19_test_results        : chr  "Negative" "Negative" "Negative" "Negative" ...
##  $age : int 30 77 49 42 37 23 71 28 55 51 ... ##$ high_risk_exposure_occupation: logi  TRUE NA NA FALSE TRUE FALSE ...
##  $high_risk_interactions : logi FALSE NA NA FALSE TRUE TRUE ... ##$ diabetes                     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $chd : logi FALSE FALSE FALSE FALSE FALSE FALSE ... ##$ htn                          : logi  FALSE TRUE FALSE TRUE FALSE FALSE ...
##  $cancer : logi FALSE FALSE FALSE FALSE FALSE FALSE ... ##$ asthma                       : logi  TRUE TRUE FALSE TRUE FALSE FALSE ...
##  $copd : logi FALSE FALSE FALSE FALSE FALSE FALSE ... ##$ autoimmune_dis               : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $temperature : num 37.1 36.8 37 36.9 37.3 ... ##$ pulse                        : int  84 96 79 108 74 110 78 NA 97 66 ...
##  $sys : int 117 128 120 156 126 134 144 NA 160 98 ... ##$ dia                          : int  69 73 80 89 67 79 85 NA 97 65 ...
##  $rr : int NA 16 18 14 16 16 15 NA 16 16 ... ##$ sats                         : int  99 97 100 NA 99 98 96 97 99 100 ...
##  $rapid_flu : logi FALSE FALSE FALSE FALSE FALSE FALSE ... ##$ rapid_flu_results            : chr  NA NA NA NA ...
##  $rapid_strep : logi FALSE TRUE FALSE FALSE FALSE TRUE ... ##$ rapid_strep_results          : chr  NA "Negative" NA NA ...
##  $ctab : logi TRUE TRUE TRUE TRUE TRUE TRUE ... ##$ labored_respiration          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $rhonchi : logi FALSE FALSE FALSE TRUE FALSE FALSE ... ##$ wheezes                      : logi  FALSE FALSE FALSE TRUE FALSE FALSE ...
##  $cough : logi FALSE NA TRUE TRUE TRUE TRUE ... ##$ cough_severity               : chr  NA NA NA "Mild" ...
##  $fever : logi NA NA NA FALSE FALSE TRUE ... ##$ sob                          : logi  FALSE NA FALSE FALSE TRUE TRUE ...
##  $sob_severity : chr NA NA NA NA ... ##$ diarrhea                     : logi  NA NA NA TRUE NA NA ...
##  $fatigue : logi NA NA NA NA TRUE TRUE ... ##$ headache                     : logi  NA NA NA NA TRUE TRUE ...
##  $loss_of_smell : logi NA NA NA NA NA NA ... ##$ loss_of_taste                : logi  NA NA NA NA NA NA ...
##  $runny_nose : logi NA NA NA NA NA TRUE ... ##$ muscle_sore                  : logi  NA NA NA TRUE NA TRUE ...
##  \$ sore_throat                  : logi  TRUE NA NA NA NA TRUE ...
#%>% pander

Now, let’s run it through the x2y approach. We are particularly interested in non-zero associations between the covid_19_test_results field and the other fields so we zero in on those by running dx2y(df, target = "covid_19_test_results") in R (details in the appendix) and filtering out the zero associations.

dx2y(df, target = "covid_19_test_results") %>%
filter(x2y >0) %>%
pander
x y perc_of_obs x2y
covid_19_test_results loss_of_smell 21.88 18.18
covid_19_test_results loss_of_taste 22.73 12.5
covid_19_test_results sats 92.9 2.24

Only three of the 41 variables have a non-zero association with covid_19_test_results. Disappointingly, the highest x2y value is an unimpressive 18%. It is based on just 22% of the observations (since the other 78% of observations had missing values) and makes one wonder if this modest association is real or if it is just due to chance.

If we were working with the correlation coefficient, we could easily calculate a confidence interval for it and gauge if what we are seeing is real or not. Can we do the same thing for the x2y metric?

We can, by using bootstrapping. Given $$x$$ and $$y$$, we can sample with replacement a 1000 times (say) and calculate the x2y metric each time. With these 1000 numbers, we can construct a confidence interval easily (this is available as an optional confidence argument in the R functions we have been using; please see the appendix).

Let’s re-do the earlier calculation with “confidence intervals” turned on by running dx2y(df, target = "covid_19_test_results", confidence = TRUE) in R.

dx2y(df, target = "covid_19_test_results", confidence = TRUE) %>%
filter(x2y >0) %>%
pander(split.tables = Inf)
x y perc_of_obs x2y CI_95_Lower CI_95_Upper
covid_19_test_results loss_of_smell 21.88 18.18 -8.08 36.36
covid_19_test_results loss_of_taste 22.73 12.5 -11.67 25
covid_19_test_results sats 92.9 2.24 -1.85 4.48

The 95% confidence intervals all contain 0.0, so none of these associations appear to be real.

Let’s see what the top 10 associations are, between any pair of variables.

dx2y(df) %>%head(10) %>% pander
x y perc_of_obs x2y
loss_of_smell loss_of_taste 20.17 100
loss_of_taste loss_of_smell 20.17 100
fatigue sore_throat 27.84 89.58
sore_throat fatigue 27.84 88.89
runny_nose fatigue 25.57 84.44

Interesting. loss_of_smell and loss_of_taste are perfectly associated with each other. Let’s look at the raw data.

with(df, table(loss_of_smell, loss_of_taste))
##              loss_of_taste
## loss_of_smell FALSE TRUE
##         FALSE    55    0
##         TRUE      0   16

They agree for every observation in the dataset and, as a result, their x2y is 100%.

Moving down the x2y ranking, we see a number of variables - fatigue, headache, sore_throat, and runny_nose - that are all strongly associated with each other, as if they are all connected by a common cause.

When the number of variable combinations is high and there are lots of missing values, it can be helpful to scatterplot x2y vs perc_of_obs.

ggplot(data = dx2y(df), aes(y=x2y, x = perc_of_obs)) +
geom_point()
## Warning: Removed 364 rows containing missing values (geom_point).

Unfortunately, the top-right quadrant is empty: there are no strongly-related variable pairs that are based on at least 50% of the observations. There are some variable pairs with x2y values > 75% but none of them are based on more than 40% of the observations.

### Conclusion

Using an insight from Information Theory, we devised a new metric - the x2y metric - that quantifies the strength of the association between pairs of variables.

The x2y metric has several advantages:

• It works for all types of variable pairs (continuous-continuous, continuous-categorical, categorical-continuous and categorical-categorical)
• It captures linear and non-linear relationships
• Perhaps best of all, it is easy to understand and use.

I hope you give it a try in your work.

(If you found this note helpful, you may find these of interest)

### Acknowledgements

Thanks to Amr Farahat for helpful feedback on an earlier draft.

### Appendix: How to use the R script

The R script depends on two R packages - rpart and dplyr - so please ensure that they are installed in your environment.

The script has two key functions: x2y() and dx2y().

#### Using the x2y() function

Usage: x2y(u, v, confidence = FALSE)

Arguments:

• u, v: two vectors of equal length
• confidence: (OPTIONAL) a boolean that indicates if a confidence interval is needed. Default is FALSE.

Value: A list with the following elements:

• perc_of_obs: the % of total observations that were used to calculate x2y. If some observations are missing for either $$u$$ or $$v$$, this will be less than 100%.
• x2y: the x2y metric for using $$u$$ to predict $$v$$

Additionally, if x2y() was called with confidence = TRUE:

• CI_95_Lower: the lower end of a 95% confidence interval for the x2y metric estimated by bootstrapping 1000 samples
• CI_95_Upper: the upper end of a 95% confidence interval for the x2y metric estimated by bootstrapping 1000 samples

#### Using the dx2y() function

Usage: dx2y(d, target = NA, confidence = FALSE)

Arguments:

• d: a dataframe
• target: (OPTIONAL) if you are only interested in the x2y values between a particular variable in d and all other variables, set target equal to the name of the variable you are interested in. Default is NA.
• confidence: (OPTIONAL) a boolean that indicates if a confidence interval is needed. Default is FALSE.

Value: A dataframe with each row containing the output of running x2y(u, v, confidence) for u and v chosen from the dataframe. Since this is just a standard R dataframe, it can be sliced, sorted, filtered, plotted etc.

Update on April 16, 2021: I learned from a commenter that a similar approach was proposed in April 2020, and that the R package ppsr which implements that approach is now available on CRAN.

Share Comments · · · ·