<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Correlation on R Views</title>
    <link>https://rviews.rstudio.com/tags/correlation/</link>
    <description>Recent content in Correlation on R Views</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en-us</language>
    <lastBuildDate>Thu, 15 Apr 2021 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://rviews.rstudio.com/tags/correlation/" rel="self" type="application/rss+xml" />
    
    
    
    
    <item>
      <title>An Alternative to the Correlation Coefficient That Works For Numeric and Categorical Variables</title>
      <link>https://rviews.rstudio.com/2021/04/15/an-alternative-to-the-correlation-coefficient-that-works-for-numeric-and-categorical-variables/</link>
      <pubDate>Thu, 15 Apr 2021 00:00:00 +0000</pubDate>
      
      <guid>https://rviews.rstudio.com/2021/04/15/an-alternative-to-the-correlation-coefficient-that-works-for-numeric-and-categorical-variables/</guid>
      <description>
        
&lt;script src=&#34;/2021/04/15/an-alternative-to-the-correlation-coefficient-that-works-for-numeric-and-categorical-variables/index_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;&lt;em&gt;Dr. Rama Ramakrishnan is Professor of the Practice at MIT Sloan School of Management where he teaches courses in Data Science, Optimization and applied Machine Learning.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;When starting to work with a new dataset, it is useful to quickly pinpoint which pairs of variables appear to be &lt;em&gt;strongly related&lt;/em&gt;. It helps you spot data issues, make better modeling decisions, and ultimately arrive at better answers.&lt;/p&gt;
&lt;p&gt;The &lt;a href=&#34;https://en.wikipedia.org/wiki/Correlation_coefficient&#34;&gt;&lt;em&gt;correlation coefficient&lt;/em&gt;&lt;/a&gt; is used widely for this purpose, but it is well-known that it can’t detect non-linear relationships. Take a look at this scatterplot of two variables &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(42)
x &amp;lt;- seq(-1,1,0.01)
y &amp;lt;- sqrt(1 - x^2) + rnorm(length(x),mean = 0, sd = 0.05)

ggplot(mapping = aes(x, y)) +
  geom_point() &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;/2021/04/15/an-alternative-to-the-correlation-coefficient-that-works-for-numeric-and-categorical-variables/index_files/figure-html/unnamed-chunk-1-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;It is obvious to the human eye that &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; have a strong relationship but the correlation coefficient between &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; is only -0.01.&lt;/p&gt;
&lt;p&gt;Further, if either variable of the pair is &lt;em&gt;categorical&lt;/em&gt;, we can’t use the correlation coefficient. We will have to turn to other metrics. If &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; are &lt;strong&gt;both&lt;/strong&gt; categorical, we can try &lt;a href=&#34;https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V&#34;&gt;Cramer’s V&lt;/a&gt; or &lt;a href=&#34;https://en.wikipedia.org/wiki/Phi_coefficient&#34;&gt;the phi coefficient&lt;/a&gt;. If &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; is continuous and &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; is binary, we can use the &lt;a href=&#34;https://en.wikipedia.org/wiki/Point-biserial_correlation_coefficient&#34;&gt;point-biserial correlation coefficient.&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;But using different metrics is problematic. Since they are derived from different assumptions, we can’t &lt;strong&gt;compare the resulting numbers with one another&lt;/strong&gt;. If the correlation coefficient between continuous variables &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; is 0.6 and the phi coefficient between categorical variables &lt;span class=&#34;math inline&#34;&gt;\(u\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(v\)&lt;/span&gt; is also 0.6, can we safely conclude that the relationships are equally strong? According to &lt;a href=&#34;https://en.wikipedia.org/wiki/Phi_coefficient&#34;&gt;Wikipedia&lt;/a&gt;,&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The correlation coefficient ranges from −1 to +1, where ±1 indicates perfect agreement or disagreement, and 0 indicates no relationship. The phi coefficient has a maximum value that is determined by the distribution of the two variables if one or both variables can take on more than two values.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;A phi coefficient value of 0.6 between &lt;span class=&#34;math inline&#34;&gt;\(u\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(v\)&lt;/span&gt; may not mean much if its maximum possible value in this particular situation is much higher. Perhaps we can normalize the phi coefficient to map it to the 0-1 range? But what if that modification introduces biases?&lt;/p&gt;
&lt;p&gt;Wouldn’t it be nice if we had &lt;strong&gt;one&lt;/strong&gt; uniform approach that was easy to understand, worked for continuous &lt;strong&gt;and&lt;/strong&gt; categorical variables alike, and could detect linear &lt;strong&gt;and&lt;/strong&gt; nonlinear relationships?&lt;/p&gt;
&lt;p&gt;(BTW, when &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; are continuous, looking at a scatter plot of &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; vs &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; can be very effective since the human brain can detect linear and non-linear patterns very quickly. But even if you are lucky and &lt;em&gt;all&lt;/em&gt; your variables are continuous, looking at scatterplots of &lt;em&gt;all&lt;/em&gt; pairs of variables is hard when you have lots of variables in your dataset; with just 100 predictors (say), you will need to look through 4950 scatterplots and this obviously isn’t practical)&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;/p&gt;
&lt;div id=&#34;a-potential-solution&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;A Potential Solution&lt;/h3&gt;
&lt;p&gt;To devise a metric that satisfies the requirements we listed above, let’s &lt;em&gt;invert&lt;/em&gt; the problem: What does it mean to say that &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; &lt;strong&gt;don’t&lt;/strong&gt; have a strong relationship?&lt;/p&gt;
&lt;p&gt;Intuitively, if there’s no relationship between &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt;, we would expect to see no patterns in a scatterplot of &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; - no lines, curves, groups etc. It will be a cloud of points that appears to be randomly scattered, perhaps something like this:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;x &amp;lt;- seq(-1,1,0.01)
y &amp;lt;- runif(length(x),min = -1, max = 1)

ggplot(mapping = aes(x, y)) +
  geom_point() &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;/2021/04/15/an-alternative-to-the-correlation-coefficient-that-works-for-numeric-and-categorical-variables/index_files/figure-html/unnamed-chunk-2-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;In this situation, does knowing the value of &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; give us any information on &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt;?&lt;/p&gt;
&lt;p&gt;Clearly not. &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; seems to be somewhere between -1 and 1 with no particular pattern, regardless of the value of &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt;. Knowing &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; does not seem to help &lt;em&gt;reduce our uncertainty&lt;/em&gt; about the value of &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;In contrast, look at the first picture again.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;/2021/04/15/an-alternative-to-the-correlation-coefficient-that-works-for-numeric-and-categorical-variables/index_files/figure-html/unnamed-chunk-3-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Here, knowing the value of &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; &lt;em&gt;does&lt;/em&gt; help. If we know that &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; is around 0.0, for example, from the graph we will guess that &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; is likely near 1.0 (the red dots). We can be confident that &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; is &lt;strong&gt;not&lt;/strong&gt; between 0 and 0.8. Knowing &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; helps us eliminate certain values of &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt;, &lt;strong&gt;reducing our uncertainty&lt;/strong&gt; about the values &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; might take.&lt;/p&gt;
&lt;p&gt;This notion - that knowing something reduces our uncertainty about something else - is exactly the idea behind &lt;a href=&#34;https://en.wikipedia.org/wiki/Mutual_information&#34;&gt;mutual information&lt;/a&gt; from &lt;a href=&#34;https://en.wikipedia.org/wiki/Information_theory&#34;&gt;Information Theory&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;According to &lt;a href=&#34;https://en.wikipedia.org/wiki/Mutual_information&#34;&gt;Wikipedia&lt;/a&gt; (emphasis mine),&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Intuitively, mutual information measures the information that &lt;span class=&#34;math inline&#34;&gt;\(X\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(Y\)&lt;/span&gt; share: It measures &lt;strong&gt;how much knowing one of these variables reduces uncertainty about the other&lt;/strong&gt;. For example, if &lt;span class=&#34;math inline&#34;&gt;\(X\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(Y\)&lt;/span&gt; are independent, then knowing &lt;span class=&#34;math inline&#34;&gt;\(X\)&lt;/span&gt; does not give any information about &lt;span class=&#34;math inline&#34;&gt;\(Y\)&lt;/span&gt; and vice versa, so their mutual information is zero.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Furthermore,&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Not limited to real-valued random variables and linear dependence like the correlation coefficient&lt;/strong&gt;, MI is more general and determines how different the joint distribution of the pair &lt;span class=&#34;math inline&#34;&gt;\((X,Y)\)&lt;/span&gt; is to the product of the marginal distributions of &lt;span class=&#34;math inline&#34;&gt;\(X\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(Y\)&lt;/span&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This is very promising!&lt;/p&gt;
&lt;p&gt;As it turns out, however, implementing mutual information is not so simple. We first need to estimate the joint probabilities (i.e., the joint probability density/mass function) of &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; before we can calculate their Mutual Information. If &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; are categorical, this is easy but if one or both of them is continuous, it is more involved.&lt;/p&gt;
&lt;p&gt;But we can use the basic insight behind mutual information – that knowing &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; may reduce our uncertainty about &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; – in a different way.&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;the-x2y-metric&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;The X2Y Metric&lt;/h3&gt;
&lt;p&gt;Consider three variables &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt;, &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(z\)&lt;/span&gt;. If knowing &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; reduces our uncertainty about &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; by 70% but knowing &lt;span class=&#34;math inline&#34;&gt;\(z\)&lt;/span&gt; reduces our uncertainty about &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; by only 40%, we will intuitively expect that the association between &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; will be stronger than the association between &lt;span class=&#34;math inline&#34;&gt;\(z\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;So, if we can &lt;em&gt;quantify&lt;/em&gt; the reduction in uncertainty, that can be used as a measure of the strength of the association. One way to do so is to measure &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt;’s ability to &lt;em&gt;predict&lt;/em&gt; &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; - after all, &lt;strong&gt;if &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; reduces our uncertainty about &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt;, knowing &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; should help us predict &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; better than if we didn’t know &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt;&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Stated another way, we can think of reduction in prediction error &lt;span class=&#34;math inline&#34;&gt;\(\approx\)&lt;/span&gt; reduction in uncertainty &lt;span class=&#34;math inline&#34;&gt;\(\approx\)&lt;/span&gt; strength of association.&lt;/p&gt;
&lt;p&gt;This suggests the following approach:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Predict &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; &lt;em&gt;without using&lt;/em&gt; &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt;.
&lt;ul&gt;
&lt;li&gt;If &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; is continuous, we can simply use the average value of &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt;.&lt;/li&gt;
&lt;li&gt;If &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; is categorical, we can use the most frequent value of &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt;.&lt;/li&gt;
&lt;li&gt;These are sometimes referred to as a &lt;em&gt;baseline&lt;/em&gt; model.&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;Predict &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; &lt;em&gt;using&lt;/em&gt; &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt;
&lt;ul&gt;
&lt;li&gt;We can take any of the standard predictive models out there (Linear/Logistic Regression, CART, Random Forests, SVMs, Neural Networks, Gradient Boosting etc.), set &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; as the independent variable and &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; as the dependent variable, fit the model to the data, and make predictions. More on this below.&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;Calculate the &lt;strong&gt;% decrease in prediction error&lt;/strong&gt; when we go from (1) to (2)
&lt;ul&gt;
&lt;li&gt;If &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; is continuous, we can use any of the familiar error metrics like RMSE, SSE, MAE etc. I prefer mean absolute error (MAE) since it is less susceptible to outliers and is in the same units as &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; but this is a matter of personal preference.&lt;/li&gt;
&lt;li&gt;If &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; is categorical, we can use Misclassification Error (= 1 - Accuracy) as the error metric.&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;blockquote&gt;
&lt;p&gt;In summary, the % reduction in error when we go from a baseline model to a predictive model measures the strength of the relationship between &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt;. We will call this metric &lt;code&gt;x2y&lt;/code&gt; since it measures the ability of &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; to predict &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;(This definition is similar to &lt;a href=&#34;https://en.wikipedia.org/wiki/Coefficient_of_determination&#34;&gt;&lt;em&gt;R-Squared&lt;/em&gt;&lt;/a&gt; from Linear Regression. In fact, if &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; is continuous and we use the Sum of Squared Errors as our error metric, the &lt;code&gt;x2y&lt;/code&gt; metric is equal to R-Squared.)&lt;/p&gt;
&lt;p&gt;To implement (2) above, we need to pick a predictive model to use. Let’s remind ourselves of what the requirements are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;If there’s a non-linear relationship between &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt;, the model should be able to detect it&lt;/li&gt;
&lt;li&gt;It should be able to handle all possible &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt;-&lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; variable types: continuous-continuous, continuous-categorical, categorical-continuous and categorical-categorical&lt;/li&gt;
&lt;li&gt;We may have hundreds (if not thousands) of pairs of variables we want to analyze so we want this to be quick&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a href=&#34;https://en.wikipedia.org/wiki/Decision_tree_learning&#34;&gt;Classification and Regression Trees (CART)&lt;/a&gt; satisfies these requirements very nicely and that’s the one I prefer to use. That said, you can certainly use other models if you like.&lt;/p&gt;
&lt;p&gt;Let’s try this approach on the ‘semicircle’ dataset from above. We use CART to predict &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; using &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; and here’s how the fitted values look:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Let&amp;#39;s generate the data again
set.seed(42)
x &amp;lt;- seq(-1,1,0.01)

d &amp;lt;- data.frame(x = x, 
                    y = sqrt(1 - x^2) + rnorm(length(x),mean = 0, sd = 0.05))

library(rpart)
preds &amp;lt;- predict(rpart(y~x, data = d, method = &amp;quot;anova&amp;quot;), type = &amp;quot;vector&amp;quot;)

# Set up a chart
ggplot(data = d, mapping = aes(x = x)) +
  geom_point(aes(y = y), size = 0.5) +
  geom_line(aes(y=preds, color = &amp;#39;2&amp;#39;)) +
  scale_color_brewer(name = &amp;quot;&amp;quot;, labels=&amp;#39;CART&amp;#39;, palette=&amp;quot;Set1&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;/2021/04/15/an-alternative-to-the-correlation-coefficient-that-works-for-numeric-and-categorical-variables/index_files/figure-html/unnamed-chunk-4-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Visually, the CART predictions seem to approximate the semi-circular relationship between &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt;. To confirm, let’s calculate the &lt;code&gt;x2y&lt;/code&gt; metric step by step.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The MAE from using the average of &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; to predict &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; is 0.19.&lt;/li&gt;
&lt;li&gt;The MAE from using the CART predictions to predict &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; is 0.06.&lt;/li&gt;
&lt;li&gt;The % reduction in MAE is 68.88%.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Excellent!&lt;/p&gt;
&lt;p&gt;If you are familiar with CART models, it is straightforward to implement the &lt;code&gt;x2y&lt;/code&gt; metric in the Machine Learning environment of your choice. An R implementation is &lt;a href=&#34;x2y.R&#34;&gt;here&lt;/a&gt; and details can be found in the &lt;a href=&#34;#appendix&#34;&gt;appendix&lt;/a&gt; but, for now, I want to highlight two functions from the R script that we will use in the examples below:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;x2y(u, v)&lt;/code&gt; calculates the &lt;code&gt;x2y&lt;/code&gt; metric between two vectors &lt;span class=&#34;math inline&#34;&gt;\(u\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(v\)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;dx2y(d)&lt;/code&gt; calculates the &lt;code&gt;x2y&lt;/code&gt; metric between all pairs of variables in a dataframe &lt;span class=&#34;math inline&#34;&gt;\(d\)&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;br&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;two-caveats&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Two Caveats&lt;/h3&gt;
&lt;p&gt;Before we demonstrate the &lt;code&gt;x2y&lt;/code&gt; metric on a couple of datasets, I want to highlight two aspects of the &lt;code&gt;x2y&lt;/code&gt; approach.&lt;/p&gt;
&lt;p&gt;Unlike metrics like the correlation coefficient, the &lt;code&gt;x2y&lt;/code&gt; metric is &lt;strong&gt;not&lt;/strong&gt; symmetric with respect to &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt;. The extent to which &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; can predict &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; can be different from the extent to which &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; can predict &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt;. For the semi-circle dataset, &lt;code&gt;x2y(x,y)&lt;/code&gt; is 68.88% but &lt;code&gt;x2y(y,x)&lt;/code&gt; is only 10.2%.&lt;/p&gt;
&lt;p&gt;This shouldn’t come as a surprise, however. Let’s look at the scatterplot again but with the axes reversed.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(data = d, mapping = aes(x = y)) +
  geom_point(aes(y = x), size = 0.5)  +
  geom_point(data = d[abs(d$x) &amp;lt; 0.05,], aes(x = y, y = x), color = &amp;quot;orange&amp;quot; ) +
  geom_point(data = d[abs(d$y-0.6) &amp;lt; 0.05,], aes(x = y, y = x), color = &amp;quot;red&amp;quot; )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;/2021/04/15/an-alternative-to-the-correlation-coefficient-that-works-for-numeric-and-categorical-variables/index_files/figure-html/unnamed-chunk-5-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;When &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; is around 0.0, for instance, &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; is near 1.0 (the orange dots). But when &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; is around 0.6, &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; can be in the (-0.75, - 1.0) range &lt;em&gt;or&lt;/em&gt; in the (0.5, 0.75) range (the red dots). Knowing &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; reduces the uncertainty about the value of &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; a lot more than knowing &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; reduces the uncertainty about the value of &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;But there’s an easy solution if you &lt;em&gt;must&lt;/em&gt; have a symmetric metric for your application: just take the average of &lt;code&gt;x2y(x,y)&lt;/code&gt; and &lt;code&gt;x2y(y,x)&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The second aspect worth highlighting is about the comparability of the &lt;code&gt;x2y&lt;/code&gt; metric across variable pairs. All &lt;code&gt;x2y&lt;/code&gt; values where the &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; variable is continuous will be measuring a % reduction in MAE. All &lt;code&gt;x2y&lt;/code&gt; values where the &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; variable is categorical will be measuring a % reduction in Misclassification Error. Is a 30% reduction in MAE equal to a 30% reduction in Misclassification Error? It is problem dependent, there’s no universal right answer.&lt;/p&gt;
&lt;p&gt;On the other hand, since (1) &lt;em&gt;all&lt;/em&gt; &lt;code&gt;x2y&lt;/code&gt; values are on the same 0-100% scale (2) are conceptually measuring the same thing, i.e., reduction in prediction error and (3) our objective is to quickly scan and identify strongly-related pairs (rather than conduct an in-depth investigation), the &lt;code&gt;x2y&lt;/code&gt; approach may be adequate.&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;application-to-the-iris-dataset&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Application to the Iris Dataset&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&#34;https://en.wikipedia.org/wiki/Iris_flower_data_set&#34;&gt;iris flower dataset&lt;/a&gt; is iconic in the statistics/ML communities and is widely used to illustrate basic concepts. The dataset consists of 150 observations in total and each observation has four continuous variables - the length and the width of petals and sepals - and a categorical variable indicating the species of iris.&lt;/p&gt;
&lt;p&gt;Let’s take a look at 10 randomly chosen rows.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;iris %&amp;gt;% sample_n(10) %&amp;gt;% pander&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;colgroup&gt;
&lt;col width=&#34;20%&#34; /&gt;
&lt;col width=&#34;19%&#34; /&gt;
&lt;col width=&#34;20%&#34; /&gt;
&lt;col width=&#34;19%&#34; /&gt;
&lt;col width=&#34;19%&#34; /&gt;
&lt;/colgroup&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;Sepal.Length&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Sepal.Width&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Petal.Length&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Petal.Width&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Species&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;5.9&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;3&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;5.1&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.8&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;virginica&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;5.5&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.6&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.4&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.2&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;versicolor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;6.1&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.8&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.3&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;versicolor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;5.9&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;3.2&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.8&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.8&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;versicolor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;7.7&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.6&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;6.9&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.3&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;virginica&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;5.7&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.4&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.5&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.4&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;setosa&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;6.5&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;3&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;5.2&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;virginica&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;5.2&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.7&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;3.9&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.4&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;versicolor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;5.6&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.7&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.2&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.3&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;versicolor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;7.2&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;3.2&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;6&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.8&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;virginica&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;We can calculate the &lt;code&gt;x2y&lt;/code&gt; values for all pairs of variables in &lt;code&gt;iris&lt;/code&gt; by running &lt;code&gt;dx2y(iris)&lt;/code&gt; in R (details of how to use the &lt;code&gt;dx2y()&lt;/code&gt; function are in the &lt;a href=&#34;#appendix&#34;&gt;appendix&lt;/a&gt;).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dx2y(iris) %&amp;gt;% pander&lt;/code&gt;&lt;/pre&gt;
&lt;table style=&#34;width:72%;&#34;&gt;
&lt;colgroup&gt;
&lt;col width=&#34;20%&#34; /&gt;
&lt;col width=&#34;20%&#34; /&gt;
&lt;col width=&#34;19%&#34; /&gt;
&lt;col width=&#34;11%&#34; /&gt;
&lt;/colgroup&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;x&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;y&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;perc_of_obs&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;x2y&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;Petal.Width&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;Species&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;100&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;94&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;Petal.Length&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;Species&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;100&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;93&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;Petal.Width&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;Petal.Length&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;100&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;80.73&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;Species&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;Petal.Length&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;100&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;79.72&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;Petal.Length&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;Petal.Width&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;100&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;77.32&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;Species&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;Petal.Width&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;100&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;76.31&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;Sepal.Length&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;Petal.Length&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;100&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;66.88&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;Sepal.Length&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;Species&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;100&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;62&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;Petal.Length&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;Sepal.Length&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;100&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;60.98&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;Sepal.Length&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;Petal.Width&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;100&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;54.36&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;Petal.Width&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;Sepal.Length&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;100&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;48.81&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;Species&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;Sepal.Length&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;100&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;42.08&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;Sepal.Width&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;Species&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;100&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;39&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;Petal.Width&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;Sepal.Width&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;100&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;31.75&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;Petal.Length&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;Sepal.Width&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;100&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;Sepal.Width&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;Petal.Length&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;100&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;28.16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;Sepal.Width&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;Petal.Width&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;100&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;23.02&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;Species&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;Sepal.Width&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;100&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;22.37&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;Sepal.Length&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;Sepal.Width&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;100&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;18.22&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;Sepal.Width&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;Sepal.Length&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;100&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;12.18&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The first two columns in the output are self-explanatory. The third column - &lt;code&gt;perc_of_obs&lt;/code&gt; - is the % of observations in the dataset that was used to calculate that row’s &lt;code&gt;x2y&lt;/code&gt; value. When a dataset has missing values, only observations that have values present for both &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; will be used to calculate the &lt;code&gt;x2y&lt;/code&gt; metrics for that variable pair. The &lt;code&gt;iris&lt;/code&gt; dataset has no missing values so this value is 100% for all rows. The fourth column is the value of the &lt;code&gt;x2y&lt;/code&gt; metric and the results are sorted in descending order of this value.&lt;/p&gt;
&lt;p&gt;Looking at the numbers, both &lt;code&gt;Petal.Length&lt;/code&gt; and &lt;code&gt;Petal.Width&lt;/code&gt; seem to be highly associated with &lt;code&gt;Species&lt;/code&gt; (and with each other). In contrast, it appears that &lt;code&gt;Sepal.Length&lt;/code&gt; and &lt;code&gt;Sepal.Width&lt;/code&gt; are very weakly associated with each other.&lt;/p&gt;
&lt;p&gt;Note that even though &lt;code&gt;Species&lt;/code&gt; is categorical and the other four variables are continuous, we could simply “drop” the &lt;code&gt;iris&lt;/code&gt; dataframe into the &lt;code&gt;dx2y()&lt;/code&gt; function and calculate the associations between all the variables.&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;application-to-a-covid-19-dataset&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Application to a COVID-19 Dataset&lt;/h3&gt;
&lt;p&gt;Next, we examine a &lt;a href=&#34;https://github.com/rama100/x2y/blob/main/covid19.csv&#34;&gt;COVID-19 dataset&lt;/a&gt; that was downloaded from the &lt;a href=&#34;https://github.com/mdcollab/covidclinicaldata/&#34;&gt;COVID-19 Clinical Data Repository&lt;/a&gt; in April 2020. This dataset contains clinical characteristics and COVID-19 test outcomes for 352 patients. Since it has a good mix of continuous and categorical variables, having something like the &lt;code&gt;x2y&lt;/code&gt; metric that can work for any type of variable pair is convenient.&lt;/p&gt;
&lt;p&gt;Let’s read in the data and take a quick look at the columns.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;df &amp;lt;- read.csv(&amp;quot;covid19.csv&amp;quot;, stringsAsFactors = FALSE)
str(df) &lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## &amp;#39;data.frame&amp;#39;:    352 obs. of  45 variables:
##  $ date_published               : chr  &amp;quot;2020-04-14&amp;quot; &amp;quot;2020-04-14&amp;quot; &amp;quot;2020-04-14&amp;quot; &amp;quot;2020-04-14&amp;quot; ...
##  $ clinic_state                 : chr  &amp;quot;CA&amp;quot; &amp;quot;CA&amp;quot; &amp;quot;CA&amp;quot; &amp;quot;CA&amp;quot; ...
##  $ test_name                    : chr  &amp;quot;Rapid COVID-19 Test&amp;quot; &amp;quot;Rapid COVID-19 Test&amp;quot; &amp;quot;Rapid COVID-19 Test&amp;quot; &amp;quot;Rapid COVID-19 Test&amp;quot; ...
##  $ swab_type                    : chr  &amp;quot;&amp;quot; &amp;quot;Nasopharyngeal&amp;quot; &amp;quot;Nasal&amp;quot; &amp;quot;&amp;quot; ...
##  $ covid_19_test_results        : chr  &amp;quot;Negative&amp;quot; &amp;quot;Negative&amp;quot; &amp;quot;Negative&amp;quot; &amp;quot;Negative&amp;quot; ...
##  $ age                          : int  30 77 49 42 37 23 71 28 55 51 ...
##  $ high_risk_exposure_occupation: logi  TRUE NA NA FALSE TRUE FALSE ...
##  $ high_risk_interactions       : logi  FALSE NA NA FALSE TRUE TRUE ...
##  $ diabetes                     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ chd                          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ htn                          : logi  FALSE TRUE FALSE TRUE FALSE FALSE ...
##  $ cancer                       : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ asthma                       : logi  TRUE TRUE FALSE TRUE FALSE FALSE ...
##  $ copd                         : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ autoimmune_dis               : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ temperature                  : num  37.1 36.8 37 36.9 37.3 ...
##  $ pulse                        : int  84 96 79 108 74 110 78 NA 97 66 ...
##  $ sys                          : int  117 128 120 156 126 134 144 NA 160 98 ...
##  $ dia                          : int  69 73 80 89 67 79 85 NA 97 65 ...
##  $ rr                           : int  NA 16 18 14 16 16 15 NA 16 16 ...
##  $ sats                         : int  99 97 100 NA 99 98 96 97 99 100 ...
##  $ rapid_flu                    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ rapid_flu_results            : chr  &amp;quot;&amp;quot; &amp;quot;&amp;quot; &amp;quot;&amp;quot; &amp;quot;&amp;quot; ...
##  $ rapid_strep                  : logi  FALSE TRUE FALSE FALSE FALSE TRUE ...
##  $ rapid_strep_results          : chr  &amp;quot;&amp;quot; &amp;quot;Negative&amp;quot; &amp;quot;&amp;quot; &amp;quot;&amp;quot; ...
##  $ ctab                         : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ labored_respiration          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ rhonchi                      : logi  FALSE FALSE FALSE TRUE FALSE FALSE ...
##  $ wheezes                      : logi  FALSE FALSE FALSE TRUE FALSE FALSE ...
##  $ cough                        : logi  FALSE NA TRUE TRUE TRUE TRUE ...
##  $ cough_severity               : chr  &amp;quot;&amp;quot; &amp;quot;&amp;quot; &amp;quot;&amp;quot; &amp;quot;Mild&amp;quot; ...
##  $ fever                        : logi  NA NA NA FALSE FALSE TRUE ...
##  $ sob                          : logi  FALSE NA FALSE FALSE TRUE TRUE ...
##  $ sob_severity                 : chr  &amp;quot;&amp;quot; &amp;quot;&amp;quot; &amp;quot;&amp;quot; &amp;quot;&amp;quot; ...
##  $ diarrhea                     : logi  NA NA NA TRUE NA NA ...
##  $ fatigue                      : logi  NA NA NA NA TRUE TRUE ...
##  $ headache                     : logi  NA NA NA NA TRUE TRUE ...
##  $ loss_of_smell                : logi  NA NA NA NA NA NA ...
##  $ loss_of_taste                : logi  NA NA NA NA NA NA ...
##  $ runny_nose                   : logi  NA NA NA NA NA TRUE ...
##  $ muscle_sore                  : logi  NA NA NA TRUE NA TRUE ...
##  $ sore_throat                  : logi  TRUE NA NA NA NA TRUE ...
##  $ cxr_findings                 : chr  &amp;quot;&amp;quot; &amp;quot;&amp;quot; &amp;quot;&amp;quot; &amp;quot;&amp;quot; ...
##  $ cxr_impression               : chr  &amp;quot;&amp;quot; &amp;quot;&amp;quot; &amp;quot;&amp;quot; &amp;quot;&amp;quot; ...
##  $ cxr_link                     : chr  &amp;quot;&amp;quot; &amp;quot;&amp;quot; &amp;quot;&amp;quot; &amp;quot;&amp;quot; ...&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#%&amp;gt;% pander&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;There are lots of missing values (denoted by ‘NA’) and lots of blanks as well - for example, see the first few values of the &lt;code&gt;rapid_flu_results&lt;/code&gt; field above. We will convert the blanks to NAs so that all the missing values can be treated consistently. Also, the rightmost three columns are free-text fields so we will remove them from the dataframe.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;df &amp;lt;- read.csv(&amp;quot;covid19.csv&amp;quot;, 
                  stringsAsFactors = FALSE,
                  na.strings=c(&amp;quot;&amp;quot;,&amp;quot;NA&amp;quot;) # read in blanks as NAs
                  )%&amp;gt;% 
  select(-starts_with(&amp;quot;cxr&amp;quot;))  # remove the chest x-ray note fields

str(df) &lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## &amp;#39;data.frame&amp;#39;:    352 obs. of  42 variables:
##  $ date_published               : chr  &amp;quot;2020-04-14&amp;quot; &amp;quot;2020-04-14&amp;quot; &amp;quot;2020-04-14&amp;quot; &amp;quot;2020-04-14&amp;quot; ...
##  $ clinic_state                 : chr  &amp;quot;CA&amp;quot; &amp;quot;CA&amp;quot; &amp;quot;CA&amp;quot; &amp;quot;CA&amp;quot; ...
##  $ test_name                    : chr  &amp;quot;Rapid COVID-19 Test&amp;quot; &amp;quot;Rapid COVID-19 Test&amp;quot; &amp;quot;Rapid COVID-19 Test&amp;quot; &amp;quot;Rapid COVID-19 Test&amp;quot; ...
##  $ swab_type                    : chr  NA &amp;quot;Nasopharyngeal&amp;quot; &amp;quot;Nasal&amp;quot; NA ...
##  $ covid_19_test_results        : chr  &amp;quot;Negative&amp;quot; &amp;quot;Negative&amp;quot; &amp;quot;Negative&amp;quot; &amp;quot;Negative&amp;quot; ...
##  $ age                          : int  30 77 49 42 37 23 71 28 55 51 ...
##  $ high_risk_exposure_occupation: logi  TRUE NA NA FALSE TRUE FALSE ...
##  $ high_risk_interactions       : logi  FALSE NA NA FALSE TRUE TRUE ...
##  $ diabetes                     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ chd                          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ htn                          : logi  FALSE TRUE FALSE TRUE FALSE FALSE ...
##  $ cancer                       : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ asthma                       : logi  TRUE TRUE FALSE TRUE FALSE FALSE ...
##  $ copd                         : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ autoimmune_dis               : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ temperature                  : num  37.1 36.8 37 36.9 37.3 ...
##  $ pulse                        : int  84 96 79 108 74 110 78 NA 97 66 ...
##  $ sys                          : int  117 128 120 156 126 134 144 NA 160 98 ...
##  $ dia                          : int  69 73 80 89 67 79 85 NA 97 65 ...
##  $ rr                           : int  NA 16 18 14 16 16 15 NA 16 16 ...
##  $ sats                         : int  99 97 100 NA 99 98 96 97 99 100 ...
##  $ rapid_flu                    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ rapid_flu_results            : chr  NA NA NA NA ...
##  $ rapid_strep                  : logi  FALSE TRUE FALSE FALSE FALSE TRUE ...
##  $ rapid_strep_results          : chr  NA &amp;quot;Negative&amp;quot; NA NA ...
##  $ ctab                         : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ labored_respiration          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ rhonchi                      : logi  FALSE FALSE FALSE TRUE FALSE FALSE ...
##  $ wheezes                      : logi  FALSE FALSE FALSE TRUE FALSE FALSE ...
##  $ cough                        : logi  FALSE NA TRUE TRUE TRUE TRUE ...
##  $ cough_severity               : chr  NA NA NA &amp;quot;Mild&amp;quot; ...
##  $ fever                        : logi  NA NA NA FALSE FALSE TRUE ...
##  $ sob                          : logi  FALSE NA FALSE FALSE TRUE TRUE ...
##  $ sob_severity                 : chr  NA NA NA NA ...
##  $ diarrhea                     : logi  NA NA NA TRUE NA NA ...
##  $ fatigue                      : logi  NA NA NA NA TRUE TRUE ...
##  $ headache                     : logi  NA NA NA NA TRUE TRUE ...
##  $ loss_of_smell                : logi  NA NA NA NA NA NA ...
##  $ loss_of_taste                : logi  NA NA NA NA NA NA ...
##  $ runny_nose                   : logi  NA NA NA NA NA TRUE ...
##  $ muscle_sore                  : logi  NA NA NA TRUE NA TRUE ...
##  $ sore_throat                  : logi  TRUE NA NA NA NA TRUE ...&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#%&amp;gt;% pander&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now, let’s run it through the &lt;code&gt;x2y&lt;/code&gt; approach. We are particularly interested in non-zero associations between the &lt;code&gt;covid_19_test_results&lt;/code&gt; field and the other fields so we zero in on those by running &lt;code&gt;dx2y(df, target = &#34;covid_19_test_results&#34;)&lt;/code&gt; in R (details in the &lt;a href=&#34;#appendix&#34;&gt;appendix&lt;/a&gt;) and filtering out the zero associations.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dx2y(df, target = &amp;quot;covid_19_test_results&amp;quot;) %&amp;gt;% 
  filter(x2y &amp;gt;0) %&amp;gt;% 
  pander&lt;/code&gt;&lt;/pre&gt;
&lt;table style=&#34;width:86%;&#34;&gt;
&lt;colgroup&gt;
&lt;col width=&#34;33%&#34; /&gt;
&lt;col width=&#34;22%&#34; /&gt;
&lt;col width=&#34;19%&#34; /&gt;
&lt;col width=&#34;11%&#34; /&gt;
&lt;/colgroup&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;x&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;y&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;perc_of_obs&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;x2y&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;covid_19_test_results&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;loss_of_smell&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;21.88&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;18.18&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;covid_19_test_results&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;loss_of_taste&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;22.73&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;12.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;covid_19_test_results&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;sats&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;92.9&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.24&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Only &lt;em&gt;three&lt;/em&gt; of the 41 variables have a non-zero association with &lt;code&gt;covid_19_test_results&lt;/code&gt;. Disappointingly, the highest &lt;code&gt;x2y&lt;/code&gt; value is an unimpressive 18%. It is based on just 22% of the observations (since the other 78% of observations had missing values) and makes one wonder if this modest association is real or if it is just due to chance.&lt;/p&gt;
&lt;p&gt;If we were working with the correlation coefficient, we could easily calculate a &lt;em&gt;confidence interval&lt;/em&gt; for it and gauge if what we are seeing is real or not. Can we do the same thing for the &lt;code&gt;x2y&lt;/code&gt; metric?&lt;/p&gt;
&lt;p&gt;We can, by using &lt;a href=&#34;https://en.wikipedia.org/wiki/Bootstrapping_(statistics)&#34;&gt;bootstrapping&lt;/a&gt;. Given &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt;, we can sample with replacement a 1000 times (say) and calculate the &lt;code&gt;x2y&lt;/code&gt; metric each time. With these 1000 numbers, we can construct a confidence interval easily (this is available as an optional &lt;code&gt;confidence&lt;/code&gt; argument in the R functions we have been using; please see the &lt;a href=&#34;#appendix&#34;&gt;appendix&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;Let’s re-do the earlier calculation with “confidence intervals” turned on by running &lt;code&gt;dx2y(df, target = &#34;covid_19_test_results&#34;, confidence = TRUE)&lt;/code&gt; in R.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dx2y(df, target = &amp;quot;covid_19_test_results&amp;quot;, confidence = TRUE) %&amp;gt;% 
  filter(x2y &amp;gt;0) %&amp;gt;% 
  pander(split.tables = Inf)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;colgroup&gt;
&lt;col width=&#34;26%&#34; /&gt;
&lt;col width=&#34;17%&#34; /&gt;
&lt;col width=&#34;15%&#34; /&gt;
&lt;col width=&#34;8%&#34; /&gt;
&lt;col width=&#34;15%&#34; /&gt;
&lt;col width=&#34;15%&#34; /&gt;
&lt;/colgroup&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;x&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;y&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;perc_of_obs&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;x2y&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;CI_95_Lower&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;CI_95_Upper&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;covid_19_test_results&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;loss_of_smell&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;21.88&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;18.18&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-8.08&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;36.36&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;covid_19_test_results&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;loss_of_taste&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;22.73&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;12.5&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-11.67&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;25&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;covid_19_test_results&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;sats&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;92.9&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.24&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-1.85&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.48&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;em&gt;The 95% confidence intervals all contain 0.0&lt;/em&gt;, so none of these associations appear to be real.&lt;/p&gt;
&lt;p&gt;Let’s see what the top 10 associations are, between &lt;em&gt;any&lt;/em&gt; pair of variables.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dx2y(df) %&amp;gt;%head(10) %&amp;gt;% pander&lt;/code&gt;&lt;/pre&gt;
&lt;table style=&#34;width:75%;&#34;&gt;
&lt;colgroup&gt;
&lt;col width=&#34;22%&#34; /&gt;
&lt;col width=&#34;22%&#34; /&gt;
&lt;col width=&#34;19%&#34; /&gt;
&lt;col width=&#34;11%&#34; /&gt;
&lt;/colgroup&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;x&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;y&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;perc_of_obs&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;x2y&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;loss_of_smell&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;loss_of_taste&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;20.17&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;loss_of_taste&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;loss_of_smell&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;20.17&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;fatigue&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;headache&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;40.06&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;90.91&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;headache&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;fatigue&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;40.06&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;90.91&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;fatigue&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;sore_throat&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;27.84&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;89.58&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;headache&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;sore_throat&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;30.4&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;89.36&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;sore_throat&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;fatigue&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;27.84&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;88.89&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;sore_throat&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;headache&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;30.4&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;88.64&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;runny_nose&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;fatigue&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;25.57&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;84.44&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;runny_nose&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;headache&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;25.57&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;84.09&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Interesting. &lt;code&gt;loss_of_smell&lt;/code&gt; and &lt;code&gt;loss_of_taste&lt;/code&gt; are &lt;em&gt;perfectly&lt;/em&gt; associated with each other. Let’s look at the raw data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;with(df, table(loss_of_smell, loss_of_taste))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##              loss_of_taste
## loss_of_smell FALSE TRUE
##         FALSE    55    0
##         TRUE      0   16&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;They agree for &lt;em&gt;every&lt;/em&gt; observation in the dataset and, as a result, their &lt;code&gt;x2y&lt;/code&gt; is 100%.&lt;/p&gt;
&lt;p&gt;Moving down the &lt;code&gt;x2y&lt;/code&gt; ranking, we see a number of variables - &lt;code&gt;fatigue&lt;/code&gt;, &lt;code&gt;headache&lt;/code&gt;, &lt;code&gt;sore_throat&lt;/code&gt;, and &lt;code&gt;runny_nose&lt;/code&gt; - that are &lt;em&gt;all strongly associated with each other&lt;/em&gt;, as if they are all connected by a common cause.&lt;/p&gt;
&lt;p&gt;When the number of variable combinations is high and there are lots of missing values, it can be helpful to scatterplot &lt;code&gt;x2y&lt;/code&gt; vs &lt;code&gt;perc_of_obs&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(data = dx2y(df), aes(y=x2y, x = perc_of_obs)) +
         geom_point()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: Removed 364 rows containing missing values (geom_point).&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;/2021/04/15/an-alternative-to-the-correlation-coefficient-that-works-for-numeric-and-categorical-variables/index_files/figure-html/unnamed-chunk-14-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Unfortunately, the top-right quadrant is empty: there are no strongly-related variable pairs that are based on at least 50% of the observations. There &lt;em&gt;are&lt;/em&gt; some variable pairs with &lt;code&gt;x2y&lt;/code&gt; values &amp;gt; 75% but none of them are based on more than 40% of the observations.&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Conclusion&lt;/h3&gt;
&lt;p&gt;Using an insight from Information Theory, we devised a new metric - the &lt;code&gt;x2y&lt;/code&gt; metric - that quantifies the strength of the association between pairs of variables.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;x2y&lt;/code&gt; metric has several advantages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;It works for all types of variable pairs (continuous-continuous, continuous-categorical, categorical-continuous and categorical-categorical)&lt;/li&gt;
&lt;li&gt;It captures linear and non-linear relationships&lt;/li&gt;
&lt;li&gt;Perhaps best of all, it is easy to understand and use.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I hope you give it a try in your work.&lt;/p&gt;
&lt;p&gt;(If you found this note helpful, you may find &lt;a href=&#34;https://rama100.github.io/lecture-notes/&#34;&gt;these&lt;/a&gt; of interest)&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;acknowledgements&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Acknowledgements&lt;/h3&gt;
&lt;p&gt;Thanks to &lt;a href=&#34;https://mitsloan.mit.edu/faculty/directory/amr-farahat&#34;&gt;Amr Farahat&lt;/a&gt; for helpful feedback on an earlier draft.&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;appendix&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Appendix: How to use the R script&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&#34;https://github.com/rama100/x2y/blob/main/x2y.R&#34;&gt;R script&lt;/a&gt; depends on two R packages - &lt;code&gt;rpart&lt;/code&gt; and &lt;code&gt;dplyr&lt;/code&gt; - so please ensure that they are installed in your environment.&lt;/p&gt;
&lt;p&gt;The script has two key functions: &lt;code&gt;x2y()&lt;/code&gt; and &lt;code&gt;dx2y()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;br&gt;&lt;/p&gt;
&lt;div id=&#34;using-the-x2y-function&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;Using the &lt;code&gt;x2y()&lt;/code&gt; function&lt;/h4&gt;
&lt;p&gt;&lt;em&gt;Usage&lt;/em&gt;: &lt;code&gt;x2y(u, v, confidence = FALSE)&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Arguments&lt;/em&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;u&lt;/code&gt;, &lt;code&gt;v&lt;/code&gt;: two vectors of equal length&lt;/li&gt;
&lt;li&gt;&lt;code&gt;confidence&lt;/code&gt;: (OPTIONAL) a boolean that indicates if a confidence interval is needed. Default is FALSE.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;em&gt;Value&lt;/em&gt;: A list with the following elements:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;perc_of_obs&lt;/code&gt;: the % of total observations that were used to calculate &lt;code&gt;x2y&lt;/code&gt;. If some observations are missing for either &lt;span class=&#34;math inline&#34;&gt;\(u\)&lt;/span&gt; or &lt;span class=&#34;math inline&#34;&gt;\(v\)&lt;/span&gt;, this will be less than 100%.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;x2y&lt;/code&gt;: the &lt;code&gt;x2y&lt;/code&gt; metric for using &lt;span class=&#34;math inline&#34;&gt;\(u\)&lt;/span&gt; to predict &lt;span class=&#34;math inline&#34;&gt;\(v\)&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Additionally, if &lt;code&gt;x2y()&lt;/code&gt; was called with &lt;code&gt;confidence = TRUE&lt;/code&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;CI_95_Lower&lt;/code&gt;: the lower end of a 95% confidence interval for the &lt;code&gt;x2y&lt;/code&gt; metric estimated by &lt;a href=&#34;https://en.wikipedia.org/wiki/Bootstrapping_(statistics)&#34;&gt;bootstrapping&lt;/a&gt; 1000 samples&lt;/li&gt;
&lt;li&gt;&lt;code&gt;CI_95_Upper&lt;/code&gt;: the upper end of a 95% confidence interval for the &lt;code&gt;x2y&lt;/code&gt; metric estimated by bootstrapping 1000 samples&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;br&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;using-the-dx2y-function&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;Using the &lt;code&gt;dx2y()&lt;/code&gt; function&lt;/h4&gt;
&lt;p&gt;&lt;em&gt;Usage&lt;/em&gt;: &lt;code&gt;dx2y(d, target = NA, confidence = FALSE)&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Arguments&lt;/em&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;d&lt;/code&gt;: a dataframe&lt;/li&gt;
&lt;li&gt;&lt;code&gt;target&lt;/code&gt;: (OPTIONAL) if you are only interested in the &lt;code&gt;x2y&lt;/code&gt; values between a &lt;em&gt;particular variable&lt;/em&gt; in &lt;code&gt;d&lt;/code&gt; and all other variables, set &lt;code&gt;target&lt;/code&gt; equal to the name of the variable you are interested in. Default is NA.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;confidence&lt;/code&gt;: (OPTIONAL) a boolean that indicates if a confidence interval is needed. Default is FALSE.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;em&gt;Value&lt;/em&gt;: A dataframe with each row containing the output of running &lt;code&gt;x2y(u, v, confidence)&lt;/code&gt; for &lt;code&gt;u&lt;/code&gt; and &lt;code&gt;v&lt;/code&gt; chosen from the dataframe. Since this is just a standard R dataframe, it can be sliced, sorted, filtered, plotted etc.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update on April 16, 2021&lt;/strong&gt;: I learned from a commenter that a &lt;a href=&#34;https://paulvanderlaken.com/2020/05/04/predictive-power-score-finding-patterns-dataset/&#34;&gt;similar approach&lt;/a&gt; was proposed in April 2020, and that the R package &lt;a href=&#34;https://cran.r-project.org/package=ppsr&#34;&gt;ppsr&lt;/a&gt; which implements that approach is now available on CRAN.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;

        &lt;script&gt;window.location.href=&#39;https://rviews.rstudio.com/2021/04/15/an-alternative-to-the-correlation-coefficient-that-works-for-numeric-and-categorical-variables/&#39;;&lt;/script&gt;
      </description>
    </item>
    
  </channel>
</rss>
