IntroductionUnderstanding and quantifying the relationship between categorical variables is one of the most important tasks in data science. This is useful not just in building predictive models, but also in data science research work. One statistical test that does this is the Chi Square Test of Independence, which is used to determine if there is an association between two or more categorical variables. In this guide, you will learn how to perform the chi-square test using R. Show DataIn this guide, we will be using fictitious data of loan applicants containing 200 observations and ten variables, as described below:
Let's start by loading the required libraries and the data.
The output shows that the data has five numerical variables (labeled as 'int', 'dbl') and five character variables (labeled as 'chr'). We will convert these into factor variables using the line of code below.
Frequency TableBefore diving into the chi-square test, it's important to understand the frequency table or matrix that is used as an input for the chi-square function in R. Frequency tables are an effective way of finding dependence or lack of it between the two categorical variables. They also give a first-level view of the relationship between the variables. The
The output from the column percentages total shows that the divorced applicants have a higher probability (at 56.8 percent) of getting loan approvals compared to the married applicants. To test whether this insight is statistically significant or not, we conduct the chi-square test of independence. StepsWe'll be using the chi-square test to determine the association between the two categorical variables, Null Hypothesis H0: The two variables Alternate Hypothesis H1: The two variables are related to each other. The first step is to create a two-way table between the variables under study, which is done in the lines of code below.
The next step is to perform the chi-square test using the
Interpretation: Since the p-value is less than 0.05, we reject the null hypothesis that the marital status of the applicants is not associated with the approval status. Another
way of using the function is directly passing in the variables under study as arguments into the
This produces similar test results, as was expected. Similarly, we can test the relationship between other categorical features. Conclusion In this guide, you have learned about the techniques of finding relationships in data for categorical variables. You also learned about the simple but effective To learn more about data science using 'R', please refer to the following guides: What test is used to test the relationship between two variables?A test of correlation establishes whether there is a linear relationship between two different variables. The two variables are usually designated as Y the dependent, outcome, or response variable and X the independent, predictor, or explanatory variable. The correlation coefficient r has a number of limitations.
What statistical method are used in relationship between two variables?The strength of a linear relationship between two variables is measured by a statistic known as the correlation coefficient, which varies from 0 to -1, and from 0 to +1. There are several correlation coefficients; the most widely used are Pearson's r and Spearman's rho.
|