home · On a note · How to calculate linear correlation coefficient. Multiple correlation coefficient in Excel (Excel)

How to calculate linear correlation coefficient. Multiple correlation coefficient in Excel (Excel)

Today's article will talk about how variables can be related to each other. Using correlation, we can determine whether there is a relationship between the first and second variable. I hope you find this activity just as fun as the previous ones!

Correlation measures the strength and direction of the relationship between x and y. The figure shows different types of correlation in the form of scatter plots of ordered pairs (x, y). Traditionally, the x variable is placed on the horizontal axis and the y variable is placed on the vertical axis.

Graph A is an example of a positive linear correlation: as x increases, y also increases, and linearly. Graph B shows us an example of a negative linear correlation, where as x increases, y decreases linearly. In graph C we see that there is no correlation between x and y. These variables do not influence each other in any way.

Finally, Graph D is an example of non-linear relationships between variables. As x increases, y first decreases, then changes direction and increases.

The remainder of the article focuses on linear relationships between the dependent and independent variables.

Correlation coefficient

The correlation coefficient, r, provides us with both the strength and direction of the relationship between the independent and dependent variables. Values ​​of r range between - 1.0 and + 1.0. When r is positive, the relationship between x and y is positive (graph A in the figure), and when r is negative, the relationship is also negative (graph B). A correlation coefficient close to zero indicates that there is no relationship between x and y (graph C).

The strength of the relationship between x and y is determined by whether the correlation coefficient is close to - 1.0 or +- 1.0. Study the following drawing.

Graph A shows a perfect positive correlation between x and y at r = + 1.0. Graph B - ideal negative correlation between x and y at r = - 1.0. Graphs C and D are examples of weaker relationships between the dependent and independent variables.

The correlation coefficient, r, determines both the strength and direction of the relationship between the dependent and independent variables. The r values ​​range from - 1.0 (strong negative relationship) to + 1.0 (strong positive relationship). When r = 0 there is no connection between the variables x and y.

We can calculate the actual correlation coefficient using the following equation:

Well well! I know this equation looks like a scary jumble of strange symbols, but before we panic, let's apply the example of an exam grade to it. Let's say I want to determine whether there is a relationship between the number of hours a student devotes to studying statistics and a final exam score. The table below will help us break down this equation into several simple calculations and make them more manageable.

As you can see, there is a very strong positive correlation between the number of hours devoted to studying a subject and the exam grade. Teachers will be very happy to know about this.

What is the benefit of establishing relationships between similar variables? Great question. If a relationship is found to exist, we can predict exam results based on a certain number of hours spent studying the subject. Simply put, the stronger the connection, the more accurate our prediction will be.

Using Excel to Calculate Correlation Coefficients

I'm sure that after looking at these terrible correlation coefficient calculations, you will be truly delighted to know that Excel can do all this work for you using the CORREL function with the following characteristics:

CORREL (array 1; array 2),

array 1 = data range for first variable,

array 2 = data range for the second variable.

For example, the figure shows the CORREL function used to calculate the correlation coefficient for the exam grade example.

A quantitative characteristic of the relationship can be obtained by calculating the correlation coefficient.

Correlation Analysis in Excel

The function itself has the general form CORREL(array1, array2). In the “Array1” field, enter the coordinates of the range of cells of one of the values, the dependence of which should be determined. As you can see, the correlation coefficient in the form of a number appears in the cell we previously selected. A window with correlation analysis parameters opens. Unlike the previous method, in the “Input interval” field we enter the interval not of each column separately, but of all columns that participate in the analysis. As you can see, the Excel application offers two methods of correlation analysis at once.

Correlation graph in excel

6) The first element of the final table will appear in the upper left cell of the selected area. Therefore, the H0 hypothesis is rejected, that is, the regression parameters and the correlation coefficient are not randomly different from zero, but are statistically significant. 7. The obtained estimates of the regression equation allow it to be used for forecasting.

How to calculate correlation coefficient in Excel

If the coefficient is 0, this indicates that there is no relationship between the values. To find the relationship between variables and y, use the built-in Microsoft Excel “CORREL” function. For example, for "Array1" select the y values, and for "Array2" select the x values. As a result, you will receive the correlation coefficient calculated by the program. Next, you need to calculate the difference between each x and xav, and yav. In the selected cells, write the formulas x-x, y-. Don't forget to pin cells with averages. The result obtained will be the desired correlation coefficient.

The above formula for calculating the Pearson coefficient shows how labor-intensive this process is if done manually. Second, please recommend what type of correlation analysis can be used for different samples with a large spread of data? How can I statistically prove that there is a significant difference between the group over 60 and everyone else?

DIY: Calculating Currency Correlations Using Excel

For example, we use Microsoft Excel, but any other program in which you can use a correlation formula will do. 7.After this, select the cells with EUR/USD data. 9.Press Enter to calculate the correlation coefficient for EUR/USD and USD/JPY. It's not worth updating the numbers every day (well, unless you're obsessed with currency correlations).

Have you already encountered the need to calculate the degree of connection between two statistical quantities and determine the formula by which they correlate? To do this, I used the CORREL function - there is some information about it here. It returns the degree of correlation between two data ranges. Theoretically, the correlation function can be refined by converting it from linear to exponential or logarithmic. Analysis of data and correlation graphs can improve its reliability very significantly.

Let’s assume that cell B2 contains the correlation coefficient itself, and cell B3 contains the number of complete observations. Do you have a Russian-speaking office? By the way, I also found a mistake - significance is not calculated for negative correlations. If both variables are metric and have a normal distribution, then the choice is correct. And is it possible to characterize the criterion of similarity of curves using only one CC? You do not have the similarity of “curves”, but the similarity of two series, which in principle can be described by a curve.

1.Open Excel

2.Create data columns. In our example, we will consider the relationship, or correlation, between aggression and self-doubt in first-graders. 30 children participated in the experiment, the data is presented in the Excel table:

1 column - subject number

2 column - aggressiveness in points

3 column - diffidence in points

3.Then you need to select an empty cell next to the table and click on the icon f(x) in the Excel panel

4.The function menu will open, you must select among the categories Statistical , and then among the list of functions alphabetically find CORREL and click OK

5.Then a menu of function arguments will open, which will allow you to select the data columns we need. To select the first column Aggressiveness you need to click on the blue button next to the line Array1

6.Select data for Array1 from the column Aggressiveness and click on the blue button in the dialog box

7. Then, similarly to Array 1, click on the blue button next to the line Array2

8.Select data for Array2- column Diffidence and press the blue button again, then OK

9. Here, the r-Pearson correlation coefficient has been calculated and written in the selected cell. In our case, it is positive and approximately equal to 0,225 . This speaks about moderate positive connections between aggressiveness and self-doubt in first-graders

Thus, statistical inference experiment will be: r = 0.225, a moderate positive relationship between the variables was revealed aggressiveness And diffidence.

Some studies require the p-level of significance of the correlation coefficient to be specified, but Excel, unlike SPSS, does not provide this option. It’s okay, there is (A.D. Nasledov).

You can also attach it to the research results.

Have you already encountered the need to calculate the degree of connection between two statistical quantities and determine the formula by which they correlate? A normal person might ask why this would be necessary at all. Oddly enough, this is actually necessary. Knowing reliable correlations can help you make crazy money if you're, say, a stock trader. The problem is that for some reason no one reveals these correlations (surprising, isn’t it?).

Let's count them ourselves! For example, I decided to try to calculate the correlation of the ruble to the dollar through the euro. Let's look at how this is done in detail.

This article is designed for advanced levels of Microsoft Excel proficiency. If you don't have time to read the entire article, you can download the file and figure it out yourself.

If you often find yourself needing to do something like this I highly recommend that you consider purchasing the book. Statistical calculations in Excel.

What is important to know about correlations

To calculate a reliable correlation, you need to have a reliable sample; the larger it is, the more reliable the result will be. For the purposes of this example, I took a daily sample of exchange rates over 10 years. The data is freely available, I took it from the site http://oanda.com.

What did I actually do

(1) Once I had the raw data, I started by checking the degree of correlation between the two data sets. To do this, I used the CORREL function - there is a little information about it. It returns the degree of correlation between two data ranges. The result, frankly speaking, was not particularly impressive (only about 70%). Generally speaking, the degree of correlation between two quantities is usually considered to be the square of this quantity, that is, the correlation turned out to be reliable by approximately 49%. This is very little!

(2) This seemed very strange to me. What errors could have crept into my calculations? So I decided to make a graph and see what could happen. The graph was specially broken down by year so that you could visually see where the correlation breaks down. The schedule turned out like this

(3) It is obvious from the graph that in the range of about 35 rubles per euro the correlation begins to break into two parts. Because of this, it turned out to be unreliable. It was necessary to determine why this was happening.

(4) The color shows that these data refer to 2007, 2008, 2009. Certainly! Periods of economic peaks and recessions are usually statistically unreliable, which is what happened in this case. Therefore, I tried to exclude these periods from the data (and to check, I checked the degree of correlation of the data in this period). The degree of correlation of these data alone is 0.01%, that is, it is completely absent. But without them, the data correlates by approximately 81%. This is already a fairly reliable correlation. Here is a graph with the function.

Next steps

Theoretically, the correlation function can be refined by converting it from linear to exponential or logarithmic. In this case, the statistical reliability of the correlation increases by approximately one percent, but the complexity of applying the formula increases incredibly. Therefore, I ask myself the question: is this really necessary? It's up to you to decide - for each specific case.

For the territories of the region, data for 200X is provided.

Region number Average per capita living wage per day of one able-bodied person, rub., x Average daily wage, rub., y
1 78 133
2 82 148
3 87 134
4 79 154
5 89 162
6 106 195
7 67 139
8 88 158
9 73 152
10 87 162
11 76 159
12 115 173

Exercise:

1. Construct a correlation field and formulate a hypothesis about the form of the connection.

2. Calculate the parameters of the linear regression equation

4. Using the average (general) elasticity coefficient, give a comparative assessment of the strength of the relationship between the factor and the result.

7. Calculate the predicted value of the result if the predicted value of the factor increases by 10% from its average level. Determine the forecast confidence interval for the significance level.

Solution:

Let's solve this problem using Excel.

1. By comparing the available data x and y, for example, ranking them in increasing order of factor x, one can observe the presence of a direct relationship between the characteristics, when an increase in the average per capita subsistence level increases the average daily wage. Based on this, we can make the assumption that the relationship between the characteristics is direct and can be described by a straight line equation. The same conclusion is confirmed based on graphical analysis.

To build a correlation field, you can use Excel PPP. Enter the initial data in sequence: first x, then y.

Select the area of ​​cells that contains data.

Then choose: Insert / Scatter Plot / Scatter with Markers as shown in Figure 1.

Figure 1 Construction of the correlation field

Analysis of the correlation field shows the presence of a close to rectilinear dependence, since the points are located almost in a straight line.

2. To calculate the parameters of the linear regression equation
Let's use the built-in statistical function LINEST.

For this:

1) Open an existing file containing the analyzed data;
2) Select a 5x2 area of ​​empty cells (5 rows, 2 columns) to display the results of regression statistics.
3) Activate Function Wizard: in the main menu select Formulas / Insert Function.
4) In the window Category you are taking Statistical, in the function window - LINEST. Click the button OK as shown in Figure 2;

Figure 2 Function Wizard Dialog Box

5) Fill in the function arguments:

Known values ​​for

Known values ​​of x

Constant- a logical value that indicates the presence or absence of a free term in the equation; if Constant = 1, then the free term is calculated in the usual way, if Constant = 0, then the free term is 0;

Statistics- a logical value that indicates whether to display additional information on regression analysis or not. If Statistics = 1, then additional information is displayed, if Statistics = 0, then only estimates of the equation parameters are displayed.

Click the button OK;

Figure 3 LINEST Function Arguments Dialog Box

6) The first element of the final table will appear in the upper left cell of the selected area. To open the entire table, press the key , and then to the key combination ++ .

Additional regression statistics will be output in the order shown in the following diagram:

Coefficient value b Coefficient a value
Standard error b Standard error a
Standard error y
F-statistic
Regression sum of squares

Figure 4 Result of calculating the LINEST function

We got the regression level:

We conclude: With an increase in the average per capita subsistence level by 1 rub. the average daily wage increases by an average of 0.92 rubles.

This means that 52% of the variation in wages (y) is explained by the variation of factor x - the average per capita living wage, and 48% - by the action of other factors not included in the model.

Using the calculated coefficient of determination, the correlation coefficient can be calculated: .

The connection is assessed as close.

4. Using the average (general) elasticity coefficient, we determine the strength of the factor’s influence on the result.

For a straight line equation, we determine the average (total) elasticity coefficient using the formula:

We will find the average values ​​by selecting the area of ​​cells with x values ​​and selecting Formulas / AutoSum / Average, and we will do the same with the values ​​of y.

Figure 5 Calculation of average function values ​​and argument

Thus, if the average per capita cost of living changes by 1% from its average value, the average daily wage will change by an average of 0.51%.

Using a data analysis tool Regression available:
- results of regression statistics,
- results of analysis of variance,
- results of confidence intervals,
- residuals and regression line fitting graphs,
- residuals and normal probability.

The procedure is as follows:

1) check access to Analysis package. In the main menu, select: File/Options/Add-ons.

2) In the dropdown list Control select item Excel add-ins and press the button Go.

3) In the window Add-ons check the box Analysis package and then click the button OK.

If Analysis package not in the field list Available add-ons, press the button Review to perform a search.

If you receive a message indicating that the analysis package is not installed on your computer, click Yes to install it.

4) In the main menu, select: Data / Data Analysis / Analysis Tools / Regression and then click the button OK.

5) Fill out the data input and output parameters dialog box:

Input interval Y- range containing data of the resultant attribute;

Input interval X- range containing data of the factor characteristic;

Tags- a flag that indicates whether the first line contains column names or not;

Constant - zero- a flag indicating the presence or absence of a free term in the equation;

Output interval- it is enough to indicate the upper left cell of the future range;

6) New worksheet - you can specify an arbitrary name for the new sheet.

Then click the button OK.

Figure 6 Dialog box for entering parameters for the Regression tool

The results of the regression analysis for the problem data are presented in Figure 7.

Figure 7 Result of using the regression tool

5. Let us evaluate the quality of the equations using the average approximation error. Let's use the results of the regression analysis presented in Figure 8.

Figure 8 Result of using the regression tool “Withdrawal of remainder”

Let's create a new table as shown in Figure 9. In column C, we calculate the relative approximation error using the formula:

Figure 9 Calculation of average approximation error

The average approximation error is calculated using the formula:

The quality of the constructed model is assessed as good, since it does not exceed 8 - 10%.

6. From the table with regression statistics (Figure 4) we write down the actual value of Fisher’s F-test:

Because the at a 5% significance level, then we can conclude that the regression equation is significant (the relationship has been proven).

8. We will assess the statistical significance of the regression parameters using Student’s t-statistics and by calculating the confidence interval of each indicator.

We put forward the hypothesis H 0 about a statistically insignificant difference between the indicators and zero:

.

for the number of degrees of freedom

Figure 7 has the actual t-statistic values:

The t-test for the correlation coefficient can be calculated in two ways:

Method I:

Where - random error of the correlation coefficient.

We will take the data for calculation from the table in Figure 7.

Method II:

The actual t-statistic values ​​exceed the table values:

Therefore, the hypothesis H 0 is rejected, that is, the regression parameters and the correlation coefficient do not differ from zero by chance, but are statistically significant.

The confidence interval for parameter a is defined as

For parameter a, the 95% limits as shown in Figure 7 were:

The confidence interval for the regression coefficient is defined as

For the regression coefficient b, the 95% limits as shown in Figure 7 were:

Analysis of the upper and lower limits of confidence intervals leads to the conclusion that with probability parameters a and b, being within the specified limits, do not take zero values, i.e. are not statistically insignificant and significantly different from zero.

7. The obtained estimates of the regression equation allow it to be used for forecasting. If the predicted cost of living is:

Then the predicted value of the cost of living will be:

We calculate the forecast error using the formula:

Where

We will also calculate the variance using Excel PPP. For this:

1) Activate Function Wizard: in the main menu select Formulas / Insert Function.

3) Fill in the range containing the numerical data of the factor characteristic. Click OK.

Figure 10 Calculation of variance

We got the variance value

To calculate the residual variance per degree of freedom, we will use the results of analysis of variance as shown in Figure 7.

Confidence intervals for predicting individual values ​​of y with a probability of 0.95 are determined by the expression:

The interval is quite wide, primarily due to the small volume of observations. In general, the forecast for the average monthly salary turned out to be reliable.

The condition of the problem is taken from: Workshop on econometrics: Proc. allowance / I.I. Eliseeva, S.V. Kurysheva, N.M. Gordeenko and others; Ed. I.I. Eliseeva. - M.: Finance and Statistics, 2003. - 192 p.: ill.