Polynomial regression is a technique to model the dependence of data collected in an experiment. The dependence is represented by a polynomial: $$f(x)=a_0x^{n}+a_1x^{n-1}+\cdots+a_{n-1}x+a_n,$$ where \(x\) is a real number. The integer \(n\geq 0\) is called the order of the polynomial, which is defined as the highest power among the terms with non-zero coefficient. For examples, $$-5,\quad x+1,\quad 2x^3+x^2+1$$ are zeroth (\(n=0\)), first (\(n=1\)) and third (\(n=3\)) order polynomials, respectively. Note that \(0\cdot x^2+x+1\) is a first order polynomial since the first term has zero coefficient. ## Illustrative ExampleTo illustrate the idea of polynomial regression, we consider data collected during an experiment to determine the change in thrust efficiency (in percent) as the divergence angle of a rocket nozzle changes (see [1], p. 530): Let \((x_i,y_i)\) be a given data point for \(i=1,2,\ldots,N\). We want to find a polynomial \(f\), such that \(y_i\approx f(x_i)\) for all \(i\). ## InterpolationInterpolation requires the polynomial passing through all given data points, i.e., \(y_i=f(x_i)\) for all \(i\). However, this often requires a high order polynomial because the order grows with the number of data points:
In this above figure, the blue dots represent samples of the black curve. There are in total 10 points and hence interpolation requires a 9th order polynomial, as shown by the red curve. We see that the red curve is oscillatory while the original curve (black) is not. ## RegressionAs we have seen, it is sometimes undesirable to force a polynomial to strictly pass through all data points. A better approach is to look for a polynomial that is "good enough". That means, we find a polynomial such that the error $$e=\sqrt{\sum_{i=1}^N\left[y_i-f(x_i)\right]^2}$$ is minimised. Obviously, for interpolation, the error is zero. When interpolation is undesirable, we will accept a non-zero but minimised error \(e\). To perform polynomial regression: - Select a suitable polynomial order \(n\). The order \(n\) should be small, such as, 1, 2 or 3.
- Find the coefficients \(a_0,a_1,\ldots,a_n\) such that the error \(e\) defined above is minimised.
- Evaluate the fitness of the regression. If needed, choose a different \(n\) and perform Step 2 again.
The coefficients \(a_0,a_1,\ldots,a_n\) in Step 2 can be obtained by solving a least square problem using QR decomposition. A brief description can be found in the document page of polyfit included in SIMO or Console. The coefficients can be obtained by SIMO or Console using the function polyfit, as shown below. The output coefficients are 17.680815 and -47.420696. Therefore, the polynomial obtained is $$f(x)=17.680815x-47.420696.$$ The error \(e\) is given by S.normr, which is 6.9987482. As shown below, the error can be improved to 4.971942 by using a second order polynomial $$f(x)=1.4670472x^2+1.3837119x-4.4594937.$$ For the meanings of s.R and s.df, please see the document page of polyfit included in the apps. That is it for the post. If you have any question, feel free to leave comments. Thanks. ## References[1] Montgomery Runger Hubele,
Engineering Statistics, 5th edition, Wiley, 2011.[2] Gene H. Golub and Charles F. Van Loan, Matrix Computations, 4th edition, The Johns Hopkins University Press, 2013.
1 Comment
Linear regression is commonly used to model the relationship between two variables, for example, the size of an apartment and its electrical energy consumption. Another example is the current drawn in a magnetic winding and the supply voltage [1]. Here, the independent variable is the supply voltage \(x\), whereas the dependent variable is the current \(y\). Values of \(x\) and \(y\) measured in 10 tests are given in the table and scatter plot below. As we can see from the scatter plot above, the data \((x,y)\) tend to fall along a line with positive slope. This might suggest that \(x\) and \(y\) are somewhat linearly correlated. However, we have to be careful when making such an assumption. The validity of the assumption is always doubtful, unless we have conducted analyses to establish the adequacy of the linear model. Analyses can be performed using the function corrcoef in SIMO or Console. This is explained in the following section. Correlation CoefficientCorrelation coefficient is used to measure the linear relationship between two variables. For the variables \(x\) and \(y\), their correlation coefficient is defined as $$\rho_{xy}=\frac{\sigma_{xy}}{\sigma_y\sigma_x},$$ where \(\sigma_{xy}\) is the covariance, and \(\sigma_x\) and \(\sigma_y\) are the standard deviations of \(x\) and \(y \), respectively. - The variables \(x\) and \(y\) are called correlate if \(\rho_{xy}\neq 0\).
- It always holds that \(-1\leq\rho_{xy}\leq 1\).
- Complete positive correlation means \(\rho_{xy}=1\), which suggests that \((x,y)\) fall along a line with positive slope.
- Complete negative correlation means \(\rho_{xy}=-1\), which suggests that \((x,y)\) fall along a line with negative slope.
Input
Output
The value in R(1,2) gives the sample correlation coefficient \(r_{xy}\approx 0.9479\). Since \(x\) and \(y\) are real numbers, the matrix R is symmetrical. If \(x\) or \(y\) was complex, R(1,2) would be a complex conjugate of R(2,1).
Hypothesis TestingNow, we perform further analysis to confirm that \(x\) and \(y\) are indeed linearly correlated. We test the null hypothesis \(H_0:\rho_{xy}=0\) against \(H_1:\rho_{xy}\neq 0\). The null hypothesis \(H_0\) suggests that \(x\) and \(y\) are uncorrelated, whereas the alternative hypothesis \(H_1\) suggests that they are somewhat linearly correlated. The null hypothesis \(H_0\) can be rejected if the \(p\)-value is less than a given significant level \(\alpha\). A low \(p\)-value suggests that observing the null hypothesis \(H_0\) is unlikely. The \(p\)-value for \(\alpha=0.05\) is given by P(2,1), which is almost zero. This suggests that it is extremely unlikely that \(x\) and \(y\) are uncorrelated. If \(\alpha\) is not specified in the input argument, the default value \(\alpha=0.05\) is used. To specify a custom \(\alpha\), say, 0.1, use corrcoef(x,y,'alpha',0.1). Confidence IntervalFinally, we obtain the \((1-\alpha)\%\) confidence interval of \(\rho_{xy}\) with the default \(\alpha=0.05\). This is given by RL(2,1) and RU(2,1) in the above example. As a result, we conclude that $$0.7895\leq\rho_{xy}\leq 0.9879.$$ 😎 To know more about the topic and usage of corrcoef, check out the document pages in our apps. Thanks. 😎 References[1] Douglas Montgomery and George Ringer, Applied Statistics and Probability for Engineers, 6th edition, Wiley, 2014. |
## Mark Lau
SIMO Developer ## Archives |