1, 2Agricultural Engineering College and Research Institute, Tamil Nadu Agricultural University, Coimbatore, India

3Agricultural College and Research Institute, Tamil Nadu Agricultural University, Madurai, India

4Agricultural College and Research Institute, Tamil Nadu Agricultural University, Coimbatore, India

Corresponding Author Email: vasanthi@tnau.ac.in

DOI : https://doi.org/10.58321/AATCCReview.2022.10.04.01

Download this article as:

Abstract

The fitting of regression model has problems related to non-linearity, multicollinearity, serial correlation and heteroscedasticity which involves very long and complex procedure of calculations and analysis. This study focuses on an improvement in the model fit based on R2 value. An attempt is made to investigate the outliers in any data set and to increase the R2 square value after the removal of outliers. In this study, a hypothetical data set is considered.  The data set indicates consumption as a dependent variable and Income, Food size are considered as independent variables. The regression model for Actual data indicates the R2 value is 0.455. After the removal of outliers using the cook’s distance, the revised R2 value is 0.578. This indicates that the outlier in the data set plays a vital role in the model fit. Therefore it is necessary to remove the outlier if any in the data, before proceeding to further analysis.

Introduction

            The main aim of regression modeling and analysis is to develop a good predictive relationship between the dependent (response) and independent (predictor) variables. Regression Diagnostics plays a vital role in finding and validating such a relationship. Once a regression model has been constructed it may be important to confirm the goodness of fit of the model and the statistical significance of the estimated parameters. Commonly used checks of goodness of fit include the R-squared, analysis of the pattern of the residuals, and hypothesis testing. Statistical significance can be checked by an F-test of the overall fit, followed by t-tests of individual parameters. Interpretations of these diagnostics tests rest heavily on the model assumptions. Although examination of the residuals can be used to invalidate a model, the results t-test, and F-test are sometimes difficult to interpret if the model assumptions are violated. Regression diagnostics is to identify the influential data. Diagnostics are certain quantities computed from the data with the purpose of pinpointing influential points, after which these outliers can be removed are corrected.

Review of Literature

            The study on Bauchi Local Government Area determined the costs and returns of rice production among farmers. Primary data were collected with the aid of structured questionnaires which were administered to fifty  (50)  purposively selected rice farmers. The article showed the input variables Seeds, herbicides, and farm size are significant. With these variables, the R2 value is 0.931 [1].

The study on Resource use efficiency in Rice production in Kwande Local Government Area of Benue State, Nigeria was examined. From this article the variables Land, and fertilizer are significant with R2 value is .895 [2].

The resource-use efficiency in sorghum production in the coastal region of Andhra Pradesh was examined. Data for the study were collected from 100 sorghum producers in seven villages in the study area pertaining to the 2008-09 crop season. In this article, the variables all are significant, except Expenditure on seeds. With these variables, the R2 value is 0.718 his study examines the resource use efficiency in rice production in Kwande Local Government  Area of Benue State Nigeria. The  data  for  the  study was collected from  100  rice  farmers in the  four  districts  of the study area using  a  simple  random  sampling  technique [3].

An investigation undertaken in central Gujarat, has estimated the technical efficiency in rice production and has assessed the effect of farm-specific socio-economic factors on this technical efficiency. A stochastic frontier production function has been estimated to determine the technical efficiency of individual farms and variance, as well as regression analyses, have been carried out to find the influence of socioeconomic factors. From this article the variables Operational area, experience in rice cultivation, education level of the farmer, Number of working family members, and Distance of field from canal irrigation structure are significant. With these variables, R2 value is 0.3174 [4].

According to the above articles, it is seen that the production of rice shows high beta coefficients for highly related variables and given higher R2 value, whereas the technical efficiency study includes the variables which lead to a less significant R2 value. Hence it is clear that the value of   R2 depends upon the independent variable selection. Therefore there is a Diagnostics check for justification.

Materials & Methods

           Diagnostics:

  • Hat Matrix:

         The Hat matrix comes from the formula for the regression Y.

                =

               = HY

Where H =  is the Hat matrix.

The Hat matrix transforms Y into the predicted scores. The diagonals of the Hat matrix indicate which values will be Outliers or not.

        (ii) Outliers: An outlier is an observation that is substantially different from all other ones and can make a large difference in the results of regression analysis.(Fox.AJ., 1972) Outliers play an important role in regression. An outlier is a data point that diverges from an overall pattern in a sample. It has a large residual (the distance between the predicted value and the observed value (y)).In linear regression, an outlier is an observation with a large residual. In other words, it is an observation whose dependent-variable value is unusual given its value on the predictor variables. An outlier may indicate a sample peculiarity or may indicate a data entry error or other problem [5].Outliers lower the significance of the fit of a statistical model because they do not coincide with the model’s prediction.

(iii) Cook’s distance (Di) Evaluating large or unusual observations in regression models are the purpose of Cook’s Distance.It is a summary measure of the influence of a single case (observation) based on the total changes in all other residuals when the case is deleted from the estimation process.

Cook’s Distance can be calculated using the following formula,

Where,

 is the i-th diagonal element of the hat matrix;

 is the crude residual (i.e., the difference between the observed value and the value    

     fitted by the proposed model);

MSE is the mean square error of the regression model;

P is the number of fitted parameters in the model

To identify potential outliers, one Rule of Thumb is to treat point i as an outlier when:

Where

                n is the number of observations

                k is the number of parameters

Results & Discussion

             Diagnostic analysis was performed for the hypothetical data set (Annexure I) in which consumption is considered as a dependent variable and Income, Food size are considered as independent variables. The leverage points, the Cooks distance and upper limit for Cooks were worked out using MATLAB program and the regression analyses was done with and without Outliers for the dataset and the results are tabulated below

Table 1

Cook’s distance

S.NoCooks DistanceS.NoCooks DistanceS.No.Cooks DistanceS.NoCooks DistanceS.NoCooks Distance
1.0.003711.0.006821.0.002831.0.005441.0.0113
2.0.003712.0.188222.0.001932.0.009342.0.0999
3.0.003713.0.076923.0.008733.0.005943.0.0029
4.0.003714.0.009324.0.027534.0.004244.0.1018
5.0.003715.0.000025.0.005335.0.007045.0.0258
6.0.003716.0.000726.0.001136.0.002846.0.0038
7.0.003717.0.001127.0.000237.0.008847.0.0187
8.0.003718.0.018228.0.009338.0.007648.0.0167
9.0.003719.0.012329.0.000039.0.045349.0.1974
10.0.003720.0.020230.0.0010400.012850.0.0026

The upper limit for Cooks distance was worked out as 0.0851 using formula given above. According to the Rule of Thumb observations for which the Cooks distance exceeded the upper limit are classified as outliers. From the Table.1. in our data set 4 observations namely 12, 42, 44 and 49 were outliers their Cooks distance are respectively 0.1882, 0.0999, 0.1018 and 0.1974 which are all greater than the upper limit.

Table 2-Regression Analysis with Outliers

SUMMARY OUTPUT
Regression Statistics
Multiple R0.67467256
R Square0.455183063
Adjusted R Square0.431999363
Standard Error504.7628577
Observations50
ANOVA
 DfSSMSFSignificance F
Regression210004794500239719.633756.34E-07
Residual4711974920254785.5
Total4921979714   
 CoefficientsStandard Errort StatP-valueLower 95%
Intercept331.3086797254.32871.3026790.199031-180.335
Income0.0560912640.0113264.9523429.88E-060.033306
Fsize129.565670136.135223.5855780.00079856.87098

Table 3-Regression Analysis without Outliers

SUMMARY OUTPUT
Regression Statistics
Multiple R0.760803
R Square0.578822
Adjusted R Square0.559232
Standard Error414.1317
Observations46
ANOVA
 DfSSMSFSignificance F
Regression210135011506750529.547268.43E-09
Residual437374719171505.1
Total4517509730   
 CoefficientsStandard Errort StatP-valueLower 95%
Intercept273.24211.29631.293160.20286-152.88
Income0.0634360.0100916.2861591.4E-070.043085
Fsize107.000232.588153.2834080.00204341.27993

The original data had 50 observations with four outliers. Table 2 shows that the R2 value of 50 observations was 0.455.  Table .3 indicate that after removing the outliers there were 46 observations and the revised value of R2 is 0.579. The adjusted value R2 also increased from 0.432 to 0.559. This shows that the removal of outliers improves the explanatory power of the model and also improves the precision of the regression coefficients and R-Square value [7]

Summary and conclusion

            Diagnostics checks are very important for regression analysis. To arrive at a suitable Multiple Linear Regression model for the data set, the researcher has to carry out the diagnostic checks and out layers, if any have to be removed from the data set, and the updated data set has been used for further analysis. In this regard cook’s distance is a very useful statistic to identify outliers. In this study, the diagnostic checks have been illustrated with the data set. This study shows the importance of diagnostic checks in fitting regression models.

Future Scope of the Study

This residual analysis and note outlying cases can lead to valuable insights for strengthening the model to adopt this model. This outlier fixation gives insight to modify or fit the correct model for analysis of data for strengthening the model.. Finally, the outliers can be detected and play any significant influence on the parameter estimate.

Conflict of Interest

The authors declares that there is no Conflict of Interest. The authors had full access to all set of data, with an explanation of the nature and extend of access to all of the data in this study and authors take complete responsibility for the integrity of the data and accuracy of the data analysis.

Acknowledgment

I acknowledge Department of Physical Science and Information and Technilogy, Agricultural Engineering College and Research Institute, Tamil Nadu Agricultural University, Coimbatore for providing the necessary facility to carryout the work.

References:

  1. Sani RM , Malumfashi AI, Daneji MI, Alao OO (2007) Economics of rice production:a case study of Bauchi local government area, Bauchi state, Nigeria. Continental J. Agricultural Economics 1: 7 – 13.
  2. David Terfa Akighir, Terwase Shabu (2011) Efficiency of Resource use in Rice Farming Enterprise in Kwande Local Government Area of Benue State, Nigeria. International Journal of Humanities and Social Science. 1(3).
  3. Chapke RR, Biswajit Mondal, Mishra JS (2011) Resource-use Efficiency of Sorghum (Sorghum bicolor) Production in Rice (Oryza sativa)-fallows in Andhra Pradesh, India. J Hum Ecol 34(2): 87-90
  4. Anuradha Narala, Zala YC (2010) Technical Efficiency of Rice Farms under Irrigated Conditions in Central Gujarat. Agricultural Economics Research Review 23:375-381.
  5. Fox AJ (1972) Outliers in time series. Journal of the Royal Statistical Society, Series B 34:350–363.
  6. Kalman RE, et al. (1960) A new approach to linear filtering and prediction problems. Journal of basic Engineering 82 (1):35–45. 
  7. Stephen Raj S, Senthamarai Kannan K  (2017) Detection of Outliers in Regression Model for Medical Data. International Journal of Medical Research & Health Sciences  6(7): 50-56

Similar Posts