68
From Figures 4.2 to 4.11, in the majority of cases there appears to be a number of outliers
towards the right-tails. This might result in the variables being
more positively skewed
than they should be. For example, the variable MORTDUE appears to have a number of
outliers in the right-tail. For the variables DELINQ and DEROG, the majority of the
values are zero. The question now arises whether these are legitimate outliers or whether
they are outliers caused by errors in recording. This is addressed
when the models are
fitted.
The data set was randomly split into four sets:
-
The “old” data set contains 2,759 observations of which 565 are bad.
-
The “validation” data set contains 549 observations of which 109 are bad.
-
The “new” data set contains 566 observations of which 114 are bad.
-
The “test” data set contains 1,662 observations of which 340 are bad.
The missing values in the data set were replaced by the mean for each variable when the
target variable (BAD) was equal to 1 and when it was equal to 0. The missing values were
thus replaced by two means for each variable.
4.2 Logistic Regression Model on “old” Data
A logistic regression model was fitted on the “old” data. This model is the model fitted on
the available data in the home country. Six Fisher scoring iterations were needed for the
algorithm, used to fit the model, to converge. The estimated parameters of the model are
given in Table 4.3.
69
Table 4.3
Logistic regression model fitted on the “old” data.
There are a number of significant variables at the 5% level of significance. This indicates
that many of the variables included in the model are significant in explaining whether an
applicant will be good or bad. The residual deviance of the model is 1,866.7 with 2,742
degrees of freedom.
Interpretation is now given for the parameters of LOAN, DEROG and DEBTINC.
-
The parameter of LOAN is -2.37E-05 and is significant at the 5% significance level.
LOAN represents the amount of loan request. A unit increase in LOAN with all other
variables
held fixed, means that there will be a 2.37E-05 decrease in the log-odds of
default.
-
The parameter of DEROG is 7.34E-01 and is significant at the 5% significance level.
DEROG represents the number of major derogatory reports. A unit increase in DEROG
Variable
Estimate
Std. Error z value
Pr(>|z|)
Significance
(Intercept)
-7.19E+00
5.64E-01
-12.765
< 2e-16
Significant
LOAN
-2.37E-05
6.50E-06
-3.642
0.000271 Significant
MORTDUE
-3.71E-06
2.28E-06
-1.625
0.104238
Insignificant
VALUE
3.03E-06
1.60E-06
1.902
0.057212 Insignificant
REASONHomeImp
2.03E-01
1.35E-01
1.504
0.132632 Insignificant
JOBOffice
-6.82E-01
2.25E-01
-3.038
0.002382 Significant
JOBOther
1.72E-02
1.79E-01
0.096
0.923139 Insignificant
JOBProfExe
4.76E-02
2.10E-01
0.227
0.820586 Insignificant
JOBSales
4.02E-01
4.25E-01
0.948
0.343111 Insignificant
JOBSelf
4.02E-01
3.80E-01
1.057
0.290496 Insignificant
YOJ
-1.62E-02
9.14E-03
-1.768
0.077093 Insignificant
DEROG
7.34E-01
8.06E-02
9.098
< 2e-16
Significant
DELINQ
8.04E-01
6.42E-02
12.53
< 2e-16
Significant
CLAGE
-5.22E-03
8.65E-04
-6.038
1.56E-09 Significant
NINQ
1.37E-01
3.20E-02
4.272
1.94E-05 Significant
CLNO
-2.82E-02
6.79E-03
-4.148
3.36E-05 Significant
DEBTINC
1.91E-01
1.38E-02
13.868
< 2e-16
Significant
70
with all other variables held fixed, means that there will be a 7.34E-01 increase in the log-
odds of default.
-
The parameter of DEBTINC is 1.91E-01 and is significant at the 5% significance level.
DEBTINC represents the debt to income ratio of the applicant.
A unit increase in
DEBTINC with all other variables held fixed, means that there will be a 1.91E-01 increase
in the log-odds of default.
In order to check the adequacy of the model, collinearity of the independent variables,
outliers and influential observations are considered. The
correlation matrix of the
numerical independent variables is given in Table 4.4.
From this correlation matrix, we see that there are no large pair-wise correlations. The
largest correlation is 0.78 between VALUE and MORTDUE. Worrying correlations will
occur with the correlation between two variables is greater than 0.9. The variance inflation
factors for each numerical variable are given in Table 4.5.