ASSESSMENT OF ECONOMICAL STABILITY OF PROJECT INVESTORS BY MEANS OF HYBRID TECHNIQUES.

Size: px

Start display at page:

Download "ASSESSMENT OF ECONOMICAL STABILITY OF PROJECT INVESTORS BY MEANS OF HYBRID TECHNIQUES."

Amberly Daniel
5 years ago
Views:

1 1 ASSESSMENT OF ECONOMICAL STABILITY OF PROJECT INVESTORS BY MEANS OF HYBRID TECHNIQUES. Mª Teresa Rodríguez*, Villanueva, Joaquin**; Menendez, Cesar*; Alonso, Cristina** *Universidad de Oviedo Departamento de Matemáticas **Universidad de Oviedo Área de Proyectos de Ingeniería C/ Independencia, Oviedo. Tfn.: Fax: mayte@api.uniovi.es RESUMEN La realización de proyectos requiere usualmente de la búsqueda de financiación externa. Las características de los proyectos y las incertidumbres existentes introducen unos riesgos financieros que los bancos deben evaluar con el fin de conceder las financiaciones. En este artículo se presenta la utilización de técnicas de data mining como herramienta de ayuda para evaluar el riesgo de un promotor de entrar en crisis de liquidez. Para ello se ha analizado un conjunto de 1, casos reales proporcionados por una entidad bancaria, entre los que se incluyen un 1% de casos que han entrado en crisis de liquidez económica. De cada caso se ha recogido la información relativa a 26 variables. El fichero ha sido protegido ocultando la identidad de las compañías y el significado de las variables. El fichero de datos ha sido pre-procesado, seleccionando las variables más relevantes mediante la utilización de una estrategia de poda iterativa. El modelo final fue generado con técnicas adaptativas multivariantes proporcionando una valiosa herramienta a los bancos para evaluar el riesgo de crisis de liquidez para la finaciación de los proyectos. Palabras clave: data mining, crisis de liquidez, financiación de proyectos ABSTRACT In general the development of a project implies searching for finance. The characteristics of a project, specifically the risk, introduce an important uncertainty that banks must consider in order to provide finance or not. Promoters getting into a liquidity crisis is one of the major problems banks have to face today. A liquidity crisis occurs whenever a company is unable to pay its bills on time or lacks sufficient cash to expand inventory and production. Banks need to establish and implement prudent liquidity management policies to assess the liquidity of the companies and to protect their positions. The detection of the risk of a liquidity crisis of a company is a hard task, because companies with the best credit don t need the loans, and companies with worst credit are not likely to repay. Bank s best customers are in the middle. In this paper it is shown how to use data mining techniques to predict the risk of a project for getting into a liquidity crisis. An historic dataset relative to 1 cases has been analyzed. 1456

2 The file was protected to warranty the privacy of customers blinding the name of the variables and the name of the companies. For the creation of the model a combination of several data mining techniques was used, which allow identifying the variables more relevant and to extract rules which facilitates the interpretation of data. Final model was created using multivariate adaptative techniques providing an important tool for banks when discriminating their own risk for project finance. Key Words: Data mining, liquidity crisis, project finance 2. INTRODUCTION There are a great number of projects which need to be bank-financed. Financial institutions must analyze the liquidity risk of the partners intervening in the project ensuring that they will have sufficient liquidity to meet liabilities when due, under both normal and stressed conditions. In general it is a hard task to determine if a company will get into a liquidity crisis or not. The use of data mining techniques based on the historic information of previous projects can help in the identification of potential risks during the loan approval cycle. Data mining techniques allow the automated analysis of large data sets finding patterns and trends that might otherwise go undiscovered. The liquidity crisis problem can be faced as a simple classification: to predict whether or not a promoter will present a good or poor credit risk. In this paper, we explain how data mining techniques have been used to predict the credit risk based on a set of 26 variables describing attributes of the companies with unknown semantics. The size of the training data set was 2, samples, with a class distribution of 1% positive cases, i.e. where a liquidity crisis occurred, and 9% negative cases. The test data consisted of 1, unlabeled samples. The objective was to success the greater number of true positive cases within the 2, samples regarded most likely to enter a liquidity crisis. In the next sections it will be explained the work developed for the creation of the data mining model for the assessment of the economical stability of the investors and the results achieved. We will start explaining the characteristics of the data set used in the work and the main tasks done to pre-process data. Next, we will describe the techniques used and the reasons of their choice. After this, it will be explained the methodology followed to select the more relevant variables which will be intervening in modelling. Finally it will be presented the results and conclusions achieved. 1. DATA UNDERSTANDING The data set used in this work was provided by the Deutsche Sparkassen- und Giroverband (DSGV) bank. Dataset consists in a blind historic set of 2, samples formed by 26 variables describing attributes of the companies and a binary variable registering if the company got into liquidity crisis or not. 1457

3 The first step for the creation of the model was concentrated in doing activities to get familiar with the data, to identify data quality problems, to discover first insights into the data or to detect interesting subsets to form hypotheses for hidden information. It was done a study of the variables of the process in order to determine their influence on the output, to detect relations between attributes, examine the quality of data and to analyze possible transformations and other data preparation necessary for further analysis. The preliminary analysis of data detected the presence of a 24% of missing values per feature on the average with a great number of missing data registered in the file with the value of The results of the data mining models have a strong dependency of the quality of data, so it is necessary to follow a strategy to handling the missing values. There are several alternatives for the treatment of missing values: filtering cases, replacement missing values (medium, average k-neighbour nearest), select a technique for modelling capable to treat with missing values or to estimate the values by means of information relative to the process. In general filtering cases is a good option, but in this case it was not possible because of the high number of missing values. The strategy adopted by the research team consisted in replace the missing values by the average, and add a new binary variable for each variable with missing values marking the position with missing data with a 1 and the other values with a. In this way the data mining technique is capable to hand missing data considering two possible cases, the case where the variable is known and the case with missing data. In order to extract information relative to the process variables were discretized using equidepth histograms and it was calculated the probability to get into liquidity crisis for each interval of the histogram. In the Figure 1 are shown two examples of the graphics obtained by the equi-depth discretization for two input variables: Var1 (left) and Var22 (right). The negative values of Var1 have greater probability to get into a liquidity crisis than the positive values of this variable, not being relevant the magnitude of the variable. In the case of VAR22 the magnitude of the variable is detected as relevant decreasing the risk when the magnitude of VAR22 increases. 3 VAR1 VAR22 Probability VAR1< Var1> VAR21< Figure 1: Discretization of the input variables and analysis of the probability to get into a liquidity crisis. Probability 25 5 VAR21>5 2. DESCRIPTION OF THE TECHNIQUE SELECTED The problem in study is a binary classification problem, to predict whether or not a promoter will get into a liquidity crisis. But we are also interested in predict the risk that a promoter get into a liquidity crisis. Besides, we have to be in consideration that the quality of data is poor with a high presence of missing values. By these reasons we have selected as data 1458

4 mining technique the ApiMARS, modification done by the research team of the standard MARS algorithm [Friedman, 1991], with capability to provides a measurement of the liquidity crisis risk and robust to noisy data. The MARS procedure builds flexible regression models by fitting separate splines (or basis functions) to distinct intervals of the predictor variables. Both the variables to use and the end points of the intervals for each variable -referred to as knots- are found via an exhaustive search procedure in a two-phase process. In the first phase, a model is grown by adding basis functions (new main effects, knots, or interactions) up to a maximum number predetermined by the user. In the second phase, basis functions are deleted in order of least contribution to the model until an optimal balance of bias and variance is found. It has been used for multidimensional fitting in several applications of multidisciplinary fields successfully, outperforming in sometimes the results of neural networks [Rodriguez, 23]. ApiMARS procedure is a modification of the MARS basic algorithm that determines automatically the number of basis functions in function of data, changing the forward/backward procedures to present different data in every step. It inherits from MARS its capability to selectively blank out some regions of a variable in order to focus on the most promising zones, which converts it in a good tool for finding interactions between variables and complex data structures. Besides the algorithm has the capacity to treat missing data, using basis functions which blank out the variable for the cases in which a variable contains missing values SELECTION OF RELEVANT VARIABLES Although in this problem the number of variables in consideration is relatively low (26 variables), is convenient to reduce the dimensionality of the input space and if it is possible decide if one or more attributes are more important than others weighting the attributes accordingly. For doing the tasks of selection of variables and weighting the relative importance of the variables it was followed an iterative pruning strategy developed with the ApiMARS algorithm. The iterative pruning strategy consists in the initial consideration of all the variables selected as candidates during the pre-process phase, training different models with ApiMARS for a great number of sets of patterns selected randomly (8% training 2% test). The different models are analysed weighting the variables for several parameters: importance of the variable into the model and importance of the model. The importance of the variable into the model is calculated using a sensitivity analysis of the loss of fit when the variable is removed and the importance of the model is calculated using the quadratic error medium for the test patterns. Irrelevant or low relevant variables are removed, and the process is repeated in an iterative manner until the entire no relevant (or very few relevant) variables are removed. The application of the strategy to the problem in study stopped in 3 iterations, detecting as no relevant 6 of the 26 variables. In the Figure 2 the variables are ordered by their relative importance. There are 5 variables (Var_23, Var_24, Var_3, Var_11 and Var_15) being the most important. So the reliability of the application depends of the quality of these variables 1459

5 being important to make an effort to collect these variables with the greatest quality as possible. 5 VAR23 VAR24 VAR3 VAR11 VAR15 VAR14 VAR21 VAR6 VAR2 VAR5 VAR1 VAR12 VAR1 VAR18 VAR4 VAR7 VAR26 VAR25 VAR22 VAR % Figure 2: Relative importance of variables 4. TRAINING AND TESTING The parameters of the data mining technique must be calibrated in order to avoid undertraining or over-fitting problems. In the Figure 3 is shown the usual behaviour of the performance of the model versus its complexity for the training and the test sample. The error evaluated in the training sample decreases with the complexity of the model but the error of the test training can increase because the loss of generality of the models. To optimise the results equilibrium between the training and test error must be found. 146

6 6 Figure 3: Perfomance of the model versus complexity. In order to evaluate the fitness of the models data were divided by means of 5-fold cross validation. That is, data set was divided into 5 subsets using 4 subsets for training the model and the other subset for test models, repeating the process 5 times. In the Figure 4 is shown the successful rate of the model evaluated in the test set in function of the measurement of the risk of crisis liquidity. The Figure 4 shows that the best rate to separate the promoters to get into a liquidity crisis is in the interval (.45,.55) with more than a 92% of right promoters classified. Succesful rate 93% 93% 92% 92% 91% 91% 9% 9% 89% 89% 88% Probability Estimated by Uniovi model Figure 4: Results of the model for the test sets. 1461

7 7 5. CONCLUSSIONS In this paper it is described a new method based in data mining techniques to assess the economical stability of project investors. It has been done a data set of 1, samples with 26 attributes for the creation of the model, being the percentage of positive cases of a 1%. The data set is characterised by the presence of a great number of missing values (24% in average by feature). To avoid this problem a new binary variable has been associated to each variable with missing values. The new variable marks with a 1 the null cases and with a the valid cases and the missing values have been replaced in the original variable by the average of the variable. ApiMars algorithm was selected as the data mining technique to be used for the selection of relevant variables and modelling. For this type of problems the collection of reliability data is one of the greater difficulties. The use of ApiMARS algorithm allows weighting the more relevant variables, so the efforts can be focused in the correct capture of this information. Parameters of the model were calibrated by means of a 5-fold cross validation in order to avoid under-fitting or over-fitting problems. The results obtained are very promising identifying successfully a 92% of the promoters getting into a liquidity crisis. 6. REFERENCES [1] Friedman, J. H. Multivariate adaptive regression splines. The Annals of Statistics, Vol. 19, Nº 1, 1-141, 1991 [2] Rodriguez Montequín, M. T. Modelado evolutivo mediante técnicas adaptativas aplicado al control de inclusiones en bobinas laminadas en caliente, Tesis doctoral, 23 [3] M. Teresa Rodríguez, Francisco Ortega, Jose Luis Rendueles, Cesar Menéndez: Combination Of Multivariate Adaptive Techniques And Neural Networks For Prediction And Control Of Internal Cleanliness In Steel Strips. Proceedings of EUNITE

Model reconnaissance: discretization, naive Bayes and maximum-entropy. Sanne de Roever/ spdrnl

Model reconnaissance: discretization, naive Bayes and maximum-entropy Sanne de Roever/ spdrnl December, 2013 Description of the dataset There are two datasets: a training and a test dataset of respectively