16-Mavzu: Neyron tarmiq asosida sinflashtirish masalasini yechish va neyron tarmoq aniqligini oshirish yo'llari Reja: 1. Neyron tarmoq asosida sinflashtirish masalasini yechish 2. Neyron tarmog'ida yo'qotish funksiyasi(loss function) va gradientli tushish (gradient descent) funksiyalarini hisoblash 3. Sinflashtirish uchun model aniqligini oshirish

NEURAL NETWORK CLASSIFICATION NEURAL NETWORK CLASSIFICATIONHomeAnalytic Solver Data Mining Online HelpData MiningClassify Introduction Artificial neural networks are relatively crude electronic networks of neurons based on the neural structure of the brain. They process records one at a time, and learn by comparing their classification of the record (i.e., largely arbitrary) with the known actual classification of the record. The errors from the initial classification of the first record is fed back into the network, and used to modify the networks algorithm for further iterations. A neuron in an artificial neural network is 1. A set of input values (xi) and associated weights (wi). 2. A function (g) that sums the weights and maps the results to an output (y).

Neurons are organized into layers: input, hidden and output. The input layer is composed not of full neurons, but rather consists simply of the record's values that are inputs to the next layer of neurons. The next layer is the hidden layer. Several hidden layers can exist in one neural network. The final layer is the output layer, where there is one node for each class. A single sweep forward through the network results in the assignment of a value to each output node, and the record is assigned to the class node with the highest value.

Training an Artificial Neural Network In the training phase, the correct class for each record is known (termed supervised training), and the output nodes can be assigned correct values -- 1 for the node corresponding to the correct class, and 0 for the others. (In practice, better results have been found using values of 0.9 and 0.1, respectively.) It is thus possible to compare the network's calculated values for the output nodes to these correct values, and calculate an error term for each node (the Delta rule). These error terms are then used to adjust the weights in the hidden layers so that, hopefully, during the next iteration the output values will be closer to the correct values. Ensemble Methods XLMiner V2015 offers two powerful ensemble methods for use with Neural Networks: bagging (bootstrap aggregating) and boosting. The Neural Network Algorithm on its own can be used to find one model that results in good classifications of the new data. We can view the statistics and confusion matrices of the current classifier to see if our model is a good fit to the data, but how would we know if there is a better classifier just waiting to be found? The answer is that we do not know if a better classifier exists. However, ensemble methods allow us to combine multiple weak neural network classification models which, when taken together form a new, more accurate strong classification model. These methods work by creating multiple diverse classification models, by taking different samples of the original data set, and then combining their outputs. (Outputs may be combined by several techniques for example, majority vote for classification and averaging for regression.) This combination of models effectively reduces the variance in the strong model. The two different types of ensemble methods offered in XLMiner (bagging and boosting) differ on three items: 1) the selection of training data for each classifier or weak model; 2) how the weak models are generated; and 3) how the outputs are combined. In all three methods, each weak model is trained on the entire Training Set to become proficient in some portion of the data set. Bagging (bootstrap aggregating) was one of the first ensemble algorithms ever to be written. It is a simple algorithm, yet very effective. Bagging generates several Training Sets by using random sampling with replacement (bootstrap sampling), applies the classification algorithm to each data set, then takes the majority vote among the models to determine the classification of the new data. The biggest advantage of bagging is the relative ease that the algorithm can be parallelized, which makes it a better selection for very large data sets. Boosting builds a strong model by successively training models to concentrate on the misclassified records in previous models. Once completed, all classifiers are combined by a weighted majority vote. XLMiner offers three different variations of boosting as implemented by the AdaBoost algorithm (one of the most popular ensemble algorithms in use today): M1 (Freund), M1 (Breiman), and SAMME (Stagewise Additive Modeling using a Multi-class Exponential). Adaboost.M1 first assigns a weight (wb(i)) to each record or observation. This weight is originally set to 1/n and is updated on each iteration of the algorithm. An original classification model is created using this first training set (Tb), and an error is calculated as: Adaboost Formula where, the I() function returns 1 if true, and 0 if not. The error of the classification model in the bth iteration is used to calculate the constant ?b. This constant is used to update the weight (wb(i). In AdaBoost.M1 (Freund), the constant is calculated as: αb= ln((1-eb)/eb) In AdaBoost.M1 (Breiman), the constant is calculated as: αb= 1/2ln((1-eb)/eb) In SAMME, the constant is calculated as: αb= 1/2ln((1-eb)/eb + ln(k-1) where k is the number of classes where, the number of categories is equal to 2, SAMME behaves the same as AdaBoost Breiman. In any of the three implementations (Freund, Breiman, or SAMME), the new weight for the (b + 1)th iteration will be Adaboost Formula Afterwards, the weights are all readjusted to the sum of 1. As a result, the weights assigned to the observations that were classified incorrectly are increased, and the weights assigned to the observations that were classified correctly are decreased. This adjustment forces the next classification model to put more emphasis on the records that were misclassified. (The ? constant is also used in the final calculation, which will give the classification model with the lowest error more influence.) This process repeats until b = Number of weak learners. The algorithm then computes the weighted sum of votes for each class and assigns the winning classification to the record. Boosting generally yields better models than bagging; however, it does have a disadvantage as it is not parallelizable. As a result, if the number of weak learners is large, boosting would not be suitable. Neural Network Ensemble methods are very powerful methods, and typically result in better performance than a single neural network. XLMiner V2015 provides users with more accurate classification models and should be considered over the single network. Neural Network As we have proper data set which ready to use, let’s continue with the neural network. Following is the simple neural network with single hidden layer we’re going to use for this classification problem.

class NeuralNet(nn.Module): def __init__(self, in_features=4, out_features=3): super().__init__() self.fc1 = nn.Linear(in_features=in_features, out_features=120) self.fc2 = nn.Linear(in_features=120, out_features=84) self.fc3 = nn.Linear(in_features=84, out_features=out_features) def forward(self, X): X = F.relu(self.fc1(X)) X = F.relu(self.fc2(X)) return self.fc3(X) Create a instance of this model to train.

model = NeuralNet() Training As data set is ready for training, let’s continue with training the model. Since there are not much data points to train with, I’m going to feed the entire training data set to the network in a single epoch throughout 50 epochs (epoch = single training iteration). I also track the loss through out the epochs and let’s see how the loss reach to a minimum throughout training.

The loss/cost function I’m going to use here is Cross Entropy Loss hence this is a classification problem and optimizer would be Adam.

criteriotn = nn.CrossEntropyLoss() optimizer = torch.optim.Adam(model.parameters(), lr=0.01) Let’s, go ahead with the training;

epochs = 50 losses = [] for i in range(epochs): i += 1 y_pred = model(X_train) loss = criterion(y_pred, y_train) losses.append(loss) if i % 10 == 0: print(f'epoch: {i} -> loss: {loss}') optimizer.zero_grad() loss.backward() optimizer.step() After 50 epochs the plot of losses against epochs looks like this;

Loss against epochs (Adam optimizer) Looking at the plot, we could realize that 50 epochs are enough this to reach the minimum loss. Around 30th epoch loss reaches the minimum however. After reaching the minimum, training further would risk overfitting. Validation Although, for validation we only have 20% of the original data set, they are totally new data that neural network has never seen. Therefore, using this data set, we should be able to realize if there’s overfitting or underfitting as well.

Validation should be done under the block of , hence we don’t want to disrupt the network’s trained weights and biases.torch.no_grad()

with torch.no_grad(): y_pred = model(X_test) preds = torch.max(y_pred, dim=1)[1] correct = (preds == y_test).sum() print(f'{correct} out of {y_test.shape[0]} is correct : {correct.item() / y_test.shape[0] * 100}%') Take the values of the tensor, hence the values across dimension 1 tells us probability of one could be specific species in this case. Max value has maximum probability to be the correct value.maxy_pred

The result is from above snippet. Which means all the items are predicted correctly by the network.30 out of 30 is correct : 100.0%

Instead of passing entire test batch, we can validate the correctness one by one as well.

with torch.no_grad(): correct = 0 for i, X in enumerate(X_test): y_pred = model(X) if y_pred.argmax().item() == y_test[i]: correct += 1 print(f'{correct} out of {y_test.shape[0]} is correct : {correct / y_test.shape[0] * 100}%') Apply Unknown Data and Get Results Using entirely unknown new or made up data we can now predict which species that data point belongs. This function allows us to do it dynamically.

@torch.no_grad() def predict_unknown(X_unknown): y_pred = model(X_unknown) return labels[y_pred.argmax()] Now, if we have following data points we can try predicting it.

unknown_iris = torch.tensor([5.6, 3.7, 2.2, 0.5]) Invoking give us the result Iris setosa. We can’t check if this prediction correct or wrong, because we don’t have validation entry for this data point. But we could have an idea looking at the scattered plot we used earlier, So, could try mapping this data point in to that scattered plot.predict_unknown(mystery_iris)

Looking at the scattered graph, we can confirm that prediction of the neural network is correct. Conclusion This neural network and training data set is very simple and easy to train and get correct results. Data in real world cannot apply to a neural network without proper modification on data. Play around with the neural network configurations like hidden layer’s number of neurons, number of hidden layers, activation functions use in the network much required.

As I started to move towards more and more complicated scenarios in this field, hope to publish more interesting articles. Appreciate experts opinions and comments. Working on the data side various methods can be applied to the data to increase the accuracy of the model. Some of the methods that can be applied on the data side are as follows:

Method 1: Acquire more data One thing that a classification modelling always requires is to let the data tell it about itself as much as possible. We may find problems with classification modelling when the amount of data is very little or less. In such a scenario we are required to have some techniques or sources that can provide us with more data.

However, if we are practicing modelling or participating in competitions getting more data is difficult and we can try to copy-paste data in the training the model to boost the accuracy or if we are working on a company project then we can try to ask for more data from the source if possible. This method is one of the basic methods that can lead us to higher accuracy than before. Let’s move to our second method.

Method 2: Missing value treatment There are various reasons for the generation of missing values in the data. Here we are not concerned about the generation of missing values but we are concerned about the treatment of missing values. One thing that is very clear here is if there are missing values in the data then it can lead the modelling procedure to disaster. Generation of the biased model, and inaccurate predictions can be the results of modelling with missing values in data. Take a look at the below table

Age Name Gender Working status 18 Rahul M N 20 Archana F Y 29 Ashok M Y 18 Anamika N 27 Ankit M Y Let’s take a look at the below table which represents the percentage of working people according to the gender

Gender Percentage Female 100% Male 66.66% Here we can see that there is a missing value and the records are showing 100% females are working in the data. Now if we fill the missing value with F then the results will be as follows.

Gender Percentage Female 50% Male 66.66% Here we can see the effect of missing values in the final prediction.

There are various ways to deal with the missing values some of them are as follows:

Mean, Median, and mode: This method is for dealing with the missing data in the continuous data(age is our case). We can fill missing values using the Mean, median, and mode of the continuous variable. Separation of class: This method separates the data points that have missing values and we can use this method for categorical data. Missing value prediction: This method uses another model that can make predictions on the missing values. After filling the missing values using a model we can continue our work on the main modelling. KNN imputation: This method can be utilized to fill missing values that work on finding the data points that have similar attributes to the class where we have missing values and fills the values same as the information available in the similar data points. Filling closest value: This method fills the nearest value in place of the missing value. This method can be worked with continuous data but is best with the time series data. Method 3: Outlier treatment Fitting a classification model can also be thought of as fitting a line or area on the data points. So if the data has the data points that are close to each other fitting a model can give us better results because the prediction area is dense. If there are data points that are sparse then the model can become inaccurate and biased toward the sparse data points. To increase the accuracy of the classification model we need ways to outlier treatment. Some of the ways of outlier treatment are as follows:

Deleting outliers: By drawing the data points in coordinates we can detect the values that are far from the dense area and we can delete sparse data points from the data if we have a very large amount of the data. With a low amount of data, this way of outlier treatment is not a good way. Data transformation: Data transformation can help us to get rid of outlier data points in modelling. There are ways to do this such as performing a log of the data points to reduce the sparsity of the data. Binning is also a way to transform data and many algorithms such as decision trees help us in dealing with outlier data points using the binning of data. Mean, median, mode: This method is similar to the method we discussed in dealing with missing values. Before using this method we are required to ensure the outlier we have detected is natural or artificial. If the value is artificial we can use the mean, median, or mode of the other data points in the place of an outlier. Separation: If the amount of the outlier is higher than the normal then we can separate them from the main data and fit the model on them separately. Method 4: Feature engineering Feature engineering is one of the best ways to increase the accuracy of the classification model. Since it lets the model work with only those variables that are highly correlated to the target variable. We can also think of this method as the creation of a hypothesis regarding the accuracy of the model. We can perform feature engineering in three steps:

Transformation of features: In the core, we can find that transformation of the features includes two main processes that can be applied one by one, or in some cases, we may require to use one of them. scaling data: in this process, we normalize the data and make it scales between 0 to 1. For example, if we have three variables in gram, kg, and ton. With such data, it is always required to fit the model after normalizing to improve the accuracy of the model. All the models require data that is normally distributed to give higher accuracy. Before fitting a classification model there is always a requirement to remove the skewness of the data as much as possible. Creation of features: This process can be considered as the creation of new features using the old features of the data. This process can help us in understanding and generate new insights from the data. Let’s say that the daily traffic on website hours has no relation with the traffic but the minutes are having a relationship. Such information can help improve the accuracy of the model. Feature selection: This process let us know about the relation of any feature with the target variables. Using this process we generally reduce the number of features that are going to be modeled. Since the best features are fed into the model it helps in improving the results of the model. Various ways help in selecting the best features: Knowledge: Based on the knowledge of the domain we can say which are the variable are most important to be modelled. For example, in the sales on a daily basis from a shop, days and amount of material are important but the name of the customer is not important. Parameters: There are some parameters such as P-value that helps in determining the relation between the variables. Using these parameters we can differentiate between important and unimportant features of the data for modelling. Dimension reduction: Various dimensional reduction algorithms help us in drawing the data into a lower dimension but also help in understanding inherent relationships in the data. Working on the model side In the above section, the methods we have discussed are applying the changes and techniques to the data. After making the data good for modelling we may require to perform some changes in the model side to improve the accuracy of the classification modelling process. Some of the techniques that can be followed to increase the accuracy of classification modelling are as follows:

Method 1: Hyperparameter tuning In the core of the models, we find some of the units that drive the model to make final results. We call these units as parameters of the model. These parameters take some values to perform their task under the model. For example, the Gini impurity under the decision tree model helps the tree to split the data into branches.

Since we know that the split of the data in the decision tree makes a higher impact on the accuracy of the decision tree model. So to better split, we need to find an optimum value of Gini impurity. Finding an optimal value for parameters of the model is known as hyperparameter tuning.

The impact of the hyperparameter tuning on the performance and accuracy of the model is so high and various packages help in hyperparameter tuning even in an automated nature. Some of them can be found here.

Method 2: Applying different models There can be several changes that applying a single model after so much hyperparameter tuning will not help us in boosting the accuracy of the procedure. So in such a scenario, applying different classification models to the data can help us in increasing the accuracy of the procedure. However, hyperparameter tuning can be applied after finalizing a model for better accuracy.

Various packages help us in finding an optimal model and hyperparameters of that model. Some of them can be found here.

Method 3: Ensembling methods In machine learning, ensemble methods are different from general methods of modelling that include various weak models to perform modelling of the data and combine their results. The reason for being more accurate is the results are combined. We can categorize ensembling methods into two categories:

Averaging methods: These methods work based on combining the results of different models as an average. We can consider these methods better than applying a single model to data. Examples of models of this method are bagging meta estimators and forests of random trees. Boosting method: These methods work based on the reduction of the bias of the combined estimator after sequentially applying the base models. These are very powerful in terms of performance and accuracy. Examples of models of these methods are Adaboost and gradient tree boosting. Method 4: Cross-validation Although this method is related to both data and model side since we apply models several times in this technique, we can consider cross-validation as a method of working on the modelling side. Also, making a model perform accurately does not mean it is accurate. Cross-validation is a way that verifies the accuracy of the model.

These methods work based on applying the trained model to the data that have classes on which the model is not trained. We can perform this by dividing the whole data into sets of similar data points and changing the group at each training. Then we can make inferences about the data whether it is better for an accurate model or not.

Cross-validation mainly works when the problem of the overfitting of the model is there in the modelling. There are various techniques of cross-validation such as K-fold, leaving one group out, leaving P groups out, etc. Some of the knowledge about cross-validation can be found here. Reducing Loss: Gradient Descent

bookmark_border Estimated Time: 10 minutes The iterative approach diagram (Figure 1) contained a green hand-wavy box entitled "Compute parameter updates." We'll now replace that algorithmic fairy dust with something more substantial.

Suppose we had the time and the computing resources to calculate the loss for all possible values of . For the kind of regression problems we've been examining, the resulting plot of loss vs. will always be convex. In other words, the plot will always be bowl-shaped, kind of like this:

A plot of a U-shaped curve, with the vertical axis labeled as 'loss' and the horizontal axis labeled as value of weight w i. Figure 2. Regression problems yield convex loss vs. weight plots.

Convex problems have only one minimum; that is, only one place where the slope is exactly 0. That minimum is where the loss function converges.

Calculating the loss function for every conceivable value of over the entire data set would be an inefficient way of finding the convergence point. Let's examine a better mechanism—very popular in machine learning—called gradient descent.

The first stage in gradient descent is to pick a starting value (a starting point) for . The starting point doesn't matter much; therefore, many algorithms simply set to 0 or pick a random value. The following figure shows that we've picked a starting point slightly greater than 0:

A plot of a U-shaped curve. A point about halfway up the left side of the curve is labeled 'Starting Point'. Figure 3. A starting point for gradient descent.

The gradient descent algorithm then calculates the gradient of the loss curve at the starting point. Here in Figure 3, the gradient of the loss is equal to the derivative (slope) of the curve, and tells you which way is "warmer" or "colder." When there are multiple weights, the gradient is a vector of partial derivatives with respect to the weights.

Click the plus icon to learn more about partial derivatives and gradients. Note that a gradient is a vector, so it has both of the following characteristics:

a direction a magnitude The gradient always points in the direction of steepest increase in the loss function. The gradient descent algorithm takes a step in the direction of the negative gradient in order to reduce loss as quickly as possible.

A plot of a U-shaped curve. A point on the left side of the curve is labeled 'Starting Point'. An arrow labeled 'negative gradient' points from this point to the right. Figure 4. Gradient descent relies on negative gradients.

To determine the next point along the loss function curve, the gradient descent algorithm adds some fraction of the gradient's magnitude to the starting point as shown in the following figure:

A plot of a U-shaped curve. A point on the left side of the curve is labeled 'Starting Point'. An arrow labeled 'negative gradient' points from this point to the right. Another arrow points from the tip of the first arrow down to a second point on the curve. The second point is labeled 'next point'. Figure 5. A gradient step moves us to the next point on the loss curve.

The gradient descent then repeats this process, edging ever closer to the minimum.

Multi-Class Neural Networks: One vs. All

bookmark_border Estimated Time: 2 minutes One vs. all provides a way to leverage binary classification. Given a classification problem with N possible solutions, a one-vs.-all solution consists of N separate binary classifiers—one binary classifier for each possible outcome. During training, the model runs through a sequence of binary classifiers, training each to answer a separate classification question. For example, given a picture of a dog, five different recognizers might be trained, four seeing the image as a negative example (not a dog) and one seeing the image as a positive example (a dog). That is:

Is this image an apple? No. Is this image a bear? No. Is this image candy? No. Is this image a dog? Yes. Is this image an egg? No. This approach is fairly reasonable when the total number of classes is small, but becomes increasingly inefficient as the number of classes rises.

We can create a significantly more efficient one-vs.-all model with a deep neural network in which each output node represents a different class. The following figure suggests this approach:

A neural network with five hidden layers and five output layers. Figure 1. A one-vs.-all neural network. Recall that logistic regression produces a decimal between 0 and 1.0. For example, a logistic regression output of 0.8 from an email classifier suggests an 80% chance of an email being spam and a 20% chance of it being not spam. Clearly, the sum of the probabilities of an email being either spam or not spam is 1.0.

Softmax extends this idea into a multi-class world. That is, Softmax assigns decimal probabilities to each class in a multi-class problem. Those decimal probabilities must add up to 1.0. This additional constraint helps training converge more quickly than it otherwise would.

For example, returning to the image analysis we saw in Figure 1, Softmax might produce the following likelihoods of an image belonging to a particular class:

Class Probability apple 0.001 bear 0.04 candy 0.008 dog 0.95 egg 0.001 Softmax is implemented through a neural network layer just before the output layer. The Softmax layer must have the same number of nodes as the output layer.

A deep neural net with an input layer, two nondescript hidden layers, then a Softmax layer, and finally an output layer with the same number of nodes as the Softmax layer. Figure 2. A Softmax layer within a neural network.

Click the plus icon to see the Softmax equation. Softmax Options Consider the following variants of Softmax:

Full Softmax is the Softmax we've been discussing; that is, Softmax calculates a probability for every possible class.

Candidate sampling means that Softmax calculates a probability for all the positive labels but only for a random sample of negative labels. For example, if we are interested in determining whether an input image is a beagle or a bloodhound, we don't have to provide probabilities for every non-doggy example.

Full Softmax is fairly cheap when the number of classes is small but becomes prohibitively expensive when the number of classes climbs. Candidate sampling can improve efficiency in problems having a large number of classes.

One Label vs. Many Labels Softmax assumes that each example is a member of exactly one class. Some examples, however, can simultaneously be a member of multiple classes. For such examples:

You may not use Softmax. You must rely on multiple logistic regressions. For example, suppose your examples are images containing exactly one item—a piece of fruit. Softmax can determine the likelihood of that one item being a pear, an orange, an apple, and so on. If your examples are images containing all sorts of things—bowls of different kinds of fruit—then you'll have to use multiple logistic regressions instead.