• A Responsible Softmax Layer in Deep Learning

      Rychlik, Marek; Coatney, Ryan Dean; Maier, Robert S.; Glickenstein, David A.; Morrison, Clayton T. (The University of Arizona., 2020)
      Clustering algorithms are an important part of modern data analysis. The K-means and EM clustering algorithm both use an iterative process to find latent (or hidden) variables in a mixture distribution. These hidden variables may be interpreted as class label for the data points of a sample. In connection with these algorithms, I consider a family of nonlinear mappings called responsibility maps. The responsibility map is obtained as a gradient of the log likelihood of N independent samples, drawn from a mixture of K distributions. I look at the discrete dynamics of this family of maps and give a proof that iteration of responsibility converges to an estimate of the mixing coefficients. I also show that the convergence is consistent in the sense that the fixed point acts as a maximizer of the log likelihood. I call the process of determining class weight by iteration dynamic responsibility and show that it converges to a unique set of weights under mild assumptions. Dynamic responsibility (DR) is inspired by the expectation step of the expectation maximization (EM) algorithm and has a useful association with Bayesian methods. Like EM, dynamic responsibility is an iterative algorithm, but DR will converge to a unique maximum under reasonable conditions. The weights determined by DR can also be found using gradient descent but DR guarantees non-negative weights and gradient descent does not. I present a new algorithm which I call responsible softmax for doing classification with neural networks. This algorithm is intended to handle imbalanced training sets and is accomplished via multiplication by per class weights. These weights may be interpreted as class probabilities for a generalized mixture model, and are determined through DR rather than by empirical observation of the training set and heuristically selecting the underlying probability distributions. I compare the performance of responsible softmax with other standard techniques, including standard softmax, and weighted softmax using empirical class probabilities. I use generated Gaussian mixture model data and the MNIST data set for proof of concept. I show that in general, responsible softmax produces more useful classifiers than softmax when presented with imbalanced training data. It will also be seen that responsible softmax approximates the performance of empirically weighted softmax, and in some cases may do better.