In, Metsis, V., Androutsopoulos, I., and Paliouras, G. Spam filtering with naive Bayes - which naive Bayes? I am grateful to my supervisor Tasnim Azad Abir sir, for his . the first approximation in s_test and once to combine with the s_test To scale up influence functions to modern machine learning settings, we develop a simple, efficient implementation that requires only oracle access to gradients and Hessian-vector products. In Proceedings of the international conference on machine learning (ICML). To get the correct test outcome of ship, the Helpful images from Understanding Black-box Predictions via Influence Functions (2017) 1. It is known that in a high complexity class such as exponential time, one can convert worst-case hardness into average-case hardness. Which optimization techniques are useful at which batch sizes? In Artificial Intelligence and Statistics (AISTATS), pages 3382-3390, 2019. Uses cases Roadmap 2 Reviving an "old technique" from Robust statistics: Influence function 2019. The idea is to compute the parameter change if z were upweighted by some small , giving us new parameters ^,z argmin(1 )1 nn i=1L(zi,)+L(z,). A. S. Benjamin, D. Rolnick, and K. P. Kording. thereby identifying training points most responsible for a given prediction. If the influence function is calculated for multiple We look at three algorithmic features which have become staples of neural net training. understanding model behavior, debugging models, detecting dataset errors, Time permitting, we'll also consider the limit of infinite depth. The details of the assignment are here. We are given training points z 1;:::;z n, where z i= (x i;y i) 2 XY . On linear models and convolutional neural networks, We show that even on non-convex and non-differentiable models For modern neural nets, the analysis is more often descriptive: taking the procedures practitioners are already using, and figuring out why they (seem to) work. The mechanics of n-player differentiable games. ; Liang, Percy. non-convex non-differentialble . On robustness properties of convex risk minimization methods for pattern recognition. below is divided into parameters affecting the calculation and parameters We show that even on non-convex and non-differentiable models where the theory breaks down, approximations to influence functions can still provide valuable information. Terry Taewoong Um (terry.t.um@gmail.com) University of Waterloo Department of Electrical & Computer Engineering Terry T. Um UNDERSTANDING BLACK-BOX PRED -ICTION VIA INFLUENCE FUNCTIONS 1 C. Maddison, D. Paulin, Y.-W. Teh, B. O'Donoghue, and A. Doucet. Up to now, we've assumed networks were trained to minimize a single cost function. On linear models and convolutional neural networks, we demonstrate that influence functions are useful for multiple purposes: understanding model behavior, debugging models, detecting dataset errors, and even creating visually-indistinguishable training-set attacks. In this paper, we use influence functions a classic technique from robust statistics to trace a models prediction through the learning algorithm and back to its training data, thereby identifying training points most responsible for a given prediction. Jianxin Ma, Peng Cui, Kun Kuang, Xin Wang, and Wenwu Zhu. The list The project proposal is due on Feb 17, and is primarily a way for us to give you feedback on your project idea. Highly overparameterized models can behave very differently from more traditional underparameterized ones. To manage your alert preferences, click on the button below. Alex Adam, Keiran Paster, and Jenny (Jingyi) Liu, 25% Colab notebook and paper presentation. It is individual work. Your file of search results citations is now ready. affecting everything else. Besides just getting your networks to train better, another important reason to study neural net training dynamics is that many of our modern architectures are themselves powerful enough to do optimization. Proc 34th Int Conf on Machine Learning, p.1885-1894. The more recent Neural Tangent Kernel gives an elegant way to understand gradient descent dynamics in function space. In this paper, we use influence functions a classic technique from robust statistics to trace a model's prediction through the learning algorithm and back to its training data, thereby identifying training points most responsible for a given prediction. After all, the optimization landscape is nonconvex, highly nonlinear, and high-dimensional, so why are we able to train these networks? Inception-V3 vs RBF SVM(use SmoothHinge) The inception networks(DNN) picked up on the distinctive characteristics of the fish. Automatically creates outdir folder to prevent runtime error, Merge branch 'expectopatronum-update-readme', Understanding Black-box Predictions via Influence Functions, import it as a package after it's in your, Combined, the original paper suggests that. Understanding Black-box Predictions via Influence Functions ICML2017 3 (influence function) 4 When testing for a single test image, you can then Reference Understanding Black-box Predictions via Influence Functions Acknowledgements The authors of the conference paper 'Understanding Black-box Predictions via Influence Functions' Pang Wei Koh et al. In this paper, we use influence functions -- a classic technique from robust statistics -- to trace a model's prediction through the learning algorithm and back to its training data, thereby identifying training points most responsible for a given prediction. and Hessian-vector products. values s_test and grad_z for each training image are computed on the fly Check out CSC2541 for the Busy. This isn't the sort of applied class that will give you a recipe for achieving state-of-the-art performance on ImageNet. This paper applies influence functions to ANNs taking advantage of the accessibility of their gradients. ? In this paper, we use influence functions a classic technique from robust statistics to trace a model's prediction through the learning algorithm and back to its training data, thereby identifying training points most responsible for a given prediction. We have a reproducible, executable, and Dockerized version of these scripts on Codalab. In many cases, the distance between two neural nets can be more profitably defined in terms of the distance between the functions they represent, rather than the distance between weight vectors. Biggio, B., Nelson, B., and Laskov, P. Support vector machines under adversarial label noise. Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. Using machine teaching to identify optimal training-set attacks on machine learners. A tag already exists with the provided branch name. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Pang Wei Koh and Percy Liang. , . Koh P, Liang P, 2017. In this paper, we use influence functions -- a classic technique from robust statistics -- to trace a model's prediction through . Interacting with predictions: Visual inspection of black-box machine learning models. We'll consider the heavy ball method and why the Nesterov Accelerated Gradient can further speed up convergence. Please download or close your previous search result export first before starting a new bulk export. If the influence function is calculated for multiple That can increase prediction accuracy, reduce Understanding Black-box Predictions via Inuence Functions 2. Understanding black-box predictions via influence functions. Dependencies: Numpy/Scipy/Scikit-learn/Pandas Metrics give a local notion of distance on a manifold. , mislabel . In this lecture, we consider the behavior of neural nets in the infinite width limit. ( , , ). the prediction outcomes of an entire dataset or even >1000 test samples. The security of latent Dirichlet allocation. In contrast with TensorFlow and PyTorch, JAX has a clean NumPy-like interface which makes it easy to use things like directional derivatives, higher-order derivatives, and differentiating through an optimization procedure. The meta-optimizer has to confront many of the same challenges we've been dealing with in this course, so we can apply the insights to reverse engineer the solutions it picks. This packages offers two modes of computation to calculate the influence Most weeks we will be targeting 2 hours of class time, but we have extra time allocated in case presentations run over. For the final project, you will carry out a small research project relating to the course content. Things get more complicated when there are multiple networks being trained simultaneously to different cost functions. Debruyne, M., Hubert, M., and Suykens, J. when calculating the influence of that single image. Delta-STN: Efficient bilevel optimization of neural networks using structured response Jacobians. Validations 4. above, keeping the grad_zs only makes sense if they can be loaded faster/ On the origin of implicit regularization in stochastic gradient descent. A spherical analysis of Adam with batch normalization. With the rapid adoption of machine learning systems in sensitive applications, there is an increasing need to make black-box models explainable. While one grad_z is used to estimate the We have a reproducible, executable, and Dockerized version of these scripts on Codalab. Cook, R. D. and Weisberg, S. Characterizations of an empirical influence function for detecting influential cases in regression. In, Cadamuro, G., Gilad-Bachrach, R., and Zhu, X. Debugging machine learning models. Kelvin Wong, Siva Manivasagam, and Amanjit Singh Kainth. Shrikumar, A., Greenside, P., Shcherbina, A., and Kundaje, A. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Understanding black-box predictions via influence functions. We have two ways of measuring influence: Our first option is to delete the instance from the training data, retrain the model on the reduced training dataset and observe the difference in the model parameters or predictions (either individually or over the complete dataset). Cook, R. D. Detection of influential observation in linear regression. We'll consider two models of stochastic optimization which make vastly different predictions about convergence behavior: the noisy quadratic model, and the interpolation regime. For a point z and parameters 2 , let L(z; ) be the loss, and let1 n P n i=1L(z I. Sutskever, J. Martens, G. Dahl, and G. Hinton. We see how to approximate the second-order updates using conjugate gradient or Kronecker-factored approximations. In this paper, we use influence functions a classic technique from robust statistics to trace a model's prediction through the learning algorithm and back to its training data, thereby identifying training points most responsible for a given prediction. Bilevel optimization refers to optimization problems where the cost function is defined in terms of the optimal solution to another optimization problem. Your job will be to read and understand the paper, and then to produce a Colab notebook which demonstrates one of the key ideas from the paper. We show that even on non-convex and non-differentiable models where the theory breaks down, approximations to influence functions can still provide valuable information. Stochastic gradient descent as approximate Bayesian inference. We'll mostly focus on minimax optimization, or zero-sum games. Most importantnly however, s_test is only A. To scale up influence functions to modern [] How can we explain the predictions of a black-box model? Subsequently, : , , , . Deep inside convolutional networks: Visualising image classification models and saliency maps. The precision of the output can be adjusted by using more iterations and/or You signed in with another tab or window. reading both values from disk and calculating the influence base on them. ": Explaining the predictions of any classifier. Loss non-convex, quadratic loss . Class will be held synchronously online every week, including lectures and occasionally tutorials. In. Google Scholar On linear models and convolutional neural networks, we demonstrate that influence functions are useful for multiple purposes: understanding model behavior, debugging models, detecting dataset errors, and even creating visually-indistinguishable training-set attacks.See more on this video at https://www.microsoft.com/en-us/research/video/understanding-black-box-predictions-via-influence-functions/ Understanding black-box predictions via influence functions Strack, B., DeShazo, J. P., Gennings, C., Olmo, J. L., Ventura, S., Cios, K. J., and Clore, J. N. Impact of HbA1c measurement on hospital readmission rates: analysis of 70,000 clinical database patient records. Measuring the effects of data parallelism on neural network training. influence function. We'll use the Hessian to diagnose slow convergence and interpret the dependence of a network's predictions on the training data. logistic regression p (y|x)=\sigma (y \theta^Tx) \sigma . We show that even on non-convex and non-differentiable models where the theory breaks down, approximations to influence functions can still provide valuable information. Influence functions are a classic technique from robust statistics to identify the training points most responsible for a given prediction. LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Understanding Black-box Predictions via Influence Functions. One would have expected this success to require overcoming significant obstacles that had been theorized to exist. D. Maclaurin, D. Duvenaud, and R. P. Adams. J. Cohen, S. Kaur, Y. Li, J. On the accuracy of influence functions for measuring group effects. Liu, Y., Jiang, S., and Liao, S. Efficient approximation of cross-validation for kernel methods using Bouligand influence function. can take significant amounts of disk space (100s of GBs) but with a fast SSD Why neural nets generalize despite their enormous capacity is intimiately tied to the dynamics of training. Adaptive Gradient Methods, Normalization, and Weight Decay [Slides]. Stochastic Optimization and Scaling [Slides]. Often we want to identify an influential group of training samples in a particular test prediction for a given machine learning model. Riemannian metrics for neural networks I: Feed-forward networks. Some JAX code examples for algorithms covered in this course will be available here. where the theory breaks down, For details and examples, look here. , . most harmful. On linear models and convolutional neural networks, we demonstrate that influence functions are useful for multiple purposes: understanding model behavior, debugging models, detecting dataset errors, and even creating visually-indistinguishable training-set attacks. Y. LeCun, L. Bottou, G. B. Orr, and K.-R. Muller. We motivate second-order optimization of neural nets from several perspectives: minimizing second-order Taylor approximations, preconditioning, invariance, and proximal optimization. Natural gradient works efficiently in learning. This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. In this paper, we use influence functions -- a classic technique from robust statistics -- to trace a model's prediction through the learning algorithm and back to its training data, thereby identifying training points most responsible for a given prediction. arXiv preprint arXiv:1703.04730 (2017). Students are encouraged to attend synchronous lectures to ask questions, but may also attend office hours or use Piazza. (a) train loss, Hessian, train_loss + Hessian . Understanding black-box predictions via influence functions. Self-tuning networks: Bilevel optimization of hyperparameters using structured best-response functions. Often we want to identify an influential group of training samples in a particular test prediction. influences. ICML 2017 best paperStanfordPang Wei KohCourseraStanfordNIPS 2019influence functionPercy Liang11Michael Jordan, , \hat{\theta}_{\epsilon, z} \stackrel{\text { def }}{=} \arg \min _{\theta \in \Theta} \frac{1}{n} \sum_{i=1}^{n} L\left(z_{i}, \theta\right)+\epsilon L(z, \theta), \left.\mathcal{I}_{\text {up, params }}(z) \stackrel{\text { def }}{=} \frac{d \hat{\theta}_{\epsilon, z}}{d \epsilon}\right|_{\epsilon=0}=-H_{\tilde{\theta}}^{-1} \nabla_{\theta} L(z, \hat{\theta}), , loss, \begin{aligned} \mathcal{I}_{\text {up, loss }}\left(z, z_{\text {test }}\right) &\left.\stackrel{\text { def }}{=} \frac{d L\left(z_{\text {test }}, \hat{\theta}_{\epsilon, z}\right)}{d \epsilon}\right|_{\epsilon=0} \\ &=\left.\nabla_{\theta} L\left(z_{\text {test }}, \hat{\theta}\right)^{\top} \frac{d \hat{\theta}_{\epsilon, z}}{d \epsilon}\right|_{\epsilon=0} \\ &=-\nabla_{\theta} L\left(z_{\text {test }}, \hat{\theta}\right)^{\top} H_{\hat{\theta}}^{-1} \nabla_{\theta} L(z, \hat{\theta}) \end{aligned}, \varepsilon=-1/n , z=(x,y) \\ z_{\delta} \stackrel{\text { def }}{=}(x+\delta, y), \hat{\theta}_{\epsilon, z_{\delta},-z} \stackrel{\text { def }}{=}\arg \min _{\theta \in \Theta} \frac{1}{n} \sum_{i=1}^{n} L\left(z_{i}, \theta\right)+\epsilon L\left(z_{\delta}, \theta\right)-\epsilon L(z, \theta), \begin{aligned}\left.\frac{d \hat{\theta}_{\epsilon, z_{\delta},-z}}{d \epsilon}\right|_{\epsilon=0} &=\mathcal{I}_{\text {up params }}\left(z_{\delta}\right)-\mathcal{I}_{\text {up, params }}(z) \\ &=-H_{\hat{\theta}}^{-1}\left(\nabla_{\theta} L(z_{\delta}, \hat{\theta})-\nabla_{\theta} L(z, \hat{\theta})\right) \end{aligned}, \varepsilon \delta \deltaloss, \left.\frac{d \hat{\theta}_{\epsilon, z_{\delta},-z}}{d \epsilon}\right|_{\epsilon=0} \approx-H_{\hat{\theta}}^{-1}\left[\nabla_{x} \nabla_{\theta} L(z, \hat{\theta})\right] \delta, \hat{\theta}_{z_{i},-z}-\hat{\theta} \approx-\frac{1}{n} H_{\hat{\theta}}^{-1}\left[\nabla_{x} \nabla_{\theta} L(z, \hat{\theta})\right] \delta, \begin{aligned} \mathcal{I}_{\text {pert,loss }}\left(z, z_{\text {test }}\right)^{\top} &\left.\stackrel{\text { def }}{=} \nabla_{\delta} L\left(z_{\text {test }}, \hat{\theta}_{z_{\delta},-z}\right)^{\top}\right|_{\delta=0} \\ &=-\nabla_{\theta} L\left(z_{\text {test }}, \hat{\theta}\right)^{\top} H_{\hat{\theta}}^{-1} \nabla_{x} \nabla_{\theta} L(z, \hat{\theta}) \end{aligned}, train lossH \mathcal{I}_{\text {up, loss }}\left(z, z_{\text {test }}\right) , -y_{\text {test }} y \cdot \sigma\left(-y_{\text {test }} \theta^{\top} x_{\text {test }}\right) \cdot \sigma\left(-y \theta^{\top} x\right) \cdot x_{\text {test }}^{\top} H_{\hat{\theta}}^{-1} x, influence functiondebug training datatraining point \mathcal{I}_{\text {up, loss }}\left(z, z_{\text {test }}\right) losstraining pointtraining point, Stochastic estimationHHHTFO(np)np, ImageNetdogfish900Inception v3SVM with RBF kernel, poisoning attackinfluence function59157%77%10590/591, attackRelated worktraining set attackadversarial example, influence functionbad case debug, labelinfluence function, \mathcal{I}_{\text {up,loss }}\left(z_{i}, z_{i}\right) , 10%labelinfluence functiontrain lossrandom, \mathcal{I}_{\text {up, loss }}\left(z, z_{\text {test }}\right), \mathcal{I}_{\text {up,loss }}\left(z_{i}, z_{i}\right), \mathcal{I}_{\text {pert,loss }}\left(z, z_{\text {test }}\right)^{\top}, H_{\hat{\theta}}^{-1} \nabla_{x} \nabla_{\theta} L(z, \hat{\theta}), Less Is Better: Unweighted Data Subsampling via Influence Function, influence functionleave-one-out retraining, 0.86H, SVMhinge loss0.95, straightforwardbest paper, influence functionloss. S. L. Smith, B. Dherin, D. Barrett, and S. De. In. Koh, Pang Wei. prediction outcome of the processed test samples. The marking scheme is as follows: The problem set will give you a chance to practice the content of the first three lectures, and will be due on Feb 10. Borys Bryndak, Sergio Casas, and Sean Segal. (b) 7 , 7 . 10 0 obj In, Mei, S. and Zhu, X. compress your dataset slightly to the most influential images important for The answers boil down to an observation that neural net training seems to have two distinct phases: a small-batch, noise-dominated phase, and a large-batch, curvature-dominated one. which can of course be changed. Goodman, B. and Flaxman, S. European union regulations on algorithmic decision-making and a "right to explanation". All information about attending virtual lectures, tutorials, and office hours will be sent to enrolled students through Quercus. PW Koh*, KS Ang*, H Teo*, PS Liang. The deep bootstrap framework: Good online learners are good offline generalizers. Despite its simplicity, linear regression provides a surprising amount of insight into neural net training. numbers above the images show the actual influence value which was calculated. Aggregated momentum: Stability through passive damping. To run the tests, further requirements are: You can either install this package directly through pip: Calculating the influence of the individual samples of your training dataset Gradient descent on neural networks typically occurs on the edge of stability. Requirements Installation Usage Background and Documentation config Misc parameters fast SSD, lots of free storage space, and want to calculate the influences on %PDF-1.5 Which algorithmic choices matter at which batch sizes? If there are n samples, it can be interpreted as 1/n. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I. J., Harp, A., Irving, G., Isard, M., Jia, Y., Jzefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Man, D., Monga, R., Moore, S., Murray, D. G., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P. A., Vanhoucke, V., Vasudevan, V., Vigas, F. B., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. Tensorflow: Large-scale machine learning on heterogeneous distributed systems.