This is a list of papers that use the Neural Tangent Kernel (NTK). In each category, papers are sorted chronologically. Some of these papers were presented in the NTK reading group during the summer 2019 at the University of Oxford.
We used hypothes.is to some extent, see this for instance. There are notes for a few of the papers, which you can find linked below the relevant papers.
- 2/08/2019 [notes] Neural Tangent Kernel: Convergence and Generalization in Neural Networks.
- 9/08/2019 [notes] Gradient Descent Finds Global Minima of Deep Neural Network.
- 16/08/2019 Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks + insights from Gradient Descent Provably Optimizes Over-parameterized Neural Networks.
- 23/08/2019 On Lazy Training in Differentiable Programming
- 13/09/2019 Generalization bounds of stochastic gradient descent for wide and deep networks
- 18/10/2019 [notes] Generalization Guarantees for Neural Networks via Harnessing the Low-rank Structure of the Jacobian
- Recent Developments in Over-parametrized Neural Networks, Part II
- Interesting, nice overview of a few things, mostly related to optimization and NTK
- YouTube, Simons institute workshop.
- Part I is interesting, but take into account that it is about other optimization things for NNs, but not about NTK.
- Neural Tangent Kernel: Convergence and Generalization in Neural Networks -- link
- Notes
- 06/2018
- Original NTK paper.
- Exposes the idea of the NTK for the first time, although the proof that the Kernel in the limit is deterministic is done tending the number of neurons of each layer to infinity, layer by layer sequentially.
- It proves positive definiteness of the kernel for certain regimes, thus proving you can optimize to reach a global minimum at a linear rate.
- Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent -- link
- 02/2019
- They apparently prove that a finite learning rate is enough for the model to follow NTK dynamics in infinite width limit.
- Experiments
- On Exact Computation with an Infinitely Wide Neural Net -- link
- 04/2019
- Shows that NTK work somewhat worse than NNs, but not as much worse as previous work suggested.
- Claims to show a proof that sounds similar to those of Allen-Zhu, Du etc. but not sure what the difference is.
- Gradient Descent Provably Optimizes Over-parameterized Neural Networks -- link
- 04/10/2018
- A preliminar result of Gradient Descent Finds Global Minima of Deep Neural Network (below) but only for two layer neural networks.
- On the Convergence Rate of Training Recurrent Neural Networks -- link
- 29/10/2018
- See below
- A Convergence Theory for Deep Learning via Over-Parameterization -- link
- 9/11/2018
- Simplification of On the Convergence Rate of Training Recurrent Neural Networks.
- Convergence to global optima whp for GD and SGD.
- Works for \ell_2, cross entropy and other losses.
- Works for fully connected, ResNets, ConvNets, (and RNNs, in the paper above)
- Gradient Descent Finds Global Minima of Deep Neural Network. -- link
- Notes
- 9/11/2018
- Du et al
- Convergence to global optima whp for GD for \ell_2.
- Exponential width wrt depth needed in fully connected. Polynomial for resnets.
- Width Provably Matters in Optimization for Deep Linear Neural Networks -- link
- 12/2019
- Du et al.
- Deep linear neural network
- Convergence to global minima if low polynomial width is assumed.
- Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path? -- link
- 25/11/2018
- Results for one hidden layer NNs, generalized linear models and low-rank matrix regression.
- Training Dynamics of Deep Networks using Stochastic Gradient Descent via Neural Tangent Kernel -- link
- 06/2019
- SGD analyzed from the point of view of Stochastic Differential Equations
- On Lazy Training in Differentiable Programming -- link
- 12/2018
- They show that NTK regime can be controlled by rescaling the model, and show (experimentally) that neural nets in practice perform better than those in lazy regime.
- Also this seems to be independent of width. So scaling the model is a much easier way to get to lazy training, versus the infinite width + infinitesimal learning rate route??
- Disentangling feature and lazy learning in deep neural networks: an empirical study -- link
- 06/2019
- Similar to above (Chizat et al.), but more experimental.
- Kernel and deep regimes in overparametrized models -- link
- 06/2019
- Large initialization leads to kernel/lazy regime
- Small initialization leads to deep/active/adaptive regime, which can sometimes lead to better generalization. They claim this is the regime that allows one to "exploit the power of depth", and thus is key to understanding deep learning.
- The systems they analyze in detail are rather simple (like matrix completion) or artificial (like a very ad-hoc type of neural network)
- Learning and Generalization in Overparameterized NeuralNetworks, Going Beyond Two Layers -- link
- 11/2018
- Theorems are not based on NTKs, but it has experiments showing how generalization for 3-layer NNs is better than for its corresponding NTK.
- Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks -- link
- 01/2019
- Arora et al
- "Our work is related to kernel methods, especially recent discoveries of the connection between deep learning and kernels (Jacot et al., 2018; Chizat & Bach, 2018b;...) Our analysis utilized several properties of a related kernel from the ReLU activation."
- Generalization Error Bounds of Gradient Descent for Learning Over-parameterized Deep ReLU Networks -- link
- 02/2019
- See below
- Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks -- link
- 05/2019
- Seems very similar to the one above. What are the differences? Just that this is SGD vs GD in the above paper?
- Improves on the Arora2019 paper showing generalization bounds for NTK.
- I’d be interested in understanding the connection of their bound to classical margin and pac bayes bounds for kernel regression.
- They don’t show any plots demonstrating how good their bounds are, which probably means they are vacuous though...
- What Can ResNet Learn Efficiently, Going Beyond Kernels? -- link
- 05/2019
- Shows in the PAC setting that there are ("simple") functions that ResNets learn efficiently and such that any kernel gets test error much greater for the same sample complexity. in particular NTKs too.
- Asymptotic learning curves of kernel methods: empirical data v.s. Teacher-Student paradigm -- link
- 05/2019
- I think that getting learning curves for neural nets is a very interesting challenge.
- Here they do it for kernels, but if the NN behaves like a kernel, it would be relevant..
- Generalization Guarantees for Neural Networks via Harnessing the Low-rank Structure of the Jacobian -- link
- 06/2019
- Notes
- Uses NTK mainly and splits the eigenspace into two (based on a cutoff value of the eigenvalues). Projection of residuals onto the top eigenspace trains very fast and the rest could not train at all and loss could increase. Trade off based on cutoff value.
- Two layers.
- \ell_2 loss.
- Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation -- link
- 02/2019
- Although this paper is really cool in that it shows that most kinds of neural networks become GPs when infinitely wide, w.r.t. NTK, it just shows a proof where the layer widths can go to infinity at the same time, and generalizes it to more architectures, so doesn’t feel like necessarily much new insight?
- On the Inductive Bias of Neural Tangent Kernels -- link
- 05/2019
- This is just about properties of NTK (so not studying NNs directly).
- They find that the NTK model has different type of stability to deformations of the input than other NNGPs, and better approximation properties (whatever that means)
- Approximate Inference Turns Deep Networks into Gaussian Processes -- link
- 06/2019
- Shows Bayesian NNs (of any width) are equivalent to GPs, surprisingly with kernel given by NTK
- Spectral Analysis of Kernel and Neural Embeddings: Optimization and Generalization -- link
- 05/2019
- They just study what happens when you use a neural network or a kernel representation for data (fed as input to a NN I guess).
- Mean Field Analysis of Neural Networks: A Central Limit Theorem -- link
- 08/2018
- they only look at one hidden layer and squared error loss, so I’m not convinced of the novelty of results?
- Provably Efficient $Q$-learning with Function Approximation via Distribution Shift Error Checking Oracle -- link
- 06/2019
- Not about NTK, but authors suggest it could be extended to use NTK to analyze NN-based function approximation.
- Enhanced Convolutional Neural Tangent Kernels -- link
- 11/2019
- Enhances the NTK for convolutional networks of "On Exact Computation..." by adding some implicit data augmentation to the kernel that encodes some kind of local translation invariance and horizontal flipping.
- They have experiments that show good empirical performance, in particular they get 89% accuracy for CIFAR-10, matching AlexNet. This is the first time a kernel gets this results.
- NTK depends on initialization.