Reading Abstracts from NIPS/NeurIPS 2018! Here is What I Learned

Prakash Kagitha
Towards Data Science
11 min readDec 2, 2018

--

Photo by chuttersnap on Unsplash

I decided to read all the abstracts from NIPS/NeurIPS 2018. But it turned out to be implausible, both physically and mentally, in the time frame I wanted. There are 1011 accepted papers for this year conference including 30 orals, 168 spotlights and 813 posters out of 4856 papers with an acceptance rate of 20.8%.(source)

I wanted to read all the abstracts in 24 waking hours I could get before the conference starts. I got 1440 mins to read 1011 abstracts, having an average time of 1.42 mins. Totally being stupid, I wanted to summarize the abstract to make a mini-abstract so that it would be easy to follow a concise abstract when I come back to it later or to share it.

I started reading abstracts, taking a set of 20 (first 20) from the first poster session of the conference ‘Tue Poster Session A’ (it has 168 papers). It took me a little over 210 mins to read and summarize (extractive manner, taking some pieces of the abstract), on an average of 10.5mins/paper. I paced it up a little not worrying too much about summarizing, I finished next 20 in about 150 mins on an average 7.5 mins. The next 20 in about 90 mins. The next 20 in about 70–80 mins. And next 20 in 60–70 mins. After 140 papers I gave up the time limit and took a break.

Nonetheless, wonderful thing happened when I am finishing a group of 20 and going to an other. Its really intimidating and overwhelming to read a concentrated abstract of solid research investigation, even one, and I have to read 20 such papers and keep reading. Reading first 20 papers, any theories that I don’t know or topic that I am not well versed with would stop me from understanding what they are solving or the value of their solution.

But, eventually I got less intimidated by the theories they used or their peculiar novelty to get to the solution, and saw them as the sort of inspirations or insights to solve a particular limitation or to extend the versatility of existing work that one find. And I felt the ease, when reading the abstract, to attend to the problem they are solving and the novelty, validity, and the impact of their solution to the field.

Overall, I am really happy that I made myself read not-a-regular number of abstracts, even though it seemed fatal in many ways!! I still want to read all the abstracts from the conference but it could take, may be, a week. I will get you posted.

These are the must read from the papers I have gone through (18 papers short of the entire ‘Tue Poster Session A’) and theirs abstracts. Sort-of-tags are not so efficient in representing these papers, they are just a mere human latent perceptive overhead, sometimes seen as feelings.

Generalizing Point Embeddings using the Wasserstein Space of Elliptical Distributions

FUNDAMENTALS

A novel framework for embeddings which are are numerical flexible and which extend the point embeddings, elliptical embeddings in wessserstein space. Wasserstein elliptical embeddings are more intuitive and yield tools that are better behaved numerically than the alternative choice of Gaussian embeddings with the Kullback-Leibler divergence. The paper demonstrates the advantages of elliptical embeddings by using them for visualization, to compute embeddings of words, and to reflect entailment or hypernymy.

Are GANs Created Equal? A Large-Scale Study

SYSTEMATIC EVALUATION, REALLY KNOWING

Despite a very rich research activity leading to numerous interesting GAN algorithms, it is still very hard to assess which algorithm(s) perform better than others. We conduct a neutral, multi-faceted large-scale empirical study on state-of-the art models and evaluation measures. We find that most models can reach similar scores with enough hyper-parameter optimization and random restarts. This suggests that improvements can arise from a higher computational budget and tuning more than fundamental algorithmic changes. To overcome some limitations of the current metrics, we also propose several data sets on which precision and recall can be computed. Our experimental results suggest that future GAN research should be based on more systematic and objective evaluation procedures. Finally, we did not find evidence that any of the tested algorithms consistently outperforms the non-saturating GAN introduced in \cite{goodfellow2014generative}.

FishNet: A Versatile Backbone for Image, Region, and Pixel Level Prediction

FUNDAMENTALS, AT THE CORE

The basic principles in designing convolutional neural network (CNN) structures for predicting objects on different levels, e.g., image-level, region-level, and pixel-level, are diverging. Generally, network structures designed specifically for image classification are directly used as default backbone structure for other tasks including detection and segmentation, but there is seldom backbone structure designed under the consideration of unifying the advantages of networks designed for pixel-level or region-level predicting tasks, which may require very deep features with high resolution. Towards this goal, we design a fish-like network, called FishNet. In FishNet, the information of all resolutions is preserved and refined for the final task. Besides, we observe that existing works still cannot \emph{directly} propagate the gradient information from deep layers to shallow layers. Our design can better handle this problem. Extensive experiments have been conducted to demonstrate the remarkable performance of the FishNet. In particular, on ImageNet-1k, the accuracy of FishNet is able to surpass the performance of DenseNet and ResNet with fewer parameters. FishNet was applied as one of the modules in the winning entry of the COCO Detection 2018 challenge. The code is available at https://github.com/kevin-ssy/FishNet.

Glow: Generative Flow with Invertible 1x1 Convolutions

PRACTICAL MAGIC, ELEGANT

Flow based generative model with 1x1 invertible convolutions, which demonstrate significant improvement in log-likelihood and quantitative sample quality. Perhaps most strikingly, it demonstrates that a generative model optimized towards the plain log-likelihood objective is capable of efficient synthesis of large and subjectively realistic-looking images.

An intriguing failing of convolutional neural networks and the CoordConv solution

INTERESTING, ABOUT TIME

We have shown the curious inability of CNNs to model the coordinate transform task, shown a simple fix in the form of the CoordConv layer, and given results that suggest including these layers can boost performance in a wide range of applications. Using CoordConv in a GAN produced less mode collapse as the transform between high-level spatial latents and pixels becomes easier to learn. A Faster R-CNN detection model trained on MNIST detection showed 24% better IOU when using CoordConv, and in the Reinforcement Learning (RL) domain agents playing Atari games benefit significantly from the use of CoordConv layers.

Which Neural Net Architectures Give Rise to Exploding and Vanishing Gradients?

FUNDAMENTALS, UNDERSTANDING

We give a rigorous analysis of the statistical behavior of gradients in a randomly initialized fully connected network N with ReLU activations. Our results show that the empirical variance of the squares of the entries in the input-output Jacobian of N is exponential in a simple architecture-dependent constant beta, given by the sum of the reciprocals of the hidden layer widths. When beta is large, the gradients computed by N at initialization vary wildly. Our approach complements the mean field theory analysis of random networks. From this point of view, we rigorously compute finite width corrections to the statistics of gradients at the edge of chaos.

A Linear Speedup Analysis of Distributed Deep Learning with Sparse and Quantized Communication

PRACTICAL

The large communication overhead has imposed a bottleneck on the performance of distributed Stochastic Gradient Descent (SGD) for training deep neural networks. Previous works have demonstrated the potential of using gradient sparsification and quantization to reduce the communication cost. However, there is still a lack of understanding about how sparse and quantized communication affects the convergence rate of the training algorithm. In this paper, we study the convergence rate of distributed SGD for non-convex optimization with two communication reducing strategies: sparse parameter averaging and gradient quantization. We show that O(1/√MK) convergence rate can be achieved if the sparsification and quantization hyperparameters are configured properly. We also propose a strategy called periodic quantized averaging (PQASGD) that further reduces the communication cost while preserving the O(1/√MK) convergence rate. Our evaluation validates our theoretical results and shows that our PQASGD can converge as fast as full-communication SGD with only 3%−5% communication data size.

Regularizing by the Variance of the Activations’ Sample-Variances

FUNDAMENTALS, NORMALIZATION

Normalization techniques play an important role in supporting efficient and often more effective training of deep neural networks. While conventional methods explicitly normalize the activations, we suggest to add a loss term instead. This new loss term encourages the variance of the activations to be stable and not vary from one random mini-batch to the next. Finally, we are able to link the new regularization term to the batchnorm method, which provides it with a regularization perspective. Our experiments demonstrate an improvement in accuracy over the batchnorm technique for both CNNs and fully connected networks.

Synaptic Strength For Convolutional Neural Network

SYNAPTIC PRUNING, NEUROSCIENCE

Convolutional Neural Networks(CNNs) are both computation and memory inten-sive which hindered their deployment in mobile devices. Inspired by the relevantconcept in neural science literature, we propose Synaptic Pruning: a data-drivenmethod to prune connections between input and output feature maps with a newlyproposed class of parameters called Synaptic Strength. Synaptic Strength is de-signed to capture the importance of a connection based on the amount of informa-tion it transports. Experiment results show the effectiveness of our approach. OnCIFAR-10, we prune connections for various CNN models with up to96%, whichresults in significant size reduction and computation saving.

DropMax: Adaptive Variational Softmax

CLEAN

We propose DropMax, a stochastic version of softmax classifier which at each iteration drops non-target classes according to dropout probabilities adaptively decided for each instance. Specifically, we overlay binary masking variables over class output probabilities, which are input-adaptively learned via variational inference. This stochastic regularization has an effect of building an ensemble classifier out of exponentially many classifiers with different decision boundaries. Moreover, the learning of dropout rates for non-target classes on each instance allows the classifier to focus more on classification against the most confusing classes. We validate our model on multiple public datasets for classification, on which it obtains significantly improved accuracy over the regular softmax classifier and other baselines. Further analysis of the learned dropout probabilities shows that our model indeed selects confusing classes more often when it performs classification.

Relational recurrent neural networks

REVOLUTIONARY

Memory-based neural networks model temporal data by leveraging an ability to remember information for long periods. It is unclear, however, whether they also have an ability to perform complex relational reasoning with the information they remember. Here, we first confirm our intuitions that standard memory architectures may struggle at tasks that heavily involve an understanding of the ways in which entities are connected — i.e., tasks involving relational reasoning. We then improve upon these deficits by using a new memory module — a Relational Memory Core (RMC) — which employs multi-head dot product attention to allow memories to interact. Finally, we test the RMC on a suite of tasks that may profit from more capable relational reasoning across sequential information, and show large gains in RL domains (BoxWorld & Mini PacMan), program evaluation, and language modeling, achieving state-of-the-art results on the WikiText-103, Project Gutenberg, and GigaWord datasets.

Embedding Logical Queries on Knowledge Graphs

MINORITY APPROACH

Learning low-dimensional embeddings of knowledge graphs is a powerful approach used to predict unobserved or missing edges between entities. However, an open challenge in this area is developing techniques that can go beyond simple edge prediction and handle more complex logical queries, which might involve multiple unobserved edges, entities, and variables. For instance, given an incomplete biological knowledge graph, we might want to predict “em what drugs are likely to target proteins involved with both diseases X and Y?” — a query that requires reasoning about all possible proteins that might interact with diseases X and Y. Here we introduce a framework to efficiently make predictions about conjunctive logical queries — a flexible but tractable subset of first-order logic — on incomplete knowledge graphs. In our approach, we embed graph nodes in a low-dimensional space and represent logical operators as learned geometric operations (e.g., translation, rotation) in this embedding space. By performing logical operations within a low-dimensional embedding space, our approach achieves a time complexity that is linear in the number of query variables, compared to the exponential complexity required by a naive enumeration-based approach. We demonstrate the utility of this framework in two application studies on real-world datasets with millions of relations: predicting logical relationships in a network of drug-gene-disease interactions and in a graph-based representation of social interactions derived from a popular web forum.

Multi-Task Learning as Multi-Objective Optimization

BIG PROBLEMS

In multi-task learning, multiple tasks are solved jointly, sharing inductive bias between them. Multi-task learning is inherently a multi-objective problem because different tasks may conflict, necessitating a trade-off. A common compromise is to optimize a proxy objective that minimizes a weighted linear combination of per-task losses. However, this workaround is only valid when the tasks do not compete, which is rarely the case. In this paper, we explicitly cast multi-task learning as multi-objective optimization, with the overall objective of finding a Pareto optimal solution. To this end, we use algorithms developed in the gradient-based multi-objective optimization literature. These algorithms are not directly applicable to large-scale learning problems since they scale poorly with the dimensionality of the gradients and the number of tasks. We therefore propose an upper bound for the multi-objective loss and show that it can be optimized efficiently. We further prove that optimizing this upper bound yields a Pareto optimal solution under realistic assumptions. We apply our method to a variety of multi-task deep learning problems including digit classification, scene understanding (joint semantic segmentation, instance segmentation, and depth estimation), and multi-label classification. Our method produces higher-performing models than recent multi-task learning formulations or per-task training.

Mesh-TensorFlow: Deep Learning for Supercomputers

THE SOLUTION

Batch-splitting (data-parallelism) is the dominant distributed Deep Neural Network (DNN) training strategy, due to its universal applicability and its amenability to Single-Program-Multiple-Data (SPMD) programming. However, batch-splitting suffers from problems including the inability to train very large models (due to memory constraints), high latency, and inefficiency at small batch sizes. All of these can be solved by more general distribution strategies (model-parallelism). Unfortunately, efficient model-parallel algorithms tend to be complicated to discover, describe, and to implement, particularly on large clusters. We introduce Mesh-TensorFlow, a language for specifying a general class of distributed tensor computations. Where data-parallelism can be viewed as splitting tensors and operations along the “batch” dimension, in Mesh-TensorFlow, the user can specify any tensor-dimensions to be split across any dimensions of a multi-dimensional mesh of processors. A Mesh-TensorFlow graph compiles into a SPMD program consisting of parallel operations coupled with collective communication primitives such as Allreduce. We use Mesh-TensorFlow to implement an efficient data-parallel, model-parallel version of the Transformer sequence-to-sequence model. Using TPU meshes of up to 512 cores, we train Transformer models with up to 5 billion parameters, surpassing SOTA results on WMT’14 English-to-French translation task and the one-billion-word Language modeling benchmark. Mesh-Tensorflow is available at https://github.com/tensorflow/mesh

I couldn’t stop thinking about NeurIPS!! Writing about it either.

Edit: I published a selection of papers from the first two poster sessions (330+ papers)

--

--