Honglei Xie

Takeaways from the NeurlPS 2021 Conference

March 08, 2022 | 14 Minute Read

My experience with NeurlPS 2021 Conference was phenomenal despite I have to join the Zoom meetings from a shitty quarantine hotel at Shanghai as if I began a prison sentence. Anyway, following is the main takeaways from some of sessions that I participated in the conference.

Pay Attention to MLP

Paper

They proposed a simple network architecture, gMLP, based solely on MLPs with gating, and show that it can perform as well as Transformers in image classification (ViT) and masked language modeling (BERT). The main takeaway from the their study is that self-attention is not the key for model scalability. The high level idea is to replace the self-attention with simple gating mechanism. In ImageNet they showed that the proposed gMLP achieved the similar performance with ViT in terms of param-accuracy efficiency. Furthermore they also verified the same conclusion holds when they trained the classifier on the ImageNet-21K and fine-tuned on the ImageNet-1K. In masked language modeling they showed that gMLP can scale as well as transformers especially when they get deep. For NLP fine-tuning, they found the two families of models may have different strength depending on the downstream task. However, both model families can scale equally well as the number of parameters got increased. What’s interesting is that they presented the visualization of the learned “inductive biases” such as locality and spatial invariance.

Out-of-Distribution Generalization in Kernel Regression

paper

Overall it’s a very theory-heavy paper. They attempted to solve the theoretical challenge which is to understand how and whether machine learning models generalize under such distributional shifts. They provided a theory of out-of-distribution (OOD) generalization for kernel regression. Despite kernel regression is not commonly applied in practice, it relates to deep neural networks in a certain limit (a recent body of work showed that under gradient descent, a certain class of neural networks with large number of hidden units approximately converge to a kernel regression with neural tangent kernel), making their results relevant to understanding OOD generalization in deep neural networks. They obtained an analytical expression for generalization in kernel regression when test data is drawn from a different distribution than the training distribution. They showed examples of how mismatched training and test distributions affect generalization. Interestingly, they demonstrated that the mismatch may improve test performance.

Learning Debiased Representation via Disentangled Feature Augmentation

paper

This is one of my favorite oral sessions that I went because it solved a very common yet challenging problem in the computer vision domain in a novel way. To begin with, what is the dataset bias and why does it matter? When a classifier is trained with a biased dataset, it often relies heavily on the bias attributes. For example, imagine training a bird classifier with the training images of birds. Assume that most of the training examples contains a bird flying on the blue sky while a few of them in other locations such as the ground. After training, we evaluate the classifier on the test data set which has the similar bird distribution as the training data set. In this case, the classifier would achieve good accuracy metrics in the test set. However, it would critically fail to correctly classify such images where the birds are not flying in the blue sky. It happens because the blue sky works as the bias attribute when training the classifier. In other words, as blue sky is highly correlated with birds, the classifier learns such attribute as the shortcut instead of learning the innate features of birds such as the shape of the wings. It raises a challenge to us since the samples that do not contain bias attribute, termed as bias-conflicting, is scarce. Furthermore, they found that the classifier trained with an extremely scarce number of bias-conflicting samples with their limited diversity may still be overfitted instead of learning truly intrinsic attributes for the target class. This paper presents an empirical analysis revealing that diversity of bias-conflicting samples is crucial for debiasing as well as the generalization capability! To solve the problem, they aimed to “diversify” bias-conflicting samples by augmenting them in the feature level.

In details, they set up two encoders \(E_i\) and \(E_b\) embed an image \(x\) into the intrinsic feature vectors \(z_i = E_i(x)\) and bias vectors \(z_b = E_b(x)\). Afterward, they concatenate these latent vectors into \(z = [z_i; z_b]\). Linear classifier \(C_i\) and \(C_b\) predict the target labels \(y\) from \(z\). As the generalized cross entropy (GCE) loss enforces the classifier to capture the easy-to-learn attribute when training encoder \(E_b\) and the linear classifier \(C_b\), GCE loss leads them to be fully biased.

On the contrary, \(E_i\) and \(C_i\) are trained with cross entropy loss. However, this schema still relies on emphasizing the scarce number of bias-conflicting samples in the training therefore their goal in the paper is to diversify the bias-conflicting samples by synthesizing bias-conflicting samples that contain the diverse intrinsic attributes of bias-aligned samples by swapping their latent features. Their method achieves the state-of-the-art performance on every dataset against baseline. Note that their method does not require the prior knowledge on the bias types and obtain competitive results against the baselines which assumed a certain type of bias.

Benign Overfitting

paper1 paper2

Math-intensive alert! I also attended the interesting lecture given by Peter Bartlett on the topic “Benign Overfitting”. The high level takeaway is that in deep neural networks, prediction accuracy improving with overparameterization. Essentially the deep neural networks can be trained to zero training error (for regression loss) with near state-of-the-art performance even for noisy problems.Therefore there is no tradeoff between fit to training data and complexity! This insight may sound counter-intuitive because the statistical wisdom says a prediction rule should not fit too well. But deep neural networks are trained to fit noisy data perfectly, and they predict well. Intuitively benign overfitting prediction rule \(\hat{f}\) decomposes as \(\hat{f} = \hat{f_0} + \Delta\) where \(\hat{f_0}\) is the simple component useful for prediction and \(\Delta\) is the spiky component useful for benign overfitting. Classical statistical learning theory applies to \(\hat{f}\). Although \(\Delta\) is not useful for prediction, it is benign. In this talk, benign overfitting was also characterized formally under a few settings such as linear regression, ridge regression and adversarial examples. Skipping the technical details here.

Data Centric AI workshop

Good data, bad data: lessons from a decade of crowdsourcing research

It’s clear that we need more quality data over quantity. One graph in Andrew’s talk demonstrated that essentially compared to clean data, you need 3 times the amount of noisy data to get the same performance out of the algorithm. The typical frustrating outcome from getting clean data is the wrong assumption that data annotation is a hermetically-sealed, automatic process, in other words, an API. Instead what they learned is that you often get higher quality labels, with the same cost and a more humane outcome, if you hire, train, and work with a small number of annotators rather than crowdsource. The second lesson they learned is that if the results look noisy, start debugging from the assumption that the source is in yur own description of th task, This yields less friction and better results.

In summary, they provided three effective approaches to get the clean data:

  • focus on training and growing a set of people
  • debug from your POV that you are the one being unclear
  • pay fairly and be a good boss

Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks

paper

As the name suggests, they found some of the most common benchmark test sets in the filed of machine learning actually have quite a lot of label errors and they studied what’s the implications for machine learning. They developed algorithms for finding label errors — which was open-sourced as Cleanlab. So how exactly they found the label errors? The step 1 is to use Cleanlab to automatically find label errors via confident learning algorithms. Adding more notes to the confident learning, it is theoretically grounded algorithm and works with any model, dataset and modality by using predicted probabilities as input therefore it’s model-agnostic. CL finds “systematic errors” instead of random label flipping. For example, a fox is much more likely to be labeled dog than cow. Step 2 is to verify the putative label errors by human via crowdsourcing. Finally they were able to find label errors in ten of the most commonly benchmarked tests sets across computer vision, natural language, and audio, which can be viewed here. The implication is that traditionally, people often choose which model to deploy based on test accuracy. However, their findings advise we need to benchmark on a corrected test set. Surprisingly, they found that lower capacity models may be practically more useful than higher capacity models in real-world datasets with high proportions of erroneously labeled data. Furthermore they estimated at what wrong label noise prevalence do the rankings start to change across a few popular benchmark datasets. For instance, in ImageNet, model rankings can change with just 6% increase in noise prevalence. And our own datasets may have much higher noise prevalence than these highly-curated benchmark datasets.

Data Centric AI Competition

The data centric AI competition inverts the traditional format of AI competition and actually challenges participants to improve a dataset given a fixed model. To provide more contexts, the fixed model in this case was a cut-off ResNet-50 and the dataset was inspired by MNIST and included 3000 examples of handwritten Roman numerals from 1 to 10. Participants could try fixing the incorrect labels, augmenting examples etc as long as they did not exceed the 10K image cap on the dataset. Just by improving the training dataset alone, participants were able to improve the accuracy by over 20%.

Summarizing some techniques by winner Shashank Deshpande

Data cleaning
  • removing noisy data points
  • correcting mislabelled data points
  • clarifying ambiguous data points

This results in 22% reduction in dataset size and a 9% boost from baseline model performance.

Balancing dataset distribution with clustering

They observed imbalance in upper and lowercase distributions within each class. When augmented blindly, these imbalances can get amplified. What they did is to subdivide each case into clusters (K-means clustering on the PAC-reduced ResNet-50 embeddings). And balancing each of the sub-clusters by augmenting images using translation, scale and rotation such that each sub-case has 500 images, 1000 per class. This results in 21.4% boost from baseline model performance.

Summarizing take-away from winner Johnson Kuan

Below are the high level steps from Johnson Kuan’s approach called Data Boosting:

  • Generate a very large set of randomly augmented images from the training data (treat these as “candidates” to source from).
  • Train an initial model and predict on the validation set. Use another pre-trained model to extract features (aka embeddings) from the validation images and augmented images. For each misclassified validation image, retrieve the nearest neighbors (based on cosine similarity) from the set of augmented images using the extracted features. Add these nearest neighbor augmented images to the training set.
  • Retrain model with the added augmented images and predict on validation set.
  • Repeat steps 4–6 until we’ve reached the limit of 10K images.

An open-source data labelling tool worth looking at: Label Studio

A Data-Centric Approach for Training Deep Neural Networks with Less Data

paper

They proposed a data optimization pipeline that investigates a dataset from various aspects, identifies invalid or mislabeled instances, and makes suggestions for addressing them. They demonstrated that the output dataset generated by their pipeline improves the accuracy by 5%, while being significantly (1.54x) smaller than the baseline. They also designed a GAN that uses feedback from the classifier as well as the discriminator to synthesize new samples which are hard to classify. Using samples generated by this GAN further improved the accuracy by 1%.

DataPerf

Researchers from Coactive.AI, ETH Zurich, Google, Harvard University, Landing.AI, Meta, Stanford University, and TU Eindhoven proposed DataPerf, a benchmark suite for ML datasets and algorithms for constructing or optimizing such datasets (such as core set selection or labeling error debugging) in order to catalyze increased research focus on data quality and foster data excellence. Despite models receive $billions in research effort, data is treated as an afterthought, which seems to be dangerous because bad data leads to bad models.

DataPerf aims to measure three things:

  • Learn how to design and build training datasets
  • Create continuously improving test sets
  • Algorithms to automate data workflow

YMIR: A Rapid Data Development Platform for Long-tailed Vision Applications

YMIR denotes “You Mine In Recursion”. Code is available at: https://github.com/industryessentials/ymir

YMIR is a data-centric ML development platform that aims to support developing vision applications at scale. YMIR uses data labeling, model training, and active learning based data mining as primitive operations to build datasets and models. It embeds version control into the iterative process to track and manage data annotations and models, much like code version control in software engineering. The platform also introduces the project concept, to isolate different jobs and enable parallel development of datasets for multiple applications. Specifically, YMIR treats dataset and model as the key inputs and outputs of the primitive operations, and stores the specification of each operation into metadata so we can quickly integrate existing tools into the platform.

Data Centric AI with Programmatic Supervision

They presented a data-centric commercial AI development platform called “Snorkel Flow” with the goal to programmatically label, build and manage training data. It provides labeling functions—rules, heuristics, and other custom complex operators—via a push-button UI or Python SDK.

Sim2Real Docs: Domain Randomization for Documents in Natural Scenes using Ray-traced Rendering

Code

Today document image capture often originates from non-professional photos under natural scene conditions - e.g. using cell phone camera. As a result, all processing of documents content (recognition, detection or classification) need to be invariant to natural scene properties of image capture such as lighting variation, heterogeneous document positioning and orientation etc. Sim2Real Docs is a Python library that uses Blender (an open source tool for 3D modeling and ray-traced rendering) as a backend. It synthesizes fit-for-purpose datasets and performing domain randomization of documents in natural scenes so that these simulated datasets can be used to build robust machine learning models in document space.

Example

Original W4 on top left and rest are synthetic images with varying lighting, background and camera conditions.

Typical use cases

  • Document classification
  • Logo detection
  • Text detection and recognition (OCR)
  • NLP analyses of document content

Contrastive Learning

Have I mentioned that the contrastive learning is getting tons of popularity and attention? I personally have little to none exposure to it but keen on diving deep into the tech details!