Research Program

My research interests are broadly in the intersection of optimization and machine learning. Specifically, I'm very interested in making machine learning efficient and robust. I'm interested in efficiency on multiple fronts: label efficiency (how can we learn with less labeled data), model efficiency (reducing model complexity for resource-constrained environments), and time and resource efficiency (how do we reduce end to end running time of training, and train models on resource-constrained environments). I am also interested in building intelligent systems that organize, analyze, and summarize massive amounts of data, and also automatically learn from this. Below are some of the concrete applications I'm currently very interested in working on.

  • Data Subset Selection/Coresets for efficient learning: How do we select the right sets of data making training/inference/hyperparameter tuning etc. more efficient, particularly for deep learning? I'm interested in speeding up deep learning by an order of magnitude (e.g. 5x to 10x speedup) by training on much smaller subsets without significant loss in accuracy or other evaluation metrics.

  • Active Learning for Deep Models: How do we tradeoff between uncertainty and diversity in a principled manner for active learning (i.e. iteratively selecting labeled data points) in deep learning? This is particularly important since labeled data is very time-consuming and expensive to obtain for real-world problems. We study techniques that can achieve 2x - 5x labeling cost reductions for a wide range of applications.

  • Data Programming and Weak Supervision: Using Weak Supervision to automatically create noisy labeled data for reducing labeling costs

  • Robust Learning: How do we learn machine learning models in a robust manner in the presence of noisy labels, outliers, distribution shift, and imbalance.

  • Feature selection: What are principled ways of selecting the right sets of features and how to do these in model-dependent or model-independent ways? How do we do these when eliciting features have a cost associated (e.g. in medical domains, each additional medical test might have a cost) and in an online manner.

  • Neural Network Compression and Architecture Search in Resource Constraints: How do we compress neural networks (top-down) or search for resource-constrained architectures (bottom-up) in an efficient manner?

  • Data Summarization: What makes a good summary of data and how do we consume these summaries

  • Data Partitioning: Efficient partitioning of data for clustering and distributed training

Motivating Applications

Below is more details of the applications listed above. I study each of the applications below in a broad range of domains including computer vision, video analytics, speech recognition, and natural language processing/text classification.

Data Subset Selection/Coresets for Efficient Learning and Active Learning

Selecting the right dataset for training is a critical problem today given massive datasets – both from training efficiency and labeling cost. This could be unsupervised, where we don’t have labels (select a subset of unlabeled data points for labeling) or supervised, where we have labels (for faster training or hyper-parameter tuning). In either case, we are interested in obtaining a representative subset of instances for training machine learning models. We show that the problem of selecting a subset of data with maximum likelihood on the training set is a submodular optimization problem, for several classifiers. We show that by learning of the right data subsets, we can achieve significant speedups (between 5x - 10x) with minimal loss in accuracy. We can similarly reduce the labeling costs by selecting (in an active learning manner) the right subset/batch of examples to label. We applied data selection and active learning to a number of application domains including computer vision, text classification, and speech recognition.

Data Programming

Getting high quality labelled data is very expensive, and machine learning models require massive amounts of labelled data. I am studying approaches of weak supervision for effectively learning machine learning models with very few labelled instances and a large number of unlabelled instances using noisy labels from multiple sources (semi-supervised data programming). I'm also interested in subset selection problems in this space (e.g. how do we select a subset of labeling functions for robustness, and selecting a subset of labeled instances to complement

  • Ayush Maheshwari, Oishik Chatterjee, KrishnaTeja Killamsetty, Ganesh Ramakrishnan, and Rishabh Iyer, Data Programming using Semi-Supervision and Subset Selection, To Appear in Findings of ACL, 2021 (Long Paper)

  • Atul Sahay, Anshul Nasery, Ayush Maheshwari, Ganesh Ramakrishnan, and Rishabh Iyer, Rule Augmented Unsupervised Constituency Parsing, To Appear in Findings of ACL, 2021 (Short Paper)

Robust Learning

Can we make machine learning algorithms robust to noisy labels, out of distribution samples, distribution shift and imbalance? We study this problem in various settings (supervised, semi-supervised, and few shot learning) and also study the impact of robustness in these settings. We pose this problem as a bi-level optimization, and study algorithms for solving this.

Data Summarization

I am interested in several applications of data summarization including video summarization, image collection summarization, document/text summarization and summarization of topic hierarchies. We study questions like what are natural models for summarization, how do we choose the right models for different problems/domains and how do we learn the right combinations of functions for various tasks. A lot of effort is also spent on interpretability of models, evaluation and loss functions for summarization, and at the core of it, understanding what makes a good summary for the problem at hand. We have also created new datasets for domain specific video summarization and image collection summarization. We recently released a dataset called VISIOCITY with large videos for video summarization and video understanding.

Data Partitioning

We seek to intelligently partition data for large scale distributed training, so that we can achieve superior results compared to simple random partitioning and other baselines. We demonstrate that diversified partitioning via submodular functions can achieve significant improvements on several distributed deep learning and general machine learning tasks.

Feature Selection

Feature Selection is a very important pre-processing step for machine learning and data science applications, and is used to mostly reduce prediction time and memory, feature acquisition cost, and remove noisy and irrelevant features. We study a parameterized feature selection framework using submodular functions, and particularly using a family of mutual information based models. We show how this framework can be extended to cost-aware feature elicitation.

Theoretical Advances

To solve the motivating applications listed in Thread 1, below are some of the theoretical directions I'm pursuing.


Submodular Optimization is a rich and expressive class of non-linear discrete optimization problems which generalize important combinatorial functions like set cover, facility location, log-determinants, etc. A number of applications such as data subset selection, data summarization, data partitioning, and active learning naturally involving flavors of submodular optimization. In this thread, we develop fast and scalable algorithms for a number of problems which occur in practice. Examples include submodular minimization, submodular maximization, difference of submodular optimization, submodular optimization subject to submodular constraints and ratio of submodular optimization. This framework of algorithms achieved (near) optimal approximation guarantees, while being easy to implement and scaling to massive datasets. Empirically, we demonstrated orders of magnitude speedups and our algorithms have been used in several real world applications. Our algorithms have been for several real world problems including cooperative cuts for image segmentation and cooperative matching, diffusion aware optimization, path planning, mobile crowd-sensing, trajectory optimization for aerial 3D scanning, sensor placement under cooperative costs, limited vocabulary speech data selection etc. Some relevant publications are:


While submodular optimization occurs in inference, a critical component of fitting submodular functions to machine learning applications is learning the right submodular functions. In this thread, we study a rich class of models associated with submodular functions and the associated learning problems.


This thread studies the intersection between submodular/combinatorial optimization and information theory via the study of submodular information measures. We study properties, modeling and representational power, instantiations, and applications of such measures. Examples of this includes submodular mutual information, submodular distance metrics, divergences, and multi-set submodular information measures.

  • Rishabh Iyer, Ninad Khargonkar, Jeff Bilmes, and Himanshu Asnani, Submodular Combinatorial Information Measures with Applications in Machine Learning, The 32nd International Conference on Algorithmic Learning Theory, ALT 2021.

  • Jennifer A Gillenwater, Rishabh K Iyer, Bethany Lusch, Rahul Kidambi, Jeff A Bilmes, Submodular hamming metrics, In Advances in Neural Information Processing Systems 2015

  • Vishal Kaushal, Suraj Kothawade, Ganesh Ramakrishnan, Jeff Bilmes, Himanshu Asnani, and Rishabh Iyer, A Unified Framework for Generic, Query-Focused, Privacy Preserving and Update Summarization using Submodular Information Measures, arXiv:2010.05631


A growing number of applications for efficient and robust learning involve bi-level optimization. In this thread, we study approaches for solving such bilevel optimization problems in an efficient manner, particularly for deep learning models. I'm particularly interested in bi-level optimization problems that have a discrete component in them (i.e. a mixed discrete/continuous bi-level optimization problem).