Andres Munoz Medina
Authored Publications
Google Publications
Other Publications
Sort By
Preview abstract
We present a novel way of training models in theweakly supervised setup of learning frombagsof examples with just aggregate label informa-tion. Unlike previous work, we focus on learninga classifier that can produce accurate predictionsat an individual instance level as opposed to simply matching a bag’s label proportion. The mainstrength of our algorithm lies on the simplicity ofits implementation. We demonstrate that a simplemodification to the loss function can yield accu-rate event level estimates. Moreover we show thatthe sample complexity of doing empirical riskminimization or stochastic gradient descent withlabel proportions increases only by a factor of √k compared to traditional supervised learning sce-narios. Finally, we validate our theoretical resultson multiple datasets and demonstrate that our pro-posed algorithm beats provides state of the artperformance for learning with label proportions.
View details
Preview abstract
The streaming model of computation is a popular approach for working with large-scale data. In this setting, there is a stream of items and the goal is to compute the desired quantities (usually data statistics) while making a single pass through the stream and using as little space as possible.
Motivated by the importance of data privacy, we develop differentially private streaming algorithms under the continual release setting, where the union of outputs of the algorithm at every timestamp must be differentially private. Specifically, we study the fundamental $\ell_p$ $(p\in [0,+\infty))$ frequency moment estimation problem under this setting, and give an $\varepsilon$-DP algorithm that achieves $(1+\eta)$-relative approximation $(\forall \eta\in(0,1))$ with $\mathrm{poly}\log(Tn)$ additive error and uses $\mathrm{poly}\log(Tn)\cdot \max(1, n^{1-2/p})$ space, where $T$ is the length of the stream and $n$ is the size of the universe of elements.
Our space is near optimal up to poly-logarithmic factors even in the non-private setting.
To obtain our results, we first reduce several primitives under the differentially private continual release model, such as counting distinct elements, heavy hitters and counting low frequency elements, to the simpler, counting/summing problems in the same setting.
Based on these primitives, we develop a differentially private continual release level set estimation approach to address the $\ell_p$ frequency moment estimation problem.
We also provide a simple extension of our results to the harder sliding window model, where the statistics must be maintained over the past $W$ data items.
View details
Scalable Differentially Private Clustering via Hierarchically Separated Trees
Chris Schwiegelshohn
David Saulpic
2022 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2022) (to appear)
Preview abstract
We study the private $k$-median and $k$-means clustering problem in $d$ dimensional Euclidean space.
By leveraging tree embeddings, we give an efficient and easy to implement algorithm, that is empirically competitive with state of the art non private methods.
We prove that our method computes a solution with cost at most $O(d^{3/2}\log n)\cdot OPT + O(k d^2 \log^2 n / \epsilon^2)$, where $\epsilon$ is the privacy guarantee. (The dimension term, $d$, can be replaced with $O(\log k)$ using standard dimension reduction techniques.) Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical, runs in near-linear, $\tilde{O}(nkd)$, time and scales to tens of millions of points. We also show that our method is amenable to parallelization in large-scale distributed computing environments. In particular we show that our private algorithms can be implemented in logarithmic number of MPC rounds in the sublinear memory regime.
Finally, we complement our theoretical analysis with an empirical evaluation demonstrating the algorithm's efficiency and accuracy in comparison to other privacy clustering baselines.
View details
A Joint Exponential Mechanism for Differentially Private Top-k
Jenny Gillenwater
Monica Ribero Diaz
International Conference on Machine Learning (ICML) 2022
Preview abstract
We present a differentially private algorithm for releasing the sequence of $k$ elements with the highest counts from a data domain of $d$ elements. The algorithm is a ``joint'' instance of the exponential mechanism, and its output space consists of all $O(d^k)$ length-$k$ sequences. Our main contribution is a method to sample this exponential mechanism in time $O(dk\log(k) + d\log(d))$ and space $O(dk)$. Experiments show that this approach outperforms existing pure differentially private methods and often improves upon even approximate differentially private methods for moderate $k$.
View details
Preview abstract
Differentially private learning algorithms protect individual participants in the training dataset by guaranteeing that their presence does not significantly change the resulting model. In order to make this promise, such algorithms need to know the maximum contribution that can be made by a single user: the more data an individual can contribute, the more noise will need to be added to protect them. While most existing analyses assume that the maximum contribution is known and fixed in advance—indeed, it is often assumed that each user contributes only a single example— we argue that in practice there is a meaningful choice to be made. On the one hand, if we allow users to contribute large amounts of data, we may end up adding excessive noise to protect a few outliers, even when the majority contribute only modestly. On the other hand, limiting users to small contributions keeps noise levels low at the cost
of potentially discarding significant amounts of excess data, thus introducing bias. Here, we characterize this trade-off for an empirical risk minimization setting, showing that in general there is a “sweet spot” that depends on measurable properties of the dataset, but that there is also a concrete cost to privacy that cannot be avoided simply by collecting more data.
View details
Adaptation Based on Generalized Discrepancy
Journal of Machine Learning Research, vol. 20 (2019), pp. 1-30
Preview abstract
We present a new algorithm for domain adaptation improving upon a discrepancy minimization algorithm, (DM), previously shown to outperform a number of algorithms for this problem. Unlike many previously proposed solutions for domain adaptation, our algorithm does not consist of a fixed reweighting of the losses over the training sample. Instead, the reweighting depends on the hypothesis sought. The algorithm is derived from a less conservative notion of discrepancy than the DM algorithm called generalized discrepancy. We present a detailed description of our algorithm and show that it can be formulated as a convex optimization problem. We also give a detailed theoretical analysis of its learning guarantees which helps us select its parameters. Finally, we report the results of experiments demonstrating that it improves upon discrepancy minimization.
View details
Testing Incentive Compatibility in Display Ad Auctions
Proceedings of the 2018 World Wide Web Conference on World Wide Web, WWW 2018
Preview abstract
Consider a buyer participating in a repeated auction, such as those
prevalent in display advertising. How would she test whether the
auction is incentive compatible? To bid effectively, she is
interested in whether the auction is single-shot incentive
compatible---a pure second-price auction, with fixed reserve
price---and also dynamically incentive compatible---her bids
are not used to set future reserve prices. In this work we develop
tests based on simple bid perturbations that a buyer can use to
answer these questions, with a focus on dynamic incentive
compatibility.
There are many potential A/B testing setups that one could use, but
we find that many natural experimental designs are, in fact, flawed.
For instance, we show that additive perturbations can lead to paradoxical results,
where higher bids lead to lower optimal reserve prices. We precisely
characterize this phenomenon and show that reserve prices are only
guaranteed to be monotone for distributions satisfying the Monotone
Hazard Rate (MHR) property. The experimenter must also decide how to split traffic to apply
systematic perturbations. It is tempting to have this split be
randomized, but we demonstrate empirically that unless the perturbations are
aligned with the partitions used by the seller to compute
reserve prices, the results are guaranteed to be inconclusive.
We validate our results with experiments on real display auction
data and show that a buyer can quantify both single-shot and dynamic
incentive compatibility even under realistic conditions where only
the cost of the impression is observed (as opposed to the exact
reserve price). We analyze the cost of running such experiments,
exposing trade-offs between test accuracy, cost, and underlying
market dynamics.
View details
Preview abstract
A significant part of optimizing revenue for ad auctions is setting a
good reserve (or minimum) price. Set it too low, and the impression
may yield little revenue, set it too high and there may not anyone
willing to buy the item. Previous work has looked at predicting this
value directly, however, the strongly non-convex objective function
makes this a challenging proposition. In contrast, motivated by the
fact that computing an optimal reserve price for a set of bids is
easy, we propose a clustering approach, first finding a good partition
of the data, and then finding an optimal reserve price for each
partition. In this work, we take a major step in this direction: we
derive the specific objective function that corresponds to revenue
optimization in auctions, and give algorithms that optimize it.
View details
Learning mobile phone battery consumptions
Ashish Sharma
Paul Eastham
Workshop on On Device Intelligence (2016)
Preview abstract
We introduce a novel, data-driven way for predicting battery consumption of apps. The state-of-the-art models used to blame battery consumption on apps are based on micro-benchmark experiments. These experiments are carried out on controlled setups where one can measure how much battery is consumed by each internal resource (CPU, bluetooth, WiFi...). The battery blame allocated to an app is simply the sum of the blames of the resources consumed by the app. We argue that this type of models do not capture the way phones work "in the wild" and propose instead to train a regression model using data collected from logs. We show that this type of learning is correct in the sense that under some assumptions, we can recover the true battery discharge rate of each component. We present experimental results where we consistently do better predictions than a model trained on micro-benchmarks.
View details
Preview abstract
We analyze the problem of linear bandits under heavy tailed noise and provide favorable regret guarantees.
View details
Adaptation algorithm and theory based on generalized discrepancy
Preview
Proceedings of the 21st ACM Conference on Knowledge Discovery and Data Mining (KDD 2015)
Preview abstract
We present a new algorithm for domain adaptation improving upon a discrepancy mini- mization algorithm, (DM), previously shown to outperform a number of algorithms for this problem. Unlike many previous proposed solutions for domain adaptation, our algorithm does not consist of a fixed reweighting of the losses over the training sample. Instead, the reweighting depends on the hypothesis sought. The algorithm is derived from a less con- servative notion of discrepancy than the DM algorithm. We call this quantity generalized discrepancy. We present a detailed description of our algorithm and show that it can be formulated as a convex optimization problem. We also give a detailed theoretical analysis of its learning guarantees which helps us select its parameters. Finally, we report the results of experiments demonstrating that it improves upon discrepancy minimization in several tasks.
View details
No Results Found