Jump to Content
Silvio Lattanzi

Silvio Lattanzi

Silvio received his bachelor (2005), master (2007) and PhD(2011) degree from the Computer Science department of Sapienza University of Rome, under the supervision of Alessandro Panconesi. Silvio joined Google Research in the New York office in January 2011. Since April 2017 Silvio moved to Google Research Zurich.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Sublinear Time Oracles for Hierarchical Clustering
    Aidasadat Mousavifar
    Akash Kumar
    Michael Kapralov
    SODA 2021 (2023) (to appear)
    Preview abstract Learning graph cluster structure using few queries is a classical question in property testing, with the fundamental special case, namely expansion testing, considered in the seminal work of Goldreich and Ron[STOC'96]. The most recent result in this line of work, due to Gluch et al.[SODA'21], designs {\em clustering oracles} for $(k, \e)$-clusterable graphs. These oracles, given a graph whose vertex set can be partitioned into a disjoint union of $k$ clusters (i.e., good expanders) with outer conductances bounded by $\e\ll 1$, provide query access to an $O(\e \log k)$-approximation to this ground truth clustering in time $\approx 2^{\text{poly}(k/\e)} n^{1/2+O(\e)}$ per query. In this paper we show that it is possible to learn the {\em hierarchical} cluster structure of $(k, \e)$-clusterable graphs in sublinear time. First, we show how to simulate the hierarchical clustering algorithm of Charikar-Chatziafratis[SODA'17] to approximate the Dasgupta cost of a $k$-clusterable graph to within a factor of $O(\sqrt{\log k})$ in $\approx \text{poly}(k)\cdot n^{1/2+O(\e)}$ time assuming oracle access to the clustering. Second, we introduce a natural hierarchical model of clusterable graphs, and give a bona fide clustering oracle for this model, i.e. a small space data structure that can answer hierarchical clustering queries in $\approx\text{poly}(k) \cdot n^{1/2+O(\e)}$ time per query. Notably, in both cases the query time depends on polynomially on the number $k$ of clusters. The second result is the main technical contribution of the paper, and relies on several structural properties of hierarchically clusterable graphs that we hope will be of independent interest in sublinear time spectral graph algorithms. View details
    Preview abstract In the correlation clustering problem the input is a signed graph where the sign indicates whether each pair of points should be placed in the same cluster or not. The goal of the problem is to compute a clustering which minimizes the number of disagreements with such recommendation. Thanks to its many practical applications, correlation clustering is a fundamental unsupervised learning problem and has been extensively studied in many different settings. In this paper we study the problem in the classic online setting with recourse; The vertices of the graphs arrive in an online manner and the goal is to maintain an approximate clustering while minimizing the number of times each vertex changes cluster. Our main contribution is an algorithm that achieves logarithmic recourse per vertex in the worst case. We also complement this result with a tight lower bound. Finally we show experimentally that our algorithm achieves better performances than state-of-the-art algorithms on real world data. View details
    Preview abstract We study the private $k$-median and $k$-means clustering problem in $d$ dimensional Euclidean space. By leveraging tree embeddings, we give an efficient and easy to implement algorithm, that is empirically competitive with state of the art non private methods. We prove that our method computes a solution with cost at most $O(d^{3/2}\log n)\cdot OPT + O(k d^2 \log^2 n / \epsilon^2)$, where $\epsilon$ is the privacy guarantee. (The dimension term, $d$, can be replaced with $O(\log k)$ using standard dimension reduction techniques.) Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical, runs in near-linear, $\tilde{O}(nkd)$, time and scales to tens of millions of points. We also show that our method is amenable to parallelization in large-scale distributed computing environments. In particular we show that our private algorithms can be implemented in logarithmic number of MPC rounds in the sublinear memory regime. Finally, we complement our theoretical analysis with an empirical evaluation demonstrating the algorithm's efficiency and accuracy in comparison to other privacy clustering baselines. View details
    Preview abstract Maximizing a monotone submodular function is a fundamental task in machine learning. In this paper we study the deletion robust version of the problem under the classic matroids constraint. Here the goal is to extract a small size summary of the dataset that contains a high value independent set even after an adversary deleted some elements. We present constant-factor approximation algorithms, whose space complexity depends on the rank $k$ of the matroid and the number $d$ of deleted elements. In the centralized setting we present a $(3.582+O(\eps))$-approximation algorithm with summary size $O(k + \frac{d \log k}{\eps^2})$. In the streaming setting we provide a $(5.582+O(\eps))$-approximation algorithm with summary size and memory $O(k + \frac{d \log k}{\eps^2})$. We complement our theoretical results with an in-depth experimental analysis showing the effectiveness of our algorithms on real-world datasets. View details
    Efficient and Stable Fully Dynamic Facility Location
    Nikos Parotsidis
    Sayan Bhattacharya
    NeurIPS (2022) (to appear)
    Preview abstract We consider the classic facility location problem in fully dynamic data streams, where elements can be both inserted and deleted. In this problem, one is interested in maintaining a stable and high quality solution throughout the data stream while using only little time per update (insertion or deletion). We study the problem and provide the first algorithm that at the same time maintains a constant approximation and only uses near-linear time per update and incurs in polylogarithmic recourse per update. We complement our theoretical results with an experimental analysis showing the practical efficiency of our method. View details
    Active Learning of Classifiers with Label and Seed Queries
    Andrea Paudice
    Marco Bressan
    Maximilian Thiessen
    Nicolo Cesa-Bianchi
    NeurIPS 2022 (to appear)
    Preview abstract We study exact active learning of binary and multiclass classifiers with margin. Given an $n$-point set $X \subset \R^m$, we want to learn any unknown classifier on $X$ whose classes have finite \emph{strong convex hull margin}, a new notion extending the SVM margin. On the other hand, using the more powerful \emph{seed} queries (a variant of equivalence queries), the target classifier could be learned in $\scO(m \log n)$ queries via Littlestone's Halving algorithm; however, Halving is computationally inefficient. In this work we show that, by carefully combining the two types of queries, a binary classifier can be learned in time $\poly(n+m)$ using only $\scO(m^2 \log n)$ label queries and $\scO\big(m \log \frac{m}{\gamma}\big)$ seed queries; the result extends to $k$-class classifiers at the price of a $k!k^2$ multiplicative overhead. Similar results hold when the input points have bounded bit complexity, or when only one class has strong convex hull margin against the rest. We complement these upper bounds by showing that in the worst case any algorithm needs $\Omega\big(\frac{k m \log \nicefrac{1}{\gamma}}{\log m}\big)$ seed and label queries to learn a $k$-class classifier with strong convex hull margin $\gamma$. View details
    The Gibbs--Rand Model
    Flavio Chierichetti
    PODS 2022 (to appear)
    Preview abstract Thanks to its many applications the clustering ensemble problem has been extensively studied in the past. In htis problem we are giving in input $m$ clustering and the objective is to output a clustering that ``well-represent'' all the input clustering. In this paper, we propose to thee best of our knowledge the first generative model for the problem. Our model is parameterized by a ``center'' clustering and a scale; the probability of a particular clustering is an exponential function of its Rand distance to the center, scaled. For this new model, we show: (i) a sampling algorithm that runs in polynomial time when the center has a constant number of clusters and (ii) a simple polynomial time reconstruction algorithm when the scale is small. View details
    Correlation Clustering in Constant Many Parallel Rounds
    Ashkan Norouzi Fard
    Jakub Tarnawski
    Nikos Parotsidis
    Slobodan Mitrović
    ICML (2022) (to appear)
    Preview abstract Correlation clustering is a central topic in unsupervised learning, with many applications in ML and data mining. In correlation clustering, one receives as input a signed graph and the goal is to partition it to minimize the number of disagreements. In this work we propose a massively parallel computation (MPC) algorithm for this problem that is considerably faster than prior work. In particular, our algorithm uses machines with memory sublinear in the number of nodes in the graph and returns a constant approximation while running only for a constant number of rounds. To the best of our knowledge, our algorithm is the first that can provably approximate a clustering problem using only a constant number of MPC rounds in the sublinear memory regime. We complement our analysis with an experimental scalability\nnote{I would remove "scalability": it is not clear that this will be demonstrated with mid-sized graphs} evaluation of our techniques. View details
    Near-Optimal Correlation Clustering with Privacy
    Ashkan Norouzi Fard
    Chenglin Fan
    Jakub Tarnawski
    Nikos Parotsidis
    Slobodan Mitrović
    NeurIPS 2022 (2022) (to appear)
    Preview abstract Correlation clustering is a central problem in unsupervised learning, with applications spanning community detection, duplicate detection, automated labeling and many more. In the correlation clustering problem one receives as input a set of nodes and for each node a list of co-clustering preferences, and the goal is to output a clustering that minimizes the disagreement with the specified nodes' preferences. In this paper, we introduce a simple and computationally efficient algorithm for the correlation clustering problem with provable privacy guarantees. Our additive error is stronger than the one shown in prior work and is optimal up to polylogarithmic factors for fixed privacy parameters. View details
    Preview abstract Designing algorithms for machine learning problems beyond worst-case analysis and, in particular,analyzing the effect of side-information on the complexity of such problems is a very important line of research with many practical applications. In this paper we study the classic k-means clustering problem in the presence of noisy labels. In this problem we receive as input a set of points and a set of clustering labels generated by either an adversarial or random perturbation of the optimal solution. Our main goal is to formally study the effect of this extra information on the complexity of the k-means problem. In particular, in the context of random perturbations, we give an efficient algorithm that finds a clustering of cost within a factor 1 +o(1)of optimum even when the label of each point is perturbed with a large probability (think 99%). In contrast, we show that side-information with adversarial perturbations is as hard as the original problem even if only a small fraction of the labels are perturbed. We complement this negative result by giving a simple algorithm in the case when the adversary is only allowed to perturb anfraction of each cluster. View details
    Secretaries with Advice
    Paul Duetting
    Proceedings of the 22nd ACM Conference on Economics and Computation (EC'21) (2021), pp. 409-429
    Preview abstract The secretary problem is probably the purest model of decision making under uncertainty. In this paper we ask which advice can we give the algorithm to improve its success probability? We propose a general model that unifies a broad range of problems: from the classic secretary problem with no advice, to the variant where the quality of a secretary is drawn from a known distribution and the algorithm learns each candidate’s quality quantile on arrival, to more modern ML-based versions of advice where a binary classifier gives us noisy advice about whether or not the current secretary is the best on the market. Our main technique is a factor revealing LP that captures all of the problems above. We use this LP formulation to gain structural insight into the optimal policy and present two case studies: a re-derivation of the classic known distributions result with tools from linear programming and a tight analysis of the noisy binary advice model. View details
    Online Facility Location with Multiple Advice
    Alessandro Panconesi
    Flavio Chierichetti
    Giuseppe Re
    Matteo Almanza
    NeurIPS 2021 (to appear)
    Preview abstract Clustering is a central topic in unsupervised learning and its online formulation has received a lot of attention in recent years. In this paper, we study the classic facility location problem in the presence of multiple machine-learned advice. We design an algorithm with provable performance guarantees such that, if the advice is good, it outperforms the best known online algorithms for the problem, and if it is bad it still matches their performance. We complement our theoretical analysis with an in-depth study of the performance of our algorithm, showing its effectiveness on synthetic and real-world data sets. View details
    Twin Peaks, a Stochastic Model for Recurring Cascades
    Alessandro Panconesi
    Giuseppe Re
    Matteo Almanza
    WWW 2021 (to appear)
    Preview abstract Understanding information dynamics and their resulting cascades is a central topic in social network analysis. In a recent seminal work, Cheng et al. analyzed multiples cascades on Facebook over several months, and noticed that many of them exhibit a recurring behaviour. They tend to have multiple peaks of popularity, with periods of quiescence in between. In this paper, we propose the first mathematical model that provably explains this interesting phenomenon. Our model is simple and shows that it is enough to have a good clustering structure to observe this interesting recurring behaviour with a standard information diffusion model. Furthermore, we complement our theoretical analysis with an experimental evaluation where we show that our model is able to reproduce the observed phenomenon on several social networks. View details
    Spectral Clustering Oracles in Sublinear Time
    Aidasadat Mousavifar
    Christian Alexander Sohler
    Grzegorz Gluch
    Michael Kapralov
    SODA 2021 (to appear)
    Preview abstract Given a graph G that can be partitioned into k clusters, can we efficiently construct a small space data structure that allows quickly classifying vertices of G according to cluster they belong to? Formally, if G can be partitioned into k disjoint expanders with outer conductance upper bounded by ∈ (0, 1), how efficient can such a data structure be? In this paper we show that surprisingly efficient and robust data structure exist. In particular we prove that storing approximations to a small number of random walks from a few random nodes in the graph G allows one to classify vertices according to cluster structure of G using only poly(k, log n) · n 1/2+O() n time per vertex, i.e. in sublinear time! This runtime is optimal for small constant values of s and k [Chiplunkar et al’FOCS’18]. To the best of our knowledge, our result is the first spectral clustering algorithm that allows classification in sublinear time even when the outer conductance of the partitions is only a small constant (as opposed to vanishingly small). View details
    Consistent k-Clustering for General Metrics
    Ashkan Norouzi Fard
    Hendrik Fichtenberger
    Ola Svensson
    SODA 2021 (to appear)
    Preview abstract Given a stream of points in a metric space, is it possible to maintain a constant approximate clustering by changing the cluster centers only a small number of times during the entire execution of the algorithm? This question received attention in recent years in the machine learning literature and, before our work, the best known algorithm performs $\widetilde{O}(k^2)$ center swaps (the $\widetilde{O}(\cdot)$ notation hides polylogarithmic factors in the number of points $n$ and the aspect ratio $\Delta$ of the input instance). This is a quadratic increase compared to the offline case --- the whole stream is known in advance and one is interested in keeping a constant approximation at any point in time --- for which $\widetilde{O}(k)$ swaps are known to be sufficient and simple examples show that $\Omega(k \log(n \Delta))$ swaps are necessary. We close this gap by developing an algorithm that, perhaps surprisingly, matches the guarantees in the offline setting. Specifically, we show how to maintain a constant-factor approximation for the $k$-median problem by performing an optimal (up to polylogarithimic factors) number $\widetilde{O}(k)$ of center swaps. To obtain our result we leverage new structural properties of $k$-median clustering that may be of independent interest. View details
    Preview abstract As a fundamental unsupervised learning task, hierarchical clustering has been extensively studied in the past decade. In particular, standard metric formulations as hierarchical $k$-center, $k$-means, and $k$-median received a lot of attention and the problems have been studied extensively in different computation model. Despite all this interest, not many efficient parallel algorithms are known for those problems. In this paper we introduce a new parallel algorithm for Euclidean hierarchical $k$-median problem that using machine with memory $s$ (for $s\in \Omega(\log^2 (n+\Delta+d))$) runs in $O\left(\log_{s} nd \right)$ rounds, where $d$ is the dimension of the data set and $\Delta$ is a polynomial upper-bound on the ratio between the maximum and minimum distance of two points in the input dataset. To the best of our knowledge, this is the first \emph{parallel} algorithm for the hierarchical $k$-median problem with theoretical guarantees. We further complement our theoretical results with an empirical study of our algorithm that shows its effectiveness in practice. View details
    Exact Recovery of Clusters in Finite Metric Spacesusing Oracle Queries
    Andrea Paudice
    Marco Bressan
    Nicolo Cesa-Bianchi
    COLT 2021 (to appear)
    Preview abstract We investigate the problem of exact cluster recovery using oracle queries. In Euclidean spaces, clusters that are convexand separated with a margin can be reconstructed exactly using only $\scO(\log n)$ same-cluster queries, where $n$ is the number of input points. In this work, we study the problem in the more challenging non-convex setting. In particular, we introduce a structural characterization of clusters, called $(\beta,\gamma)$-density, that can be applied to any finite set of points equipped with a metric (or even a semimetric, as the triangle inequality is not needed). Using $(\beta,\gamma)$-density, we can translate natural density properties of clusters (which include, for instance, clusters that strongly non-convex in $\R^d$) into a graph-theoretic notion of convexity. By exploiting this convexity notion, we design a deterministic algorithm that recovers $(\beta,\gamma)$-dense clusters using roughly $\scO(k^2 \log n + k^2 \PackNum^*)$ same-cluster queries, where $k$ is the number of clusters and $\PackNum^*$ is an upper bound to the packing number of the (semi)metric. We show that the dependence on the packing number is necessary, and we also show that, if we are allowed to make $\scO(k^2)$ additional queries to a ``cluster separation'' oracle, then we can recover clusters that have different and arbitrary density scales, even when the scale of each cluster is unknown. View details
    Fully Dynamic Algorithm for Constrained Submodular Optimization
    Ashkan Norouzi Fard
    Jakub Tarnawski
    Slobodan Mitrović
    NeurIPS 2020 (to appear)
    Preview abstract The task of maximizing a monotone submodular function under a cardinality constraint is at the core of many machine learning and data mining applications, including data summarization, sparse regression and coverage problems. We study this problem in the context of fully dynamic streams, where elements can be both inserted and removed. Our main result is a randomized algorithm that maintains an efficient data structure with a poly-logarithmic ammortized update time and returns a (1/2 - \epsilon)-approximate solution. We complement our theoretical analysis with an empirical study of the performance of our algorithm. View details
    Preview abstract In this paper, we provide efficient approximation algorithms for finding the most likelihood configuration (MAP) of size $k$ for Determinantal Point Processes (DPP) in the online and streaming settings where the data points arrive in an arbitrary order. More specifically, in the online setting, where the algorithm cannot discard the selected elements from its local memory, our Online-DPP algorithm achieves an $O(k^{O(k)})$ multiplicative approximation with $\eta$ additive error, using memory independent of the number of points. In the streaming setting, where the algorithm is allowed to discard the previously selected elements, our Stream-DPP algorithm achieves an $O(k^{O(k)})$ multiplicative approximation (and no additive error), with similar memory bounds. We note that the exponential dependence on $k$ in the approximation factor is unavoidable even in the offline setting. View details
    Preview abstract The sliding window model of computation captures scenarios in which data is arriving continuously, but only the latest $w$ elements should be used for analysis. The goal is to design algorithms that update the solution efficiently with each arrival rather than recomputing it from scratch. In this work, we focus on $k$-clustering problems such as $k$-means and $k$-median. In this setting, we give simple and practical algorithms that come with stronger performance guarantees than previously known results. Empirically, we show that our methods store only a small fraction of the data, are orders of magnitude faster, and find solutions with cost only slightly worse than those returned by algorithms that have access to the full dataset. View details
    Online Scheduling via Learned Weights
    Benjamin Moseley
    Thomas Lavastida
    SODA 2020 (to appear)
    Preview abstract The use of machine learning and predictive models have produced a revolution in science and engineering. Online optimization problems are a natural source of uncertainty that predictions can be used to manage and improve performance. This paper studies how predictive techniques can be used to break through worst case barriers in online scheduling. The make-span minimization problem on unrelated machines and its special case, restricted assignment, are classic problems in online scheduling theory. Worst case analysis of these problems yields Ω(log m) lower bounds on the competitive ratio in the online setting. In this paper we construct non-trivial predictions for these problems and design algorithms that utilize these predictions to compute solutions online. Our predictions are compact in size, having dimension linear in the number of machines. The performance guarantees of our algorithms depend on the accuracy of the predictions, and moderately accurate predictions allow our techniques to beat the worst case lower bounds. More precisely, the predictions can be used to construct O(log η)-competitive fractional assignments, where η is the error of the predictions. We then give an online algorithm that is O(poly(log log(m)))-competitive to round these fractional assignments View details
    Preview abstract We present the first distributed approximation algorithm for the Euclidean $k$-median problem with an optimal trade-off between memory usage and the number of parallel rounds. Our algorithm even works in the setting where each machine has very limited memory $s\in \Omega(\log n)$ and it is work efficient. In the future, it would be interesting to obtain similar results for other clustering problems and to improve the approximation factor of our algorithm. View details
    Preview abstract The desire to use machine learning to assist in human decision making has spawned a large area of research in understanding the impact of such systems not only on the society as a whole, but also the specific impact on different subpopulations. Recent work has shown that while there are several natural ways to quantify the fairness of a particular system, none of them are universal, and except for trivial cases, satisfying one means violating another~\citet{Kleinberg, Goel, Kleinberg2}. In parallel to understanding the interplay between different definitions of fairness, researchers have looked to develop algorithms for finding fair solutions to specific classes of problems. For example, there is recent work on fair regression, fair clustering, fair bandit algorithms and many others. In this work we tackle another large class of problems that form a useful primitive in machine learning and optimization, that of optimizing a set function subject to a set of matroid constraints. A matroid is a combinatorial object that generalizes linear independence between vectors and is general enough to encode cardinality constraints (e.g. selecting at most $k$ elements), connectivity constraints (e.g. selecting a spanning tree of a graph), or matching constraints (ensuring a subgraph has a perfect matching). View details
    Preview abstract We propose online algorithms for Column Subset Selection (CSS) and Principal Component Analysis (PCA), two methods that are widely employed for data analysis, summarization, and visualization. Given a data matrix A that is revealed one column at a time, the online CSS problems asks to keep a small set of columns, S, that best approximates the space spanned by the columns of A. As each column arrives, the algorithm must irrevocably decide whether to add it to S, or to ignore it. In the online PCA problem, the goal is to output a projection of each column to a low dimensional subspace. In other words, the algorithm must provide an embedding for each column as it arrives, which cannot be changed as new columns arrive. While both of these problems have been studied in the online setting, only additive approximations were known prior to our work. The core of our approach is an adaptive sampling technique that gives a practical and efficient algorithm for both of these problems. We prove that by sampling columns using their “residual norm” (i.e. their norm orthogonal to directions sampled so far), we end up with a significantly better dependence between the number of columns sampled, and the desired error in the approximation. We further show how to combine our algorithm “in series” with prior algorithms. In particular, using the results of Boutsidis et al. [4] and Frieze et al. [14] that have additive guarantees, we show how to improve the bounds on the error of our algorithm. View details
    Preview abstract Modern online learning algorithms achieve low (sublinear) regret in a variety of diverse settings. These algorithms, however, update their solution at every time step. While, these updates are computationally efficient, the very requirement of frequent updates makes the algorithms untenable in some practical applications. In this work we look for online learning algorithms that update a sublinear number of times. We give a meta algorithm based on non-homogeneous Poisson Processes that gives a smooth trade- off between final regret and frequency of updates. Empirically, we show that in many cases we can significantly reduce the number at a minimal increase in regret. View details
    Preview abstract Clustering large-scale networks is a central topic in unsupervised learning with many applications in machine learning and data mining. A classic approach to cluster a network is to identify regions of high edge density. In literature this approach is captured by two fundamental problems: the densest subgraph and the k-core decomposition problems. We design massively parallel computation (MPC) algorithms for these problems that are considerably faster than prior work. In the case of k-core decomposition, our work improves exponentially on the algorithm provided by Esfandiari et al. (ICML’18). Compared to the prior work on densest subgraph presented by Bahmani et al. (VLDB’12, ’14), our result requires quadratically fewer MPC rounds. We complement our analysis with an experimental scalability analysis of our techniques. View details
    Preview abstract In this paper, we develop a new variant of $k$-means++ seeding that in expectation achieves a constant approximation guarantee. We obtain this result by a simple combination of $k$-means++ sampling with a local search strategy. We evaluate our algorithm empirically and show that it also improves the quality of a solution in practice. View details
    Preview abstract The Massive Parallel Computing (MPC) model gained popularity during the last decade and it is now seen as the standard model for processing large scale data. One significant shortcoming of the model is that it assumes to work on static datasets while, in practice, real world datasets evolve continuously. To overcome this issue, in this paper we initiate the study of dynamic algorithms in the MPC model. We first discuss the main requirements for a dynamic parallel model and we show how to adapt the classic MPC model to capture them. Then we analyze the connection between classic dynamic algorithms and dynamic algorithms in the MPC model. Finally, we provide new efficient dynamic MPC algorithms for a variety of fundamental graph problems, including connectivity, minimum spanning tree and matching. View details
    Better Sliding Window Algorithms to Maximize Subadditive and Diversity Objectives
    Michele Borassi
    Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS 2019)
    Preview abstract The streaming computation model is a standard model for large-scale data analysis: the input arrives one element at a time, and the goal is to maintain an approximately optimal solution using only a constant, or, at worst, poly-logarithmic space. In practice, however, recency plays a large role, and one often wishes to consider only the last w elements that have arrived, the so-called sliding window problem. A trivial approach is to simply store the last w elements in a buffer; our goal is to develop algorithms with space and update time sublinear in w. In this regime, there are two frameworks: exponential histograms and smooth histograms, which can be used to obtain sliding window algorithms for families of functions satisfying certain properties. Unfortunately, these frameworks have limitations and cannot always be applied directly. A prominent example is the case of a submodular function with cardinality constraints. Some of these difficulties can be rectified, but often only on a case-by-case basis. Here, we describe an alternative approach to designing efficient sliding window algorithms for maximization problems. Then we instantiate this approach on a wide range of problems, yielding better algorithms for submodular function optimization, diversity optimization and general subadditive optimization. In doing so, we improve state-of-the art results obtained using problem-specific algorithms. View details
    Preview abstract Any streaming algorithm is commonly judged by the quality of its solution, the memory footprint, and its required amount of computations. In this paper, we study the problem of maximizing a monotone submodular function subject to a cardinality constraint k. We propose SIEVESTREAMING++, a streaming algorithm that with one pass over the data, while keeping only O(k) elements, achieves the tight 1/2 approximation guarantee. The best previously known streaming algorithms either achieve a suboptimal approximation ratio 1/4 with Θ(k) memory or the tight approximation ratio 1/2 with O(k log k) memory. We then show that by buffering a small fraction of the data stream, and through a careful filtering procedure, one can heavily reduce the rounds of adaptive computations, thus can lower the computational complexity of SIEVE-STREAMING++ substantially. We then generalize our results to the more challenging multi-source streaming setting. We show how one can achieve the tight 1/2 approximation guarantee with a O(k) shared memory, while minimizing not only the rounds of computations but also the total number of communicated bits. Finally, we demonstrate the efficiency of our algorithms on real-world applications including data summarization for multisource streams of tweets and of YouTube videos View details
    A Framework for Parallelizing Hierarchical Clustering Methods
    Benjamin Moseley
    Kefu Lu
    Thomas Lavastida
    ECML PKDD 2019
    Preview abstract Hierarchical clustering is a widely used tool in machine learning, several sequential hierarchical clustering algorithms are known and well-studied. These algorithms include top down divisive approaches such as bisecting $k$-means, $k$-median or $k$-center, and bottom-up agglomerative approaches such as single-linkage, average-linkage, and centroid-linkage. As data sets are becoming larger, these sequential methods are becoming ineffective. \comment{Most of these methods have serial dependencies and are difficult to adapt to parallel and distributed architectures. Due to this, a limited number of parallel and distributed implementations have been developed.} In this paper we develop distributed algorithms for bisecting $k$-means, $k$-median and $k$-center as well as centroid-linkage. We first establishes strong worst case bounds on the running time of the methods. Then we show experimentally that the algorithms have also good performance in practice. View details
    Mallows Models for Top-k Lists
    Flavio Chierichetti
    Anirban Dasgupta
    Shahrzad Haddadan
    NeurIPS (2018)
    Preview abstract The classic Mallows model is a widely-used tool to realize distributions on per- mutations. Motivated by common practical situations, in this paper, we generalize Mallows to model distributions on top-k lists by using a suitable distance measure between top-k lists. Unlike many earlier works, our model is both analytically tractable and computationally efficient. We demonstrate this by studying two basic problems in this model, namely, sampling and reconstruction, from both algorithmic and experimental points of view. View details
    Preview abstract We consider the problems of sparse regression and column subset selection under `1 error. For both problems, we show that in the non-negative setting it is possible to obtain tight and efficient approximations, without any additional structural assumptions (such as restricted isometry, incoherence, expansion, etc.). For sparse regression, given A, b with non-negative entries, we give an efficient algorithm to output a vector x of sparsity O(k), for which ||Ax − b||_1 is comparable to the smallest error possible using non-negative k-sparse x. We then use this technique to obtain our main result: an efficient algorithm for column subset selection under \ell_1 error for non-negative matrices. View details
    Preview abstract The k-core decomposition is a fundamental primitive in many machine learning and data mining applications. We present the first distributed and the first streaming algorithms to compute and maintain an approximate k-core decomposition with provable guarantees. Our algorithms achieve rigorous bounds on space complexity while bounding the number of passes or number of rounds of computation. We do so by presenting a new powerful sketching technique for k-core decomposition, and then by showing it can be computed efficiently in both streaming and MapReduce models. Finally, we confirm the effectiveness of our sketching technique empirically on a number of publicly available graphs. View details
    One-Shot Coresets: The Case of k-Clustering
    International Conference on Artificial Intelligence and Statistics (2018)
    Preview abstract Scaling clustering algorithms to massive data sets is a challenging task. Recently, several successful approaches based on data summarization methods, such as coresets and sketches, were proposed. While these techniques provide provably good and small summaries, they are inherently problem dependent --- the practitioner has to commit to a fixed clustering objective before even exploring the data. However, can one construct small data summaries for a wide range of clustering problems simultaneously? In this work, we affirmatively answer this question by proposing an efficient algorithm that constructs such one-shot summaries for k-clustering problems while retaining strong theoretical guarantees. View details
    Preview abstract We introduce the {\em consistent $k$-clustering} problem, which asks to minimize the number of re-clusterings necessary to maintain a constant approximate solution as points arrive in an online manner. We prove lower bounds of $\Omega(k \log n)$ on the number of clustering changes necessary, and give an algorithm that needs only $O(k^3 \log^5 n)$ changes to maintain a constant competitive solution. This is an exponential improvement on the naive solution of reclustering at every time step. Finally, we show experimentally that our approach performs much better than the theoretical bound, with the number of changes growing approximately as $O(\log n)$ View details
    Preview abstract We consider the problem of approximating a given matrix by a low-rank matrix so as to minimize the entrywise ℓp-approximation error, for any p≥1; the case p=2 is the classical SVD problem. We obtain the first provably good approximation algorithms for this version of low-rank approximation that work for every value of p≥1, including p=∞. Our algorithms are simple, easy to implement, work well in practice, and illustrate interesting tradeoffs between the approximation quality, the running time, and the rank of the approximating matrix. View details
    Preview abstract We study the question of fair clustering under the disparate impact doctrine, where each protected class must have approximately equal representation in every cluster. We formulate the fair clustering problem under both the k-center and the k-median objectives, and show that even with two protected classes the problem is challenging, as the optimum solution can violate common conventions—for instance a point may no longer be assigned to its nearest cluster center! En route we introduce the concept of fairlets, which are minimal sets that satisfy fair representation while approximately preserving the clustering objective. We show that any fair clustering problem can be decomposed into first finding good fairlets, and then using existing machinery for traditional clustering algorithms. While finding good fairlets can be NP-hard, we proceed to obtain efficient approximation algorithms based on minimum cost flow. We empirically quantify the value of fair clustering on real-world datasets with sensitive attributes. View details
    Preview abstract We propose a new framework called Ego-splitting for detecting clusters in complex networks which leverage the local structures known as ego-nets (i.e. the subgraph induced by the neighborhood of each node) to de-couple overlapping clusters. Ego-splitting is highly scalable and flexible framework, with provable theoretical guarantees, that reduce the complex overlapping clustering problem to a simpler and more amenable non-overlapping (partitioning) problem. We can solve community detection in graphs with tens of billions of edges and outperform previous solutions based on ego-nets analysis. More precisely, our framework works in two steps: a local ego- net analysis, and a global graph partitioning. In the local step, we first partition the nodes’ ego-nets using non-overlapping clustering. We then use these clusters to split each node of the graph into its persona nodes that represents the instantiation of the node in its communities. Then, in the global step, we partition these new persona nodes to obtain an overlapping clustering of the original graph. View details
    Preview abstract Maximizing submodular functions under cardinality constraints lies at the core of numerous data mining and machine learning applications, including data diversification, data summarization, and coverage problems. In this work, we study this question in the context of data streams, where elements arrive one at a time, and we want to design low-memory and fast update-time algorithms that maintain a good solution. Specifically, we focus on the sliding window model, where we are asked to maintain a solution that considers only the last $W$ items. In this context, we provide the first non-trivial algorithm that maintains a provable approximation of the optimum using space sublinear in the size of the window. In particular we give a $\nicefrac{1}{3} - \epsilon$ approximation algorithm that uses space polylogarithmic in the spread of the values of the elements, $\Spread$, and linear in the solution size $k$ for any constant $\epsilon > 0$. At the same time, processing each element only requires a polylogarithmic number of evaluations of the function itself. When a better approximation is desired, we show a different algorithm that, at the cost of using more memory, provides a $\nicefrac{1}{2} - \epsilon$ approximation, and allows a tunable trade-off between average update time and space. This algorithm matches the best known approximation guarantees for submodular optimization in insertion-only streams, a less general formulation of the problem. We demonstrate the efficacy of the algorithms on a number of real world datasets, showing that their practical performance far exceeds the theoretical bounds. The algorithms preserve high quality solutions in streams with millions of items, while storing a negligible fraction of them. View details
    Preview abstract We consider the reachability indexing problem for private-public directed graphs. In these graphs nodes come in three flavors: public—nodes visible to all users, private—nodes visible to a specific set of users, and protected—nodes visible to any user who can see at least one of the node’s parents. We are interested in computing the set of nodes visible to a specific user online. There are two obvious algorithms: precompute the result for every user, or run a reachability algorithm at query time. This paper explores the trade-off between these two strategies. Our approach is to identify a set of additional visible seed nodes for each user. The online reachability algorithm explores the graph starting at these nodes. We first formulate the problem as asymmetric k-center with outliers, and then give an efficient and practical algorithm. We prove new theoretical guarantees for this problem and show empirically that it performs very well in practice. View details
    Affinity Clustering: Hierarchical Clustering at Scale
    Soheil Behnezhad
    Mahsa Derakhshan
    MohammadTaghi Hajiaghayi
    Raimondas Kiveris
    NIPS 2017, pp. 6867-6877
    Preview abstract Graph clustering is a fundamental task in many data-mining and machine-learning pipelines. In particular, identifying good hierarchical clustering structure is at the same time a fundamental and challenging problem for several applications. In many applications, the amount of data to analyze is increasing at an astonishing rate each day. Hence there is a need for new solutions to efficiently compute effective hierarchical clusterings on such huge data. In this paper, we propose algorithms to address this problem. First, we analyze minimum spanning tree-based clustering algorithms and their corresponding hierarchical clusterings. In particular we consider classic single-linkage clustering based on Kruskal's algorithm and a variation of Boruvka algorithm that we call affinity clustering and prove new interesting properties of these clusterings via the concept of certificates. Then we present new algorithms in the MapReduce model and their efficient real world implementations via Distributed Hash Tables (DHTs). Our MapReduce algorithms indeed improve upon the previous MapReduce algorithms for finding a minimum spanning tree in graphs as well. Finally we show experimentally that our algorithms are scalable for huge data and competitive with state-of-the-art algorithms. In particular we show that Affinity Clustering is in practice superior to several state-of-the-art clustering algorithms. View details
    Expander via Local Edge Flips
    Zeyuan Allen-Zhu
    Aditya Bhaskara
    Lorenzo Orecchia
    SODA (2016)
    Preview abstract Designing distributed and scalable algorithms to improve network connectivity is a central topic in peer-to-peer networks. In this paper we focus on the following well-known problem: given an n-node d-regular network for d = Ω(log n), we want to design a decentralized, local algorithm that transforms the graph into one that has good connectivity properties (low diameter, expansion, etc.) without affecting the sparsity of the graph. To this end, Mahlmann and Schindelhauer introduced the random "flip" transformation, where in each time step, a random pair of vertices that have an edge decide to 'swap a neighbor'. They conjectured that performing O(nd) such flips at random would convert any connected d-regular graph into a d-regular expander graph, with high probability. However, the best known upper bound for the number of steps is roughly O(n17d23), obtained via a delicate Markov chain comparison argument. Our main result is to prove that a natural instantiation of the random flip produces an expander in at most O(n2d2[EQUATION] n) steps, with high probability. Our argument uses a potential-function analysis based on the matrix exponential, together with the recent beautiful results on the higher-order Cheeger inequality of graphs. We also show that our technique can be used to analyze another well-studied random process known as the 'random switch', and show that it produces an expander in O(nd) steps with high probability. View details
    On Sampling Nodes in a Network
    Flavio Chierichetti
    Anirban Dasgupta
    Ravi Kumar
    WWW (2016) (to appear)
    Preview
    Preview abstract In this paper, we present a study of the community structure of ego-networks—the graphs representing the connections among the neighbors of a node—for several online social networks. Toward this goal, we design a new technique to efficiently build and cluster all the ego-nets of a graph in parallel (note that even just building the ego-nets efficiently is challenging on large networks). Our experimental findings are quite compelling: at a microscopic level it is easy to detect high quality communities. Leveraging on this fact we, then, develop new features for friend suggestion based on co-occurrences of two nodes in different ego-nets’ communities. Our new features can be computed efficiently on very large scale graphs by just analyzing the neighborhood of each node. Furthermore, we prove formally on a stylized model, and by experimental analysis that this new similarity measure outperforms the classic local features employed for friend suggestions View details
    Preview abstract We introduce the public-private model of graphs. In this model, we have a public graph and each node in the public graph has an associated private graph. The motivation for studying this model stems from social networks, where the nodes are the users, the public graph is visible to everyone, and the private graph at each node is visible only to the user at the node. From each node's viewpoint, the graph is just a union of its private graph and the public graph. We consider the problem of efficiently computing various properties of the graphs from each node's point of view, with minimal amount of recomputation on the public graph. To illustrate the richness of our model, we explore two powerful computational paradigms for studying large graphs, namely, sketching and sampling, and focus on some key problems in social networks and show efficient algorithms in the public-private graph model. In the sketching model, we show how to efficiently approximate the neighborhood function, which in turn can be used to approximate various notions of centrality. In the sampling model, we focus on all-pair shortest path distances, node similarities, and correlation clustering. View details
    Preview abstract As a fundamental tool in modeling and analyzing social, and information networks, large-scale graph mining is an important component of any tool set for big data analysis. Processing graphs with hundreds of billions of edges is only possible via developing distributed algorithms under distributed graph mining frameworks such as MapReduce, Pregel, Gigraph, and alike. For these distributed algorithms to work well in practice, we need to take into account several metrics such as the number of rounds of computation and the communication complexity of each round. For example, given the popularity and ease-of-use of MapReduce framework, developing practical algorithms with good theoretical guarantees for basic graph algorithms is a problem of great importance. In this tutorial, we first discuss how to design and implement algorithms based on traditional MapReduce architecture. In this regard, we discuss various basic graph theoretic problems such as computing connected components, maximum matching, MST, counting triangle and overlapping or balanced clustering. We discuss a computation model for MapReduce and describe the sampling, filtering, local random walk, and core-set techniques to develop efficient algorithms in this framework. At the end, we explore the possibility of employing other distributed graph processing frameworks. In particular, we study the effect of augmenting MapReduce with a distributed hash table (DHT) service and also discuss the use of a new graph processing framework called ASYMP based on asynchronous message-passing method. In particular, we will show that using ASyMP, one can improve the CPU usage, and achieve significantly improved running time. View details
    Expanders via Local Edge Flips
    Zeyuan Allen-Zhu,
    Aditya Bhaskara
    Lorenzo Orecchia
    Society for Industrial and Applied Mathematics (2015), pp. 259-269
    Preview abstract Designing distributed and scalable algorithms to improve network connectivity is a central topic in peer-to-peer networks. In this paper we focus on the following well-known problem: given an n-node d-regular network for d=Ω(logn), we want to design a decentralized, local algorithm that transforms the graph into one that has good connectivity properties (low diameter, expansion, etc.) without affecting the sparsity of the graph. To this end, Mahlmann and Schindelhauer introduced the random "flip" transformation, where in each time step, a random pair of vertices that have an edge decide to `swap a neighbor'. They conjectured that performing O(nd) such flips at random would convert any connected d-regular graph into a d-regular expander graph, with high probability. However, the best known upper bound for the number of steps is roughly O(n^17d^23), obtained via a delicate Markov chain comparison argument. Our main result is to prove that a natural instantiation of the random flip produces an expander in at most O(n^2d^2√logn) steps, with high probability. Our argument uses a potential-function analysis based on the matrix exponential, together with the recent beautiful results on the higher-order Cheeger inequality of graphs. We also show that our technique can be used to analyze another well-studied random process known as the `random switch', and show that it produces an expander in O(nd) steps with high probability. View details
    Preview abstract Densest subgraph computation has emerged as an important primitive in a wide range of data analysis tasks such as community and event detection. Social media such as Facebook and Twitter are highly dynamic with new friendship links and tweets being generated incessantly, calling for efficient algorithms that can handle very large and highly dynamic input data. While either scalable or dynamic algorithms for finding densest subgraphs have been proposed, a viable and satisfactory solution for addressing both the dynamic aspect of the input data and its large size is still missing. We study the densest subgraph problem in the the dynamic graph model, for which we present the first scalable algorithm with provable guarantees. In our model, edges are added adversarially while they are removed uniformly at random from the current graph. We show that at any point in time we are able to maintain a 2(1+ε)-approximation of a current densest subgraph, while requiring O(polylog(n+r)) amortized cost per update (with high probability), where r is the total number of update operations executed and n is the maximum number of nodes in the graph. In contrast, a naive algorithm that recomputes a dense subgraph every time the graph changes requires Omega(m) work per update, where m is the number of edges in the current graph. Our theoretical analysis is complemented with an extensive experimental evaluation on large real-world graphs showing that (approximate) densest subgraphs can be maintained efficiently within hundred of microseconds per update. View details
    On Learning Mixture Models for Permutations
    Flavio Chierichetti
    Anirban Dasgupta
    Ravi Kumar
    ITCS (2015)
    Preview abstract In this paper we consider the problem of learning a mixture of permutations, where each component of the mixture is generated by a stochastic process. Learning permutation mixtures arises in practical settings when a set of items is ranked by different sub-populations and the rankings of users in a sub-population tend to agree with each other. While there is some applied work on learning such mixtures, they have been mostly heuristic in nature. We study the problem where the permutations in a mixture component are generated by the classical Mallows process in which each component is associated with a center and a scalar parameter. We show that even when the centers are arbitrarily separated, with exponentially many samples one can learn the mixture, provided the parameters are all the same and known; we also show that the latter two assumptions are information-theoretically inevitable. We then focus on polynomial-time learnability and show bounds on the performance of two simple algorithms for the case when the centers are well separated. Conceptually, our work suggests that while permutations may not enjoy as nice mathematical properties as Gaussians, certain structural aspects can still be exploited towards analyzing the corresponding mixture learning problem. View details
    Preview abstract We study the problem of computing similarity rankings in large-scale multi-categorical bipartite graphs, where the two sides of the graph represent actors and items, and the items are partitioned into an arbitrary set of categories. The problem has several real-world applications, including identifying competing advertisers and suggesting related queries in an online advertising system or finding users with similar interests and suggesting content to them. In these settings, we are interested in computing on-the-fly rankings of similar actors, given an actor and an arbitrary subset of categories of interest. Two main challenges arise: First, the bipartite graphs are huge and often lopsided (e.g. the system might receive billions of queries while presenting only millions of advertisers). Second, the sheer number of possible combinations of categories prevents the pre-computation of the results for all of them. We present a novel algorithmic framework that addresses both issues for the computation of several graph-theoretical similarity measures, including # common neighbors, and Personalized PageRank. We show how to tackle the imbalance in the graphs to speed up the computation and provide efficient real-time algorithms for computing rankings for an arbitrary subset of categories. Finally, we show experimentally the accuracy of our approach with real-world data, using both public graphs and a very large dataset from Google AdWords. View details
    Preview abstract Computing connected components of a graph lies at the core of many data mining algorithms, and is a fundamental subroutine in graph clustering. This problem is well studied, yet many of the algorithms with good theoretical guarantees perform poorly in practice, especially when faced with graphs with hundreds of billions of edges. In this paper, we design improved algorithms based on traditional MapReduce architecture for large scale data analysis. We also explore the effect of augmenting MapReduce with a distributed hash table (DHT) service. We show that these algorithms have provable theoretical guarantees, and easily outperform previously studied algorithms, sometimes by more than an order of magnitude. In particular, our iterative MapReduce algorithms run 3 to 15 times faster than the best previously studied algorithms, and the MapReduce implementation using a DHT is 10 to 30 times faster than the best previously studied algorithms. These are the fastest algorithms that easily scale to graphs with hundreds of billions of edges. View details
    Distributed Balanced Clustering via Mapping Coresets
    Aditya Bhaskara
    NIPS, Neural Information Processing Systems Foundation (2014)
    Preview abstract Large-scale clustering of data points in metric spaces is an important problem in mining big data sets. For many applications, we face explicit or implicit size constraints for each cluster which leads to the problem of clustering under capacity constraints or the balanced clustering'' problem. Although the balanced clustering problem has been widely studied, developing a theoretically sound distributed algorithm remains an open problem. In the present paper we develop a general framework based on mapping coresets'' to tackle this issue. For a wide range of clustering objective functions such as k-center, k-median, and k-means, our techniques give distributed algorithms for balanced clustering that match the best known single machine approximation ratios. View details
    On Reconstructing a Hidden Permutation
    Flavio Chierichetti
    Anirban Dasgupta
    Ravi Kumar
    RANDOM (2014)
    Preview
    Learning Entangled Single-Sample Gaussians
    Flavio Chierichetti
    Anirban Dasgupta
    Ravi Kumar
    Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2014
    Preview
    Communities, Random Walks, and Social Sybil Defense.
    Lorenzo Alvisi
    Allen Clement
    Alessandro Panconesi
    Internet Mathematics (2014)
    Preview
    Arrival and departure in Social Networks
    Shaomei Wu
    Atish Das Sarma
    Sixth ACM International Conference on Web Search and Data Mining, WSDM 2013
    Preview
    Sok: The Evolution of Sybil Defense via Social Networks
    Lorenzo Alvisi
    Allen Clement
    Alessandro Panconesi
    2013 IEEE Symposium on Security and Privacy, SP 2013
    Preview abstract Sybil attacks in which an adversary forges a potentially unbounded number of identities are a danger to distributed systems and online social networks. The goal of sybil defense is to accurately identify sybil identities. This paper surveys the evolution of sybil defense protocols that leverage the structural properties of the social graph underlying a distributed system to identify sybil identities. We make two main contributions. First, we clarify the deep connection between sybil defense and the theory of random walks. This leads us to identify a community detection algorithm that, for the first time, offers provable guarantees in the context of sybil defense. Second, we advocate a new goal for sybil defense that addresses the more limited, but practically useful, goal of securely white-listing a local region of the graph. View details
    A Local Algorithm for Finding Well-Connected Clusters
    Zeyuan Allen Zhu
    The 30th International Conference on Machine Learning, ICML 2013
    Preview abstract Motivated by applications of large-scale graph clustering, we study random-walk-based LOCAL algorithms whose running times depend only on the size of the output cluster, rather than the entire graph. In particular, we develop a method with better theoretical guarantee compared to all previous work, both in terms of the clustering accuracy and the conductance of the output set. We also prove that our analysis is tight, and perform empirical evaluation to support our theory on both synthetic and real data. More specifically, our method outperforms prior work when the cluster is WELL-CONNECTED. In fact, the better it is well-connected inside, the more significant improvement we can obtain. Our results shed light on why in practice some random-walk-based algorithms perform better than its previous theory, and help guide future research about local clustering. View details
    Milgram-routing in social networks.
    Alessandro Panconesi
    D. Sivakumar
    Proceedings of the 20th International Conference on World Wide Web, WWW 2011, pp. 725-734
    Preview abstract We demonstrate how a recent model of social networks (“Affiliation Networks”) offers powerful cues in local routing within social networks, a theme made famous by sociologist Milgram’s “six degrees of separation” experiments. This model posits the existence of an “interest space” that underlies a social network; we prove that in networks produced by this model, not only do short paths exist among all pairs of nodes but natural local routing algorithms can discover them effectively. Specifically, we show that local routing can discover paths of length O(log^2 n) to targets chosen uniformly at random, and paths of length O(1) to targets chosen with probability proportional to their degrees. Experiments on the co-authorship graph derived from DBLP data confirm our theoretical results, and shed light into the power of one step of lookahead in routing algorithms for social networks. View details
    Filtering: a method for solving graph problems in MapReduce.
    Benjamin Moseley
    Siddharth Suri
    SPAA 2011: Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures, pp. 85-94
    Preview abstract The MapReduce framework is currently the de facto standard used throughout both industry and academia for petabyte scale data analysis. As the input to a typical MapReduce computation is large, one of the key requirements of the framework is that the input cannot be stored on a single machine and must be processed in parallel. In this paper we describe a general algorithmic design technique in the MapReduce framework called filtering. The main idea behind filtering is to reduce the size of the input in a distributed fashion so that the resulting, much smaller, problem instance can be solved on a single machine. Using this approach we give new algorithms in the MapReduce framework for a variety of fundamental graph problems. Specifically, we present algorithms for minimum spanning trees, maximal matchings, approximate weighted matchings, approximate vertex and edge covers and minimum cuts. In all of these cases, we will parameterize our algorithms by the amount of memory available on the machines allowing us to show tradeoffs between the memory available and the number of MapReduce rounds. For each setting we will show that even if the machines are only given substantially sublinear memory, our algorithms run in a constant number of MapReduce rounds. To demonstrate the practical viability of our algorithms we implement the maximal matching algorithm that lies at the core of our analysis and show that it achieves a significant speedup over the sequential version. View details
    Hiring a secretary from a poset.
    Ravi Kumar
    Andrea Vattani
    Proceedings 12th ACM Conference on Electronic Commerce (EC-2011), pp. 39-48
    Preview abstract The secretary problem lies at the core of mechanism design for online auctions. In this work we study the generalization of the classical secretary problem in a setting where there is only a partial order be- tween the elements and the goal of the algorithm is to return one of the maximal elements of the poset. This is equivalent to the setting where the seller has a multidimensional objective function with only a partial order among the outcomes. We obtain an algorithm that succeeds with probability at least?1 + l k^{−k/(k−1)} ((1+log^{-1/(k-1)} k)^k -1) where k is the number of maximal elements in the poset and is the only information about the poset that is known to the algorithm. On the other hand, we prove an almost matching upper bound of k^{−1/(k−1)} on the success probability of any algorithm for this problem; this upper bound holds even if the algorithm knows the complete structure of the poset. View details
    Affiliation Networks
    D. Sivakumar
    Proceedings of the 41st Annual ACM Symposium on Theory of Computing, ACM (2009), pp. 427-434
    Preview abstract In the last decade, structural properties of several naturally arising networks (the Internet, social networks, the web graph, etc.) have been studied intensively with a view to understanding their evolution. In recent empirical work, Leskovec, Kleinberg, and Faloutsos identify two new and surprising properties of the evolution of many real-world networks: densification (the ratio of edges to vertices grows over time), and shrinking diameter (the diameter reduces over time to a constant). These properties run counter to conventional wisdom, and are certainly inconsistent with graph models prior to their work. In this paper, we present the first model that provides a simple, realistic, and mathematically tractable generative model that intrinsically explains all the well-known properties of the social networks, as well as densification and shrinking diameter. Our model is based on ideas studied empirically in the social sciences, primarily on the groundbreaking work of Breiger (1973) on bipartite models of social networks that capture the affiliation of agents to societies. We also present algorithms that harness the structural consequences of our model. Specifically, we show how to overcome the bottleneck of densification in computing shortest paths between vertices by producing sparse subgraphs that preserve or approximate shortest distances to all or a distinguished subset of vertices. This is a rare example of an algorithmic benefit derived from a realistic graph model. Finally, our work also presents a modular approach to connecting random graph paradigms (preferential attachment, edge-copying, etc.) to structural consequences (heavy-tailed degree distributions, shrinking diameter, etc.) View details
    Efficient computation of Weighted Clustering Coefficient
    Stefano Leonardi
    WAW (2014)
    Models for the Compressible Web
    Flavio Chierichetti
    Ravi Kumar
    Alessandro Panconesi
    SIAM Journal on Computing (2013)
    Filtering: a method for solving graph problems in MapReduce
    Benjamin Moseley
    Siddharth Suri
    SPAA (2011), pp. 85-94
    An algorithmic treatment of strong queries
    Ravi Kumar
    WSDM (2011), pp. 775-784
    Hiring a secretary from a poset
    Ravi Kumar
    Andrea Vattani
    ACM Conference on Electronic Commerce (2011), pp. 39-48
    An Algorithmic Treatment of Strong Queries.
    Almost tight bounds for rumor spreading with conductance.
    Flavio Chierichetti
    Alessandro Panconesi
    STOC2010
    Rumour spreading and graph conductance.
    Flavio Chierichetti
    Alessandro Panconesi
    SODA2010
    Rumor Spreading in Social Networks.
    Flavio Chierichetti
    Alessandro Panconesi
    TCS (2010)
    Models for the Compressible Web
    Flavio Chierichetti
    Ravi Kumar
    Alessandro Panconesi
    FOCS (2009), pp. 331-340
    On compressing social networks
    Flavio Chierichetti
    Ravi Kumar
    Michael Mitzenmacher
    Alessandro Panconesi
    KDD (2009), pp. 219-228
    Models for the Compressible Web.
    Flavio Chierichetti
    Ravi Kumar
    Alessandro Panconesi
    FOCS2009
    Rumor Spreading in Social Networks.
    Flavio Chierichetti
    Alessandro Panconesi
    ICALP(2)2009
    On compressing social networks.
    Flavio Chierichetti
    Ravi Kumar
    Michael Mitzenmacher
    Alessandro Panconesi
    KDD2009
    On placing skips optimally in expectation.
    Flavio Chierichetti
    Federico Mari
    Alessandro Panconesi
    WSDM2008
    Gossiping (via mobile?) in social networks.
    Flavio Chierichetti
    Alessandro Panconesi
    DIALM-POMC 2008