My Research at the Yale Institute of Network Sscience

I worked as a research assistant under Professor Sekhar Tatikonda in Summer 2015. As everybody knows, the amount of data has been growing at an enormous rate. The data explosion has fueled a giant wave of machine learning research, both in academia and industry. It also means that it is increasingly common for data to be stored over a distributed system, with no central core that holds all the data. This naturally lends itself to a problem: How do we do machine learning over a distributed system?

Naturally, we started with the simplest machine learning algorithm: linear regression. The exact problem we were trying to solve is as follows: given a list of multivarite data (x_1, x_2, ..., x_p, y) split over k cores, devise a paradigm for doing linear regression which facilitates an efficient tradeoff between communication costs and accuracy. In the paper we come up with a solution which allows us to do this trade-off. In addition, we run the algorithm on the Department of Transporation's Airline on-time dataset and find that this trade-off is approximately linear.