Spark Workshop
Workshop on Distributed Data Analysis with Apache Spark
Scientific data sets produced by today’s state-of-the-art experiments are rapidly exceeding our ability to process them by simple means on a laptop or even a capable workstation. While machines with extremely large amounts of on-board memory (~ 1-3 TB) are available today, such machines are very expensive and limiting an analysis to a single computer means that when the datasets increase further in size, the existing solution will not scale. Therefore, many scientists still struggle to make sense of such vast quantities of data, as this often requires knowledge of complex parallel programming tools. In addition, analysis on data sets that large may take a very long time to run, inhibiting data exploration with fast turn-arounds.
In recent years a new distributed data analysis framework, named Spark, has emerged which provides both, robust scalability to many hundreds of machines as well as ease of use through a high-level programming interface. Importantly, Spark also allows users to perform interactive analysis which makes it very attractive for scientific applications, where the goals of an analysis can only be established if efficient data exploration is possible.
Over the past year, Scientific IT Services (SIS) has been exploring the usability of Spark on the existing, central ETH computing infrastructure (including a prototype stand-alone Hadoop cluster and the Euler batch queue). We have developed a short workshop through which we hope to introduce the ETH scientific community to this novel data analysis framework. The workshop provides the scientists with an understanding of the Spark programming model and gives them the tools necessary to immediately run their own analyses on the ETH infrastructure.
The first two sessions of the workshop were held during the first two weeks of September and were attended by 25 scientists from D-GESS, D-BSSE and D-BIOL. Each session took place over three days, with the first day serving as an introduction to the challenges of distributed data analysis and some relevant programming concepts through exercises. During the following two days the participants carried out a mini-project whose goal was to develop a pipeline to analyze the text of the entire Gutenberg books corpus. Approximately 75% of the workshop was spent in hands-on sessions, during which the participants were able to gain a proper intuition for both the Spark framework as well as the available infrastructure. Due to the large interest in the workshop and positive feedback received from the first two groups, we hope to extend it to other departments during the Fall semester.
Text & Contact
Rok Roskar, Research Informatics, ITS Scientific IT Services (ITS SIS)
erstellt am
in News