The following invited tutorials will be held as part of IEEE CloudCom 2013.
Tutorial 1: “Hadoop Services at Yahoo”
The Hadoop project is an integral part of the Yahoo! cloud infrastructure — and is the heart of many of Yahoo!’s important business processes. It is becoming the industry de facto framework for big data processing, and our engineers are leading participants in the Hadoop community. This tutorial offers an overview of Yahoo!'s Hadoop data processing platform including technologies we are focusing our efforts on such as MapReduce, Pig, Hive, HCatalog, HBase and Storm. We will talk about some of the Big Data problems at Yahoo! that these technologies are being used to solve; and we will discuss some of the operational aspects. We will also focus on HCatalog which abstracts out file locations and underlying storage format of data for the users, along with several other advantages such as sharing of data among MapReduce, Pig, and Hive systems. In this tutorial, we will discuss on how the ODBC/JDBC interface of HiveServer2 accomplished the use case of getting data out of the Grid clusters. Similarly we will highlight HBase; an open-source, distributed, versioned, column-oriented store modeled after Google's Bigtable. Yahoo! has been using HBase for a long time as isolated one off deployments. Having a multi-tenant platform makes it possible for all our customers to take advantage of HBase capabilities. In this tutorial we will provide a brief overview of HBase and how it works, and then spend some time talking about multi-tenancy with HBase.
About the Speaker for Tutorial 1: Viraj Bhat, Yahoo!
Viraj is employed with Yahoo! Sunnyvale as a Principal Engineer, where he works on building, porting and parallelizing several data-intensive applications on Yahoo! Grids based on Hadoop (Map Reduce Programming Paradigm). He built Hadoop Vaidya, a performance diagnostic tool for Hadoop Map Reduce jobs. He is a contributor on various open source projects such as Pig, HCatalog & Hive. He received the Yahoo! award in 2008 for evangelizing Grid technologies, profiling and optimizing Map Reduce applications and the 2012 excellence award for “Gridifying” the Genome project. Viraj Bhat graduated with a Ph.D. degree from the Rutgers University. He used to be a visiting researcher at the Princeton Plasma Physics Laboratory (PPPL) in Princeton, New Jersey. Viraj has been involved in several research projects and technical publications at Rutgers, PPPL, Lawrence Berkley National Laboratory (LBNL) and the Oak Ridge National Laboratory (ORNL) which included Autonomic Data Management, AutoMate/Accord, DISCOVER and CORBACoG at CAIP (Rutgers University) and Development of Systems for Distributed Scientific Data Management using Automated Workflows for Plasma Physics Simulations (PPPL, LBNL and ORNL).
Tutorial 2: Stream Mining: A Big Data Perspective
Data streams are continuous flows of data. Examples of data streams include network traffic, sensor data, call center records and so on. Their sheer volume and speed/velocity make mining them a great challenge for the big data mining community. Big data streams demonstrate several unique properties: infinite length, concept-drift, concept-evolution, feature-evolution and limited labeled data. Concept-drift occurs in data streams when the underlying concept of data changes over time. Concept-evolution occurs when new classes evolve in streams. Feature-evolution occurs when feature set varies with time in data streams. Data streams also suffer from scarcity of labeled data since it is not possible to manually label all the data points in the stream. All of these properties together conform to the characteristics of big data (i,e., volume, velocity, variety and veracity) and adds a challenge to data stream mining. This tutorial first presents a number of big data infrastructure to process unbounded continuous streams of data (e.g., S4, Storm). Next, the tutorial presents an organized picture on how to handle various data mining techniques in data streams: in particular, how to handle classification and clustering in evolving data streams by addressing these challenges. In this tutorial open source tools for big data mining over streams and a number of applications will be presented such as Mahout, ASMOA, adaptive malicious code detection, on-line malicious URL detection, evolving insider threat detection and textual stream classification.
Content of tutorial:
Multi-step methodologies and techniques, and multi-scan algorithms, suitable for knowledge discovery and data mining, cannot be readily applied to data streams. This is due to well-known properties of stream data such as infinite length, high speed data arrival, online/timely data processing, changing characteristics of data and need for one-pass techniques (i.e., forgotten raw data) etc. In particular, data streams are infinite, therefore efficient storage and incremental learning are required. The underlying concept changes over time are known as concept drift, so the learner should adapt to this change and be ready for veracity and variety. New classes evolving in the stream is known as concept evolution, which makes classification difficult. New features may also evolve in the stream, such as text streams.
All of these properties conform to characteristics of big data. For example, infinite length of stream data constitutes large “Volume” of big data. High speed data arrival and on-line processing holds the characteristics of large “Velocity” of big data. Characteristics of stream data can be changed over time. New patterns may emerge in evolving stream. Old patterns may be outdated. These properties may support the notion of “Variety” and “Veracity” of big data. Therefore, stream data possesses characteristics of big data. In spite of the success and extensive studies of stream mining techniques, there is no single tutorial dedicated to a unified study of the new challenges introduced by big data and evolving stream data.
The big data community adopts big data infrastructures that are used to process unbounded continuous streams of data (e.g., S4, Storm). However, their data mining support is rudimentary. A number of open source tools for big data mining have been released to support traditional data mining. Only a few tools support very basic stream mining. Research challenges such as change detection, novelty detection, and feature evolution over evolving streams have been recently studied in traditional stream mining but not in big data. This tutorial will present a unified picture of how these challenges and proposed solutions can be augmented in these open source tools. In addition to presenting the solutions to overcome stream mining challenges, experience with real applications of these techniques to data mining and security will be shared with the audience. We believe that the tutorial will greatly benefit both academician and practitioner, and motivate them to address these new challenges and applications to develop effective stream mining techniques.
With streaming products (e.g., Apache S4, IBM InfoSphere streams, etc) going live and getting more popular in various real-time data management areas such as telecommunication, banking, transportation, smart grid etc, the research efforts of stream Knowledge discovery (KD) and data mining (DM) are going into more and more applications. With the right platform, it is important and useful to share the various results in stream mining for the community, and outline new challenges as being identified. Specifically, it will be interesting for the big data community to get a flavor of the cutting edge research in data streams and big data, its usefulness in real life applications, and the new challenges that are yet to be met. It will be a wonderful opportunity for the tutors to share their knowledge and experience with data mining, as well as for the big data community to get in touch with this rapidly growing field of stream mining research.
Speaker for Tutorial 2: Latifur Khan (University of Texas at Dallas)
Latifur R. Khan is currently a full professor (tenured) in the Computer Science department at the University of Texas at Dallas (UTD), where he has been teaching and conducting research since September 2000. He received his Ph.D. degree in Computer Science from the University of Southern California (USC), USA in August of 2000. Dr. Khan's research areas cover big data management and analytics, data mining, multimedia information management, and semantic web. He has published more than 170 papers including more than 40 journal papers. He is an ACM Distinguished Scientist and a Senior Member of IEEE. He has chaired several conferences and serves (or has served) as associate editor on multiple editorial boards including IEEE Transactions on Knowledge and Data Engineering (TKDE) journal. He has conducted tutorial sessions in prominent conferences such as ACM WWW 2005, MIS2005, DASFAA 2007, WI 2008 and PAKDD 2011.