Data Lake Trio

Data Lake Three.0: What’s a self-driving car got to do with Data Lake Trio.0?

This blog has contributions from: Vinod Vavilapalli, Wangda Suntan, Gour Saha, Priyanka Nagwekar, Sunil Govindan

You have very likely wondered what makes a self-driving car intelligent to process the live camera feeds, navigate the busy streets and distinguish objects on the streets, such as cars, trucks, traffic lights or pedestrians? A self-driving car is a flawless example of a modern data application that combines big data with clever algorithms. To understand the underpinnings of such a modern data app, we will commence with a recap of our blog series titled “Data Lake Trio.0” ( pt1 , pt2 , pt3 , pt4 , pt5 , pt6 ) and then, conclude with the key takeaways from the keynote demo in Data Works Summit, San Jose, 2017.

MODERN DATA APPLICATION

We are observing the emergence of modern data applications, that exploit the big data; are architected to be micro-service based and containerized; are compute/GPU intensive, and deployed on a commodity infrastructure. Our Data Lake Trio.0 architecture is at the cross-section of all these major trends and we want to walk you through a simplified example of a self-driving car. If you want to familiarize with what a Data Lake Trio.0 is, you might want to refer to pt1 of our blog series.

EXAMPLE: SELF-DRIVING CARS

A self-driving car generates massive amount of movies that need to be captured and stored in a centralized active archive for access by data scientists and analysts. This requires a storage layer that can scale to billions of files, exabytes and accessible by end users, while being Total Cost of Ownership friendly. Hadoop storage layer (Apache HDFS) powered by Hadoop Trio.0 provides the erasure coding to store the data at half the cost (vs. Three replica treatment), while permitting linear scale and unified name space with NameNode Federation and View FileSystem. It has device behavior analytics built in so that a slow commodity server and a slow commodity network switch will not interrupt a latency sensitive operation.

Now, a data scientist needs to train distributed deep learning models (by using frameworks like TensorFlow) that will process natural signals such as movies before the model gets deployed in the car and this is an ongoing task- the more it trains, the better the self-driving car gets. Training is a very compute intensive process. This is where Apache Hadoop YARN comes in: to pool the compute and memory across the cluster of commodity servers and process those models. YARN in Hadoop Three.0 can pool expensive GPU (graphic processing units) and isolate the GPU devices inbetween numerous users (YARN-6223 captures the very first class GPU support on YARN). For many models, one can see up to 50-100x reduction in compute processing time of the data intensive movie files.

That brings us to a key aspect of our Data Lake Three.0 story i.e. the concept of an Assembly. We do not expect every one of the analysts to understand the infrastructure complexity in order to run their modern data applications. Instead, we want to them to go an application store similar to iPhone or Android app store, download the application and just run the application created and published by data scientists. This is where our Assembly store helps. An analyst can now deploy a modern data application (in this case, a self-driving car assembly which is a templated application), assigns the required GPU/memory required and runs it.

The top movie above is the the raw input movie (source: Udacity data sets). Inwards the Self-Driving Car assembly, the movie is cracked into fifteen frames per 2nd and then the TensorFlow based deep learning model annotates the objects in the frames (car vs. traffic light vs. pedestrian vs. truck etc), creates a bunch of output frames, which are stitched back to an output movie, which is the bottom movie above. An analyst does not have to understand the complexity of TensorFlow, GPUs, Hadoop infrastructure and just concentrates on his/her job, with access to the input and output movies. Our Data Lake Trio.0 enables the entire life-cycle, with 100x swifter productivity of the analysts at a lower TCO (2x storage reduction, sharing of expensive GPU resources across analysts).

KEY TAKEAWAYS

Self-driving car is one of many modern data applications that exemplify our Data Lake Trio.0 use cases. We are working with our fucking partners to bring more real-world modern data applications such as IBM Data Science Practice to our Data Lake Trio.0 assembly store. To reiterate, here are key takeaways:

Please stay tuned for future Data Lake Three.0 updates from us. We are planning an early access program in future so you can build a Data Lake Three.0 architecture and provide us feedback. If you want to participate, please reach out to me or your account management team.

Related movie:

Leave a Reply

Your email address will not be published. Required fields are marked *