If you look around in the meetups, the talks about adopting the Spark as real time computation engine and using Millb for large scale machine learning is popping up almost every Teck Hub, in San Francisco or London or Beijing. The Spark community is thriving so fast in this year, everyone seems to agree that Spark will be replacing Mapreduce from Hadoop stack in many time-critical scenarios with the use of Spark streaming, and just now we can move from big data to big real time data. So far, some of companies are already adopting Spark, in China like Taobao and Baidu.
What lures me into the Spark has many reasons.

1. Real time performance with Spark streaming

In smart transportation or smart health, although real/ near real time communication becomes available through protocols like MQTT,  high time latency on the analytic layer  was a headache that not so many public user are interested in old data analytic from the past day. Simply because batching processing in MapReduce is based on  usually takes tens of minutes to hours on big dataset. Thanks to the Spark ‘s streaming and in cluster & memory computation so that latency can be minimized from mini-seconds to seconds.(As for the in memory computation, it is also good to see other DRAM based solutions like VOLUME) That is undoubtedly a huge improvements that can aid us to aggregate the big datasets or make predictive analysis in near real time and ultimately acts faster to understand what is going on in the traffic. In our computer vision projects,  make image processing, feature extraction and do classification could be faster to learn the scenes or activities on the scenes.

Behind the scene of Spark, there are two basic concepts RDD and streaming.

1.1 RDD (Resilient Distributed Datasets) 

Basically, you could consider it as a data structure which is designed to be distributed and fault-tolerant on the cluster. For data scientists, it is  easy to manipulate data with RDD like what we do on the single PC; Moreover RDD offers data scientists a set of common operators like transformation operators (for example , map and filter )and action operators (for example , reduce, collect, count)  to apply on the distributed data. Special to note, RDD is  designed to be lazy evaluation, which means Spark executes all previous  transformation operators only when it encounters the action operators,  this mechanism makes data processing more effective. The previous transformation operators in Spark is called as Lineage – a directed acyclic graph structure stores the transformation metadata, children RDD depends on the parent RDD. When the lineage is too long, checkpoint is needed to avoid long rollback.  At last,   intermediate RDD is also cache-able in the memory  thereby operating on it more quickly, making it reusable.

More advanced RDD includes Broadcast, join, accumulators, controllable partitioning. Still to be discussed in real examples.

1.2 Spark streaming

It works by splitting the streaming data into small discretized stream in a batch size (minimal 0.5 – 2 seconds), each discretized stream is then transformed into RDD and later manipulate the RDD with various Spark operators, on AWS Spark is said to be scaled to 100 nodes, each node with 4 cores, processing 6GB/s in seconds latency.  Spark ‘s streaming is often compared to other streaming systems, for example, Twitter Storm. It is said in Storm the minimum batch size is 100ms, but Spark outperforms Strom in scalability and processing capabilities .  Hopefully we can see that in our tests.

2.   Integration with existing HADOOP system

Actually, Spark solves the problem of real time large scale computation in cluster, however it doesn’t invent all the wheels,  it could work with cluster like Mesos, YARN, standalone. Being integrated with Hadoop based file system allows it to work with file format like HDFS, S3 and sequenceFile.

3. Large scalable machine learning

Weka is a easy machine learning platform in Java which appeared in my master time, don’t know how it goes now. Although Weka is so easy use and prototype but can solve the problem of large scalable machine learning on big data. Mahout is good in Hadoop, butMLlib seems to me a better choice. So far MLlib provides linear regression, logistic regression, Kmeans clustering, Native bayes, SVM … Look forwards to Mllib to introduce Softmax, neural networks.

4. Support Python and R

Although made in Scala, for data scientists, Python and R are the most common languages, I am Python guy so PYSPARK is what I am looking forward to program with, combined with scikit-learn ‘s preprocessing utility / perhaps pylearn2 ‘s deep learning model and GPU integration / python opencv

In fact, Spark has provided SparkSQL, GraphX, still needs to understand better how GraphX works.

In the next two month, the junior team is going to experiment on the Spark and hopefully put it on our stack for real time analytic on the latest smart city project, looking forwards to their results.