Category: big data

To work on SPARK

If you look around in the meetups, the talks about adopting the Spark as real time computation engine and using Millb for large scale machine learning is popping up almost every Teck Hub, in San Francisco or London or Beijing. The Spark community is thriving so fast in this year, everyone seems to agree that Spark will be replacing Mapreduce from Hadoop stack in many time-critical scenarios with the use of Spark streaming, and just now we can move from big data to big real time data. So far, some of companies are already adopting Spark, in China like Taobao and Baidu.
What lures me into the Spark has many reasons.

1. Real time performance with Spark streaming

In smart transportation or smart health, although real/ near real time communication becomes available through protocols like MQTT,  high time latency on the analytic layer  was a headache that not so many public user are interested in old data analytic from the past day. Simply because batching processing in MapReduce is based on  usually takes tens of minutes to hours on big dataset. Thanks to the Spark ‘s streaming and in cluster & memory computation so that latency can be minimized from mini-seconds to seconds.(As for the in memory computation, it is also good to see other DRAM based solutions like VOLUME) That is undoubtedly a huge improvements that can aid us to aggregate the big datasets or make predictive analysis in near real time and ultimately acts faster to understand what is going on in the traffic. In our computer vision projects,  make image processing, feature extraction and do classification could be faster to learn the scenes or activities on the scenes.

Behind the scene of Spark, there are two basic concepts RDD and streaming.

1.1 RDD (Resilient Distributed Datasets) 

Basically, you could consider it as a data structure which is designed to be distributed and fault-tolerant on the cluster. For data scientists, it is  easy to manipulate data with RDD like what we do on the single PC; Moreover RDD offers data scientists a set of common operators like transformation operators (for example , map and filter )and action operators (for example , reduce, collect, count)  to apply on the distributed data. Special to note, RDD is  designed to be lazy evaluation, which means Spark executes all previous  transformation operators only when it encounters the action operators,  this mechanism makes data processing more effective. The previous transformation operators in Spark is called as Lineage – a directed acyclic graph structure stores the transformation metadata, children RDD depends on the parent RDD. When the lineage is too long, checkpoint is needed to avoid long rollback.  At last,   intermediate RDD is also cache-able in the memory  thereby operating on it more quickly, making it reusable.

More advanced RDD includes Broadcast, join, accumulators, controllable partitioning. Still to be discussed in real examples.

1.2 Spark streaming

It works by splitting the streaming data into small discretized stream in a batch size (minimal 0.5 – 2 seconds), each discretized stream is then transformed into RDD and later manipulate the RDD with various Spark operators, on AWS Spark is said to be scaled to 100 nodes, each node with 4 cores, processing 6GB/s in seconds latency.  Spark ‘s streaming is often compared to other streaming systems, for example, Twitter Storm. It is said in Storm the minimum batch size is 100ms, but Spark outperforms Strom in scalability and processing capabilities .  Hopefully we can see that in our tests.

2.   Integration with existing HADOOP system

Actually, Spark solves the problem of real time large scale computation in cluster, however it doesn’t invent all the wheels,  it could work with cluster like Mesos, YARN, standalone. Being integrated with Hadoop based file system allows it to work with file format like HDFS, S3 and sequenceFile.

3. Large scalable machine learning

Weka is a easy machine learning platform in Java which appeared in my master time, don’t know how it goes now. Although Weka is so easy use and prototype but can solve the problem of large scalable machine learning on big data. Mahout is good in Hadoop, butMLlib seems to me a better choice. So far MLlib provides linear regression, logistic regression, Kmeans clustering, Native bayes, SVM … Look forwards to Mllib to introduce Softmax, neural networks.

4. Support Python and R

Although made in Scala, for data scientists, Python and R are the most common languages, I am Python guy so PYSPARK is what I am looking forward to program with, combined with scikit-learn ‘s preprocessing utility / perhaps pylearn2 ‘s deep learning model and GPU integration / python opencv

In fact, Spark has provided SparkSQL, GraphX, still needs to understand better how GraphX works.

In the next two month, the junior team is going to experiment on the Spark and hopefully put it on our stack for real time analytic on the latest smart city project, looking forwards to their results.


By Brett Sheppard Feb. 2, 2011, 2:00pm PT

During the 2011 NFL playoff TV broadcasts, amid commercials featuring Anheuser-Busch Clydesdales and auto-racing driver Danica Patrick, one ad features an IBM researcher talking about data analytics. While NFL TV broadcasts may seem an unusual forum for a discussion of data analytics, information management and analysis play an important role in professional sports. Or to put it plainly, big data has officially gone global.

With data volumes moving past terabytes to tens of petabytes and more, business and IT leaders across the board face significant opportunities and challenges from big data. For a large company, big data may be in the petabytes or more; for a small or mid-size enterprise, data volumes that grow into tens of terabytes become “big data.”

Better service through big data
For health care provider Kaiser Permanente and its more than 8 million members, big data is about improving the quality of care and reducing costs. Using electronic healthcare records and decision-support software, Kaiser doctors and nurses can view the patient’s complete history including lab test results, prescriptions, diagnosis, treatment, demographics, medical plan and payment records. Further, patients can avoid unnecessary trips to the hospital through a personalized online “My Health Manager” that allows them to securely email doctors and request mail-order prescription refills.

In the financial services industry, Fidelity National Information Services (FIS), which sells risk management and fraud detection services to credit card issuers, uses big data analytics to better detect credit card fraud. The nature of what FIS analysts do is highly ad hoc and interactive. They frequently run complex queries correlating multiple activities in different data sets, to stay one step ahead of credit card thieves. As they detect new methods of fraud, those methods become encoded into their company’s search algorithms and the operational systems that accept or decline a credit card transaction in real-time.

Visualize big data

Source: LinkedIn

As Stacey discusses, one of the biggest challenges is making data intelligible and accessible. Visualizations help business users identify patterns and take actionable steps. For example, LinkedIn Maps (above) enables users to map professional networks and understand relationships among connections. Your map is color-coded to represent different affiliations or groups from your professional career, such as your previous employer, college classmates or industries you’ve worked in. When you click on a contact within a circle, you’ll see their profile pop up on the right, as well as lines highlighting how they’re connected to your connections. To empower large-scale data computations of more than 100 billion relationships a day and low-latency site serving, LinkedIn uses a combination of Hadoop to process massive batch workloads, Project Voldemort for a NoSQL key/value storage engine and the Azkaban open-source workflow system to control ETL jobs.

While it’s exciting to see NFL playoff commercials and other organizations touting advanced analytics, the space is not without its limitations. It takes time to adopt best practices for these new technologies, and, more fundamentally, business processes often adapt slow to them. Data continues to remain silo’ed, and let’s not forget the numerous privacy issues that continue to crop up. Nonetheless, 2011 is shaping up to be a big year for big data.

To read more about the opportunities, players, business models and challenges in the space, check out our 2011 Big Data Preview at GigaOM Pro(subscription required). For more insights from the big data landscape, come to GigaOM’s Structure: Big Data conference on March 23 in New York City.

Earl Bellinger

Earl Bellinger

Peng's Blog

where ideas come from

ajduke's blog

technical notes on software development

Urban Armor

DIY wearable electronics for intervening in the everyday

Pod-able Life

Pod, noun: streamlined housing of some kind.

Practical Vision Science

Tom Wallis blogs about vision science, open science and data analysis

Dejan Glozic

30% Turtleneck, 70% Hoodie

Representation Learning

Course material for graduate class ift6266h13

Sina Honari's blog for Representation Learning

My blog on the Representation Learning Course

IFT6266 Project

Log of my Representation Learning course project


I am the Bad Wolf, I create myself

Machine Learning on Emotion Recognition

Research Journal of Yangyang Zhao for Machine Learning Course

IFT 6266 H13 Blog

Welcome to the machine l...

Experimenting with representation learning

My journal for IFT6266 projects

My missives

A technologist's view of things ...

Marcos Nieto's Blog

Computer vision, research and more!


Technology news, trends and analysis covering mobile, big data, cloud, science, energy and media

Tickett's Blog

Jibber Jabber!


News About Tech, Money and Innovation