It’s well known that big data is usually stated in terms of the three Vs:
Volume, Variety and Velocity. The three
Vs
appropriately sum up the characteristics of big data and convey that
big data is heterogeneous, noisy, dynamic, inter-related and not
trustworthy. Companies now strive to convert the three
Vs into
Big Promises. And Big Data’s promise can be summarized by three new descriptive terms:
Veracity, Value and Victory.
1. Three Vs of Big Promise - Veracity, Value and Victory
Like the three
Vs of big data that well describe the characteristics of big data, the volume is based on both variety and velocity; the three
Vs of
Big Promise also has an internal relationship. The
Veracity mined from big data, based on volume and variety, determines the
Value of big data. The value determines the
Victory when a business appropriately applies in a timely manner. The higher the
Veracity
mined from raw data, the more valuable the result, the smarter decision
a business can make, and the more successful the business will become.
All those will lead to big
Victory for the business.
While
much around big data remains hype, many companies are in the fledging
stages of drawing value from their big data corpus, and given an army of
discussions and opinions around the topic, it’s still hard to find a
clear roadmap to arrive at the
Big Promise.
2. The Journey from Big Data to Big Promise
Here
I share my thoughts about the roadmap from the big picture of big data
in a grand view, regardless the type of a business. Basically there are
three big steps:
Step 1: Big Data Collection – Gathering Organic Material
Regardless
where you are in the journey – it has to start with understanding the
nature of the big data defined by three V’s defined though there is
voice that put more dimensions into the big data such as value and
veracity. However I do not think they are characteristics of big data in
raw. Instead I defined them as two characteristics of big data promise.
Step 2: Big Data Analytics – Gleaning Big Insight
The
core technologies are big data platform and big data analytics. The
big data platform provides the power of speedy processing with millions
of records per second. It harness an integrated technologies for
transforming organic/raw content to designed content like Natural
Language process (NLP), Data Cleansing, transformation (ETL) and
filtering methods. The goal aims to transform semi-structured or
unstructured data into structured format for easier understanding,
analysis and visualization.
Though in the world of analytics,
there are many different kinds of analytics terminologies used and
referenced like text analytics, social media analytics, customer, social
network, business or sentiment analytics, if given deep thoughts on
those terminologies, basically analytics can be categorized into three
categories functionally, they are
Descriptive Analytics,
Relationship Analytics,
Prescriptive Analytics. The detail is explained as below for each of them.
1. Descriptive Analytics
Once
organic data are transformed into designed data from data processing
phase, the first analytics is descriptive or exploratory. This phase
uses simple statistics to get a general understanding about the data
such as data properties like dimensions and field types, statistical
profile or summaries like number of records, missing values or field
value max, min, median, field value distribution, etc. The exploratory
analysis provides us with initial knowledge about the raw content
without any deeper digging internal relationships. The process can
suggest right strategies to perform deeper analysis. The phase can be
done on a random sampled dataset with simple tools like excel sheet and
visualized with basic chart types like bar chart, pie chart and scatter
plot, etc. The characteristics of the descriptive analytics are:
- Autonomy,
the analytics performed is based on individual fields and their values
and it’s self-government and independent of other fields without
considering any connections between different fields and contents.
- Shallow and Straight forward,
the result from the analysis is usually shallow basic statistics like
the frequencies of word count, the number and percent of employees with a
earning about 5k within a certain geographic area.
- Simple and Easier understanding
– As the method to analyze the data is basic statistical profiling
without any extra effort involved, so the result is also simple and
easier to understand and visualize.
With descriptive
analytics, it can reach a general understanding about what happened.
It’s like a doctor to find out what happened to a patient, the fact
first before he digs out why the patient got the disease.
2. Relationship Analytics
This
level analytics aims to dig out embedded valuable insight among the big
data. Comparing with the descriptive analytics, the analysis is deeper –
in order to succeed at this level, it requires ample mining algorithm
or methods like advanced statistics, sophisticated machine learning,
inter-disciplined studies, meta or scalable algorithms; the process
involved is usually also complicated and performance demanding both in
speed and volume.
The reason I called the analytics at this level
as relationship analytics because, at this deeper level analytics, its
primary goal is to find connection among data elements – the connection
may be timely based like sequential dependent relationship or geo
location based or functional category based like relationship between
production and customer purchasing pattern or transaction based like
marketing basket analytics.
During this level analysis, the methods used may be as below:
- Inferential or Association
draws insight from data through random processes that are developed
with statistical methods. Inferential depends on the right population
and randomly sampled. For example, the average children height tends to
higher than their parents who are usually lower than average height of
adults. For basket analysis, through mining millions of transactions,
some of items have the higher probability to be bought together by
customers like coffee and coffee mater – creams, etc. some of the
conclusions are easier to understand and make common sense, however, the
high value comes from the conclusions that are against people’s common
sense or wrong assumption.
- Model based analysis
uses pre-developed model based on the known observed data to infer or
predict what will happen in the future. Under this category, two sub
categories are commonly known, classification and predictive modeling.
Usually when the target variable is in different categories and the
method is called classification; when it’s numerical or continuous
variable, it’s called predictive method. Both methods need a training
data set that are well labeled and a test dataset that are drawn from
the same population with the training dataset. The analysis has two
phases involved, first a model is built with training dataset then
evaluated or tested with test dataset for measuring its performance.
Once the model is developed, it’s used to predict the future events or
target variables based on the independent variables. For example, a
linear regression model can be built to predict sales amount based on
the factors that affect sales in the last three months then predict the
next month sales; a decision tree model can be built to predict whether a
specific twitter message is positive or negative, etc. Sometimes
classification and predictive methods are overlapped based on the
business applications.
- Segmentation
dynamically group data into different clusters based predefined
measurement like distance method. The method is different than the
classification or predictive method. It does not need training data or
test data. For example, an algorithm can be used to dynamically group
similar twitter messages into different clusters.
3. Prescriptive Analytics
Prescriptive
analysis is actually a business decision based on the conclusions or
results drawn from relationship analysis. For a given situation, what
kinds of best action to take so that we gain the expected result in the
future? Suppose a patient go to see a doctor, first the doctor performs
descriptive analysis, fact finding phase, to understand what happened to
the patient and some relative factors like daily activities and
workloads and food nutrition, next the doctor perform relationship
analysis to find out what are the possible factors that cause the
patient sick, finally the doctor will give prescription to the patient
like medicines to take so that the patient can get well.
Step 3: Reap Big Promise
In
order to fully empower a business with insight drawn from analytics –
first the veracity of the result has to be verified before it can be
deployed into a business application for generating valuable results.
The main approaches that are used to evaluate the veracity of analytics
results or models built include precision, recall and accuracy. Also we
need to consider the business cost for each error made in dollars.
Basically, there are three phases in evaluating the performance, 1) once
the model or algorithm developed, the performance can be evaluated
based on an validation dataset that are drawn from the same population
of the training dataset. If the result is not good enough, the model
needs to be redeveloped by adding more data or perform some tuning by
adjusting parameters or exploring other methods; 2) the model is
evaluated against a test dataset that are drawn from a different dataset
than training data. This dataset is more representative to the real
world dataset at the point of developing the model and the associate
error cost should be also measured based on business objectives; 3) the
model will be evaluated on an on-going process. Because the world
changes so fast, new data comes in and they may be pretty different from
the dataset used to develop the model. The phase 3 should be performed
in a regular scheduled base so that the prediction will not go too far
off the expected and causes business crashes. In the process, once it’s
found the model does not perform well enough anymore, the process will
go back to 2).
The values from data
veracity of 2) are also count on how well the business takes full
advantage of them – how many opportunities to use them to provide
business intelligence to customers. Exploring the right business
opportunities and defining the right objectives are the key factors for
generating business values. If a company can generate higher revenue,
victory will be shining out brightly tomorrow.
Original Source : http://smartdatacollective.com/ling-zhang/123661/journey-big-data-big-promise