The next industrial revolution, which will be powered by data factories, fed by the most valuable raw material of all, data. The total value at stake – $5.000, 000,000, 000. Five trillion dollars.
The future belongs to the companies and people that turn data into products. Data is a key raw material for a variety of socio-economic business systems. Unfortunately, the ability to maximize the use of and value from data has been under leveraged. A massive change in the technology ecosystems, explosion in sensors emitting new data and sprouting of new business models is revolutionizing the data landscape. Forever changing the way we look at, solve and commercialize business problems.
This is the world of Big Data, where big is merely a reference to the size of the opportunity, not the inability to address it. Examples include web logs; RFID; sensor networks; social networks; social data (due to the Social data revolution), Internet text and documents; Internet search indexing; call detail records; astronomy, atmospheric science, genomics, biogeochemical, biological, and other complex and/or interdisciplinary scientific research; military surveillance; medical records; photography archives; video archives; and large scale ecommerce.
It is ushering us into the next industrial revolution.
The Web is full of data-driven applications. Almost any e-commerce application is a data-driven application. A database behind a web front end and middleware links to a number of other databases and data services. However, merely using data is not really, what we mean by data science. A data application acquires its value from the data itself, and creates more data as a result. It is not just an application with data; it is a data product. Data science enables the creation of data products.
What differentiates data science from statistics is that data science is a holistic approach. We are increasingly finding data in the wild, and data scientists are involved with gathering data, massaging it into a tractable form, making it tell its story, and presenting that story to others.
The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades.
One of the earlier data products on the Web was the CDDB database. Google is a master at creating data products.
One current feature of big data is the difficulty working with it using relational databases and desktop statistics/visualization packages, requiring instead massively parallel software running on tens, hundreds, or even thousands of servers. The size of big data varies depending on the capabilities of the organization managing the set. For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration.
The ability to automate the data pipeline and then rapidly find information in the massive amounts of data will be critical to success. The way has been already defined by Data driven web applications. Google’s breakthrough was realizing that a search engine could use input other than the text on the page. Google’s Page Rank algorithm was among the first to use data outside of the page itself, in particular, the number of links pointing to a page. Tracking links made Google searches much more useful, and Page Rank has been a key ingredient to the company’s success.
Spell checking is not a terribly difficult problem, but by suggesting corrections to misspelled searches, and observing what the user clicks in response, Google made it much more accurate. They have built a dictionary of common misspellings, their corrections, and the contexts in which they occur.
Speech recognition has always been a hard problem, and it remains difficult. But Google has made huge strides by using the voice data they’ve collected, and has been able to integrate voice search into their core search engine.
During the Swine Flu epidemic of 2009, Google was able to track the progress of the epidemic by following searches for flu-related topic
Many IT leaders are attempting to manage big data challenges by focusing on the high volumes of information to the exclusion of the many other dimensions of information management, leaving massive challenges to be addressed later.
Big data is a popular term used to acknowledge the exponential growth, availability, and use of information in the data-rich landscape of tomorrow. The term big data puts an inordinate focus on the issue of information volume. The focus is in every aspect from storage through transform/transport to analysis.
A preview into the future – Data Factories
Digital data is now everywhere—in every sector, in every economy, in every organization and user of digital technology.
We are truly fortunate to be witnessing the birth of the second industrial revolution. This revolution is being fueled by the factories of the future – data factories.
Industries outside the web properties, saddled as they are with legacy technology platforms, just have not enjoyed the same level of innovation and disruption that data brings to their business models. The same technology principles that have completely obliterated web property business models (Hadoop being the major force for change) will be tailored to other industry verticals and cause similar disruption.
YouTube continues to grow and on current Alexa figures, the online video site is now ranked number 4 web site in the world – behind only Yahoo, MSN, and Google itself!
This data powered industrial revolution will be bigger, faster, and more disruptive than the first one because data is a core asset for many industries and when enabled, can help solve problems that could not be solved before. Everything from fraud, disease management to smart grids, and dynamic traffic management is now within grasp of being solved. The ability to automate the data pipeline and then rapidly find information in the massive amounts of data will be critical to success. The path has already been laid by the web properties. The ability to manage, clean, process, and model massive amounts of data is a sill that has historically been undervalued. That is about to change. Data experts, just like algorithm experts, are a key resource for enterprises. In fact, they are more valuable than any other resource – except data itself.
Google was the first to prove that data assembly lines can be treated the same way as manufacturing. In addition, automation is the only way to scale and manage large quantities of data in an efficient manner.
Information managers may be tempted to focus on volume alone when they are losing control of the access and qualification aspects of data at the same time. IT managers shall move their narrow focus to address the other dimensions of big data.
Worldwide information volume is growing annually at a minimum rate of 59 percent annually, and while volume is a significant challenge in managing big data, business and IT leaders must focus on information volume, variety, and velocity.
Volume: The increase in data volumes within enterprise systems is caused by transaction volumes and other traditional data types, as well as by new types of data. Too much volume is a storage issue, but too much data is also a massive analysis issue.
Variety: IT leaders have always had an issue translating large volumes of transactional information into decisions — now there are more types of information to analyze — mainly coming from social media and mobile (context-aware). Variety includes tabular data (databases), hierarchical data, documents, e-mail, metering data, video, still images, audio, stock ticker data, financial transactions and more.
Velocity: This involves streams of data, structured record creation, and availability for access and delivery. Velocity means both how fast data is being produced and how fast the data must be processed to meet demand.
While big data is a significant issue, The real issue is making sense of big data and finding patterns in it that help organizations make better business decisions. The ability to manage extreme data will be a core competency of enterprises that are increasingly using new forms of information — such as text, social, and context — to look for patterns that support business decisions in what we call Pattern-Based Strategy. Big data enables companies to create new products and services, enhance existing ones, and invent entirely new business models. Data generated from the Internet of Things will grow exponentially as the number of connected nodes increases.
Pattern-Based Strategy, as an engine of change, utilizes all the dimensions in its pattern-seeking process. It then provides the basis of the modeling for new business solutions, which allows the business to adapt. The seek-model-and-adapt cycle can then be completed in Google was the first to prove that data assembly lines can be treated the same.
A great example of one such company is also a personal favorite, Zynga. Zynga is the biggest gaming company on Facebook. Zynga is also redefining gaming, as we know it – taking it from the multi-million dollar 2 year development cycles for blockbuster games, and transforming it to real-time massively multi-player simple games with a huge social component (because games are meant to be played with friends). At the heart of it though, Zynga, like Google, is a data and analytics company. In addition, they have designed an organization structure that’s equally dynamic. They have hundreds of games that they offer, and have embedded data scientists with every one of them. These data scientists use Tableau to look at the data coming from the games the people are playing. They have the ability in real time to change features of the game or the tools they offer in it, be it the weapons in Mafia Wars™ or fruits and vegetables
Rather than centralizing analytics, they completely democratized it. In the future, organizations will have to do the same. They will have to democratize the access to data at all levels of an organization, forever changing the way we make decisions, deliver products and services and how we charge for it. Welcome to the future.
The future belongs to the companies who figure out how to collect and use data successfully. Google, Amazon, Facebook, and LinkedIn have all tapped into their dataStream’s and made that the core of their success. They were the vanguard, but newer companies like bit.ly are following their path. Whether it is mining your personal biology, building maps from the shared experience of millions of travelers, or studying the URLs that people pass to others, the next generation of successful businesses will be built around data.
Hadoop is an open-source project administered by the Apache Software Foundation. Hadoop’s contributors work for some of the world’s biggest technology companies. That diverse, motivated community has produced a genuinely innovative platform for consolidating, combining, and understanding large-scale data in order to better comprehend the data deluge. Enterprises today collect and generate more data than ever before. Relational and data warehouse products excel at OLAP and OLTP workloads over structured data. Hadoop, however, was designed to solve a different problem: the fast, reliable analysis of both structured data and complex data. As a result, many enterprises deploy Hadoop alongside their legacy IT systems, which allows them to combine old data and new data sets in powerful new ways.
Technically, Hadoop consists of two key services: reliable data storage using the Hadoop Distributed File System (HDFS) and high-performance parallel data processing using a technique called MapReduce.
Hadoop runs on a collection of commodity, shared-nothing servers. You can add or remove servers in a Hadoop cluster at will; the system detects and compensates for hardware or system problems on any server. Hadoop, in other words, is self-healing. It can deliver data — and can run large-scale, high-performance processing jobs — in spite of system changes or failures.
Originally developed and employed by dominant Web companies like Yahoo and Facebook, Hadoop is now widely used in finance, technology, telecom, media and entertainment, government, research institutions and other markets with significant data. With Hadoop, enterprises can easily explore complex data using custom analyses tailored to their information and questions.
Cloudera is an active contributor to the Hadoop project and provides an enterprise-ready, commercial Distribution for Hadoop. Cloudera’s Distribution bundles the innovative work of a global open-source community; this includes critical bug fixes and important new features from the public development repository and applies all this to a stable version of the source code. In short, Cloudera integrates the most popular projects related to Hadoop into a single package, which is run through a suite of rigorous tests to ensure reliability during production.