Big Data is a combination of well-known and less common technologies and is used mostly by larger organizations and innovative start-ups to solve problems or find insights.
Retail companies, for example, use big data to correlate large quantities of customer data, in order to optimize their stock or their delivery systems. In order to do this they must have systems in place that can gather various types of data and for the most part in real-time, such as video, audio, written text and speech.
Let’s take a look at some of the technologies.
Hadoop and more
“At the most basic level, Apache Hadoop is an open-source software platform designed to store and process quantities of data that are too large for just one particular device or server. Hadoop’s strength lies in its ability to scale across thousands of commodity servers that don’t share memory or disk space.
Hadoop delegates tasks across these servers (called “worker nodes” or “slave nodes”), essentially harnessing the power of each device and running them together simultaneously. This is what allows massive amounts of data to be analyzed: splitting the tasks across different locations in this manner allows bigger jobs to be completed faster.”
Hadoop can be thought of as an ecosystem—it’s comprised of many different components that all work together to create a single platform. There are two key functional components within this ecosystem: The storage of data (Hadoop Distributed File System, or HDFS) and the framework for running parallel computations on this data (MapReduce).
MapReduce is the system used to efficiently process the large amount of data Hadoop stores in HDFS. Originally created by Google to index the entire World Wide Web, its strength lies in the ability to divide a single large data processing job into smaller tasks. All MapReduce jobs are written in Java, but other languages can be used via the Hadoop Streaming API, which is a utility that comes with Hadoop.
Hadoop-based Big Data projects often require separate architecture, servers and network redesign. There are also newer technologies like Apache Spark, Storm, Kafka that are more than batch-processing and move toward real-time streaming analytics.
Natural language processing (NLP)
After all the data sources and data is compiled, the task of retrieving structured data from unstructured data and especially text is necessary. This is where Natural Language Processing comes in.
Natural Language Processing gives machines the ability to read and understand as well as derive meaning from the languages humans speak, and it is part of Artificial Intelligence.
NLP is at the heart of modern software that processes and understands human language, leveraging the vast amount of language data on the web and in social media. One of the most well known examples of NLP is IBM’s Watson, who was able to win the TV show Jeopardy by beating two of Jeopardy’s greatest champions.
As BigData-Startups.com founder Mark van Rijmenam writes “the key stumbling block here is that computers understand “unambiguous and highly structured” programming language, while human language is a minefield of nuance, emotion, and implied intent.”
Geoffrey Pullum, a professor of general linguistics at the University of Edinburgh. Pullum outlines three prerequisites for computers to master human language: “First, enough syntax to uniquely identify the sentence; second, enough semantics to extract its literal meaning; and third, enough pragmatics to infer the intent behind the utterance, and thus discerning what should be done or assumed given that it was uttered.”
Named entity recognition (NER)
Named entity recognition (NER) is an NLP task that consists in tagging groups of words that correspond to “predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages.”
NER finds mentions of specified things in text and this aids garnering insights from the structured data.
Natural Language Generation (NLG)
A recent Forbes article, Why Big Data Needs Natural Language Generation to Work, explains this well:
“But the bigger game of NLG is not about the language but about handling the growing number of insights that are being produced by big data through automated forms of analysis. If your idea of big data is that you have a data scientist doing some sort of analysis and then presenting it through a dashboard, you are thinking far too small. The fact of the matter is that big data really can’t be understood without machine learning and advanced statistical algorithms. While it takes skill and expertise to apply these methods, once you have them running, they continue to pump out the insights.”
“The data, once extracted, would then be sent to the semantic engine which would first determine what was true and then determine which of those signals are important and impactful to various audiences.”
“What is true is determined through the application of techniques that would be familiar to any data scientist: time series and regression analysis, histogramming, ranking, etc. The semantic engine then decides what’s important based on an understanding of what’s normal for the whole population of the data.”
“The second type of analysis the semantic engine does is to determine what is interesting or impactful to a particular audience. A retail representative at a bank may be interested in a whole different set of signals than someone who is originating mortgages.”
You need a systematic approach to a semantic model. That’s the secret to having a big impact from big data.
reportbrain takes Big Data and uses Natural language processing (NLP) and algorithms like Named Entity Recognition (NER) on its platform. Try it out and see the difference in your search results.
What is reportbrain?
reportbrain is an integrative start-to-finish intelligence platform that helps you process information for insightful business decisions.
reportbrain allows you to:
- Search Global News: find all you need from across the web and from thousands of sources across the globe.
- Analyze In-Depth: learn first every facet of your market by understanding relationships in key developments.
- Report in Style: drag-and-drop articles and add your own editorials to communicate information effectively.