Talking about machine learning today doesn’t go without talking about distributed fault-tolerant data storage and query system.
Hadoop is an open-source Apache Project derived 2006 from Yahoo! and based on papers from Google in 2003, is such a widely used system. Even though it is basically ‘mostly just’ a filesystem it’s three biggest advantages are
At the core of Hadoop is the filesystem HDFS (Hadoop Distributed File System) which stores it’s data in blocks across all DataNode machines. The data is replicated usually on a (or more) machine in the same rack, as well as on an other rack. Clients connecting to Hadoop to read or write will first question the NameNode which will tell them at which DataNodes they can attempt to access.
The current Hadoop 2.x version rely on YARN (Yet another Resource Negotiater) for that and with data and server replication there is no single-point-of-failure anymore. On top of that Hadoop uses MapReduce as key/value database. Therefore Hadoop is great for lots of data retrievals and querying.
It’s drawbacks are: the data in HDFS is not editable, only append able, it takes a lot of configuration work and without any additions and it’s not for real-time queries.
As for machine learning the more data the merrier – so many experts, e.g. from Spring, believe that for data with an amount of 10 Terrabyte and more Hadoop is a great solution.
SAS for Hadoop / HortonWorks
There are countless derivations and distributors, for BI cases, besides IBM there is also HortonWorks Data Plattform utilising SAS in-Memory Statistics for Hadoop and SAS Visual Statistics for machine learning.
STATISTICA by StatSoft (DELL)
StatSoft’s Statistica, since 2014 owned by DELL is one of the most well known statistic Suits, offering Statistica HP – High Performance for massive parallel processing in-memory. This software works with a wide number of systems, including Hadoop as well as SAP Hana, Windows HPC Server / Polybase / Parallel Data Warehouse, Oracle Exadata, Teradata Aster, Pig, Hive, Sqoop as well as IBM’s Netezza.
It’s been widley used worldwide abroad all branches for data analytics ans well as Predictive Analytics.
IBM’s Predictive Analytics
Besides offering IT consulting in the field of Predictive Data Analytics IBM is also offering a wide variety of own tools.
Since 2009 this includes SPSS Statistics (Statistical Package for the Social Sciences) the standard software for statistical analysis in social and other sciences.
And IBM’s Hadoop and Watson Analytics will be discussed in this blog as well.
The german SAP HANA system is as well one of the leading software for business data and analytics. Using Hadoop as filesystem HANA is able to do predictive data analystics with the in-memory HANA system.
More about SAP HANA and Hadoop
Read more about Hadoop, IBM’s Hadoop as well as SAS for BI the following articles.
Online Classes for Big Data (Predictive) Data Analytics and similar topics
Introduction to Big Data with Apache Spark : Learn how to apply data science techniques using parallel programming in Apache Spark to explore big (and small) data. by UC Berkley @ edX
Data, Analytics and Learning : An introduction to the logic and methods of analysis of data to improve teaching and learning. by U Texas Arlington @ edX
Big Data and Social Physics : Understanding big data, how to use it to improve companies, cities, and government, and best-practice for privacy. by MIT @edX
Knowledge Management and Big Data in Business : Learn why and how knowledge management and Big Data are vital to the new business era. by HongKong Polytechnic U
Foundations of Data Analysis : This is a hands on course with a data lab to teach fundamental statistical topics such as descriptive statistics, inferential testing, and modeling. by Texas U Austin
The Analytics Edge : Through inspiring examples and stories, discover the power of data and use analytics to provide an edge to your career and your life. by MIT @ edX
And many more at edX
Or check out the Data Analyst Nanodegree @ Udacity