Big Data was a Buzzwords as well as a mostly undefined Fuzzword of the last years – now we have “Data Science”.
Whereas Big Data combines the 4 Vs (Volume, Variety, Velocity, Value = Analytics) Data Science is the science to work with big data, but here are some definitions:
Data Scientists are either statisticians with outstanding programming skills – or – software developer with outstanding knowledge of statistics.
while others say
Data Science is just a fancy word for the advanced linear regression
and a widly acknowledged definition is:
Data Scientists know the the Big Data tools, as well as the statistical Analytics combined with domain knowledge of the field of work they are in.
The IDC calculated in 2012 we created 2,8 Zetabyte of data, in 2020 it’s supposed to be already 40 Zetabytes (Tera, Peta, Exa, Zetta, Yotta … 40 * 10^21 Bytes). Heureka! It’s the new more ecofriendly raw oil 🙂
Carsten Bangefrom BARC said: Big Data is not only about big amounts of data. It’s also about the processes and methods for scaleable retrieval and analytics of information, which are present in diverse, often unpredictable structures.
Times changed and earlier database paradigms become outdates. Such as there should be no redundandent data in SQL databases become kind of obsolete, as in Hadoop redundancy is required at it’s core and as raw data is saved – it usually comes with redundant data.
The NoSQL area comes with a wide variety of techniques. With BASE (Basically Available, Soft State, Eventually Consistent) times of de-facto standard ACID (Atomic, Consistent, Isolated, Durable) are broken. Some sysems use Key-Value, graph or document-oriented -databases (like InfiniteGraph, Neoj4, CouchDB, MongoDB) others column-oriented tables (Amazon SimpleDB, Hadoop, SAP HANA).
Considering the CAP-Theorem – you can’t have it all: Consistency, Availability and Partition Tolerance – you have to decide two of them.
Processes of Data Seas
Big Data also means party unstructured or incomplete data – therefore it’s not like common databases (like SQL Tables) where a schema of the data is presented always beforehand – now it has to be more like a sea of data – where all data flows into – whatever shape it has. The formatting, preperation and transformation is then done afterwards and before the processing.
Interview Extract (tbc)
Hadoop, Apache Projects & other Tools
At it’s main core of Hadoop implements the HDFS – the Hadoop Filesystem. This is a distributed filesystem with is extremly scaleable, stores data redudantant, knows the network topology for improved errorhandling and much more. Most of the application within the Hadoop Ecosystem are able to access the HDFS.
Yet another Resource Negotiator (YARN) is handling distributed jobs and tasks. Instead of transferring data to machines to process – the jobs/tasks will be send to the machines where the data already is or close to. Usually this will be sent as a single Java JAR-File and YARN takes care of distribution as well as error-handling. Most but not all tools on top of Hadoop are using YARN.
Map Reduce is Hadoops programming model to process large quantities of distributed data. Within the Map-phase the tasks are sent to the nodes where they are computed locally and key-value result sets are created. Then within the Reduce-phase these information are mapped across the machines and unnecessary information removed.
MapReduce was a big success-factor for Hadoop at the beginning. As it relies heavily on writing results to disk it’s not applicable for real-time computing. Therefore this ‘batch-technology’ is used less and less by other tools within the Hadoop world. Several tools implement their own processing model either using disks or In-Memory approaches.
Apache Spark: In-Memory Database
Apache Storm: Stream processing for Input & Output
Apache Hive: Data warehouse providing data summarization, query, and analysis
Apache Hive Stream: the Storm alternative for Hive
Apache Flink: TU Berlin devoloped extra fast system
Apache Drill: SQL for ad-hoc reading from Hadoop, based on Google Dremel
Apache Pig: Script based, SQL like, Hadoop Task Creator
Apache Zookeeper: Overview and Taskmanager of Hadoop-world
Apache Hue: Web Administration of Hadoop ecosystem
Cloudera Impala: In-Memory Alternative to Hadoop but also possible to use within Hadoops HDFS
The Business Intelligence solutions: The BI software developer like JasperSoft, Tableau, Pentaho, Qlik as well as the giants Oracle, Microsoft, SAP and SAS support Hadoop.
SQL Engines reading Hadoop HDFS: Couldera Impala and IBM BigSQL / Infoshpere BigInsights
Queries combining Data from RDMS and Hadoop: Microsoft Analytics Plattfrom System (previous Parallel Data Warehouse) and Oracle (Big Data SQL)
In-Memory Spark (tbc)
In-Memory SAP HANA
SAP HANA is an extremly fast in-memory SQL database with is ordered by columns not rows. The system is as fast as it is promised and fairly similar to SQL or T-SQL Databases. It’s not only supportinf Hadoops HDFS but also simple integration to SAPs systems, such as R/3. Data in HANA is compressed which is advertised to fit 7 times the amount of (CSV) data per Gigabyte. Real-world examples are even better and compressed to one fiveteenth of the size. HANA SQL is pretty similar to T-SQL with some different Syntax – found here as PDF.
What took on regular databases hours, will run now in minutes – just let HANA selfoptimze (no manual indices needed) and possibly give a command a limited amount of RAM (with the new version) in case there is an error in the code.
A nice example for HANA is Process Mining. Instead of relying on peoples subjective opinion and answers, the real world processes are drwan by data-driven analytics. Mostly checking on timestamps and connecting tables of Logfiles and Changes in the database you can find out how much percentages of processes are in a unusual processes order. You get objective information so you can improve the processes.
So the conversion von traditional SQL Databases to SAP HANA is happily unspectecular – as Sebastian Walters states in the german source Big Data iX by heise.de.
Some say Predictive Data Science is a more fancy linear regression. Of course it’s not that simple, but statistics and mostly the regression is at it’s core.
Linear Regression and other statistical methods
Most common in linear regression is the method of least squares by Carl Friedrich Gauß from Göttingen. Here the discrete and incomplete Datapoints are given and a function is wanted which hach the smallest distance to all points in average. It shouldn’t matter if the distances are positive or negative – therefore the square of the distances are calculated. To create such a function one can choose how complicated the function can be (linear, any polynom, sin, cos, log, …) and how many variables it should have (e.g. y = ax^2 + bx + c) Finally from discrete values a continous function is build.
With modern computers it’s more easy to create predictions with non-linear methods, such as the Gauß-Newton method.
Another simple statistical method is the Maximum-Likelihood and the confidence interval where the average is compare with the average of a training sample set.