Discover Hidden Opportunities

MORE About Spark

Spark

What is Spark?

Apache Spark is a powerful open source processing engine built around speed, ease of use, and sophisticated analytics. It was originally developed at UC Berkeley in 2009.

Since its release, Spark has seen rapid adoption by enterprises across a wide range of industries. Internet powerhouses such as Yahoo, Baidu, and Tencent, have eagerly deployed Spark at massive scale, collectively processing multiple petabytes of data on clusters of over 8,000 nodes. It has quickly become the largest open source community in big data, with over 500 contributors from 200+ organizations.

What is Spark used for?

Spark is a general-purpose engine used for many types of data processing. Spark comes packaged with support for ETL, interactive queries (SQL), advanced analytics (e.g. machine learning) and streaming over large datasets. For loading and storing data, Spark integrates with many storage systems (e.g. HDFS, Cassandra, HBase, S3). Spark is also pluggable, with dozens of third party libraries and storage integrations.
Additionally, Spark supports a variety of popular development languages including R, Spark SQL, Java, Python and Scala.

What are the benefits of Spark?

Spark was initially designed for interactive queries and iterative algorithms, as these were two major use cases not well served by batch frameworks like MapReduce. Consequently Spark excels in scenarios that require fast performance, such as iterative processing, interactive querying, large-scale batch computations, streaming, and graph computations.

Developers and enterprises typically deploy Spark because of its inherent benefits:

Simple

Easy-to-use APIs for operating on large datasets. This includes a collection of over 100 operators for transforming data and familiar data frame APIs for manipulating semi-structured data.

Fast

Engineered from the bottom-up for performance, running 100x faster than Hadoop by exploiting in memory computing and other optimizations. Spark is also fast when data is stored on disk, and currently holds the world record for large-scale on-disk sorting.

Unified Engine

Packaged with higher-level libraries, including support for SQL queries, streaming data, machine learning and graph processing. These standard libraries increase developer productivity and can be seamlessly combined to create complex workflows.

Broadly Compatible

Built-in support for many data sources, such as HDFS, RDBMS, S3, Cassandra, and MongoDB.

 


 

Spark Ecosystem

 

Spark_Ecosystem_Chart1

General Execution: Spark Core

Spark Core is the underlying general execution engine for the Spark platform that all other functionality is built on top of. It provides in-memory computing capabilities to deliver speed, a generalized execution model to support a wide variety of applications, and Java, Scala, and Python APIs for ease of development.

Structured Data: Spark SQL

Many data scientists, analysts, and general business intelligence users rely on interactive SQL queries for exploring data. Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine.  It enables unmodified Hadoop Hive queries to run up to 100x faster on existing deployments and data. It also provides powerful integration with the rest of the Spark ecosystem (e.g., integrating SQL query processing with machine learning).

Streaming Analytics: Spark Streaming

Many applications need the ability to process and analyze not only batch data, but also streams of new data in real-time. Running on top of Spark, Spark Streaming enables powerful interactive and analytical applications across both streaming and historical data, while inheriting Spark’s ease of use and fault tolerance characteristics. It readily integrates with a wide variety of popular data sources, including HDFS, Flume, Kafka, and Twitter.

Machine Learning: MLlib

Machine learning has quickly emerged as a critical piece in mining Big Data for actionable insights. Built on top of Spark, MLlib is a scalable machine learning library that delivers both high-quality algorithms (e.g., multiple iterations to increase accuracy) and blazing speed (up to 100x faster than MapReduce). The library is usable in Java, Scala, and Python as part of Spark applications, so that you can include it in complete workflows.

Graph Computation: GraphX

GraphX is a graph computation engine built on top of Spark that enables users to interactively build, transform and reason about graph structured data at scale. It comes complete with a library of common algorithms.

SparkR

SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. In Spark 1.4.0, SparkR provides a distributed data frame implementation that supports operations like selection, filtering, aggregation etc. (similar to R data frames, dplyr) but on large datasets.

 

Other Projects

One of the benefits of Spark’s vibrant open-source community is continued innovation that helps extend Spark’s capabilities, many of which originated in UC Berkeley’s AMPLab. Here is a sampling of some on-going projects in the community (that are still in alpha):

BlinkDB: An approximate query engine for interactive SQL queries in Shark that allows users to trade-off query accuracy for response time. This enables interactive queries over massive data by using data samples and presenting results annotated with meaningful error bars.

 


 

SERVICES


Architecture Services

  • Big Data projects strategy and road map definition
  • Big Data projects architecture and component design artifacts definition
  • Real time data integration strategy definition
  • Implementation services

  • Predictive analytics model development
  • SPARK component development
  • SPARK steaming
  • MLIB development
  • Real Time Data visualization and reporting component development
  • Solution deployment strategy definition and implementation
  • Integration services

  • Runtime analytics integration with CEP
  • Real time custom connectors development
  • Infrastructure services

  • SPARK onpromises cluster configuration and setup
  • SPARK cloud cluster configuration and setup
  • Kafka cluster setup
  • SPARK cluster performance tuning
  • SPARK cluster performance tuning

  • SOLUTIONS


    Realtime stream Analytics

    LOGO

    Cost effective real-time data processing solution that helps to discover insights from data to make right business decisions on-time. Scenarios of real-time streaming analytics can be found across all industries:
    • real-time stock-trading analysis and alerts offered by financial services companies;
    • real-time fraud detection;
    • data and identity protection services;
    • reliable ingestion and analysis of data generated by sensors and actuators embedded in physical objects (Internet of Things, or IoT);
    • web clickstream analytics;
    • customer relationship management (CRM) applications issuing alerts when customer experience within a time frame is degraded.
    Continuously analyzes data as it is captured, in real-time – and take immediate action

    Business Intelligence Analytics

    LOGO

    Supports:
    • high volume and high velocity of data in a different data formats
    • provides single source of truth
    • answers questions based on information that was acquired in the past up to the present
    • capture more valuable insights from your big data
    • gives organizations a powerful solution to translate large amounts of data into valuable, actionable insights


    About

    Nebulaware specializes in real-time BI and datawarehousing and there is no surprise that Spark caught our attention at the very beginning of its existence

    Nebulaware specializes in real-time BI and datawarehousing and there is no surprise that Spark caught our attentions at the very beginning of its existence.

    Nebulaware belongs to the short list of companies that have a dedicated Spark solutions team including Spark solutions architects, Spark developers and consultants.

    During last few years we have accumulated viable experience in implementing Spark solutions for different business domains. Our Agile style implementation methodology helps customers to maximize ROI in their Spark projects.

    Get in touch