Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s...

53
Corso di Sistemi e Architetture per Big Data A.A. 2018/19 Valeria Cardellini Laurea Magistrale in Ingegneria Informatica Data Acquisition Macroarea di Ingegneria Dipartimento di Ingegneria Civile e Ingegneria Informatica

Transcript of Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s...

Page 1: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Corso di Sistemi e Architetture per Big DataA.A. 2018/19

Valeria Cardellini

Laurea Magistrale in Ingegneria Informatica

Data Acquisition

MacroareadiIngegneriaDipartimentodiIngegneriaCivileeIngegneriaInformatica

Page 2: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

The reference Big Data stack

Valeria Cardellini - SABD 2018/19

Resource Management

Data Storage

Data Processing

High-level Interfaces Support / Integration

1

Page 3: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Data acquisition and ingestion

• How to collect data from various data sources and ingest it into a system where it can be stored and later analyzed using batch processing?– Distributed file systems (e.g., HDFS), NoSQL data

stores (e.g., Hbase), …

• How to connect data sources to stream or in-memory processing systems for immediate use?

• How to perform also some preprocessing, including data transformation or conversion?

Valeria Cardellini - SABD 2018/19 2

Page 4: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Driving factors

• Source type– Batch data sources: files, logs, RDBMS, …– Real-time data sources: sensors, IoT systems, social media

feeds, stock market feeds, …

• Velocity– How fast data is generated?– How frequently data varies?– Real-time or streaming data require low latency and low

overhead

• Ingestion mechanism– Depends on data consumers– Pull: pub/sub, message queue – Push: framework pushes data to sinks

Valeria Cardellini - SABD 2018/19 3

Page 5: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Common requirements• Ingestion

– Batch data, streaming data– Easy writing to HDFS

• Decoupling– Data source should not directly be coupled to analytical

backends

• High availability– Data ingestion should be available 24x7– Data should be buffered (persisted) in case analytical

backend is not available

• Scalability– Amount of data and number of analytical applications will

increase

• Security– Authentication and data in motion encryption

Valeria Cardellini - SABD 2018/19 4

Page 6: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

A unifying view: Lambda architecture

Valeria Cardellini - SABD 2018/19 5

Page 7: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Data acquisition layer

• Allows collecting, aggregating and moving data • From various sources (server logs, social media,

streaming sensor data, …)• To a data store (distributed file system, NoSQL data

store, messaging system)• We analyze

– Apache Flume for stream data– Apache Sqoop for batch data

Valeria Cardellini - SABD 2018/19 6

Page 8: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Apache Flume• Distributed, reliable, and available service for

efficiently collecting, aggregating, and moving large amounts of stream data

• Robust and fault tolerant with tunable reliability mechanisms and failover and recovery mechanisms– Tunable reliability levels

• Best effort: “Fast and loose”• Guaranteed delivery: “Deliver no matter what”

• Suitable for online analytics

Valeria Cardellini - SABD 2018/19 7

Page 9: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Flume architecture

Valeria Cardellini - SABD 2018/19 8

Page 10: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Flume architecture• Agent: JVM running Flume

– One per machine– Can run many sources, sinks and channels

• Event – Basic unit of data that is moved using Flume (e.g., Avro

event)– Normally ~4KB

• Source– Produces data in the form of events

• Channel – Connects sources to sinks (like a queue)– Implements the reliability semantics

• Sink – Removes an event from a channel and forwards it to either

to a destination (e.g., HDFS) or to another agent

Valeria Cardellini - SABD 2018/19 9

Page 11: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Flume data flows• Flume allows a user to build multi-hop flows where

events travel through multiple agents before reaching the final destination

• Supports multiplexing the event flow to one or more destinations

• Multiple built-in sources and sinks (e.g., Avro)

Valeria Cardellini - SABD 2018/19 10

Page 12: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Flume reliability

• Events are staged in a channel on each agent

• Events are then delivered to the next agent or final repository (e.g., HDFS) in the flow

• Events are removed from a channel only after they are stored in the channel of next agent or in the final repository

• Transactional approach to guarantee the reliable delivery of events– Sources and sinks encapsulate in a transaction

the storage/retrieval of the events placed in or provided by a transaction provided by the channel

Valeria Cardellini - SABD 2018/19 11

Page 13: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Apache Sqoop• A commonly used tool for SQL data transfer to

Hadoop– SQL to Hadoop = SQOOP

• To import bulk data from structured data stores such as RDBMS into HDFS, HBase or Hive

• Also to export data from HDFS to RDBMS• Supports a variety of file formats such as Avro

Valeria Cardellini - SABD 2018/19 12

Page 14: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Data serialization formats for Big Data

• Serialization is the process of converting structured data into its raw form

• Some serialization formats you already know– JSON– XML

• Other serialization formats– Protocol buffers– Thrift– Apache Avro

Valeria Cardellini - SABD 2018/19 13

Page 15: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Apache Avro

• Data serialization format part developed by the Apache Software Foundation

• Key features– Compact, fast, binary data format– Supports a number of data structures for serialization– Neutral to programming language– Code generation is optional: data can be read, written, or

used in RPCs without having to generate classes or code– JSON-based schema segregated from data

• Data is always accompanied by a schema that permits full processing of that data

• Comparing their performance https://bit.ly/2qrMnOz– Avro should not be used from small objects (large

serialization and deserialization times)– Interesting for very big objects

Valeria Cardellini - SABD 2018/19 14

Page 16: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Apache NiFi

Valeria Cardellini - SABD 2018/19 15

• Powerful and reliable system to process and distribute data over several resources

• Mainly used for data routing and transformation• Highly configurable

– Flow specific QoS (loss tolerant vs guaranteed delivery, lowlatency vs high throughput)

– Prioritized queueing and flow specific QoS– Flow can be modified at runtime

• Useful for data pre-processing– Back pressure

• Ease of use: visual command and control– UI based platform where to define the sources from where to

collect data, processors for data conversion, destination to store the data

• Multiple NiFi servers can be clustered for scalability

Page 17: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Apache NiFi: use case

• Use NiFi to fetch tweets by means of NiFi’s processor ‘GetTwitter’– It uses Twitter Streaming API for retrieving tweets

• Move data stream to Apache Kafka using NiFi’sprocessor ‘PublishKafka’

Valeria Cardellini - SABD 2018/19 16

Page 18: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Messaging layer: Architecture choices

• Message queue (MQ)– ActiveMQ– RabbitMQ– ZeroMQ– Amazon SQS

• Publish/subscribe (pub/sub)– Kafka– NATS http://www.nats.io– Apache Pulsar

• Geo-replication of stored messages

– RedisValeria Cardellini - SABD 2018/19 17

Page 19: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Messaging layer: use cases

• Mainly used in the data processing pipelines for data ingestion or aggregation

• Envisioned mainly to be used at the beginning or end of a data processing pipeline

• Example– Incoming data from various sensors: ingest this

data into a streaming system for real-time analytics or a distributed file system for batch analytics

Valeria Cardellini - SABD 2018/19 18

Page 20: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Message queue pattern

• Allows for persistent asynchronous communication– How can a service and its consumers

accommodate isolated failures and avoid unnecessarily locking resources?

• Principles– Loose coupling– Service statelessness

• Services minimize resource consumption by deferring the management of state information when necessary

Valeria Cardellini - SABD 2018/19 19

Page 21: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Message queue API• Basic calls:

– put: non-blocking send• Append a message to a specified queue

– get: blocking receive• Block until the specified queue is nonempty and remove the

first message• Variations: allow searching for a specific message in the

queue, e.g., using a matching pattern

– poll: non-blocking receive• Check a specified queue for message and remove the first• Never block

– notify: non-blocking receive• Install a handler (callback function) to be automatically

called when a message is put into the specified queue

Valeria Cardellini - SABD 2018/19 20

Page 22: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Message queue systems

• Can be used for push-pull messaging– Producers push data to queue– Consumers pull data from queue

• Message queue systems based on protocols– RabbitMQ https://www.rabbitmq.com

• Implements AMQP and relies on a broker-based architecture

– ZeroMQ http://zeromq.org• High-throughput and lightweight messaging library• No persistence

– Amazon SQS

Valeria Cardellini - SABD 2018/19 21

Page 23: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Publish/subscribe pattern

Valeria Cardellini - SABD 2018/19

• Application components can publish asynchronous messages (e.g., event notifications), and/or declare their interest in message topics by issuing a subscription

22

Page 24: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Publish/subscribe pattern

• Multiple consumers can subscribe to topics with or without filters

• Subscriptions are collected by an event dispatcher component, responsible for routing events to allmatching subscribers – For scalability reasons, its implementation is usually

distributed

• High degree of decoupling– Easy to add and remove components– Appropriate for dynamic environments

Valeria Cardellini - SABD 2018/19 23

Page 25: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Publish/subscribe API

• Basic calls:– publish(event): to publish an event

• Events can be of any data type supported by the given implementation languages and may also contain meta-data

– subscribe(filter_expr, notify_cb, expiry) → sub_handle: to subscribe to an event

• Takes a filter expression, a reference to a notify callback for event delivery, and an expiry time for the subscription registration.

• Returns a subscription handle– unsubscribe(sub_handle)– notify_cb(sub_handle, event): called by the pub/sub system to

deliver a matching event

Valeria Cardellini - SABD 2018/19 24

Page 26: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Pub/sub vs. message queue

• A sibling of message queue pattern but further generalizes it by delivering a message to multiple consumers – Message queue: delivers messages to only one

receiver, i.e., one-to-one communication– Pub/sub: delivers messages to multiple receivers,

i.e., one-to-many communication

• Some frameworks (e.g., RabbitMQ, Kafka, NATS) support both patterns

Valeria Cardellini - SABD 2018/19 25

Page 27: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Apache Kafka• General-purpose, distributed pub/sub system• Originally developed in 2010 by LinkedIn• Written in Scala• Horizontal scalability• High throughput

– Thousands of messages per sec

• Fault-tolerant

Kreps et al., “Kafka: A Distributed Messaging System for Log Processing”, 2011Valeria Cardellini - SABD 2018/19 26

• Delivery guarantees– At least once: guarantees no

loss, but duplicated messages, possibly out-of-order

– Exactly once: guarantees no-loss and no duplicates, but requires expensive end-to-end 2PC

Page 28: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Kafka at a glance

• Kafka maintains feeds of messages in categories called topics

• Producers: publish messages to a Kafka topic• Consumers: subscribe to topics and process the feed of

published message• Kafka cluster: distributed log of data over servers known

as brokers– Brokers rely on Apache Zookeeper for coordination

Valeria Cardellini - SABD 2018/19 27

Page 29: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Kafka: topics• Topic: category to which the message is published• For each topic, Kafka cluster maintains a partitioned log

– Log (data structure!): append-only, totally-ordered sequence of records ordered by time

• Topics are split into a pre-defined number of partitions– Partition: unit of parallelism of the topic

• Each partition is replicated in multiple brokers with some replication factor

Vale

ria C

arde

llini

-S

AB

D 2

018/

19

> bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test

• CLI command to create a topic with a single partition and one replica

28

Page 30: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Kafka: partitions

• Producers publish their records to partitions of a topic (round-robin or partitioned by keys), and consumers consume the published records of that topic

• Each partition is an ordered, numbered, immutable sequence of records that is continually appended to– Like a commit log

• Each record is associated with a monotonically increasing sequence number, called offset

Valeria Cardellini - SABD 2018/19 29

Page 31: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Kafka: partitions

• Partitions are distributed across brokers for scalability• Each partition is replicated for fault tolerance across

a configurable number of brokers• Each partition has one leader broker and 0 or more

followers• The leader handles read and write requests

– Read from leader– Write to leader

• A follower replicates the leader and acts as a backup• Each broker is a leader for some of it partitions and a

follower for others to load balance• ZooKeeper is used to keep the brokers consistent

Valeria Cardellini - SABD 2018/19 30

Page 32: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Kafka: partitions

Valeria Cardellini - SABD 2018/19 31

Page 33: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Kafka: producers• Publish data to topics of their choice• Also responsible for choosing which record to assign

to which partition within the topic– Round-robin or partitioned by keys

• Producers = data sources

Valeria Cardellini - SABD 2018/19

> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test

This is a message

This is another message

• To run the producer

32

Page 34: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Kafka: consumersVa

leria

Car

delli

ni -

SA

BD

201

8/19

• Consumer Group: set of consumers sharing a common group ID– A Consumer Group maps to a logical subscriber– Each group consists of multiple consumers for scalability and fault

tolerance• Consumers use the offset to track which messages have been

consumed– Messages can be replayed using the offset

• To run the consumer> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning

33

Page 35: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Kafka: design choice for consumers

• Push vs. pull model for consumers

• Push model– Challenging for the broker to deal with different consumers

as it controls the rate at which data is transferred– Need to decide whether to send a message immediately or

accumulate more data and send

• Pull model– Pros: scalability, flexibility (different consumers can have

diverse needs and capabilities)– Cons: in case broker has no data, consumers may end up

busy waiting for data to arrive

Valeria Cardellini - SABD 2018/19 34

Page 36: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Kafka: ordering guarantees

• Messages sent by a producer to a particular topic partition will be appended in the order they are sent

• Consumer sees records in the order they are stored in the log

• Strong guarantees about ordering only within a partition– Total order over messages within a partition, but Kafka

cannot preserve order between different partitions in a topic

• Per-partition ordering combined with the ability to partition data by key is sufficient for most applications

Valeria Cardellini - SABD 2018/19 35

Page 37: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Kafka: ZooKeeper

• Kafka uses ZooKeeper to coordinate among the producers, consumers and brokers

• ZooKeeper stores metadata– List of brokers– List of consumers and their offsets– List of producers

• ZooKeeper runs several algorithms for coordination between consumers and brokers– Consumer registration algorithm– Consumer rebalancing algorithm

• Allows all the consumers in a group to come into consensus on which consumer is consuming which partitions

Valeria Cardellini - SABD 2018/19 36

Page 38: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Kafka: fault tolerance

• Replicates partitions for fault tolerance• Kafka makes a message available for

consumption only after all the followers acknowledge to the leader a successful write– Implies that a message may not be immediately

available for consumption

• Kafka retains messages for a configured period of time– Messages can be “replayed” in case a consumer

fails

Valeria Cardellini - SABD 2018/19 37

Page 39: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Kafka APIs• Four core APIs• Producer API: allows app to

publish streams of records to one or more Kafka topics

• Consumer API: allows app to subscribe to one or more topics and process the stream of records produced to them

• Connector API: allows building and running reusable producers or consumers that

Valeria Cardellini - SABD 2018/19

connect Kafka topics to existing applications or data systems so to move large collections of data into and out of Kafka

38

Page 40: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Kafka APIs• Streams API: allows app to

act as a stream processor, transforming an input stream from one or more topics to an output stream to one or more output topics

• Can use Kafka Streams to process data in pipelines consisting of multiple stages

Valeria Cardellini - SABD 2018/19 39

Page 41: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Kafka clients

• JVM internal client• Plus rich ecosystem of clients, among which:

– Sarama: Go libraryhttps://shopify.github.io/sarama/

– Python libraryhttps://github.com/confluentinc/confluent-kafka-python/

- NodeJS clienthttps://github.com/Blizzard/node-rdkafka

Valeria Cardellini - SABD 2018/19 40

Page 42: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

• Both guarantee millisecond-level low-latency

– At least once delivery guarantee more expensive on Kafka (latency almost doubles)

• Replication has a drastic impact on the performance of both– Performance reduced by 50% (RabbitMQ) and 75% (Kafka)

• Kafka is best suited as scalable ingestion system• The two systems can be chained

Valeria Cardellini - SABD 2018/19 41

Dobbelaere and Esmaili, “Kafka versus RabbitMQ”, ACM DEBS 2017

Performance comparison: Kafka versus RabbitMQ

Page 43: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Kafka: limitations

• No complete set of monitoring and management tools

• No support for wildcard topic selection

• No geo-replication

Valeria Cardellini - SABD 2018/19 42

Page 44: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Kafka: evolution

• Kafka as a streaming platform• In upcoming hands-on lesson

Valeria Cardellini - SABD 2018/19 43

Page 45: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Kafka @ CINI Smart City Challenge ’17

Valeria Cardellini - SABD 2018/19 By M. Adriani, D. Magnanimi, M. Ponza, F. Rossi 44

Page 46: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Kafka @ Netflix

• Netflix uses Kafka for data collection and buffering so that it can be used by downstream systems

Valeria Cardellini - SABD 2018/19

http://techblog.netflix.com/2016/04/kafka-inside-keystone-pipeline.html45

Page 47: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Kafka @ Uber• Uber uses Kafka for real-time business driven

decisions

Valeria Cardellini - SABD 2018/19 https://eng.uber.com/ureplicator/ 46

Page 48: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Kafka @ AudiVa

leria

Car

delli

ni -

SA

BD

201

8/19

47

• Audi uses Kafka for real-time data processing

https://www.youtube.com/watch?v=yGLKi3TMJv8

Page 49: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

Cloud services for IoT data ingestion and analysis

• Let’s consider AWS cloud services devoted to Internet of Things data ingestion and analysis– AWS IoT Events– AWS IoT Core– AWS IoT Analytics

Valeria Cardellini - SABD 2018/19 48

Page 50: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

AWS IoT Events

• IoT service to detect and respond to events from IoTsensors and applications– Select the data sources to ingest, define the logic for each

event using if-then-else statements, and select the alert or custom action to trigger when an event occurs

– Integrated with other services, such as AWS IoT Core and AWS IoT Analytics, to enable detection and insights intoevents

Valeria Cardellini - SABD 2018/19 49

Page 51: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

AWS IoT Core

Valeria Cardellini - SABD 2018/19 50

• Managed cloud service that lets connected devices interact with cloud applications and other devices

Page 52: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

AWS IoT Analytics

• Fully-managed Cloud service to run analytics on massive volumes of IoT data

• Filters, transforms, and enriches IoT data before storing it in a time-series data store for analysis

Valeria Cardellini - SABD 2018/19 51

Page 53: Corso di Sistemi e Architetture per Big Data · • Move data stream to Apache Kafka using NiFi’s ... beginning or end of a data processing pipeline • Example – Incoming data

References

• Apache Flume documentation, http://bit.ly/2qE5QK7

• Apache NiFi documentation, https://nifi.apache.org/docs.html

• Kreps et al., “Kafka: A Distributed Messaging System for Log Processing”, NetDB 2011. http://bit.ly/2oxpael

• Apache Kafka documentation, http://bit.ly/2ozEY0m

Valeria Cardellini - SABD 2018/19 52