Featured Post

8 Ways to Optimize AWS Glue Jobs in a Nutshell

Image
  Improving the performance of AWS Glue jobs involves several strategies that target different aspects of the ETL (Extract, Transform, Load) process. Here are some key practices. 1. Optimize Job Scripts Partitioning : Ensure your data is properly partitioned. Partitioning divides your data into manageable chunks, allowing parallel processing and reducing the amount of data scanned. Filtering : Apply pushdown predicates to filter data early in the ETL process, reducing the amount of data processed downstream. Compression : Use compressed file formats (e.g., Parquet, ORC) for your data sources and sinks. These formats not only reduce storage costs but also improve I/O performance. Optimize Transformations : Minimize the number of transformations and actions in your script. Combine transformations where possible and use DataFrame APIs which are optimized for performance. 2. Use Appropriate Data Formats Parquet and ORC : These columnar formats are efficient for storage and querying, signif

Apache Storm Architecture Tutorial Flowchart

There are two main reasons why Apache Storm is so popular. The number one is it can connect to many sources. The number two is scalable. The other advantage is fault-tolerant. That means, guaranteed data processing.


Apache Storm topologies

The map-reduce jobs process data analytics in Hadoop. The topology in Storm is the real data processor.
The co-ordination between Nimbus and Supervisor carried by Zookeeper

Apache Storm

  1. The jobs in Hadoop are similar to the topology. The jobs run as per the schedule defined.
  2. In Storm, the topology runs forever.
  3. A topology consists of many worker processes spread across many machines. 
  4. A topology is a pre-defined design to get end product using your data.
  5. A topology comprises of 2 parts. These are Spout and bolts.
  6. The Spout is a funnel for topology
Storm Topology

Two nodes in Storm

  1. Master Node: similar to the Hadoop job tracker. It runs on a daemon called Nimbus.
  2. Worker Node: It runs on a daemon called Supervisor. The Supervisor listens to the work assigned to each machine.

Master Node

  • Nimbus is responsible for distributing the code
  • Monitors failures
  • Assign tasks to each machine

Worker Node

  • It listens to the work assigned by Nimbus.
  • It works under the subset of the topology.

Read More

Comments

Popular posts from this blog

How to Fix datetime Import Error in Python Quickly

How to Check Kafka Available Brokers

SQL Query: 3 Methods for Calculating Cumulative SUM