Posts

Showing posts with the label Apache spark

Featured Post

How to Build CI/CD Pipeline: GitHub to AWS

Image
 Creating a CI/CD pipeline to deploy a project from GitHub to AWS can be done using various AWS services like AWS CodePipeline, AWS CodeBuild, and optionally AWS CodeDeploy or Amazon ECS for application deployment. Below is a high-level guide on how to set up a basic GitHub to AWS pipeline: Prerequisites AWS Account : Ensure access to the AWS account with the necessary permissions. GitHub Repository : Have your application code hosted on GitHub. IAM Roles : Create necessary IAM roles with permissions to interact with AWS services (e.g., CodePipeline, CodeBuild, S3, ECS, etc.). AWS CLI : Install and configure the AWS CLI for easier management of services. Step 1: Create an S3 Bucket for Artifacts AWS CodePipeline requires an S3 bucket to store artifacts (builds, deployments, etc.). Go to the S3 service in the AWS Management Console. Create a new bucket, ensuring it has a unique name. Note the bucket name for later use. Step 2: Set Up AWS CodeBuild CodeBuild will handle the build proces

3 best Self Study Materials on Spark Mlib

Image
Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala and Python, and an optimized engine that supports general execution graphs. An execution graph describes the possible states of execution and the states between them. Spark also supports a set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming. #Spark   Review of Spark Machine Language Library (MLlib): MLlib is Spark's machine learning library, focusing on learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives. Why MLlib? It is built on Apache Spark, which is a fast and general engine for large scale processing. Supposedly, running times or up to 100x faster than Hadoop MapReduce, or 10x faster on disk. Supports writing applications