Featured Post

8 Ways to Optimize AWS Glue Jobs in a Nutshell

Image
  Improving the performance of AWS Glue jobs involves several strategies that target different aspects of the ETL (Extract, Transform, Load) process. Here are some key practices. 1. Optimize Job Scripts Partitioning : Ensure your data is properly partitioned. Partitioning divides your data into manageable chunks, allowing parallel processing and reducing the amount of data scanned. Filtering : Apply pushdown predicates to filter data early in the ETL process, reducing the amount of data processed downstream. Compression : Use compressed file formats (e.g., Parquet, ORC) for your data sources and sinks. These formats not only reduce storage costs but also improve I/O performance. Optimize Transformations : Minimize the number of transformations and actions in your script. Combine transformations where possible and use DataFrame APIs which are optimized for performance. 2. Use Appropriate Data Formats Parquet and ORC : These columnar formats are efficient for storage and querying, signif

How to Write ETL Logic in Python: Sample Code to Practice

Here's an example Python code that uses the mysql-connector library to connect to a MySQL database, extract data from a table, transform it, and load it as a JSON file. Here's an example:







Python ETL Sample Code


import mysql.connector

import json


# Connect to the MySQL database

cnx = mysql.connector.connect(user='username', password='password',

                              host='localhost',

                              database='database_name')


# Define a cursor to execute SQL queries

cursor = cnx.cursor()


# Define the SQL query to extract data

query = ("SELECT column1, column2, column3 FROM table_name")


# Execute the SQL query

cursor.execute(query)


# Fetch all rows from the result set

rows = cursor.fetchall()


# Transform the rows into a list of dictionaries

result = []

for row in rows:

    result.append({'column1': row[0], 'column2': row[1], 'column3': row[2]})


# Save the result as a JSON file

with open('output.json', 'w') as outfile:

    json.dump(result, outfile)


# Close the cursor and database connection

cursor.close()

cnx.close()

In this example, you will need to replace username, password, localhost, database_name, table_name, column1, column2, and column3 with the appropriate values for your MySQL database and table. 


The code will extract the data from the specified table, transform it into a list of dictionaries, and save it as a JSON file named output.json.

Comments

Popular posts from this blog

How to Fix datetime Import Error in Python Quickly

How to Check Kafka Available Brokers

SQL Query: 3 Methods for Calculating Cumulative SUM