Featured Post

How to Read a CSV File from Amazon S3 Using Python (With Headers and Rows Displayed)

Image
  Introduction If you’re working with cloud data, especially on AWS, chances are you’ll encounter data stored in CSV files inside an Amazon S3 bucket . Whether you're building a data pipeline or a quick analysis tool, reading data directly from S3 in Python is a fast, reliable, and scalable way to get started. In this blog post, we’ll walk through: Setting up access to S3 Reading a CSV file using Python and Boto3 Displaying headers and rows Tips to handle larger datasets Let’s jump in! What You’ll Need An AWS account An S3 bucket with a CSV file uploaded AWS credentials (access key and secret key) Python 3.x installed boto3 and pandas libraries installed (you can install them via pip) pip install boto3 pandas Step-by-Step: Read CSV from S3 Let’s say your S3 bucket is named my-data-bucket , and your CSV file is sample-data/employees.csv . ✅ Step 1: Import Required Libraries import boto3 import pandas as pd from io import StringIO boto3 is...

5 Python Pandas Tricky Examples for Data Analysis

Here are five tricky Python Pandas examples. These provide detailed insights to work with Pandas in Python,


Pandas examples

#1 Dealing with datetime data (parse_dates pandas example)


import pandas as pd

# Convert a column to datetime format

data['date_column'] = pd.to_datetime(data['date_column'])


# Extract components from datetime (e.g., year, month, day)

data['year'] = data['date_column'].dt.year

data['month'] = data['date_column'].dt.month


# Calculate the time difference between two datetime columns

data['time_diff'] = data['end_time'] - data['start_time']


#2 Working with text data

 

# Convert text to lowercase

data['text_column'] = data['text_column'].str.lower()


# Count the occurrences of specific words in a text column

data['word_count'] = data['text_column'].str.count('word')


# Extract information using regular expressions

data['extracted_info'] = data['text_column'].str.extract(r'(\d+)')


#3 Handling large datasets efficiently


# Read a large dataset in chunks

chunk_size = 100000

data_chunks = pd.read_csv('large_data.csv', chunksize=chunk_size)

# Process data in chunks

for chunk in data_chunks:

    # Perform calculations or manipulations on each chunk


# Append data from multiple files

file_list = ['file1.csv', 'file2.csv', 'file3.csv']

combined_data = pd.concat([pd.read_csv(file) for file in file_list])


#4 Pivot tables and reshaping data


# Create a pivot table

pivot_table = data.pivot_table(values='column2', index='column1', columns='column3', aggfunc='mean')


# Unstack a multi-index DataFrame

unstacked_data = pivot_table.unstack().reset_index()


# Melt a DataFrame from wide to long format

melted_data = pd.melt(data, id_vars=['id'], value_vars=['var1', 'var2'], var_name='variable', value_name='value')


#5 Efficient memory usage


# Optimize memory usage of DataFrame columns

data['numeric_column'] = pd.to_numeric(data['numeric_column'], downcast='integer')

data['category_column'] = data['category_column'].astype('category')


# Load a subset of columns from a large dataset

selected_columns = ['column1', 'column2', 'column3']

data_subset = pd.read_csv('large_data.csv', usecols=selected_columns)


These examples demonstrate more advanced techniques for handling datetime data, text data, large datasets, reshaping data, and optimizing memory usage. They highlight some of the powerful features that pandas provide for complex data analysis tasks.


Related

Comments

Popular posts from this blog

SQL Query: 3 Methods for Calculating Cumulative SUM

5 SQL Queries That Popularly Used in Data Analysis

Big Data: Top Cloud Computing Interview Questions (1 of 4)