Step-by-Step Guide to Reading Different Files in Python

If you’re working with cloud data, especially on AWS, chances are you’ll encounter data stored in CSV files inside an Amazon S3 bucket. Whether you're building a data pipeline or a quick analysis tool, reading data directly from S3 in Python is a fast, reliable, and scalable way to get started.
In this blog post, we’ll walk through:
Setting up access to S3
Reading a CSV file using Python and Boto3
Displaying headers and rows
Tips to handle larger datasets
Let’s jump in!
An AWS account
An S3 bucket with a CSV file uploaded
AWS credentials (access key and secret key)
Python 3.x installed
boto3
and pandas
libraries installed (you can install them via pip)
pip install boto3 pandas
Let’s say your S3 bucket is named my-data-bucket
, and your CSV file is sample-data/employees.csv
.
import boto3 import pandas as pd from io import StringIO
boto3
is the AWS SDK for Python.pandas
helps load and process the CSV.StringIO
is used to handle the in-memory string as a file-like object.
We’ll use your AWS credentials. You can configure them using environment variables or directly in the code for testing (not recommended in production).
s3 = boto3.client( 's3', aws_access_key_id='YOUR_ACCESS_KEY', aws_secret_access_key='YOUR_SECRET_KEY' )
You can also omit the keys above if your environment is already configured using:
aws configure
bucket_name = 'my-data-bucket' file_key = 'sample-data/employees.csv' response = s3.get_object(Bucket=bucket_name, Key=file_key) csv_data = response['Body'].read().decode('utf-8') get_object() fetches the file.We decode the binary response to a UTF-8 string.
df = pd.read_csv(StringIO(csv_data))
At this point, your CSV is now a DataFrame.
# Print column headers print("Column Headers:") print(df.columns.tolist()) # Print first 5 rows print("\nFirst 5 Rows:") print(df.head())
This will output:
Column Headers: ['employee_id', 'name', 'department', 'salary'] First 5 Rows: employee_id name department salary 0 1 Alice HR 60000 1 2 Bob Sales 72000 2 3 Charlie Finance 85000 3 4 Diana Sales 69000 4 5 Edward HR 62000
import boto3import pandas as pd from io import StringIO # S3 config bucket_name = 'my-data-bucket' file_key = 'sample-data/employees.csv' # Connect to S3 s3 = boto3.client( 's3', aws_access_key_id='YOUR_ACCESS_KEY', aws_secret_access_key='YOUR_SECRET_KEY' ) # Read file from S3 response = s3.get_object(Bucket=bucket_name, Key=file_key) csv_data = response['Body'].read().decode('utf-8') # Convert to DataFrame df = pd.read_csv(StringIO(csv_data)) # Display headers and rows print("Column Headers:") print(df.columns.tolist()) print("\nFirst 5 Rows:") print(df.head())
Error | Fix |
---|---|
botocore.exceptions.NoCredentialsError | Make sure your credentials are set using aws configure or passed into boto3.client() |
UnicodeDecodeError | Try changing .decode('utf-8') to .decode('ISO-8859-1') or appropriate encoding |
File not found | Double-check the file_key path in your S3 bucket |
If your CSV file is too large to load all at once, you can use pandas.read_csv()
with the chunksize
parameter.
chunksize = 1000 # rows per chunk for chunk in pd.read_csv(StringIO(csv_data), chunksize=chunksize): print(chunk.head()) # process or print each chunk
This is useful for optimizing memory and enabling real-time processing.
If you're running this code inside an EC2, Lambda, or Glue environment, it’s best to avoid hardcoding credentials. Use IAM roles with permissions to access S3.
Example policy:
{ "Effect": "Allow", "Action": "s3:GetObject", "Resource": "arn:aws:s3:::my-data-bucket/sample-data/*" }
Here’s what we’ve done in this guide:
Connected Python to AWS S3 using boto3
Retrieved and read a CSV file
Displayed headers and data using pandas
Covered tips for large files and secure access
Working with CSV files from S3 is a great way to build flexible, cloud-powered data pipelines. Whether you're a data engineer, analyst, or Python enthusiast, this pattern is essential in cloud-native projects.
Comments
Post a Comment
Thanks for your message. We will get back you.