Posts

Showing posts with the label Pandas

Featured Post

8 Ways to Optimize AWS Glue Jobs in a Nutshell

Image
  Improving the performance of AWS Glue jobs involves several strategies that target different aspects of the ETL (Extract, Transform, Load) process. Here are some key practices. 1. Optimize Job Scripts Partitioning : Ensure your data is properly partitioned. Partitioning divides your data into manageable chunks, allowing parallel processing and reducing the amount of data scanned. Filtering : Apply pushdown predicates to filter data early in the ETL process, reducing the amount of data processed downstream. Compression : Use compressed file formats (e.g., Parquet, ORC) for your data sources and sinks. These formats not only reduce storage costs but also improve I/O performance. Optimize Transformations : Minimize the number of transformations and actions in your script. Combine transformations where possible and use DataFrame APIs which are optimized for performance. 2. Use Appropriate Data Formats Parquet and ORC : These columnar formats are efficient for storage and querying, signif

How to Deal With Missing Data: Pandas Fillna() and Dropna()

Image
Here are the best examples of Pandas fillna(), dropna() and sum() methods. We have explained the process in two steps - Counting and Replacing the Null values. Count Nulls ## count null values column-wise null_counts = df.isnull(). sum() print(null_counts) ``` Output: ``` Column1    1 Column2    1 Column3    5 dtype: int64 ``` In the above code, we first create a sample Pandas DataFrame `df` with some null values. Then, we use the `isnull()` function to create a DataFrame of the same shape as `df`, where each element is a boolean value indicating whether that element is null or not. Finally, we use the `sum()` function to count the number of null values in each column of the resulting DataFrame. The output shows the count of null values column-wise. to count null values column-wise: ``` df.isnull().sum() ``` ##Code snippet to count null values row-wise: ``` df.isnull().sum(axis=1) ``` In the above code, `df` is the Pandas DataFrame for which you want to count the null values. The `isnu

A Beginner's Guide to Pandas Project for Immediate Practice

Image
Pandas is a powerful data manipulation and analysis library in Python that provides a wide range of functions and tools to work with structured data. Whether you are a data scientist, analyst, or just a curious learner, Pandas can help you efficiently handle and analyze data.  In this blog post, we will walk through a step-by-step guide on how to start a Pandas project from scratch. By following these steps, you will be able to import data, explore and manipulate it, perform calculations and transformations, and save the results for further analysis. So let's dive into the world of Pandas and get started with your own project! Simple Pandas project Import the necessary libraries: import pandas as pd import numpy as np Read data from a file into a Pandas DataFrame: df = pd.read_csv('/path/to/file.csv') Explore and manipulate the data: View the first few rows of the DataFrame: print(df.head()) Access specific columns or rows in the DataFrame: print(df['column_name'])

How to Fill Nulls in Pandas: bfill and ffill

Image
In Pandas, bfill and ffill are two important methods used for filling missing values in a DataFrame or Series by propagating the previous (forward fill) or next (backward fill) valid values respectively. These methods are particularly useful when dealing with time series data or other ordered data where missing values need to be filled based on the available adjacent values. ffill (forward fill): When you use the ffill method on a DataFrame or Series, it fills missing values with the previous non-null value in the same column. It propagates the last known value forward. This method is often used to carry forward the last observed value for a specific column, making it a good choice for time series data when the assumption is that the value doesn't change abruptly. Example: import pandas as pd data = {'A': [1, 2, None, 4, None, 6],         'B': [None, 'X', 'Y', None, 'Z', 'W']} df = pd.DataFrame(data) print(df) # Output: #      A     B

How to Convert Dictionary to Dataframe: Pandas from_dict

Image
 Pandas is a data analysis Python library.  The example shows you to convert a dictionary to a data frame. The point to note here is DataFrame will take only 2D data. So you need to supply 2D data.  Pandas Dictionary to Dataframe import pandas as pd import numpy as np data_dict = {'item1' : np.random.randn(4), 'item2' : np.random.randn(4)} df3=pd.DataFrame. from_dict (data_dict, orient='index') print(df3) Output 0 1 2 3 item1 -0.109300 -0.483624 0.375838 1.248651 item2 -0.274944 -0.857318 -1.203718 -0.061941 Explanation Using the NumPy package, created a dictionary with random values. There are two items - item 1 and item 2. The data_dict is input to the data frame. The from_dict method needs two parameters. These are data_dict and index. Here's the syntax you can refer to quickly. Related Hands-on Data Analysis Using Pandas How to create 3D data frame in Pandas

5 Python Pandas Tricky Examples for Data Analysis

Image
Here are five tricky Python Pandas examples. These provide detailed insights to work with Pandas in Python, #1 Dealing with datetime data ( parse_dates pandas example) import pandas as pd # Convert a column to datetime format data['date_column'] = pd.to_datetime(data['date_column']) # Extract components from datetime (e.g., year, month, day) data['year'] = data['date_column'].dt.year data['month'] = data['date_column'].dt.month # Calculate the time difference between two datetime columns data['time_diff'] = data['end_time'] - data['start_time'] #2 Working with text data   # Convert text to lowercase data['text_column'] = data['text_column'].str.lower() # Count the occurrences of specific words in a text column data['word_count'] = data['text_column'].str.count('word') # Extract information using regular expressions data['extracted_info'] = data['text_column'].