Featured Post

How to Check Column Nulls and Replace: Pandas

Image
Here is a post that shows how to count Nulls and replace them with the value you want in the Pandas Dataframe. We have explained the process in two steps - Counting and Replacing the Null values. Count null values (column-wise) in Pandas ## count null values column-wise null_counts = df.isnull(). sum() print(null_counts) ``` Output: ``` Column1    1 Column2    1 Column3    5 dtype: int64 ``` In the above code, we first create a sample Pandas DataFrame `df` with some null values. Then, we use the `isnull()` function to create a DataFrame of the same shape as `df`, where each element is a boolean value indicating whether that element is null or not. Finally, we use the `sum()` function to count the number of null values in each column of the resulting DataFrame. The output shows the count of null values column-wise. to count null values column-wise: ``` df.isnull().sum() ``` ##Code snippet to count null values row-wise: ``` df.isnull().sum(axis=1) ``` In the above code, `df` is the Panda

Top Key Architecture Components in HIVE

5 architectural components present in Hadoop Hive: Shell: allows interactive queries like MySQL shell connected to a database – Also supports web and JDBC clients Driver: session handles, fetch, execute Compiler: parse, plan, optimize Execution engine: DAG of stages (M/R, HDFS, or metadata) Metastore: schema, location in HDFS, SerDe

Data Mode of Hive:
  • Tables
– Typed columns (int, float, string, date, boolean)
– Also, list: map (for JSON-like data)
  • Partitions
– e.g., to range-partition tables by date
  • Buckets
– Hash partitions within ranges (useful for sampling, join optimization)

HIVE Meta Store
  • Database: namespace containing a set of tables
  • Holds table definitions (column types, physical layout)
  • Partition data 
  • Uses JPOX ORM for implementation; can be stored in Derby, MySQL, many other relational databases
Physical Layout of HIVE
  • Warehouse directory in HDFS
– e.g., /home/hive/warehouse
  • Tables stored in subdirectories of warehouse
– Partitions, buckets form subdirectories of tables
  • Actual data stored in flat files
– Control char-delimited text, or SequenceFiles
– With custom SerDe, can use arbitrary format

Comments

Popular posts from this blog

Explained Ideal Structure of Python Class

How to Check Kafka Available Brokers

6 Python file Methods Real Usage