Posts

Showing posts with the label ETL questions

Featured Post

8 Ways to Optimize AWS Glue Jobs in a Nutshell

Image
  Improving the performance of AWS Glue jobs involves several strategies that target different aspects of the ETL (Extract, Transform, Load) process. Here are some key practices. 1. Optimize Job Scripts Partitioning : Ensure your data is properly partitioned. Partitioning divides your data into manageable chunks, allowing parallel processing and reducing the amount of data scanned. Filtering : Apply pushdown predicates to filter data early in the ETL process, reducing the amount of data processed downstream. Compression : Use compressed file formats (e.g., Parquet, ORC) for your data sources and sinks. These formats not only reduce storage costs but also improve I/O performance. Optimize Transformations : Minimize the number of transformations and actions in your script. Combine transformations where possible and use DataFrame APIs which are optimized for performance. 2. Use Appropriate Data Formats Parquet and ORC : These columnar formats are efficient for storage and querying, signif

19 Top Unix File Scenario Commands

Image
ETL developers main task is to browse various flat files before they start testing. File browsing in UNIX is tricky. If you know right command to do it you can save a lot of time. These 19 top UNIX files commands useful to use in your project. In UNIX a file normally can have Header, Detail and Trailer. There are scenarios where you need only details without header and Trailer, and need only recent one record, and you need to skip some records from the input files. So for all the File based scenarios, I have given useful UNIX commands.   1). How to print/display the first line of a file?  There are many ways to do this. However the easiest way to display the first line of a file is using the [head] command.  $> head -1 file. Txt If you specify [head -2] then it would print first 2 records of the file.  Another way can be by using [sed] command. [sed] is a very powerful text editor which can be used for various text manipulation purposes like this.  $> sed '2,$ d