Orc table creation from spark sql with snappy compression

11/25/2023

Apache Kafka and MapR Event Store for Kafka (also now part of HPE Ezmeral Data Fabric) are good for scalable reading and writing of real-time streaming data.MapR Database supports consistent, predictable, high throughput, fast reads and writes with efficient updates, automatic partitioning, and sorting. Apache HBase and MapR Database (now part of HPE Ezmeral Data Fabric) are good for random read/write use cases.Parquet is slower for writing but gives the best performance for reading this format is good for BI and analytics, which require low latency reads. CSV and JSON data formats give excellent write path performance but are slower for reading these formats are good candidates for collecting raw data for example logs, which require high throughput writes. File data stores are good for write once (append only), read many use cases.Typically, data pipelines will involve multiple data sources and sinks and multiple formats to support different use cases and different read/write latency requirements. Spark supports several data formats, including CSV, JSON, ORC, and Parquet, and several data sources or connectors, popular NoSQL databases, and distributed messaging stores.īut just because Spark supports a given data storage or format doesn’t mean you’ll get the same performance with all of them. Use the Best Data Store for Your Use Case However, Dataset functional transformations (like map) will not take advantage of query optimization, whole-stage code generation, and reduced GC. When possible, use Spark SQL functions – for example, to_date(), hour() – instead of custom UDFs in order to benefit from the advantages above.ĭatasets provide the advantage of compile time type safety over DataFrames. Reduced garbage collection processing overhead.Datasets, DataFrames, and Spark SQL provide the following advantages: In order to take advantage of Spark 2.x, you should be using Datasets, DataFrames, and Spark SQL, instead of RDDs. Tips for Taking Advantage of Spark 2.x Improvements Use Dataset, DataFrames, Spark SQL Spark SQL “Whole-Stage Java Code Generation” optimizes CPU usage by generating a single optimized function in bytecode for the set of operators in a SQL query (when possible), instead of generating iterator code for each operator.Catalyst now supports both rule-based and cost-based optimization. The Catalyst optimizer handles: analysis, logical optimization, physical planning, and code generation to compile parts of queries to Java bytecode. Spark SQL’s Catalyst Optimizer underpins all the major new APIs in Spark 2.0 and later versions, from DataFrames and Datasets to Structured Streaming. To improve the speed of data processing through more effective use of L1/ L2/元 CPU caches, Spark algorithms and data structures exploit memory hierarchy with cache-aware computation. This takes advantage of modern CPU designs, by keeping all pipelines full to achieve efficiency. Vectorization allows the CPU to operate on vectors, which are arrays of column values from multiple records.

Columnar layout for memory data avoids unnecessary I/O and accelerates analytical processing performance on modern CPUs and GPUs.To reduce JVM object memory size, creation, and garbage collection processing, Spark explicitly manages memory and converts most operations to operate directly against binary data.Tungsten builds upon ideas from modern compilers and massively parallel processing (MPP) technologies, such as Apache Drill, Presto, and Apache Arrow.

Tungsten is the code name for the Spark project that makes changes to Apache Spark’s execution engine, focusing on improvements to the efficiency of memory and CPU usage.

This blog post will first give a quick overview of what changes were made and then some tips to take advantage of these changes. With Apache Spark 2.0 and later versions, big improvements were implemented to enable Spark to execute faster, making lot of earlier tips and best practices obsolete. Editor’s Note: MapR products referenced are now part of the HPE Ezmeral Data Fabric.

0 Comments

Orc table creation from spark sql with snappy compression

Leave a Reply.

Author

Archives

Categories