

The Amazon Redshift data lake export feature.For more information on how this can be done, see the following resources:
Redshift spectrum vs athena how to#
How to convert from one file format to another is beyond the scope of this post. For example, if you often access a subset of columns, a columnar format such as Parquet and ORC can greatly reduce I/O by reading only the needed columns. You can query data in its original format or convert data to a more efficient one based on data access pattern, storage requirement, and so on. Amazon Redshift Spectrum supports many common data formats: text, Parquet, ORC, JSON, Avro, and more. To perform tests to validate the best practices we outline in this post, you can use any dataset. For more information about prerequisites to get started in Amazon Redshift Spectrum, see Getting started with Amazon Redshift Spectrum.

We base these guidelines on many interactions and considerable direct project work with Amazon Redshift customers.īefore you get started, there are a few setup steps. In this post, we collect important best practices for Amazon Redshift Spectrum and group them into several different functional groups. Amazon Redshift Spectrum applies sophisticated query optimization and scales processing across thousands of nodes to deliver fast performance. You can also join external Amazon S3 tables with tables that reside on the cluster’s local disk. You can query vast amounts of data in your Amazon S3 data lake without having to go through a tedious and time-consuming extract, transfer, and load (ETL) process. Such platforms include Amazon Athena, Amazon EMR with Apache Spark, Amazon EMR with Apache Hive, Presto, and any other compute platform that can access Amazon S3. For example, it expands the data size accessible to Amazon Redshift and enables you to separate compute from storage to enhance processing for mixed-workload use cases.Īmazon Redshift Spectrum also increases the interoperability of your data, because you can access the same S3 object from multiple compute platforms beyond Amazon Redshift. With Amazon Redshift Spectrum, you can extend the analytic power of Amazon Redshift beyond the data that is stored natively in Amazon Redshift.Īmazon Redshift Spectrum offers several capabilities that widen your possible implementation strategies. Amazon Redshift Spectrum enables you to run Amazon Redshift SQL queries on data that is stored in Amazon Simple Storage Service (Amazon S3).
