How to use S3 data source in Quicksight? approaches and strategies

Using S3 data source in Quicksight can be difficult most of the time as it holds unstructured data and the files can have different data formats, however, these limitations entirely depend on the requirements, data format, and volume of data. Following are some of the few approaches to connecting S3 data sources from Quicksight.

Connecting S3 data source directly from Quicksight

Connecting directly to S3 buckets and files from Quicksight requires a manifest file that can either be in the form of a URL  or available locally to upload. S3 manifest files are of JSON format that contains the file location, URI,  prefix, and Globaluploadsettings of S3 files. Quicksight processes the JSON manifest file and imports all the associated files into spice for further analysis. If the concerned file mentioned in the manifest file or uploaded file is not found, Quicksight ends the process and returns an error.

Manifest File  Format:

The following JSON code is the format for the manifest files while performing the S3 imports in Quicksight. 
{
"fileLocations": [
{
"URIs": [
"uri1",
"uri2",
"uri3"
]
},
{
"URIPrefixes": [
"prefix1",
"prefix2",
"prefix3"
]
}
],
"globalUploadSettings": {
"format": "CSV",
"delimiter": ",",
"textqualifier": "'",
"containsHeader": "true"
}
}

fileLocations: Use the fileLocations element to indicate the import file locations. To do this, you can utilise one or both of the URIs and URIPrefixes arrays. At least one value must be specified in either one of them.

  • URIs: URI’s are the list of file locations of the files to be imported into Quicksight. URI format example: https://s3.amazonaws.com/<bucket name>/<file name>.
  • URIPrefixes: These are the prefixes of s3 buckets and folders and all the files in the respective folder and buckets are imported into Quicksight. Recursively retrieving files from subfolders is a feature of Amazon QuickSight. URIPrefixes example: https://s3.amazonaws.com/<bucket name>.

Globalupdatesetting: You can provide field delimiters and other import options for the Amazon S3 files using this element. Amazon QuickSight utilises the default values for the fields in this section if this element is not supplied.

  • Format: Set the file formats that will be imported. CSV, TSV, CLF, ELF, and JSON are all acceptable formats. CSV is the default selection.
  • Delimiter: Set the delimiter for the file field. The format field’s file type specification must be mapped to this. Commas (,) and tabs (t) are acceptable formats for .csv and .tsv files, respectively. The comma is the default value (,).

textqualifier: The file text qualifier must be specified. Single quotes (‘) and double quotes are acceptable formats ” (\”). JSON requires the leading backslash as an escape character before a double quotation.

Containsheader: Indicate whether a header row exists in the file. True or false are acceptable formats. True is the default setting.

However, direct query or custom SQL queries from Quicksight are not supported on s3 folders and files, hence, the complete file is imported into spice after Quicksight processes the JSON manifest file and the analyses are performed thereafter in the dashboard. Additionally, having the entire data in spice certainly costs much, especially if the data tend to grow on a continuous phase. It is a major constraint in terms of connecting the S3 data source directly from Quicksight.

Another limitation with this approach is that certain file formats are not supported for instance Avro, parquet, and orc

Connecting S3 data source through AWS Glue and AWS Athena:

In this scenario, AWS Glue works in tandem with Athena for the creation of a data catalog out of the data it receives from the S3 bucket. The data catalog contains the schema for the source data such as column names, data type, and source name.

Aws Glue set up a crawler in its environment to crawl the data regularly from the data source on schedule or demand. The glue classifier identifies the data format and assigns the appropriate data type for all the columns which are also mutable, later, the metadata is stored in the Glue catalog.

This schema information also appears in Athena and allows us to perform custom queries on the data that is physically present in the S3 bucket. Therefore, connecting the Athena data source from Quicksight allows us to query the data present in S3 regardless of the format. It eventually eliminates the risk of having the entire S3 data imported into spice, instead, the direct query is applied to the S3 data via Athena catalog to retrieve the preferred columns with the required data transformation applied.

Connecting S3 data source through AWS Athena

In this approach, based on the information in the S3 data source, tables and schema information are added manually in Athena. Since AWS glue is not involved here to automatically identify the metadata information of S3 files, the cost of running the glue is eliminated. Moreover, unlike the AWS glue, the configurations are performed manually but are not limited to anything that glue does automatically.

The columns can be added either individually or in bulk after creating the database and tables in Athena. Similarly, the data format is also selected manually as what is there in the S3 data source.

In addition, Partitioning is also supported in Athena for the efficient querying of data. Like the previous approach, both the direct query and spice modes are enabled for the Athena data source from Quicksight.

In conclusion, a direct connection to the S3 data source always imports the data into spice and the direct query isn’t applicable. When connecting to Athena in QuickSight you have the option of using Direct Query or importing into SPICE. You also get the ability to use Custom SQL with Athena, so can do some data transformations too. Lastly, Athena supports querying Parquet, Avro, etc whereas you cannot import Parquet files using the direct S3 connection.

 

2 thoughts on “How to use S3 data source in Quicksight? approaches and strategies”

Leave a Comment

Exit mobile version