Redshift and ELT

AWS Redshift is Amazon’s managed data warehouse service, which we make heavy use of at Bamboo. A common process when using a data warehouse is Extract, Transform, Load (ETL). In short, ETL is the process of extracting data from a source system/database, transforming it into a representation suitable for use in a (relational) data warehouse and then loading the transformed data into the warehouse.

A common theme when using Redshift is to flip the order of the Transform and Load steps, and instead load raw data extracted from a source system directly into Redshift, and then use Redshift’s compute power to perform any transformations. This is known as ELT. ELT is beneficial because it often removes the need for a separate transformation tool, reducing effort and cost to make data available in Redshift.

An example of Redshift’s support for ELT is the SUPER column type, which allows the storage of structured (JSON) data directly in Redshift relations. Recently, AWS have improved their support for transforming such structured data with the new UNPIVOT keyword to destructure JSON objects.

In this post we’ll demonstrate UNPIVOT and how it enhances Redshift’s ELT abilities.

Example: inventory tracking

Consider an imaginary inventory tracking system that tracks the inventory of several shops, where each shop has an inventory of arbitrary items; assume that the shop’s source systems store the inventory as JSON objects. We can store this structured data by parsing JSON into the SUPER column type using json_parse. (For this post, we will use a temporary table, but the queries would also work with a non-temporary table.)

> CREATE TEMPORARY TABLE example_data AS (
    SELECT 10 AS shop_id,
           json_parse('{"apple_count": 2, "orange_count": 6}') AS inventory
    UNION ALL
    SELECT 20, json_parse('{"pear_count": 10, "other_data": 42}')
    UNION ALL
    SELECT 30, json_parse('{"apple_count": 3, "lemon_count": 5}')
  )

> SELECT * FROM example_data
+-----------+------------------------------------+
| shop_id   | inventory                          |
|-----------+------------------------------------|
| 10        | {"apple_count":2,"orange_count":6} |
| 20        | {"pear_count":10,"other_data":42}  |
| 30        | {"apple_count":3,"lemon_count":5}  |
+-----------+------------------------------------+

We can extract a particular item’s count from each inventory:

> SELECT shop_id, 'apple' AS item_name, inventory.apple_count AS count
  FROM example_data
  WHERE count IS NOT NULL
+-----------+-------------+---------+
| shop_id   | item_name   | count   |
|-----------+-------------+---------|
| 10        | apple       | 2       |
| 30        | apple       | 3       |
+-----------+-------------+---------+

and could count the total item count across all shops:

> SELECT 'apple' AS item_name, sum(inventory.apple_count) AS total
  FROM example_data
+-------------+---------+
| item_name   | total   |
|-------------+---------|
| apple       | 5       |
+-------------+---------+

However, it would be inconvenient to try and do the same count for all item types (especially if we don’t know when writing the query all the types that exist). We could use something like:

> SELECT *
  FROM (
    SELECT shop_id, 'apple' AS item_name, inventory.apple_count AS count
    FROM example_data
    WHERE count IS NOT NULL
    UNION ALL
    SELECT shop_id, 'orange' AS item_name, inventory.orange_count AS count
    FROM example_data
    WHERE count IS NOT NULL
    UNION ALL
    SELECT shop_id, 'pear' AS item_name, inventory.pear_count AS count
    FROM example_data
    WHERE count IS NOT NULL
    UNION ALL
    SELECT shop_id, 'lemon' AS item_name, inventory.lemon_count AS count
    FROM example_data
    WHERE count IS NOT NULL
  ) d
  ORDER BY shop_id, item_name
+-----------+-------------+---------+
| shop_id   | item_name   | count   |
|-----------+-------------+---------|
| 10        | apple       | 2       |
| 10        | orange      | 6       |
| 20        | pear        | 10      |
| 30        | apple       | 3       |
| 30        | lemon       | 5       |
+-----------+-------------+---------+

but that does not seem easy to read, or maintain, due to the level of duplication.

Luckily, using UNPIVOT we do not need to hard-code item names, and can extract each key-value mapping from the inventory object as a separate row:

> SELECT shop_id, key, value
  FROM example_data ed, UNPIVOT ed.inventory AS value AT key
  ORDER BY shop_id, key
+-----------+--------------+---------+
| shop_id   | key          | value   |
|-----------+--------------+---------|
| 10        | apple_count  | 2       |
| 10        | orange_count | 6       |
| 20        | other_data   | 42      |
| 20        | pear_count   | 10      |
| 30        | apple_count  | 3       |
| 30        | lemon_count  | 5       |
+-----------+--------------+---------+

N.b. that the comma in FROM example_data ed, UNPIVOT is intentional, and represents a CROSS JOIN in SQL-89 syntax. We could equivalently write FROM example_data ed CROSS JOIN UNPIVOT ed.inventory AS value AT key, but use the comma syntax since it matches the Redshift documentation.

Now we can filter-out the non-count rows based on their key, remove the ‘_count’ suffix and rename the columns to obtain the same result as our verbose query above:

> SELECT shop_id, regexp_replace(key, '_count$', '') AS item_name, value AS count
  FROM example_data ed, UNPIVOT ed.inventory AS value AT key
  WHERE key LIKE '%_count'
  ORDER BY shop_id, item_name
+-----------+-------------+---------+
| shop_id   | item_name   | count   |
|-----------+-------------+---------|
| 10        | apple       | 2       |
| 10        | orange      | 6       |
| 20        | pear        | 10      |
| 30        | apple       | 3       |
| 30        | lemon       | 5       |
+-----------+-------------+---------+

The result is exactly the same as for our verbose query, but we haven’t had to write a duplication-heavy query, or explicitly name each item that we are interested in; this is a great improvement!

Combining UNPIVOT and unnesting

We can take our example one step further, and nest the shop_id and associated inventory into a single object, and then store those objects in an array. That is, as JSON:

[
  {"shop_id": 10, "inventory": {"apple_count": 2, "orange_count": 6}},
  {"shop_id": 20, "inventory": {"pear_count": 10, "other_data": 42}},
  {"shop_id": 30, "inventory": {"apple_count": 3, "lemon_count": 5}}
]

We construct another temporary table for our example:

> CREATE TEMPORARY TABLE nested_example_data AS (
  SELECT json_parse('
    [
      {"shop_id": 10, "inventory": {"apple_count": 2, "orange_count": 6}},
      {"shop_id": 20, "inventory": {"pear_count": 10, "other_data": 42}},
      {"shop_id": 30, "inventory": {"apple_count": 3, "lemon_count": 5}}
    ]
  ') AS shop_inventories
  )

and can then combine the existing support for unnesting JSON arrays and new support for unnesting objects:

> SELECT shop_index, shop.shop_id, key, value
  FROM nested_example_data AS ned,
  ned.shop_inventories AS shop AT shop_index,
  UNPIVOT shop.inventory AS value AT KEY
+--------------+-----------+--------------+---------+
| shop_index   | shop_id   | key          | value   |
|--------------+-----------+--------------+---------|
| 0            | 10        | apple_count  | 2       |
| 0            | 10        | orange_count | 6       |
| 1            | 20        | pear_count   | 10      |
| 1            | 20        | other_data   | 42      |
| 2            | 30        | apple_count  | 3       |
| 2            | 30        | lemon_count  | 5       |
+--------------+-----------+--------------+---------+

notice that we unnest the array, obtaining each object as shop (and capturing the 0-based array index in shop_index) and then unnest shop as before, using UNNEST.

Conclusion

In this post we have shown an example where the new UNPIVOT keyword in Redshift helps us write powerful transformation queries directly in Redshift, greatly enhancing our ability to write ELT-style transformations.

Unnesting JSON objects in Redshift

Using UNPIVOT to iterate over objects

Redshift and ELT

Example: inventory tracking

Combining UNPIVOT and unnesting

Conclusion