earnest
  • COVID-19
    • COVID Insights
    • For Consumer Brands
    • For Government
  • Solutions
    • Financial Services
    • Consumer Brands
    • Consulting Firms
  • About
    • Our Company
    • Careers
    • Technology
  • Insights
    • Insights
    • Data Bites
  • Press
See A Demo
earnest
  • COVID-19
    • COVID Insights
    • For Consumer Brands
    • For Government
  • Solutions
    • Financial Services
    • Consumer Brands
    • Consulting Firms
  • About
    • Our Company
    • Careers
    • Technology
  • Insights
    • Insights
    • Data Bites
  • Press
See A Demo

Tech Blog

Keep up with what's going on across Technology through our blog posts

Better Lookup Table Integration — using dbt seeds

Are your data transformation jobs configurable?

At Earnest, we often use configurations as a way to integrate domain-expertise into our transformations. This is one way Analysts have to enrich datasets through a simple interface, without having to understand the inner workings of the pipeline.

We use CSV as our configuration “language”, as most datasets we work with are tabular by nature, having tabular configurations facilitates “JOINing” them to the datasets.

In this post I describe how Earnest has supported the use of configurations for data engineering before adopting dbt and how adopting dbt has simplified our setup. I am also sharing 3 specific use cases for configurations/seeds with concrete examples.

If you are not familiar with dbt already, this “What is dbt?” page is all you need to understand this introductory blog post.

Before dbt at Earnest

Our workflow for configurations outside of dbt is as follows: analysts make changes to the CSVs in a version controlled repository which are then validated by Continuous Integration checks and reviewed by engineers (and other analysts).

Validating a change is quite complicated:

  • Schema integrity checks need to be in place, for example: “this column should be null” or “this column should have unique values”. 
  • The CSVs need to be valid and a change should not “break the schema”: i.e. column values should not contain the delimiter and the number of columns should be consistent.
  • The change should not have unexpected consequences: analysts should be able to test how a contribution would impact the output.

These requirements led to significant engineering investment in a QC system, and a Web application to simulate the impact of changes.

When we started looking into dbt, several months back, we found that it treats lookup tables as “first class citizens”. It has a dedicated concept called “seeds” with features that enable integrating these CSVs into the Data Warehouse.

Why we ❤️ dbt seeds

A seed can be validated like other tables (“models” and “sources”) using the same tools: “Schema tests” and “Data tests”. So we do not need a separate QC system.

Checking the impact of changes in lookup tables on the transformations is also much simpler:

  • Using Snowflake/BigQuery, we do not have to provide complex tooling to provision clusters like we did with Spark.
  • Relying on dbt to dynamically change schema and database names, we can easily create isolated environments.
  • The dbt command line makes it easy to run portions of a pipeline, and we can target our runs to all the models that would be impacted by a seed change.

After Analysts learned to use dbt, they could prototype and test changes autonomously, so we do not need separate applications dedicated to simulating the impact of configuration changes.

With this new workflow, our usage of lookup tables grew, and in our team project we now have ~60 seeds for ~150 models. As a team, we learned to leverage seeds for specific use cases, and structure them differently depending on their use.

Photo by Akil Mazumder from Pexels

3 Kinds Of Lookups

Before we get into examples, lets talk about identifiers and foreign keys. When it comes to lookup tables there are many options to assign a unique key to an entity:

  • Using an incremental numeric identifier.
  • Using a UUID.
  • Using a hash of one or multiple natural keys.
  • Using the natural keys directly.

We find the three first options to be tedious to maintain, as seeds are meant to be manually edited and reviewed. When we start having seeds linked to each other, we need to constantly refer to the “joined” seed in order to understand what a certain UUID/numeric identifier/hash is referring to.

We prefer assigning a plain English snake_case identifier to each one of the entities. We suffix these columns with “_code” internally by convention.

Eliminating “CASE WHEN”s

We have a rule of thumb that if a “case when” has 4 or more “whens”, it should probably live in a seed.

These expressions often look like mappings, which are better expressed as data than logic:

  • They are verbose in a SQL statement.
  • They are often susceptible to change.
  • They can sometimes be reused in different parts of the pipeline.
  • They can include additional columns for presentation (BI).

Example: 

Pretend names of cities are unique in the U.S. and we are trying to group them by state. In one pipeline, we turn the raw city name (e.g. “Aliso Viejo”) coming from our source systems into a snake_case version (e.g. “aliso_viejo”) which is used as the JOIN key to a lookup table.

When we need to apply the same transformation in a different pipeline, maybe the raw city name comes in a different format (e.g. “AlisoViejo”), so we apply a different transformation to get to `snake_case` and we can re-use the same lookup table.

We get three benefits:

  1. We have avoided redundant logic.
  2. We can JOIN the resulting tables and easily aggregate them on the new common columns: “city_code” or “state_code” (conformed dimensions).
  3. We can add additional columns to the CSV for presentation/BI.

Without a seed:

with
 
us_cities as (
   select * from values (1, 'Geneva'), (2, 'Alameda') as t(id, city_name)
)
 
select
   -- group cities into states
   city_name,
   case city_name
       -- bad :(
       when 'Geneva' then 'NY'
       when 'Buffalo' then 'NY'
       when 'Alameda' then 'CL'
       when 'Aliso Viejo' then 'CL'
       -- etc ...
   end as state_code
from us_cities

With a seed:

us_cities.csv:

city_code,city_name,state_code
geneva,Geneva,NY
buffalo,Buffalo,NY
alameda,Alameda,CL
aliso_viejo,Aliso Viejo,CL

SQL:

with
 
us_cities as (
  select * from values
      (1, 'Geneva'),
      (2, 'Alameda'),
      (3, 'Aliso Viejo') as t(id, city_name)
),
 
-- simulate seed
us_cities_lookup as (
  select * from values
      ('geneva', 'Geneva', 'NY'),
      ('buffalo', 'Buffalo', 'NY'),
      ('alameda', 'Alameda', 'CL'),
      ('aliso_viejo', 'Aliso Viejo', 'CL') as t(city_code, city_name, state_code)
)
 
select
  us_cities_lookup.*
from us_cities
join us_cities_lookup
  -- in dbt you could have a macro to turn city name into "snake_case"
  on replace(lower(trim(us_cities.city_name)), ' ', '_') = us_cities_lookup.city_code

Lookups to group other lookups

Often, we have lookup tables that we need to create groups for. For example we might have “US cities” and want to provide analysts with the ability to roll up to the “US state” level.

If “US cities” is also a seed you manage, you could represent the “US states” information inside the “US cities” lookup by repeating the “state” specific columns inside the “cities” seed. 

city_code,city_name,state_code,state_name
geneva,Geneva,NY,New York
buffalo,Buffalo,NY,New York
alameda,Alameda,CL,California
aliso_viejo,Aliso Viejo,CL,California

This could become hard to maintain as you add new columns, since data is repeated and changes need to be consistent across all rows.

However, if you don’t foresee many changes (US states are a good example of that), or the seed is very small (~10 rows): it can be the simplest solution as there is no need for a separate seed and join.

The alternative is to separate “US states” into its own seed, only keeping the foreign key “state_code” inside the “us cities” seed.

us_cities.csv:

city_code,city_name,state_code
geneva,Geneva,NY
buffalo,Buffalo,NY
alameda,Alameda,CL
aliso_viejo,Aliso Viejo,CL

us_states.csv:

state_code,state_name,area_km2
NY,New York,141297
CL,California,423967

However, sometimes the lookup table you are trying to create groups for, isn’t a seed. It might be a dbt model, or a dataset extracted from an external source.

For these cases, you have to create an additional mapping seed:

us_cities_to_states.csv:

city_code,state_code
geneva,NY
buffalo,NY
alameda,CL
aliso_viejo,CL

When the source has stable and ordered IDs we use ranges as mappings to avoid repetition.

us_cities_to_states.csv:

city_id_start,city_id_end,state_code
1,20,NY
21,28,CL
29,40,NY

The Staging/Production switch 

In our dbt + Snowflake/BigQuery setup, we have three isolated environments:

  • Development: database `my_database_dev`, each user has its own schemas prefixed by their username.
  • Staging: database `my_database_staging`, it follows the same structure as Production and contains the same volume, but it is not exposed to clients and is not updated as often as Production. It is used to test the impact of new features we would like to release to our clients.
  • Production: database `my_database`, production data from which the client products are created.

For these cases we use “production status” CSVs. These CSVs contain at least one column that identifies the “group” we want to enable in specific environments, and which environment they’re enabled in.

These groups could be logical categories, for example if you do analytics on pet foods you could enable only “dog food” in production and keep “cat food” in staging only, until your “cat food” analytics are rock solid and ready to be delivered.

We apply it to other use cases like conditionally enabling metrics in certain environments as well as enabling only certain group+metric combinations in specific products.

These lookup tables are used to create filters enabled by templating in our dbt models.

Putting it all together:

pet_food_production_status.csv:

pet_food_animal_category,metric,product,status
cat,sales_per_breed,pet_food_sales_dashboard,staging
dog,sales,pet_food_sales_dashboard,production

SQL:

with
pet_food_production_status as (
  select * from values ('cat', 'sales_per_breed', 'pet_food_sales_dashboard', 'staging'),
                       ('dog', 'sales' , 'pet_food_sales_dashboard', 'production')
                           as t(pet_food_animal_category, metric, product, status)
),
 
pet_food_metrics as (
  select * from values ('cat', 'sales_per_breed', 'persian', 100),
                       ('cat', 'sales_per_breed', 'siamese', 200),
                       ('dog', 'sales', null, 1000)
                           as t(pet_food_animal_category, metric, breed, metric_value)
)
select
   pet_food_metrics.*
from pet_food_metrics
left join pet_food_production_status
   on pet_food_metrics.pet_food_animal_category = pet_food_production_status.pet_food_animal_category
       and pet_food_metrics.metric = pet_food_production_status.metric
{% if is_staging() %}
where pet_food_production_status.status = 'staging'
{% elif is_production() %}
where pet_food_production_status.status = 'production'
{% endif %}
Photo by Matej from Pexels

Warning against using seeds for large data ⚠️

This would not be an introduction to dbt seeds if I did not warn you against using them to load large amounts of data!

From the dbt documentation:

“Seeds should not be used to load raw data (for example, large CSV exports from a production database).

Since seeds are version controlled, they are best suited to files that contain business-specific logic, for example a list of country codes or user IDs of employees.

Loading CSVs using dbt’s seed functionality is not performant for large files. Consider using a different tool to load these CSVs into your data warehouse.”

For big lookup tables we use Google Sheets for a better User Experience of viewing and maintaining the lookups and to avoid slowing down our dbt project.

Beyond the seeds  🌱

We found that seeds are a good way to surface which dimensions are most important to your data pipeline. For example you might end up with seeds representing pet food brands, mappings from brands to distributors, pet breeds and food categories.

Representing these concepts as seeds allow you to iterate fast on their structure, but the number of seeds can quickly multiply and adding new “pet food brands” can become a tedious and error prone process of updating many CSVs.

In our team, we think that our seeds will eventually reach “maturity” and their structure will be only slowly changing. When that happens, we can use the data model of our seeds to inform the design of an internal app for a more refined User Experience.

In Conclusion

Managing data transformation “configurations” is solved well by using lookup tables checked in version control and imported in the Data Warehouse through dbt seeds.

These lookups can be used to (among other things):

  • Refactor logic contained in CASE WHEN statements.
  • Create groups of entities, and groups of groups.
  • Filter data/metrics to specific environments.

dbt seeds should not be used to load raw data into a Data Warehouse!

A mature dbt pipeline can require changes to many CSVs/Seeds as the analytical use cases grow and requirements for more dimensions arise. If the process becomes tedious and needs to be scaled, you can use your seeds to inform the design of internal specialized apps.

Thanks for reading 🌲

Categories

  • All Posts
  • Data Engineering
  • Data Science
  • Engineering
  • Tech Culture

Tags

  • AWS
  • dagster
  • data-engineering
  • dbt
  • Distributed Working
  • experiments
  • GCP
  • GIT
  • Hackathon
  • innovation
  • machine-learning
  • mlops
  • OKR
  • pipelines
  • teams

Subscribe to Technology

Earnest Research Home

43 West 24th Street, Floor 5
New York, NY 10010

[email protected]

  • twitter
  • instagram
  • linkedin
  • Privacy Policy
  • Terms of Use
  • CA Privacy Notice
  • twitter
  • instagram
  • linkedin
  • Privacy Policy
  • Terms of Use
  • CA Privacy Notice
 

See a demo