Transformation, Part 2

Buy or Subscribe

You can access this course in just a minute, and support my efforts to rid the world of crappy online courses!

Buy Standalone  Subscribe

Timestamping when data has been created or updated is a regular occurrence in the developer world and if you're working with Postgres, the definition is simple:

create table products(
  --...
  created_at timestamptz not null default now(),
  updated_at timestamptz not null default now()
);

When preparing data for analysis, it's common to denormalize it in order to speed up the analysis process. It's common to see data pre-calculated and stored, especially if its historical data like ours is. I'm doing exactly this by creating a date and year field.

Postgres makes this kind of thing simple by offering generated fields:

create table inms(
  id serial primary key,
  created_at timestamp not null,
  date date not null generated always as (created_at::date) stored,
  year int not null generated always as (date_part('year', created_at)) stored,
  --...
);

Whenever the created_at data changes, so will date and year because these are virtual columns that can't be changed directly.

Here's the entire table definition:

drop schema if exists enceladus cascade;
create schema enceladus;
set search_path='enceladus';

create table inms(
  id serial primary key,
  created_at timestamp not null,
  date date not null generated always as (created_at::date) stored,
  year int not null generated always as (date_part('year', created_at)) stored,
  flyby_id int references flybys(id),
  altitude numeric(9,2) not null check(altitude > 0),
  source text not null check(source in('osi','csn','osnb','osnt')),
  mass numeric(6,3) not null check(mass >=0.125 and mass < 100),
  high_sensitivity_count int not null check(high_sensitivity_count > 0),
  low_sensitivity_count int not null check(low_sensitivity_count > 0),
  imported_at timestamptz not null default now()
);

To insert data into this table we need to execute an insert statement (which is going to fail, but that's OK it's kind of in the plan). Notice that I don't need to insert all of the fields - the generated bits take care of themselves, as does the imported_at timestamp due to its default:

insert into inms(
  created_at, 
  altitude, 
  source, 
  mass, 
  high_sensitivity_count,
  low_sensitivity_count
)
select
  sclk::timestamp,
  alt_t::numeric(9,2),
  source,
  mass_per_charge::numeric(6,3),
  c1counts::int,
  c1counts::int
from csvs.inms
where target='ENCELADUS';