Joins & Bias: How Your SQL Queries Secretly Shape Your Data Story

This is a premium course, which you can purchase below.

Bias can creep in anywhere during the analysis process - even in the way you structure your query and the joins you use.

The word bias might sound a bit strong when describing the use of joins. It's important to note, however, that bias can take two forms: implicit and explicit. I dislike brussel sprouts and will never eat them (unless they're wrapped in bacon and then they're tolerable). This is an explicit bias and I know it!

Implicit bias can take many forms that we're simply not aware of, including a lack of knowledge or experience. It also takes the form of unseen preferences which, in the programming world can come down to the languages you use, the editors you write in and even the music you listen to while writing that code in that editor!

We're going to see how implicit bias can change the story our data tells. It will seem subtle - but that's exactly the point...

Left, Right, Full Outer Joins

In this example we work with a many to many join. One person can have many jobs, and one job can have many people. Notice the composite primary key on assignments. This is a common primary key for joining tables. The primary key is defined by two columns instead of one, the combination of each must be unique:

create table people(
  id serial primary key,
  name text not null
);

create table jobs(
  id serial primary key,
  name text not null
);

create table assignments(
  people_id int not null references people(id),
  job_id int not null references jobs(id),
  primary key(people_id, job_id) -- composite primary key
);

insert into people(name)
values('Darth Vader'),('Apple Dumpling'),('Duke Leto'),('Totoro');

insert into jobs(name)
values('Control the Galaxy'),('Bake Apple Pies'),
      ('Rule Arrakis'),('Play shakuhatchi in tree');

insert into assignments(people_id, job_id)
values
    (1,1),(1,2),
    (2,2),(2,4),
    (4,4),(4,1);

Querying this data involves joining the tables together. We can do this exclusively, meaning only data that fits our join condition (people_id = people.id for instance) will be included:

select * from people -- from left...
inner join assignments on people_id = people.id
inner join jobs on jobs.id = assignments.job_id; -- to right

We can loosen things up a bit using a left or left outer join (same thing). Show me all data to the left of the join. In this query, we'll see all people but null for jobs without people:

select * from people --all data coming from here
left outer assignments on people_id = people.id -- and here...
left outer jobs on jobs.id = assignments.job_id; -- only matching from here

This can work in the other direction, to the right of the join as well. Here we'll see all the jobs but null for people without jobs:

select * from people -- only matching from here
right outer assignments on people_id = people.id -- and here...
right outer jobs on jobs.id = assignments.job_id; --all data coming from here

If we want to see absolutely everything, we use a full outer join:

select * from people --all data coming from here
full outer assignments on people_id = people.id -- and here...
full outer jobs on jobs.id = assignments.job_id; --all data coming from here

This shows us everything, including jobs without people and people without jobs.

We can also do a many to many join with two tables, one of them referencing itself. This looks weird, but it implies a completely different type of relationship (bidirectional):

-- drop our existing tables
drop table if exists people cascade;
drop table if exists jobs cascade;
drop table if exists assignments cascade;

create table people(
  id serial primary key,
  name text not null
);

create table friends(
  people_id int not null,
  friend_id int not null references people(id),
  primary key (people_id, friend_id) -- a composite primary key
);

insert into people(name)
values
    ('Darth Vader'),
    ('Apple Dumpling'),
    ('Duke Leto'),
    ('Totoro');
insert into friends(people_id, friend_id)
values
    (1,3),(1,4), -- Vader/Leto, Vader/Totoro
    (2,4),(3,4); -- Apple Dumpling/Totoro, Duke Leto/Totoro

Querying this gets interesting - what is it we hope to understand? The obvious answer is "who's friends with whom", but what will the result set look like? Also, do we care about people without friends?

If we're only interested in showing friends, we can query it like this:

select people.name, friendos.name from people
inner join friends on people.id = friends.people_id
inner join people as friendos on friends.friend_id = friendos.id; -- alias the people table to avoid naming collisions

This exclusive result set is what we're looking for as showing a more detailed result set (using left, right or full) would cause duplication. In other words, we'd see Vader/Leto and Leto/Vader. Friendship is kind of a mutual thing and all we need to see is a single result to define the idea.

The Bias Discussion

In that last paragraph I just defined what friendship means and created a table design and query based on my thoughts and beliefs about friendship. Do you trust me to accurately do this? Do you trust yourself? What kinds of questions might arise from this data set in the future?

The answer is awareness. It's yet another reason we write down everything in our README file so people understand what decisions we've made and why. If you have a team to discuss things with, do!