Concept: Joins

2 hours of SQL video using NASA's Cassini data. See why developers become data pros.
The word bias might sound a bit strong when describing the use of joins. It's important to note, however, that bias can take two forms: implicit and explicit. I dislike brussel sprouts and will never eat them (unless they're wrapped in bacon and then they're tolerable). This is an explicit bias and I know it!
Implicit bias can take many forms that we're simply not aware of, including a lack of knowledge or experience. It also takes the form of unseen preferences which, in the programming world can come down to the languages you use, the editors you write in and even the music you listen to while writing that code in that editor!
We're going to see how implicit bias can change the story our data tells. It will seem subtle - but that's exactly the point...
Left, Right, Full Outer Joins
In this example we work with a many to many join. One person can have many jobs, and one job can have many people. Notice the composite primary key on assignments. This is a common primary key for joining tables. The primary key is defined by two columns instead of one, the combination of each must be unique:
<span class="hljs-keyword">create table</span> people(
id serial <span class="hljs-keyword">primary key</span>,
name text <span class="hljs-keyword">not null</span>
);
<span class="hljs-keyword">create table</span> jobs(
id serial <span class="hljs-keyword">primary key</span>,
name text <span class="hljs-keyword">not null</span>
);
<span class="hljs-keyword">create table</span> assignments(
people_id <span class="hljs-type">int</span> <span class="hljs-keyword">not null</span> <span class="hljs-keyword">references</span> people(id),
job_id <span class="hljs-type">int</span> <span class="hljs-keyword">not null</span> <span class="hljs-keyword">references</span> jobs(id),
<span class="hljs-keyword">primary key</span>(people_id, job_id) <span class="hljs-comment">-- composite primary key</span>
);
<span class="hljs-keyword">insert into</span> people(name)
<span class="hljs-keyword">values</span>(<span class="hljs-string">'Darth Vader'</span>),(<span class="hljs-string">'Apple Dumpling'</span>),(<span class="hljs-string">'Duke Leto'</span>),(<span class="hljs-string">'Totoro'</span>);
<span class="hljs-keyword">insert into</span> jobs(name)
<span class="hljs-keyword">values</span>(<span class="hljs-string">'Control the Galaxy'</span>),(<span class="hljs-string">'Bake Apple Pies'</span>),
(<span class="hljs-string">'Rule Arrakis'</span>),(<span class="hljs-string">'Play shakuhatchi in tree'</span>);
<span class="hljs-keyword">insert into</span> assignments(people_id, job_id)
<span class="hljs-keyword">values</span>
(<span class="hljs-number">1</span>,<span class="hljs-number">1</span>),(<span class="hljs-number">1</span>,<span class="hljs-number">2</span>),
(<span class="hljs-number">2</span>,<span class="hljs-number">2</span>),(<span class="hljs-number">2</span>,<span class="hljs-number">4</span>),
(<span class="hljs-number">4</span>,<span class="hljs-number">4</span>),(<span class="hljs-number">4</span>,<span class="hljs-number">1</span>);
Querying this data involves joining the tables together. We can do this exclusively, meaning only data that fits our join condition (people_id = people.id for instance) will be included:
<span class="hljs-keyword">select</span> <span class="hljs-operator">*</span> <span class="hljs-keyword">from</span> people <span class="hljs-comment">-- from left...</span>
<span class="hljs-keyword">inner</span> <span class="hljs-keyword">join</span> assignments <span class="hljs-keyword">on</span> people_id <span class="hljs-operator">=</span> people.id
<span class="hljs-keyword">inner</span> <span class="hljs-keyword">join</span> jobs <span class="hljs-keyword">on</span> jobs.id <span class="hljs-operator">=</span> assignments.job_id; <span class="hljs-comment">-- to right</span>
We can loosen things up a bit using a left or left outer join (same thing). Show me all data to the left of the join. In this query, we'll see all people but null for jobs without people:
<span class="hljs-keyword">select</span> <span class="hljs-operator">*</span> <span class="hljs-keyword">from</span> people <span class="hljs-comment">--all data coming from here</span>
<span class="hljs-keyword">left</span> <span class="hljs-keyword">outer</span> assignments <span class="hljs-keyword">on</span> people_id <span class="hljs-operator">=</span> people.id <span class="hljs-comment">-- and here...</span>
<span class="hljs-keyword">left</span> <span class="hljs-keyword">outer</span> jobs <span class="hljs-keyword">on</span> jobs.id <span class="hljs-operator">=</span> assignments.job_id; <span class="hljs-comment">-- only matching from here</span>
This can work in the other direction, to the right of the join as well. Here we'll see all the jobs but null for people without jobs:
<span class="hljs-keyword">select</span> <span class="hljs-operator">*</span> <span class="hljs-keyword">from</span> people <span class="hljs-comment">-- only matching from here</span>
<span class="hljs-keyword">right</span> <span class="hljs-keyword">outer</span> assignments <span class="hljs-keyword">on</span> people_id <span class="hljs-operator">=</span> people.id <span class="hljs-comment">-- and here...</span>
<span class="hljs-keyword">right</span> <span class="hljs-keyword">outer</span> jobs <span class="hljs-keyword">on</span> jobs.id <span class="hljs-operator">=</span> assignments.job_id; <span class="hljs-comment">--all data coming from here</span>
If we want to see absolutely everything, we use a full outer join:
<span class="hljs-keyword">select</span> <span class="hljs-operator">*</span> <span class="hljs-keyword">from</span> people <span class="hljs-comment">--all data coming from here</span>
<span class="hljs-keyword">full</span> <span class="hljs-keyword">outer</span> assignments <span class="hljs-keyword">on</span> people_id <span class="hljs-operator">=</span> people.id <span class="hljs-comment">-- and here...</span>
<span class="hljs-keyword">full</span> <span class="hljs-keyword">outer</span> jobs <span class="hljs-keyword">on</span> jobs.id <span class="hljs-operator">=</span> assignments.job_id; <span class="hljs-comment">--all data coming from here</span>
This shows us everything, including jobs without people and people without jobs.
We can also do a many to many join with two tables, one of them referencing itself. This looks weird, but it implies a completely different type of relationship (bidirectional):
<span class="hljs-comment">-- drop our existing tables</span>
<span class="hljs-keyword">drop</span> <span class="hljs-keyword">table</span> if <span class="hljs-keyword">exists</span> people cascade;
<span class="hljs-keyword">drop</span> <span class="hljs-keyword">table</span> if <span class="hljs-keyword">exists</span> jobs cascade;
<span class="hljs-keyword">drop</span> <span class="hljs-keyword">table</span> if <span class="hljs-keyword">exists</span> assignments cascade;
<span class="hljs-keyword">create table</span> people(
id serial <span class="hljs-keyword">primary key</span>,
name text <span class="hljs-keyword">not null</span>
);
<span class="hljs-keyword">create table</span> friends(
people_id <span class="hljs-type">int</span> <span class="hljs-keyword">not null</span>,
friend_id <span class="hljs-type">int</span> <span class="hljs-keyword">not null</span> <span class="hljs-keyword">references</span> people(id),
<span class="hljs-keyword">primary key</span> (people_id, friend_id) <span class="hljs-comment">-- a composite primary key</span>
);
<span class="hljs-keyword">insert into</span> people(name)
<span class="hljs-keyword">values</span>
(<span class="hljs-string">'Darth Vader'</span>),
(<span class="hljs-string">'Apple Dumpling'</span>),
(<span class="hljs-string">'Duke Leto'</span>),
(<span class="hljs-string">'Totoro'</span>);
<span class="hljs-keyword">insert into</span> friends(people_id, friend_id)
<span class="hljs-keyword">values</span>
(<span class="hljs-number">1</span>,<span class="hljs-number">3</span>),(<span class="hljs-number">1</span>,<span class="hljs-number">4</span>), <span class="hljs-comment">-- Vader/Leto, Vader/Totoro</span>
(<span class="hljs-number">2</span>,<span class="hljs-number">4</span>),(<span class="hljs-number">3</span>,<span class="hljs-number">4</span>); <span class="hljs-comment">-- Apple Dumpling/Totoro, Duke Leto/Totoro</span>
Querying this gets interesting - what is it we hope to understand? The obvious answer is "who's friends with whom", but what will the result set look like? Also, do we care about people without friends?
If we're only interested in showing friends, we can query it like this:
<span class="hljs-keyword">select</span> people.name, friendos.name <span class="hljs-keyword">from</span> people
<span class="hljs-keyword">inner</span> <span class="hljs-keyword">join</span> friends <span class="hljs-keyword">on</span> people.id <span class="hljs-operator">=</span> friends.people_id
<span class="hljs-keyword">inner</span> <span class="hljs-keyword">join</span> people <span class="hljs-keyword">as</span> friendos <span class="hljs-keyword">on</span> friends.friend_id <span class="hljs-operator">=</span> friendos.id; <span class="hljs-comment">-- alias the people table to avoid naming collisions</span>
This exclusive result set is what we're looking for as showing a more detailed result set (using left, right or full) would cause duplication. In other words, we'd see Vader/Leto and Leto/Vader. Friendship is kind of a mutual thing and all we need to see is a single result to define the idea.
The Bias Discussion
In that last paragraph I just defined what friendship means and created a table design and query based on my thoughts and beliefs about friendship. Do you trust me to accurately do this? Do you trust yourself? What kinds of questions might arise from this data set in the future?
The answer is awareness. It's yet another reason we write down everything in our README file so people understand what decisions we've made and why. If you have a team to discuss things with, do!
2 hours of SQL video using NASA's Cassini data. See why developers become data pros.