Cross joins for fun and profit

August 29, 2011 Pivotal Labs

In my experience cross joins aren’t very common, but very powerful and performant when used correctly. In this post I’ll describe one example of where cross joins, also known as cartesian joins, can improve performance of your
app while also improving maintainability.

Imagine that you just landed a contract with the Department of Education. They want you to build an app that
will track which singers can sing on which songs for performances in Glee clubs nationwide. The project manager, Will Schuester, tells you:

  • Sometimes everyone sings on a song
  • Sometimes a song has restrictions (i.e. only boys can sing on “it’s my life”)
  • Sometimes everyone sings except for one or two singers (i.e. everyone can sing on “don’t stop believing” except rachel)
  • This will be a nationwide platform, with millions of Glee club members tracked

You know you’ll track Singers and Songs, so create those and add some data:

Singers

create table singers (id serial, name varchar(20));

insert into singers (name) values ('finn'), ('rachel'), ('mercedes');

select * from singers;
+----+----------+
| id | name     |
+----+----------+
|  1 | finn     |
|  2 | rachel   |
|  3 | mercedes |
+----+----------+

Songs

create table songs (id serial, name varchar(255), has_restrictions boolean);

insert into songs (name, limited) values ('it's my life', true), ('don't stop believing', false), ('loser like me', false);

select * from songs;
+----+----------------------+------------------+
| id | name                 | has_restrictions |
+----+----------------------+------------------+
|  1 | it's my life         |                1 |
|  2 | don't stop believing |                0 |
|  3 | loser like me        |                0 |
+----+----------------------+------------------+

Now you have to figure out how to store which user sings on which song. Here are two possible solutions:

Solution 1: Store each Singer / Song combination in a table

You might start with a structure like this:

create table song_performers( id serial, singer_id int(11), song_id int(11) );
CREATE UNIQUE INDEX index_song_performers ON song_performers (singer_id, song_id);

The idea here is that you insert one row singer per song that they can sing on. This is a simple solution that might work well for your app, but it has a few downsides. For example, when a singer is added to the Glee club, something will have to insert that singer into every song that they can sing on. That table might also grow to be millions of rows long very quickly, and require archiving as time goes on.

Solution 2: Store rules about which singer can sing in which songs

You can get the same result as above by using a table list just the rules about who can sing in which songs, and using a cross join and some conditions to limit the result set.
Here’s what the tables might look like:

create table performer_rules( id serial, singer_id int(11), song_id int(11), rule_type varchar(20) );
create unique index index_performer_rules on performer_rules (singer_id, song_id);
insert into performer_rules (singer_id, song_id, rule_type) values (1, 1, 'included');
insert into performer_rules (singer_id, song_id, rule_type) values (2, 2, 'excluded');

Notice the unique index singer_id and song_id – that’s a crucial step for the queries below to return the correct data without having to use distinct.

To ensure the query is returning the correct results, add some data that might throw the results off. In this case, it should
suffice to exclude Mercedes from “it’s my life” (she should already be excluded), and include her in “loser like me” (she should already be included):

insert into performer_rules (singer_id, song_id, rule_type) values (3, 1, 'excluded');
insert into performer_rules (singer_id, song_id, rule_type) values (3, 2, 'included');

select songs.name as song_name, singers.name as singer_name, has_restrictions, rule_type
from performer_rules
inner join songs on performer_rules.song_id = songs.id
inner join singers on performer_rules.singer_id = singers.id
order by songs.name, singers.name;

+----------------------+-------------+------------------+-----------+
| song_name            | singer_name | has_restrictions | rule_type |
+----------------------+-------------+------------------+-----------+
| don't stop believing | mercedes    |                0 | included  |
| don't stop believing | rachel      |                0 | excluded  |
| it's my life         | finn        |                1 | included  |
| it's my life         | mercedes    |                1 | excluded  |
+----------------------+-------------+------------------+-----------+

Now that table of rules is there, you can get the correct result set with the following query:

select songs.name as song_name, singers.name as singer_name
from singers
cross join songs
left join performer_rules on performer_rules.singer_id = singers.id and performer_rules.song_id = songs.id
where (songs.has_restrictions = false and rule_type is null) or rule_type = 'included'
order by songs.name, singers.name;

+----------------------+-------------+
| song_name            | singer_name |
+----------------------+-------------+
| don't stop believing | finn        |
| don't stop believing | mercedes    |
| it's my life         | finn        |
| loser like me        | finn        |
| loser like me        | mercedes    |
| loser like me        | rachel      |
+----------------------+-------------+

Let’s break that down a bit. First, take a look at the result of the cross join: it contains a row for every singer
for every song.

select songs.name as song_name, singers.name as singer_name
from singers
cross join songs
order by song_name, singer_name;

+----------------------+-------------+
| song_name            | singer_name |
+----------------------+-------------+
| don't stop believing | finn        |
| don't stop believing | mercedes    |
| don't stop believing | rachel      |
| it's my life         | finn        |
| it's my life         | mercedes    |
| it's my life         | rachel      |
| loser like me        | finn        |
| loser like me        | mercedes    |
| loser like me        | rachel      |
+----------------------+-------------+

Next, add the performance_rules, matching on both the song and the singer:

select songs.name as song_name, singers.name as singer_name, has_restrictions, rule_type
from singers
cross join songs
left join performer_rules on performer_rules.singer_id = singers.id and performer_rules.song_id = songs.id
order by song_name, singer_name;

+----------------------+-------------+------------------+-----------+
| song_name            | singer_name | has_restrictions | rule_type |
+----------------------+-------------+------------------+-----------+
| don't stop believing | finn        |                0 | NULL      |
| don't stop believing | mercedes    |                0 | included  |
| don't stop believing | rachel      |                0 | excluded  |
| it's my life         | finn        |                1 | included  |
| it's my life         | mercedes    |                1 | excluded  |
| it's my life         | rachel      |                1 | NULL      |
| loser like me        | finn        |                0 | NULL      |
| loser like me        | mercedes    |                0 | NULL      |
| loser like me        | rachel      |                0 | NULL      |
+----------------------+-------------+------------------+-----------+

Notice how we still have every row of every singer and every song (because of the left join). We also see that we have
6 possible combinations of has_restrictions and rule_type. We can figure out whether they should be included with a simple logical statement:

(songs.has_restrictions = false and rule_type is null) or rule_type = 'included'

Here’s what that might look like as a select:

select songs.name as song_name, singers.name as singer_name, has_restrictions, rule_type, coalesce((songs.has_restrictions = false and rule_type is null) or rule_type = 'included', false) as 'include?'
from singers
cross join songs
left join performer_rules on performer_rules.singer_id = singers.id and performer_rules.song_id = songs.id
order by song_name, singer_name;

+----------------------+-------------+------------------+-----------+----------+
| song_name            | singer_name | has_restrictions | rule_type | include? |
+----------------------+-------------+------------------+-----------+----------+
| don't stop believing | finn        |                0 | NULL      |        1 |
| don't stop believing | mercedes    |                0 | included  |        1 |
| don't stop believing | rachel      |                0 | excluded  |        0 |
| it's my life         | finn        |                1 | included  |        1 |
| it's my life         | mercedes    |                1 | excluded  |        0 |
| it's my life         | rachel      |                1 | NULL      |        0 |
| loser like me        | finn        |                0 | NULL      |        1 |
| loser like me        | mercedes    |                0 | NULL      |        1 |
| loser like me        | rachel      |                0 | NULL      |        1 |
+----------------------+-------------+------------------+-----------+----------+

The final step is to add that logical statement to the where clause:

select songs.name as song_name, singers.name as singer_name
from singers
cross join songs
left join performer_rules on performer_rules.singer_id = singers.id and performer_rules.song_id = songs.id
where (songs.has_restrictions = false and rule_type is null) or rule_type = 'included'
order by song_name, singer_name;

+----------------------+-------------+
| song_name            | singer_name |
+----------------------+-------------+
| don't stop believing | finn        |
| don't stop believing | mercedes    |
| it's my life         | finn        |
| loser like me        | finn        |
| loser like me        | mercedes    |
| loser like me        | rachel      |
+----------------------+-------------+

Summary

This type of database structure works well for access lists and permissions apps, and in my experience these kinds of simple cross joins perform reasonably well as long as there are proper indexes on the tables, and the datasets are not too massive.

About the Author

Biography

Previous
Ruby + Salesforce = Happy Developers
Ruby + Salesforce = Happy Developers

Pivotal is proud to announce the release of the databasedotcom gem. Developed in partnership with Heroku a...

Next
Jasmine Workshop 9/8 & 9/9
Jasmine Workshop 9/8 & 9/9

Cory Flanigan and Justin Searls are hosting a two day Jasmine workshop very soon in Denver on 9/8 & 9/9 at ...