Review: Big-O Notation
Buy or Subscribe
You can access this course in just a minute, and support my efforts to rid the world of crappy online courses!
No video for this one - just a quick review of Big-O.
When I started writing The Imposterâs Handbook, this was the question that was in my head from the start: what the f*** is Big O and why should I care?I remember giving myself a few weeks to jump in and figure it out but, fortunately, I found that it was pretty straightforward after putting a few smaller concepts together.
Big O is conceptual. Many people want to qualify the efficiency of an algorithm based on the number of inputs. A common thought is if I have a list with 1 item it canât be O(n) because thereâs only 1 item so itâs O(1). This is an understandable approach, but Big O is a technical adjective, itâs not a benchmarking system. Itâs simply using math to describe the efficiency of what youâve created.
Big O is worst-case, always. That means that even if you think youâre looking for is the very first thing in the set, Big O doesnât care, a loop-based find is still considered O(n). Thatâs because Big O is just a descriptive way of thinking about the code youâve written, not the inputs expected.
An Amazingly Useful Concept
Iâm glad I took the time to learn Big O because I find myself thinking about it Â often. If youâve always wondered about Big O but found the descriptions a bit too academic, Iâve put together a bit of a Common Personâs Big O Cheat Sheet, along with how you might use Big O in your every day work.
Rather than base this on arrays and simplified nonsense, Iâll share with you a situation that I was in just a year ago: choosing the right data structure in Redis. If youâve never used Redis before, itâs a very basic key-value store that works in-memory and can optionally persist your data to disk.
When you work in a relational database like PostgreSQL, MySQL, or SQL Server you get a single data structure: the table. Yes, there are data types, sure, but your data is stored in a row separated by columns, which is a data structure.
Redis gives you a bit more flexibility. You get to choose the data structure that fits your programming need the best. There are a bunch of them, but the ones I find myself using most often are:
- String. With this structure you store a string value (which could be JSON) with a single key.
- Set. A Set in Redis is a bunch of unordered, unique string values.
- Sorted Set. Just like a Set, but sorted.
- List. Non-unique string values sorted by order of insertion. These things operate like both stacks and queues.
- Hash. A set of string values identified by âsub keysâ. You can think of this as a JSON object with values being only strings.
Why are we talking about Redis when this post is about Big O? Because Redis and Big O go hand in hand. To choose the right data structure for your needs, you need to dig you some Big O (whether you know itâs Big O or not).
FINDING SOMETHING IN A SHOPPING CART
Letâs say youâre tasked with storing Shopping Cart data in Redis. Your team has decided that an in-memory system would work well because itâs fast and it doesnât matter if cart data is lost if the server blows up.
The question is: how do you store this information? Hereâs whatâs required:
- Finding the cart quickly by key
- CRUD operations on each item within the cart
- Finding an item in the cart quickly
- Iterating over each item in the cart
Believe it or not, youâre thinking in Big O right now and you might not even know it. I used the words âquicklyâ and âiterateâ above, which may or may not mean something to you in a technical sense. The thing I was trying to convey by using the word âquicklyâ is that I want to get to the cart (or an item within it) directly, without having to jump through a lot of hoops.
Even that description is really arm-wavy, isnât it? We can dispose of the arm-waving by thinking about things in terms of operations per input. How many operations does my code need to perform to get to a single cart from the set of all carts in Redis?
ONLY ONE OPERATION: O(1)
The cool thing about Redis is that itâs a key-value store. To find something, you just need to know its key. You donât have to run a loop or do some complex find routine â itâs just right there for you.
When something requires only one operation we can say that directly: my code for finding a shopping cart is on the order of 1 operation. If we want to be Big O about it, we can say itâs order 1, or âO(1)â. It doesnât matter how many carts are in our Redis database either! We have a key and we can go right to it.
A more precise way to think about this is to use the term âconstant timeâ. It doesnât matter how many rows of data are in our database (or, correspondingly, how many inputs to our algorithm) â the algorithm will run in constant time which doesnât change.
What about the items in the cart itself?
LOOPING OVER A SET: O(N)
We know that our cart will need to store 0 to n items. Iâm using n here because I donât know how many items that will be â it varies per customer.
I can use any of Redisâs data structures for this:
- I can store a JSON blob in a String, identified by a unique cart key
- I can store items in a Set or Sorted Set, with each item being a bit of JSON that represents a
CartItem
- I can store things in a List in the same way as a set
- I can store things in a Hash, with each item having a unique sub key
When it comes to items in the cart, we need to be able to do CRUD stuff but we also need to be able to find an item in the cart âas quickly as possibleâ. If we use a String (serializing it into JSON first), a Set or a List weâll need to loop over every item in a cart in order to find the one weâre looking for.
Rather than saying âneed to loop over every itemâ, we can think about things in terms of operations again: if I use a Set or a List or a String Iâll need to have one operation for every n items in my cart. We can also say that this is âorder nâ, or just O(n).
You can spot O(n) operations easily by simply looking for loops in your code. This is my rule of thumb: âif thereâs a loop, itâs O(n)â.
LOOPING WITHIN A LOOP: O(N^2)
Letâs say we decided to keep things simple and deal with problems as they arise so we chose a Set, allowing us to dump unique blobs of JSON data that we can loop over if we need to. Unfortunately for us, this caused some issues:
By changing the quantity
in our CartItem
we have made our JSON string unique, causing duplication. We need to remove these duplications now, otherwise our customers wonât be happy.
Simple enough to do: we just loop over the items within a cart, and then loop over the items one more time (skipping the current loop index) to see if thereâs a match on sku
. This is a classic brute force algorithm for deduping an array. Thatâs a lot of words to describe this nested loop algorithm and we can do better if we use Big O.
Thinking in terms of operations, we have n operations per n items in our cart. Thatâs n * n
operations, which we can shorthand to âorder n squaredâ or O(n^2). Put another way: deduping an array is an O(n^2) operation, which isnât terribly efficient.
As I said before, I like to think of these things in terms of loops. My rule of thumb here is that if I have to use a loop within a loop, thatâs O(n^2). Another rule of thumb is that the term âbrute forceâ almost always denotes an O(n^2) algorithm that uses some kind of nested loop.
INDEXING A DATABASE TABLE AND O(LOG N).
If youâve ever worked on a larger project with a DBA, youâve probably been barked at for querying a table without utilizing an index (a âfuzzyâ search, for instance). Have you ever wondered what the deal is? I have. I was that DBA doing the barking!
Hereâs the thing: tables tend to grow over time. Letâs say that our commerce site is selling independent digital films and our catalog is constantly growing. We might have a table called film
filled with ridiculous test data that we want to query based on title
. Unfortunately, we donât have an index just yet and our query is beginning to slow down. We decide to ask PostgreSQL whatâs going on using EXPLAIN
and ANALYZE
:
Our database is doing whatâs called a âSequential Scanâ. In SQL Server land this is called a âFull Table Scanâ and it basically means that Postgres has to loop over every row, comparing the title
to our query argument.
In other words: a Sequential Scan is a loop over every item which means itâs O(n), where n represents the number of rows in our table. As our table grows, the efficiency of this algorithm goes down linearly.
Itâs easy to improve the performance here by adding an index:
Now weâre using an Index Scan, which is, I suppose, much faster. But how much? And how does it work?
Under the covers, most databases use a version of an algorithm called binary search â I made a video about this and other algorithms which you can watch right here if you want. For binary search to work properly, you have to sort the list of things youâre working with. Thatâs exactly what Postgres does when you first create the index:
Now that the index is sorted, Postgres can find the title
weâre looking for by systematically splitting this list in half until thereâs only one row left, which will be the one we want.
This is much better than looping over every row (which we know is O(n)), but how many operations do we have here? For this we can use logarithms:
Weâre continually splitting things in half in a sorted set until we arrive at the thing we want. We can describe this with an inverted binary tree, as you see above. We start with 8 values, split, and are left with 4, which we split again to get 2, then finally 1.
This is an inverse squaring operation as weâre going from 2^3 (8) down to 2^2 (4) down to 2^1 (2) and finally 2^0 (1). Inverse squaring operations are called logarithms. That means that we can now describe the operations of our database index as âbeing logarithmicâ. We should also specify âlogarithmic of whatâ to which we can answer âwe donât know, so weâll say itâs nâ, also known as O(log n).
This kind of algorithm is called âdivide and conquerâ and when you see those words, you know immediately that youâre talking about a log n algorithm.
âŚ AND SO WHAT?
Hereâs why you care about turning something thatâs O(n) into O(log n) and the best part is that itâs not really arguable because itâs math (I was told that means youâre always right :trollface:).
Letâs say we have 1000 records in our film
table. To find âAcademy Dinosaurâ our database will need to do 1000 operations (comparing the title
in each row). But how many will it do if we use an index? Letâs use a calculator and find out, shall we? I need to find the log base 2 (because of the binary split) of 1000:
Only Ten! Ten splits of 1000 records to find what we want in our database. Thatâs a performance gain of a few orders of magnitude, and itâs a lot more convincing to tell someone that as opposed to âitâs a lot fasterâ.
The best part here is that we can keep using this calculator to find out how many operations will be needed if we have a million records (itâs 20) or a billion (itâs 30). That kind of scaling as our inputs goes up is the stuff of DBA dreams.
BONUS QUESTION: WHATâS THE BIG O OF A PRIMARY KEY LOOKUP?
Itâs tempting to think that if I have a primary key and I know the value of that key that I should be able to simply go right to it. Is that what happens? Think about it for a second and while youâre thinking letâs talk about Redis a bit more.
A major selling point of Redis (or any key-value system really) is that you can do a lot of stuff with O(1) time complexity. Thatâs what weâre measuring when we talk about Big O â time complexity, or long something takes given the inputs to an algorithm youâre working with. Thereâs also space complexity which has to do with the resources your algorithm needs, but Iâll save that for another post.
Redis is a key-value store, which means that if you have a key, you have an O(1) operation. For our Shopping Cart above, if I use a Hash Iâll have a key for the cart as well as a key for each item in the cart, which is huge in terms of performance â or I should say âoptimal time complexityâ. We can access any item in a cart without a single loop, which makes things fast. Super fast.
OK, back to the question regarding primary key queries: are they O(1)? Nope:
Indexes in most database systems tend to use a variation of binary search, and primary key indexes are no different. That said, there are plenty of optimizations that databases use under the covers to make these queries extremely fast.
I should also note that some databases, like Postgres, offer you different types of indexes. For instance you can use a Hash Index with Postgres that will give you O(1) access to a given record. There is a lot going on behind the scenes, however, to the point where thereâs a pretty good debate about whether theyâre actually faster. Iâll side step this discussion and you can go read more for yourself.
THERE YOU HAVE IT
I find myself thinking about things in terms of Big O a lot. The cart example, above, happened to me just over a month ago and I needed to make sure that I was flexing the power of Redis as much as possible.
I donât want to turn this into a Redis commercial, but I will say that it (and systems like it) have a lot to offer when you start thinking about things in terms of time complexity, which you should! Itâs not premature optimization to think about Big O upfront, itâs programming and I donât mean to sound snotty about that! If you can clip an O(n) operation down to O(log n) then you should, donât you think?
So, quick review:
- Plucking an item from a list using an index or a key: O(1)
- Looping over a set of n items: O(n)
- A nested loop over n items: O(n^2)
- A divide and conquer algorithm: O(log n)