I recently completed two courses on algorithms at Coursera: Roughgarden’s “Algorithms: Design & Analysis” and Sedgewick’s “Introduction to Algorithms.” Both covered very similar material—a “greatest hits” of classic algorithms (quicksort, graphing—e.g., Djikstra’s shortest paths) and a few of the underlying data types that facilitate efficient runtime (e.g., heaps, search trees, hashing).
Although they were very similar in content, the two were very different in presentation. Sedgewick’s course assignments are java-based, with very strict APIs on memory management, code quality, etc. The video lectures then walked through the code required to implement the algorithms—with visualizations to give an intuitive sense of runtime. Roughgarden’s course is much more theoretical—for example, explaining how to calculate topological ordering or strongly connected components through depth- or breadth-first searches. It also explains the differences between different “O” notations, and asks students to think about work / computation and the associated number of operations required for a given algorithm.
##Was it worth it?
With any type of online course, the hope is that the material will be targeted learning that has direct daily applications. My friends who have CS degrees hear “introductory algorithms course” and either think of it as a rite of passage a long, long time ago or an intellectual exercise. The expectation is that this course, in practice, would have little utility.
I didn’t find this to be the case. Having now studied different NoSQL database types and taken a few certifications on data science, I found the courses to be incredibly useful—perhaps because I can now refer back to times that I “brute forced” a fuzzy join (using double for loops) that would have been ~1000x faster with a hash. A few key lessons:
- When runtime can be managed. Knuth, Sedgewick, Tarjan—taking this course, it feels like the 1960s and 1970s were full of very smart guys who were very good at discrete mathematics. As a result, we are able to put all sorts of use cases into production that wouldn’t be possible without clever workarounds. It’s really useful to have an intuition for the difference between quadratic, logarithmic, linearithmic, and constant time in the code.
- The relationship between data structures and algorithms. The projects I’ve worked with, to date, have been files with the data in a table or an array. I did not have exposure to heaps or search trees. It’s amazing the performance improvements one can create in basic searching by changing the underlying structure of the data. One quote I enjoyed in Sedgewick’s class was from Linus Torvalds, regarding the difference between a bad programmer and a good one: “Bad programmers worry about the code, good programmers worry about data structures, and their relationships.” I have a much better intuition for why this is true, and how to think about product performance at scale.
- A new way of thinking about NoSQL databases. When I started at SVDS, I took some time to study different NoSQL database types. I ended up with a catalogue of different structures and associated use cases (e.g., When do you use Redis? When do you use Cassandra?). However, I did not have an intuition for why read-write performance might be different between these database structures and the challenges / tension that this may have with being able to query the data. After this course, it is much more clear how those use cases emerge and how architects make their infrastructure decisions.
##What does this lead to next?
These courses were an inspiration for me to push further on application development and to think about production-level code. I’m putting together a project to set up a weather sensor at my home in San Francisco and create some visualizations / analytics against a streaming “weather in state capitols” dataset and the WeatherUnderground API.
A few additional topics I’d like to pull the thread on:
- Distributed Systems. How do algorithms and data management change when data is stored in different clusters?
- Model Scoring and Management. Java seems like a default language for production-level code because it is widely used and extensible. However, many models are generated in statistical packages—e.g., in R, scikit-learn in Python, or SAS. How are these models put into production effectively, and how is runtime managed for models that are complex by nature (e.g., the blended models in a Random Forest)?
- All non-database related factors that have an effect on user runtime. For example, what is the effect of networking on runtime? I currently don’t have an intuition for all of the different things that affect the user experience and the different tools a product designer might have to manage an excellent experience—I am excited to learn more, though.