Review of recent trends in Data Science.

Every week, I spend ~5h reading my favorite newsletters on data science (eg, Data Science Roundup, Data Elixir, Jack Clark, Daniel Meissner). My most recent project prevented me from catching up on these articles for a few months - so I took a few days to catch up and synthesize what I had been seeing. That summary can be found below.

Key takeaways

Most of the interesting articles were saying the same things:

Role definition is still happening, but there is alignment that hybrid models in business are useful
Businesses are continuing to invest in these initiatives, more understanding that it is not “magic”
Platforms and cloud tooling continue to enforce the imperative for DS to be able to own models “end to end” - the DS needs to write clean code and launch to production; can be on platforms built by SWE & DE

I didn’t see very much interesting on actual techniques (besides a very interesting post on transfer learning as deep feature extraction), but have shared some below. Lots of talk on ethics, some talk on RL and all of the NLP stuff (Bert & ELMo, etc – great summary of this as an ImageNet moment here).

Role of a Data Engineer

Do I need a data engineer? Here is a second article, distinguishing between DE and an “Analytics Engineer.” Suggests using stitch, fivetran, dbt as data engineering tools in lieu of Airflow. Natural migration (in startups at least) is to do PoCs, early builds on Airflow and then migrate to a more resilient tool like those listed. Super critical role (and primary responsibilities of our principal data engineers on a project), should be responsible for:

Managing and optimizing core data infrastructure,
Building and maintaining custom ingestion pipelines,
Supporting data team resources with design and performance optimization (think 1 DE for 3 DS) and
Building non-SQL transformation pipelines (PySpark ETL (maybe), geo enrichment)

I thought the idea of removing Airflow for SQL transformations was an interesting trend. I haven’t ever used the three “pipeline-as-a-service” products. For my own projects, the above responsibilities were good to have for our primary data engineer, with a separate, proper SW developer as the code master.

More on role definition. “The Kinds of a Data Scientist.”

The VP, DS of Instacart split the key types of work into “Decision Science” vs. “Data Products” to identify skills required. I thought the Decision Science example was pretty interesting.

*At LinkedIn, the executive team used decision science to make a critical business decision about the visibility of member profiles in search results. Historically, only paid users could see full profiles for everyone in their extended (third-degree) network. The visibility rules were complex, and LinkedIn wanted to simplify them — but not in a way that would undermine its revenue. The stakes were enormous.

The proposed visibility model was a monthly use limit for unpaid users, with a cut-off based on usage. LinkedIn’s decision scientists simulated the effects of this change, using historical behavior to predict the impact on revenue and engagement. The analysis had to extrapolate past behavior on one model to forecast behavior on a radically different one. Nonetheless, the analysis was sufficient to move forward.*

“Code as Configuration”. Article written by an experienced DS outlining how DS & SWE/DE should think about working together.

Project management.

Doing Agile in Data Science. Google. Particular focus on taking the scientific method and breaking it down into “less rigorous hypotheses” that can be (1) time bound and (2) better inform the LT hypotheses you want to prove or disprove in an experiment.
Divergent and convergent thinking in data analysis. An okay article, on a great concept. With the article above, how do we coach projects to (1) generate good hypotheses with the associated code “spikes”, (2) then fit that into design patterns, etc. that allow us to grow code safely? This can also be very useful for identifying product/project/pilot requirements with business users. I’ve received feedback in projects that “divergent thinking is very uncomfortable” but the decisions it can lead to are almost always of higher value than what a convergent path would provide.
Design for Continuous Experimentation. Etsy. How Etsy learned to build stage gates into their product development process and use A/B tests to enable stage gate decisions (!!!) I liked how the Principal Engineer framed the way they used their learning from things that did not go well for future projects
How much should managers code? Wrong question. Where to write code? Coursera. Emphasizes being invested in small bug fixes, code reviews, and JIRA to develop a manager’s empathy for the team’s work and foster better outcomes.
The State of Data Product Management Roles. Insight Data Science. Highlights 5 domain areas: Infrastructure, Analytics, Applied ML/AI, Platforms, Standardization & Discovery. Note that Analytics and Applied AI/ML roles map well to the two different “kinds of data scientists” outlined above, and the other 3 are DevOps-y in nature.

Experimentation.

Guidelines for AB Testing at Etsy. Advocates for frequentists methods and to establish measurement up front first. Great anecdote on how more measurement up front changed their delivery process for new features. Many, many links for follow on reads.
Is Bayesian Testing Immune to Peeking? Not Exactly. Stack Overflow (the company). Less about the Bayesian approach, I thought this was a great way of showing how to use simulation to trust the data behind your decisions (and how p-values can change over time). I fell into a small rabbit hole reading about AB testing from these articles; a theme for me was many different teams using simulation to “shore up” questions around experiment design.
Analyzing Experiment Outcomes: Beyond ATEs. Uber. Walks through a metric called Quantile Treatment Effect (QTE) that allows for understanding heterogeneity in treatment effects. Cool.
Experimentation & Measurement for SEO. Airbnb. Useful case study of when to use differences-in-differences and some of the idiosyncrasies to account for.

Data / infrastructure engineering.

Capturing data evolution in a SOA. Airbnb
Putting the power of Kafka into the hands of data scientists. Stitch Fix
Future of Data Warehousing. Cloudera/PNC. Short, 5 min video outlining some challenges standing up a distributed system – seen these challenges at all projects and would expect our DE to work with and help the client set up these processes.

Techniques, Tools, & Interesting reads.

Contextual bandits for content outreach. Stitch Fix
How to write clean code and conduct code reviews.
Forecast sales in retail. A good reference for establishing a baseline and packages/methods
3 facts about time-series analyses that can surprise ML people. (eg, “The uncertainty of the forecast is just as important as, or even more so, than the forecast itself.”)
11 classical time series forecasting methods in Python. Related: Forecasting at Uber. Interesting to see Uber’s built-in backtesting system.
A short history of prediction serving systems. RISE Lab. Discusses the architecture of RISE’s Clipper system, and has an interesting paper from LinkedIn on managing the tradeoff between accuracy and serving latency.
Papermill. Parametric, scheduled Jupyter notebooks with the option for aggregate summaries. Used a lot at Netflix.
Convoys. PyPI package for time-lagged conversion rates (eg, telco churn..)

Brad Allen