When I think about the technical steps for a ML project I have the following in mind:
- data extraction and transformation (E and L of ETL)
- model training
- model deployment
You learn to appreciate data extraction and transformation once you discover, out of tutorials heaven, that data are not born clean and ordered.
Model training is discussed so much that I don’t think it deserve more space in this post.
Model deployment, with MLOps, is getting its fair share of attention. Actually I believe it’s so fundamental that I made this meme:
As someone on BlueSky noticed deployed models can be useless too
In the last weeks I have been reminded of an activity that deserves its own bullet point between the first and second step: exploratory data analysis or in a more pragmatic way “plot the freaking data”.
It’s not that I ignored it, but sometimes is that activity a little overlooked between read_parquet
and model.fit
.
Let me tell you how I got reminded of its importance.
An interview and a project
I was interviewing a new grad.
Following some chatting to learn more about his background and explain our team responsibility inside the company, I started to make some questions like “Can you explain me logistic regression in simple words?”. Since the answers were good I decided to threw at him a problem with no absolute right solution: “What are you going to do if I give you this set of telemetry data?”.
He gave an answer that I really appreciated (I will not tell it, in the unlikely scenario I will ask this same question in an interview to someone reading this). He also said that as first step of the analysis he would plot the data to see how it is.
Shortly after, I experienced the importance of plotting in a project while I was preparing the dataset for training a ML model following the technical specs of a domain expert. Given that we are talking about stuff that can be taken for granted:
It's not enough to throw some data expecting ML magic to happen
Since for once I had detailed explanations I jumped to write the code for discarding and transforming the data according to the rules provided.
When I thought the data was ready, I discovered that it was all messed up. So I came back to the domain expert and going through the data together we figured out that there were a bunch of days in which it was all basically garbage. This value could not be easily detected with thresholds since the values were in an acceptable range, the only way to find them was to look at the plots of the data.
Ascombe’s quartet
Actually this lesson was formalized more than 50 years ago, and presented in millions of stats 101 courses, by the statistician Francis Anscombe with his paper Graphs in Statistical Analysis and his famous quartet.
Ascombe’s quartet is a set of four datasets of $(x,y)$ points with basically the same summary properties (means, medians, correlations) and even with the same linear regression line fitting the data (and same $R^2$!). But look what happens when you plot the data:
Anscombe, Francis J. (1973) Graphs in statistical analysis. American Statistician, 27, 17–21.
They show a completely different story, one you would never learn without plotting. Therefore, do not forget to plot the data!
By the way he joined the team! I mean the new grad, not Anscombe.