Developing a machine learning application is a sizable undertaking. You’re dealing with large amounts of data and you have no idea what the solution is going to look like.
Maybe you’ve done some early data mining before hand and that’s what has triggered the project, or maybe you haven’t. Either way if the goal of your project is “build a machine learning application that predicts X” then you have yourself a daunting brief.
I’m sharing how we approached the Nudgr proof of concept and I hope you find it useful. (Note that in reality we worked on and improved this framework as we went but I’ve left this element out to make the message clearer. Also note that we’re a start-up that hadn’t worked on machine learning until we embarked on Nudgr).
Before I dive in, here is some context on the project: We were building a proof of concept application that predicted if a user completing an online form or checkout was going to stop before finishing. It was a supervised classification problem. We had plenty of data with each sample classified as converted or not. We needed a model that could predict an abandon in real time.
These are the 4 key pillars to our approach.
1: Performance Score
We built a bespoke scoring function that we passed our model and test-data through. The key aspect is that it simulated the application. We defined metrics based on the different outcomes of this simulation. These metrics became the way we determined the performance of the application – our performance score.
For you this may be as simple as passing the test-data to the predict method and comparing the predictions to truth. For us our simulation involved rebuilding form-fills event by event (click by keypress) to see how the application would treat the user as they filled in the form.
Note that it was completely reasonable for us to iterate on the simulator – after all how a prediction is utilised is a part of your application. But we aimed to minimise changes to the measurement aspect (mostly by trying to get it very right very early on). The less you need to change your measurement function the better as it’s harder to track improvement over time.
We had a huge amount of data that we could sample a thousand times and not overlap, but if you’re trying to measure incremental increases in performance you need to be consistent in what you measure and how. So we created 9 key datasets.
Similar to how you measure performance – the less you need to change your datasets the better. However we had to replace 3 of our datasets due to atypical anomalies. We also made several smaller changes as we improved the initial filtering.
We ensured that we covered the important spectrums within the data. For us this was mainly dataset size, industry and a couple of other features learned from Formisimo. We made sure to spread our 9 datasets out over these spectra.
3: Performance Log
We maintained a performance log that recorded the version of the software against 16 metrics of the application applied to each key dataset. Therefore each full-measurement had 144 metrics – but the performance score was king.
Why so many? They were useful; they outlined features of the applications behaviour which informed us as we developed.
We incremented our software in a semantic-style schema. We found it was a schema that communicated with the rest of the company very well. A challenge in any project is how do you share progress with your colleagues and stakeholders. We found that a project around machine learning especially so. He’s how we approached incrementing the version number
- major: paradigm shift in the functioning of the application/model (only 2 of these occurred)
- minor: performance score improvement (many)
- patch: all the rest (countless)
Our sprint framework was a 2 week cycle. In our first sprint we built the simulator and the datasets. Then for the remaining sprints we followed this structure:
- At the start: pick the best ideas for improving the performance score and task them up
- At the end: share with the company the version and score we’ve got to.
We posted the version and score on a wall in the office after every sprint so all can see and refer to it (with a very brief summary). Note that this was the only time we shared a performance score with the wider company. Not sharing the score early wasn’t about reducing transparency but as you work through a project like this you’ll have countless false measurements; it doesn’t help the wider team if you share early results.
Remember that discovery is part of the work, so discovering that an idea doesn’t work is still progress. For example we regularly built visualisations of our raw data to give us insight. Not all of these led to performance improvements, but all were considered valuable work.
Finally we kept a journal of experimental conclusions as we knew we were going to recycle ideas.
I found that the framework above brought two significant benefits, or rather, it fixed two big problems we were having.
- Despite the level of uncertainty – we successfully communicated progress to the rest of the company, and they in turn felt well informed.
- Despite the level of uncertainty – we remained laser-focussed on the goal and never got lost in the complexity, detail or sheer scale that comes with ML, big-data and infinite possibilities
Hear more about our general development stack in the video that I made below: