My name is David Aliaga, an engineer from the Software Development Team at Sensetime Japan.

Early last month, I attended a GTC session describing the application of MLOps to develop and deploy AV systems[^1]. This was the first time I had encountered the term MLOps and it called my attention, so I decided to investigate about it. This article is a summary of me discovering about this field.

So, what is MLOps anyway?

Wikipedia (and the source where the definition comes from) defines MLOps as

a set of practices that combines ML, DevOps and Data Engineering that aims to deploy and maintain machine learning models in production reliably and efficiently

I was eager to learn more so I opened the book Introducing MLOps -How to scale ML in the Enterprise[^2] . In it, MLOps was defined as

The standardization and streamlining of machine learning life cycle management.

The book goes further and explains the necessity of MLOps in an enterprise setting. Many people with completely different skill sets and using different tools which represent a hurdle to effective communication between Data Scientists and Engineers putting these models to production. This all made me think of how useful it would be to apply these practices in my everyday work.

The objective of MLOps is, after all, to increase automation and improve the quality of production models, and that is something that hits close to home to me, as an engineer of the Software Development Team.

What does MLOps involve then?

There seems to be several things we have to address in order to start MLOps in practice[^3]:

Hybrid Teams
ML Pipelines
Model and Data Versioning
Model Validation
Data Validation
Monitoring

Hybrid Teams.

Since putting ML models to production requires a set of varied skills, it is rare that a single person would possess all of these. Therefore, a hybrid team is valuable. What kind of roles this involves varies depending on who defines it but according to the book Machine Learning Design Patterns[^4] we have:

Data Scientist: Who works on collecting, interpreting analyzing and processing datasets.
Data Engineer: Who implements the pipelines around data
ML Engineers: Similar to Data Engineers but for ML Models. Manage the infrastructure around training and deploying models.
Other roles are:
Research Scientists: Who are focused on research
Developers and SW Engineers: Who use the model serving infrastructure built by ML Engineers to build applications and user interfaces to show predictions.
DevOps. Sometimes included separately [^2], they conduct and build operational systems, tests and manage the CI/CD pipeline.

ML Pipelines

Usually, two pipelines are required: One for training and one for serving (or inference) because their characteristics are not the same. Usually training involves batch processing while inference could be batch or real time. The ML Pipeline should be version-tracked.

Model and Data Versioning

Different than traditional SW systems, ML systems require not only a versioning system for code, but also for data. The Model lifecycle and the code lifecycle should be independent of each other (but tied so as to identify which model was used with which code).

Model Validation

In traditional SW systems, we have unit and integration tests which are usually binary: either pass or fail. For ML systems, it is usually necessary a more statistical type of tests to decide when a model is good enough for deployment. This usually involves adequate metrics.

Data Validation

It is recommended to validate the input data to check for null or empty values, format size, etc. The challenge for us would be how can this be accomplished for AV systems

Monitoring

Finally, as a way to evaluate how well our system is performing and if our models are losing its effectivity with new data, we have to monitor its results.

Five key components of MLOps

There are five key components of MLOps:

Model Development
Deployment
Monitoring
Iteration
Governance

Model Development implies searching for suitable input data, which is not a simple task, exploratory data analysis either visually or statistic, feature engineering and training and evaluation

Model deployment involves deciding the type and content of the deployment and here an approach is to export the model to a portable format such as ONNX. Containers are also a solution to the problem of match the exact requirements each model requires.

Monitoring includes addressing the concerns of DevOps (is the model performing quick enough and with reasonable resources?), Data Scientists (Is the model degrading?) and Business (Is the model beneficial to the enterprise?)

Iteration involves the capacity to re-train the model with new data. Sometimes this involves a great deal of devices running the models each one sending feedback to a central point (as is Tesla's autopilot system)

MLOps would also be beneficial when government regulations change and impose constraints on our systems.

MLops and its significance to us

One of the main reasons why many ML systems that work well in the lab don't get into production is that ML systems involve not only code but also data. Data affects the behavior of the models in production.

Now, people in the MLOps community reflect that data comes from the real world. This is true for all kinds of problems people apply ML systems such as medical image analysis, spam mail detection, etc. But even more so, in the field that we-in Sensetime Japan- work on.
The data provided by cameras installed in vehicles are bound to be incredibly varied, changing and we cannot control how it changes. This is one more incentive for thinking about MLOps in the field that we work on.

Where to go from here?

I have just started touching the shores of the ocean that MLOps is. Next in my journey would be to try and implement simple examples of MLOps in action and see how this can benefit my everyday work. I am interested in learning how to use Kubeflow for example.

Going back to where my interest started, I am reading also about Maglev which is NVIDIA's internal MLOps platform. I would like to understand its functionality so that we can learn lessons in order to apply them to our work here in Sensetime Japan as well.

NVIDIA reports that Maglev support all of the data processing necessary to train and validate industry-grade AI systems: Testing, Data Management & Labeling, Data selection, traceability and end-to-end workflow automation.

This last feature should enable engineers to seamlessly deploy models to the car and execute them periodically. I think all of us people that work with autonomous and semi-autonomous vehicle systems do have this as a common goal.

[^1]: Scaling ML Ops for AV Development [SE31473] (GTC 2021)
[^2]: Introducing MLOps -How to scale ML in the Enterprise. by Mark Treveil
[^3]: ML Ops: Machine Learning as an Engineering Discipline. By Cristiano Breuel
[^4]: Machine Learning Design Patterns. By Valliappa Lakshmanan, Sara Robinson& Michael Munn

投稿者プロフィール

aliaga: Engineer from the Software Development Team. I've been in Kyoto since when I was a post grad student in the Graduate School of Informatics. Interested in A.I, Robotics and Software, early this year finished the Self-Driving Car Engineer Nanodegree Program and always eager to learn new fields.