Ever since the inception of computers, there has been a need for developers. Eventually, the need
seamless collaboration between development and operational teams brought about the need for the
infrastructure that went on to be termed as ‘DevOps.’
The term DevOps, as the name suggests, combines “Development” & “Operations”. In the data science
domain, DevOps is the coming together of data scientists who are responsible for “development” and
IT professionals who execute “operations” - a co-ordinated workflow to drive an organization’s
analytical quest. There is a certain amount of overlapping in the operations of data scientists and
DevOps teams, specifically while deploying data models in the production platform. This is also why
you’ll find job descriptions of data scientists highlighting EC2, Docker, Kubernetes, among others,
Through the course of this article we address the nature of DevOps in the world of software
the step-by-step cyclic operation that is characteristic to a fundamental DevOps process, its use
in data science and future technologies that are slated to change DevOps as we now know it.
The DevOps process can be summarized as a cyclic step-by-step operation that requires
continuous development, testing, integration, deployment and monitoring.
How Do DevOps Requirements for Data Science Vary from Software Application Development?
While DevOps deals with software application development, the DevOps engineer has a broader
responsibility to ensure continuous, efficient software functionality especially in terms of
development, deployment and operational support. Here are few ways in which DevOps
for data science vary from software application development:
- While a range of software applications utilize a uniform computing stack such as Python
developing and testing data models requires more than simple python scripts and involves
Spark and other big data platforms.
- Computing requirements of a data scientist can be extensive and vary for different data
For instance, a model that must be tested in more than five variations against a large data
will require more computing power and storage than a model which has to be tested against a
- Maintenance of data models that are already commissioned to production entails more tasks
just altering the underlying code. These tasks range from feeding models with updated data
and reconfiguring their operating parameters to updating the operating infrastructure, which
generally leads to a new deployment process altogether.
- A software application development process is usually compact with limited process-heads
developers collaborate with. On the contrary, a data scientist has to cross-collaborate with
clients, IT operators, and business analysts to execute a successful design, deploy data
and establish proof of value. In most cases, data analytics teams do not report into the IT
department which makes it even more difficult to advise on standard rules and governance to
Why do you need DevOps in Data Science?
Deployment of a data analytics solution entails more tasks, unlike a software
deployment that typically involves testing, integrating, and deploying code. To ensure a
seamless end-user experience, data models must cohesively work with upstream and
applications, and most organizations tend to have multiple development teams to focus on
deployment process. However, the lack of co-ordination between these development teams
trigger multiple challenges - customer grievances during deployment, creating multiple
builds for a single data model application, to name a few. However, DevOps tools
integrate and streamline the operations of multiple teams so that they can collaborate
harmoniously, ensuring successful deployment of data models, and the end-users can
- Bridging the knowledge gap in the deployment process: DevOps experts
to choose and
configure infrastructure that forms the podium for seamless deployment of data models.
task entails close collaboration with data scientists to observe and replicate
configurations required for the infrastructure ecosystem. DevOps engineers must have a
thorough know-how of code repositories used by data scientists and the process to commit
codes. In most cases, despite using code repositories, data scientists lack the
automate integrations. This knowledge gap can create loopholes in the deployment process
data models. DevOps teams effectively fill this gap by assisting data scientists with
continuous integrated deployment. Standard processes previously operating with a manual
workflow to test new algorithms can be efficiently automated with the help of DevOps.
- Infrastructure Provisioning: Machine Learning setups are founded on
basis of different
technological frameworks that aid the intricate computation process. To manage the
clusters, DevOps engineers create scripts that can enable automation and termination of
various instances that are run in the ML training process. Constant management of code
configuration ensures that the processes remain up to date, and setting up ML processes
ensures that the DevOps engineers save time spent on manual configuration.
- Iterative Developments: To ensure that deployed models can easily be
aligned to newer
software updates, continuous integration (CI) and continuous delivery (CD) practices are
followed. For ML models to constantly evolve, iterative development environments are set
given the different tools employed for automation and consistent machine training and
learning, including Python, R, Juno, PyCharm, etc. Iterative developments using complex
and CD pipelines help identify and fix bugs swiftly, enhance developer productivity,
automate the software release processes and deliver updates quickly.
- Scalability: and development processes need to be operated at scale
organizations can expand DevOps efforts and increase implementations. Evolving,
systems can be efficiently managed with consistency and automation, which in turn fuel
scalable development. Normalization and standardization processes for the same need to
started at junctures that are already functioning with agility and are the starting
of DevOps processes.
- Configuration Management:
Through DevOps, infrastructure can be developed at scale, and can
be managed via programming instead of manual efforts. By leveraging configuration
management, this system can be updated and standardized. It is often considered as the
and finish line for DevOps. It helps maintain repositories of source codes,
and operational artefacts, and scripts used for testing, building and deploying. The
Configuration Management Database is also leveraged to manage all repositories,
- Monitoring: To assess the performance of deployed systems and analyze
monitoring ML models is vital. DevOps engineers enable real-time analytical insights by
proactively monitoring and sifting through data provided by the systems and ML models.
Insights on any changes or issues are then identified and duly acted upon.
Containerization: A majority of ML applications have elements that are
written in different
programming languages such as R, Python that are generally not in perfect
Apprehending a negative impact owing to the lack of synchronization among languages, ML
applications are ported into production-friendly languages such as C++ or Java. However,
these languages are more complicated, and this takes a toll on speed and accuracy of the
original ML model. DevOps engineers prefer containerization technologies, such as
that are functional in addressing challenges stemming from the use of multiple
Inadequacies in deployment procedures have often derailed data analytics projects. Different types
data models require dedicated production infrastructure that can support operation of individual
models. Such niche requirements create confusion and ultimately trigger major hindrances during
implementation. For instance, training and using neural nets for interference, demands huge
power and adds layers of complexities during deployment. These cases are not uncommon in the data
analytics field, which reinstates expertise of DevOps engineers to fill the gap between building and
deployment of data models.
However, in the same way that segregation between software engineering and DevOps engineering
smooth workflows, the absence of a cohesive collaboration between data scientists and DevOps
deters smooth operational processes as well. And the functions of the DevOps engineers can quite
be embraced by data science teams.
Being a part of the DevOps process does not mean that one needs to be a DevOps engineer. It simply
that when working on DevOps:
All Python model codes need to be committed to a repository, and all and any changes to existing
codes need to be managed through the existing repository.
Codes need to be integrated with Azure ML via Software Develop kits so that all changes, feature
alterations can be logged and tracked for later referencing.
Given that the DevOps process is automated to build and create codes and artefacts, ensure that you
not manually release or build your code or artefacts to any location other than your experimenting
With these few, simple steps, data science teams and DevOps teams can easily collaborate with one
another. Some data scientists might not be well-versed with using versioning tools such as Git and
take time to implement continuous delivery and deployment setups. As noted in an article featured on
DZone, “Most data scientists spend the majority of their time getting access to data or trying to
their algorithms deployed. With better tooling and a DevOps point of view, this process can be
When DevOps and data scientists collaborate earlier in the process, it's possible to ensure data
pipelines get the same respect as a consumer-facing website.”
Is DataOps the DevOps of the Future? How does MLOps feature in this
DevOps signaled a sea of change with its inception and a truly efficient DevOps process can
reduce delivery time from months to mere days. However, many believe that another upcoming
technology has the potential to be the next big thing – Data Ops. In 2018, about 73% of
companies were reportedly investing in DataOps.
Some business leaders refer to DataOps as ‘DevOps with data analytics’. However, an article
featured on Inside Big Data calls DataOps the “close cousin of DevOps” and argues that,
isn’t just DevOps applied to data analytics. While the two methodologies have a common theme
establishing new, streamlined collaboration, DevOps responds to organizational challenges in
developing and continuously deploying applications. DataOps, on the other hand, responds to
similar challenges but around the collaborative development of data flows, and the
use of data across the organization.”
Simply put, we could consider that Agile + DevOps + lean manufacturing = DataOps. Once the
development phase of DataOps is completed, CI can be setup to maintain the quality of the
on the master branch. At the end of each sprint, developers will merge all their changes
the master branch, where all the test cases are run before the branch is accepted. This step
then followed by identifying the CD pipeline that can be run to generate artifacts of
which can then be stored on the cloud. As part of the deployment process, at the end of each
sprint, dockerizing, i.e., converting a software application to run within a specified
Beyond the DataOps framework, is the MLOps framework. It is also considered similar to
the sense that they both deal with software development and deployment and emphasize on the
for collaboration, given that ML deals with data science, it requires relatively more amount
experimentation, and multiple iterations to align with changing data structures. A
straightforward definition of MLOps would be that it is a duet between Machine Learning and
Operations and is used to understand KPIs, acquire data, and then develop, deploy and
models. It helps incorporate data science into business applications.
Here is an overview of the three aforementioned technologies – and their characteristic
Further elaborating on these characteristics, we can observe that DevOps can function within
guidelines, but the same cannot be said for DataOps because changes to data processes are
inevitable, thereby necessitating constant inspecting and updating. Augmented Data
could provide some respite in this situation. It enables testing of various setups and
by creating access to meaningful data, doing away with the need for assistance from data
scientists or IT personnel. “It allows users access to crucial data and Information and
them to connect to various data sources (personal, external, Cloud, and IT provisioned).
can mash-up and integrate data in a single, uniform, interactive view and leverage
auto-suggested relationships, JOINs, type casts, hierarchies and clean, reduce and clarify
so that it is easier to use and interpret, using integrated statistical algorithms like
clustering and regression for noise reduction and identification of trends and patterns,”
a Dataversity article.
MLOps on the other hand, requires Continuous Integration and Continuous Deployment of code,
also deals with the deployment of data and ML models. The continuous training and model
monitoring also ensure great end-user experience, when accessing AI-powered apps, because
models are constantly training and retraining as required. Also, software such as Git can be
used for version control, so that we can go back to the stable version of the model as and
required. MLOPs enables users to access initial phases where stakeholders wish to understand
KPIs of the business and figure out the way to acquire data and place it accordingly. Once
data is in place, and the model is deployed, by leveraging a monitoring tool, the
can be further studied.
While DevOps teams are more technically-driven with the aim of delivering an efficient and
effective product, DataOps brings in the collaboration of technical and the business teams
is primarily data-driven. For the data science & analytics functions and the deployment and
governance functions to thrive, a more collaborative set-up is required. And this
has to be inclusive of the workings of the ML models as well. Regulating the functioning of
models, standardizing the lifecycle of ML management, and constantly aligning to the latest
inputs and changes in data sources, reduces the possibility of false data insights. Also,
practices need to seamless collaborate with the existing DevOps practices to ensure that all
automated operations run smoothly. While DevOps teams are more technically-driven with the
of delivering an efficient and effective product, DataOps brings in the collaboration of
technical and the business teams and is primarily data-driven. For the data science &
functions and the deployment and governance functions to thrive, a more collaborative set-up
required. And this collaboration has to be inclusive of the workings of the ML models as
Regulating the functioning of ML models, standardizing the lifecycle of ML management, and
constantly aligning to the latest data inputs and changes in data sources, reduces the
possibility of false data insights. Also, MLOps practices need to seamless collaborate with
existing DevOps practices to ensure that all the automated operations run smoothly.
The ML lifecycle has always been pitched as an end-to-end cycle, but many businesses are yet
successfully manage the process in its intended format at an enterprise-level scale. This is
because ML inclusion in enterprise is a gradual process, challenging to scale, at a limited
level of automation, difficult to collaborate, and very few of the operationalized models
succeed in delivering business value. Even when ML is in a mature stage, deployment
and business impact are the real areas warranting improvement.
This year has shed light on the growing need for data scientists and engineers to
and collaborate effectively when automating and productizing machine learning algorithms.
like modern application development, machine learning development is also iterative. New
datasets need to be made available as they help in the training and enabling evolution of
Open source frameworks like MLflow, Kubeflow and DVC are competing to become the market
of the open-source landscape. Simultaneously, upcoming startups are adding UIs to these
solutions to introduce “proprietary” MLOps products to the market. While these platforms may
help manage the ML lifecycle and enable experimentation, reproducibility and deployment,
versioning still remains a challenge and leveraging options like DVC & Delta Lake might help
resolve these challenges.
Overall, some believe that merging the two practices of DevOps and DataOps would qualify as
‘match made in heaven’, others are still skeptical about the coupling and believe that the
is incomplete without the inclusion of MLOps. Some articles point out that there are ‘too
Ops’. Perhaps, given that DevOps setups have been around for more than a decade and DataOps
still in the nascent stages of application. MLOps too is, in a sense, in its infancy.
might hesitate to adopt MLOps as there are no universal guiding principles yet, but what we
is a leap of faith to ensure that we get started with the implementation process in order to
stay ahead of the curve.
DevOps has been a buzzword in the data science and data engineering world for about a decade
in the last ten years this field has undergone quite the change. As noted in an article by
Beacon, DevOps was originally created to “de-silo dev and ops to overcome the bottlenecks in
software development and deployment process, mostly on the ops side.” Now DevOps forges
leveraging continuous development and deployment of software and ML models, and DataOps and
MLOps promise an analytically sound future as the three ‘Ops’ come together.
Furthermore, many firms are also looking to transition from a DevOps setup to a DevSecOps
i.e, a setup that highlights the need for accountable development and operational processes
emphasize on the need for swift decisions governing security. This ensures that security
measures are built into the tools and models when being developed, rather than being added
the final step. When security is woven into every step of the software and ML model
and deployment, models can get deployed swiftly, with reduced compliance costs.
Regardless of the numerous ‘Ops’ and changes that lay in store, at the heart of it, the
fundamentals are likely to remain the same – collaboration & cooperation across teams and
verticals for developmental and analytical efficiency.
Author - Sameena Shaik, Senior Associate, TheMathCompany
Sameena has 5 years of experience in the field of DevOps, and has been architecting effective
optimal solutions, according to the requirements, while striving towards zero downtime. She has
on a wide range of technologies, which include Docker, Kubernetes, Jenkins. She also has hands
experience in major cloud platforms - AWS, GCP and Azure. Outside of work, Sameena loves to take
time out to read and take up fabric painting.