azure | Caffe Big

Does distributed data products work for low latency searches or lazy reporting ?

Lately, Data Mesh has gained prominence when the pain of ingesting all the data in one datalake and the data engineering teams controlling everything gave into distributed teams managing their data domains.

Data Mesh is viewed as a solution to many data-dependent use cases like., reporting, analytics, queries, etc. Datalake architecture does solve all these use cases. In some cases, it performs better. However, Data Mesh prompts unblocked parallel development.

Having said this, I struggled to understand how does the distributed data domain concept solve low latency adhoc queries, or lazy reporting use cases. Specially when the end user is on mobile, webapp or want quick answer. This is something where solutions like elastic search, read replica partitioned and indexed DBs etc. works pretty well.

Lets understand in brief what Data Mesh is and use cases or advantages where it helps.

Data Mesh – Brief Introduction

Data Mesh, in simpler terms, decentralizes data lakes or data warehouses. It is similar to how microservices break down a monolithic application.
It hands over the responsibility to ingest, curate, cleanse , store and allow consumption of “data as a product” to the team who understand that data best i.e., the data domain teams. Obviously each team has to follow the data governance principle and should leverage the core data engineering solution developed by the data platform team as much as possible.

The data domain team is responsible to share the data to the data consumer teams/applications through standard well defined interfaces. All this happens under the allegiance of central data engineering or platform team which manages data governance and promotes reusable data pipeline solutions or architecture. The data products should be registered with catalogs for consumers and everyone to discover the product and the products metadata (schema etc.)

Although you can have a centralized data pipeline or template infrastructure but it should be more of a reference implementation. There has to be given enough flexibility to the data products and the consumers team.

The key characteristics of a good Data Mesh implementation is :

Very clear data boundaries, well defined data product without lot of cross cutting. Similar to domains concept in DDD.
Well maintained data catalog with schema, owners, access patterns for the data products. The key element definitions and names should remain the same across the products so its easy to cross reference or join them.
Each data product should have standard way to consume the data. There should not be lot of customized patterns to consume the data.
Each one should follow standard governance practice to that we can trust the data.

Cloud services make it easy to implement the Data Mesh architecture. We can’t discount this point. Imagine how difficult it will for each team to provision their own hardware and toolset.

So far so good. As the data is distributed, it leads to complexity for low latency adhoc queries usecases.

Why supporting quick response searches is a problem

On demand reporting, or any usecase which requires quick query with data related to different domains need to prepare the data at common place to cross reference and join them. Also the data should be stored in a database which is optimal for the consumer like., for text search ElasticSearch, geospatial supported DB or graph supporting DBs.Data Mesh prescribes that the data should be distributed and with the owners only and should be consumed on need basis with no duplicate storage. Data Mesh is a network of data domains. If you need to optimize for low-latency, this distributed data will not work. I think these are two opposite thoughts. For example., you have a data product representing the Payments Transaction and another domain is related to customer master data. Any report will have to stitch these two data products. These are not just OLTP which can be easily managed. Users would like to check historic data like.., a payment done 3 months back to Party A end understand the charges incurred. These are quick search usecases and not monthly MI reports or a daily live dashboard which works on a limited data set.

What are the options

One thought process, is to create complex caching layers or data virtualization solutions or building cubes. It’s like solving a problem that should never have existed in the first place.

The practical way to solve this is to go with likes., of CQRS on data mesh (the way microservices solve such problem). Query from read replica databases, and horses for courses data storage systems like elastic search, Document DB , Indexed and partitioned RDBMS or even in some scenarios caches.

This works even in the Data Mesh paradigm. Data Mesh actually allows to create data domain based on the consumption perspective as well. This means an application responding to quick queries can prepare its own consumer data domain.

Data Mesh doesn’t restrict from including other domain’s data (payments transaction and customer data), provided the data is transformed (joined in this case), and persist it (DataStore relevant to final use case). The consumer team can take ownership of this new transformed data product. This is what is known as the consumer aligned data domain. Another advantage is you can distribute the read workloads and may lead to cost saving as well.

Can this be a data product, it depends on the consumer team. If they think that this is relevant for others then it should be registered with the catalog teams and proper lineage should be made available.

Long story short, it is OK to prepare the data beforehand and give consumers enough room to operate. This is win win situation as everyone can participate in evolving the data platform and without compromising on critical business non functional requirements.

Ansible, Chef, Puppet, Terraform, Cloud Vendor Templates, Pulumi what works for you ?

With the advent of cloud computing and DevOps CI/CD concepts, the whole ecosystem of application deployment has changed. The deployment space now involves quite identifiable tasks like., packaging the code(build the binaries), managing environment configurations (both infra. and app) , building the infrastructure, building application images and then final deployment on clusters.

The infrastructure machines have moved from very fine curated servers to serverless compute VMs. With that the deployment process has also shifted. It has changed from providing very customized deployment instructions to using repeatable automated deployment tools and code ( Yes, I am referring to IaC).

There are lot of mature products addressing this space. Each one of them are capable of managing the end to end deployment process on their own. But the problem is figuring out what works best for your use case and your application. You will find lot of information comparing tools like., Ansible to Chef then to Puppet, Terraform , ARM templates, Pulumi to name a few. On top of it each organization, project even architect will have their affinity to these products. The more I read the more confused I was to pick what works for the use case. This blog tries to simplify given the scenario which set of tool is best placed to solve your problem statement.

Questions to guide the decision-making

When we are analyzing the whole build and deploy space, actually we look to resolve following queries :

How do I package the application so that it doesn’t need to be rebuilt based on environment config and infrastructure changes.
How do I provision the platform, VM or cloud services which are replicable in each environment.
How do I decouple application deployment from the infra provisioning, each one should have their own lifecycle and not interdependent.

Tool categories to address the questions

Let’s check what deployment tools we have.

Category	Description	Examples
Provisioners	Family of products/tools responsible to create infrastructure from zero.	Tools like Terraform and AWS CloudFormation.
Configurer or App Managers	Products that help you manage the infrastructure created and support application configurations.	Tools like Ansible, Puppet, and Chef.
Packagers	Language-specific tools to bundle code into a manageable deployment unit. Containerization solutions fall in this category.	Tools like Docker, Kubernetes, and language-specific package managers.

Based on your application nature, a combination of tools from above categories should come together to manage the deployment automation needs. In case we have to create new infrastructure or platform for the first time or replicate in all envs. we should use something from the provisioners. We should lean towards the App Configurer and Managers when we want the codebase deployed with a few configurations tweaked per environment. We can also use them when the app is deployed in an incremental fashion. When you want to package the code in such a that they are agnostic to underlying hardware then pick something from the Packagers.

CD Pipeline: Tool Suitability and Usage

Ideally the pipeline should have following steps

This article focuses on the provision and configuration steps as the boundaries are blur here and the options available here mostly overlap. Its pretty hard to choose practical thing. Just for brevity sake, not focusing much on the other steps.

The “provisioners” are best suited to create the repeatable infrastructure. The are mostly declarative i.e. you only mention the desired state and leave it to them to figure out to attain it. The key point here is that we want to provision the infrastructure in immutable way. This means if we need to change something it doesn’t alter the state, it will figure out what needs to be done to reach the changed state. This concept makes IaC code simple as we don’t have to manage the changes based on current state in the code. For example ,

Here is a simple terraform script managing a spark cluster(GCP Dataproc) and the same script can be edited to change the base image without worrying what is the current state : :

provider "google" {
  project = "<projectId>"
  region  = "<Var1>"
}

resource "google_dataproc_cluster" "<Var2>" {
  name       = "<Var2>"
  region     = "<Var3>"

  cluster_config {
    master_config {
      num_instances = 1
      machine_type  = "n1-standard-4"
    }

    # Worker node configuration
    worker_config {
      num_instances = 2
      machine_type  = "n1-standard-4"
    }

    software_config {
      image_version = "<configure image name>"
      
    }

Also one thing to notice is that these are pretty standard tasks like., creation of DB, Databricks clusters, setting up N/W, Kubernetes cluster deploying image on a VM. The nuances are mostly driven by the cloud provider. Application team doesn’t change much than putting right values in the creation templates. It makes sense to let the IaC tool manage the state. Let it determine how to reach the desired state. Tools like Terraform, Pulumi or public cloud ones like., ARM templates, CloudFormation etc works well here.

Now lets discuss the next thing i.e. infrastructure configuration. This means customizing the the provisioned infrastructure as per your need. This is on top of providing the customized values in the provisioning templates. Few examples are like installing certain dependencies on the VMs, allowing certain ports only, setting up keys and certificates, pushing some init scripts for the cluster startup. Sometimes this can be achieved by inbuilding into the container image itself. Though this should be used more for the application configuration. Now this is very much specific to the organization and the project. Some project use the same cluster provisioning with different init script. Also this step will undergo lot of updates and incremental development. This will also involve lot of scheduled patching work like, change the security scan script etc. etc. Here the goal is to maintain the desired state of the platform, provided that the base state is known, hence the “Maintainers”. You need to tell exactly how it needs to be done like., fetch password from some vault and then install in certain directory or connect to some nexus location and copy initial dependencies. This needs scripting support. The tools which best support this step are Ansible, Puppet, Chef or even Powershell scripts. We wont go into nitty gritty on how to pick one among them like one is agent less, or use DSL language or YML or simple extension of shell scripts. But the point is pick tool here where you can exactly tell how it needs to be done and you got lot of control on code. If we pick any of provisioners here, we will end up creating lot of complex scripts or we will try to invoke the puppet modules or shell scripts from the provisioner (eg. terraform etc.). We should avoid such interlinking and let CI pipeline, gitlab or jenkins mange it. Yes it creates one problem that is how to link the output state of the provisioners and channel it as input for the configuration scripts in an automated way(may be another blog for it).. Here is an example of puppet script to configure init script execution for a VM. The same can be done from terraform but the whole flow of downloading script from nexus and pushing it here makes it easier.


# Init script path
$init_script_url = '<path>/<initVm>.sh'

# Download the initialization script to a temporary location
exec { 'download_init_script':
  command => "/usr/bin/wget -O /tmp/initVm.sh ${init_script_url}",
  path    => '/usr/bin',
  creates => '/tmp/initVm.sh',
}

# Execute the initialization script
exec { 'execute_init_script':
  command => '/tmp/initVm.sh',
  path    => '/bin:/usr/bin:/usr/local/bin',
  require => File['/tmp/initVm.sh'],
}

After bringing the infrastructure to the ready state lets deploy and configure the application. From a deployment perspective, our focus should be on managing environment-related properties. Other types of configurations should be managed in the build phase or the containerization phase. Few examples here would be notification email setting, timeout settings, DB url, etc. All these should come from Cloud Vaults, secret managers or configMap. All solutions like HELM Charts are applicable here only. There is not much difference between application and infrastructure configuration process. Its only at what level they are applicable. Hence same set of processes and IAC tool like., Ansible, Puppet, Powershell script, cloud specific configuration managers, Chef etc. works best here. Not picking specific example here.

For more clarity lets go through some practical scenarios :

We have to deploy some Java application on a cloud VM. The VM must have the organization-approved OS image. Then, the VM should have a few utilities and a certificate installed.
- Here you can have provisioner IaC code to create VM with image and then use configurer file to deploy the certificate file. To link between VM name created by provisioner to configurator we can use a host file/infra file.
You have to provision a kubernetes cluster and create namespace. HELM is provided to you. You need to configure the PODs, gateway and then deploy the HELM chart. Few secrets needs to be pushed to the cloud vault.
- Again create cluster, namespace and configure initializer script using provisioners and for remaining tasks use puppet/ansible.
You need to create a dataproc, databricks, or spark cluster. Pre define some dependencies. Then deploy your Spark jobs.
- You can use scripts to integrate with the Cloud Service REST Apis after provisioning the cluster.
Database or Sqlwarehouse deployments with DDL scripts and user permission settings.
- We can see clear demarcation. Treat DDL scripts as code and deploy it separately. Only DB config should be managed using terraform etc.

Concluding Remarks

Its fair to say that with the gradual shift from very much curated hardware to easily replaceable infrastructure and different level of application packages (jar, zip, to serverless modules), we need a combination of deployment tools to work in tandem. Its not going to be a single tool but a combination of them aligning to each step involved in the deployment(infra deployment, app deployment, environment configuration etc.) and its very low dividend to dwell deep into comparison among tools addressing each deployment step. You will be fine with whatever is mandated by your organization in each category.

Caffe Big

Talking Technology : big things in small packages

Tag Archives: azure