Building Applications with LLM Models: A Comprehensive Guide for Developers — Part 2

Posted on February 21, 2025 by thecaffeblog

Part 1 focuses on the setting the context. This article explains the code and developer tasks in detail.

Introduction

Hopefully we now have enough understanding on key GenAI and LLM concepts. We also discussed on various Gen AI LLM models and key questions for app developers to ponder on. Let’s start discussing those questions.

Development framework options

All major GenAI or LLM API providers, such as Azure, Google, and OpenAI, offer libraries that allow developers to connect to their model APIs, configure them, provide context, and more, across various programming languages. This is similar to how noSQL vendors provide binaries to perform operations. However, as a developer, you need a framework that handles much of the boilerplate code for you, manages configurations easily, and integrates seamlessly into your larger application development ecosystem.

LangChain is a widely used framework that supports various programming languages for LLM-based application development. Another one that I prefer is SpringAI. If you are a Java developer, you’ll enjoy the familiar ecosystem that SpringAI offers, as it is equivalent to SpringData.

While there are other options available, it’s best to choose a framework that easily integrates with your existing application ecosystem.

Once the framework question is settled, lets dwell more into the app development.

Building the application : Key Steps

Just a teaser: when you develop your own application using Model APIs and integrate them with other application services, it can eventually lead to the creation of automated agents. These agents can chain multiple calls to the LLM models or trigger workflows based on model responses. I believe that’s where the real fun begins—truly integrated agents with GenAI. That’s where true potential will be realized.

Coming back to key steps with Spring AI as an implementation framework,

Create and configure a client to connect to Model APIs.

We need an API, or a client which based on the configuration properties connect to the model apis. This client becomes the one stop interface in the application to interact with the model. Perfect analogy would be SparkSession from the bigData world.

SpringAI provides ChatClient with fluent api to create prompts(input for model), advisors, functions and response handlers. As mentioned this becomes the one stop interface. It uses ChatModel which is either autoconfigured or programmatically created based on model properties defined in yaml file.

Some code samples to explain it better :

##There can be lot many configurations
spring:
  main:
    web-application-type: none
  ai:
    openai:
      api-key: ${OPENAI_API_KEY}
      chat:
        options:
          model: gpt-4o-mini

//autoconfigured chatClient with fluent api response
ChatClient.Builder builder = context.getBean(ChatClient.Builder.class);

ChatClient chatClient = builder.build();
String response = chatClient.prompt().user("Can you respond as a test ping").call().content();

The prompt creation — user query to the model

Prompts are the user queries or inputs for AI model to process and generate the desired output. The LLM AI models works with string input and output data types. Its for programmer to represent the input and output in structured objects but for AI models its string in and string out. Prompt engineering is a very important discipline in GenAI space. A well designed prompt can make a difference from an irrelevant response to well authored context aware content.

The prompt json schema for gpt ai model looks something like this :

{
  "model": "string",
  "messages": [
    {
      "role": "string",
      "content": "string"
    }
  ],
  "temperature": "number",
  "max_tokens": "number",
  "top_p": "number",
  "frequency_penalty": "number",
  "presence_penalty": "number",
  "stop": ["string"]
}

Spring AI provides Prompt class which provides convenient way to create json with above schema. Spring AI Prompt message-content can be created from a template string. Programmers will be able to relate it to parameterized queries. Prompt and ChatClient fluent apis makes it readable to setup all these input arguments. Most of the parameters are easy to understand. “Temperature” here controls the degree of randomness, lower the value more deterministic the ouptut is. “Role” is to define who is sending the prompt like., user, system or assistant. “top_p” controls the probability and diversity of the response.

The prompt and context fine tuning — ability to advise the Model

So far we have discussed patterns where the GenAI model are more or less treated as a black box. ChatClient calls apis with prompts with few response tuning parameters and accept what the model produces. As a developer, we want ability fine tune or influence the generated output as per our needs. There are few limitations with LLM models which make it mandatory. LLM have limited support when it comes to managing long running content and previous history, hard facts , data freshness ,and context-awareness. Let’s discuss various options available to advise the model to influence the response.

How to fine tune the model

The simplest option is passing the “context” with the prompt. The SpringAI ChatClient provides a fluent API to set the context, which is useful for providing background information or an initial narrative. It’s like giving a hint when you execute the query. For example, for a prompt like “Explain LLM model,” you can set the context as “LLM models are useful in summarizing text. Researchers struggle to summarize research papers.” This is equivalent to the “context” element in the GPT schema.

Contexts are good for setting up the baseline or ground level. For the next level of customization, if you want the response to be in a specific tone or include additional content, you can use the SpringAI “Advisors” API, which implements the RAG technique to enrich the user input. This maps to the “message” and “context” sections of the GPT JSON API. For example, you can add advice like “generate the response in a FAQ form” as one type of customization.

To further customize the model, you can incorporate information from your database or documents that the LLM model hasn’t been trained on. LLM models are typically trained on data up to a certain cutoff date, so it’s sometimes necessary to pass on the latest information. RAG (Retrieval-Augmented Generation) is the technique to “stuff the prompt” with contextual, relevant, unseen, and specific instructions to generate the desired response, overcoming the LLM model’s token size and context history limitations. SpringAI provides various Advisor implementations to implement the RAG technique and also supports VectorDB implementations as a data store for RAG. You can read all your documents, tokenize them, and store them in VectorDB. This can then be queried to pass on the most relevant information while invoking a prompt execution.

For example, if you have a document of internal assessments of various clients, you can tokenize it and store it in VectorDB. When creating an advisor to be tagged along with the “prompt,” you can query the VectorDB to find the 2024 ratings and pass the response as “advice.”

This adds a lot of value as you can simply augment your information on the model’s training dataset and enable responses to be customized as per your context.

Sometimes, prompt stuffing via RAG may not be sufficient. For example, if you want to understand the weather condition of your location, the AI model needs to know your current location, its temperature, and the current time to respond better. One way to achieve this is by preparing all this data by calling various functions, creating a context, and passing it along with the prompt. However, this involves a lot of boilerplate and plumbing code.

SpringAI and model APIs provide the ability to register functions with the prompt call, which the GenAI algorithms invoke as and when required. It’s like registering callbacks and letting the frameworks stitch everything together for you. SpringAI “Tool Calling” APIs support this, which is equivalent to {"role": "function", "name": "getDate", "parameters": {}} in the GPT JSON input.

I want to emphasize that SpringAI or equivalent frameworks make it easy to read and manage the code. Without these frameworks, someone would have to write all the code to prepare a fully functional GPT prompt response.

One last option but on the model side. What if you can fine tune the model itself. You can have the base model provided and tune it with customized training data which applies across all the usecases. All cloud AI services provide capability to fine tune the model. This makes sense when you want to handle certain customizations across the board for everyone.

Final steps– testing and deployment

We are very familiar with writing unit , integration and end to end tests. The problem is how to test the model output. Each time the model response will be a bit different from the earlier run even with the same input and parameter. Irony is all such responses are quite accurate and acceptable. Simple expected compared to actual data comparison may not work.

But we can test following things in the response :

Relevancy, tone and context alignment : The suggested way to do it is to run the same input with another model and then compare both responses on various parameters using a prompt to compare the two responses. This is using AI to evaluate. For example., you can create a prompt “Please compare both the responses on how relevant they are to the Software Engineering context and share response in a json ” or a prompt “Is this text in a friendly tone. Please response in binary terms.”
Fact checking : Sometimes the response is quite factual. Like., “India won the finals”. It becomes easy to test when there are facts in the response.

Spring AI provides Evaluators to support the model evaluation process.

Its easy to notice that these are very much standard applications. You can deploy them just like any other application. But you have decision to make where do you want the model to be deployed and be in action. Models execution requires high end GPUs or good RAM size. Most of the models are require good disk size to store the model datasets. As earlier mentioned that models are available as cloud services but there is an option to deploy the models on-prem as well.

Ollama, is open source tool to deploy the LLM models locally. This option works when you are concerned about data privacy and security and also serves well to provide a container feel and try out various models which have Ollama integration.

By using Ollama for model deployment, you achieve the same benefits that Docker provides for application deployment: consistency, portability, isolation, ease of deployment, and scalability.

Conclusion

It nay be challenging to grasp the internal working and training of the GenAI models. But as a developer, you can build application using their services in fairly standard way. I believe the natural progression is building of agents which seemlessly integrate GenAI with application services.

Building Applications with LLM Models: A Comprehensive Guide for Developers — Part 1

Posted on February 15, 2025 by thecaffeblog

GenAI 101 and Building an Application Ecosystem Around Generative AI

Part 2 of the series details out developer tasks. Hopefully will introduce one more part to discuss in the GenAI model concepts in detail.

Introduction

The field of text analytics and language processing has evolved significantly with the general availability of Generative AI (GenAI) powered by Large Language Models (LLMs) on Transformer architecture. While NLP and text analytics have a long history, the introduction of the ‘attention’ mechanism in Transformer models, their ability to link multiple statements together to generate longer responses, and their parallel processing capabilities have made them highly impactful.

There’s a ton of articles out there about how these LLM models work and what you can do with them. Most of them either dive deep into AI math with things like embedded vectors or they focus on how to write ‘prompts’ for the AI model. But one thing that often gets overlooked is what developers need to build solid and structured applications around these models. Remember when NoSQL databases first came out? After a while, developers stopped worrying so much about how they worked and focused more on how to use them to solve their problems.

This article addresses the developer perspective on GenAI. Part 2 of the series will focus more on building the applications. This one focuses on preparing the ground. The terms are defined in brief just to keep length in check.

Gen AI and LLM — Hello World Introduction

LLM, Large Language Models is a AI model, typically function or algorithm with billions of parameters. They use the concept of embedding models and neural networks (deep learning) to create a mathematical model of each language world and based on a training data prepare a AI model to predict what is the most suitable word or text for a given input.

They are large as they have billion of parameters like., 1.5B, 7B, 8B, 14B, 32B, 70B etc. various sizes of the models. They are trained of huge amount of data all over the internet and works on a principle of “self supervised learning”. No one has to give what is the output for a given input to train itself, it uses the text available all over to figure out what is the relevant text.

The model to find relationship and predict what’s next word and sentence is where hard core ML engineers excel. This is core of GenAI systems. It is the base for GenAI like., what Lucene text indexing is for ElasticSearch.

There are really good articles to focus on how does it work. Right now let’s just move on with the baseline that we have a complex algorithm developed using multi layer learning technique and huge dataset which is good at predicting what should be the next word, sentence or paragraph.

The next key concept is Transformers.

Transformer architecture is what which makes the difference. As mentioned text predicting and language modelling has been there for sometime. What machines were not able to do well to interconnect multiple lines or figure out context to tune the recommendation. As humans, we focus on key words, tone, underlying theme to process the information. Google research paper, “Attention is all you need“, highlighted this and created a new architecture called Transformer Model to develop the LLM. Transformers use self-attention mechanisms to process and generate sequences, allowing them to capture context and dependencies better than previous models.

The Transformer’s parallel simultaneous processing mechanism computes relationships between all words in a sentence at the same time, rather than sequentially which makes them faster.

Having established this, let’s move on to next fundamental term i.e. GenAI.

GenAI is an AI system capable of generating new text, images, videos, audio, etc., using various AI models (like LLMs for text) with the ability to interpret input context. It includes not just text models but also other multimodal models like audio, video, and images. As a system, it addresses challenges around maintaining long-running context and client integration APIs. The term to focus on is that it is a system which works on LLMs and equivalent models to generate new content while integrating with other information sources for better context management.

Remember the analogy: Elasticsearch to Lucene.

Being a Software Developer — Getting started and what to do

With the buzz around ChatGPT, Gemini, Copilot, HuggingFace, Ollama, and other GenAI models, I feel as a software developer that we’re in a similar situation as when NoSQL databases changed the storage landscape. While I grasped the use cases and high-level architecture, what mattered most was understanding how to build applications around them and integrate them into my application landscape.

Similarly, today I understand how these AI models work and their practical use cases. But there are still questions on how to integrate them into my application ecosystem. This blog addresses such questions, including:

What are the available LLM model options?
What hardware is needed to deploy them?
Are there only cloud-based model services, or can they be deployed on-premises?
What frameworks and APIs are available to integrate these systems?
Can the models be customized?
How do you provide context information or new facts to fine-tune the results?
How can automated integration be done to fetch information from other systems and let the model use it?
How do you integrate model responses into an application flow to build automated processes?
How do you configure the model engine to control token length, cost, etc.?
Is caching applicable, and what are the optimization techniques?
How do you format the output in a structured entity for better integration?
And last but not least, how do you test and observe model operations?

I believe this should look familiar. Barring new different usecases, it resembles exactly all that a developer has to manage for an application probably with database as a backend. By no means I want to trivialize the challenges but the attempt is to simplify things with an analogy with what a developer do day in and day out.

I find it convenient to share examples or code snippets with Spring AI. This is just from illustration purposes. Let’s quickly address these questions.

Exploring Various LLM Models and Other AI Models

The whole GenAI model space is full of options. First categorization would be on the basis on the use cases they solve. The idea is not to classify them based on the implementation pattern or architecture but more from user perspective.

Category	Training dataset type	Architecture	Description	Examples
Text Generators	Huge text datasets; data over internet	Transformer	LLM models; accepts input as text or forms which provides text and generates text mostly or audio content. Basically trained on text datasets.	LLama, GPT3, GPT4, LaMDA
Image Generators	Image datasets with their metadata	Generative Adversarial Networks (GANs), other also applicable	Specialized in image generation; trained on image datasets	DALL-E 2 –Open AI, Midjourney
Audio Generators	Music data, audio	GANs, transformers, hybrid	Various music generation usecases	GPT-4o audio:, Azure Open AI Text to Speech
Multimodal Generators	Various dataset pairs	Transformers	can work on different input and output data modes	Gemini

Video generators follow near about same approach as audio ones with different training dataset.

Another angle to look at them is their problem domain specialization. Some models are trained and developed to solve a very specific business case, such as GitLab Duo for coding and OpenAI Codex.

Another important parameter for classifying these models is their size and hardware requirements. Based on size, i.e., the number of parameters in the models, they can be classified as SLM (Small Language Model) or LLM (Large Language Model). Specialized models are generally SLMs, which have low latency and require less hardware. On the flip side, they can’t manage a wide range of use cases. For example, Codestral from Mistra AI is a code generation SLM model.

One last criterion, very critical from a privacy perspective, is the deployment model for the models. Most models have a cloud-based service offering, such as Google Vertex AI or Azure Open AI, but there are options for deploying models on your hardware/infrastructure. Ollama is a solution for running a few supported LLM models locally as well. The deployment strategy is a critical decision-maker based on data privacy constraints. We will discuss more on Ollama later.

It must be evident that there are various criteria’s along with cost and the model response evaluation to figure out what model or solution will work in your use case.

Point to note: ChatGPT, Gemini, and Copilot are applications using one or more of these models. As an application developer, the same can be achieved. Just think of having various data connectors in an application and, based on the scenario, a factory class provides you with the right connector.

The blog started with LLM models, and to have a consistent example set, we will continue with that only. Let’s discuss in the

Conclusion

Having gone through the concepts in brief and listing down the key developer concerns lets discuss the developer tasks in detail here (part 2).

Data Mesh and Quick Response Adhoc Queries

Posted on January 31, 2025 by thecaffeblog

Does distributed data products work for low latency searches or lazy reporting ?

Lately, Data Mesh has gained prominence when the pain of ingesting all the data in one datalake and the data engineering teams controlling everything gave into distributed teams managing their data domains.

Data Mesh is viewed as a solution to many data-dependent use cases like., reporting, analytics, queries, etc. Datalake architecture does solve all these use cases. In some cases, it performs better. However, Data Mesh prompts unblocked parallel development.

Having said this, I struggled to understand how does the distributed data domain concept solve low latency adhoc queries, or lazy reporting use cases. Specially when the end user is on mobile, webapp or want quick answer. This is something where solutions like elastic search, read replica partitioned and indexed DBs etc. works pretty well.

Lets understand in brief what Data Mesh is and use cases or advantages where it helps.

Data Mesh – Brief Introduction

Data Mesh, in simpler terms, decentralizes data lakes or data warehouses. It is similar to how microservices break down a monolithic application.
It hands over the responsibility to ingest, curate, cleanse , store and allow consumption of “data as a product” to the team who understand that data best i.e., the data domain teams. Obviously each team has to follow the data governance principle and should leverage the core data engineering solution developed by the data platform team as much as possible.

The data domain team is responsible to share the data to the data consumer teams/applications through standard well defined interfaces. All this happens under the allegiance of central data engineering or platform team which manages data governance and promotes reusable data pipeline solutions or architecture. The data products should be registered with catalogs for consumers and everyone to discover the product and the products metadata (schema etc.)

Although you can have a centralized data pipeline or template infrastructure but it should be more of a reference implementation. There has to be given enough flexibility to the data products and the consumers team.

The key characteristics of a good Data Mesh implementation is :

Very clear data boundaries, well defined data product without lot of cross cutting. Similar to domains concept in DDD.
Well maintained data catalog with schema, owners, access patterns for the data products. The key element definitions and names should remain the same across the products so its easy to cross reference or join them.
Each data product should have standard way to consume the data. There should not be lot of customized patterns to consume the data.
Each one should follow standard governance practice to that we can trust the data.

Cloud services make it easy to implement the Data Mesh architecture. We can’t discount this point. Imagine how difficult it will for each team to provision their own hardware and toolset.

So far so good. As the data is distributed, it leads to complexity for low latency adhoc queries usecases.

Why supporting quick response searches is a problem

On demand reporting, or any usecase which requires quick query with data related to different domains need to prepare the data at common place to cross reference and join them. Also the data should be stored in a database which is optimal for the consumer like., for text search ElasticSearch, geospatial supported DB or graph supporting DBs.Data Mesh prescribes that the data should be distributed and with the owners only and should be consumed on need basis with no duplicate storage. Data Mesh is a network of data domains. If you need to optimize for low-latency, this distributed data will not work. I think these are two opposite thoughts. For example., you have a data product representing the Payments Transaction and another domain is related to customer master data. Any report will have to stitch these two data products. These are not just OLTP which can be easily managed. Users would like to check historic data like.., a payment done 3 months back to Party A end understand the charges incurred. These are quick search usecases and not monthly MI reports or a daily live dashboard which works on a limited data set.

What are the options

One thought process, is to create complex caching layers or data virtualization solutions or building cubes. It’s like solving a problem that should never have existed in the first place.

The practical way to solve this is to go with likes., of CQRS on data mesh (the way microservices solve such problem). Query from read replica databases, and horses for courses data storage systems like elastic search, Document DB , Indexed and partitioned RDBMS or even in some scenarios caches.

This works even in the Data Mesh paradigm. Data Mesh actually allows to create data domain based on the consumption perspective as well. This means an application responding to quick queries can prepare its own consumer data domain.

Data Mesh doesn’t restrict from including other domain’s data (payments transaction and customer data), provided the data is transformed (joined in this case), and persist it (DataStore relevant to final use case). The consumer team can take ownership of this new transformed data product. This is what is known as the consumer aligned data domain. Another advantage is you can distribute the read workloads and may lead to cost saving as well.

Can this be a data product, it depends on the consumer team. If they think that this is relevant for others then it should be registered with the catalog teams and proper lineage should be made available.

Long story short, it is OK to prepare the data beforehand and give consumers enough room to operate. This is win win situation as everyone can participate in evolving the data platform and without compromising on critical business non functional requirements.

Caffe Big

Talking Technology : big things in small packages

Tag Archives: ai

Building Applications with LLM Models: A Comprehensive Guide for Developers — Part 2

Introduction

Development framework options

Building the application : Key Steps

Conclusion

Building Applications with LLM Models: A Comprehensive Guide for Developers — Part 1

Introduction

Gen AI and LLM — Hello World Introduction

Being a Software Developer — Getting started and what to do

Exploring Various LLM Models and Other AI Models

Conclusion