Case Study: LLM Application for Hierarchical Insights in GLP-1

July 12, 2024 Chris Chaplin

Introduction

Companies are generating, sourcing and analyzing an ever-growing amount of information to improve decision making. From meeting transcripts to corporate decks and research reports, the challenge lies not just in collecting and managing this data, but also in distilling it into actionable insights.

Highlighting a specific example, biotech analysts are focused on the growing market of GLP-1 drugs for treating type 2 diabetes and obesity. This market is expanding rapidly, driven by consumer demand. Analysts are keen to review and stay at the forefront of knowledge relating to companies in the space, focusing on capacity building, pricing, and clinical updates. Analyzing all of this information sourced from press releases, earnings calls and reports, and other detailed technical documents is a critical task. Furthermore, analysts are interested in understanding this content at different levels or hierarchies - from key high-level insights and developments, to summaries with more nuanced company and product updates, all the way down to detailed notes that contain specific scientific facts, Q&A content and other relevant data from the original source document.

We have built a Large Language Model (LLM) application to accomplish this type of industry-specific, hierarchical information extraction. While the guiding use case is analyzing content related to the GLP-1 market, all of the design choices we made along the way make this application generalizable to other industries. Our application has three primary goals:

Generate detailed notes covering the entirety of the content.
Produce impactful summaries.
Extract key insights.

In the rest of this blog post, we'll dive into some of the specific technical challenges with building LLM applications to achieve our goals, discuss our approaches and solutions to each challenge, and demonstrate the application effectiveness by focusing on the example use case of analyzing a call transcript for a company in the GLP-1 market.

Brief Background on LLMs

The best proprietary Large Language Models (LLMs), including OpenAI’s GPT-4 Omni, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini Ultra are all excellent tools for performing certain types of information extraction. Unfortunately, even these state-of-the-art tools are unable to perform every type of analysis that users need out of the box, in particular when it comes to domain-specific or very detailed analyses. Often these models will produce outputs that are too general or they require significant manual user interaction to extract critical details. Users also often care about the concept of hierarchical information extraction, where they may want to have varying levels of details available for different use cases.

Fortunately there is a growing ecosystem of products and tools available for building custom LLM applications across the entire “LLMOps” stack. We have already mentioned some of the key proprietary LLM providers, but there are many other good proprietary and open-source providers in the space including Hugging Face, Cohere, Meta, and Databricks. In addition to the models themselves, vector databases, such as Pinecone, Chroma, Milvus, and others have taken off since they are critical for building Retrieval-Augmented Generation (RAG) systems. For building entire applications there are a number of tools such as LangChain, LlamaIndex, Hugging Face, and an ever-growing number of proprietary products. Every major cloud computing service provider, including AWS, GCP, and Azure, is integrating LLMs into existing services and workflows which allow developers to build their own apps via that pathway.

Technical Challenges

Long-form Output

LLMs are being widely adopted in production applications for summarization and information extraction. However, even the best performing LLMs today have limitations. The overall context size of many open source and proprietary LLMs has reached over 100,000 tokens (roughly 80,000 words). This means that the total amount of text the LLM can handle in a single operation, consisting of a prompt, relevant contextual information, and the answer that the LLM provides, is up to 100,000 tokens. However, the distribution of these tokens across the different text components is important. Most models are constrained by a maximum output token limit, often 4,096 tokens. Additionally, these models have been trained and optimized to return short answers in the majority of cases. If your use case requires more detailed or longer-form answers, such as generating detailed notes intended to cover the entirety of a transcript or report, then simply passing the document to the LLM along with specialized instructions will not succeed. Furthermore, although LLMs can handle large context sizes, they do not always perform well at complicated tasks over the entire context¹⁻².

Summarization

Focusing on the summarization of a single document using LLMs, there are three general approaches available. You can either “stuff” the entire content into the prompt and proceed with a single pass over the entire data. You can split the input document into chunks, perform operations on each chunk, and then combine the results from the distinct chunks into a single output (“map-reduce”). As a third option, you can take an incremental approach where you still split the input document into chunks, but you start building the summary as you go and pass that summary as context into the LLM (“refinement”). LangChain³ has a helpful write-up on this and provides a lot of boilerplate code for performing summarization on text.

Each of the approaches listed above has their own strengths and weaknesses. The “stuffing” approach works well for short documents and for documents where the type of content is not overly specific. The other two approaches perform better in most cases because they help the LLM to focus on processing a smaller amount of information at a time. “Refinement” is often better at creating summaries compared to “stuffing”, however, it introduces the challenge of how to integrate the new chunked information into the ongoing summary. “Refinement” is also a sequential algorithm that inherently cannot be parallelized. “Map-reduce” offers some of the same benefits as “refinement”. This approach allows for the LLM to view a smaller amount of context at a time, and is more efficient since each step in the mapping process can be performed independently of each other. However, “map-reduce” has the same challenge as “Refinement” in that a user needs to figure out how to integrate information from multiple chunks. “Map-reduce” also has to deal with a separate issue from “Refinement” and that is the potential of missing context at the start and end of each chunk.

Industry-Specific Knowledge and Tasks

Many of the largest, state-of-the-art LLMs are very good at performing text analysis over a wide range of industries or domains. That being said, a simple summarization prompt or other request may not lead to output that contains the right content or depth of information for a specific use case.

There are several options for improving LLM outputs. Two of the most well-known approaches are prompt engineering and model fine-tuning. Prompt engineering involves giving the model specific instructions or information to help it perform better at a specific task or request. Model fine-tuning involves taking a base LLM, providing additional representative training data for desired tasks, and retraining that base model with the hopes of creating a better model for performing those tasks.

Although fine-tuning can lead to better task-specific models, it is time consuming to generate clean training data for training the base model, and it is often quite expensive, particularly for the best-performing proprietary models. Industry best practice is to start with prompt engineering to improve model performance for specific tasks, and to avoid fine-tuning unless necessary. OpenAI⁴ provides a helpful intro guide to prompt engineering and even provides several examples.

Goals and Technical Challenges

Let’s revisit our goals for this application.

Generate detailed notes covering the entirety of the content.
Produce impactful summaries.
Extract key insights.

Taking these all into consideration, we made the following set of decisions for application design. In order to generate detailed notes on the entirety of a document (of any potential length), we split the text to help the LLM focus carefully on the smaller context and not omit any details. We employ a “map-reduce” approach where we use the output from the detailed notes step in the reduction phase for summarization. For domain-specific knowledge, we developed custom, detailed prompts for every single LLM call to ensure that the output content was focused on the most relevant and important content.

Technical Approach

Below is a graphical representation of the application. A critical step involved in generating the lowest-level of hierarchical information is breaking up the original document into semantically coherent chunks and then running various targeted prompts on those distinct chunks. There is also an important downstream reduction step where the summary is generated from the output of the detailed notes mapping step. The highest level information, the key insights, are best generated from the original content without any processing, except for perhaps splitting up the original document if it doesn’t fit into the LLM context window. Finally, all of the content is merged together at the end into a single, final output.

In the rest of this section, we’ll provide some pseudocode and commentary outlining different key steps and what we experienced along the way. The diagram above and the pseudocode here highlight the flow of data through the application.

See this content in the original post

Development Details

We developed the application in Python, primarily using LangChain and OpenAI, with GPT-4o as the model. Specifically, we leveraged the document loaders and experimental semantic text splitter from LangChain. For generating the final output, we converted the application text output, which was markdown formatted text, to html first and ultimately into a PDF using the markdown2 and xhtml2pdf Python libraries.

Prompt Engineering

We designed detailed prompts for each LLM process in the application to produce the desired outputs. At a top-level, we fed the LLM a system prompt with domain-specific knowledge, so that we could apply that knowledge throughout the entire process. Then we defined clear instructions for each specific task. The remainder of this section will cover the types of content included in the prompts and pepper in some details about our approach to prompt engineering.

System Prompts

A natural place to include detailed, industry-specific information for LLM applications is in a system prompt. This is a piece of context that users may pass to an LLM at the start of each request they make. The system prompt doesn’t need to be the same for every process in our application, and in fact, we vary it depending on the task we’re trying to accomplish. At a high level, the system prompts for this application attempt to do two things:

Define a role for the process.
Define industry-specific concepts to focus on.

For example, in the detailed notes system prompt, we instruct the model to assume the role of an expert analyst and notetaker. We also provide a list of industry-specific concepts to focus on when reviewing content. Some examples of concepts for our use case included detailed financial metrics, market size, organizations and people, and clinical trials.

User Prompts

In addition to system prompts, we define user prompts with additional context for each LLM process. Our user prompts contain a series of instructions for the LLM to follow, such as:

Provide specific instructions related to the task at hand.
Define the structure and content of the output response.
Pass document content to the LLM.

Sticking with the example of the detailed notes process, the user prompt tells the LLM to generate detailed notes based on the included text. Then it provides details on which types of content to include or ignore and provides some instructions related to the output format. Lastly, we inject the actual document content at the end of the prompt so that the LLM will have that as context to perform the operation.

Document Input and Semantic Splitting

We used LangChain for reading in the PDF content and performing the document splitting. The standard PyPDFLoader is great right out of the box for extracting text from PDFs, and there are many other loaders available if you want to explore extracting content from different types of files, or want to extract non-text content from PDFs.

Splitting the original long-form transcript into smaller chunks is necessary for retaining key details and generating the detailed notes for the entire content. There are several options for splitting documents, and LangChain⁵ along with a helpful python workbook by Greg Kamradt⁶ provide a great writeup on the semantic chunking that we decided to employ here.

We found that the semantic splitting proposed by Greg Kamradt is really helpful for generating good summaries and detailed notes downstream. Otherwise, if you want to split up a document, you must split based on a roughly fixed unit of length, which can lead to chunk boundaries that interrupt a continuous section of related content.

However, semantic splitting by itself didn’t end up working well for our use case. Using the LangChain provided "SemanticChunker" with the default parameters created several small chunks that didn't work well with highly-specific prompts that are focused on extracting detailed information.

The way we got around the small chunk problem was to take a bottoms-up approach where we let the chunker create a large number of chunks, many of them quite small. Then we apply a post-processing step where we combine neighboring chunks up until a certain size limit is reached. Ultimately this leads to a collection of chunks that are similar in length, and have semantically coherent content in each. We provide some code below to illustrate this process.

In general we found that for this use case, chunks between 2,000 - 10,000 characters in length work well. Chunks below 1,000 characters often produced poorer results with the detailed notes prompt, and even could lead to model hallucinations in some circumstances.

See this content in the original post

Side notes on the “SemanticChunker”:

The “SemanticChunker” doesn’t include any options of including overlap for preserving context at chunk boundaries. In our actual implementation we added a "number_of_sentences" overlap parameter for augmenting a chunk with a set number of sentences at the start or end. This parameter improves the handling of content at chunk boundaries.

Additionally, there are many parameters available for impacting the output of the “SemanticChunker”. You can choose between different breakpoint threshold types and values and you can even specify a target number of chunks. If you find that the default parameters aren't working for you, try changing them to see if your output improves.

Detailed Notes Generation

Once we have the collection of semantically coherent chunks, we can apply the first step in our “map-reduce” process to create detailed notes for each chunk. The code for doing this is straightforward, as we apply our prompt to each chunk and then merge all of the outputs into a single set of notes that feeds the summarization process.

See this content in the original post

Key Insights and Summarization

Each of these processes takes a single text input and runs an LLM process to generate a single text output. Most of the interesting content is embedded in the actual system and user prompts for each step. If the total length of the text input is either close to the model context window size or longer than it, you’ll have to take additional steps. Either the input text needs to be reduced to better fit the context window size, or the input text needs to be broken up into chunks so the key insights and summarization prompts can be applied on the chunks, followed by a downstream reduction process. We did not handle this situation for the purposes of this application.

Cleaned Notes Generation

This is a second mapping step that is performed on the output of the detailed notes generation. The prompt instructs the model to remove various undesirable outputs, such as titles that aren't based on document content (but rather prompt instructions), details that are irrelevant, and other types of model response and output artifacts. While this step isn't strictly necessary, we found that it improved the final output.

Additional Development Considerations

Asynchronous Requests

As mentioned earlier, one of the reasons we opted for using the “map-reduce” approach in this application is performance. OpenAI released an asynchronous version of their client that no longer relied internally on the requests Python package at the end of 2023. The Python requests package is a synchronous HTTP library, so it was difficult to use the OpenAI Python package to build production LLM applications. Now it is much easier to build performant applications directly using the OpenAI library since the package natively supports asynchronous requests. This is perfect for the mapping steps of our application. Below is a very simple example of how to leverage this capability in your own Python applications.

See this content in the original post

If you plan on making a large number of concurrent requests in your application, you may want to configure the underlying async HTTP client when you instantiate your AsyncOpenAi class.

Evaluating Accuracy

We applied a few techniques for evaluating the quality and accuracy of our application.

Recall-Oriented Understudy for Gisting Evaluation, or ROUGE, is a well-known automated approach for reviewing output quality for summarization applications. Traditionally ROUGE compares a human-generated summary against a model-generated summary. However, we can use it to compare the detailed notes output for each chunk of the transcript against the actual transcript chunk to generate some insights into the accuracy of the application. We ran ROUGE-1, -2, and -L over the produced detailed notes and evaluated the results. What we were looking for in this case was a high ROUGE-1 precision score, which implies that the detailed notes contained a high percentage of the same words. We expect the recall and f1 scores to be lower since we are comparing a summary against a transcript. Across a representative sample of chunks, we see an average ROUGE-1 precision value greater than 80% which indicates that the application is mostly extracting content directly from the transcript as desired.

We employed one other manual, high-level approach to evaluate accuracy of the application and that was Named Entity Recognition, or NER. The goal of employing NER was to help identify potential instances of model hallucinations. We took a straightforward approach and generated a list of entities for each transcript chunk and associated detailed notes chunk. Then we output the list of entities that were present in the detailed notes but not in the transcript. All of the entities found were either slightly different textual representations of real entities shared between the notes and the transcript, or headings and labels that the detailed notes generated for organizational clarity.

As mentioned earlier in the approach section, the detailed notes prompt applied to small chunks at the start of the transcript could sometimes lead to model output hallucinations. These situations are detectable in both the ROUGE-1 and NER outputs in the form of much lower precision scores and new entities appearing.

There are a variety of other approaches out there; evaluating the accuracy of LLM-generated output is an active field of research⁷.

For our evaluations, we used the following Python packages: rouge-score, nltk, and spacy. Before running either of the two text-based evaluation methods listed here, it is important to normalize the text inputs to ensure each tool is comparing similarly formatted text.

Application and Results

We want to extract information and output actionable information for our GLP-1 market case study. Let’s take a detailed look at the performance of the application on a real-world example: the Eli Lilly earnings call transcript from Q1 2024. At the end of this section, we provide the full output from running the application on that transcript.

Key Insights

The most important information for an analyst are the top-level insights that they can glean from a set of data. These are the most critical pieces of information from a document and highlight things like key financial and scientific updates, along with new strategic initiatives or anything else that impacts the direction of the company’s financial performance both now and looking forward. You will typically find this content at the top of articles covering the results of earnings calls or reports. Below we present the key insights identified by the LLM application.

See this content in the original post

Each of the insights identified here are clear and concise. The top two points relate to Eli Lilly’s GLP-1 products, and we can see that strong sales have driven the company to raise its full-year guidance and that the company is aggressively focused on expanding manufacturing capabilities to keep up with the growing demand. The third insight is an effective summary of the most important clinical updates for the not yet approved or pipeline products. We see that the model clearly identifies each product, clinical trial phase, and general indication.

Impactful Summaries

We designed the summary prompt to produce specific product updates along with a collection of high-level financial metrics. Let's focus on the summary details provided for the GLP-1 products mentioned: Mounjaro and Zepbound.

See this content in the original post

Each piece of information provided is helpful in understanding the performance of these two products in their respective markets, and the output also includes helpful context around the sales numbers showing the major growth in sales for Mounjaro which has been on the market since the middle of 2022.

Although we have highlighted the output for the primary GLP-1 products, the application also provides high-level insights for other products, both those already on the market and those still in the pipeline. For example, one of the pipeline products that Lilly is seeking approval for in the moderate-to-severe sleep apnea market is Tirzepatide, which is actually the same active ingredient in Mounjaro and Zepbound.

See this content in the original post

We see the most important information right at the top; the LLM has extracted the positive clinical trial readout news and surfaced it for us. These types of details are useful for analysts looking at this information from both a financial perspective and a scientific perspective. Even more detailed information about these studies is presented in the detailed notes section. The model also produces updates on the other products mentioned in the transcript.

Beyond product-specific updates, the summary also provides company-level key financial metrics shown below:

See this content in the original post

Detailed Notes

At the base of our information hierarchy is the detailed notes section. The intent of this section is to pull out all relevant facts, metrics, and useful context around them. For the Eli Lilly earnings call, we expect to see detailed notes covering the entirety of the call, from the company presentation portion to the question and answer section.

Let’s take a look at some of the produced content from the company presentation portion of the call. We’ll continue focusing on the GLP-1 products.

Mounjaro

Q1 sales were $1.8 billion globally and $1.5 billion in the U.S., up from $568 million and $536 million in Q1 2023, respectively.
Sequential quarter-over-quarter revenue in the U.S. was impacted by a one-time benefit from changes in estimates for rebates and discounts in Q4 2023, as well as lower inventory in the channel in Q4 2024 amidst strong demand.
Access levels across commercial end parties were consistent with high levels and near parity with established injectable incretin medicines.
Demand for tirzepatide is very strong, with hundreds of thousands of people filling scripts each week.
Production of salable doses of incretin medicine in the second half of 2024 will be at least 1.5 times the salable doses in the second half of 2023.
Multi-dose KwikPen delivery device for Mounjaro was recently approved in the EU, adding to the UK approval earlier this year. This approval applies to both Type 2 diabetes and product week management indications.

Zepbound

U.S. launch progress was exceptionally strong with over $0.5 billion in sales in Q1.
Approximately 67% access in the commercial segment as of April 1.

Compared to the mentions of Mounjaro in the insights and summary sections, we see added detail here including specific information about quarterly revenues, capacity, and demand. Although we don’t see new information about Zepbound, this is expected since there are no additional details about Zepbound presented at this point in the transcript.

Here is a representative output for a question and answer exchange between an analyst and the company relating to Lilly’s pipeline product Tirzepatide for moderate-to-severe obstructive sleep apnea.

See this content in the original post

The application has nicely organized this content into a format that is easily digestible, including a topic heading along with the key question and answer facts. As mentioned earlier, we also prompted the application to capture speakers since that is of value in some circumstances. We can see that the analyst is trying to understand the future potential reimbursement dynamics for this drug product.

Conclusion

We’ve provided an in-depth overview of an application we built leveraging LLMs to generate hierarchical levels of detail for industry-specific documents. We demonstrated the effectiveness of this application by analyzing an earnings call transcript from Eli Lilly, a key player in the GLP-1 market. The application successfully generated detailed notes, an impactful summary, and key insights effectively showcasing its ability to transform long and complex documents into various hierarchical levels of detail suitable for different audiences or analytical needs.

The approach we described is not limited to content relating to the GLP-1 market or call transcripts, but is generalizable to other domains where detailed, hierarchical information extraction is required.

References

Li, T. et al (2024). LongICLBench: Long-context LLMs Struggle with Long In-context Learning, arXiv
Liu, N. et al (2023). Lost in the Middle: How Language Models Use Long Contexts, arXiv
LangChain: Summarize Text
Open AI Docs: Prompt engineering
LangChain: How to split text based on semantic similarity
Kamradt, G. (2024). 5 Levels of Text Splitting, GitHub
Huang, J. (2024) Evaluating Large Language Model (LLM) Systems: Metrics, challenges, and best practices, Medium

See this content in the original post