{"id":3089280,"date":"2024-01-29T13:34:21","date_gmt":"2024-01-29T18:34:21","guid":{"rendered":"https:\/\/wordpress-1016567-4521551.cloudwaysapps.com\/plato-data\/benchmark-and-optimize-endpoint-deployment-in-amazon-sagemaker-jumpstart-amazon-web-services\/"},"modified":"2024-01-29T13:34:21","modified_gmt":"2024-01-29T18:34:21","slug":"benchmark-and-optimize-endpoint-deployment-in-amazon-sagemaker-jumpstart-amazon-web-services","status":"publish","type":"station","link":"https:\/\/platodata.io\/plato-data\/benchmark-and-optimize-endpoint-deployment-in-amazon-sagemaker-jumpstart-amazon-web-services\/","title":{"rendered":"Benchmark and optimize endpoint deployment in Amazon SageMaker JumpStart\u00a0 | Amazon Web Services"},"content":{"rendered":"

When deploying a large language model (LLM), machine learning (ML) practitioners typically care about two measurements for model serving performance: latency, defined by the time it takes to generate a single token, and throughput, defined by the number of tokens generated per second. Although a single request to the deployed endpoint would exhibit a throughput approximately equal to the inverse of model latency, this is not necessarily the case when multiple concurrent requests are simultaneously sent to the endpoint. Due to model serving techniques, such as client-side continuous batching of concurrent requests, latency and throughput have a complex relationship that varies significantly based on model architecture, serving configurations, instance type hardware, number of concurrent requests, and variations in input payloads such as number of input tokens and output tokens.<\/p>\n

This post explores these relationships via a comprehensive benchmarking of LLMs available in Amazon SageMaker JumpStart, including Llama 2, Falcon, and Mistral variants. With SageMaker JumpStart, ML practitioners can choose from a broad selection of publicly available foundation models to deploy to dedicated Amazon SageMaker<\/a> instances within a network-isolated environment. We provide theoretical principles on how accelerator specifications impact LLM benchmarking. We also demonstrate the impact of deploying multiple instances behind a single endpoint. Finally, we provide practical recommendations for tailoring the SageMaker JumpStart deployment process to align with your requirements on latency, throughput, cost, and constraints on available instance types. All the benchmarking results as well as recommendations are based on a versatile notebook <\/a>that you can adapt to your use case.<\/p>\n

Deployed endpoint benchmarking<\/h2>\n

The following figure shows the lowest latencies (left) and highest throughput (right) values for deployment configurations across a variety of model types and instance types. Importantly, each of these model deployments use default configurations as provided by SageMaker JumpStart given the desired model ID and instance type for deployment.<\/p>\n

<\/p>\n

These latency and throughput values correspond to payloads with 256 input tokens and 256 output tokens. The lowest latency configuration limits model serving to a single concurrent request, and the highest throughput configuration maximizes the possible number of concurrent requests. As we can see in our benchmarking, increasing concurrent requests monotonically increases throughput with diminishing improvement for large concurrent requests. Additionally, models are fully sharded on the supported instance. For example, because the ml.g5.48xlarge instance has 8 GPUs, all SageMaker JumpStart models using this instance are sharded using tensor parallelism on all eight available accelerators.<\/p>\n

We can note a few takeaways from this figure. First, not all models are supported on all instances; some smaller models, such as Falcon 7B, don\u2019t support model sharding, whereas larger models have higher compute resource requirements. Second, as sharding increases, performance typically improves, but may not necessarily improve for small models. <\/em>This is because small models such as 7B and 13B incur a substantial communication overhead when sharded across too many accelerators. We discuss this in more depth later. Finally, ml.p4d.24xlarge instances tend to have significantly better throughput due to memory bandwidth improvements of A100 over A10G GPUs. As we discuss later, the decision to use a particular instance type depends on your deployment requirements, including latency, throughput, and cost constraints.<\/p>\n

How can you obtain these lowest latency and highest throughput configuration values? Let\u2019s start by plotting latency vs. throughput for a Llama 2 7B endpoint on an ml.g5.12xlarge instance for a payload with 256 input tokens and 256 output tokens, as seen in the following curve. A similar curve exists for every deployed LLM endpoint.<\/p>\n

<\/p>\n

As concurrency increases, throughput and latency also monotonically increase. Therefore, the lowest latency point occurs at a concurrent request value of 1, and you can cost-effectively increase system throughput by increasing concurrent requests. There exists a distinct \u201cknee\u201d in this curve, where it\u2019s obvious that the throughput gains associated with additional concurrency don\u2019t outweigh the associated increase in latency. The exact location of this knee is use case-specific; some practitioners may define the knee at the point where a pre-specified latency requirement is exceeded (for example, 100 ms\/token), whereas others may use load test benchmarks and queueing theory methods like the half-latency rule, and others may use theoretical accelerator specifications.<\/p>\n

We also note that the maximum number of concurrent requests is limited. In the preceding figure, the line trace ends with 192 concurrent requests. The source of this limitation is the SageMaker invocation timeout limit, where SageMaker endpoints timeout an invocation response after 60 seconds. This setting is account-specific and not configurable for an individual endpoint. For LLMs, generating a large number of output tokens can take seconds or even minutes. Therefore, large input or output payloads can cause the invocation requests to fail. Furthermore, if the number of concurrent requests is very large, then many requests will experience large queue times, driving this 60-second timeout limit. For the purpose of this study, we use the timeout limit to define the maximum throughput possible for a model deployment. Importantly, although a SageMaker endpoint may handle a large number of concurrent requests without observing an invocation response timeout, you may want to define maximum concurrent requests with respect to the knee in the latency-throughput curve. This is likely the point at which you start to consider horizontal scaling, where a single endpoint provisions multiple instances with model replicas and load balances incoming requests between the replicas, to support more concurrent requests.<\/p>\n

Taking this one step further, the following table contains benchmarking results for different configurations for the Llama 2 7B model, including different number of input and output tokens, instance types, and number of concurrent requests. Note that the preceding figure only plots a single row of this table.<\/p>\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
.<\/td>\nThroughput (tokens\/sec)<\/span><\/td>\nLatency (ms\/token)<\/span><\/td>\n<\/tr>\n
Concurrent Requests<\/td>\n1<\/td>\n2<\/td>\n4<\/td>\n8<\/td>\n16<\/td>\n32<\/td>\n64<\/td>\n128<\/td>\n256<\/td>\n512<\/td>\n1<\/td>\n2<\/td>\n4<\/td>\n8<\/td>\n16<\/td>\n32<\/td>\n64<\/td>\n128<\/td>\n256<\/td>\n512<\/td>\n<\/tr>\n
Number of total tokens: 512,    Number of output tokens: 256<\/td>\n<\/tr>\n
ml.g5.2xlarge<\/td>\n30<\/td>\n54<\/td>\n115<\/td>\n208<\/td>\n343<\/td>\n475<\/td>\n486<\/td>\n\u2014<\/td>\n\u2014<\/td>\n\u2014<\/td>\n33<\/td>\n33<\/td>\n35<\/td>\n39<\/td>\n48<\/td>\n97<\/td>\n159<\/td>\n\u2014<\/td>\n\u2014<\/td>\n\u2014<\/td>\n<\/tr>\n
ml.g5.12xlarge<\/td>\n59<\/td>\n117<\/td>\n223<\/td>\n406<\/td>\n616<\/td>\n866<\/td>\n1098<\/td>\n1214<\/td>\n\u2014<\/td>\n\u2014<\/td>\n17<\/td>\n17<\/td>\n18<\/td>\n20<\/td>\n27<\/td>\n38<\/td>\n60<\/td>\n112<\/td>\n\u2014<\/td>\n\u2014<\/td>\n<\/tr>\n
ml.g5.48xlarge<\/td>\n56<\/td>\n108<\/td>\n202<\/td>\n366<\/td>\n522<\/td>\n660<\/td>\n707<\/td>\n804<\/td>\n\u2014<\/td>\n\u2014<\/td>\n18<\/td>\n18<\/td>\n19<\/td>\n22<\/td>\n32<\/td>\n50<\/td>\n101<\/td>\n171<\/td>\n\u2014<\/td>\n\u2014<\/td>\n<\/tr>\n
ml.p4d.24xlarge<\/td>\n49<\/td>\n85<\/td>\n178<\/td>\n353<\/td>\n654<\/td>\n1079<\/td>\n1544<\/td>\n2312<\/td>\n2905<\/td>\n2944<\/td>\n21<\/td>\n23<\/td>\n22<\/td>\n23<\/td>\n26<\/td>\n31<\/td>\n44<\/td>\n58<\/td>\n92<\/td>\n165<\/td>\n<\/tr>\n
Number of total tokens: 4096,    Number of output tokens: 256<\/td>\n<\/tr>\n
ml.g5.2xlarge<\/td>\n20<\/td>\n36<\/td>\n48<\/td>\n49<\/td>\n\u2014<\/td>\n\u2014<\/td>\n\u2014<\/td>\n\u2014<\/td>\n\u2014<\/td>\n\u2014<\/td>\n48<\/td>\n57<\/td>\n104<\/td>\n170<\/td>\n\u2014<\/td>\n\u2014<\/td>\n\u2014<\/td>\n\u2014<\/td>\n\u2014<\/td>\n\u2014<\/td>\n<\/tr>\n
ml.g5.12xlarge<\/td>\n33<\/td>\n58<\/td>\n90<\/td>\n123<\/td>\n142<\/td>\n\u2014<\/td>\n\u2014<\/td>\n\u2014<\/td>\n\u2014<\/td>\n\u2014<\/td>\n31<\/td>\n34<\/td>\n48<\/td>\n73<\/td>\n132<\/td>\n\u2014<\/td>\n\u2014<\/td>\n\u2014<\/td>\n\u2014<\/td>\n\u2014<\/td>\n<\/tr>\n
ml.g5.48xlarge<\/td>\n31<\/td>\n48<\/td>\n66<\/td>\n82<\/td>\n\u2014<\/td>\n\u2014<\/td>\n\u2014<\/td>\n\u2014<\/td>\n\u2014<\/td>\n\u2014<\/td>\n31<\/td>\n43<\/td>\n68<\/td>\n120<\/td>\n\u2014<\/td>\n\u2014<\/td>\n\u2014<\/td>\n\u2014<\/td>\n\u2014<\/td>\n\u2014<\/td>\n<\/tr>\n
ml.p4d.24xlarge<\/td>\n39<\/td>\n73<\/td>\n124<\/td>\n202<\/td>\n278<\/td>\n290<\/td>\n\u2014<\/td>\n\u2014<\/td>\n\u2014<\/td>\n\u2014<\/td>\n26<\/td>\n27<\/td>\n33<\/td>\n43<\/td>\n66<\/td>\n107<\/td>\n\u2014<\/td>\n\u2014<\/td>\n\u2014<\/td>\n\u2014<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n

We observe some additional patterns in this data. When increasing context size, latency increases and throughput decreases. For instance, on ml.g5.2xlarge with a concurrency of 1, throughput is 30 tokens\/sec when the number of total tokens is 512, vs. 20 tokens\/sec if the number of total tokens is 4,096. This is because it takes more time to process the larger input. We can also see that increasing GPU capability and sharding impacts the maximum throughput and maximum supported concurrent requests. The table shows that Llama 2 7B has notably different maximum throughput values for different instance types, and these maximum throughput values occur at different values of concurrent requests. These characteristics would drive an ML practitioner to justify the cost of one instance over another. For example, given a low latency requirement, the practitioner might select an ml.g5.12xlarge instance (4 A10G GPUs) over an ml.g5.2xlarge instance (1 A10G GPU). If given a high throughput requirement, the use of an ml.p4d.24xlarge instance (8 A100 GPUs) with full sharding would only be justified under high concurrency. Note, however, that it\u2019s often beneficial to instead load multiple inference components of a 7B model on a single ml.p4d.24xlarge instance; such multi-model support is discussed later in this post.<\/p>\n

The preceding observations were made for the Llama 2 7B model. However, similar patterns remain true for other models as well. A primary takeaway is that latency and throughput performance numbers are dependent on payload, instance type, and number of concurrent requests, so you will need to find the ideal configuration for your specific application. To generate the preceding numbers for your use case, you can run the linked notebook<\/a>, where you can configure this load test analysis for your model, instance type, and payload.<\/p>\n

Making sense of accelerator specifications<\/h2>\n

Selecting suitable hardware for LLM inference relies heavily on specific use cases, user experience goals, and the chosen LLM. This section attempts to create an understanding of the knee in the latency-throughput curve with respect to high-level principles based on accelerator specifications. These principles alone don\u2019t suffice to make a decision: real benchmarks are necessary. The term device <\/em>is used here to encompass all ML hardware accelerators. We assert the knee in the latency-throughput curve is driven by one of two factors:<\/p>\n