How to determine EC2 instance type and memory for LLM inference endpoint [closed]

Ask Question

Asked 2 years, 5 months ago

Modified 2 years, 4 months ago

Viewed 277 times

This question appears to be off-topic because it focuses on programming, debugging, or performing routine operations, or it asks about obtaining datasets. You could try the support links we maintain or the Open Data site instead.

If the question is actually a statistical topic disguised as a coding question, then OP should edit the question to clarify this. After the statistical content has been clarified, the question is eligible for reopening.

Closed 2 years ago.

Improve this question

I am trying to estimate the costs required for hosting a fine tuned large language model for real time inference. There will be 100s of users querying the endpoint concurrently for multiple use cases like classification or text generation. How to determine AWS EC2 instance type, memory requirements (both GPU and RAM) and any other considerations. Also, what kind of latency and throughput to expect? Is my assumption correct that the latency is measured as time to generate a single token. So, the throughput has to be calculated by first deciding how many max tokens to support and what is the average token size.

Searched on the internet and blog posts. I was expecting to find a guide which breaks this process down and calculates the infrastructure requirements based on model size, token size, concurrent users, latency requirements, etc

asked Jun 19, 2023 at 5:58

user3711946

1112 bronze badges

Add a comment |

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Stack Exchange Network

How to determine EC2 instance type and memory for LLM inference endpoint [closed]

0

Hot Network Questions

How to determine EC2 instance type and memory for LLM inference endpoint [closed]

0

Related

Hot Network Questions