I am trying to estimate the costs required for hosting a fine tuned large language model for real time inference. There will be 100s of users querying the endpoint concurrently for multiple use cases like classification or text generation. How to determine AWS EC2 instance type, memory requirements (both GPU and RAM) and any other considerations. Also, what kind of latency and throughput to expect? Is my assumption correct that the latency is measured as time to generate a single token. So, the throughput has to be calculated by first deciding how many max tokens to support and what is the average token size.
Searched on the internet and blog posts. I was expecting to find a guide which breaks this process down and calculates the infrastructure requirements based on model size, token size, concurrent users, latency requirements, etc