Why Does ChatGPT Keep Going Down?

OpenAI’s ChatGPT is the most popular Large Language Model (LLM) right now that has taken the world by storm. Although it has been some time since ChatGPT was first released in November 2022¹, the demand that the service has seen has been huge, probably more than anyone could have predicted. So why does ChatGPT keep going down and crashing so often?

Big Hype, Huge Demand

The technology behind ChatGPT is relatively new, and when OpenAI announced ChatGPT 3, it sent the internet into a frenzy that built up huge hype that translated to massive demand when it was released.² Everyone wanted to try out this new and exciting technology. OpenAI probably anticipated this and proactively implemented a waiting list that anyone could sign up and join. Over time though, the number of users inevitably grew larger and larger, which put an incredible strain on the servers.³

New Technology, New Challenges

Traditionally, services hosted on the internet are served by one or more servers in dedicated locations (data centers). These servers can usually handle hundreds or even thousands of requests concurrently. The technology behind typical web services has been refined and optimized over the years to the point where most physical resources are shared between each request (e.g. Memory, network reuse… etc.).

However, Deep Learning (DL) models like ChatGPT run on specialized hardware⁴ that is often challenging to share between requests. Most DL models nowadays run on Graphics Processing Units (GPUs) or Tensor Processing Units (TPUs). These specialized hardware are essential for training and serving requests using most DL models because of their massive sizes and the complexity of the processing power they require.

Often, when using GPUs or TPUs to serve requests in reasonable time (near real-time requests in the case of ChatGPT), a portion of the hardware is used exclusively for any given request. Increasing the number of parallel requests that are served slows down the group. That’s if that’s even possible given how much memory a model of ChatGPT’s size probably requires.⁵

Hardware Availability, High Costs

More and more public clouds are offering GPU hardware support, which ChatGPT relies on. However, this offering is relatively new compared to traditional web hosting that relies on CPUs. The cost of running servers with GPUs is also much higher in comparison to those with only CPUs. This likely influences the fact that OpenAI offers ChatGPT in two tiers: Free and Plus. With the free offering of an older model of GPT at normal speed with no guarantees of speed and availability, and Plus offering the latest and greatest model with faster response times (which may translate to having less or exclusive concurrency on the hardware that serves requests).⁶

Scaling Challenges, Spikey Traffic

Most normal web applications have some kind of established strategy for scaling and handling spikey traffic (that is, traffic that may suddenly ramp up and down). This can include things such as rate limiting, caching, queuing, …etc. Also, services often take advantage of cloud offerings to “auto scale” which allows dynamically adding servers that handle the service when traffic increases, and removing them when it decreases. This can be done quite quickly, to the point where we now have concepts like serverless computing which can auto-scale to add or remove traditional computing power in seconds.

Scaling DL models such as ChatGPT’s huge LLM is a different challenge that can’t scale as easily as traditional applications. Scaling up DL models that rely on specialized hardware like GPUs takes time. It takes time for a new server to be provisioned and initialized, and then it has to download the DL model to disk if it isn’t baked into the server’s image. Next, the service that runs the DL model has to start up, which initializes the drivers and libraries used by the DL model (such as Python’s PyTorch). Then finally the model has to be loaded into the hardware’s memory (e.g. VRAM for GPUs) before finally being ready to start serving traffic.

As you can probably guess, scaling servers with GPUs like this takes time measured in minutes as opposed to traditional computing that can be scaled in mere seconds.

You might wonder then, what about pre-warming or predictive scaling? Well, the issue with pre-warming, in this case, is that once a GPU loses power (e.g. shutting down the server) it loses all data in its memory, and that will need to be loaded again when started up.

Predictive scaling might be possible, but it’s difficult to predict spikey traffic effectively. We don’t know the traffic patterns that ChatGPT receives, but any spike in traffic has the potential to affect all traffic until scaling has a chance to kick in.

New Territory

The fact is that hosting LLMs is still relatively new in the tech world at the scale that ChatGPT is running. There are no doubt many engineering challenges that the OpenAI teams are facing quite often, and learning and fixing as they go. The ChatGPT service these days is no doubt in a much better shape when it comes to service stability, reliability, and response times than it was closer to its initial release. That would have required a team effect from the developers, data scientists, and DevOps/SRE engineers.

Take Away

ChatGPT is a relatively new service running bleeding-edge technology on untraditional specialized hardware. No one knows for certain why the service seems to often face issues that render it temporarily unavailable. The service faces many technical challenges that most web services never have to worry about when it comes to reliability, stability, and scaling. As the OpenAI teams learn more the service will no doubt improve over time.

References

Introducing ChatGPT. (n.d.). https://openai.com/blog/chatgpt ↩︎
Jones, T. (2022, December 12). ChatGPT is being used by so many people its servers are struggling to keep up. SmartCompany. https://www.smartcompany.com.au/technology/chatgpt-servers-struggling-to-keep-up ↩︎

OpenAI status. (n.d.). https://status.openai.com/ ↩︎
Horsey, J. (2023, August 4). The insane hardware powering ChatGPT artificial intelligence. Geeky Gadgets. https://www.geeky-gadgets.com/hardware-powering-chatgpt/ ↩︎
TitanML. (2023, November 15). Harmonizing Multi-GPUs: Efficient scaling of LLM inference. Medium. https://medium.com/@TitanML/harmonizing-multi-gpus-efficient-scaling-of-llm-inference-2e79b2b9d8cc ↩︎

Pricing. (n.d.). https://openai.com/pricing ↩︎