CoreAI11 min read

Understanding LLM APIs as a Frontend Engineer

Tokens, context windows, function calling, structured output. The parts of the model API that shape your UI decisions.

You're not going to be training models. You probably won't be running them either. But you'll be calling them every day, and the parts of the API that bleed into your UI are not the parts most tutorials focus on. Tutorials teach you how to make a request. I'm going to teach you what the response actually means and how it constrains your design choices.

Spend twenty minutes here and you'll have the vocabulary to push back on backend decisions, design better UIs around model capabilities, and stop pretending you understand what people mean when they say "we're at the context limit."

01Tokens, and why they cost what they cost

A token is the unit of LLM input and output. Roughly four characters of English text — though it varies wildly. "Hello" is one token. "ChatGPT" is one or two depending on the tokenizer. Chinese characters are typically one token each. Code with lots of symbols can be way more tokens than the same byte length of English prose.

Why you care: everything is priced in tokens. Context windows, rate limits, your monthly bill — all denominated in tokens, not characters or words. The backend usually returns token counts alongside responses; show them in cost dashboards and "you are approaching your usage" UIs.

Practical tip: don't try to count tokens client-side. Different models use different tokenizers and your numbers will drift. Trust the count the API returns.

02Context window — the wall you keep hitting

The context window is the maximum total tokens — system prompt + tool definitions + conversation history + retrieved documents + user input + planned output — that fit in one model call. Today's numbers: Claude 4.5 is 200K, Gemini 1.5 is 1M+, GPT-4o is 128K. When you exceed it, the call fails.

This is the constraint you'll bump into the most. UI implications:

Warn users when they're approaching the limit. Don't wait for the call to fail.
Truncate or summarize old messages. Most products drop the oldest messages first; smarter products run a "compact older history" step.
Stop showing every tool result in full. Long tool outputs (web pages, long docs) burn context. Truncate the rendered version, keep the full version available on click.

If your product has long conversations or large attached documents, context management is the silent feature your competitors are quietly investing in.

03Streaming vs non-streaming, when to use which

Streaming responses arrive incrementally — first token in 200–600ms, full response in 4–8 seconds. Non-streaming responses block until done, then arrive all at once. Total time is often identical. Perceived time is dramatically different.

Default to streaming for any user-facing chat or completion. Use non-streaming when you need the full response before doing anything (parsing structured output for downstream UI, batch processing, anything where partial doesn't make sense).

ℹ A quick framing

If someone asks "should we stream this response?" the answer is almost always yes for human consumption and almost always no for machine consumption. Humans have a 200ms patience budget. Machines don't.

04Function calling: the model emits structured intent

Sometimes you want the model to do something instead of just talking. Modern LLMs support function calling (or "tool use" in Anthropic's terms): you tell the model what functions are available, with typed input schemas; the model returns a structured payload — function name plus arguments — when it judges that a tool would help.

Your job on the frontend: render the tool call (often as a card in an agent trace), execute it via your backend, feed the result back into the conversation. The model then continues with the result in mind.

const tools = [{
  name: 'get_weather',
  description: 'Get current weather for a city',
  parameters: {
    type: 'object',
    properties: { city: { type: 'string' } },
    required: ['city'],
  },
}];

// Model response might be:
// { type: 'tool_use', name: 'get_weather', input: { city: 'Bangalore' } }

This is the foundation of every "agent" you've ever seen. The agent is just an LLM in a loop, calling tools and reading the results.

05Structured output (JSON mode) — stop parsing free text

If you've ever tried to extract JSON from a free-text LLM response with a regex, you know the pain. The model puts a stray comma in. It adds a trailing explanation. It wraps the JSON in markdown fences sometimes. Your parser breaks 5% of the time.

Modern APIs solve this with structured output: you provide a JSON schema, the model is forced to emit valid JSON matching it. OpenAI calls it response_format: { type: 'json_schema' }. Anthropic uses tool-use coercion. Either way, your parser never breaks.

If your feature has the LLM populating a UI element (a card, a form, a list filter), structured output is the right answer. The added latency is negligible; the reliability gain is enormous.

06Prompt caching — the lever everyone underuses

Anthropic and OpenAI both let you cache stable prefixes of your prompts (system message, tool definitions, large reference documents). Subsequent requests with the same prefix skip the recomputation, dropping latency by 30–80% and cost by 90% on the cached portion.

The frontend lever: keep the early parts of your prompt stable across calls. Don't put dynamic data first. Put system prompts, tool defs, and any RAG-retrieved documents at the top — those should change infrequently. Put the user's specific message last. The model will cache the prefix and your next call will fly.

A 5-minute change that often turns a sluggish product into a snappy one. Worth knowing.

Key Takeaways

01Tokens are the universal unit — pricing, limits, context windows. Trust API counts; do not roll your own tokenizer client-side.
02Context window management (truncation, summarization, document chunking in the prompt) is silent infrastructure your competitors are investing in.
03Stream for humans; non-stream for machines.
04Function calling is how the model returns structured intent; it's the foundation of every agent.
05Structured output (JSON schema mode) eliminates the entire class of parse errors that plague free-text extraction.
06Prompt caching is a 5-minute change that often delivers 30–80% latency wins. Keep prompt prefixes stable.

Previous Next