How to Integrate Chinese LLM APIs into Your Global Application
You have heard that Chinese LLM APIs offer significant cost savings. Now you want to actually wire them into your application. This guide walks through the entire process — from getting your API key to production-ready code — with concrete examples in curl, Python, and Node.js.
The good news: if your application already works with the OpenAI or Anthropic API, the integration is a configuration change, not a rewrite.
Overview: Accessing Chinese LLMs from Outside China
Chinese LLM API integration for international users works through aggregation platforms that sit between you and the upstream Chinese model providers. These platforms expose standard API protocols on globally accessible endpoints, so you do not need a VPN, a Chinese phone number, or any special network configuration.
The typical architecture looks like this:
Your Application → Aggregation Platform (global endpoint) → Chinese LLM Provider
The aggregation platform handles authentication with the upstream provider, request routing, load balancing, and billing. From your application's perspective, it looks and behaves exactly like calling OpenAI or Anthropic directly.
Supported protocols:
- OpenAI-compatible:
/v1/chat/completions— works with any OpenAI SDK client - Claude-native:
/v1/messages— works with the Anthropic SDK - Responses API:
/v1/responses— supports the newer OpenAI agent-style format
Base URL: https://gpt-agent.cc/v1
All examples in this guide use this endpoint. Replace it with your platform's URL if different.
Step 1: Get an API Key
The purchase flow is straightforward:
- Go to the aggregation platform's website (e.g.,
https://gpt-agent.cc). - Register an account with your email address.
- Navigate to the billing or token purchase page.
- Select a token package. Start small ($10-$20) for testing.
- Pay using an international credit card, PayPal, or USDT.
- Copy your API key from the dashboard. It is available immediately after payment.
Your API key works across all supported models. There is no need to get separate keys for different providers.
Step 2: Configure Your Client
Claude Code
If you are using Claude Code as your development assistant, set the API endpoint in your configuration:
{
"apiBaseUrl": "https://gpt-agent.cc",
"apiKey": "your-api-key"
}
Claude Code will route all requests through the aggregation platform, giving you access to Claude models at reduced token rates.
Cursor
In Cursor's settings, navigate to the AI configuration section and set:
- API Base URL:
https://gpt-agent.cc/v1 - API Key: your key from the dashboard
VS Code (with Continue or similar extensions)
Most OpenAI-compatible VS Code extensions allow you to set a custom base URL. Update the extension settings:
{
"openai.baseUrl": "https://gpt-agent.cc/v1",
"openai.apiKey": "your-api-key"
}
Custom Applications
For your own applications, the configuration depends on which SDK you use. See the code examples below.
Step 3: Code Examples
curl
The simplest way to test your connection:
curl https://gpt-agent.cc/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-api-key" \
-d '{
"model": "gpt-4o",
"messages": [
{"role": "user", "content": "Explain quantum computing in one paragraph."}
]
}'
To use a Chinese model instead, change the model parameter:
curl https://gpt-agent.cc/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-api-key" \
-d '{
"model": "deepseek-r1",
"messages": [
{"role": "user", "content": "Solve this step by step: What is 23! / 20!?"}
]
}'
Python (OpenAI SDK)
Install the OpenAI Python package if you have not already:
pip install openai
Basic completion:
from openai import OpenAI
client = OpenAI(
base_url="https://gpt-agent.cc/v1",
api_key="your-api-key"
)
response = client.chat.completions.create(
model="qwen-max",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What are the main exports of Vietnam?"}
],
temperature=0.7,
max_tokens=1024
)
print(response.choices[0].message.content)
Python (Anthropic SDK — Claude-native protocol)
pip install anthropic
import anthropic
client = anthropic.Anthropic(
base_url="https://gpt-agent.cc",
api_key="your-api-key"
)
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[
{"role": "user", "content": "Write a Python function to merge two sorted lists."}
]
)
print(message.content[0].text)
Node.js (OpenAI SDK)
npm install openai
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://gpt-agent.cc/v1",
apiKey: "your-api-key",
});
async function main() {
const response = await client.chat.completions.create({
model: "deepseek-v3",
messages: [
{ role: "user", content: "Explain the difference between REST and GraphQL." },
],
temperature: 0.7,
});
console.log(response.choices[0].message.content);
}
main();
Streaming Support
All endpoints support streaming responses, which is critical for user-facing applications where perceived latency matters.
Python streaming example:
from openai import OpenAI
client = OpenAI(
base_url="https://gpt-agent.cc/v1",
api_key="your-api-key"
)
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Write a short story about a robot."}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="", flush=True)
Node.js streaming example:
const stream = await client.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "Write a short story about a robot." }],
stream: true,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content || "";
process.stdout.write(content);
}
Streaming works identically to the official OpenAI and Anthropic APIs. No special configuration is needed.
Error Handling and Troubleshooting
Common issues and how to resolve them:
401 Unauthorized Your API key is invalid or expired. Double-check the key in your dashboard. Ensure there are no trailing spaces or newline characters.
402 Payment Required / Insufficient Balance Your prepaid token balance is depleted. Top up your account through the platform dashboard.
429 Too Many Requests You have hit the rate limit. Most aggregation platforms allow higher concurrency than direct provider APIs, but limits still exist. Implement exponential backoff in your client:
import time
from openai import OpenAI, RateLimitError
client = OpenAI(base_url="https://gpt-agent.cc/v1", api_key="your-api-key")
def call_with_retry(messages, model="gpt-4o", max_retries=3):
for attempt in range(max_retries):
try:
return client.chat.completions.create(model=model, messages=messages)
except RateLimitError:
wait = 2 ** attempt
time.sleep(wait)
raise Exception("Max retries exceeded")
500 / 502 / 503 Server Errors Temporary upstream issues. The aggregation platform usually recovers automatically. Retry after a brief delay. If errors persist for more than a few minutes, check the platform's status page.
Timeout errors For long-running completions (large max_tokens or complex reasoning models like DeepSeek-R1), increase your client timeout:
client = OpenAI(
base_url="https://gpt-agent.cc/v1",
api_key="your-api-key",
timeout=120.0 # seconds
)
Performance Tips
Choose the right model for the task. Do not use GPT-4o for simple classification tasks where GLM-4-Flash or Qwen-Turbo would suffice at 1/50th the cost. Match model capability to task complexity.
Leverage caching. If your application sends similar prompts repeatedly (e.g., customer service templates), the platform's cache-hit mechanism will automatically reduce your costs. Structure your prompts with a stable system message and variable user input to maximize cache hits.
Use streaming for user-facing applications. Streaming reduces perceived latency significantly. The first token arrives much faster than waiting for the complete response.
Optimize prompt length. Input tokens cost money too. Keep system prompts concise. Avoid stuffing unnecessary context into every request.
Batch non-urgent requests. If you have workloads that are not time-sensitive (e.g., nightly data processing), batch them during off-peak hours when the platform may have lower latency.
Supported Models and Their Strengths
Here is a quick reference for choosing the right model:
| Model | Best For | Context Window | Relative Cost | |---|---|---|---| | GPT-4o | General purpose, complex reasoning | 128K | Medium-High | | Claude 3.5 Sonnet | Coding, analysis, long documents | 200K | Medium-High | | DeepSeek-R1 | Math, logic, step-by-step reasoning | 64K | Medium | | DeepSeek-V3 | General purpose, good value | 128K | Low-Medium | | Qwen-Max | Multilingual, coding, reasoning | 128K | Medium | | Qwen-Plus | Balanced performance and cost | 128K | Low-Medium | | Qwen-Turbo | Speed-critical, simple tasks | 128K | Low | | Kimi (Moonshot) | Very long documents, research | 200K | Medium | | GLM-4 | Bilingual tasks, general purpose | 128K | Low-Medium | | GLM-4-Flash | High-volume, cost-sensitive | 128K | Very Low | | MiniMax | Conversational AI, chatbots | 64K | Low |
Conclusion
Chinese LLM API integration is not complicated. If you can call the OpenAI API, you can call a Chinese LLM through an aggregation platform. The protocol is the same, the SDKs are the same, and the code changes are minimal — typically just a base URL and API key swap.
The real advantage is access: one API key gives you China AI API global access to dozens of models spanning both Western and Chinese providers, all at prices that are significantly lower than going direct. For teams in Southeast Asia, Europe, or anywhere else looking to optimize their AI spend, this is the most practical path available today.
Start with a small test balance, verify the models meet your quality requirements, and scale from there.