Skip to main content
The Azure AI model catalog offers a large selection of Azure AI Foundry Models from a wide range of providers. You have various options for deploying models from the model catalog. This article lists inference examples for serverless API deployments. To perform inferencing with the models, some models such as Nixtla’s TimeGEN-1 and Cohere rerank require you to use custom APIs from the model providers. Others support inferencing using the Model Inference API. You can find more details about individual models by reviewing their model cards in the model catalog for Azure AI Foundry portal.

Cohere

The Cohere family of models includes various models optimized for different use cases, including rerank, chat completions, and embeddings models.

Inference examples: Cohere command and embed

The following table provides links to examples of how to use Cohere models.
DescriptionLanguageSample
Web requestsBashCommand-R Command-R+
cohere-embed.ipynb
Azure AI Inference package for C#C#Link
Azure AI Inference package for JavaScriptJavaScriptLink
Azure AI Inference package for PythonPythonLink
OpenAI SDK (experimental)PythonLink
LangChainPythonLink
Cohere SDKPythonCommand
Embed
LiteLLM SDKPythonLink

Retrieval Augmented Generation (RAG) and tool use samples: Cohere command and embed

DescriptionPackagesSample
Create a local Facebook AI similarity search (FAISS) vector index, using Cohere embeddings - Langchainlangchain, langchain_coherecohere_faiss_langchain_embed.ipynb
Use Cohere Command R/R+ to answer questions from data in local FAISS vector index - Langchainlangchain, langchain_coherecommand_faiss_langchain.ipynb
Use Cohere Command R/R+ to answer questions from data in AI search vector index - Langchainlangchain, langchain_coherecohere-aisearch-langchain-rag.ipynb
Use Cohere Command R/R+ to answer questions from data in AI search vector index - Cohere SDKcohere, azure_search_documentscohere-aisearch-rag.ipynb
Command R+ tool/function calling, using LangChaincohere, langchain, langchain_coherecommand_tools-langchain.ipynb

Cohere rerank

To perform inferencing with Cohere rerank models, you’re required to use Cohere’s custom rerank APIs. For more information on the Cohere rerank model and its capabilities, see Cohere rerank.

Pricing for Cohere rerank models

Queries, not to be confused with a user’s query, is a pricing meter that refers to the cost associated with the tokens used as input for inference of a Cohere Rerank model. Cohere counts a single search unit as a query with up to 100 documents to be ranked. Documents longer than 500 tokens (for Cohere-rerank-v3.5) or longer than 4096 tokens (for Cohere-rerank-v3-English and Cohere-rerank-v3-multilingual) when including the length of the search query are split up into multiple chunks, where each chunk counts as a single document. See the Cohere model collection in Azure AI Foundry portal.

Core42

The following table provides links to examples of how to use Jais models.
DescriptionLanguageSample
Azure AI Inference package for C#C#Link
Azure AI Inference package for JavaScriptJavaScriptLink
Azure AI Inference package for PythonPythonLink

DeepSeek

DeepSeek family of models includes DeepSeek-R1, which excels at reasoning tasks using a step-by-step training process, such as language, scientific reasoning, and coding tasks, DeepSeek-V3-0324, a Mixture-of-Experts (MoE) language model, and more. The following table provides links to examples of how to use DeepSeek models.
DescriptionLanguageSample
Azure AI Inference package for PythonPythonLink
Azure AI Inference package for JavaScriptJavaScriptLink
Azure AI Inference package for C#C#Link
Azure AI Inference package for JavaJavaLink

Meta

Meta Llama models and tools are a collection of pretrained and fine-tuned generative AI text and image reasoning models. Meta models range is scale to include:
  • Small language models (SLMs) like 1B and 3B Base and Instruct models for on-device and edge inferencing
  • Mid-size large language models (LLMs) like 7B, 8B, and 70B Base and Instruct models
  • High-performant models like Meta Llama 3.1-405B Instruct for synthetic data generation and distillation use cases.
  • High-performant natively multimodal models, Llama 4 Scout and Llama 4 Maverick, leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding.
The following table provides links to examples of how to use Meta Llama models.
DescriptionLanguageSample
CURL requestBashLink
Azure AI Inference package for C#C#Link
Azure AI Inference package for JavaScriptJavaScriptLink
Azure AI Inference package for PythonPythonLink
Python web requestsPythonLink
OpenAI SDK (experimental)PythonLink
LangChainPythonLink
LiteLLMPythonLink

Microsoft

Microsoft models include various model groups such as MAI models, Phi models, healthcare AI models, and more. To see all the available Microsoft models, view the Microsoft model collection in Azure AI Foundry portal. The following table provides links to examples of how to use Microsoft models.
DescriptionLanguageSample
Azure AI Inference package for C#C#Link
Azure AI Inference package for JavaScriptJavaScriptLink
Azure AI Inference package for PythonPythonLink
LangChainPythonLink
Llama-IndexPythonLink
See the Microsoft model collection in Azure AI Foundry portal.

Mistral AI

Mistral AI offers two categories of models, namely:
  • Premium models: These include Mistral Large, Mistral Small, Mistral-OCR-2503, Mistral Medium 3 (25.05), and Ministral 3B models, and are available as serverless APIs with pay-as-you-go token-based billing.
  • Open models: These include Mistral-small-2503, Codestral, and Mistral Nemo (that are available as serverless APIs with pay-as-you-go token-based billing), and Mixtral-8x7B-Instruct-v01, Mixtral-8x7B-v01, Mistral-7B-Instruct-v01, and Mistral-7B-v01(that are available to download and run on self-hosted managed endpoints).
The following table provides links to examples of how to use Mistral models.
DescriptionLanguageSample
CURL requestBashLink
Azure AI Inference package for C#C#Link
Azure AI Inference package for JavaScriptJavaScriptLink
Azure AI Inference package for PythonPythonLink
Python web requestsPythonLink
OpenAI SDK (experimental)PythonMistral - OpenAI SDK sample
LangChainPythonMistral - LangChain sample
Mistral AIPythonMistral - Mistral AI sample
LiteLLMPythonMistral - LiteLLM sample

Nixtla

Nixtla’s TimeGEN-1 is a generative pre-trained forecasting and anomaly detection model for time series data. TimeGEN-1 can produce accurate forecasts for new time series without training, using only historical values and exogenous covariates as inputs. To perform inferencing, TimeGEN-1 requires you to use Nixtla’s custom inference API. For more information on the TimeGEN-1 model and its capabilities, see Nixtla.

Estimate the number of tokens needed

Before you create a TimeGEN-1 deployment, it’s useful to estimate the number of tokens that you plan to consume and be billed for. One token corresponds to one data point in your input dataset or output dataset. Suppose you have the following input time series dataset:
Unique_idTimestampTarget VariableExogenous Variable 1Exogenous Variable 2
BE2016-10-22 00:00:0070.0049593.057253.0
BE2016-10-22 01:00:0037.1046073.051887.0
To determine the number of tokens, multiply the number of rows (in this example, two) and the number of columns used for forecasting—not counting the unique_id and timestamp columns (in this example, three) to get a total of six tokens. Given the following output dataset:
Unique_idTimestampForecasted Target Variable
BE2016-10-22 02:00:0046.57
BE2016-10-22 03:00:0048.57
You can also determine the number of tokens by counting the number of data points returned after data forecasting. In this example, the number of tokens is two.

Estimate pricing based on tokens

There are four pricing meters that determine the price you pay. These meters are as follows:
Pricing MeterDescription
paygo-inference-input-tokensCosts associated with the tokens used as input for inference when finetune_steps = 0
paygo-inference-output-tokensCosts associated with the tokens used as output for inference when finetune_steps = 0
paygo-finetuned-model-inference-input-tokensCosts associated with the tokens used as input for inference when finetune_steps > 0
paygo-finetuned-model-inference-output-tokensCosts associated with the tokens used as output for inference when finetune_steps > 0
See the Nixtla model collection in Azure AI Foundry portal.

Stability AI

Stability AI models deployed via serverless API deployment implement the Model Inference API on the route /image/generations. For examples of how to use Stability AI models, see the following examples: