The Right Stack
  • Cloud, Developer, AI, and Security Survey Collection
  • Blog
The Right Stack

Vendor research collection

Linkedin

Threads

RSS Feed

Comparison of AI Models across Quality, Performance, Price | Artificial Analysis

Vendor Sponsor
Artificial Analysis
Research Published
February 19, 2024
Link to research
https://artificialanalysis.ai/models
Description

Comparison and analysis of AI models across key metrics including quality, price, performance and speed (throughput & latency), context window & others.

Demographic or Methodology comments

Topic Tags
AILarge Language ModelsEvergreen ResourceBenchmarksGenerative AIAI adoption
Demographics
Created time
Feb 25, 2024 11:17 PM
Directory name

The Rightstack Research DB

image

Comparison and analysis of AI models across key metrics including quality, price, performance and speed (throughput & latency), context window & others. Click on any model to see detailed metrics. For more details including relating to our methodology, see our FAQs.

Models compared: OpenAI: GPT-3.5 Turbo, GPT-3.5 Turbo (0125), GPT-3.5 Turbo (1106), GPT-3.5 Turbo Instruct, GPT-4, GPT-4 Turbo, GPT-4 Turbo (0125), and GPT-4 Vision, Google: Gemini Pro and Gemma 7B, Meta: Code Llama (70B), Llama 2 Chat (13B), Llama 2 Chat (70B), and Llama 2 Chat (7B), Mistral: Mistral 7B, Mistral Medium, and Mixtral 8x7B, Anthropic: Claude 2.0, Claude 2.1, and Claude Instant, Cohere: Command and Command Light, Perplexity: PPLX-70B-online and PPLX-7B-Online, and OpenChat: OpenChat 3.5.

Highlights

Quality

Quality Index; Higher is better

100100776762625837GPT-4GPT-4 TurboMistral MediumClaude 2.0GPT-3.5 TurboMixtral 8x7BGemini ProLlama 2 Chat(70B)

Speed

Throughput in Tokens per Second; Higher is better

10789786929242018Mixtral 8x7BGemini ProGPT-3.5 TurboLlama 2 Chat(70B)Claude 2.0GPT-4GPT-4 TurboMistral Medium

Price

USD per 1M Tokens; Lower is better

37.515124.11.310.80.5GPT-4GPT-4 TurboClaude 2.0Mistral MediumGemini ProLlama 2 Chat(70B)GPT-3.5 TurboMixtral 8x7B37.515124.11.310.80.5

Parallel Queries:

SingleMultiple

Prompt Length:

Short

Quality comparison by ability

Varied metrics by ability categorization; Higher is better

General Ability (Chatbot Arena)

124312431150113111211117111110771023GPT-4GPT-4 TurboMistral MediumClaude 2.0Mixtral 8x7BGPT-3.5 TurboGemini ProLlama 2 Chat(70B)Mistral 7B124312431150113111211117111110771023

Reasoning & Knowledge (MMLU)

86%79%75%72%71%70%63%55%GPT-4Claude 2.0Mistral MediumGemini ProMixtral 8x7BGPT-3.5 TurboLlama 2 Chat(70B)Mistral 7BGPT-4 Turbo86%79%75%72%71%70%63%55%N/A

Reasoning & Knowledge (MT Bench)

9.38.68.48.38.16.96.8GPT-4 TurboMistral MediumGPT-3.5 TurboMixtral 8x7BClaude 2.0Llama 2 Chat(70B)Mistral 7BGPT-4Gemini Pro9.38.68.48.38.16.96.8N/AN/A

Coding (HumanEval)

88.485.473.271.263.4GPT-4GPT-4 TurboGPT-3.5 TurboClaude 2.0Gemini ProLlama 2 Chat(70B)Mistral 7BMixtral 8x7BMistral Medium88.485.473.271.263.4N/AN/AN/AN/A

OpenAI's GPT-4 is the clear quality leader across quality metrics. However, models including Gemini Pro and Mixtral 8x7B have reached GPT-3.5 performance in some measures.

Total Response Time: Time to receive a 100 token response. Estimated based on Latency (time to receive first chunk) and Throughput (tokens per second).

Median across hosts: Figures represent median (P50) across all hosts which support the model.

Quality vs. Price

Quality: General reasoning index, Price: USD per 1M Tokens

Most attractive quadrant

GPT-3.5 Turbo

GPT-4

GPT-4 Turbo

Llama 2 Chat (70B)

Mixtral 8x7B

Mistral Medium

Claude 2.0

ArtificialAnalysis.ai$0.00$5.00$10.00$15.00$20.00$25.00$30.00$35.00$40.00$45.00Price (USD per M Tokens)0102030405060708090100110120Quality (General ability index)Mistral 7BMistral 7BLlama 2 Chat (70B)Llama 2 Chat (70B)Gemini ProGemini ProGPT-3.5 TurboGPT-3.5 TurboMixtral 8x7BMixtral 8x7BClaude 2.0Claude 2.0Mistral MediumMistral MediumGPT-4 TurboGPT-4 TurboGPT-4GPT-4

Higher quality models are typically more expensive. However, model quality varies significantly and some open source models now achieve very high quality.

Quality: Index represents normalized average relative performance across Chatbot arena, MMLU & MT-Bench.

Price: Price per token, represented as USD per million Tokens. Price is a blend of Input & Output token prices (3:1 ratio).

Median across hosts: Figures represent median (P50) across all hosts which support the model.

Pricing: Input and Output prices

USD per 1M Tokens

Input price

Output price

ArtificialAnalysis.ai301082.7510.80.50.40.156030248.24211.50.60.2GPT-4GPT-4 TurboClaude 2.0Mistral MediumGemini ProLlama 2 Chat(70B)GPT-3.5 TurboMixtral 8x7BMistral 7B301082.7510.80.50.40.156030248.24211.50.60.2

Prices vary considerably, including between input and output token price. GPT-4 stands out as orders of magnitude higher priced than most other models.

Input price: Price per token included in the request/message sent to the API, represented as USD per million Tokens.

Output price: Price per token generated by the model (received from the API), represented as USD per million Tokens.

Median across hosts: Figures represent median (P50) across all hosts which support the model.

Quality vs. Throughput, Price

Quality: General reasoning index, Throughput: Tokens per Second, Price: USD per 1M Tokens

Most attractive quadrant

Size represents Price (USD per M Tokens)

GPT-3.5 Turbo

GPT-4

GPT-4 Turbo

Llama 2 Chat (70B)

Mistral 7B

Mixtral 8x7B

Mistral Medium

Gemini Pro

Claude 2.0

ArtificialAnalysis.ai0102030405060708090100110120Throughput (Tokens per Second)0102030405060708090100110120Quality (General ability index)Claude 2.0Claude 2.0Llama 2 Chat (70B)Llama 2 Chat (70B)Mistral MediumMistral MediumMistral 7BMistral 7BGPT-3.5 TurboGPT-3.5 TurboGPT-4 TurboGPT-4 TurboGPT-4GPT-4Gemini ProGemini ProMixtral 8x7BMixtral 8x7B

There is a trade-off between model quality and throughput, with higher quality models typically having lower throughput.

Quality: Index represents normalized average relative performance across Chatbot arena, MMLU & MT-Bench.

Throughput: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).

Price: Price per token, represented as USD per million Tokens. Price is a blend of Input & Output token prices (3:1 ratio).

Total Response Time

Seconds to Output 100 Tokens; Lower is better

ArtificialAnalysis.ai65.55.542.91.81.81.61.2Mistral MediumGPT-4GPT-4 TurboClaude 2.0Gemini ProGPT-3.5 TurboLlama 2 Chat(70B)Mistral 7BMixtral 8x7B

The speed difference between the fastest and slowest models is >3X. There is not always a correlation between parameter size and speed, or between price and speed.

Total Response Time: Time to receive a 100 token response. Estimated based on Latency (time to receive first chunk) and Throughput (tokens per second).

Median across hosts: Figures represent median (P50) across all hosts which support the model.

Total Response Time, Over Time

Seconds to Output 100 Tokens; Lower is better

GPT-3.5 Turbo

GPT-4

Mistral 7B Instruct

Mistral Medium

Claude 2.0

ArtificialAnalysis.aiDec 17Dec 18Dec 19Dec 20Dec 21Dec 22Dec 23Dec 24Dec 25Dec 26Dec 27Dec 28Dec 29Dec 30Dec 31Jan 01Jan 02Jan 03Jan 04Jan 05Jan 06Jan 07Jan 08Jan 09Jan 10Jan 11Jan 12Jan 13Jan 14Jan 15Jan 16Jan 17Jan 18Jan 19Jan 20Jan 21Jan 22Jan 23Jan 24Jan 25Jan 26Jan 27Jan 28Jan 29Jan 30Jan 31Feb 01Feb 02Feb 03Feb 04Feb 05Feb 06Feb 07Feb 08Feb 09Feb 10Feb 11Feb 12Feb 13Feb 14Feb 15Feb 16Feb 17Feb 18Feb 19Feb 20Feb 21Feb 22Feb 23Feb 240s2s4s6s8s10s12s14s16s18scount

Total Response Time: Time to receive a 100 token response. Estimated based on Latency (time to receive first chunk) and Throughput (tokens per second).

Over time measurement: Median measurement per day, based on 8 measurements each day at different times.

Median across hosts: Figures represent median (P50) across all hosts which support the model.

Quality vs. Context window, Input token price

Quality: General reasoning index, Context window: Tokens limit, Input Price: USD per 1M Tokens

Most attractive quadrant

Size represents Input Price (USD per M Tokens)

GPT-3.5 Turbo

GPT-4

GPT-4 Turbo

Llama 2 Chat (70B)

Mistral 7B

Mixtral 8x7B

Mistral Medium

Gemini Pro

Claude 2.0

ArtificialAnalysis.ai0.020k40k60k80k100k120k140kContext window (Tokens limit)0102030405060708090100110120Quality (General ability index)Llama 2 Chat (70B)Llama 2 Chat (70B)Mistral 7BMistral 7BGPT-4GPT-4GPT-3.5 TurboGPT-3.5 TurboGemini ProGemini ProMixtral 8x7BMixtral 8x7BMistral MediumMistral MediumClaude 2.0Claude 2.0GPT-4 TurboGPT-4 Turbo

Open AI's GPT-4 Turbo and Anthropic's Claude models stand out as leaders in offering large context windows. A trade off of quality and context window size exists between GPT-4 Turbo and Claude 2.1, Claude 2.1 is also marginally cheaper.

Quality: Index represents normalized average relative performance across Chatbot arena, MMLU & MT-Bench.

Context window: Maximum number of combined input & output tokens. Output tokens commonly have a significantly lower limit (varied by model).

Input price: Price per token included in the request/message sent to the API, represented as USD per million Tokens.

Context window

Context window: Tokens limit; Higher is better

ArtificialAnalysis.ai128k100k32.8k32.8k32.8k16.4k8.19k8.19k4.10kGPT-4 TurboClaude 2.0Mixtral 8x7BMistral MediumGemini ProGPT-3.5 TurboGPT-4Mistral 7BLlama 2 Chat(70B)128k100k32.8k32.8k32.8k16.4k8.19k8.19k4.10k

Open AI's GPT-4 Turbo and Anthropic's Claude models, particuarly Claude 2.1, stand out as leaders in offering large context windows.

Context window: Maximum number of combined input & output tokens. Output tokens commonly have a significantly lower limit (varied by model).

Detailed performance metrics

Latency vs. Throughput

Latency: Seconds to First Tokens Chunk Received, Throughput: Tokens per Second

Most attractive quadrant

Size represents Price (USD per M Tokens)

GPT-3.5 Turbo

GPT-4

GPT-4 Turbo

Llama 2 Chat (70B)

Mistral 7B

Mixtral 8x7B

Mistral Medium

Gemini Pro

Claude 2.0

ArtificialAnalysis.ai0102030405060708090100110120Throughput (Tokens per Second)0s0.2s0.4s0.6s0.8s1s1.2s1.4s1.6s1.8s2s2.2s2.4sLatency (Time to First Chunk)Mistral MediumMistral MediumGPT-4 TurboGPT-4 TurboGPT-4GPT-4Claude 2.0Claude 2.0Llama 2 Chat (70B)Llama 2 Chat (70B)Mistral 7BMistral 7BGPT-3.5 TurboGPT-3.5 TurboGemini ProGemini ProMixtral 8x7BMixtral 8x7B

Throughput: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).

Latency: Time to first chunk of tokens received, in seconds, after API request sent.

Price: Price per token, represented as USD per million Tokens. Price is a blend of Input & Output token prices (3:1 ratio).

Median across hosts: Figures represent median (P50) across all hosts which support the model.

Latency vs. Throughput: Host & Model combinations

Latency: Seconds to First Tokens Chunk Received, Throughput: Tokens per Second

Most attractive quadrant

Size represents Price (USD per M Tokens)

GPT-3.5 Turbo (OpenAI)

GPT-3.5 Turbo (Azure)

GPT-4 Turbo (OpenAI)

GPT-4 Turbo (Azure)

ArtificialAnalysis.ai0102030405060708090100Throughput (Tokens per Second)0s0.1s0.2s0.3s0.4s0.5s0.6s0.7s0.8sLatency (Time to First Chunk)GPT-4 Turbo (Azure)GPT-4 Turbo (Azure)GPT-4 Turbo (OpenAI)GPT-4 Turbo (OpenAI)GPT-3.5 Turbo (OpenAI)GPT-3.5 Turbo (OpenAI)GPT-3.5 Turbo (Azure)GPT-3.5 Turbo (Azure)

Throughput: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).

Latency: Time to first chunk of tokens received, in seconds, after API request sent.

Price: Price per token, represented as USD per million Tokens. Price is a blend of Input & Output token prices (3:1 ratio).

Median across hosts: Figures represent median (P50) across all hosts which support the model.

Quality vs. Throughput: Host & Model combinations

Quality: General reasoning index, Throughput: Tokens per Second, Price: USD per 1M Tokens

Most attractive quadrant

Size represents Price (USD per M Tokens)

GPT-3.5 Turbo (OpenAI)

GPT-3.5 Turbo (Azure)

GPT-4 Turbo (OpenAI)

GPT-4 Turbo (Azure)

ArtificialAnalysis.ai0102030405060708090100Throughput (Tokens per Second)0102030405060708090100110120Quality (General ability index)GPT-3.5 Turbo (OpenAI)GPT-3.5 Turbo (OpenAI)GPT-4 Turbo (Azure)GPT-4 Turbo (Azure)GPT-4 Turbo (OpenAI)GPT-4 Turbo (OpenAI)GPT-3.5 Turbo (Azure)GPT-3.5 Turbo (Azure)

Throughput: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).

Latency: Time to first chunk of tokens received, in seconds, after API request sent.

Price: Price per token, represented as USD per million Tokens. Price is a blend of Input & Output token prices (3:1 ratio).

Median across hosts: Figures represent median (P50) across all hosts which support the model.

Throughput

Output Tokens per Second; Higher is better

ArtificialAnalysis.ai1078978766929242018Mixtral 8x7BGemini ProGPT-3.5 TurboMistral 7BLlama 2 Chat(70B)Claude 2.0GPT-4GPT-4 TurboMistral Medium

Throughput: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).

Median across hosts: Figures represent median (P50) across all hosts which support the model.

Throughput Variance

Output Tokens per Second; Results by percentile; Higher median is better

Median, Other points represent 5th, 25th, 75th, 95th Percentiles respectively

ArtificialAnalysis.aiGPT-3.5 TurboGPT-4GPT-4 TurboLlama 2 Chat(70B)Mistral 7BMixtral 8x7BMistral MediumGemini ProClaude 2.005010015020025030035040045050077.823.920.469.176.5106.917.68929.4

Throughput: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).

Boxplot: Shows variance of measurements

image

Picture of the author

Throughput, Over Time

Output Tokens per Second; Higher is better

GPT-3.5 Turbo

GPT-4

GPT-4 Turbo

Llama 2 Chat (70B)

Mistral 7B Instruct

Mixtral 8x7B Instruct

Mistral Medium

Gemini Pro

Claude 2.0

ArtificialAnalysis.aiDec 17Dec 18Dec 19Dec 20Dec 21Dec 22Dec 23Dec 24Dec 25Dec 26Dec 27Dec 28Dec 29Dec 30Dec 31Jan 01Jan 02Jan 03Jan 04Jan 05Jan 06Jan 07Jan 08Jan 09Jan 10Jan 11Jan 12Jan 13Jan 14Jan 15Jan 16Jan 17Jan 18Jan 19Jan 20Jan 21Jan 22Jan 23Jan 24Jan 25Jan 26Jan 27Jan 28Jan 29Jan 30Jan 31Feb 01Feb 02Feb 03Feb 04Feb 05Feb 06Feb 07Feb 08Feb 09Feb 10Feb 11Feb 12Feb 13Feb 14Feb 15Feb 16Feb 17Feb 18Feb 19Feb 20Feb 21Feb 22Feb 23Feb 240102030405060708090100110120130count

Throughput: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).

Over time measurement: Median measurement per day, based on 8 measurements each day at different times.

Median across hosts: Figures represent median (P50) across all hosts which support the model.

Latency

Seconds to First Tokens Chunk Received; Lower is better

ArtificialAnalysis.ai20.70.60.50.50.30.30.30.3Gemini ProGPT-4Claude 2.0GPT-3.5 TurboGPT-4 TurboLlama 2 Chat(70B)Mistral 7BMixtral 8x7BMistral Medium

Latency: Time to first chunk of tokens received, in seconds, after API request sent.

Median across hosts: Figures represent median (P50) across all hosts which support the model.

Latency Variance

Seconds to First Tokens Chunk Received; Results by percentile; Lower median is better

Median, Other points represent 5th, 25th, 75th, 95th Percentiles respectively

ArtificialAnalysis.aiGPT-3.5 TurboGPT-4GPT-4 TurboLlama 2 Chat(70B)Mistral 7BMixtral 8x7BMistral MediumGemini ProClaude 2.0012345678910110.490.740.540.320.280.290.271.960.56

Latency: Time to first chunk of tokens received, in seconds, after API request sent.

Boxplot: Shows variance of measurements

Latency, Over Time

Seconds to First Tokens Chunk Received; Lower median is better

GPT-3.5 Turbo

GPT-4

GPT-4 Turbo

Llama 2 Chat (70B)

Mistral 7B Instruct

Mixtral 8x7B Instruct

Mistral Medium

Gemini Pro

Claude 2.0

ArtificialAnalysis.aiDec 17Dec 18Dec 19Dec 20Dec 21Dec 22Dec 23Dec 24Dec 25Dec 26Dec 27Dec 28Dec 29Dec 30Dec 31Jan 01Jan 02Jan 03Jan 04Jan 05Jan 06Jan 07Jan 08Jan 09Jan 10Jan 11Jan 12Jan 13Jan 14Jan 15Jan 16Jan 17Jan 18Jan 19Jan 20Jan 21Jan 22Jan 23Jan 24Jan 25Jan 26Jan 27Jan 28Jan 29Jan 30Jan 31Feb 01Feb 02Feb 03Feb 04Feb 05Feb 06Feb 07Feb 08Feb 09Feb 10Feb 11Feb 12Feb 13Feb 14Feb 15Feb 16Feb 17Feb 18Feb 19Feb 20Feb 21Feb 22Feb 23Feb 240s1s2s3s4s5s6s7s8s9scount

Latency: Time to first chunk of tokens received, in seconds, after API request sent.

Over time measurement: Median measurement per day, based on 8 measurements each day at different times.

Median across hosts: Figures represent median (P50) across all hosts which support the model.

Further details

Model Name
Creator
License
Context Window

Highlights

Quality

Quality Index; Higher is better

Speed

Throughput in Tokens per Second; Higher is better

Price

USD per 1M Tokens; Lower is better

Parallel Queries:

SingleMultiple

Prompt Length:

ShortLong

Total Response Time vs. Price

Total Response Time: Seconds to Output 100 Tokens, Price: USD per 1M Tokens

Most attractive quadrant

Together.ai

Anyscale

Fireworks

Price: Price per token, represented as USD per million Tokens. Price is a blend of Input & Output token prices (3:1 ratio).

Total Response Time: Time to receive a 100 token response. Estimated based on Latency (time to receive first chunk) and Throughput (tokens per second).

Median: Figures represent median (P50) measurement over the past 14 days.

Pricing: Input and Output Prices

USD per 1M Tokens; Lower is better

Input price

Output price

Input price: Price per token included in the request/message sent to the API, represented as USD per million Tokens.

Output price: Price per token generated by the model (received from the API), represented as USD per million Tokens.

Total Response Time

Seconds to Output 100 Tokens; Lower is better

Total Response Time: Time to receive a 100 token response. Estimated based on Latency (time to receive first chunk) and Throughput (tokens per second).

Median: Figures represent median (P50) measurement over the past 14 days.

Total Response Time, Over Time

Seconds to Output 100 Tokens; Lower is better

Together.ai

Anyscale

Fireworks

Total Response Time: Time to receive a 100 token response. Estimated based on Latency (time to receive first chunk) and Throughput (tokens per second).

Over time measurement: Median measurement per day, based on 8 measurements each day at different times.

Hosting Providers for Gemma 7B: Detailed performance Metrics

Latency vs. Throughput

Latency: Seconds to First Tokens Chunk Received, Throughput: Tokens per Second

Most attractive quadrant

Size represents Price (USD per M Tokens)

Together.ai

Anyscale

Fireworks

Throughput: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).

Latency: Time to first chunk of tokens received, in seconds, after API request sent.

Price: Price per token, represented as USD per million Tokens. Price is a blend of Input & Output token prices (3:1 ratio).

Median: Figures represent median (P50) measurement over the past 14 days.

Throughput vs. Price

Quality: General reasoning index, Throughput: Tokens per Second, Price: USD per 1M Tokens

Most attractive quadrant

Together.ai

Anyscale

Fireworks

Throughput: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).

Price: Price per token, represented as USD per million Tokens. Price is a blend of Input & Output token prices (3:1 ratio).

Median: Figures represent median (P50) measurement over the past 14 days.

Throughput

Output Tokens per Second; Higher is better

Throughput: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).

Median: Figures represent median (P50) measurement over the past 14 days.

Throughput Variance

Output Tokens per Second; Results by percentile; Higher median is better

Median, Other points represent 5th, 25th, 75th, 95th Percentiles respectively

Throughput: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).

Boxplot: Shows variance of measurements

Throughput, Over Time

Output Tokens per Second; Higher is better

Together.ai

Anyscale

Fireworks

Throughput: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).

Over time measurement: Median measurement per day, based on 8 measurements each day at different times.

Latency

Seconds to First Tokens Chunk Received; Lower is better

Latency: Time to first chunk of tokens received, in seconds, after API request sent.

Median: Figures represent median (P50) measurement over the past 14 days.

Latency Variance

Seconds to First Tokens Chunk Received; Results by percentile; Lower median is better

Median, Other points represent 5th, 25th, 75th, 95th Percentiles respectively

Latency: Time to first chunk of tokens received, in seconds, after API request sent.

Boxplot: Shows variance of measurements

Picture of the author

Latency, Over Time

Seconds to First Tokens Chunk Received; Lower is better

Together.ai

Anyscale

Fireworks

Latency: Time to first chunk of tokens received, in seconds, after API request sent.

Over time measurement: Median measurement per day, based on 8 measurements each day at different times.

Summary table of key metrics