Leaderboard

SwiftEval
# Model Score
1 GPT-4o 0.89
2 GPT-4 Turbo 0.87
3 GPT-4o Mini 0.86
4 DeepSeek Coder V2 Instruct 236B 0.82
5 GPT-4 0.82
6 GPT-3.5 Turbo 0.81
7 Codestral 22B 0.73
8 DeepSeek Coder V2 Instruct 16B 0.69
9 Codestral Mamba 7B 0.57
10 CodeGeeX4 9B 0.53
11 StarCoder 2 Instruct 15B 0.53
12 CodeLlama Instruct 70B 0.51
13 CodeQwen2.5 Instruct 7B 0.43
14 StarCoder 2 15B 0.41
15 CodeQwen1.5 7B 0.41
16 Nxcode-CQ 7B 0.40
17 CodeQwen1.5 Chat 7B 0.38
18 CodeGemma 1.1 Instruct 7B 0.36
19 CodeLlama Instruct 13B 0.35
20 Yi-Coder Chat 9B 0.35
21 CodeLlama Instruct 34B 0.34
22 DeepSeek Coder Instruct 33B 0.32
23 Granite Code Instruct 34B 0.31
24 CodeGemma Instruct 7B 0.30
25 StarCoder 2 7B 0.30
26 CodeQwen2.5 7B 0.29
27 OpenCodeInterpreter 33B 0.27
28 CodeGemma 7B 0.25
29 CodeLlama Instruct 7B 0.25
30 Granite Code Instruct 8B 0.24
31 StarCoder 2 3B 0.22
32 OpenCodeInterpreter 6.7B 0.20
33 CodeGemma 2B 0.17
34 DeepSeek Coder Instruct 6.7B 0.17
35 Granite Code Instruct 20B 0.16
36 CodeShell 7B 0.12
37 Granite Code Instruct 3B 0.11
38 CodeQwen2.5 1.5B 0.11
39 Stable Code Instruct 3B 0.11
40 DeepSeek Coder Instruct 1.3B 0.10
41 Stable Code 3B 0.09
42 CodeShell Chat 7B 0.07
43 Yi-Coder Chat 1.5B 0.04
44 CodeGeeX2 6B 0.04
HumanEval (Python)
# Model Score
1 GPT-4o 0.90
2 DeepSeek Coder V2 Instruct 236B 0.90
3 GPT-4 0.88
4 GPT-4 Turbo 0.88
5 CodeQwen2.5 Instruct 7B 0.88
6 GPT-4o Mini 0.87
7 Nxcode-CQ 7B 0.86
8 Yi-Coder Chat 9B 0.85
9 CodeQwen1.5 Chat 7B 0.83
10 CodeGeeX4 9B 0.82
11 Codestral 22B 0.81
12 DeepSeek Coder V2 Instruct 16B 0.81
13 DeepSeek Coder Instruct 33B 0.79
14 OpenCodeInterpreter 33B 0.79
15 DeepSeek Coder Instruct 6.7B 0.78
16 OpenCodeInterpreter 6.7B 0.77
17 Codestral Mamba 7B 0.75
18 StarCoder 2 Instruct 15B 0.72
19 GPT-3.5 Turbo 0.68
20 CodeLlama Instruct 70B 0.67
21 Yi-Coder Chat 1.5B 0.67
22 DeepSeek Coder Instruct 1.3B 0.65
23 Granite Code Instruct 34B 0.62
24 CodeQwen2.5 7B 0.61
25 CodeGemma 1.1 Instruct 7B 0.60
26 Granite Code Instruct 20B 0.60
27 Stable Code Instruct 3B 0.59
28 Granite Code Instruct 8B 0.57
29 CodeGemma Instruct 7B 0.56
30 CodeQwen1.5 7B 0.51
31 Granite Code Instruct 3B 0.51
32 StarCoder 2 15B 0.46
33 CodeGemma 7B 0.44
34 CodeQwen2.5 1.5B 0.43
35 CodeLlama Instruct 13B 0.42
36 CodeLlama Instruct 34B 0.41
37 StarCoder 2 7B 0.35
38 CodeLlama Instruct 7B 0.34
39 CodeShell 7B 0.34
40 CodeShell Chat 7B 0.34
41 CodeGeeX2 6B 0.33
42 Stable Code 3B 0.32
43 CodeGemma 2B 0.31
44 StarCoder 2 3B 0.31

Categories

Base
SwiftEval main category with hand-crafted problems. Designed specially for Swift programming language and considers specific Swift features as generics, protocols, enumerations, closures, and more. Contains diverse set of problems including practical tasks and design patterns implementations.
  • 28 Problems
  • 98 Experiments
HumanEval
Custom HumanEval benchmark version for Swift programming language. Based on MultiPL-E (HumanEval) subset with critical errors fixed. This category doesn't included into final results and exists for research purposes.
  • 158 Problems
  • 72 Experiments
SwiftUI
SwiftEval additional category to evaluate SwiftUI knowledge. Contains different UI problems including layout, forms, lists, etc. It's empty now but we will provide it with content in future.
  • 4 Problems
  • 0 Experiments