Comparing large language models can be confusing, so I created a benchmarking system that ranks them in 17 key categories—like search, creativity, coding, and document writing—based on actual LLM insights rather than personal opinion. In this video, I break down the results from seven major AI models, revealing which excels in each category and crowning the ultimate winner.
Video Timeline:
00:01 – Introduction to benchmark measurement for large language models and why it’s confusing for most users.
00:14 – Even AI experts struggle to interpret these comparisons.
00:21 – I developed a ranking system that simplifies LLM performance for everyday users.
00:35 – The categories we’re measuring: search, document writing, creativity, coding, image generation, and more.
00:48 – There are 17 comparison categories, and the rankings are based on what the LLMs themselves say.
01:02 – The models rank each other, not just my opinion—introducing the fair benchmarking system.
01:16 – Rules of the comparison: 17 categories, 10 points per category, shared points if necessary.
01:32 – The AI models being compared: ChatGPT-4, Copilot, Gemini Advanced, Claude 3.5, Perplexity, Llama 3, and Grok.
02:06 – Each AI describes itself in one word—some interesting insights here!
02:47 – First category: Which LLM offers a free trial? (All do, so points are evenly distributed.)
03:13 – Which LLM would choose another model if it had to? Most picked ChatGPT-4.
03:32 – Least costly paid plan? Llama 3 wins since it's free.
04:17 – Which LLM has the most up-to-date knowledge? Claude 3.5 lags behind.
04:55 – Do they have mobile apps? Llama 3 is the only one without one.
05:21 – Do they offer real-time voice-to-voice? Only ChatGPT and Copilot do.
05:55 – Real-time vision? ChatGPT-4 wins.
06:32 – Do they have a plan that disables training on user inputs? All do.
07:00 – How many languages can they translate? Over 100 for all, so points are shared.
07:21 – SOC 2 encryption compliance? ChatGPT, Copilot, Gemini, and Perplexity win.
07:52 – Which is the most creative LLM? Claude 3.5 dominates.
08:37 – Best image generator? ChatGPT-4 (DALL·E 3) wins, with an honorable mention for Grok.
09:13 – Best LLM for search? Perplexity takes the lead.
09:32 – Best for coding? Copilot’s GitHub integration is the surprise winner.
09:59 – Best for document writing? ChatGPT-4, thanks to Canvas.
10:16 – Least hallucinations? Claude 3.5 ranks best, but no AI is perfect.
10:51 – API availability? All models offer it, making it a commodity.
11:03 – Final Rankings: 🥇ChatGPT-4 (Winner), 🥈Claude 3.5, 🥉Copilot, with Gemini close behind.
11:21 – This benchmark will be updated quarterly—PDF available with full details.
11:40 – Call to action: Share your thoughts, suggest new categories, and don’t forget to like & subscribe!
Link to the PDF Ranking - https://drive.google.com/file/d/19Xlg...