SmartGPT: Major Benchmark Broken – 89.0% on … | 質問の答えを募集中です! SmartGPT: Major Benchmark Broken – 89.0% on … | 質問の答えを募集中です!

SmartGPT: Major Benchmark Broken – 89.0% on …

未分類
SmartGPT: Major Benchmark Broken – 89.0% on MMLU + Exam's Many Errors
Has GPT4, using a SmartGPT system, broken a major benchmark, the MMLU, in more ways than one? 89.0% is an unofficial record, but do we urgently need a new, authoritative benchmark, especially in the light of today’s insider info of 5x compute for Gemini than for GPT 5?

Learn all about the power of exemplars, self-consistency and how you can tangibly benefit in real world examples. You’ll learn more about everything from cutting edge benchmarking to AGI forecasting.

Get more from AI Explained on Patreon
Home of the AI Insiders network: plus exclusive content and more

Original SmartGPT Video: https://www.youtube.com/watch?v=wVzuvf9D9BU&list=PPSV
MMLU: https://arxiv.org/pdf/2009.03300.pdf
Gemini 5x GPT 4, Semianalysis: https://www.semianalysis.com/p/google-gemini-eats-the-world-gemini
WizardCoder Overfitting? https://twitter.com/Shahules786/status/1695493641610133600
Let’s Do a Thought Experiment: https://arxiv.org/pdf/2306.14308.pdf
LegalBench: https://arxiv.org/pdf/2308.11462.pdf
SciBench: https://arxiv.org/pdf/2307.10635.pdf
AGIEval: https://arxiv.org/pdf/2304.06364.pdf
MMLU Grading Issues: https://huggingface.co/blog/evaluating-mmlu-leaderboard
Oxford University Press Question Example: https://global.oup.com/uk/orc/chemistry/chechik/student/mcqs/ch04/
Fall 2011 Epidemiology Example: https://www.docsity.com/en/final-exam-fall-2011-4/8308030/
HellaSwag: https://arxiv.org/pdf/1905.07830.pdf
GPT 4 Technical Report: https://arxiv.org/pdf/2303.08774.pdf
Minerva, Solving Quantitative Reasoning: https://arxiv.org/pdf/2206.14858.pdf
Original Scratchpads Paper: https://arxiv.org/pdf/2112.00114.pdf
Is ChatGPT Behaviour Changing Over Time? https://arxiv.org/pdf/2307.09009.pdf
Paul Christiano: https://www.lesswrong.com/posts/fRSj2W4Fjje8rQWm9/thoughts-on-sharing-information-about-language-model
Metaculus Forecasting: https://www.metaculus.com/ai/ https://www.lesswrong.com/posts/SdkexhiynayG2sQCC/ai-forecasting-two-years-in
MIT Paper: https://twitter.com/jeremyphoward/status/1669588857149612033?lang=en-GB
Snowballing Hallucinations: https://arxiv.org/pdf/2305.13534.pdf
Self Consistency: https://arxiv.org/pdf/2203.11171.pdf
OpenLLM Leaderboard: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
NHS Question from ‘Extended Matching Questions’
Graph of Thoughts: https://arxiv.org/pdf/2308.09687.pdf
Dario Amodei Interview – Dwarkesh Patel: https://www.youtube.com/watch?v=Nlkk3glap_U

GitHub Answers: https://github.com/Joshua-Stapleton/smartgpt-answers

Joshua Stapleton is a Machine Learning Engineer who has worked in the healthcare and defence sectors. He recently pivoted into AI capabilities and safety, with a concentration on LLMs. He now works as a research engineer, consults on the applications of AI across various industries, and is pursuing his Masters in Machine Learning and Data Science at Imperial College London.
Feel free to reach out to Josh via his email, mailto:joshua.stapleton.ai@gmail.com, or check out his new Patreon: https://patreon.com/JoshuaStapleton.

AI Explained Community: https://discord.gg/PEmxEhFV

https://www.patreon.com/AIExplained



 ⬇人気の記事!⬇

タイトルとURLをコピーしました