New AI benchmark test results prove rapid growth for genAI models

The latest round of AI benchmark results measuring query response times is released by MLCommons Tuesday, which for the first time, included Meta’s large language model (LLM) Llama 2 70B as part of its testing suite. Cybernews breaks it down.

MLCommons is an AI engineering consortium tasked with developing worldwide industry-standard benchmark tests to measure and compare how well various machine learning (ML) and AI models perform in various domains.

Additionally for this round, Stability AI’s Stable Diffusion XL was selected for the first time as the benchmark for text-to-image generative AI models.

Described as a collaboration between over 125 tech “startups, leading companies, academics, and non-profits from around the globe,” MLCommons states its goal is to improve “the accuracy, safety, speed, and efficiency” of all AI models, as well as building the “open, large-scale, and diverse datasets” used to train them.

The group put out two separate blogs on March 26th, one describing how its ‘MLPerf Inference’ benchmark suite came to incorporate the Llama 2 generative AI model into its toolset and the other, an explanation of the parameters used to measure its v4.0 submission round results.

Specifically, the tests are designed to measure how fast hardware systems can run AI and ML models in a variety of deployment scenarios, MLCommons states.

Typical 'question answering' benchmarks measure vision, speech, and natural language processing across different application segments when a model is prompted with a query.

The tests consider a comparison model’s specifications, general system description, type and number of CPUs, GPUs, and accelerators used, as well as how much of each is needed for the model to complete a computation.

The results are used by businesses, vendors, developers, and consumers as a way to help them choose the best models for their needs, based on items such as such as customization, costs, workloads, and more.

AI benchmark test results — Sample of AI benchmark dataset test results. Image by MLCommons.

Meta’s Llama 2 is top benchmark pick

In 2023, MLCommons said it created a special taskforce to help develop the benchmarks for both small and larger LLMs.

The task force was able to successfully create a v3.1 benchmark test based on GPT-J, a smaller open-source LLM that produces human-like text from a prompt using six billion parameters. The model was released by EleutherAI in 2021.

For the smaller model v3.1 benchmark round, 11 organizations submitted 30 results to the group depicting their models' CPUs, GPUs, and custom accelerators.

However, creating the benchmark for larger LLMs presented more of a challenge and the task force decided to postpone its call for submissions so it could carry out further research to help determine the best model for the job – which turned out to be the LLama 2 70B.

For perspective, the number of parameters (which determine how effective a model can optimize its output) for larger LLMs can range anywhere from 70 billion parameters with Meta’s Llama 2, to 175 billion parameters with OpenAI’s GPT-3, and roughly 1.76 trillion for its GPT-4.

MLCommons said the MetaAI model was chosen mainly for its flexibility of unlimited access, ease of use and deployment, community interests, and quality.

Stability Diffusion XL, which has 2.6 billion parameters, was chosen for its ability to generate a high number of images, which can help to “calculate metrics such as latency and throughput to understand overall performance.”

What’s in a test?

To create the tests, the consortium said it chose the OpenOrca dataset, recognized among AI experts as one of the “highest quality and most extensive datasets available for question answering and evaluating NLP [natural language processing] capabilities.”

AI datasets, such as the open-source OpenOrca dataset, are collections of data used in AI to train models and test algorithms.

MLCommons said a subset of 24,576 samples was chosen from the Orca dataset and then sifted through to create the benchmarks using prompt quality, maximum input length, and minimum reference response length.

OpenOrca Dataset for AI benchmark test — OpenOrca dataset sample. Image by Cybernews.

The MLPerf Inference v4.0 round produced over 8500 performance results from 23 submitting organizations, according to MLCommons.

Big names who submitted in this fourth round include Azure, Cisco, Dell, Fujitsu, Google, Hewlett Packard, Intel, Juniper Networks, Lenovo, NVIDIA, Oracle, Qualcomm Technologies, Red Hat, and more.

The model submissions themselves are divided into three categories: LLMs that are currently available for purchase or rent in the cloud; preview systems that will be available by the next round of benchmark tests; and models with internal use hardware or software that are still in the research, experimental, and/or development stage.

MLCommons said that this latest round of submission results highlights the "continued progress made in efficient AI acceleration."

The group said the participating organizations also submitted about 900 power results, including data center-focused power numbers from Dell, Fujitsu, NVIDIA, and Qualcomm, which help to measure the power consumption used while the benchmarks are running.

Energy benchmarks were also used in this round to help determine how optimal a chip is by measuring its delivering performance with the minimal amount of energy used.

The full MLPerf Inference v4.0 AI benchmark results can be read here and here.

New AI benchmark test results prove rapid growth for genAI models

More from Cybernews

Meta’s Llama 2 is top benchmark pick

What’s in a test?