Wednesday, December 20, 2023

The surprising power of Small Language Models (SLMs) ... TL;DR and podcast 21Dec23

Last update: Thursday 12/21/23 
Welcome to our 21Dec23
 TL;DR summary + podcast about the past week's top AI story on our "Useful AI News" page  Microsoft's newest small language model (SLM) and its benchmarks 


Click the "start" button in the audio control (below) to hear the podcast ...  If audio fails to start, or gets stuck, try reloading the page.

 
TL;DR link  HERE

A. Top story in past 2 weeks ...
As the reader will see, we only have one top story. Given that Google had rolled out tables of benchmark tests two weeks ago that proclaimed the superiority of its biggest LLM cannon, Gemini Ultra, over Microsoft's biggest LLM cannon, GPT-4, the editor was not surprised when he learned that the Microsoft "Empire" had decided to strike back by firing at the biggest cannon in the Google "Empire". 

But the editor was stunned when he found Microsoft firing test results wherein its smallest cannon, Phi-2, outperformed Gemini Ultra. Phi-2 is not a large language model (LLM); Phi-2 is a small language model (SLM). Whereas Gemini Ultra contains 3.5 trillion parameters, Phi-2 only contains 2.7 billion parameters; so Ultra is about 580 times as large as Phi-2. Wow!!! How can such a little David outperform such a big Goliath at anything? Indeed, Microsoft published the benchmarks on its blog in a note titled: "The surprising power of small language models"

This story was obviously far more important that any other AI news item in the past two weeks. Indeed, Microsoft's announcement that small language models could support the emergence of the same kinds of dazzling capabilities that had emerged from large language models might be the most important generative AI news story since the publication of ChatGPT running GPT 3.5 in November 2022,

Microsoft's newest small language model (SLM) and benchmarks
On December 12, 2023, staff from the Microsoft Research unit published a report on Microsoft's blog that included tables of benchmarks and other performance indicators that supported the assertion of the note's title as to "The surprising power of small language models." The report's Table 2 was the most relevant table for our discussion because it compares Microsoft's Phi-2 with data that Google published about its Gemini Nano 2 SLM. A copy of the report's Table 2 appears below. 

The second column shows that Nano 2 is about the same size as Phi-2, i.e., it has about the same number of parameters. The remaining four columns in the table provide four performance measres for each SLM. Phi-2 outperforms Nano 2 on all four measures, but it clobbers Nano 2 on MBPP (column 5), Phi-2's value 59.1 being more than twice as large as Nano 2's 29.1 value.
  • What is MBPP?
    The MBPP (Mostly Basic Python Problems) dataset consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers, covering programming, fundamentals, standard library functionality, and so on. Each problem consists of a task.
SizeBBHBoolQMBPPMMLU
Gemini Nano 23.2B42.479.327.255.8
Phi-22.7B59.383.359.156.7
Table 2. Comparison between Phi-2 and Gemini Nano 2 Model on Gemini’s reported benchmarks
Now let's pause and take a couple of steps backwards. One of the most impressive emergent skills shown by ChatGPT, Bard, and other LLM chatbots is their ability to generate valid code in various languages, e.g., Python. Somehow the editor of this blog got the impression that only large language models could generate code, models that were at least 500 times as large as the pip-squeaks in Table 2
  • ChatGPT's GPY-4 LLM has about 1.76 trillion parameters; ans
  • Gemini's Ultra has about 1.56 trillion parameters. 
Nevertheless, the data in Table 2 validates the title of the Mocrosoft blog note  Small language models have surprising power.

As for Phi-2 being more than twice as powerful as Nano, this result should not be taken too seriously. Phi-2 is still a research project, whereas Google has already added Nano to its Pixel smartphone production lines. Indeed, Google has issued assurances that it will installi Nano on all android devices wherein they will function as stand alone models without connections to the Internet cloud. Although the editor could glibly wave aside this "victory" of Microsoft's SLM over Google's SLM, Microsoft's next victorious challenge cannot be ignored. 

Microsoft posed the kind of question to Phi-2 and to Gemini Ultra that students might encounter in an introduction to physics course, specifically: to calculate the speed of a skier when it reaches the bottom of a hill, given the skier's mass, the gravitational constant, and the height of the hill. Phi-2 calculated the correct answer, but Ultra's answer was wrong. Adding insult to injury, Microsoft asked Phi-2 to identify the mistake in Ultra's answer ... which it did. 

If the models had been asked to select the correct answer given multiple choiices, Phi-2 might have made a lucky guess. But this question, like all good questions in intro to physics courses, was an open question that required students to calculate the right answer and to show how they had made their calculation. Nevertheless, the SLM got it right; the LLM got it wrong; and the SLM identified the error in the LLM's logic. 

High quality data can empower SLMs
How could this happen? The answer implied by the authors of Microsoft's blog note, is plausible, snarky, and initially disturbing. 
"The massive increase in the size of language models to hundreds of billions of parameters has unlocked a host of emerging capabilities that have redefined the landscape of natural language processing. A question remains whether such emergent abilities can be achieved at a smaller scale using strategic choices for training, e.g., data selection ...
... training data quality plays a critical role in model performance. This has been known for decades, but we take this insight to its extreme by focusing on “textbook-quality” data, following upon our prior work “Textbooks Are All You Need.” 
In other words, a substantial chunk of Phi-2's surprising power derives from its having been trained exclusively on high quality data, which the authors describe as "textbook" quality
  • This is a plausible converse to one of information technology's oldest maxims  garbage in, garbage out.
     
  • It's snarky because it suggests that Gemini Ultra's weaker performance might be due to its having been trained on inferior data. Indeed, if Gemini Ultra could not correctly answer a simple intro to physics question, what other shortfalls might its users encounter that might also follow from inferrior training data?

  • And it is initially disturbing because it suggests that Big Tech corporations may be currently engaged in a race to the bottom because of their de facto assumption that bigger is better. So they beg, borrow, and steal all of the data they can get their hands on, as fast as they can, from anywhere and everywhere, regardless of copyright restrictions, privacy violations, and gender/racial bias.
SLMs can empower competition
However, on second thought, initial disturbance quickly gives way to elation. If bigger was always better, generative AI would be forever dominated by Big Tech -- Apple, Amazon, Facebook/Meta, Google/Alphabet, and Microsoft -- because no other firms would be rich enough to afford this competition's ever increasing entry fees.

But common sense suggests that if SLMs provide viable foundations for generative AI, then the cost of producing small models should be orders of magnitude lower then the costs of producing models that were 500 times as large. Small models should only require investments measured in hundreds of millions or even tens of millions compared to the billions required to produce large models. Instead of five competitors, we might see 500. More competitors should produce more effective models that could address more important challenges much faster, at lower costs, and more safely if a substantial segment of the producers worked on open source models.

This is our last TL;DR/podcast for 2023 ... See you next year ... :-)
Seasons Greetings ... Merry Christmas ... Happy Chanukah 
... Happy Kwanza ... Happy New Year!!!

B. Top story in past 2 weeks ...
  1. Microsoft
     
    "Microsoft releases Phi-2, a small language model AI that outperforms Llama 2, Mistral 7B", Carl Franzen, VentureBeat, 12/12/23 *** 
    -- This story also covered by ZDNetTechRepublicMediumComputerworld ... and 
    Microsoft

C. Microsoft Research -- a component of Microsoft
The editor of this blog asked Microsoft Copilot the following questions:

Question 1: What does the Microsoft Research unit do? 
Response 1
"Microsoft Research is a division of Microsoft Corporation that focuses on advancing science and technology to benefit humanity. The mission statement of Microsoft Research is to expand the state of the art in each of the areas in which it does research, rapidly transfer innovative technologies into Microsoft products, and ensure that Microsoft products have a future"

Question 2: Please describe Microsoft's small language models
Reaponse 2:
"Microsoft Research has been working on developing small language models that can outperform larger models in conversational tasks 

1. In 2023, Microsoft Research introduced Orca 2, a pair of compact language models that have proven themselves by either matching or surpassing the capabilities of much larger language models 

2. Phi-2 is another small language model developed by Microsoft Research that demonstrates outstanding reasoning and language understanding capabilities, showcasing state-of-the-art performance among base language models with less than 13 billion"


D. Links to ChatGPT's responses to five basic AI questions:  
ChatGPT's answer to each question provides the framework for its response to the next question. So most readers should start with Question 1.

No comments:

Post a Comment

Your comments will be greatly appreciated ... Or just click the "Like" button above the comments section if you enjoyed this blog note.