Neoskeptics: How and why we should build a national library of generative AI use cases and benchmark tests

Last update: Monday 6/9/25

Coding has rapidly become the most successful use case for Large Language Models. Models can now code so well that all Big Tech and many smaller companies are making drastic reductions in the size of their coding staffs. That's bad news ... for software engineers.

The good news ... for experts in a wide range of other fields ... is that new employment opportunities are just over the horizon for computer savvy personnel who learn how to use generative AI to solve complex problems in their fields. So how can they learn these new skills? Click Here to read more ...

A. Use cases ... "We learn by doing", John Dewey

For slower moving technologies, taking courses based on textbooks has traditionally been the most effective way to learn how to use new technologies. But genAI is developing faster than publishers can update their textbooks. So the best way to keep up with such rapid innovation is to keep abreast of the latest use cases.

The good news is that the tech companies that create innovations usually include use cases for their innovations in their presentations at conferences and other assemblies.
The bad news is that there is no one place that anyone go to find these use cases after the conferences and other assemblies are over. So it is difficult for a prospective user to compare similar innovations. For example, which AI tool is easier to use and/or provides more accurate summaries of PDF files.
Our nation needs at least one reputable organization to provide a badly needed public service for the next two or three years by acting as a curator of use cases, a librarian who maintains up-to-date online catalogs of cases with copies of the data embodied in these cases in “standard” formats and/or allows visitors to the catalog to run the use cases on their own data.
The curator/librarian should not rate the performance of an app in a use case. However the visitors to the catalog should be encouraged to provide their own assessments, perhaps using the same kind of comments plus four star, three star, two star, one star ratings provided by purchasers on Amazon’s online shopping sites.
Visitors to the catalog would be required to have appropriate subscriptions to run the cases. For example, if a visitor wanted to run an OpenAI use case that demonstrated a feature that required a ChatGPT Plus subscription, the visitor would need a Plus subscription to run the case.
In order to close the gap between the new opportunities that are about to appear and the limited current supply of qualified applicants as quickly as possible, the curator/librarian should invest substantial efforts to solicit use cases from pioneers in academia and non-academic training institutions who have already developed their own use cases. These early cases will enable other instructors to learn the new skills themselves and then teach their students the new skills using the same use cases.

BackToTop

As will be discussed in Appendix 1 to this note, The Atlantic magazine provided a similar public service during the first year of the COVID pandemic in the U.S.

During COVID, Jeffrey Goldberg was its editor-in-chief, a position he continues to hold. Under his leadership, The Atlantic launched the COVID Tracking Project to collect and curate the data required to fill a critical gap in public health reporting during the first year of the pandemic, data that should have been collected by the CDC, but wasn't.

Given its widely acclaimed success in creating and managing the Covid Tracking Project, The Atlantic would have been an excellent choice to create and manage the badly needed catalogs of use cases for generative AI.

Unfortunately, Mr. Goldman’s recent public assertions, assertions that triggered the “SignalGate" fiasco, might cause the Trump administration to misperceive this project as a political challenge, rather than a badly needed public service.

BackToTop

B. Benchmarks

The public librarian should also maintain catalogs of benchmark tests. Benchmark tests of large language models are designed to assess the overall power of an AI model. Whereas use cases, as their name implies, are designed to show how specific innovations in genAI models can be used effectively.

In principle a model that has greater power as measured by benchmark tests should have a wider and deeper range of use cases. For example, many businesses that operate help desks would like to replace their human staff with less costly chatbots running on small inexpensive models that have been derived from large foundation models that received high ratings on benchmark tests.

The good news is that many reputable institutions are already providing open source benchmark tests that assess the overall power of big Tech’s models. A few of these institutions are listed in Appendix 3 to this note.
The bad news, again, is that there is no one place that anyone can go to find out which tests exist, which models have been tested, and the results of these tests.
The nation needs at least one reputable organization to act as a curator of a collection of existing benchmark test results, a librarian that maintains up-to-date catalogs of all the test results of the models in the library plus links to later test results on later versions of the models.

BackToTop

C. The Librarian/Curator

The library is a public service; so it should not be a profit-oriented operation. It can receive operating income from three sources: a large number of subscribers paying relatively small subscription fees, advertisers, and donors. Most of its funding should come from subscription fees in order to protect it from the political whims of advertisers and donors.

BackToTop

D. The New York Times as the badly needed librarian/curator of use cases and benchmark tests

Ever since OpenAI announced ChatGPT running on GPT-4 back in March 2023, the New York Times has mostly published what might charitably be called “old news” rewrites of Big Tech press releases 2 to 3 days after these stories had been covered by TechCrunch, Wired, The Verge, and other tech publications.

The assignment of such a low priority to the world shaking developments in generative AI is not consistent with the Times’ promise in its highly successful sales pitch to prospective subscribers that their subscription fees will help the Times cover the world‘s most important issues. Appendix #2 to this note quantifies the dollar value of this sales pitch in a table that shows that the Times’ income from digital subscription fees rose from $47 million in 2011 to $979 million in 2022.

BackToTop

Amazon

The Times just signed an agreement with Amazon that enables Amazon to use a substantial portion of the millions of articles published by the Times. The terms and fees were not disclosed in the announcement of the agreement, but the price was evidently high enough for the Times to permit Amazon to train its generative AI models on the Times publications.

This agreement means that the venerable New York Times is no longer sitting in the stands observing generative AI from a distance. The Times is definitely out of the stands and onto the playing field.

That’s why the editor of this blog thinks that the Times’ well educated subscribers have reason to expect that it might begin to cover the incredibly important generative AI revolution in a more responsible manner asap not just with better articles, but by offering them more substantial benefits, e.g., access to catalogs of use cases and benchmark tests.
This does not mean that the Times should make a profit from its management of genAI catalogs like it does from its Wirecutter guide to purchasing products. Subscribers should get unlimited access to its genAI catalogs for free. But non-subscribers should face a very low paywall, for example, 10 to 15 free inquiries per week.

BackToTop

World shaking events

The Times takes pride in its commitment to keeping its well educated readers informed about the world’s most important issues, whether they involve climate change, the sudden emergence of authoritarian regimes, rising instability in trade relationships, you name it. Yes, these are very important issues that deserve extensive and intensive coverage by the Times.

But the world shaking magnitude of the potential positive and negative impacts of generative AI as it evolves into artificial general intelligence (AGI), faster than anyone had predicted just a few years ago, is at least as important as any other issue that anyone can put on the table.

Unique qualifications

The New York Times is one of the only publications of any kind that can provide extensive and intensive coverage of the game changing developments in generative AI that are popping up at a faster and faster pace. Its stories about genAI will provide valuable indicators of new use cases that should be added to its catalogs. And its substantial subscriber base will make it relatively immune to pressure from advertisers.

BackToTop

... Appendix #1 …

The COVID Tracking Project

When the editor of this blog asked ChatGPT to provide the following concise overview of this endeavor, he didn't need to double check its sources because COVID had been the primary focus of this blog from the onset of the pandemic in 2020 until the Biden administration called the pandemic to a de facto close in late 2022/early 2023.

Peter Thiel's Palantir company received a federal contract to collect similar data for HHS in April 2020. But Palantir's data was not shared with the wider community until early in 2021, a sharing that enabled the Atlantic to shut down its Tracking Project in March 2021.

BackToTop

🧪 The COVID Tracking Project: Filling the Void Left by the CDC
When the CDC failed to provide timely and reliable national COVID-19 data in early 2020, The Atlantic stepped in with The COVID Tracking Project, launched on March 7, 2020.

What began as a volunteer effort quickly became the country’s most trusted and widely cited source of pandemic data, the gold standard — used by public health agencies, media outlets, researchers, and even the federal government.

🔹 What the Project Tracked

The team published consistent, daily updates on:

Testing data (total tests, test results, and positivity rates
Confirmed COVID-19 cases, by state and nationally
Hospitalization data, including current hospitalizations and ICU usage
Death counts, including fatalities in long-term care facilities
Racial and ethnic breakdowns of cases and deaths
Outbreak data from nursing homes and assisted living centers

🧑‍🔬 Guided by Experts
The project’s reliability stemmed from its collaboration with:

Public health officials
Epidemiologists
Health data scientists
Veteran journalists and data analysts from The Atlantic

These advisors helped shape the methodology, ensuring that the data met professional standards even as it was gathered and published by a largely volunteer force

BackToTop

... Appendix 2 …

NY Times Subscription Income

The editor of this blog provided ChatGPT on GPT-4o and Meta AI on Llama 4 with the same prompt:

Provide a summary of the percentage of its income that the NY Times derived from advertising and subscriptions in each of the following years -- 2000, 2004, 2008, 20012, 2016, 2020, 2024 ... and note when the Times first posted a paywall.

He used the response from each chatbot to check the other chatbot’s response. Their responses were more or less the same because they consulted almost the same sources. Meta AI’s response is quoted below because its response was more concise.

Response from Meta AI on Llama 4

"The New York Times has undergone significant changes in its revenue model over the years. Here's the breakdown of advertising versus subscription income for the specified years:

- 2004:

- Advertising: $1,222 million (61% of total revenue)

- Print Circulation: $616 million (31% of total revenue)

- Digital Subscription: Not available (N/A)

- 2008:

- Advertising: $1,068 million (56% of total revenue)

- Print Circulation: $668 million (35% of total revenue)

- Digital Subscription: N/A

- 2012:

- Advertising: $712 million (45% of total revenue)

- Digital Subscription: $114 million (7% of total revenue)

- Print Circulation: $681 million (43% of total revenue)

- 2016:

- Advertising: $581 million (37% of total revenue)

- Digital Subscription: $232 million (15% of total revenue)

- Print Circulation: $647 million (42% of total revenue)

- 2022 (latest available data):

- Advertising: $523 million (23% of total revenue)

- Digital Subscription: $979 million (42% of total revenue)

- Print Circulation: $574 million (25% of total revenue)

BackToTop

As for 2024, the data isn't available yet. However, we can see a clear trend of increasing revenue from digital subscriptions and decreasing revenue from advertising.

The New York Times introduced a paywall on its online edition in 2011, launching its digital subscription model. This move marked a significant shift in its revenue strategy, with digital subscription revenue growing from $47 million in 2011 to $979 million in 2022."

Primary Source = https://www.visualcapitalist.com/cp/new-york-times-revenue-chart/

Editor's notes:

1. Data for 2024 was not available yet; so Meta AI provided data for 2022, the midpoint between the requested years 2020 and 2024.

2. Nowadays most printed editions of the Times are obtained through subscriptions and delivered to "doorsteps", rather than purchased at newsstands. These subscribers may be computer savvy, but prefer paper copies because of personal preference and/or family tradition. So the total income from subscribers who responded to the Times’ online sales pitch may be worth well over one billion dollars annually and account for over 60 percent of its income.

BackToTop

... Appendix 3 …

Benchmark Test Results

Organization	Benchmarks(s)	Primary Focus Area
MLCommons	MLPerf (Training, Inference)	System performance throughput, latency for cloud & edge
Epoch AI	FrontierMath & others via AI Benchmarking Dashboard	Challenging tasks across AI capabilities
Vellum AI	SOTA leaderboards (GPQA, reasoning, non-saturated evals)	Logic/reasoning tracking cutting-edge LLM performance
LiveBench	Real-time LLM evals	Reliable, contamination-aware model evaluation
Center for AI Safety	Humanity’s Last Exam	Multimodal reasoning at the frontier of human knowledge
SCALE AI	AgentBench, Bench, Bench (via community platforms)	Agent reliability, multi-agent coordination, ML problem solving
PerfKitBenchmarker	Big data, HPC, networking, storage benchmarks	Infrastructure metrics CPU, I/O, network across clouds
Other Key Suites	MMLU, GLUE/SuperGLUE, GSM8K, SQuAD	General language understanding & reasoning abilities

BackToTo

Official Links to Benchmark Organizations

MLCommons (MLPerf) https://mlcommons.org
Epoch AI (AI Benchmarking Dashboard)
https://www.epochai.org/benchmarks
Vellum AI (SOTA leaderboards)
https://www.vellum.ai/leaderboard
LiveBench (Contamination-aware model evaluation)
https://livebench.org
Center for AI Safety (Humanity’s Last Exam)
https://www.safe.ai
SCALE AI (AgentBench, SWE-Bench, τ-Bench)
https://scale.com
AgentBench Info
SWE-Bench
TauBench (τ‑Bench)
PerfKitBenchmarker (by Google)
https://github.com/GoogleCloudPlatform/PerfKitBenchmarker
MMLU, GLUE/SuperGLUE, GSM8K, SQuAD

MMLU: https://github.com/hendrycks/test
GLUE: https://gluebenchmark.com
SuperGLUE: https://super.gluebenchmark.com
GSM8K: https://github.com/openai/grade-school-math
SQuAD: https://rajpurkar.github.io/SQuAD-explorer

BackToTop

Neoskeptics

Pages

Saturday, June 7, 2025

How and why we should build a national library of generative AI use cases and benchmark tests

No comments:

Post a Comment