Monday, June 13, 2022

Does the CDC have full access to its most important data?

Last update: Monday 6/13/22 

The editor of this blog has come to believe that a substantial proportion of the CDC's guidance that limited our behavior during the pandemic was not based on limits derived from an underlying "science" of the coronavirus, but on the CDC's limited access to the petabytes of data in its own repository.  




The editor believes that CDC's limited access is embodied in the searchable files in its repository, i.e., the files accessed via the National Center for Health Statistics (NCHS) page. 

From time to time throughout the pandemic, the editor browsed the descriptions of files related to COVID and downloaded the ones he found of greatest interest, then wrote "tidyverse" R scripts to tabulate their contents into more meaningful statistical summaries. He's even published a few notes on this blog that discussed these tabulations, his most recent being related to the effectiveness of our vaccines with regards to preventing COVID related deaths among our nation's oldest residents:
The editor presumes that his readers will understand his strong interest in the effectiveness of vaccines (and boosters) in protecting the lives of old people because the editor himself is an "old guy" who will turn 81 in October. However this "old guy" is also a retired data analyst-data scientist-policy analyst whose long career conditioned his mind to solving problems by assembling the kinds of data most closely related to the most essential features of the problems. 

The three most essential and unchanging features of the virus
These three features were well documented by the end of 2020, the first year of the pandemic in the U.S.
  • The virus has an overwhelming capacity to kill old victims. 
  • The virus is also highly capable of killing diabetics, and finds old diabetics doubly vulnerable
  • Its most vulnerable victims are the immunocompromised who, by definition, have limited capacity to resist its attacks.
Indeed, more than 90 percent of COVID deaths have been victims in one or more of these thee categories. Being an old analyst, the editor expected to find that most of the CDC's datasets related to COVID would involve these three most vulnerable classes of its victims ... but they don't. The status of every COVID victim with regard to each of these categories should be readily ascertained. Unfortunately, very few CDC tables contain data about whether the persons who were sick, hospitalized, or died were diabetic or immunocompromised. Indeed, the editor of this blog suspects that the CDC has data about the diabetic and immunocompromised status of COVID victims in most, if not all of the raw data it collects from its partner agencies in all 50 states. 

To be sure, age is a category in many of its tables, but most tables only contain one category, for older victims, age 65 and older. Many of the CDC's datasets divide childhood into multiple categories, e.g., under 1, 1 to 5, 6 to 11, 12 to 17,  a reasonable breakdown because we all know that children in different age groups have physical and social difference that may cause them to have different reactions to diseases, vaccines, and treatments. Are the CDC's experts unaware of the profound differences in health conditions and social interactions between people in older age groups? Why do most of its tables only specify 65 and older? Why not have at least four categories in most tables: 50 to 64, 65 to 74, 75 to 84, and 85 and older? Again, the editor of this blog suspects that the age of COVID victims has been provided to the CDC by its state level partner agencies.

Why doesn't the CDC produce more extensive tables that are more relevant to COVID'S primary victims?
The CDC is woefully understaffed in the data analyst/data science skills required to wrangle all of the data it receives from its numerous partner agencies in fifty states into standardized formats, then organize the data into user friendly formats for access via the Web. Less than 10 percent of its staff have the required technical skills, whereas the majority of its staff are domain experts in the biological/medical sciences. This imbalance was highlighted in another note on this blog.
The U.S. Census Bureau employs battalions of highly skilled professional data techs who not only enable the Bureau to produce more extensive datasets from its raw data then than the CDC, they also enable the Bureau to design and maintain a far more extensible user interface. Given search terms specified by a user, the CDC's interface can only fetch files that already exist in its repository or are subsets of existing files.  In stark contrast the Bureau's interface can also concoct files on the fly that meet a user's specifications by combining data from the underlying components of its repository into files that never existed before the user made their request.
  • For example, if the CDCs repository had an interface of comparable power, preferably in the form of a dashboard, a user would be able to request tables wherein the ages of COVID victims were divided into five year categories, or even one year categories.

  • A user might also be able to specify the same kind of data as in an existing file that had no "diabetics" category, but the new file would only include data about diabetics of 77+ years.

____________________________________
Links to related notes on this blog:  

No comments:

Post a Comment

Your comments will be greatly appreciated ... Or just click the "Like" button above the comments section if you enjoyed this blog note.