Tuesday, April 26, 2022

Moving another CDC "component" to the Census Bureau

Last update: Friday 4/29/22 
The U.S. Census Bureau is America's official biographer, determined to get all of the facts about life in the USA, determined to get them right, and determined to get them without bias as to which facts are more important than others. The last characteristic distinguishes the Census Bureau from other data repositories, e.g., the CDC. All repositories produce tables that codify their anticipation of the tabulations that most users would like to obtain. However, the Census Bureau is unique in its extensive efforts to enable users to extract new tabulations. 

For example, suppose that both the CDC and the Census Bureau published age related tables that lumped all older residents into a single "65 and over" category. The underlying datasets from which the published tables were derived would likely have older ages grouped into finer categories, e.g., "65 to 69", "70 to 74", "75 to 79", and "80 and older". CDC users would have to settle for the published "65 and older" table; but Census users could trigger the production of a new table that included older ages divided into the smaller categories. 
In other words, the tables provided by most agencies, including the CDC. are a "closed" collection, i.e.,  users can only receive copies of tables and associated maps and charts that were previously prepared by the agency's staff. By contrast, the collection of census tables are "open", i.e., users can trigger the generation of new tables with new maps and charts
  • Technical note: Any table in the CDC's closed collection can be filtered, grouped, or aggregated before downloading. This is a convenience for users who might not want to perform these operations after downloading a full table ... or an enhancement for users who don't know how to perform these operations on a downloaded table. But in either case the end result is merely a simplified version of the original table; it is not a new table.

Closed collections of tables and short staffing
A previous note on this blog called attention to an article in the NY Times that discussed data hoarding as the ultimate form of non-tabulation:
  • "The C.D.C. Isn’t Publishing Large Portions of the Covid Data It Collects", Apoorva Mandavilli, NY Times, 2/20/22
The reporter cited some important instances of the CDC's non-publication of data about boosters, hospitalizations and, until recently, wastewater analyses:
  • "For more than a year, the Centers for Disease Control and Prevention has collected data on hospitalizations for Covid-19 in the United States and broken it down by age, race and vaccination status. But it has not made most of the information public."

  • "When the C.D.C. published the first significant data on the effectiveness of boosters in adults younger than 65 two weeks ago, it left out the numbers for a huge portion of that population: 18- to 49-year-olds, the group least likely to benefit from extra shots, because the first two doses already left them well-protected." ... Link to CDC report -- 1/22/22
     
  • "The agency recently debuted a dashboard of wastewater data on its website that will be updated daily and might provide early signals of an oncoming surge of Covid cases. Some states and localities had been sharing wastewater information with the agency since the start of the pandemic, but it had never before released those findings." ... Link to CDC wastewater dashboard
Why didn't the CDC publish tabulations of these data on its Website? The most obvious explanation immediately came to the mind of the editor of this blog, namely: power ... as in, knowledge is power ... as in, those who have greater knowledge have power over those who don't ... well ... maybe. After his subsequent effort to derive strategic implications from the CDC's structure, the editor's suspicions have become less sinister, but more damning:
First, imagine a baseball team called the "CDC All Stars" that was composed of four pitchers, four catchers, and one outfielder = 9 players ... but nobody on First, Second, Third, or Shortstop. For now we will let the absurd imbalance of this hypothetical roster sit with you for a while ... :-)

Second, consider the following copy of a table created by FederalPay.org that displays the distribution of the occupational categories of the CDC's employees in FY 20 that appears in a scrollable frame 
_____________________

 
Note: This frame can be scrolled left to right, also up and down; its image can be made larger or smaller by clicking any position in the frame, then clicking the plus or minus icons that appear at the bottom of the frame.
_____________________

From the bar chart we can estimate approximate values for the number of CDC employees in biomedical categories:
  • General Health Science = 2200
  • Public Health Program Specialist = 2000
  • Microbiology = 500
  • Total biomedical employees = 4700, i.e., medical doctors, biomedical researchers, etc.
From the chart we can also estimate approximate values for the number of CDC employees in occupations related to the management of its datasets, e.g., data scientists, data analysts, statisticians, and software engineers.
  • Information Technology Management = 500
  • Statistics = 200
  • Total data management = 700
  • The CDC's biomedical internal users of its data outnumber its data managers by almost seven to one. This imbalance has inevitable consequences.
Third, now imagine the kinds of internal discussions that occur when the CDC gains access to new sets of data. 
  • The CDC's biomedical subject matter experts must determine the specific form of the table(s?) that should be tabulated ... but how can they be sure which table(s) would be most useful to their community of users? Given the "closed" nature of the CDC's collection of tables, there is considerable pressure to identify the "best" table. By contrast, the Census Bureau's "open" collections enables it to publish Table A, knowing that if its power users frequently generated Table B, the staff could add Table B to its closed collection of tables for the benefit of its less skilled users.

  • Who will implement the tabulation: the CDC's plentiful biomedical subject matter experts or its relatively scarce data managers?.Getting the data into the "right shape" is time consuming. Indeed, this "data wrangling" process, as it is called by data scientists and data analysts, can be surprisingly time consuming ... If done by subject matter experts, the wrangling will take even longer because data wrangling is not in their core professional skills set so they might know the most effective state-of-the-art procedures; they will probably underestimate the required time ... but if done by data managers, who are in short supply at the CDC, action might be delayed until they find the time to do it.
One could easily imagine that some tables go unpublished, not because of conscious decision, but because discussions and dithering push back the intended publication dates, again and again. Indeed, the editor of this blog has often wondered why it took so long for the CDC to publish some of the tables that it eventually did publish. Delays and non-publications are the inevitable consequences of the "CDC All Stars" not having enough data managers on First, Second, Third, and Shortstop ... :-)


The Census Bureau and the CDC
Most federal departments and their component agencies act independently with regards to other federal departments and the components of other departments. This general observation fails with regards to the U.S. Census Bureau, which is a component of the U.S. Department of Commerce (DOC), and the CDC, a component of the U.S. Department of Health and Human Services (HHS). The extensive cooperation between these two federal agencies is a tax payer's delight. 

Indeed, their extensive cooperation suggests a straight-forward strategy for improving the CDC's management of its data repository. Better management would yield more timely tabulations of its data and thereby enable the CDC to produce more timely pandemic guidance based on its tabulations. Here's an important example of inter-agency cooperation between the Bureau and the CDC.
  • National Center for Health Statistics (NCHS)
    The NCHS is one of the CDC's many centers. It conducts a national survey described on its home page as follows, "The National Health Interview Survey (NHIS) has monitored the health of the nation since 1957. NHIS data on a broad range of health topics are collected through personal household interviews. Survey results have been instrumental in providing data to track health status, health care access, and progress toward achieving national health objectives."

    Not stated on the NCHS home page, but fully described on the home page for  the NHIS survey on the Census Bureau's Website, is the fact that the interviews for the survey are actually conducted by personnel from the Census Bureau: "NHIS data are collected through personal household interviews. For over 50 years, interviewers from the U.S. Census Bureau have visited American homes to ask about a broad range of health topics. Survey results have been instrumental in providing data to track health status, health care access, and progress toward achieving national health objectives"

    In short, everyone knows that the Census Bureau hires and trains battalions of survey interviewers. Rather than hire and train its own smaller battalions of interviewers, the CDC has wisely chosen to deploy some of the experienced interviewers who were hired and trained by its partner, the Census Bureau.
Assuming the correctness of our conjecture that the CDC is understaffed with regards to data managers, why can't the CDC negotiate another wise partnership with the Census Bureau, whose role as the nation's biographer requires that it hire and train battalions of data managers? Rather than hire its own smaller battalions of data managers, why can't the CDC deploy some of the experienced data managers who were hired and trained by its partner, the Census Bureau? 

Hopefully, the hordes of additional data managers from the Bureau would enable the CDC to produce more timely tabulations and would also enable the CDC to enhance the functionality of its repository by enabling users to trigger the production of new tabulations beyond the CDC's existing collection of tables that were specified by its biomedical subject matter experts.


____________________________________
Links to related notes on this blog:  

No comments:

Post a Comment

Your comments will be greatly appreciated ... Or just click the "Like" button above the comments section if you enjoyed this blog note.