Innovating with Technology

Getting Started with Machine Learning and Team Science & Science of Team Science Research

Dr. Stephanie Vasko, Michigan State University

Over the past few years, initial forays into machine learning (ML) applications to the science of team science at the Science of Team Science annual conference have sparked moments of dialogue about new ways to conduct both team science and research on team science. Additionally, during 2018 alone, the National Academies of Sciences published two reports on workshops with a focus on ML, the latter of which has implications for CTSA-based team science practitioners: “Data Matters: Ethics, Data, and International Research Collaboration in a Changing World: Proceedings of a Workshop” and “Artificial Intelligence and Machine Learning to Accelerate Translational Research: Proceedings of a Workshop—in Brief.” (National Academies of Sciences, 2018a and 2018b). These forays have already found themselves into the commercial sector, with Amazon applying ML to their teams working on complex problems (Machine Learning Center of Excellence) and Google partnering with universities and to push advances in science through university-company partnerships focused on ML (Google AI). These initial SciTS talks, NAS reports, and industry drivers spark the need to unpack the landscape of ML applications and techniques.

In this talk, I will present expanded findings on two case studies where commercial and custom ML algorithms have been applied and explored: 1) a science of team science case where authors are interested in predictions on dialogue from research teams participating in Toolbox Dialogue Initiative workshops (Vasko, et al, in progress) and 2) a team science case on agricultural disease identification using a combination of image data and ML. These examples will be used to compare and contrast the implementation of commercial, open-source, and homebuilt approaches to ML and team science research and research on team science. I will explore the hype around ML, unpacking promises and looking at realistic applications of techniques, the types of machine learning that can be used to approach different types of questions in the science of team science field, and potential costs associated with ML for applications in this field. This talk will also explore examples of bias in ML and the implications this can and has had in other research fields before turning to a) the implications of bias on the science of team science and b) the ways in which this bias can be combatted from the outset of research. I will also touch on implications of ML for team science and team science research for data management, workforce development (NSF, 2018) and career pivots for masters and PhD students. Finally, attendees will leave this talk with a list of options (including programming languages and packages) for using each of these types and a list of ways to gain skills in these areas.


Collaborative Data Design: A Tool for Multisector Alignment

Dr. Kathleen Pine, Arizona State University; Dr. Michael Shafer, Arizona State University; Dr. Margaret Hinrichs, Arizona State University; Ms. Kailey Love, Arizona State University; Dr. George Runger, Arizona State University; Dr. William Riley, Arizona State

Fragmentation and lack of coordination between systems represents a significant barrier to improving health, well-being and equity in our country. The fragmentation present in the behavioral health care delivery system leads to disruptions in continuity of care, lack of coordination, lack of systems-level solutions, and siloed datasets. The Systems for Action (S4A) project on behavioral health in metro Phoenix utilized an interdisciplinary framework to create knowledge infrastructure aimed at facilitating data-driven behavioral healthcare delivery decision making, policy development, policy implementation, and policy evaluation across multiple sectors core to the behavioral healthcare delivery system in Phoenix, Arizona.

Three sectors comprising multiple organizations are key to behavioral health care delivery in Phoenix: health (e.g. provider organizations, crisis response organizations), local government (e.g. county public health office, county manager’s office), and public safety (county sheriff’s department, correctional health department). Our interdisciplinary team utilized a participatory approach to increase alignment between these sectors. Our approach consisted of multisector data sharing, complex interactive data visualizations, and a participatory design approach which entailed ongoing “convening” sessions of researchers and participants from key organizations in each of the three key sectors.

Through an ongoing series of meetings held at the ASU Decision Theater, participants identified datasets for integration into a comprehensive behavioral health dataset. Participants then collaborated with researchers to collect these datasets, navigating political and organizational barriers. Once datasets were integrated into a central repository, researchers developed a set of interactive data visualizations focused on the health care delivery system. Multisector participants then took part in a series of five “data design” meetings with the research team in which they collectively explored and critiqued the data visualizations, discussed behavioral health priority areas, and engaged in thought exercises about what questions could be asked of the data visualizations and how the dataset and visualizations could be mobilized to solve specific problems.

To evaluate the participatory data sharing and data design process, the research team utilized a qualitative approach including observations and semi-structured interviews. Our qualitative study revealed that five mechanisms emerged from our process to increase alignment between different sectors. These mechanisms were: 1) Data sharing (including both the social and technical apparatus of sharing data); 2) Convening (recruiting a diversity of participants from appropriate sectors and engaging in repeated interaction); Relationship Building (creating shared communication context and coming to know one another within that context); 4) Shared Information Tool (creating visualizations with effective data characteristics—actionable, poignant, efficient, and authoritative); 5) Perspective Taking (taking the perspective of different participants/sectors and developing an overview of systems problems from the view of these different entities). Our study further revealed that these five alignment mechanisms resulted in four distinct alignment outcomes: 1) improved multisector systems awareness; 2) strengthened relationships between specific sectors; 3) increased acknowledgement of the importance of data sharing and transparency; and 4) augmented decision making.


CohortGen: Promoting Collaboration Through Automatic Cohort Generation

Dr. Charisse Madlock, University of Tennessee Health Science Center; Mr. Billy Barnett, University of Tennessee Health Science Center

In recent years there has been a significant focus on increasing collaboration within the biomedical community 1. Investigators need projects around which to collaborate as they are unlikely to reach out to without a need.  Our system presents users both a generated cohort from an Electronic Medical Records (EMR) system and potential collaborators at the same time. This system gives users both a research direction along with prospective  co-investigator identification so the need for the collaborators may be more apparent. We use the Unified Medical Language System (UMLS) to summarize and link EMR data to a social network of researchers to facilitate dataset discovery and collaboration. Many healthcare organizations have built enterprise data warehouses (EDW) that gather electronic data from multiple sources into a single data model that can be used to facilitate analysis, reporting and strategic decision making2. The Cerner Healthfacts © database which the University of Tennessee Health Science Center faculty and students have access to includes de-identified patient records from approximately 70 million patients and spans from 2001 to 2017. This data warehouse contains a deluge of complex healthcare information, and it can be difficult for researchers to find useful information that relates to their fields of study and interest.

Our purpose is to provide a more personalized experience for viewing the contents of a large-scale EMR data warehouse. Our system generates graphical summaries and hierarchical structures that summarize the data that is relevant to researchers — thus providing potential collaborative possibilities with others who work in similar areas.  

Our presentation will detail the development pipeline and  user workflow for our system, CohortGen. We first extracted  ICD codes from Healthfacts © and articles for UTHSC employees for the past ten years from Scopus. We used the ctakes clinical text analysis engine to tag articles with UMLS concepts.  We used those UMLS concepts to create a crosswalk between the Cerner Healthfacts © database and UTHSC research articles using the International Statistical Classification of Diseases codes.

We created a website using Oracle’s Apex system which provides users with a way to explore potential datasets and potential collaborators based on UMLS concepts shared across researcher profiles generated by our system. Several reports, charts, and graphs are produced by our system which allowed a researcher or data analyst a quick insight into the contents of the data warehouse.

 Once users have completed their exploration, they can request access to the dataset that matches the ICD diagnostic codes identified by our system.  We believe that our tool will not only help increase collaboration across our campus but also increase the use of a complex available dataset by UTHSC research faculty.


Bennett LM, Gadlin H, Levine-Finley S. Collaboration & Team Science: A Field Guide. NIH Office of the Ombudsman, Center for Cooperative Resolution Bethesda …; 2010.

Shin S-Y, Kim WS, Lee J-H. Characteristics Desired in Clinical Data Warehouse for Biomedical Research. Healthc Inform Res. 2014;20(2):109. doi:10.4258/hir.2014.20.2.109

SciTS Presentation: Barnett Madlock-Brown - CohortGen promoting collaboration through automatic cohort generation.pdf


Do More, Faster: Utilizing Advanced Computational Resources in Your Research Team

Dr. Dirk Colbry, Michigan State University

The use of technology in research is becoming ubiquitous as low cost programmable sensors and advanced computing (e.g., AI and big data) emerge in nearly every domain. Out of necessity, many scientists and domain experts have become Technology Users: individuals who need to employ advanced technology in their research, but who do not have broad expertise in topics such as engineering and programming.  Modern laptop and desktop computers are extremely powerful, and most scientists can accomplish 90-95% of what they need with a computer in their office.  However, much of today’s cutting edge science needs bigger, stronger and faster computers.  High Performance Computing (HPC) systems (aka supercomputers) are designed to take science to the next level.  Universities and national funding agencies (NSF, NIH, DOD, DOE, etc.) provide researchers access to  large scale computing resources, often free of charge.  The most common barrier to scientists who want to utilize these resources is knowledge: that these tools exist, and how to get started using them. 

This talk will introduce the world of large scale computing. We will discuss how to gain access to both local and national resources, the common types of problems that can be solved using these resources, and training programs specifically designed to help new researchers leverage these technologies and use them effectively in their own research.

Dr. Dirk Colbry is a Cyberinfrastructure Professional with many years of expertise in developing and facilitating both technical and professional skills training materials. With a background in Artificial Intelligence, Pattern Recognition, High Performance Computing and Computational Science, Dr. Colbry has worked with dozens of researchers at three universities (University of Michigan, Michigan State University and Arizona State University) and in many disciplines (zoology, biology, physics, chemistry, psychology, engineering, math, linguistics, health sciences, and others) on projects related to computational sciences and cyberinfrastructure. Dr. Colbry served as the Director of the MSU High Performance Computing Center and as a Computational Consultant at the MSU Institute of Cyber Enabled Research prior to assuming his current role in the Department of Computational Mathematics, Science and Engineering (CMSE) at Michigan State University.


Supplanting Institutionalized Science of Science Data Collection with Big Data Methods `in the Wild’

Dr. Loannis Pavlidis, University of Houston; Mr. Mohammed Emtiaz Ahmed, University of Houston

The international research community has been increasing in size and connectivity by leaps and bounds. Accordingly, policy makers need accurate, up to date indicators to help them craft and adjust policies for a rapidly evolving ecosystem. The UNESCO Science Report series is the definitive tool in this direction. The conventional process of creating such a report, however, is time consuming and requires significant manpower. To minimize the cost and improve the rapidity with which science of science indicators are produced for global and local policy makers, we introduce a new method based on big data and analytics. Specifically, we use the Google Scholar database (~2 million profiles) as an extensive sample of the worldwide research population.  After performing quality control, we analyze these data, showing that they produce results on par with the data in the UNESCO Science Report.

METHODS: We collected all the available profiles in the Google Scholar repository as of June 2018. Each profile includes the researcher’s full name, university affiliation, total citations, research areas, and rank. In parallel, we collected the worldwide university list annotated with country codes. This way we were able to associate each research profile with a specific country.     

RESULTS: After data cleaning, we performed exploratory data analysis finding that the United States has the highest number of researchers, and the average US researcher has the highest mean citation among all countries. This is in agreement with other reports in the literature. Importantly, we found that the Google Scholar data are highly correlated with the UNESCO Science Report data.

CONCLUSION: Conventional collection of global science of science data is part of a sophisticated UNESCO operation involving the cooperation of nearly all the governments in the world for years at a time. It appears that a science of science dataset amassed from Google Scholar in a few days via scripting, provides essentially the same information and insight as the UNESCO dataset. Big data, automated analytics, and open sources promise to change the information model that feeds science policy makers.

SciTS Presentation: Supplanting Institutionalized Science of Science Data Collection with Big Data Methods 'in the wild'