FDA Medical Device Report Project Part 1

Background

Medical Device reports are sent to the FDA from across the country. These reports are gathered and reviewed by experts in the FDA, and are also made available to the public. At the moment, there is only one easy way to retrieve MDRs from the FDA, which is to use the MAUDE database (https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfmaude/search.cfm). However, this database suffers from a poor interface, few search features, and a limitation of only 500 reports retrievable at a time. Therefore, the MAUDE database is only suitable for analyzing small numbers of reports, and a new way of extracting and using device report data is needed.

The goal of this project is three-fold:

  1. Web Scraping to gather relevant data from the FDA website
  2. Text analysis to determine key words from report descriptions
  3. Visualizations to find interesting trends or areas needing further analysis

This post will focus on part 1 of this project.

The FDA has a second repository of medical device reports, kept as raw data in JSON format on another site (https://open.fda.gov/data/downloads/). Each file exists inside its own ZIP folder and has the same name as all others, making it a nightmare to manually download, unzip, rename, and merge the JSONs needed to make up the base dataset. Furthermore, any code written for analysis of this data would have to be fed a nicely-formatted file in order to be usable, putting undue stress on the end-user, who would have to perform all the above-mentioned steps without error. For these reasons, web-scraping and the automated assembly of these files according to the user’s parameters became the first major priority.

Selenium Web-Scraping

The FDA site in question was most certainly not made with web-scraping in mind (or any end-user experience, for that matter). Obstacles include:

  1. All buttons share the same name
  2. A light screen with an interactable button prevents movement around the page
  3. Download links we need are populated by Javascript after clicking a “show all” button
  4. The button we need to click is hidden from the code until the user scrolls to its place on the page

The above obstacles prevent simple web scraping tools like Beautiful Soup from accessing the URLs needed to download the reports. Fortunately, we can use the Selenium web-scraper to perform the necessary actions to populate the HTML code of the site without manual input. The script pulls up the FDA website, gets past the lightbox, scrolls to the button we need, and selects it. This allows us to extract a snippet of HTML with the list of relevant download links and bring it as a string into our code.

#The FDA website does not load the downloadable files unless you scroll to that area of the page first
#Web-Scraping involves using the Selenium webdriver to open the site with Chrome, navigate to the
    #needed area, and hit the correct buttons at the correct time

from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import selenium.common.exceptions
from selenium import webdriver
import time

from selenium.webdriver.support.wait import WebDriverWait

options = webdriver.ChromeOptions()

#Initialize the browser
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get("https://www.google.com")

driver.get(URL)
driver.maximize_window()

time.sleep(1)

#Get past the light screen
button1 = driver.find_element(By.CLASS_NAME, "button.bg-primary.clr-white")
button1.click()

time.sleep(1)
 
#Scroll to the button for medical device events
element_link=WebDriverWait(driver, 10).until(EC.presence_of_element_located(
   (By.XPATH, '//*[@id="Medical Device Event"]')))

driver.execute_script("arguments[0].scrollIntoView(true)", element_link)

time.sleep(1)

#Click the medical device event button
button2 = driver.find_element(By.XPATH, '//*[@id="Medical Device Event"]/section/button')
button2.click()

time.sleep(1)

#Retrieve HTML code now that it displays the links we need
html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")

driver.close()

Working with JSON Files and Zips

The HTML sample is turned into a list of URLS, each a download link to the Zip File with the JSON inside that we need. There is more data here than is immediately useful (there are thousands of device product codes and we only need a handful at a time), and the timeframe of the data goes back over 20 years. Instead of attempting to gather all the data and compile it into one very large dataset, you can specify the date range of data that is needed and a list of device codes to examine.

The relevant URLs for the downloads are located by filtering only the URLS that include years in the range specified by the user. Each URL for those years is followed, the ZIP file downloaded, the JSON file extracted, and then the data taken out and standardized into a dataframe before the JSON and the ZIP are deleted. Each JSON that is extracted is transformed into a dataframe and filtered according to relevant columns and Device codes, before being appended to the master dataframe. The end result can be saved as a CSV as a backup or for other use.

#Snip HTML to just the portion in question
pattern = '1991(.*?)<li id="Medical Device PMA">'
substring = re.search(pattern, html).group(1)
#Itemize links into array
import lxml.html

url_list = lxml.html.fromstring(substring)
url_list = url_list.xpath('//a/@href')
start_year = 2020
end_year = 2021

year_list = list(range(start_year, end_year+1))

index_to_download = []

for year_index in year_list:
    for range_index in range(0, len(url_list)):
        if str(year_index) in url_list[range_index]:
            index_to_download.append(url_list.index(url_list[range_index]))

index_count = len(index_to_download)
#Follow links in array to download/process ZIPs

#Specify Download Path
path = 'C:/Users/Allen/Documents/FDA'
#pathinverse = 'C:\Users\Allen\Documents\FDA'
import requests, zipfile
from io import BytesIO

#Filter Data by Project Code, use "All" to include all data
pcode = ["KDK","KDL","KDN","LLJ","MFU","MFV","MSB","PGR","PHO"]
#pcode = ["All"]

#Run loop, opening JSONs
loopnumber = 0
datamain = ""
for download_index in index_to_download:
    print('Download ' + (str(loopnumber+1)) + " of " + (str(index_count)) + " started ")
    url = url_list[download_index]
    import requests, zipfile
    req = requests.get(url)
    print('Download ' + (str(loopnumber+1))+ " completed ")
    zipfile = zipfile.ZipFile(BytesIO(req.content))
    #filename = "FDA" + str(download_index+1)
    filename = "FDAdata.json"
    for i, f in enumerate(zipfile.filelist):
        f.filename = filename.format(i)
        zipfile.extract(f)
    print('File ' + (str(loopnumber+1))+ ' extracted')
    data = json.load(open(r'C:\Users\Allen\Documents\FDA\FDAdata.json'))
    data = data["results"]
    datamain = data
    if loopnumber == 0:
        print('Creating Dataframe with JSON ' + (str(loopnumber+1)))
        dfmain = pd.json_normalize(data,
                  record_path = "device",
                  meta = ["report_number","report_source_code","date_received","event_type","type_of_report","mdr_text"],
                  record_prefix = "_",
                  errors = "ignore")
        if pcode[0] != "All":
            dfmain = dfmain[dfmain._device_report_product_code.isin(pcode)]
        print('Dataframe Created')
    else:
        print('Appending Dataframe with JSON ' + (str(loopnumber+1)))
        dfnew = pd.json_normalize(data,
                  record_path = "device",
                  meta = ["report_number","report_source_code","date_received","event_type","type_of_report","mdr_text"],
                  record_prefix = "_",
                  errors = "ignore")
        if pcode[0] != "All":
            dfnew = dfnew[dfnew._device_report_product_code.isin(pcode)]
        dfmain = pd.concat([dfmain, dfnew])
        #dfmain.append(dfnew)
        print('JSON ' + (str(loopnumber+1)) + ' appended')
        
    os.remove(path + "/" + filename)
    loopnumber = loopnumber + 1
print("Dataframe ready")
...
Appending Dataframe with JSON 36
JSON 36 appended
Download 37 of 38 started 
Download 37 completed 
File 37 extracted
Appending Dataframe with JSON 37
JSON 37 appended
Download 38 of 38 started 
Download 38 completed 
File 38 extracted
Appending Dataframe with JSON 38
JSON 38 appended
Dataframe ready
_device_event_key _implant_flag _date_removed_flag _device_sequence_number _date_received _brand_name _generic_name _manufacturer_d_name _manufacturer_d_address_1 _manufacturer_d_address_2 ... _openfda.device_class _expiration_date_of_device _openfda.registration_number _openfda.fei_number report_number report_source_code date_received event_type type_of_report mdr_text
98596 1 20200206 LIFEPORT KIDNEY TRANSPORTER SYSTEM PERFUSION CIRCUIT ORGAN RECOVERY SYSTEMS, INC. 1 PIERCE PLACE SUITE 475W ... 2 20210318 NaN NaN 3004068499-2020-00003 Manufacturer report 20200206 Malfunction [Initial submission, Followup] [{'mdr_text_key': '221303083', 'text_type_code...
17151 1 20200227 LIFEPORT KIDNEY TRANSPORTER SYSTEM PERFUSION CIRCUIT ORGAN RECOVERY SYSTEMS, INC. 1 PIERCE PLACE SUITE 475W ... 2 20210318 [3004068499] [3004068499] 3004068499-2020-00004 Manufacturer report 20200227 Malfunction [Initial submission] [{'mdr_text_key': '224587763', 'text_type_code...
42763 1 20200206 LIFEPORT KIDNEY TRANSPORTER SYSTEM PERFUSION CIRCUIT ORGAN RECOVERY SYSTEMS, INC. 1 PIERCE PLACE SUITE 475W ... 2 20210424 NaN NaN 3004068499-2020-00001 Manufacturer report 20200206 Malfunction [Initial submission, Followup] [{'mdr_text_key': '196742172', 'text_type_code...
70387 1 20200206 LIFEPORT KIDNEY TRANSPORTER SYSTEM PERFUSION CIRCUIT ORGAN RECOVERY SYSTEMS, INC. 1 PIERCE PLACE SUITE 475W ... 2 20210318 [3004068499] [3004068499] 3004068499-2020-00002 Manufacturer report 20200206 Malfunction [Initial submission, Followup] [{'mdr_text_key': '221303162', 'text_type_code...
20329 1 20200320 LIFEPORT KIDNEY TRANSPORTER SYSTEM PERFUSION CIRCUIT ORGAN RECOVERY SYSTEMS, INC. 1 PIERCE PLACE SUITE 475W ... 2 20211029 [3004068499] [3004068499] 3004068499-2020-00005 Manufacturer report 20200320 Malfunction [Initial submission] [{'mdr_text_key': '222081101', 'text_type_code...

5 rows × 40 columns

The above table has more information than necessary, and requires further transformation. The important part of the text is also nested in the MDR Text field, which creates its own additional challenges. These will be explored in part 2 of this project.