FDA Medical Device Report Project Part 1

Background

Medical Device reports are sent to the FDA from across the country. These reports are gathered and reviewed by experts in the FDA, and are also made available to the public. At the moment, there is only one easy way to retrieve MDRs from the FDA, which is to use the MAUDE database (https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfmaude/search.cfm). However, this database suffers from a poor interface, few search features, and a limitation of only 500 reports retrievable at a time. Therefore, the MAUDE database is only suitable for analyzing small numbers of reports, and a new way of extracting and using device report data is needed.

The goal of this project is three-fold:

Web Scraping to gather relevant data from the FDA website
Text analysis to determine key words from report descriptions
Visualizations to find interesting trends or areas needing further analysis

This post will focus on part 1 of this project.

The FDA has a second repository of medical device reports, kept as raw data in JSON format on another site (https://open.fda.gov/data/downloads/). Each file exists inside its own ZIP folder and has the same name as all others, making it a nightmare to manually download, unzip, rename, and merge the JSONs needed to make up the base dataset. Furthermore, any code written for analysis of this data would have to be fed a nicely-formatted file in order to be usable, putting undue stress on the end-user, who would have to perform all the above-mentioned steps without error. For these reasons, web-scraping and the automated assembly of these files according to the user’s parameters became the first major priority.

Selenium Web-Scraping

The FDA site in question was most certainly not made with web-scraping in mind (or any end-user experience, for that matter). Obstacles include:

All buttons share the same name
A light screen with an interactable button prevents movement around the page
Download links we need are populated by Javascript after clicking a “show all” button
The button we need to click is hidden from the code until the user scrolls to its place on the page

The above obstacles prevent simple web scraping tools like Beautiful Soup from accessing the URLs needed to download the reports. Fortunately, we can use the Selenium web-scraper to perform the necessary actions to populate the HTML code of the site without manual input. The script pulls up the FDA website, gets past the lightbox, scrolls to the button we need, and selects it. This allows us to extract a snippet of HTML with the list of relevant download links and bring it as a string into our code.

#The FDA website does not load the downloadable files unless you scroll to that area of the page first
#Web-Scraping involves using the Selenium webdriver to open the site with Chrome, navigate to the
    #needed area, and hit the correct buttons at the correct time

from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import selenium.common.exceptions
from selenium import webdriver
import time

from selenium.webdriver.support.wait import WebDriverWait

options = webdriver.ChromeOptions()

#Initialize the browser
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get("https://www.google.com")

driver.get(URL)
driver.maximize_window()

time.sleep(1)

#Get past the light screen
button1 = driver.find_element(By.CLASS_NAME, "button.bg-primary.clr-white")
button1.click()

time.sleep(1)
 
#Scroll to the button for medical device events
element_link=WebDriverWait(driver, 10).until(EC.presence_of_element_located(
   (By.XPATH, '//*[@id="Medical Device Event"]')))

driver.execute_script("arguments[0].scrollIntoView(true)", element_link)

time.sleep(1)

#Click the medical device event button
button2 = driver.find_element(By.XPATH, '//*[@id="Medical Device Event"]/section/button')
button2.click()

time.sleep(1)

#Retrieve HTML code now that it displays the links we need
html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")

driver.close()

Working with JSON Files and Zips

The HTML sample is turned into a list of URLS, each a download link to the Zip File with the JSON inside that we need. There is more data here than is immediately useful (there are thousands of device product codes and we only need a handful at a time), and the timeframe of the data goes back over 20 years. Instead of attempting to gather all the data and compile it into one very large dataset, you can specify the date range of data that is needed and a list of device codes to examine.

The relevant URLs for the downloads are located by filtering only the URLS that include years in the range specified by the user. Each URL for those years is followed, the ZIP file downloaded, the JSON file extracted, and then the data taken out and standardized into a dataframe before the JSON and the ZIP are deleted. Each JSON that is extracted is transformed into a dataframe and filtered according to relevant columns and Device codes, before being appended to the master dataframe. The end result can be saved as a CSV as a backup or for other use.

#Snip HTML to just the portion in question
pattern = '1991(.*?)<li id="Medical Device PMA">'
substring = re.search(pattern, html).group(1)

#Itemize links into array
import lxml.html

url_list = lxml.html.fromstring(substring)
url_list = url_list.xpath('//a/@href')

start_year = 2020
end_year = 2021

year_list = list(range(start_year, end_year+1))

index_to_download = []

for year_index in year_list:
    for range_index in range(0, len(url_list)):
        if str(year_index) in url_list[range_index]:
            index_to_download.append(url_list.index(url_list[range_index]))

index_count = len(index_to_download)

#Follow links in array to download/process ZIPs

#Specify Download Path
path = 'C:/Users/Allen/Documents/FDA'
#pathinverse = 'C:\Users\Allen\Documents\FDA'
import requests, zipfile
from io import BytesIO

#Filter Data by Project Code, use "All" to include all data
pcode = ["KDK","KDL","KDN","LLJ","MFU","MFV","MSB","PGR","PHO"]
#pcode = ["All"]

#Run loop, opening JSONs
loopnumber = 0
datamain = ""
for download_index in index_to_download:
    print('Download ' + (str(loopnumber+1)) + " of " + (str(index_count)) + " started ")
    url = url_list[download_index]
    import requests, zipfile
    req = requests.get(url)
    print('Download ' + (str(loopnumber+1))+ " completed ")
    zipfile = zipfile.ZipFile(BytesIO(req.content))
    #filename = "FDA" + str(download_index+1)
    filename = "FDAdata.json"
    for i, f in enumerate(zipfile.filelist):
        f.filename = filename.format(i)
        zipfile.extract(f)
    print('File ' + (str(loopnumber+1))+ ' extracted')
    data = json.load(open(r'C:\Users\Allen\Documents\FDA\FDAdata.json'))
    data = data["results"]
    datamain = data
    if loopnumber == 0:
        print('Creating Dataframe with JSON ' + (str(loopnumber+1)))
        dfmain = pd.json_normalize(data,
                  record_path = "device",
                  meta = ["report_number","report_source_code","date_received","event_type","type_of_report","mdr_text"],
                  record_prefix = "_",
                  errors = "ignore")
        if pcode[0] != "All":
            dfmain = dfmain[dfmain._device_report_product_code.isin(pcode)]
        print('Dataframe Created')
    else:
        print('Appending Dataframe with JSON ' + (str(loopnumber+1)))
        dfnew = pd.json_normalize(data,
                  record_path = "device",
                  meta = ["report_number","report_source_code","date_received","event_type","type_of_report","mdr_text"],
                  record_prefix = "_",
                  errors = "ignore")
        if pcode[0] != "All":
            dfnew = dfnew[dfnew._device_report_product_code.isin(pcode)]
        dfmain = pd.concat([dfmain, dfnew])
        #dfmain.append(dfnew)
        print('JSON ' + (str(loopnumber+1)) + ' appended')
        
    os.remove(path + "/" + filename)
    loopnumber = loopnumber + 1
print("Dataframe ready")

...
Appending Dataframe with JSON 36
JSON 36 appended
Download 37 of 38 started 
Download 37 completed 
File 37 extracted
Appending Dataframe with JSON 37
JSON 37 appended
Download 38 of 38 started 
Download 38 completed 
File 38 extracted
Appending Dataframe with JSON 38
JSON 38 appended
Dataframe ready

	_device_sequence_number	_date_received	_brand_name	_generic_name	_manufacturer_d_name	_manufacturer_d_address_1	_manufacturer_d_address_2	...	_openfda.device_class	_expiration_date_of_device	_openfda.registration_number	_openfda.fei_number	report_number	report_source_code	date_received	event_type	type_of_report	mdr_text
98596	1	20200206	LIFEPORT KIDNEY TRANSPORTER SYSTEM	PERFUSION CIRCUIT	ORGAN RECOVERY SYSTEMS, INC.	1 PIERCE PLACE	SUITE 475W	...	2	20210318	NaN	NaN	3004068499-2020-00003	Manufacturer report	20200206	Malfunction	[Initial submission, Followup]	[{'mdr_text_key': '221303083', 'text_type_code...
17151	1	20200227	LIFEPORT KIDNEY TRANSPORTER SYSTEM	PERFUSION CIRCUIT	ORGAN RECOVERY SYSTEMS, INC.	1 PIERCE PLACE	SUITE 475W	...	2	20210318	[3004068499]	[3004068499]	3004068499-2020-00004	Manufacturer report	20200227	Malfunction	[Initial submission]	[{'mdr_text_key': '224587763', 'text_type_code...
42763	1	20200206	LIFEPORT KIDNEY TRANSPORTER SYSTEM	PERFUSION CIRCUIT	ORGAN RECOVERY SYSTEMS, INC.	1 PIERCE PLACE	SUITE 475W	...	2	20210424	NaN	NaN	3004068499-2020-00001	Manufacturer report	20200206	Malfunction	[Initial submission, Followup]	[{'mdr_text_key': '196742172', 'text_type_code...
70387	1	20200206	LIFEPORT KIDNEY TRANSPORTER SYSTEM	PERFUSION CIRCUIT	ORGAN RECOVERY SYSTEMS, INC.	1 PIERCE PLACE	SUITE 475W	...	2	20210318	[3004068499]	[3004068499]	3004068499-2020-00002	Manufacturer report	20200206	Malfunction	[Initial submission, Followup]	[{'mdr_text_key': '221303162', 'text_type_code...
20329	1	20200320	LIFEPORT KIDNEY TRANSPORTER SYSTEM	PERFUSION CIRCUIT	ORGAN RECOVERY SYSTEMS, INC.	1 PIERCE PLACE	SUITE 475W	...	2	20211029	[3004068499]	[3004068499]	3004068499-2020-00005	Manufacturer report	20200320	Malfunction	[Initial submission]	[{'mdr_text_key': '222081101', 'text_type_code...

5 rows × 40 columns

The above table has more information than necessary, and requires further transformation. The important part of the text is also nested in the MDR Text field, which creates its own additional challenges. These will be explored in part 2 of this project.

FDA Medical Device Report Project Part 1 - Web Scraping

FDA Medical Device Report Project Part 1

Background

Selenium Web-Scraping

Working with JSON Files and Zips