FDA Project - Part 2

Challenge

We have arrived at a workable dataset, but the most important part for text-analysis is inside the MDR_Text field. Each MDR_Text item is a list of narratives pertaining to the specific malfunction report. We can try to unnest these lists of narratives and keep each narrative separate, but that is not necessary for the scope of this project–instead, what is needed is to remove punctuation and unnecessary/redundant words, so that we have a short string of unique words in each MDR Text field. This will allow for counting of keywords and the compilation of useful visualizations.

An example of one such visualization is shown below. Created in Tableau from output from this code, it shows the progression of keywords of Heartvalve malfunction reports over the past two years.

Code

First we need to cull out unneeded columns and perform some minor data cleaning.

#Keep only relevant columns
dfmain = dfmain[["_device_report_product_code","_brand_name","_generic_name","_manufacturer_d_name","type_of_report","report_number","report_source_code",
                 "date_received","event_type","mdr_text"]]

#Rename columns
dfmain.columns = ["product_code","brand_name","generic_name","manufacturer_name","type_of_report","report_number",
                        "report_source_code","date_received","event_type","mdr_text"]

#Update date column to date format
dfmain["date_received"] = pd.to_datetime(dfmain["date_received"])

#Remove brackets from type of report column
dfmain['type_of_report'] = dfmain['type_of_report'].str.join(', ')

#Update MDR Text to only show the text narrative items--also lowercase the text
newmdr = []
for text_index in dfmain["mdr_text"]:
    newmdr.append(''.join(re.findall("'text': .+?}",str(text_index))).translate(str.maketrans('', '', string.punctuation)).replace("text"," - ")[4:])

dfmain["mdr_text"] = [x.lower() for x in newmdr]
dfmain.head()

	product_code	brand_name	generic_name	manufacturer_name	type_of_report	report_number	report_source_code	date_received	event_type	mdr_text
98596	KDN	LIFEPORT KIDNEY TRANSPORTER SYSTEM	PERFUSION CIRCUIT	ORGAN RECOVERY SYSTEMS, INC.	Initial submission, Followup	3004068499-2020-00003	Manufacturer report	2020-02-06	Malfunction	investigation for this complaint is currently ...
17151	KDN	LIFEPORT KIDNEY TRANSPORTER SYSTEM	PERFUSION CIRCUIT	ORGAN RECOVERY SYSTEMS, INC.	Initial submission	3004068499-2020-00004	Manufacturer report	2020-02-27	Malfunction	patient information does not apply the device ...
42763	KDN	LIFEPORT KIDNEY TRANSPORTER SYSTEM	PERFUSION CIRCUIT	ORGAN RECOVERY SYSTEMS, INC.	Initial submission, Followup	3004068499-2020-00001	Manufacturer report	2020-02-06	Malfunction	investigation for this complaint is currently ...
70387	KDN	LIFEPORT KIDNEY TRANSPORTER SYSTEM	PERFUSION CIRCUIT	ORGAN RECOVERY SYSTEMS, INC.	Initial submission, Followup	3004068499-2020-00002	Manufacturer report	2020-02-06	Malfunction	investigation for this complaint is currently ...
20329	KDN	LIFEPORT KIDNEY TRANSPORTER SYSTEM	PERFUSION CIRCUIT	ORGAN RECOVERY SYSTEMS, INC.	Initial submission	3004068499-2020-00005	Manufacturer report	2020-03-20	Malfunction	patient information does not apply the device ...

Here comes much of the NLP operations. We unnest each sentence and perform lemmatization in order to standardize suffixes. We also remove stop-words and a manually curated list of words that are very common and/or not useful for the analysis. At this point it is a manual operation to gather these words–in the future, it would be beneficial to gather them for exclusion more automatically, perhaps with IPV/SV analysis.

#Remove NA items
dfmain.fillna('', inplace=True)

#Prepare for lemmatization
lemmatizer = nltk.stem.WordNetLemmatizer()
wordnet_lemmatizer = WordNetLemmatizer()

#Create function to tag words with what part of speech they are
def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

#Create function to separate strings into words, assign parts of speech, and lemmatize accordingly
def lemmatize_sentence(sentence):
    #tokenize the sentence and find the POS tag for each token
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            #if there is no available tag, append the token as is
            lemmatized_sentence.append(word)
        else:
            #else use the tag to lemmatize the token
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return " ".join(lemmatized_sentence)

#Perform Lemmatization - this step can take some time
dfmain['mdr_text'] = dfmain['mdr_text'].apply(lambda x: lemmatize_sentence(x))

#Remove stop words
stop_words = set(stopwords.words('english'))

#Add additional common words into stopwords
stop_words.update(["investigation","still","progress","complete","supplemental","report","filed","device","returned","reported",
                  "-","due","failure","failed","may","ensure","assures","around","met","reports","number","per","dated","patient",
                  "conclusion","performed","year","years","therefore","submitted","information","received","upon","event",
                  "events","review","required","appropriate","monitored","monitor","basis","continue","monthly","trends","completion",
                  "additional","months","yet","without","history","regarding","cause","record","established","accordingly","procedure",
                  "provided","underwent","reason","evaluation","time","related","product","severed","post","return","made","also",
                  "andor","multiple","remains","determined","replacement","definitive","issues","however","clinical","factors",
                  "effects","release","identified","cannot","reviewed","would","including","available","issued","observation",
                  "common","action","regard","make","future","severe","issue","require","receive","month","submit","occur",
                  "factor","include","provide","remain","contribute","establish","relate","effect","replace","analysis","via",
                  "perform","subject","likely"])

dfmain = dfmain[dfmain['mdr_text'].notnull()]

dfmain['mdr_text'] = dfmain['mdr_text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

dfmain.head()

	product_code	brand_name	generic_name	manufacturer_name	type_of_report	report_number	report_source_code	date_received	event_type	mdr_text
0	LWR	PERCEVAL SUTURELESS AORTIC HEART VALVE	TISSUE HEART VALVE	LIVANOVA CANADA CORP.	Initial submission, Followup, Followup	1718850-2020-01047	Distributor report	2020-03-20	Injury	manufacturer notify serious adverse involve pe...
1	DYE	CARPENTIER-EDWARDS PERIMOUNT MAGNA MITRAL EASE...	REPLACEMENT HEART VALVE	EDWARDS LIFESCIENCES	Initial submission, Followup, Followup	2015691-2020-10062	Manufacturer report	2020-01-07	Injury	although bioprosthetic valve prove excellent l...
2	DYE	INSPIRIS RESILIA AORTIC VALVE	REPLACEMENT HEART VALVE	EDWARDS LIFESCIENCES	Initial submission, Followup, Followup, Follow...	2015691-2020-10677	Manufacturer report	2020-02-25	Injury	udi b4 stenosis develop progressively addition...
3	DYE	CARPENTIER-EDWARDS PERIMOUNT MAGNA EASE PERICA...	REPLACEMENT HEART VALVE	EDWARDS LIFESCIENCES	Initial submission, Followup	2015691-2020-10709	Manufacturer report	2020-02-27	Injury	stenosis regurgitation develop progressively a...
4	MWH	CONTEGRA	CONDUIT,VALVED,PULMONIC	MEDTRONIC HEART VALVE DIVISION	Initial submission, Followup	2025587-2020-00676	Manufacturer report	2020-03-05	Injury	medtronic 8 9 implant 18mm pulmonary valved co...

In order to create a bar chart race for keywords, we need to remove all duplicate words from each MDR_Text field.

no_dup = []

#Following function found from internet
def remove_duplicates(input):
 
    # split input string separated by space
    input = input.split(" ")
 
    # now create dictionary using counter method
    # which will have strings as key and their
    # frequencies as value
    UniqW = Counter(input)
 
    # joins two adjacent elements in iterable way
    s = " ".join(UniqW.keys())
    return (s)

dfmain['mdr_text_nodup'] = dfmain['mdr_text'].apply(remove_duplicates)

dfmain['mdr_text_nodup'] = dfmain['mdr_text_nodup'].map(str)

dfmain.head()

	product_code	brand_name	generic_name	manufacturer_name	type_of_report	report_number	report_source_code	date_received	event_type	mdr_text	mdr_text_nodup
98596	KDN	LIFEPORT KIDNEY TRANSPORTER SYSTEM	PERFUSION CIRCUIT	ORGAN RECOVERY SYSTEMS, INC.	Initial submission, Followup	3004068499-2020-00003	Manufacturer report	2020-02-06	Malfunction	complaint currently underway yet completed fol...	complaint currently underway yet completed fol...
17151	KDN	LIFEPORT KIDNEY TRANSPORTER SYSTEM	PERFUSION CIRCUIT	ORGAN RECOVERY SYSTEMS, INC.	Initial submission	3004068499-2020-00004	Manufacturer report	2020-02-27	Malfunction	patient information apply patient contact capa...	patient information apply contact capab4 initi...
42763	KDN	LIFEPORT KIDNEY TRANSPORTER SYSTEM	PERFUSION CIRCUIT	ORGAN RECOVERY SYSTEMS, INC.	Initial submission, Followup	3004068499-2020-00001	Manufacturer report	2020-02-06	Malfunction	complaint currently underway yet completed fol...	complaint currently underway yet completed fol...
70387	KDN	LIFEPORT KIDNEY TRANSPORTER SYSTEM	PERFUSION CIRCUIT	ORGAN RECOVERY SYSTEMS, INC.	Initial submission, Followup	3004068499-2020-00002	Manufacturer report	2020-02-06	Malfunction	complaint currently underway yet completed fol...	complaint currently underway yet completed fol...
20329	KDN	LIFEPORT KIDNEY TRANSPORTER SYSTEM	PERFUSION CIRCUIT	ORGAN RECOVERY SYSTEMS, INC.	Initial submission	3004068499-2020-00005	Manufacturer report	2020-03-20	Malfunction	patient information apply patient contact outc...	patient information apply contact outcomes att...

At this point we can run a basic word cloud and see, from the total dataset, what words show up most.

#Implement Word Cloud
word_cloud = WordCloud(collocations = False, background_color = 'white').generate(str(dfmain["mdr_text_nodup"]))
plt.imshow(word_cloud, interpolation='bilinear')
plt.axis("off")
plt.savefig('foo.png')
plt.show()

png

#Create new date field with the days removed, for visualizations

dfmain["year"] = pd.to_datetime(dfmain["date_received"], format = '%Y-%m-%d').dt.year
dfmain["month"] = pd.to_datetime(dfmain["date_received"], format = '%Y-%m-%d').dt.month

dfmain["plaindate"] = pd.to_datetime(dfmain[['year', 'month']].assign(DAY=1))

dfmain = dfmain.drop(["year","month"], 1)

dfmain.head()

	product_code	brand_name	generic_name	manufacturer_name	type_of_report	report_number	report_source_code	date_received	event_type	mdr_text	mdr_text_nodup	plaindate
98596	KDN	LIFEPORT KIDNEY TRANSPORTER SYSTEM	PERFUSION CIRCUIT	ORGAN RECOVERY SYSTEMS, INC.	Initial submission, Followup	3004068499-2020-00003	Manufacturer report	2020-02-06	Malfunction	complaint currently underway yet completed fol...	complaint currently underway yet completed fol...	2020-02-01
17151	KDN	LIFEPORT KIDNEY TRANSPORTER SYSTEM	PERFUSION CIRCUIT	ORGAN RECOVERY SYSTEMS, INC.	Initial submission	3004068499-2020-00004	Manufacturer report	2020-02-27	Malfunction	patient information apply patient contact capa...	patient information apply contact capab4 initi...	2020-02-01
42763	KDN	LIFEPORT KIDNEY TRANSPORTER SYSTEM	PERFUSION CIRCUIT	ORGAN RECOVERY SYSTEMS, INC.	Initial submission, Followup	3004068499-2020-00001	Manufacturer report	2020-02-06	Malfunction	complaint currently underway yet completed fol...	complaint currently underway yet completed fol...	2020-02-01
70387	KDN	LIFEPORT KIDNEY TRANSPORTER SYSTEM	PERFUSION CIRCUIT	ORGAN RECOVERY SYSTEMS, INC.	Initial submission, Followup	3004068499-2020-00002	Manufacturer report	2020-02-06	Malfunction	complaint currently underway yet completed fol...	complaint currently underway yet completed fol...	2020-02-01
20329	KDN	LIFEPORT KIDNEY TRANSPORTER SYSTEM	PERFUSION CIRCUIT	ORGAN RECOVERY SYSTEMS, INC.	Initial submission	3004068499-2020-00005	Manufacturer report	2020-03-20	Malfunction	patient information apply patient contact outc...	patient information apply contact outcomes att...	2020-03-01

#Create new dataframe--list of unique words in data, separated by month

dfmain['mdr_text_nodup'] = dfmain['mdr_text_nodup'].map(str)

dfword = pd.DataFrame()

#Create unique list of dates to be used in the for loop
datelist = dfmain["plaindate"].unique()

#Loop to create dataframe of unique words and the number of times they each appear
wordnumbers = []
for date_index in datelist:
    wordnumbers = Counter(' '.join(map(lambda l: ''.join(l), dfmain.loc[dfmain["plaindate"] == date_index]["mdr_text_nodup"])).split(" ")) 
    dftemp = pd.DataFrame.from_dict(wordnumbers, orient='index').reset_index()
    dftemp["Date"] = date_index
    dfword = pd.concat([dfword, dftemp])
    
dfword.columns = ["word","counts","date"]

#Add column for % representation of each word during their time period
def percentcalc(countcol, datecol):
    perc = 0
    #perc = countcol / 100
    perc = countcol / len(dfmain[dfmain["plaindate"] == datecol])
    return perc

dfword["percent"] = dfword.apply(lambda x: percentcalc(x['counts'],x['date']), axis = 1)

Now a new column is added to show the change from month to month for the counts of each word. This was a challenge to do in a way that was not too time-intensive. Instead of trying to search for each previous month’s value in the dataframe and then performing the subtraction calculation, we sort the dataset by word and then by date, and go through each item, checking to see if the previous line matches what we expect the word and previous date values to be. If these both match, the two counts values are subtracted, but if they do not match then the current value must be the “change” value we are looking for. This way we only go through the dataframe a single time, instead of once for each row in the dataset.

Several different searching methods were tried, but this was by far the fastest method.

#Begin by sorting the dataset and removing a leftover junk column
dfword = dfword.sort_values(["word","date"]).reset_index()
dfword.pop("index")

#Create a function to get the previous month off of a given string
def getlastmonth(dateinquestion):
    newdate = dateinquestion[5:7]
    newdate = int(newdate) - 1
    if newdate == 0:
        newdate = 12
    newdate = str(newdate)
    if len(newdate) == 1:
        newdate = "0" + newdate
    newdate = dateinquestion[0:4] + "-" + newdate + dateinquestion[7:10]
    if newdate[5:7] == "12":
        newmonth = str(int(newdate[0:4]) - 1)
        newdate = newmonth + "-" +  newdate[5:7] + dateinquestion[7:10]
    return newdate

#Create a function to check if the previous row in the dataset is the actual previous month for the word in question,
#and then perform the subtraction if necessary to arrive at the change value
def previous_values(indexitem):
    if indexitem > 0:
        if (dfword["date"][indexitem-1] == getlastmonth(dfword["date"][indexitem])) & (dfword["word"][indexitem-1] == dfword["word"][indexitem]):
            return (dfword["counts"][indexitem] - dfword["counts"][indexitem-1])
        else:
            return dfword["counts"][indexitem]
    else:
        return dfword["counts"][indexitem]

#Create the new column with the change value
dfword["change"] = dfword.apply(lambda x: previous_values(x.name), axis = 1)

	counts	date	percent	change
0	1	2011-03-01	0.006536	1
1	1	2012-03-01	0.005848	1
2	2	2012-05-01	0.012658	2
3	1	2013-05-01	0.009709	1
4	1	2013-06-01	0.007812	0

dfword.to_csv(r'word.csv', index = False)

The end result is a csv file that can be imported into Tableau and turned into the bar chart race shown at the beginning of this post.

FDA Medical Device Report Project Part 2 - Text Analysis

FDA Project - Part 2

Challenge

Code