FDA Project - Part 2

Challenge

We have arrived at a workable dataset, but the most important part for text-analysis is inside the MDR_Text field. Each MDR_Text item is a list of narratives pertaining to the specific malfunction report. We can try to unnest these lists of narratives and keep each narrative separate, but that is not necessary for the scope of this project–instead, what is needed is to remove punctuation and unnecessary/redundant words, so that we have a short string of unique words in each MDR Text field. This will allow for counting of keywords and the compilation of useful visualizations.

An example of one such visualization is shown below. Created in Tableau from output from this code, it shows the progression of keywords of Heartvalve malfunction reports over the past two years.

Code

First we need to cull out unneeded columns and perform some minor data cleaning.

#Keep only relevant columns
dfmain = dfmain[["_device_report_product_code","_brand_name","_generic_name","_manufacturer_d_name","type_of_report","report_number","report_source_code",
                 "date_received","event_type","mdr_text"]]

#Rename columns
dfmain.columns = ["product_code","brand_name","generic_name","manufacturer_name","type_of_report","report_number",
                        "report_source_code","date_received","event_type","mdr_text"]

#Update date column to date format
dfmain["date_received"] = pd.to_datetime(dfmain["date_received"])

#Remove brackets from type of report column
dfmain['type_of_report'] = dfmain['type_of_report'].str.join(', ')

#Update MDR Text to only show the text narrative items--also lowercase the text
newmdr = []
for text_index in dfmain["mdr_text"]:
    newmdr.append(''.join(re.findall("'text': .+?}",str(text_index))).translate(str.maketrans('', '', string.punctuation)).replace("text"," - ")[4:])

dfmain["mdr_text"] = [x.lower() for x in newmdr]
dfmain.head()
product_code brand_name generic_name manufacturer_name type_of_report report_number report_source_code date_received event_type mdr_text
98596 KDN LIFEPORT KIDNEY TRANSPORTER SYSTEM PERFUSION CIRCUIT ORGAN RECOVERY SYSTEMS, INC. Initial submission, Followup 3004068499-2020-00003 Manufacturer report 2020-02-06 Malfunction investigation for this complaint is currently ...
17151 KDN LIFEPORT KIDNEY TRANSPORTER SYSTEM PERFUSION CIRCUIT ORGAN RECOVERY SYSTEMS, INC. Initial submission 3004068499-2020-00004 Manufacturer report 2020-02-27 Malfunction patient information does not apply the device ...
42763 KDN LIFEPORT KIDNEY TRANSPORTER SYSTEM PERFUSION CIRCUIT ORGAN RECOVERY SYSTEMS, INC. Initial submission, Followup 3004068499-2020-00001 Manufacturer report 2020-02-06 Malfunction investigation for this complaint is currently ...
70387 KDN LIFEPORT KIDNEY TRANSPORTER SYSTEM PERFUSION CIRCUIT ORGAN RECOVERY SYSTEMS, INC. Initial submission, Followup 3004068499-2020-00002 Manufacturer report 2020-02-06 Malfunction investigation for this complaint is currently ...
20329 KDN LIFEPORT KIDNEY TRANSPORTER SYSTEM PERFUSION CIRCUIT ORGAN RECOVERY SYSTEMS, INC. Initial submission 3004068499-2020-00005 Manufacturer report 2020-03-20 Malfunction patient information does not apply the device ...

Here comes much of the NLP operations. We unnest each sentence and perform lemmatization in order to standardize suffixes. We also remove stop-words and a manually curated list of words that are very common and/or not useful for the analysis. At this point it is a manual operation to gather these words–in the future, it would be beneficial to gather them for exclusion more automatically, perhaps with IPV/SV analysis.

#Remove NA items
dfmain.fillna('', inplace=True)

#Prepare for lemmatization
lemmatizer = nltk.stem.WordNetLemmatizer()
wordnet_lemmatizer = WordNetLemmatizer()

#Create function to tag words with what part of speech they are
def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

#Create function to separate strings into words, assign parts of speech, and lemmatize accordingly
def lemmatize_sentence(sentence):
    #tokenize the sentence and find the POS tag for each token
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            #if there is no available tag, append the token as is
            lemmatized_sentence.append(word)
        else:
            #else use the tag to lemmatize the token
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return " ".join(lemmatized_sentence)

#Perform Lemmatization - this step can take some time
dfmain['mdr_text'] = dfmain['mdr_text'].apply(lambda x: lemmatize_sentence(x))

#Remove stop words
stop_words = set(stopwords.words('english'))

#Add additional common words into stopwords
stop_words.update(["investigation","still","progress","complete","supplemental","report","filed","device","returned","reported",
                  "-","due","failure","failed","may","ensure","assures","around","met","reports","number","per","dated","patient",
                  "conclusion","performed","year","years","therefore","submitted","information","received","upon","event",
                  "events","review","required","appropriate","monitored","monitor","basis","continue","monthly","trends","completion",
                  "additional","months","yet","without","history","regarding","cause","record","established","accordingly","procedure",
                  "provided","underwent","reason","evaluation","time","related","product","severed","post","return","made","also",
                  "andor","multiple","remains","determined","replacement","definitive","issues","however","clinical","factors",
                  "effects","release","identified","cannot","reviewed","would","including","available","issued","observation",
                  "common","action","regard","make","future","severe","issue","require","receive","month","submit","occur",
                  "factor","include","provide","remain","contribute","establish","relate","effect","replace","analysis","via",
                  "perform","subject","likely"])

dfmain = dfmain[dfmain['mdr_text'].notnull()]

dfmain['mdr_text'] = dfmain['mdr_text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

dfmain.head()
product_code brand_name generic_name manufacturer_name type_of_report report_number report_source_code date_received event_type mdr_text
0 LWR PERCEVAL SUTURELESS AORTIC HEART VALVE TISSUE HEART VALVE LIVANOVA CANADA CORP. Initial submission, Followup, Followup 1718850-2020-01047 Distributor report 2020-03-20 Injury manufacturer notify serious adverse involve pe...
1 DYE CARPENTIER-EDWARDS PERIMOUNT MAGNA MITRAL EASE... REPLACEMENT HEART VALVE EDWARDS LIFESCIENCES Initial submission, Followup, Followup 2015691-2020-10062 Manufacturer report 2020-01-07 Injury although bioprosthetic valve prove excellent l...
2 DYE INSPIRIS RESILIA AORTIC VALVE REPLACEMENT HEART VALVE EDWARDS LIFESCIENCES Initial submission, Followup, Followup, Follow... 2015691-2020-10677 Manufacturer report 2020-02-25 Injury udi b4 stenosis develop progressively addition...
3 DYE CARPENTIER-EDWARDS PERIMOUNT MAGNA EASE PERICA... REPLACEMENT HEART VALVE EDWARDS LIFESCIENCES Initial submission, Followup 2015691-2020-10709 Manufacturer report 2020-02-27 Injury stenosis regurgitation develop progressively a...
4 MWH CONTEGRA CONDUIT,VALVED,PULMONIC MEDTRONIC HEART VALVE DIVISION Initial submission, Followup 2025587-2020-00676 Manufacturer report 2020-03-05 Injury medtronic 8 9 implant 18mm pulmonary valved co...

In order to create a bar chart race for keywords, we need to remove all duplicate words from each MDR_Text field.

no_dup = []

#Following function found from internet
def remove_duplicates(input):
 
    # split input string separated by space
    input = input.split(" ")
 
    # now create dictionary using counter method
    # which will have strings as key and their
    # frequencies as value
    UniqW = Counter(input)
 
    # joins two adjacent elements in iterable way
    s = " ".join(UniqW.keys())
    return (s)

dfmain['mdr_text_nodup'] = dfmain['mdr_text'].apply(remove_duplicates)

dfmain['mdr_text_nodup'] = dfmain['mdr_text_nodup'].map(str)

dfmain.head()
product_code brand_name generic_name manufacturer_name type_of_report report_number report_source_code date_received event_type mdr_text mdr_text_nodup
98596 KDN LIFEPORT KIDNEY TRANSPORTER SYSTEM PERFUSION CIRCUIT ORGAN RECOVERY SYSTEMS, INC. Initial submission, Followup 3004068499-2020-00003 Manufacturer report 2020-02-06 Malfunction complaint currently underway yet completed fol... complaint currently underway yet completed fol...
17151 KDN LIFEPORT KIDNEY TRANSPORTER SYSTEM PERFUSION CIRCUIT ORGAN RECOVERY SYSTEMS, INC. Initial submission 3004068499-2020-00004 Manufacturer report 2020-02-27 Malfunction patient information apply patient contact capa... patient information apply contact capab4 initi...
42763 KDN LIFEPORT KIDNEY TRANSPORTER SYSTEM PERFUSION CIRCUIT ORGAN RECOVERY SYSTEMS, INC. Initial submission, Followup 3004068499-2020-00001 Manufacturer report 2020-02-06 Malfunction complaint currently underway yet completed fol... complaint currently underway yet completed fol...
70387 KDN LIFEPORT KIDNEY TRANSPORTER SYSTEM PERFUSION CIRCUIT ORGAN RECOVERY SYSTEMS, INC. Initial submission, Followup 3004068499-2020-00002 Manufacturer report 2020-02-06 Malfunction complaint currently underway yet completed fol... complaint currently underway yet completed fol...
20329 KDN LIFEPORT KIDNEY TRANSPORTER SYSTEM PERFUSION CIRCUIT ORGAN RECOVERY SYSTEMS, INC. Initial submission 3004068499-2020-00005 Manufacturer report 2020-03-20 Malfunction patient information apply patient contact outc... patient information apply contact outcomes att...

At this point we can run a basic word cloud and see, from the total dataset, what words show up most.

#Implement Word Cloud
word_cloud = WordCloud(collocations = False, background_color = 'white').generate(str(dfmain["mdr_text_nodup"]))
plt.imshow(word_cloud, interpolation='bilinear')
plt.axis("off")
plt.savefig('foo.png')
plt.show()

png

#Create new date field with the days removed, for visualizations

dfmain["year"] = pd.to_datetime(dfmain["date_received"], format = '%Y-%m-%d').dt.year
dfmain["month"] = pd.to_datetime(dfmain["date_received"], format = '%Y-%m-%d').dt.month

dfmain["plaindate"] = pd.to_datetime(dfmain[['year', 'month']].assign(DAY=1))

dfmain = dfmain.drop(["year","month"], 1)

dfmain.head()
product_code brand_name generic_name manufacturer_name type_of_report report_number report_source_code date_received event_type mdr_text mdr_text_nodup plaindate
98596 KDN LIFEPORT KIDNEY TRANSPORTER SYSTEM PERFUSION CIRCUIT ORGAN RECOVERY SYSTEMS, INC. Initial submission, Followup 3004068499-2020-00003 Manufacturer report 2020-02-06 Malfunction complaint currently underway yet completed fol... complaint currently underway yet completed fol... 2020-02-01
17151 KDN LIFEPORT KIDNEY TRANSPORTER SYSTEM PERFUSION CIRCUIT ORGAN RECOVERY SYSTEMS, INC. Initial submission 3004068499-2020-00004 Manufacturer report 2020-02-27 Malfunction patient information apply patient contact capa... patient information apply contact capab4 initi... 2020-02-01
42763 KDN LIFEPORT KIDNEY TRANSPORTER SYSTEM PERFUSION CIRCUIT ORGAN RECOVERY SYSTEMS, INC. Initial submission, Followup 3004068499-2020-00001 Manufacturer report 2020-02-06 Malfunction complaint currently underway yet completed fol... complaint currently underway yet completed fol... 2020-02-01
70387 KDN LIFEPORT KIDNEY TRANSPORTER SYSTEM PERFUSION CIRCUIT ORGAN RECOVERY SYSTEMS, INC. Initial submission, Followup 3004068499-2020-00002 Manufacturer report 2020-02-06 Malfunction complaint currently underway yet completed fol... complaint currently underway yet completed fol... 2020-02-01
20329 KDN LIFEPORT KIDNEY TRANSPORTER SYSTEM PERFUSION CIRCUIT ORGAN RECOVERY SYSTEMS, INC. Initial submission 3004068499-2020-00005 Manufacturer report 2020-03-20 Malfunction patient information apply patient contact outc... patient information apply contact outcomes att... 2020-03-01
#Create new dataframe--list of unique words in data, separated by month

dfmain['mdr_text_nodup'] = dfmain['mdr_text_nodup'].map(str)

dfword = pd.DataFrame()

#Create unique list of dates to be used in the for loop
datelist = dfmain["plaindate"].unique()

#Loop to create dataframe of unique words and the number of times they each appear
wordnumbers = []
for date_index in datelist:
    wordnumbers = Counter(' '.join(map(lambda l: ''.join(l), dfmain.loc[dfmain["plaindate"] == date_index]["mdr_text_nodup"])).split(" ")) 
    dftemp = pd.DataFrame.from_dict(wordnumbers, orient='index').reset_index()
    dftemp["Date"] = date_index
    dfword = pd.concat([dfword, dftemp])
    
dfword.columns = ["word","counts","date"]

#Add column for % representation of each word during their time period
def percentcalc(countcol, datecol):
    perc = 0
    #perc = countcol / 100
    perc = countcol / len(dfmain[dfmain["plaindate"] == datecol])
    return perc

dfword["percent"] = dfword.apply(lambda x: percentcalc(x['counts'],x['date']), axis = 1)

Now a new column is added to show the change from month to month for the counts of each word. This was a challenge to do in a way that was not too time-intensive. Instead of trying to search for each previous month’s value in the dataframe and then performing the subtraction calculation, we sort the dataset by word and then by date, and go through each item, checking to see if the previous line matches what we expect the word and previous date values to be. If these both match, the two counts values are subtracted, but if they do not match then the current value must be the “change” value we are looking for. This way we only go through the dataframe a single time, instead of once for each row in the dataset.

Several different searching methods were tried, but this was by far the fastest method.

#Begin by sorting the dataset and removing a leftover junk column
dfword = dfword.sort_values(["word","date"]).reset_index()
dfword.pop("index")

#Create a function to get the previous month off of a given string
def getlastmonth(dateinquestion):
    newdate = dateinquestion[5:7]
    newdate = int(newdate) - 1
    if newdate == 0:
        newdate = 12
    newdate = str(newdate)
    if len(newdate) == 1:
        newdate = "0" + newdate
    newdate = dateinquestion[0:4] + "-" + newdate + dateinquestion[7:10]
    if newdate[5:7] == "12":
        newmonth = str(int(newdate[0:4]) - 1)
        newdate = newmonth + "-" +  newdate[5:7] + dateinquestion[7:10]
    return newdate

#Create a function to check if the previous row in the dataset is the actual previous month for the word in question,
#and then perform the subtraction if necessary to arrive at the change value
def previous_values(indexitem):
    if indexitem > 0:
        if (dfword["date"][indexitem-1] == getlastmonth(dfword["date"][indexitem])) & (dfword["word"][indexitem-1] == dfword["word"][indexitem]):
            return (dfword["counts"][indexitem] - dfword["counts"][indexitem-1])
        else:
            return dfword["counts"][indexitem]
    else:
        return dfword["counts"][indexitem]

#Create the new column with the change value
dfword["change"] = dfword.apply(lambda x: previous_values(x.name), axis = 1)
word counts date percent change
0 0 1 2011-03-01 0.006536 1
1 0 1 2012-03-01 0.005848 1
2 0 2 2012-05-01 0.012658 2
3 0 1 2013-05-01 0.009709 1
4 0 1 2013-06-01 0.007812 0
dfword.to_csv(r'word.csv', index = False)

The end result is a csv file that can be imported into Tableau and turned into the bar chart race shown at the beginning of this post.