Introduction to parse and the lookup

library(DOPE)
library(dplyr)
library(stringr)

This package aims to parse out identifiable drug names given a corpus of text. By corpus of text, we assume that the data has already been imported into R.

Data: drug_df

Throughout this vignette, we will employ a sample dataset - drug_df - that is intended to represent data collected from a clinical trial. The dataset contains 3 variables and 500 observations.

dim(drug_df)
   [1] 500   3

str(drug_df)
   tibble [500 x 3] (S3: tbl_df/tbl/data.frame)
    $ textdrug: chr [1:500] "Remeron" "Remeron" "Soma" "Ectacy" ...
    $ sex     : chr [1:500] "male" "female" "female" "female" ...
    $ race    : chr [1:500] "black" "ai/an" "ai/an" "hn/pi" ...

class(drug_df)
   [1] "tbl_df"     "tbl"        "data.frame"

Note: drug_df is a simulated dataset and does not reflect any true clinical observations.

`parse()`

The parse() function is intended to extract identifiable drug names from a corpus of text such as, clinical trial data, social media, survey or interview transcription. parse() takes in one argument, the vector that contains the strings to be parsed.

Here is an example of some problematic records in the drug_df dataset that warrants the use of parse()


messy_data <- drug_df %>% 
  # select records that have problematic characters
  filter(str_detect(textdrug, ",|;|and|\\/|=|\\(")) %>% 
  distinct(textdrug)

knitr::kable(messy_data)

textdrug
Bup/Nx
Bup/Nx.
bup/nx
Percocets and Vicodin
Barbiturate (doesn’t know which)
heroin - “few days on, few days off”
heroin- "a few days on, few days off
Ambien = 2 pills
Ambien “a bunch” = 2 pills
promethazine (25mg), clonidine (0.1mg)

As you can see there are so many extraneous/problematic characters, multiple drugs in one record and several variations of the same drug (i.e. “bup/nx”). We assume that the user is solely interested in the drugs themselves, not information such as dosage and units.

This messy data is exactly what parse() was designed for.

drug_names <- DOPE::parse(messy_data$textdrug)

drug_names
    [1] "bup/nx"       "bup/nx"       "bup/nx"       "percocets"    "vicodin"     
    [6] "barbiturate"  "heroin"       "ambien"       "ambien"       "promethazine"
   [11] "clonidine"   
   attr(,"na.action")
   [1] 8 9
   attr(,"class")
   [1] "omit"

Notice parse() cleans up the capitalization and punctuation of ‘bup/nx’. parse() has special code to clean up cases of ‘bup/nx’ and also ‘speedball’. It also finds the distinction of the final row “promethazine (25mg), clonidine” and separates them. See the tidytext package.¹

The resulting vector can then be passed on to the lookup_* functions to identify whether the input drug is a class, category or a synonym for other drugs in the same category.

`lookup_*`

`lookup()`

This function relies on a comprehensive lookup table lookup_df. This dataframe contains 3 variables:

class = Highest level classification e.g. “stimulant” or “narcotic (opoid)”
category = More specific level of classification e.g. “heroin”, “opium” or “marijuana”
synonym = Common name or street slang for specific drug name e.g. “china”, “dope” or “weed”

These names were based on terms used by the DEA.²

dim(lookup_df)
   [1] 4766    3

str(lookup_df)
   'data.frame':    4766 obs. of  3 variables:
    $ class   : chr  "hallucinogen" "hallucinogen" "hallucinogen" "hallucinogen" ...
    $ category: chr  "2cb" "2cb" "2cb" "2cb" ...
    $ synonym : chr  "banana split" "bdmpea" "bromo" "mft" ...

class(lookup_df)
   [1] "data.frame"

The purpose of this function is to return any possible matches to the lookup_df, which is a comprehensive dataframe consisting of all drug classes, categories and synonyms. It serves as a source or helper function to many of the other more specific function in the package. The idea is to match any possible columns with a the single word, a list of separate words or a vector passed as an argument. The dataframe returned will consist of the lookup_df match as well as the original_word that was the source of the match.

Here is an example of a common search done using lookup.

results <- lookup(unique(drug_names))
head(results, 15) %>%
  knitr::kable()

original_word	class	category	synonym
bup/nx	treatment drug	treatment drug	bup/nx
percocets	NA	NA	NA
vicodin	narcotic (opioid)	codeine combinations, non-injectable	vicodin
barbiturate	depressant	barbiturate	fiorina
barbiturate	depressant	barbiturate	nembutal
barbiturate	depressant	barbiturate	pentothal
barbiturate	depressant	barbiturate	seconal
heroin	heroin	heroin	a-bomb
heroin	heroin	heroin	a-bomb (mixed with marijuana)
heroin	heroin	heroin	achivia
heroin	heroin	heroin	adormidera
heroin	heroin	heroin	aip
heroin	heroin	heroin	al capone
heroin	heroin	heroin	antifreeze
heroin	heroin	heroin	aries

You can see that the dataframe returned could be vast in its matches (heroin returns another few hundred matches alone), and that the other more specific functions, below, might be of more use depending on one’s needs.

`compress_lookup()`

This function takes in one argument: the table returned from a search using the lookup function. The purpose of this function is to narrow down the results to a more specific dataframe consisting of only relevant values, such as class and/or category depending on the user’s selection. compress_lookup returns, by default, original_word, class and category.

If a researcher wanted to determine the main classes of drugs being used by the patients of a clinical study, they might pass a large vector of substances from clinical notes taken in a study to the lookup function, then filter them down to only return the datafram of classes relevant to their needs.

Here is an example of a common search done using compress_lookup.

filtered_df <- compress_lookup(results)
head(filtered_df)
     original_word             class                             category
   1        bup/nx    treatment drug                       treatment drug
   2     percocets              <NA>                                 <NA>
   3       vicodin narcotic (opioid) codeine combinations, non-injectable
   4   barbiturate        depressant                          barbiturate
   5        heroin            heroin                               heroin
   6        ambien           Unknown                              Unknown

The resulting dataframe is a short list of only the relevant information needed.

`lookup_syn()`

The purpose of this function is to find all possible synonyms of, primarily, a slang/street name of a commonly abused drug. Though searching for a drug class or category with lookup() will also return common synonms, this function makes searching specifically for synonyms explicit by taking just one argument: drug_name. The function will then determine the category of the slang term (drug_name) and return all synonyms that share that category.

Here is an example of a common search done using lookup_syn.

results <- lookup_syn("shrooms")
head(results)
       category
   1  mushrooms
   2 psilocybin

The resulting dataframe contains a moderate list of terms that are synonyms of the drug_name given as determined by sources such as the DEA, FDA and other publicly available resources.

Introduction to parse and the lookup_ functions

Data: drug_df

parse()

lookup_*

lookup()

compress_lookup()

lookup_syn()

`parse()`

`lookup_*`

`lookup()`

`compress_lookup()`

`lookup_syn()`