FDA FAERS Database Download and Initialize

Introduction

I’ve been wanting to work on some data mining techniques for awhile. As the semester is coming to a close, I figured procrastinating on this take home exam was a good time to get started. To begin with, I thought I would use the FDA Adverse Event Reporting System (FAERS), which “contains adverse event reports, medication error reports and product quality complaints” per the FDA Website.

I have previously used the FDA API and found it to be a bit cumbersome, so I was looking for a different solution to managing the data. I’ve spent some time learning SQL this semester, so this seemed like a good time to start putting it to use. I downloaded MYSQL 8.0 and setup a local database. After some configuration issues, I had it working and just needed to get the data. The files are available from the FDA website in the following format: “https://fis.fda.gov/content/Exports/faers_ascii_yyyy_q#.zip”. For example, the data for 2012, 4th quarter can be downloaded from: https://fis.fda.gov/content/Exports/faers_ascii_2012q4.zip.

Since it is in such a consistent format, it is fairly straightforward to download all of the files. I wrote a couple of python scripts to download the files, extract all of the data files into one folder, and then generate a MYSQL shell script to load the data into the MYSQL database.

Code

The code is a pretty quick and dirty solution for now. If I end up trying to download some of the other FDA adverse event data, then I may revisit it and create a more complete solution. The github link for the files is below:

Github Repo

The current version of the code is also included below and some very brief discussion as well.

Download

This section was fairly easy to put together, as it simply requires python to download each file based on a pattern, and only retain the .txt files.

import zipfile
import urllib.request
import glob, os, shutil

## Define Year and Quarter ranges
years = list(range(2013, 2018))
quarters = list(range(1, 5))
# pre 2012 q4 should be "aers" not "faers"
base_url = "https://fis.fda.gov/content/Exports/faers_ascii_"
test_add = base_url + str(years[0]) + "q" + str(quarters[0]) + ".zip"

download_urls = []

## Generate list of URLS for download
for i in range(0, len(years)):
    for j in range(0, len(quarters)):
        temp = base_url + str(years[i]) + "q" + str(quarters[j]) + ".zip"
        download_urls.append(temp)

##
base_path = "E:/SQL/FDA_FAERS/Python_DL/fda_faers_dl_extract/TEST/"
split_val = "https://fis.fda.gov/content/Exports/"

## Download each file, extract to txt files to DATA FOLDER, delete temporary files
for k in range(0, len(download_urls)):
    end_path = download_urls[k].split(split_val, 1)[1]
    filename = base_path + end_path
    urllib.request.urlretrieve(download_urls[k], filename)
    os.makedirs("E:/SQL/FDA_FAERS/Python_DL/fda_faers_dl_extract/TEST/TEMP")
    with zipfile.ZipFile(filename) as zip:
        zip.extractall("E:/SQL/FDA_FAERS/Python_DL/fda_faers_dl_extract/TEST/TEMP")
        files = glob.iglob(os.path.join("E:/SQL/FDA_FAERS/Python_DL/fda_faers_dl_extract/TEST/TEMP/ascii", "*.txt"))
        for file in files:
            if os.path.isfile(file):
                shutil.copy2(file, "E:/SQL/FDA_FAERS/Python_DL/fda_faers_dl_extract/TEST/DATA")
    shutil.rmtree("E:/SQL/FDA_FAERS/Python_DL/fda_faers_dl_extract/TEST/TEMP/")

## Repeat above process for aers data

years = list(range(2004, 2013))
quarters = list(range(1, 5))
# pre 2012 q4 should ve "aers" not "faers"
base_url = "https://fis.fda.gov/content/Exports/aers_ascii_"
test_add = base_url + str(years[0]) + "q" + str(quarters[0]) + ".zip"
# print(test_add)

download_urls = []

for i in range(0, len(years)):
    for j in range(0, len(quarters)):
        temp = base_url + str(years[i]) + "q" + str(quarters[j]) + ".zip"
        download_urls.append(temp)

# print(download_urls)


base_path = "E:/SQL/FDA_FAERS/Python_DL/fda_faers_dl_extract/TEST/"
split_val = "https://fis.fda.gov/content/Exports/"

for k in range(0, len(download_urls)):
    end_path = download_urls[k].split(split_val, 1)[1]
    filename = base_path + end_path
    urllib.request.urlretrieve(download_urls[k], filename)
    os.makedirs("E:/SQL/FDA_FAERS/Python_DL/fda_faers_dl_extract/TEST/TEMP")
    with zipfile.ZipFile(filename) as zip:
        zip.extractall("E:/SQL/FDA_FAERS/Python_DL/fda_faers_dl_extract/TEST/TEMP")
        files = glob.iglob(os.path.join("E:/SQL/FDA_FAERS/Python_DL/fda_faers_dl_extract/TEST/TEMP/ascii", "*.txt"))
        for file in files:
            if os.path.isfile(file):
                shutil.copy2(file, "E:/SQL/FDA_FAERS/Python_DL/fda_faers_dl_extract/TEST/DATA")
    shutil.rmtree("E:/SQL/FDA_FAERS/Python_DL/fda_faers_dl_extract/TEST/TEMP/")

SQL script generation for import

When importing the data, I used a MYSQL script in the form of the code chunk below:

util.importTable("DEMO04Q1.txt", {table:"DEMO", skipRows: 1, linesTerminatedBy:"\r\n", fieldsTerminatedBy:"$"})

As there are a lot of import statements (8 for every quarter per year from 2004 to 2019), it was much easier to generate them into a file using python and then run the script from the MYSQL shell.

The script below generates the SQL file, as well as the lines creating all of the necessary tables.

## This file generates the SQL Shell script that will be run to initialize the data from the necessary files

## Setup 
base_statement = 'util.importTable("E:/SQL/FDA_FAERS/Python_DL/fda_faers_dl_extract/TEST/DATA/'
mid_statement = '.txt", {table:"'
end_statement = '", skipRows: 1, linesTerminatedBy:"\\r\\n", fieldsTerminatedBy:"$"})'

## Declare all year and quarter strings
year = ["04", "05", "06", "07", "08", "09", "10", "11",
        "12", "13", "14", "15", "16", "17", "18", "19"]

quarter = ["Q1", "Q2", "Q3", "Q4"]

## List of database names
db_names = ["DEMO", "DRUG", "INDI", "OUTC", "REAC", "RPSR", "STAT", "THER"]

## Open .sql script file
f = open("E:/SQL/FDA_FAERS/Python_DL/fda_faers_dl_extract/testfile.sql", "x")

## Prewritten string to declare all tables.

table_create = """CREATE DATABASE faers;\r\n\r\nUSE faers;\r\n\r\ncreate table demo
(\r\nprimaryid varchar(255),\r\ncaseid varchar(255),\r\ncaseversion varchar(255),\r\n
i_f_code varchar(255),\r\nevent_dt varchar(255),\r\nmfr_dt varchar(255),
\r\ninit_fda_dt varchar(255),\r\nfda_dt varchar(255),\r\nrept_cod varchar(255),
\r\nauth_num varchar(255),\r\nmfr_num varchar(255),\r\nmfr_sndr varchar(255),
\r\nlit_ref varchar(255),\r\nage varchar(255),\r\nage_cod varchar(255),
\r\nage_grp varchar(255),\r\nsex varchar(255),\r\ne_sub varchar(255),
\r\nwt varchar(255),\r\nwt_cod varchar(255),\r\nrept_dt varchar(255),
\r\nto_mfr varchar(255),\r\noccp_cod varchar(255),\r\nreporter_country varchar(255),
\r\noccr_country varchar(255));\r\n\r\ncreate table drug (\r\nprimaryid varchar(255),
\r\ncaseid varchar(255),\r\ndrug_seq varchar(255),\r\nrole_cod varchar(255),
\r\ndrugname varchar(255),\r\nprod_ai varchar(255),\r\nval_vbm varchar(255),
\r\nroute varchar(255),\r\ndose_vbm varchar(255),\r\ncum_dose_chr varchar(255),
\r\ncum_dose_unit varchar(255),\r\ndechal varchar(255),\r\nrechal varchar(255),
\r\nlot_num varchar(255),\r\nexp_dt varchar(255),\r\nnda_num varchar(255),
\r\ndose_amt varchar(255),\r\ndose_unit varchar(255),\r\ndose_form varchar(255),
\r\ndose_freq varchar(255));\r\n\r\ncreate table indi (\r\nprimaryid varchar(255),
\r\ncaseid varchar(255), \r\nindi_drug_seq varchar(255), \r\nindi_pt varchar(255));
\r\n\r\ncreate table outc ( \r\nprimaryid varchar(255),\r\ncaseid varchar(255),
\r\noutc_cod varchar(255));\r\n\r\ncreate table reac (\r\nprimaryid varchar(255),
\r\ncaseid varchar(255),\r\npt varchar(255),\r\ndrug_rec_act varchar(255));
\r\n\r\ncreate table rpsr (\r\nprimaryid varchar(255),\r\ncaseid varchar(255),
\r\nrpsr_cod varchar(255));\r\n\r\ncreate table ther (\r\nprimaryid varchar(255),
\r\ncaseid varchar(255),\r\ndsg_drug_seq varchar(255),\r\nstart_dt varchar(255),
\r\nend_dt varchar(255),\r\ndur varchar(255),\r\ndur_cod varchar(255));\r\n\r\n"""

## Write script to top of file.
f.write(table_create)
f.write("\r\n\r\n")

## Include each combination of year, quarter, and database name in list.
for i in range(0, len(db_names)):
    for j in range(0, len(year)):
        for k in range(0, len(quarter)):
            temp = base_statement + db_names[i] + year[j] + quarter[k] + mid_statement + db_names[i] + end_statement
            f.write(temp)
            f.write("\r\n")

## Close file
f.close()

After creating this file, it imported all of the data into the SQL database. Currently all of the variables are stored as varchar variables. I am planning on fine tuning the import once I examine all of the data types more carefully, but that will be a problem for when the semester is officially over.