Fun with Email Scraping and Coffee

Introduction

For the first blog post of the new year, I am going to start with something a little fun. I have a coworker who makes coffee at her desk pretty much every afternoon. Once it starts brewing, she sends out an email that brings everyone running like a dinner bell. I have been meaning to play around with email scraping, so this was a good opportunity to mess around with it.

As a quick note, I don’t ever show the data directly in this post. This was done to make sure everything stays anonymous.

Outlook Scraping with Python

To start with I had to access all of the emails. I used the win32 python library as shown in the code chunk below. Every coffee email was sent with “Yodo” in the subject line, so I first filtered all emails by the coworker in question and then stored only those emails that have were relevant to coffee. If you are trying to reproduce this yourself, the coworkername variable should be set exactly how the name appears in outlook. The data is then exported to excel. If you are looking closely you can see there is a typo in the export filename and I just did not fix it.

import win32com.client
import pandas as pd
coworkername = "Last, First" 
    
outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
    
inbox = outlook.GetDefaultFolder(6)
messages = inbox.Items
message = messages.GetFirst()
rec_time = message.CreationTime
body_content = message.body
subj_line = message.subject
sender = message.sender
[date, time] = str(rec_time).split(" ")
[time, trash]= time.split("+") 
YodoTemp = []
while message:
    try:
        rec_time = message.CreationTime
        body_content = message.body
        subj_line = message.subject
        sender = message.sender
        print(rec_time)
        if (("Yodo" in (str(subj_line))) & (str(sender) == coworkername)):
            [date, time] = str(rec_time).split(" ")
            [time, trash]= time.split("+")
            Temp = [date, time, str(body_content)]
            YodoTemp.append(Temp)
        message = messages.GetNext()
    except:
        message = messages.GetNext()
YodoTimes = pd.DataFrame(data=YodooTemp, columns=["Date","Time","Body"])    
YodoTimes.to_excel("YodooTimes.xlsx")

R Grouping and Visualization

In this next section, the generated excel file is read into R and visualized in a few fun ways. Instead of including all of the library declarations at the top, like I typically would, I have called them just before they are used. Hopefully this makes it easy to understand when each library is actually used.

Import and Transformations

This section is pretty straightforward. The data is read in, the date is converted to a lubridate variable, and then a mutate is used to both calculate the time of day as a double variable for easy display and the day of the weeks are determined using the weekdays function.

library(readxl)

# File Import
fileloc <- "~/Projects/YodoData/YodooTimes.xlsx"
yodo <- read_excel(fileloc)

library(tidyverse)
library(lubridate)

# Combine Date and Time fields into one field, then convert to lubridate variable
yodo_days <- yodo %>% mutate(datetime = paste(Date,Time, sep = " "))
yodo_days$datetime <- ymd_hms(yodo_days$datetime)

# Calculate hourtime from hours and minutes for easy calculations and plotting later. 
# Also create variable for day of the week.
yodo_times <- yodo_days %>% 
  mutate(hourtime = hour(datetime) + minute(datetime)/60 - 12, 
         day = weekdays(datetime))

Wordcloud

For this section, there are 4 variables which are not declared in the code shown. These variables are left out because they have personal information in them. The variables keyreplace 1 and 2 are names that are included multiple times. There are two email signatures present in the emails sent in the dataset. The beginning of these signatures are unqiue and stored in the colsplit variables.

This chunk has a few things going on. The two keyreplace variables are removed using the gsub function, along with new line and carriage return characters. The remaining variable is then split on the colsplit variables. This results in a clean string to work with, that only contains the body of the message. The next section I copied pretty much directly from this tutorial in order to prepare the text for a wordcloud.

# Replace new line whitespace and identifiers.
yodoreplaced <- yodo %>% mutate(bodysplit = gsub("\\r","",Body)) %>% 
  mutate(bodysplit = gsub("\\n","",bodysplit)) %>% 
  mutate(bodysplit = gsub(keyreplace1,"",bodysplit)) %>% 
  mutate(bodysplit = gsub(keyreplace2,"",bodysplit)) 

library(reshape2)

# Split data into body and signature based on two possible signatures
split2 <- colsplit(yodoreplaced$bodysplit, colsplit1, c("bodysplit","sig"))
split3 <- colsplit(split2$bodysplit, colsplit2, c("bodysplit","sig"))

library(tm)
library(SnowballC)


# Declare as Corpus
bodyCorpus <-Corpus(VectorSource(split3$bodysplit))
# Convert to lowercase and remove punctuation and numbers.
bodyCorpus <- tm_map(bodyCorpus, content_transformer(tolower))
bodyCorpus <- tm_map(bodyCorpus, removePunctuation)
bodyCorpus <- tm_map(bodyCorpus, removeNumbers)
bodyCorpus <- tm_map(bodyCorpus, PlainTextDocument)
bodyCorpus <- tm_map(bodyCorpus, removeWords, stopwords('english'))

The actual wordcloud is built here. I tried a few of the supplied color palettes from the RcolorBrewer package and found I liked Set2 for this cloud. The cloud is then build using the wordcloud library.

library(RColorBrewer)

# Declare color palette for printing
pal <- brewer.pal(8,"Set2")

library(wordcloud)

# Plot as wordcloud
wordcloud(bodyCorpus[1], max.words = 100, random.order = FALSE, colors = pal)

plot of chunk wordcloudplot

Summarizing the Data

This chunk is very straightforward. T/he first section groups the data by day of the week for plotting. The factor levels are set manually, so when graphed the days of the week appear in order. The average, latest, and earliest brew times are then calculated and displayed.

# Group by statement to compare by day of the week. 
yodo_group <- yodo_times %>% group_by(day)
yodo_group$day <- factor(yodo_group$day, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday"))

# This calculates the number of possible weekdays between the start and end date. Subtracts 10 for number of holidays.
dayspossible <- sum(!weekdays(seq(min(yodo_group$datetime), max(yodo_group$datetime), "days")) %in% c("Saturday", "Sunday")) - 10
# Find number of emails sent and calculate brew percentage
ncoffeedays <- max(yodo_group$X__1) + 1
coffeepercent <- round(ncoffeedays / dayspossible * 100)

# Find max time and date that it occurred.
maxtime <- yodo_times %>% filter(hourtime == max(hourtime))
maxhr <- floor(mean(maxtime$hourtime))
maxmin <- round((mean(maxtime$hourtime) - maxhr)*60)

# Very quick, dirty, and bad way to format later
if(maxmin > 9){
  leadzero = ""
} else {
    leadzero = "0"
}
maxdate <- maxtime$Date[1]

# Find mean time coffee was brewed
meanhr <- floor(mean(yodo_times$hourtime))
meanmin <- round((mean(yodo_times$hourtime) - meanhr)*60)

# Find max time and date that it occurred.
mintime <- yodo_times %>% filter(hourtime == min(hourtime))
minhr <- floor(mean(mintime$hourtime))
minmin <- round((mean(mintime$hourtime) - minhr)*60)

# Very quick, dirty, and bad way to format later
if(minmin > 9){
  leadzero = ""
} else {
    leadzero = "0"
}
mindate <- mintime$Date[1]

# Output for viewing

print(paste("Coffee brewed ",coffeepercent, "% of work days"), sep="")

[1] “Coffee brewed 67 % of work days”

print(paste("Average coffee brewed at: ", meanhr, ":", meanmin, "P.M."), sep="")

[1] “Average coffee brewed at: 1 : 53 P.M.”

print(paste("Latest coffee brewed at: ", maxhr, ":", leadzero, maxmin, "P.M. on :", maxdate), sep="")

[1] “Latest coffee brewed at: 4 : 0 9 P.M. on : 2018-08-16”

print(paste("Earliest coffee brewed at: ", minhr, ":", leadzero, minmin, "P.M. on :", mindate), sep="")

[1] “Earliest coffee brewed at: 1 : 0 6 P.M. on : 2018-11-07”

Finally, the data is plotted in a boxplot by day and as a time series through the year. Turns out Thursday is very consistent and Friday is very inconsistent! I wonder why that could be.

plot of chunk visualizeplot of chunk visualize

If you have any feedback or questions feel free to connect with me via the links on the home page.

Categories: , ,

Updated: