Current biostatistics graduate student at the University of Minnesota. Previously a med device engineer. All opinions are my own. Available for engineering or statistical consulting.
For the first blog post of the new year, I am going to start with something a little fun. I have a coworker who makes coffee at her desk pretty much every afternoon. Once it starts brewing, she sends out an email that brings everyone running like a dinner bell. I have been meaning to play around with email scraping, so this was a good opportunity to mess around with it.
As a quick note, I don’t ever show the data directly in this post. This was done to make sure everything stays anonymous.
Outlook Scraping with Python
To start with I had to access all of the emails. I used the win32 python library as shown in the code chunk below. Every coffee email was sent with “Yodo” in the subject line, so I first filtered all emails by the coworker in question and then stored only those emails that have were relevant to coffee. If you are trying to reproduce this yourself, the coworkername variable should be set exactly how the name appears in outlook. The data is then exported to excel. If you are looking closely you can see there is a typo in the export filename and I just did not fix it.
R Grouping and Visualization
In this next section, the generated excel file is read into R and visualized in a few fun ways. Instead of including all of the library declarations at the top, like I typically would, I have called them just before they are used. Hopefully this makes it easy to understand when each library is actually used.
Import and Transformations
This section is pretty straightforward. The data is read in, the date is converted to a lubridate variable, and then a mutate is used to both calculate the time of day as a double variable for easy display and the day of the weeks are determined using the weekdays function.
Wordcloud
For this section, there are 4 variables which are not declared in the code shown. These variables are left out because they have personal information in them. The variables keyreplace 1 and 2 are names that are included multiple times. There are two email signatures present in the emails sent in the dataset. The beginning of these signatures are unqiue and stored in the colsplit variables.
This chunk has a few things going on. The two keyreplace variables are removed using the gsub function, along with new line and carriage return characters. The remaining variable is then split on the colsplit variables. This results in a clean string to work with, that only contains the body of the message. The next section I copied pretty much directly from this tutorial in order to prepare the text for a wordcloud.
The actual wordcloud is built here. I tried a few of the supplied color palettes from the RcolorBrewer package and found I liked Set2 for this cloud. The cloud is then build using the wordcloud library.
Summarizing the Data
This chunk is very straightforward. T/he first section groups the data by day of the week for plotting. The factor levels are set manually, so when graphed the days of the week appear in order. The average, latest, and earliest brew times are then calculated and displayed.
Finally, the data is plotted in a boxplot by day and as a time series through the year. Turns out Thursday is very consistent and Friday is very inconsistent! I wonder why that could be.
If you have any feedback or questions feel free to connect with me via the links on the home page.