reid/write

home // about
shitbloggin with reid scoggin

Tutorial: writing an email parser

Published: 3 Jun, 2020

This tutorial will walk you through the process I used to set up the Windows Subsystem for Linux, using a python script to download emails from a Gmail account using IMAP, and putting together basic shell commands like grep and sed to extract the data from those emails and compile them in a spreadsheet.

Introduction

If you have a lot of content that gets emailed to you but no easy way to extract the data they contain, the most common solution is called ‘email parsing’; simply put, this is telling a computer how to read the data inside emails and extract what is useful. Most of the solutions to this problem are apps and services you have to pay for, because it’s a common business requirement. These products are frankly too expensive, and I was surprised that there didn’t seem to be any easy open source ways to do this, so I tried figuring one out. I will walk you through my solutions so that you can figure out how to solve similar issues.

The issue I am trying to solve is how to extract useful data from customer survey emails forwarded to us at work by a client. I have about 18,000 of these in my inbox and get hundreds a day, but there’s no straightforward way to use them because they’re all stuck on an email server and split into individual HTML-formatted documents. I thought the simplest solution would be to use an IMAP downloader, then write a shell script to extract the data points and convert that to a spreadsheet.

The underlying code for this is relatively simple and straightforward, though the setup to run it may be a little unfamiliar. This document is written as a tutorial so that readers can understand the logic and apply similar processes to their own issues, as well as modify or adapt this script.

Configuring your PC & email

First, you will need to enable IMAP in your Gmail account, if you have one. Go to the settings gear icon at the top right, click Settings, then click Forwarding and POP/IMAP. Click ‘enable IMAP’ and hit save.

Now go to this webpage and enable the option toggle:

https://myaccount.google.com/lesssecureapps

Now you need to set up the Windows Subsystem for Linux (WSL). You can find instructions here:

https://docs.microsoft.com/en-us/windows/wsl/install-win10

Keep in mind that you must be able to run powershell as an administrator. You may need to update Windows, and it may take a while.

Once WSL is installed, follow the instructions on the previous link to install an Ubuntu instance.

Setting up on Linux

After Ubuntu is installed, a terminal window should pop up. Enter username and password credentials when it prompts you. Once you’re dropped into a command line, run this:

    sudo apt-get update

It will download a list of software. Next, run this and enter the password you just set when prompted:

    sudo apt-get install getmail

Then, run this:

    git clone https://github.com/ralbear/IMAPbackup.git

This will download software for backing up email into a new folder. Now, enter this command:

    cd IMAPbackup

This will drop you into your newly created folder. Now run:

    python2 imapgrab.py
    cp imapgrab.py ~/imap.py

If everything so far has been successful, the first command will print about 20 lines of text beginning with “IMAP Grab 0.1.4”. The second command will make a copy of the script into your home folder for easier access. If the first command ran successfully, we can proceed to connecting it to your email.

Scraping email

Run the following, substituting your email and password:

    python2 imapgrab.py -l -s imap.gmail.com -S -P 993 -u your.email@domain.com -p password

If this worked, you will see a list of all your email folders. Now we can download some of them.

Run the following command with your email address, password, and the label you want to download, spelled exactly as it was in the list of email folders:

    python2 imapgrab.py -d -v -M -f ~/email -P 993 -S -s imap.gmail.com -u your.email@domain.com -p yourpassword -m "Test"

Let’s walk through this command to understand what it’s doing.

        python2 imap2grab.py

We’re running a Python script (denoted by .py) using Python 2; if you try running this command with just ‘python’, it will default to Python 3 and not work. Now for the arguments:

        -d -v -M

These three flags are telling the script to download mailboxes into separate files and folders (-d), print verbose output in the command line (-v), and save the emails in maildir format (meaning one file per email instead of one big file with all of them) (-M).

        -f ~/email

This tells the script to download the emails to a folder called ‘email’ in your home directory.

        -P 993 -S

We’re going to connect to the email server on port 993, encrypting the connection using SSL (this is mandatory with Gmail).

        -s imap.gmail.com -u your.email@domain.com -p yourpassword

Server, username, and password.

        -m "email label"

This tells the script to download the emails with the label inside quotations marks.

Once you run this command, the imapgrab.py script will download everything under the specified label into the folder you set.

Grepping data

Now we have our emails, so we need to extract the data. This next part is not necessary to try yourself, but it’s a replication of how I experimented to solve the problem. First, I grabbed a random email from the pile and copied it over to play with.

    cp ~/email/my\ label/new\1591046073.M749748P5404Q27R4105a8a0b2df58c3.computer-name ~/test.email

Now we have test.email in our home directory. Let’s navigate back there with cd ~ and have a look. As I mentioned, the emails are HTML formatted and all had the same structure, which should make extraction easier. Let’s find some data we want and use grep to pull out the corresponding line:

    grep -o 'Order Number:</b> [^"]*' test.email

We’re telling grep to search test.email for lines that look like the above, and only display nonempty parts of the matching line (-o). The string we’re searching for is enclosed in single quotes, and has [^"]* in the part of the line that has variable content (the part unique data we’re interested in, in each email). This is a regular expression, and requires a little explaining.

Regular expressions (regex) are simple but very powerful rules that can be combined to manipulate text. In short, you feed the computer a set of rules using these symbols, point it at some data, and it will apply the rules to the data. grep is a tool that uses regex to search text files.

The * at the end of [^"]* is telling the computer to match using all of the rules before the *. The [^] is a negated set, telling the computer to ignore everything else inside of the brackets; which in this case is just a single quotation mark. So, we’re telling the computer to search for a line starting with ‘Order Number: </b>’, then looking at what follows that, ignoring quotation marks.

This prints:

    Order Number:</b> G751320146</td>

Success! Now we just chain a few of these together and extract all the lines we want. I’m going to skip ahead a bit and show you the end result:

    grep -orh 'Order Number:</b> [^"]*</td>\|Date:</b> [^"]*</td>\|Service:</b> [^"]*</td>\|Type:</b> [^"]*</td>\|Loss State:</b> [^"]*</td>\|On Time:</b> [^"]*</td>\|How Long:</b> [^"]*</td>\|Rating:</b> [^"]*</td>' ~/email/Test/new > ~/parser.temp

This might look intimidating, but let’s step through it quickly. We’re using grep with the -orh flags (-o only matching, -r recursive (all files in directory), and -h suppressing filename from the output), and pointing grep at the folder ~/email/Test/new, then saving the output to ~/parser.temp. You can see the chain of items we wanted to extract, stuck together using a few of these: \|

Converting to CSV

If we look at parser.temp after we’ve run the previous command, we see a giant list of every matching item:

    Order Number:</b> G72820146</td>
    Date:</b> 05/25/2020</td>
    Service:</b> #2</td>
    Type:</b> Auto</td>
    Loss State:</b> PA</td>

This has the data we want, but not in a useful format. We want a spreadsheet with these data points from each email in one line, with the extraneous HTML tags removed, and with column names at the top.

The simplest kind of spreadsheet is a comma-separated value (CSV) file, which is exactly what it sounds like. It will end up looking like this:

    Order Number,Date of Dispatch,Service,Type,Loss State,On Time,How Long,Rating,
    G72820146,05/25/2020,#2,Auto,PA,Yes,46-60 minutes,Good,
    G153820148,05/27/2020,#1,Auto,FL,Yes,Less then 30 minutes,Excellent,
    G603420140,05/19/2020,#3,Auto,FL,No,61+ minutes,Good,

Notice the first line is titles, and each line after is data. Let’s get started, by using the sed command to replace text.

    sed -i 's|</td>||g' ~/parser.temp

This will get rid of all the </td> tags by replacing them with nothing. The command is saying “us sed to search for </td>, replace that with [nothing], and apply to entire file”, with the sections of the argument separated by |'s. The -i flag tells sed to edit the file inline, instead of copying the output to a new file.

    sed -i 's|Order\ Number:</b>\ ||g' ~/parser.temp
    sed -i 's|Date:</b>\ ||g' ~/parser.temp
    sed -i 's|Service:</b>\ ||g' ~/parser.temp
    sed -i 's|Type:</b>\ ||g' ~/parser.temp
    sed -i 's|Loss\ State:</b>\ ||g' ~/parser.temp
    sed -i 's|Provider\ Arrived\ On\ Time:</b>\ ||g' ~/parser.temp
    sed -i 's|How\ Long\ Till\ Provider\ Arrived:</b>\ ||g' ~/parser.temp
    sed -i 's|Rating\ of\ Technician:</b>\ ||g' ~/parser.temp

These all do basically the same thing, searching for strings of text and replacing them with nothing (represented as || with nothing between them). Notice the \'s before all spaces, called ‘escaping’ them, which is to tell the computer that the argument isn’t over yet when it sees the space.

    sed -zi 's|\n|,|g' ~/parser.temp

This will search for the newline character (\n) and replace it with a comma. The -z flag tells sed to separate lines by null characters.

    sed -Ei 's/(G[0-9])/\n\0/g2' ~/parser.temp

This one is a little trickier; it uses extended regex (-E) to tell it to look for the pattern G[numbers], save that instance of the pattern, replace it with a newline, then paste the instance back in after the new line. The g2 tells it to skip the first one.

Almost done! Now, we’ll insert the column titles at the top of the file:

    sed -i '1i Order\ Number,Dispatch,Service,Type,Loss\ State,Provider\ Arrived\ On\ Time,How\ Long\ Till\ Provider\ Arrived,Rating\ of\ Technician,' ~/parser.temp

And, rename the file to CSV:

    mv ~/parser.temp ~/Parsed_Email.csv

And there we are!

Shell script

Here is our completed script (be sure to edit the first command as necessary, and edit the input folder):

    #!/bin/sh

    # download emails (edit with your email, pw, and desired label)
    # be sure to point this part at the correct location for imapgrab.py
    python2 ~/imap.py -d -v -M -f ~/email -P 993 -S -s imap.gmail.com -u EMAIL.HERE@autorescuesolutions.com -p PASSWORDHERE -m "TestLabel"

    # grep the relevant data out of all email files in that folder
    # be sure to edit the folder name at the end with your email label!
    grep -orh 'Order Number:</b> [^"]*</td>\|Date:</b> [^"]*</td>\|Service:</b> [^"]*</td>\|Type:</b> [^"]*</td>\|Loss State:</b> [^"]*</td>\|On Time:</b> [^"]*</td>\|How Long:</b> [^"]*</td>\|Rating:</b> [^"]*</td>' ~/email/TestLabel/new > ~/parser.temp

    # remove data descriptions
    sed -i 's|</td>||g' ~/parser.temp
    sed -i 's|Order\ Number:</b>\ ||g' ~/parser.temp
    sed -i 's|Date:</b>\ ||g' ~/parser.temp
    sed -i 's|Service:</b>\ ||g' ~/parser.temp
    sed -i 's|Type:</b>\ ||g' ~/parser.temp
    sed -i 's|Loss\ State:</b>\ ||g' ~/parser.temp
    sed -i 's|On\ Time:</b>\ ||g' ~/parser.temp
    sed -i 's|How\ Long:</b>\ ||g' ~/parser.temp
    sed -i 's|Rating:</b>\ ||g' ~/parser.temp

    # convert to CSV -- first replace all newlines with commas
    sed -zi 's|\n|,|g' ~/parser.temp

    # then find 'G[numbers]' pattern and insert newline
    sed -Ei 's/(G[0-9])/\n\0/g2' ~/parser.temp

    # then insert column titles
    sed -i '1i Order\ Number,Date,Service,Type,Loss\ State,On\ Time,How\ Long,Rating,Technician,' ~/parser.temp
    mv ~/parser.temp ~/Parsed_Email.csv
    echo "Parsing completed. Please see Parsed_Email.csv in your home directory."

So, we will copy and paste the above into a new file in our WSL instance, save it as parser.sh, and make it executable:

    chmod +x parser.sh

If you need to make any changes, you can edit the text file with vim or nano in your Linux terminal. Now we can run the script:

    ./parser.sh

If all goes well, you will see a bunch of emails downloading and a completion message shortly after.

I hope you find this useful! Please do not contact me for support.