Have you ever had to parse the structure of an email, which contains html, text and images? Easier said than done 😅
As a developer, you need to examine the email content, extract images or text, and do something with them, like upload the images to the web server or save them to a database. IMAP functions available in standard PHP libraries offer elegant and efficient processing of email messages, but if only things were simple!
Remember those wonderful times when email servers provided consistent, predictable structure? When parsing the structure of email was boring? You probably do not remember such times, not because you are too young, but because such times never existed! 😀
The reality sets in when we realize most email servers use a complex structure to represent the email message content. Developers seeking examples on how to retrieve the email content will find solutions that assume the structure is predictable and will not change. Unfortunately, the underlying structure of an email changes when it contains replies or attachments.
In this article, we introduce our approach to identifying the structure of an email and extract the content we are interested in. We feel our approach is worth sharing because it works independently of how the web server represents the email. As you will examine the code, you may find it simple and elegant.
Email message structure
We use PHP to connect to the mail server, retrieve the latest email messages intended for use, and use PHP’s standard IMAP functions to get the structure of the email document. While this is a strategy that works in principle, we faced a major obstacle: inconsistent structure content. Even after doing extended research on this matter, we made little progress. As a result, we looked for in-house solutions that would recognize the structure of an email message, and process it to collect the target elements.
Two prong process
Our solution uses two functions that operate on the email message structure:
- extract_body_part_path – accepts the structure of the email message and returns a hash with information about the paths within the structure where data is stored.
- extract_body_part_path_exception – looks at email structure using the structure path as returned by extract_body_part_path.
Process structure recursively
In extracting the body part of an email from the available structure, we use a recursive approach. The email content is stored in the HTML element of the structure, but getting to this element gets difficult because its location changes depending on server settings and response options. To solve this problem, we invoke a recursive iteration that provides a hash structure. Each value of this hash is processed further to find the information stored in “encoding” item of the structure.
After getting the path of the element and filtering on numerical values for the path, we store the path element in an array… aptly called $keys. The function returns a hash with two keys:
- Path – containing the values of $keys
- Encoding – the encoding type corresponding to current email component.
Process path exception
The second item of the system is the extract_body_part_path_exception function which is responsible for going through the initial email message structure. This function uses the structure path built with extract_body_part_path, and deals with unfamiliar structures. The work is detailed in the code below, but suffice it to say it provides a complete list of paths where the HTML content is stored.
Using IMAP functions to extract email content
The process of extracting the HTML type content follows these steps:
- Fetch IMAP email message structure – use the standard imap_fetchstructure function.
- Extract structure path – use extract_body_part_path on the structure to build information about locations of the HTML type content.
- Build complete structure path – use extract_body_part_exception on structure and paths collected at step 2.
- Extract HTML content – use imap_fetchbody to extract the specific path from the email message.
Conclusions
The structure representing an email retrieved from the mail server has several depth levels and changes based on the email message content and history. This is a problem for automatically processing its content, but we solved it by identifying all paths using a recursive approach.
The code cand be found at:
A second/alternative solution where the recursive algorithm is hand coded from scratch: