My learnings from the e-mail classification use-case from the SK[AI] is the limit hackathon!

Content: Business Understanding This paper will analyse the different aspects of a typical e-mail classification use-case, by segmenting the use-case into the different phases of the cross industry standard process for data mining. The goal of this script is to evaluate an optimal e-mail classification approach in the phases of data understanding, data preparation, modelling, evaluation, and deployment, to ultimately enhance customer service with artificial intelligence supported e-mail classification. By analysing real e-mails from a customer service and choosing the right classification algorithm, time can be saved by automatically labelling the e-mails into the correct class / topic section and classes can be optimized by being redefined. By mastering the classification challenge correctly, the model could furthermore be used to suggest how-to manuals, which would ensure a faster response time in first contact, reduce the workload of the customer service and give the employees more time for relevant tasks, but due to the fact that we focused on the classification task, this topic is not covered.


SK[AI] IS THE LIMIT! - HACK IT HAPPEN is an online hackathon organised by EESTEC LC Zurich, ipt AG and the Swiss foreign department (EDA). It took place from noon of 20th until evening of 21st of November. The topic of the hackathon was Natural Language Processing. The team, consisting of 3 people, did compete in classifying or even clustering IT-Helpdesk e-mails, pitching the results to a jury and eventually win one of the breath-taking prices like a flight with a PC-12 or bodyflying.

Klick here to see our pitch from the hackathon:

Klick here to see the full hackathon: Sk[AI] is the limit - Hack it happen! - YouTube



The Swiss Federal Department of Foreign Affairs (FDFA) (Eidgenössisches Departement für auswärtige Angelegenheiten, EDA) determines and coordinates Swiss foreign policy on behalf of the Federal Council. The FDFA consists of the organizational units at head office in Bern and the network of Swiss representations, which includes embassies, consulates, cooperation offices and missions.

The FDFA internal IT Helpdesk

FDFA has its own IT department, which provides IT services. The IT Helpdesk is in charge for supporting internal users. Due to the high workload support mails are sometimes only treated after several hours, which can result in users calling in and creating even more workload. Once the Helpdesk collaborator has analyzed a support request he creates an incident in the IT Service Management Tool. He assigns the incident to a service, to a support group (the group that will treat the incident) and evaluates the urgency und impact of the request (which determines the priority).

A “Mailbot” for the FDFA IT Helpdesk

To offer a better support service to the users, the FDFA IT Helpdesk plans to use a Mailbot (for German mails only). The Mailbot should analyze incoming support mails with a pretrained AI model, open an incident in the ticketing system with the correct service category and send a receipt mail to the user. This receipt mail should also contain links to manuals from a collection of how-to manuals, that the Helpdesk team maintains. Helpdesk employees will then take over and correct & finalize the created case.


A PoCathon is a proof-of-concept (PoC) in the form of a hackathon. FDFA wants to discover what approach performs best to analyze and classify received Mails by the internal IT Helpdesk. Every Team creates its own PoC and will compete against the other AI solutions.

Your Objective

Your objective is: Based on real IT helpdesk support mails, train one or multiple machine learning (ML) models that predict the following attributes as reliable as possible (measured by F1 Score):

[Task 1] Predict Labels IT-Service & How-To Manuals

  1. Which service is the incident about?

  2. Which manuals of the provided how-to manual list would fit? Try to find max. 4 matches per incoming E-Mail.

[Task 2] Service-Merging Strategy

Find a solution for the evaluation of the best possible services to merge with each other into a single service. Try to come up with ready-to-use code that benchmarks different combinations of 2 or more services and show the resulting differences. Try to run automatic tests of one or more merges in parallel.

[Task 3] Bonus Challenge: Clustering to find new Manuals

If you completed the first parts of the PoC you may try to solve the following question:

Which manuals are missing on the list? Try to perform a clustering of the E-Mails, so that new manual categories can be identified and the Helpdesk team can create such how-to manuals.

An essay about my personal learnings

Data Understanding The data is collected from existing customer service emails. Ideally, a data scientist prepares the data so that it is in a CSV or Excel file. The difficulties of procurement are the amount of data and the differentiation between important and unimportant information, as well as merging of all data if different data sources are available. The e-mails contain various elements such as ID number, incident type, service processed, subject of the e-mail, e-mail text, urgency, impact and the class, which is currently assigned manually by customer service.

Data Preparation *Data preparation suggestions are made from Ettore Galantes, Mesut Ceylans and Alejandro Castañeiras presented winning approach from the SK[AI] IS THE LIMIT hackaton 2020. Everything fat is from their GitHub Repository*

Pre-processing is essential to better understand the raw data. With plt.pie a pie chart can be created, in which an initial overview of the previous classes manually defined by customer service is shown. A diagram with the lines for the individual classes is drawn using plt.plot. A similar approach would be plot_label_distribution, in which each class name is displayed next to its individual bar chart. Usually we need to identify the language of the string or document in which it is written. Therefore, we use identity_language. It checks every process of the existing languages and looks for patterns, calculating a total score for each language and from this calculation the probability for each language in the analysed document is calculated (Simoes Alberto, 2020). Then the data needs to be cleaned up: With NLP = spacy.blank ("de") an empty model is created and the German language is registered (spaCy, 2020). Then we iterate over all training data, i.e. because it is a CSV file with e-mails, so the e-mail text is cleaned for every word in each e-mail with the help of regular expressions (Kaplun, 2020). This means that afterwards only text and no links and URLS appear in it (Google for Education, 2020). So that this can be done, the Python library re is imported (Friedl, 2020). In order to run the program more efficiently, the regular expression is compiled with Pattern = re.compile (Rajendra, 2020). The e-mail text is then tokenized, i.e. it is divided into smaller pieces of text, the tokens (Dataquest, 2020). Finally, by installing TensorFlow and downloading a multiversal encoder, we can encode all our emails. Another interesting approach to inferring quality information from text would be to use Microsoft Azures services. It can be used to extract N-Gram functions from text or convert text into integercoded functions using the Vowpal Wabbit library. You can also remove stop words and thus clean up the text or use Word2Vector to convert the words into values (Microsoft, 2020).

Modelling I would not choose a non-sequential machine learning model but rather go with a recurrent neural network. More precisely I would use the long short-term memory model, LSTM, because it is a sequential model. In the hackathon we went with the GRU model, because it has a similar effect like LSTM but its implementation was easier for us. So, we used GRU and then tried to sharpen the focus of our model with an attention layer, which we then fed into dense layer, and then we had a classifier. But this time I would use a LSTM model.

I'd check different possible merges of the classes, by using the K-Means algorithm. I would do this because the more classes available, the more difficult it is to allocate the correct class, so with the clustering algorithm K-Means we can predefine our K and merge the existing classes into the new defined classes. We basically create partitions for K in par example [3,6,9,12,15,18,21,24,27,30] and see how the algorithm would suggest the compositions of the newly built classes.

The different gates of the LSTM model will ensure a great performance. The model is structured into stage one the “forget gate”, which works with a first sigmoid layer and throws unimportant information out. Stage two the “input gate”, is responsible with a second sigmoid layer which can decide which values get updated, as well as a tanh layer that will create a vector that can be added to the state, which will be combined with the second sigmoid layer and this way they update the state. Stage three the “cell state” is the memory of our model and stage four is the “output gate” (Rahuljha, 2020). It is known that recurrent neural networks have illnesses like exploding gradients or that they just vanish, which means that when the gradient becomes smaller, no learning is done, and its parameter update becomes insignificant.

The problem gets solved by the structure of the LSTM model, because it uses a special additive gradient structure which is including direct access to the activation of the “forget gate”, which enables the neuronal network to encourage the desired behaviour from the gradient with the error by using frequent gates update on each time step from the learning process (Arbel, 2020). This enables us to train our model by doing the forward computation, the loss calculation, as well as the backward computation (Kristiadi, 2020), or in another terminology, by using the gradient descent method.

Evaluation Recurrent neural networks normally have a higher accuracy than neural networks with only one layer. I would use the loss function, which can be defined as MAPE, as an indicator of the accuracy (TensorFlow, 2020). But there are also other options as the Friedman test or the Nemeny test. In class we had a look at the coincidence matrix which can greatly measure the performance for classification problems. In addition, precision and recall should be analysed, as well as the f1 score and the Jaccard index, to avoid a wrong interpretation of a possible biased accuracy. We want to avoid that our model memorizes because our goal is to generalize. Therefore, we also have split our data into 3 sets: one for training purpose, one for test and one for validation.

Deployment A use case is successfully finished if the goal, which has been specifically set at the beginning, what, where and how to get there has been reached. To know that it has been reached, it must be measurable. So the goal has to be assigned to people to make it attainable, it must be really relevant and feasible, and the time based timeframe should be set goal driven but yet realistic. In your case, you as the team leader, make individual data scientists attainable for the different CRISP-DM processes. You can either do it in 30 hours like we did at the hackathon or choose a realistic time frame, until after successful training and validation it gets deployed. For deployment the sequential object of Keras can be saved and exported, from there it can be uploaded in the cloud or locally and then we need the API endpoint from the deployed instance where we send the prediction requests. Finally, monitoring and re-training is essentially to succeed such a use case.


Arbel, N. (27. 11 2020). Medium . Von abgerufen

Dataquest. (27. 11 2020). Von abgerufen

Friedl, J. (24. 11 2020). Mastering Regular Expressions. 3rd ed. Von abgerufen

Google for education. (25. 11 2020). Developers Google. Von abgerufen

Kaplun, E. (24. 11 2020). Von abgerufen

Kristiadi, A. (27. 11 2020). Wiseodd Github . Von abgerufen Lynn, S. (25. 11 2020).

Shane Lynn . Von Data science, Startups, Analytics, and Data visualisation.: abgerufen Microsoft. (27. 11 2020). Von abgerufen

Rahuljha. (27. 11 2020). Towards Data Science . Von abgerufen

Rajendra, D. (24. 11 2020). Von abgerufen

Simoes Alberto, C. J. (24. 11 2020). Metacpan. Von abgerufen

spaCy. (24. 11 2020). Von abgerufen

TensorFlow. (27. 11 2020). TensorFlow. Von abgerufen

11 Ansichten0 Kommentare

© Antonia Durisch   IMPRESSUM