Using natural language processing and machine learning to predict severe injuries classification in the oil and gas industry

Marcelo Guarido, Daniel O. Trad

Severe injuries, such as fractured body parts and amputations, are always on the top list of mitigation importance in any kind of job. In this work, we use the incident/accident description of the severe injuries' reports from the Occupational Safety and Health Administration of the United States Department of Labor to create a machine learning model that standardizes the class of the incident classification. We used natural language processing to convert each description of an injury to numerical features and applied the TF-IDF methodology to remove words that are not important to the classification of an injury. Models such as "Extremely Randomize Trees" and "Multinomial Logistic Regression" were trained and applied on the oil and gas industry’s reports to test their accuracy, and we came to the following conclusions: predictions are improved when binary input features are used; the Extremely Randomized Trees tends to predict the most frequent classes with accuracy over 80%; the Logistic Regression works better for the other classes with balanced accuracy of 54% if implemented with balanced class weights.