Using natural language processing to convert mud-log chip descriptions to usable data tables

Marcelo Guarido, David J. Emery, Kristopher A. Innanen

We successfully created a natural language processing pipeline to extract mud-logging cutting descriptions from PDF files. We converted them to usable structured numerical tables that can be used to match with wireline logs or seismic sessions. The nature of the original tables required extensive preprocessing of the extracted object, including data manipulation, pattern recognition, missing values treatment, and resample. The extract and processed table were merged with well logs and used to predict DTC and provided important improvement of the predictions compared to the baseline model using wireline logs only, where the R2 improved from 0.73 to o.82 using a linear regression model. Feature selection with the stepwise regression generated an optimized model that kept the quality of the predictions and used logs and cutting descriptions with equal importance. Lately, an XGBoost regressor created a non-linear model to improve the predictions with an R2 of 0.88, relying more on the wireline logs. New tests were done on a train-validation split of 5% and 95% to avoid biased predictions. Both the stepwise and XGBoost regression predictions were less precise but still close to the actual values, showing the robustness of the methodology.