Cat Data is a major Big Data player in the automotive industry, specializing in the collection, enrichment, and exploitation of vehicle-related data at scale. Its activities rely on the integration of heterogeneous information sources such as VIN identifiers, maintenance histories, technical specifications, warranties, and manufacturer metadata, originating from multiple countries and systems. These datasets form a critical backbone for decision-making by garages, insurers, and automotive manufacturers.
As the volume, heterogeneity, and fragmentation of automotive data continue to grow, current validation and integration pipelines remain largely manual or semi-automated, resulting in high operational costs, limited scalability, and persistent data quality issues. Automating data reconciliation, anomaly detection, and integrity validation has therefore become a strategic priority for Cat Data.
Within a CIFRE Ph.D. framework, Cat Data and ICAM Strasbourg-Europe propose a joint research project aimed at developing next-generation learning models capable of handling the structural complexity and scale of automotive data. The doctoral research will be conducted in close interaction with industrial teams, while benefiting from advanced academic supervision in machine learning and complex systems modeling.
Scientific Motivation
Automotive datasets exhibit a dual nature:
- Tabular data: structured attributes such as engine capacity, year of production, fuel type, or warranty duration.
- Relational data: complex dependencies between entities such as vehicles, maintenance records, parts, manufacturers, garages, and ownership histories.
State-of-the-art tabular models (e.g., gradient boosting, neural tabular models, TabPFN) are highly effective on structured attributes but fail to exploit inter-entity dependencies. Conversely, Graph Neural Networks (GNNs) excel at modeling relational structures but struggle to fully leverage high-dimensional tabular features and sparse categorical data.
This Ph.D. addresses the hypothesis that hybrid graph–tabular learning architectures can overcome these limitations by jointly preserving structured information and exploiting relational dependencies, leading to superior performance in real-world automotive data processing tasks.
Main hypothesis: A well-designed hybrid architecture will consistently outperform traditional approaches by jointly exploiting structured attributes and relational dependencies while remaining scalable and interpretable for industrial deployment.
Application
Applicants are invited to send their CV and two reference emails contacts to: rabih.amhaz(at)icam.fr
References:
[1] Hollmann, N., Müller, S., Purucker, L. et al. Accurate predictions on small data with a tabular foundation model. Nature 637, 319–326 (2025). https://doi.org/10.1038/s41586-024-08328-6
[2] Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., & Bengio, Y. (2017). Graph attention networks. arXiv preprint arXiv:1710.10903.
[3] Kipf, T. N., & Welling, M. (2016). Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.

