Developing a reliable framework for monitoring mortality risk in post-transplant patients requires methods that capture the complexity of patient risk trajectories. These trajectories can be dynamically influenced by the treatment and vary between patients. We address this challenge in two key ways. First, we treat the input data as multivariate time series, preserving temporal dependencies across multiple clinical variables. Second, we formulate the prediction task as a rolling short-term risk estimation problem: predicting mortality within the next seven days based on data from the preceding 14 days. This formulation reflects a clinically relevant monitoring scenario, where risk must be continuously reassessed as new data becomes available. Instead of reducing predictions to binary outcomes, we retain the model’s continuous output (a score between zero and one), allowing for a more nuanced and time-resolved assessment of mortality risk.
Feature selection was guided by the availability and consistency of routinely collected clinical data across patients, with a focus on parameters commonly measured during post-transplant care. This ensures that the proposed framework relies on features that are broadly available in clinical practice and reduces the risk of bias arising from feature-specific missingness.
All methods and experiments were performed in accordance with the relevant guidelines and regulations.
Data
The primary clinical cohort was collected at the University Hospital Düsseldorf (UKD) and used for model development, including training, validation, and internal evaluation. The external cohort, derived from the MIMIC-IV database15, was used exclusively for external testing to assess the generalizability of the proposed framework and was not involved in model training or parameter tuning.
In both cohorts, we included only adult patients (\(\ge 18\) years) who underwent allogeneic HSCT. No patients were prospectively enrolled, and for the UKD cohort, no interventions outside of standard clinical care were performed.
UKD cohort
The UKD cohort comprises clinical routine data (electronic health records, EHR) from 891 adult patients who underwent at least one allogeneic HSCT at the University Hospital of Düsseldorf between 2004 and 2019. It was compiled from two data sources to include both patient demographic data and diagnoses alongside laboratory values. As a result, the data contain both numerical time series and categorical variables. Patients who underwent autologous HSCT were not included.
All patients were hospitalized during transplantation and the early post-transplant phase as part of standard clinical care, but not necessarily throughout the full 100-day monitoring period. Some patients were readmitted during this period. After discharge, laboratory values were recorded during regular outpatient follow-up visits and, when applicable, during readmission. Accordingly, the frequency and density of available laboratory measurements vary across patients depending on hospitalization status, follow-up schedule, and individual clinical course.
The data were pseudonymized in line with the HIPAA Safe Harbor requirements. Specifically, all 18 categories of direct identifiers were removed, and patients are identified by double-blind assigned integer identifiers not related to their data. In addition, time-related information is presented only relative to the date of HSCT, and demographic characteristics such as age are grouped into sufficiently broad categories to prevent re-identification of individuals, including those with unusual characteristics or outlier status. The data made available for this manuscript therefore contain no information that could be used to identify participants.
The use of the UKD cohort was approved by the Ethics Committee of the Medical Faculty at Heinrich Heine University Düsseldorf (case-id: 2019-513) on August 12, 2019. The approval permits processing of the de-identified data used in this study. In light of this approval, the removal of all identifiers according to HIPAA Safe Harbor, and the secondary nature of the data use, we believe that no individual written informed consent is required for this study.
MIMIC-IV cohort
To evaluate the robustness and generalizability of the proposed framework, we additionally used an independent cohort extracted from the publicly available MIMIC-IV database15, which contains de-identified health-related data from intensive care unit (ICU) patients. Access to the database requires completion of a data use agreement and certification process. From this database, we identified patients who underwent HSCT based on procedure and diagnosis codes and constructed a cohort designed to mirror the UKD setting as closely as possible. Mappings of laboratory parameters to MIMIC-IV itemids and ICD procedure codes used for cohort identification are provided in Supplementary Tables S1 and S2.
However, notable differences in data availability required adaptations in feature extraction. In contrast to the more standardized data collection in the HSCT cohort, measurements in MIMIC-IV are recorded based on clinical necessity in an intensive care setting. As a result, several laboratory parameters that are routinely monitored after HSCT, including gamma-glutamyl transferase (GGT37), C-reactive protein (CRP), and total protein (EIWEISS), exhibit substantial missingness, with many patients lacking these measurements entirely.
This missingness is not random but reflects clinical decision-making, where laboratory tests are performed selectively depending on the patient’s condition. Consequently, the absence of these parameters may itself carry implicit clinical information, but their inconsistent availability limits their direct use in a standardized modeling pipeline.
In addition to laboratory features, transplant-specific categorical variables are also affected. In particular, donor and graft information (ops_type), which encodes HLA matching and donor relationship in the primary cohort, is not available in MIMIC-IV and cannot be reliably reconstructed from the available data.
To maintain consistency of the input representation while avoiding bias from feature availability, we adapted the preprocessing accordingly. For the laboratory parameters GGT37, CRP, and total protein, values were imputed using the mean value of the respective feature computed over the first 100 days post-HSCT, independent of the number of available measurements per patient. This approach allows the incorporation of these features without disproportionately favoring patients with more frequent laboratory testing. For ops_type, we assigned a fixed worst-case category corresponding to an unrelated donor without HLA matching for all patients in the MIMIC-IV cohort.
Overall, these differences highlight the challenges of transferring models between datasets with distinct clinical contexts and documentation practices.
Data preprocessing
Clinical routine data after HSCT are characterized by irregular sampling, variable feature availability, and treatment-dependent measurement frequency. These challenges are further amplified when combining data from different clinical settings, such as the UKD and MIMIC-IV cohorts described above. Therefore, a carefully designed preprocessing pipeline is required to ensure comparability across patients and over time.
The preprocessing includes data from the time interval of \(-14\) to 107 days relative to the transplantation day zero in order to predict the daily mortality risk within the interval of 0 to 100 days. We did not apply normalization or standardization to the data, as all patients share the same initial conditions on day zero, and we aim to preserve the natural interactions between features. The preprocessing pipeline consists of eight steps, as illustrated in Fig. 2. The following sections provide a detailed explanation of each step.
Fig. 2The alternative text for this image may have been generated using AI.
Schematic visualization of the preprocessing pipeline. It shows each processing step applied to the raw data to obtain the data used for model training.
Extraction of subsequences
To create the required input for training and evaluation of our monitoring framework, we first excluded data of the day of death for the deceased patients. We then applied a sliding window of 14 days length with a step size of one day, moving it across the time series to extract subsequences for each patient. The first iteration of the sliding window starts at day \(-13\), whereas the final iteration extends to day 100.
Consequently, predictions made on day zero for the next seven days are based on data from the time interval \([-13,0]\) days. In contrast, predictions on day 100 for the interval [101, 107] rely on data from days 87 to 100. This sliding window formulation enables continuous reassessment of patient risk and reflects a real-time monitoring scenario in which predictions are updated as new data become available.
The window size of 14 days was chosen for three reasons. First, it ensures that the conditioning phase prior to transplantation is fully captured. Second, since engraftment typically begins within 10–14 days after HSCT, this window spans both the transplantation and early engraftment phases. Third, given that clinical follow-ups typically occur at least once per week, a 14-day window provides sufficient observations to capture temporal trends.
Selection of features
Blood-related features play an important role in tracking and assessing patients’ risk trajectories post-HSCT. Since different complications require the analysis of specific laboratory parameters, the recorded features vary among patients. As a result, some parameters are documented for only a small subset of individuals, whereas others are consistently recorded for all patients. Furthermore, the admission status of patients introduces additional variability in the type, amount, and frequency of recorded parameters.
To prevent the deep learning models from learning outcome-related biases due to feature availability, we prioritize features that are consistently available across patients in the HSCT-specific UKD cohort. Based on these considerations, we selected 22 laboratory parameters that are routinely measured and broadly available during the first 100 days after HSCT. These variables capture key aspects of organ function, inflammation, and hematopoietic recovery and are commonly used in post-transplant monitoring.
While most selected laboratory parameters are consistently recorded in the UKD cohort, some variables exhibit substantial missingness in the external MIMIC-IV cohort, where measurements are recorded based on clinical necessity in an intensive care setting. In particular, gamma-glutamyl transferase (GGT37), C-reactive protein (CRP), and total protein (EIWEISS) are not available for all patients. To retain these clinically relevant features while ensuring compatibility across cohorts, we incorporate them as described in “Data” section.
Additionally, we considered five categorical features, namely the relative day of prediction, age, sex, principal diagnosis, and donor/graft information. The relative day provides temporal context. Age and sex were included due to known differences in immune response and baseline laboratory values16. Since both cohorts contain only adult patients, age was grouped into three categories (18–29, 30–60, and \(>60\) years) to avoid overfitting to specific values17 while preserving clinically meaningful stratification. Principal diagnoses were grouped into malignant, non-malignant, and other conditions. A comprehensive overview of the primary diagnoses and their respective frequencies is presented in Supplementary Table S3. Donor/graft information was summarized into four categories based on HLA matching and donor relationship.
In the MIMIC-IV cohort, donor and graft information (ops_type) is mostly missing and cannot be reliably reconstructed. To maintain a consistent feature space, it is therefore approximated as described in Section Data.
While additional clinical variables such as graft-versus-host disease (GvHD), conditioning intensity, infection status, or vital signs are known to influence post-transplant outcomes, these variables were not consistently available in structured form across cohorts. Including them would have introduced bias and reduced the number of usable patient risk trajectories. Therefore, this work prioritizes robustness and reproducibility over maximal feature richness, focusing on consistently recorded or reliably approximated variables. Incorporating richer clinical variables represents an important direction for future work.
All selected features and categorizations were approved by medical experts at the University Hospital of Düsseldorf and are listed in Table 1. Descriptive statistics of the features after preprocessing are presented in Table 2 and Table 3.
Table 1 Abbreviated feature names and their full descriptions used for the monitoring framework.Table 2 Descriptive statistics for the UKD and MIMIC-IV cohorts across the selected laboratory features and patient age. For each feature, the table reports the number of recorded measurements and the mean with standard deviation.Table 3 Distribution of categorical variables in the UKD and MIMIC-IV cohorts, together with category-specific mortality within 100 days after HSCT. For each variable and category, the table reports the number and percentage of patients, as well as the number and percentage who died within 100 days.
Aggregation of values
Some of the selected 22 laboratory parameters were measured more than once for the same patient on the same day. In such cases, we used the median value to obtain a single daily measurement per feature, reducing the influence of extreme values and measurement noise. We note that the frequency of measurements may itself carry clinical information. However, incorporating this explicitly would introduce treatment-related bias, as patients in poorer condition are typically monitored more frequently as discussed in “Adding noise” section.
Filtering of Subsequences
Given the importance of temporal trends for prediction, we excluded subsequences with fewer than two observations for a laboratory parameter within the 14-day window, as they do not allow reliable estimation of temporal dynamics. This threshold represents a trade-off between retaining sufficient data and ensuring meaningful temporal information within each subsequence.
For the MIMIC-IV cohort, the same criterion was applied at the subsequence level. However, due to substantial missingness of certain laboratory parameters (GGT37, CRP, and total protein), these features were imputed with cohort-level mean values as described in “MIMIC-IV cohort” section. Therefore, these variables primarily contribute contextual information rather than temporal dynamics, while temporal patterns are captured by the remaining features. As a result, we obtained approximately 88, 500 unique subsequences for the UKD cohort and 7, 208 for the MIMIC-IV cohort.
Imputation
The numerical features within each subsequence are irregular and vary in length. To obtain a uniform representation suitable for model input, missing values were imputed using a combination of last observation carried forward, next observation carried backward, and linear interpolation, depending on the position of missing values within the sequence. This approach preserves temporal continuity while minimizing the introduction of artificial trends.
Clipping
We observed that some patients exhibited extreme laboratory values. To prevent such outliers from disproportionately influencing model training while preserving their clinical relevance, we applied percentile-based clipping. Specifically, for each feature, values below the \(0.5\text {th}\) percentile and above the \(99.5\text {th}\) percentile were clipped to these respective thresholds, computed separately for each cohort (UKD and MIMIC-IV). This preserves the relative ordering of values while limiting the impact of extreme outliers.
Adding noise
Both cohorts exhibit treatment-related bias reflected in measurement frequency: patients in poorer condition are typically monitored more frequently than stable patients. Although aggregation and imputation reduce this effect, residual patterns may still be present. To mitigate this bias, we added Gaussian noise (mean 0, standard deviation 0.05) to each measurement, constrained within feature-specific value ranges. This encourages a deep learning model to focus on temporal patterns in the data rather than implicitly encoding measurement frequency.
Creation of labels
Each subsequence was labeled as deceased if the patient died within seven days following the last day of the subsequence, and as survived otherwise. For example, a subsequence covering days \([-13, 0]\) is labeled as deceased if death occurs within days [1, 7]. Similarly, a subsequence from the time interval of day 87 to day 100 is labeled as deceased if the patient died within the interval from day 101 to day 107.
This labeling strategy aligns with the objective of short-term risk prediction in a clinical monitoring setting, enabling early detection of patient deterioration while aiming for a clinically actionable lead time. The resulting dataset is highly imbalanced, with approximately \(1\%\) of subsequences labeled as deceased.
Prediction models
During the selection process of an appropriate deep learning model for our framework, we identified the following key requirements. The chosen model must capture temporal dependencies in multivariate time series data, ensuring, it could learn and recognize patterns across time from clinical laboratory parameters. Additionally, the model has to generalize well on the limited dataset of 891 patients and at the same time scale effectively when the dataset grows. Another consideration is the computational efficiency, i.e., our model needs to be compact enough to run locally, without requiring extensive computational resources. Based on these requirements, we identified the Explainable Convolutional Neural Network for Multivariate Time Series Classification (XCM)18 as the optimal choice.
In addition to the XCM, we considered a Long Short-Term Memory Network (LSTM)19 and a Multilayer Perceptron (MLP)20. In contrast to CNNs, which were originally developed for image classification, the LSTM architecture was explicitly designed for time series data and is commonly used as a baseline for time series classification tasks in the medical domain21,22,23,24,25. The MLP architecture, initially conceived to handle non-linearly separable data, serves as another baseline for comparing performance of XCM with an approach that is not specifically designed or adapted to handle multivariate time series data. Similar to the LSTM, it is also a widely employed model for classification tasks in the medical domain26,27,28,29,30.
Explainability method
The explainability method aims to enable physicians to understand the reasoning behind the predictions without requiring technical expertise. The explanation method should be capable of providing explanations fast enough to avoid interfering with everyday medical practice. Further, the calculated explanations should allow a quick visual identification of the relevant parameters. In addition, the explanation method should use the original model and being applicable to original data. These requirements guided us to the application of Integrated Gradients (IG)31 in combination with Temporal Saliency Rescaling (TSR)32 in order to increase the interpretability of the explanations.
IG is a gradient-based explainability method that attributes feature importance by integrating gradients along a path from the baseline input to the actual input. Since IG was originally developed for explaining image classification tasks, it does not account for temporal dependencies between data points. To address this limitation, we applied TSR to the output of IG. TSR computes a time relevance score and a feature relevance score, whose product yields the time-resolved feature importance score.
Cross-validation and ensemble of models
In our study, we employed a 10-fold cross-validation strategy with an 8-1-1 split, ensuring that within each iteration, \(80\%\) of the patients were used for training, \(10\%\) for validation, and \(10\%\) for testing. Given the limited size of the dataset of 891 patients, where around \(10\%\) of patients died within the first 100 days, the split was chosen to maximize the training data available whereas ensuring a robust evaluation of the model performance. To further assess the generalization capabilities of our models, we repeated this cross-validation procedure five times, each time using new random assignment of patients to folds without replacement, while ensuring that the label distribution remained consistent across all folds. Since the predictions of individual models vary depending on the training data, we aimed to provide physicians with uncertainty estimates for each mortality risk prediction to further support their decision-making. To compute uncertainty estimates for a given patient p at time step t, we selected the model from each cross-validation run where p was in the test fold, i.e., where the patient was not used for training. This procedure resulted in five models per patient and time step, to yield five independent mortality risk predictions that we used to compute the mean risk score and the corresponding \(95\%\) confidence interval. Since the analysis of five explanations with the size of \(27 \times 14\) each can be time-consuming and impractical, we average and normalize the five explainability results for each patient p at time step t to provide a unified explanation while preserving the findings.
For our experiments, all models were trained for 25 epochs using the preprocessed data described above and a weighted loss function that mitigates the massive label imbalance. Predicted risk scores were further calibrated by temperature scaling on the validation fold of the respective run 33.

