In this study, we explored the prediction of parameters associated with clinical outcomes, including TMTV, lesion count, age, sex, and diagnosis status, from whole-body FDG-PET/CT using a deep regression approach, as shown in Figure 1.
Fig. 1The alternative text for this image may have been generated using AI.
Overview of the end-to-end automated framework for clinical parameter prediction from tissue-wise FDG-PET/CT projections.
Dataset
Data collection
This study utilized the publicly available autoPET dataset from The Cancer Imaging Archive (TCIA)9,23,24. Released in 2022 as part of a tumor segmentation grand challenge, the autoPET dataset consisted of 1014 FDG-PET/CT volumes, each accompanied by manually annotated tumor delineations. Alongside the segmentation masks, clinical parameters, such as age (in years), diagnosis type, and sex were also made available for each patient. The data encompassed three cancer types: lymphoma (144), lung cancer (167), melanoma (188), and a negative control group (513). All FDG-PET/CT and their manual annotations were provided as 3D volumes, typically from the head to the mid-thigh, and in some cases, the entire body, based on clinical relevance. The manual annotations were conducted by two radiologists with ten and five years of experience. Publication of the anonymized autoPET dataset was approved by the Institutional Ethics Committee of the Medical Faculty at the Eberhard Karls University of Tübingen and the University Hospital of Tübingen (833/2020BO2). All subjects signed an informed-consent form.
Data acquisition
PET/CT examinations were conducted at the University Hospital Tübingen using a Siemens Biograph mCT scanner9. The protocol adhered to international guidelines for oncologic PET/CT, specifically the FDG PET/CT EANM procedure guidelines version 2.025. Diagnostic whole-body CT scans were acquired with standardized parameters: 200 mAs with automated exposure control (CareDose), tube voltage of 120 kV, and weight-adapted intravenous CT contrast agent (Ultravist 370, Bayer Healthcare) or without contrast agent (in case where contraindications existed). CT data were reconstructed in transverse orientation with a slice thickness ranging from 2.0 to 3.0 mm and an in-plane voxel size between 0.7 and 1.0 mm.
For PET imaging, FDG was intravenously administered after at least 6 hours of fasting, with a mean radioactivity of 314.7 MBq (range: 150 to 432 MBq) adjusted based on patient weight. PET scans covered four to eight bed positions (typically from skull base to mid-thigh), reconstructed using a 3D-ordered subset expectation maximization algorithm (2 iterations, 21 subsets, Gaussian filter 2.0 mm, matrix size 400 × 400, slice thickness 3.0 mm, voxel size 2.04 × 2.04 × 3 mm3). PET acquisition time was 2 min per bed position.
Ethical considerations
Ethical approval to conduct retrospective image analysis on the autoPET dataset was obtained from the Swedish Ethical Review Authority (Dnr 2023-02312-02). The study was conducted in accordance with relevant guidelines and regulations, including the Declaration of Helsinki.
Data pre-processing
The PET data was standardized through the conversion to standardized uptake value (SUV), normalized by body weight. To ensure consistency with respect to voxel size throughout the entire cohort, all CT and SUV data were resampled to a common image resolution of (2.04 × 2.04 × 3 mm3).
Proposed method: Efficient 2D representation of 3D PET/CT scans
Due to the significant GPU memory requirements associated with processing 3D PET/CT volumes, 2D projections were derived. This resulted in suitable 2D representations of the underlying 3D volumes while minimizing overall memory usage. While 2D projections effectively can retain essential anatomical and metabolic information from the original PET/CT scans, it is important to acknowledge that complete data preservation cannot be obtained. To address this issue, supplementary information to the original CT (CTorig) and SUV (SUVorig) volumes, in the form of multi-channel 2D projections across various tissues was created. This involved categorizing the CTorig and SUVorig volumes into four distinct tissue types based on the CT Hounsfield units (HU) according to the equations (1)-(4)26,27.
$$C{T}_{i}(bone)=\left\{\begin{array}{ll}1 & \,{{\rm{if}}}\,i\ge 200\\ 0 & \,{{\rm{elsewhere}}}\end{array}\right.$$
(1)
$$C{T}_{i}(lean)=\left\{\begin{array}{ll}1 & \,{{\rm{if}}}\,i\in [-29,150]\\ 0 & \,{{\rm{elsewhere}}}\end{array}\right.$$
(2)
$$C{T}_{i}(adipose)=\left\{\begin{array}{ll}1 & \,{{\rm{if}}}\,i\in [-190,-30]\\ 0 & \,{{\rm{elsewhere}}}\end{array}\right.$$
(3)
$$C{T}_{i}(air)=\left\{\begin{array}{ll}1 & \,{{\rm{if}}}\,i < -190\\ 0 & \,{{\rm{elsewhere}}}\end{array}\right.$$
(4)
This categorization resulted in a set of tissue-wise CT masks, representing bone tissue, lean soft tissue, adipose tissue, and air. Subsequently, tissue-wise CT and SUV volumes were generated by applying these masks to the respective CTorig and SUVorig volumes. Finally, maximum intensity projections (MIPs) were computed for all the SUV channels by capturing the highest intensity along the coronal (Θ = 0°) and sagittal (Θ = 90°) directions. Similarly, mean intensity projections (meanIP) were computed for all the CT channels by capturing the mean intensity along the respective direction. This process resulted in the creation of tissue-wise 2D projections, including \(C{T}_{orig}^{meanIP},\,SU{V}_{orig}^{MIP},\,C{T}_{bone}^{meanIP},\,SU{V}_{bone}^{MIP},\,C{T}_{lean}^{meanIP},\, SU{V}_{lean}^{MIP},\, C{T}_{adipose}^{meanIP},\,SU{V}_{adipose}^{MIP},\,C{T}_{air}^{meanIP}\), and \(SU{V}_{air}^{MIP}\).
The decision to include these specific tissue categories was based on the fact that they are important tissue types that can be easily identified from CT based on their HU. These categories also represent key anatomical components that provide valuable insights into structural and metabolic aspects of the whole-body and have the potential to aid future clinical outcome prediction. The selection of the specific projection angles, corresponding to coronal (0°) and sagittal (90°) views was based on their ability to provide complementary and comprehensive perspectives of the whole body PET/CT. These projections are utilized in medical imaging by experts, including radiologists, due to their effectiveness in visualizing various aspects of internal anatomy.
One of the important aspects of this research was to examine the effectiveness of incorporating multi-directional tissue-wise projections in the context of clinical parameter prediction. Multi-directional 2D projections refer to multiple CT or SUV projections obtained at different angles Θ with respect to the sagittal plane, where Θ ranges between [−90°, +90°]28,29. Besides examining pure sagittal and coronal projections, we also evaluated the impact of introducing oblique projections, with Θ = ± 45°. This idea originated from the fact that coronal and sagittal projections offer distinct and complementary information to each other. Thus, the integration of oblique projections could potentially yield additional information.
Clinical parameter prediction: TMTV, lesion count, age, sex, diagnosis status
TMTV, lesion count and age prediction were formulated as regression tasks, given their nature as continuous variables. In contrast, prediction of sex and diagnosis status was approached as classification tasks. The number of images used for both the classification tasks was 1010.
Regression
TMTV and lesion count exhibit a positively skewed distribution, with values ranging from 0.1 to 2481 ml for TMTV and 1 to 900 for lesion count. In contrast, age shows an approximately Gaussian distribution, spanning from 11 to 95 years. The number of images used for age prediction was 1010 whereas TMTV and lesion count prediction employed 499 images.
Classification
Diagnosis status was approached as a binary classification task, distinguishing between cancer-positive and cancer-negative categories. Similarly, sex classification comprised male and female categories. The evaluation of diagnosis status involved 1014 scans, while sex classification utilized 1010 scans, excluding four cases with unspecified sex information.
Neural network training
Data preparation
MIPs and meanIPs were generated from tissue-wise SUV and CT volumes as discussed in section 2.3 along coronal and sagittal directions and normalized between [0, 1]. Subsequently, all the multi-channel coronal and sagittal projections were combined into a single image by placing them next to each other in the form of a collage which resulted in a total of 1014 collages. This was done independently for all the CT and SUV projections, resulting in 10 channel image collages for each patient, as shown in step 2 of Fig. 1.
Given the substantial variation in the PET/CT field of view, often encompassing regions from the head to mid-thigh, and occasionally extending across the entire body, a simple padding based approach was implemented. This involved zero padding and unpadding to the image collages, ensuring a uniform dimension of (512, 512) across the cohort. Subsequently, to assess the effectiveness of the integration of multi-directional projections, oblique projections (Θ = ± 45°) were introduced into the collage. This resulted in a uniform dimension of (512, 1024) across the entire cohort.
Network configuration
A 2D CNN based architecture named DenseNet-12130 was utilized for all the prediction tasks, including TMTV, lesion count, age, sex, and diagnosis status. In all the cases, the input to the network was a combination of the multi-channel 2D collages consisting of \(C{T}_{orig}^{meanIP},\,SU{V}_{orig}^{MIP},\,C{T}_{bone}^{meanIP},\,SU{V}_{bone}^{MIP},\,C{T}_{lean}^{meanIP},\, SU{V}_{lean}^{MIP},\, C{T}_{adipose}^{meanIP},\,SU{V}_{adipose}^{MIP},\,C{T}_{air}^{meanIP}\), and \(SU{V}_{air}^{MIP}\). To evaluate the effectiveness of tissue-wise projections, several networks were evaluated, with the same network architecture, except for the number of input channels. These ablation experiments were primarily conducted for TMTV and age prediction, to investigate the impact of each individual channel. In all the prediction tasks, the baseline model utilized either the \(SU{V}_{orig}^{MIP}\) or \(C{T}_{orig}^{MIP}\) as input, while the proposed model incorporated all tissue-wise CT and SUV projections.
To ensure consistency and unbiased evaluation, all models were independently assessed using 10-fold cross-validation, employing the same training-validation split. Stratification based on sex and cancer type was applied during cross-validation to maintain the same distribution across all the folds. All networks were trained on a Nvidia RTX 3090 Ti GPU with 24 GBs of memory. The training process included a smooth L1 loss for regression and binary cross-entropy loss for classification tasks. Both tasks were optimized using the Adam optimizer with a learning rate of 0.0001, weight decay of 0.00001 and dropout rate of 0.25. A batch size of 30 was employed during training for faster convergence.
Evaluation metrics
During 10-fold cross-validation, various evaluation metrics were employed to facilitate a comprehensive comparison between different models. For regression tasks, the metrics included mean absolute error (MAE), coefficient of determination (R2), and Pearson’s correlation coefficient (r), providing insights into predictive accuracy and consistency. For classification tasks, the metrics included area under the curve (AUC), sensitivity, precision, and specificity.
Saliency analysis: Grad-CAM
To visualize the regions within the input images that contributed most to the neural network’s predictions, saliency analysis was performed using the Gradient-weighted Class Activation Mapping (Grad-CAM) approach. This algorithm computes the importance of a chosen feature map in the CNN in relation to the target variable. In our case, this was achieved by calculating the gradients of the output with respect to the feature map of the first convolutional layer. It generates a heatmap, indicating the regions of the input image that positively influenced the target prediction.
Analyzing individual saliency maps can be time-consuming and sometimes challenging to interpret in large cohorts. Therefore, cohort saliency analysis was employed to enhance overall understanding of recurring patterns across the entire cohort22,31. Cohort saliency analysis involved spatially aligning all individual saliency maps into a reference image using a previously developed whole-body FDG-PET/CT image registration framework26. Cohort saliency analysis allows us to visualize the overall patterns in the entire cohort by aligning individual saliency maps to a common reference, highlighting recurring regions of interest that contribute to the model’s predictions across multiple subjects. Females and males were registered to separate reference images due to anatomical differences, with template selection based on body fat percentage32. Cost function masking was applied around individual tumors to preserve their original shape. Further details regarding image registration and template selection for the autoPET cohort can be found in ref. 33.
The image registration technique was originally developed for 3D FDG-PET/CT data. However, in this study, where 2D projections were utilized for deep regression-based prediction, we first converted the 2D saliency maps into 3D, registered them in 3D, aggregated and then converted them back to 2D. This was made possible because nearly all the subjects had a similar field of view in terms of anatomical coverage, with only minor differences. Further details regarding cohort saliency analysis in the autoPET cohort can be found in ref. 22.
Statistics and Reproducibility
Prediction performance was assessed using R2 and MAE for regression tasks (TMTV and age), and AUC for classification tasks (sex and diagnosis status). Statistical comparisons between the baseline and proposed models were conducted using two-sided Steiger’s Z-test for regression tasks and two-sided DeLong’s test for classification tasks. A p-value less than 0.05 was considered statistically significant, and no adjustments for multiple comparisons were applied.
All experiments were performed on the autoPET dataset, consisting of 1014 whole-body FDG-PET/CT scans. Model evaluation was conducted using 10-fold cross-validation. Each fold was trained once using identical preprocessing, network architecture, and training procedures, and no additional repeated runs with different random seeds were performed. In this study, the 10 folds serve as independent experimental replicates, as each fold provides an independent evaluation of model performance.

