Building prediction models
Normal prostate tissue
We built gene expression prediction models using Genotype-Tissue Expression (GTEx) Project [14], v8 genotype and gene expression data for normal prostate tissues from 183 European ancestry (EA) men who died of other causes, though we cannot discount the possibility of undiagnosed prostate cancer in their prostates (dbGap accession number phs000424.v8.p2). The EA was determined as self-reported non-Hispanic white in their phenotype data. Following the protocols of GTEx consortium [14], we selected genes with expression thresholds of >0.1 TPM (transcript per million) in at least 20% of samples and ≥6 reads in at least 20% of samples, resulting in a set of 26,585 out of 56,200 genes. Expression data were normalized between samples using TMM in edgeR [15] and then further normalized across samples using a quantile normal transform. The genotype data was obtained from whole genome sequencing (WGS). To boost identifications in the downstream association analysis, we filtered single nucleiotide polymorphisms (SNPs) in 1000 Genome and with minor allele frequency <1% in our data. We also removed all indels, monomorphisms, and ambiguous pairs (e.g. A/T, C/G), SNPs with >20% missing genotypes, and SNPs with Hardy–Weinberg equilibrium (HWE) test p-value < 1 × 10−5. To build the model, we included all SNPs located from 1 Mb upstream of the transcriptional start to 1 Mb downstream of the transcriptional end. This includes all intronic and exonic SNPs in the transcribed region plus those intergenic regions within 1 Mb of the gene.
To train the prediction models, we used the Transcriptome-Integrated Genetic Association Resource (TIGAR) [16] approach. TIGAR is a nonparametric method based on Bayesian Dirichlet Process Regression that demonstrated boosted prediction accuracy from parametric methods like PrediXcan. TIGAR is actively maintained and thus can be used to train comparable models in normal, primary tumor, and metastasis tumor samples. Covariates adjusted in model training included genomic principal components (PC) 1–5, PEER factors 1–30, PCR, platform, age, and RIN number. Imputation R2 was generated by 5-fold cross-validation (CV). To assess if the Bayesian models generated by TIGAR predict gene expression better than the elastic net models built with PrediXcan, we also trained models using the same reference genetic/gene expression data and the implementation of the PrediXcan algorithm in TIGAR.
Primary prostate tumors
To investigate the PrCa-survival associated genes expressed in primary tumors, we also built gene expression prediction models using the Cancer Genome Atlas (TCGA) [17] genotype and gene expression data for primary prostate tumors from 373 EA men (dbGap accession number phs000178.v11.p8). We downloaded the raw CEL files for the Affymetrix SNP 6.0 arrays from adjacent normal tissue from the legacy TCGA data portal and called genotypes using the apt-probeset-genotype command line tool from Affymetrix. We removed all indels, monomorphs, ambiguous allele pairs (e.g. A/T, C/G), SNPs with >5% missing genotypes, and HWE test p-value < 1 × 10−5. The remaining SNPs were aligned to build 38 coordinates, and imputation was performed on the TOPMed imputation server [18]. Then we used this processed genotype data to determine the ancestry by performing principal component analysis (PCA) using the FastPCA approximation in EIGENSOFT v6.1.4. From this, we identified a total of 373 men who appear to be of European ancestry for building the gene expression prediction models. We processed the expression data in the same manner as we did for GTEx, resulting in 22,474 out of 55,325 genes. The prediction models were trained using TIGAR for these 22,474 genes using SNPs within ±1 Mb of the gene start and end locations with imputation score >0.7, appearing in 1000 Genome, and with minor allele frequency >1% in our data. Covariates included PCs 1–10 and age.
Metastatic prostate tumors
To investigate the PrCa-survival associated genes expressed in metastatic PrCa, we built genetic prediction models using data from the Genomic Characterization of Metastatic Castration Resistant Prostate Cancer (GCMCRPC) study [19]. This study had WGS data and RNA-seq data from 99 subjects; of these 84 were of self-reported white race. The vast majority of samples come from either lymph node or bone metastases, though some come from other sites [19]. We followed the same protocols as GTEx for preprocessing the WGS and gene expression data, resulting in 28,472 out of 60,483 genes for model training. We trained the TIGAR prediction models for these 28,472 genes using SNPs within ±1 Mb of the gene start and end locations, using PCs 1–10, site of resection or biopsy, and hidden expression covariates 1–10 as covariates. Hidden covariates were calculated following the method by Mostafavi et al. [20], using gene expression data and site of biopsy.
Transcriptome-wide association study using MDC GWAS summary statistics
We had previously reported a GWAS of prostate cancer survival in the MDC cohort [13] which consists of 1053 men with prostate cancer (European ancestry, white), aged 45–73, in Malmö, Sweden. Of these, 245 men have died from prostate cancer after follow-up for 22 years. The summary data contained statistics for 6,246,818 variants for their association with PrCa survival, covering 62.5%, 67.2%, and 60.2% of the SNPs used in gene expression training model of GTEx, TCGA, and GCMCRPC datasets respectively. We calculated linkage disequilibrium (LD) for these SNPs using reference LD genotype covariance file generated separately from GTEx, TCGA and GCMCRPC datasets. The genome segmentation block used for LD calculation was ±1 Mb from the gene, consistent with the gene region used in predictive model training for gene expression. TWAS was conducted separately for normal prostate, primary PrCa tumor, and metastatic PrCa tumor tissues. For each gene in each data, we used TIGAR SNP weights, LD calculated from corresponding dataset, and GWAS summary Z-score statistics from MDC to obtain the test Z-score and p-value for gene expression associations with PrCa-specific survival using the S-PrediXcan strategy [21] as implemented in the TIGAR software. The Z-score statistics from the MDC come from our prior Cox proportional hazards survival analysis of the MDC data [13]. We calculated the false discovery rate (FDR) using the Benjamini–Hochberg procedure [22] and family-wise error rate (FWER) using the Bonferroni correction for the number genes tested in each tissue type.
Validation of TWAS hits in PLCO cohort
The subjects were genotyped on five different microarrays. For each array, only single nucleotide variants with unambiguous strand alignment, a missingness rate <5%, and a Hardy–Weinberg equilibrium p-value > 10–5 were considered. We processed the genotype data with a custom in-house script that converted all data to genome build 38. To do this, we first created a master list of mappings between genome build 37 and genome build 38 by running liftOver repeatedly for each possible chromosome position. We then used this table to map each SNP from build 37 to 38, taking into consideration the portion of the genome whose orientation flipped between the two builds. This step was necessary to ensure that these regions do not fail QC on the imputation server [23]. The resulting VCF files were then imputed to the TopMed imputation server using Eagle phasing [24,25,26]. Data was imputed as one batch per array except for the GSA array, which we split into four random sets due to limitations of the imputation server. To estimate the principal components of genetic ancestry, we used the build 38-aligned files prior to uploading to the imputation server. We merged the data population reference data [27, 28] lifted over to build 38. Variants with missingness less than 0.1% were then LD pruned and principal components were calculated using the “fast” mode of the smartpca program in EIGENSOFT [29]. To run the survival analysis, we extracted only those individuals with a prostate cancer diagnosis and only those imputed variants with r2 > 0.8.
As the majority PrCa patients die from other causes than the cancer, we modeled death of other causes (DoO) separately from death of disease (DoD) using competing risk models. Specifically, we first predicted the genetically regulated gene expression (GReX) for each significant genes with FDR < 10% in MDC, using TIGAR models trained on normal, primary tumor, and metastasis tumor tissues. Next, we leveraged a multi-state R package mstate [30] to perform competing risk modeling. To use the multi-state method, we defined three disease stages, PrCa diagnosis (stage 1), DoO (stage 2), and DoD (stage 3), and their transition probabilities. We required the transition probabilities to be nonzero for staying in the same stage (e.g. stage 1 → stage 1), or transiting from diagnosis to DoO (stage 1 → stage 2), or transiting from diagnosis to DoD (stage 1 → stage 3), but to be zero for the reverse of the transition or transition between stage 2 and 3. Then, we fitted the multi-state model between the imputed GReX and the three disease stages, adjusting for the top 10 genetic PCs, age at diagnosis, aggressive prostate cancer (Gleason ≥ 8), and advanced cancer (T stage ≥ 3 or M stage = 1). This analysis provided log hazard ratios for each gene on DoD. Genes that were significant at a nominal level of p < 0.05 and exhibited the same association direction as in MDC cohort were considered validated.
Colocalization analysis
To test if the same genetic variant is driving associations with gene expression changes and prostate cancer survival, we performed colocalization analysis using the coloc package [31]. This package first fine maps the signal for each trait (gene expression and survival) using Susie [32] and then uses a Bayesian approach to determine the posterior probability that the same variant is driving the association with both traits, taking linkage disequilibrium into account. To perform coloc analysis, for each identified TWAS hit we included all SNPs from the TIGAR expression imputation model and compared the association with gene expression from the training data with the association signal from the MDC cohort.
Association between gene expression and time to biochemical recurrence
To understand the effects of validated genes on the time to PrCa biochemical recurrence, we used measured transcriptomics data from three independent cohorts available in the Cambridge Carcinoma of the Prostate App [33]. These cohorts are referred to by their geographic origin – MSKCC [34], Stockholm, and Cambridge [35]. We performed Cox regression to assess the association between gene expression and time to biochemical recurrence for all genes validated in PLCO, adjusting for Gleason score (≤7 and ≥8), clinical stage (T1, T2 and T3, T4), PSA category (<10, 10–20, >20), and age at diagnosis (not in Stockholm cohort). .We then combined the results for each gene across three cohorts using random effects meta-analysis in R package metafor [36]. As a sensitivity analysis, we altered the categorization of covariates as: Gleason ≤6 vs. Gleason ≥7, T1/2 vs T3/4, PSA < 10 vs. PSA ≥ 10, and age at diagnosis <60 vs. ≥60.
Association between gene expression and tissue types in TCGA data
To understand the functional roles of significant genes, we compared the gene expression levels of all validated genes first between 43 tumor adjacent normal prostate tissues and 375 tumor tissues, and then between 208 low grade primary tumor and 167 high-grade primary tumor tissues in TCGA. Normalization was performed on the combined expression data from both normal and tumor tissues using TMM and then quantile normal transform as done for the TWAS model building. We compared tumors with Gleason score ≤7 to those with Gleason ≥8. Two-sided Student’s t test was used for the comparison. As a sensitivity analysis, we also compared Gleason 6 with Gleason ≥7, and Gleason 6 with Gleason ≥8.

