Unique TEs lie in chromatin variants of HSPC versus mature cells
Because chromatin accessibility enables the identification of chromatin variants that define cell-state identity (Extended Data Fig. 1a)6,7,8,35,36, we systematically studied TE subfamily enrichment within accessible chromatin across normal human hematopoietic populations. For this, we analyzed assay for transposase-accessible chromatin sequencing (ATAC–seq; Methods) data from six highly purified hematopoietic stem and progenitor populations (HSPC), including long-term (LT) and short-term (ST) HSCs, common myeloid progenitors, megakaryocyte-erythroid progenitors, multilymphoid progenitors (MLP) and granulocyte-monocyte progenitors, as well as seven mature myeloid and lymphoid fated populations isolated from human cord blood donors (Extended Data Fig. 1b)7. These populations were isolated using our sorting scheme (Methods), which achieves the highest purity compared to other studies8 and separates HSCs with distinct repopulation and self-renewal potential (~30%) for improved resolution at the apex of the normal hematopoietic hierarchy (Methods).
We quantified enrichment of all TE subfamilies (n = 971, hg38; Methods) within accessible chromatin using ChromVAR (Methods), substituting TE subfamily elements for transcription factor DNA motif positions (Methods) while controlling for GC content. Unsupervised clustering of TE accessibility z scores revealed that LT-, ST-HSCs and later progenitor populations, collectively referred to as HSPC, share similar TE enrichment profiles and differ from mature populations (Fig. 1a). The megakaryocyte-erythroid lineage is the exception, with megakaryocyte-erythroid progenitors clustered with erythroid precursors (Fig. 1a), consistent with strong lineage-specific enrichment of TE subfamilies. Indeed, removing megakaryocyte-erythroid lineage-specific TE subfamilies (Benjamini–Hochberg corrected P value (q) < 0.0001; Extended Data Fig. 1c) before clustering yielded a clear HSPC versus mature populations separation (Extended Data Fig. 1d). Thus, chromatin accessibility at specific TE subfamilies distinguishes HSPC from mature hematopoietic cells.
Fig. 1: Distinct TE subfamilies populate the accessible chromatin of primitive and mature hematopoietic cells.The alternative text for this image may have been generated using AI.
a, Heatmap displaying differential accessibility z scores of all TE subfamilies across bulk ATAC–seq profiles of HSC, progenitor and mature hematopoietic populations; note that HSC and progenitors show similar TE subfamily enrichment profiles. b, Association of TE families enriched in accessible chromatin across HSC and progenitor (‘HSPC’) versus mature cells. Y axis lists TE families with enriched TE subfamilies (each dot denotes a different TE subfamily) and x axis presents the median z score difference between HSPC and mature cells (left panel). Number of enriched TE subfamilies in HSPC or mature cells, across TE families (right). TE families are ordered from the most common in HSPC to mature cells. c, Examples of TE subfamilies differentially enriched in accessible chromatin in HSPC and mature cells. Boxplots show differential accessibility z scores in hematopoietic stem, progenitor and mature populations. Boxplots show the median and upper and lower quartiles; whiskers represent 1.5× IQR. The TE subfamilies shown here were found differentially enriched between HSPC versus mature hematopoietic populations (Benjamini–Hochberg-corrected two-sided Wilcoxon signed-rank test, q < 0.01). d, Genome browser view of ATAC–seq signal from hematopoietic populations for one element of each HSPC-enriched TE subfamily shown in c. e, Transcription factor cistromes enrichment at 2 of 23 TE subfamilies enriched in stem populations (HSCs). Every dot corresponds to a transcription factor cistrome. Red dashed lines correspond to −log10(q) = 1.3 (q = 0.05) and log(OR) = 1 thresholds. The q values correspond to FDR-corrected Fisher’s exact test. f, Individual transcription factors (left), including HOXA9, or three-dimensional genome organization factor (right) cistromes enriched over HSC-specific TE subfamilies (according to e). Boxplots showing enrichment GIGGLE scores of individual cistromes profiled in cell lines derived from a variety of tissue of origin. Boxplots show the median and upper and lower quartiles; whiskers represent 1.5× IQR. Cistromes showcased here were found significantly enriched over the TE subfamilies of interest (Fisher’s exact two-tailed, P < 0.05). g, Enrichment of HOXA9 and CTCF DNA motifs within TE subfamilies enriched in HSCs. Bars represent −log10(q) for each TE subfamily for transcription factor DNA motifs. The red dashed line corresponds to −log10(q) = 1.3 (q = 0.05) threshold. The q values correspond to Benjamini–Hochberg corrected, one-sided hypergeometric test, P values. CMP, common myeloid progenitors; MEP, megakaryocyte-erythroid progenitors; MLP, myeloid-lymphoid progenitors; GMP, granulocyte-monocyte progenitor; IQR, interquartile range; NS, not significant.
To identify TE subfamilies differentially enriched between HSPC and mature populations, we compared TE accessibility z scores using a two-sided Wilcoxon signed-rank test. We found 81 and 128 TE subfamilies belonging to 12 families enriched in HSPC and mature populations, respectively (q < 0.01; Fig. 1b, Extended Data Fig. 1e and Supplementary Tables 1–3). HSPCs were mainly enriched for subfamilies part of ERV1 and ERV3 families (Fig. 1b), including MER61A (q = 1.10 × 10−5), LTR39 (q = 1.87 × 10−6), LTR16E1 (q = 4.61 × 10−7), LTR33 (q = 9.05 × 10−7) and LTR50 (q = 4.22 × 10−5; Fig. 1c and Extended Data Fig. 1f–l), exemplified at unique elements (Fig. 1d). This aligns with previous studies of regulatory activity for LTR16E1 and LTR33 in pluripotent stem cells37. In contrast, mature populations were mainly enriched for SINE1, SINE2 and transposon subfamily members (Fig. 1b), such as AluJB (q = 1.26 × 10−7) and AluSx (q = 1.08 × 10−7), MIR (q = 2.93 × 10−7) and MIR3 (q = 2.93 × 10−7; Fig. 1c and Extended Data Fig. 1f). Although mature populations had similar numbers of accessible DNA elements (peaks; Extended Data Fig. 1m), the enrichment of SINE1, SINE2 and transposon TE subfamilies relied on population-specific accessible elements. In contrast, HSPC shared accessibility at the same ERV1 and ERV3 subfamily elements across stem and progenitor populations. Together, our study reveals distinct TE subfamily accessibility patterns in HSPCs versus mature populations, supporting a role for TE-associated chromatin variants in hematopoietic cell-state identity.
Accessible TEs are docking sites for stem factors in HSPC
We next assessed if chromatin variants over TE subfamilies could distinguish LT-HSCs (the most primitive HSC subset) from other hematopoietic populations (Extended Data Fig. 1b). One-to-one comparison of TE subfamily accessibility z scores, such as LT-HSCs versus ST-HSCs, did not identify differentially-enriched TE subfamilies (Supplementary Table 4). We next compared stem versus mature populations, revealing 61 TE subfamilies enriched in stem versus 119 in mature populations (q < 0.01; Extended Data Fig. 2a and Supplementary Table 5). Progenitor versus mature populations identified 56 TE subfamilies enriched in progenitor and 89 in mature populations (q < 0.01; Supplementary Table 6). Among TE subfamilies distinguishing HSPC from mature populations (Fig. 1a), 23 were specific to stem populations (Extended Data Fig. 2b,c). Of these, only LTR39-int, LTR85a, MER61D, MLT1E2, HERV4_I-int, and UCON26 were enriched in stem relative to progenitor populations (q < 0.05; Extended Data Fig. 2d and Supplementary Table 7). Notably, MER4B-int, MER57B1, MER57B2, MER57-int, LTR16A2, LTR67B, and LTR85a showed the highest z scores within LT/HSPC and/or Act/HSPC chromatin signatures that discriminate between LT-HSC, ST-HSC and other progenitors7 (Supplementary Table 8). Collectively, chromatin variants over specific TE subfamilies separate HSPC from mature populations, with 23 being stem specific.
Because TE subfamilies harbor DNA motifs for transcription factors30,31,38,39,40,41, we used the ReMap atlas of 485 cistromes (Methods) to assess if the 23 stem-specific TE subfamilies could serve as docking sites. HOXA9 binding to the chromatin was enriched over MER57B1 (q = 2.88 × 10−13, log(OR) = 27.69) and MER57B2 (q = 0.0007, log(OR) = 11.95) elements (Fig. 1e and Extended Data Fig. 2e). HOXA9 is a canonical HSC factor42, and more highly expressed in stem than mature populations (Extended Data Fig. 2f). The cistrome of RUNX1, another known HSC transcription factor43, was found enriched over MLT1E2 elements (q = 0.03, log(OR) = 2.12; Extended Data Fig. 2g), despite being expressed at similar levels in stem and mature populations (Extended Data Fig. 2h). While many stem-specific TE subfamilies showed no enrichment for any of the 485 cistromes in ReMap (HERV4_I-int, HERVK9-int, LTR80A, LTR85a, LTR9A1, MamGypLTR1b, MER127, MER4B-int, MER57-int, MER83B-int, MLT2E and PRIMA41-int), CTCF and cohesin complex factors (SMC1A2, SMC3 and RAD21), known regulators of genome topology44, were enriched over MamGypsy2-LTR and LTR16A2 elements (Fig. 1e, Extended Data Fig. 2i and Supplementary Table 9). We independently validated these observations using GIGGLE enrichment score45, confirming HOXA9 enrichment over MER57B1 and MER57B2 (Fig. 1f and Extended Data Fig. 2j), RUNX1 enrichment over MLT1E2 elements (Extended Data Fig. 2k) and CTCF/cohesin enrichment over MamGypsy2-LTR and LTR16A2 elements (Fig. 1f and Extended Data Fig. 2l). Sequence analysis further supported these findings by identifying enrichment of HOXA9 or CTCF DNA motifs within MER57B1 and MamGypsy2-LTR elements, respectively (Fig. 1g). Collectively, these results suggest that TE subfamilies uniquely accessible in stem populations provide docking sites for ‘stem’ transcription factors and regulators of genome topology.
LSCs show HSPC-like TE enrichment in accessible chromatin
Considering the role of LSCs in AML, we tested whether chromatin variants over TE subfamilies could discriminate LSC+ from LSC− leukemic fractions, and from normal HSPC and mature hematopoietic populations. Uncultured AML patient samples (n = 15) were sorted into fractions based on CD34 and CD38 expression, and fractions were functionally classified as LSC+ (n = 11) or LSC− (n = 24) based on leukemic engraftment upon xenotransplantation into immunodeficient mice15,46. In parallel, we performed ATAC–seq on these fractions and measured TE subfamily enrichment within accessible chromatin. Unsupervised clustering of TE accessibility z scores across AML fractions and normal hematopoietic populations grouped all LSC+ fractions with HSPCs (Fig. 2a), whereas most LSC− fractions (17/24) clustered with mature hematopoietic populations (Fig. 2a). Seven LSC− fractions clustered with LSC+ (Fig. 2a,b), potentially reflecting low LSC frequency within these fractions. Notably, clustering based on genome-wide ATAC–seq peaks, including nonrepetitive sequences, did not reproduce these observations (Extended Data Fig. 3a,b), suggesting that TE-centered chromatin variants provide a nuanced classification of AML biology related to stemness.
Fig. 2: Leukemia stem cells containing fractions cluster with HSPCs and show distinct and common TE subfamily enrichment.The alternative text for this image may have been generated using AI.
a, Heatmap of accessibility z scores of all TE subfamilies (rows) across bulk ATAC–seq profiles of HSPC, mature hematopoietic populations, LSC+ and LSC− fractions clustered based on correlation. LSC+ fractions clustered with HSPC, while LSC− fractions mostly clustered with mature populations. b, Heatmap of TE subfamilies with differential accessibility z scores (rows, q < 0.01) between ATAC–seq profiles of LSC+ (n = 11) and LSC− fractions (n = 24). c, Enrichment of TE families across LSC+ versus LSC− samples. The y axis shows all TE families; each dot denotes a TE subfamily. SINE1, SINE2 and transposons families are enriched in LSC− samples, whereas ERV1 and ERV3 families are enriched in LSC+ samples (left). Number of enriched TE subfamilies in LSC+ or LSC− samples, groups based on TE family (right). TE families are ordered from the most common in LSC+ versus LSC−. d, Comparison of the proportion of TE subfamilies enriched in HSPC and hematopoietic mature populations in LSC− and LSC+ fractions grouped and color-coded according to their TE family classification. The overall distribution of TE families within the hg38 build of the human genome (Genomic) is also shown as a reference. e, UpSet plot showing the intersection of TE subfamilies enriched in accessible chromatin across normal and leukemic populations, including between LSC+/HSPC (21 subfamilies) and LSC−/mature hematopoietic populations (53 subfamilies). Colors within bar graphs denote the grouping of TE subfamilies according to their TE families classification, as annotated in d.
Direct comparison of TE accessibility z scores between LSC+ and LSC− fractions revealed 44 subfamilies enriched in LSC+ and 77 in LSC− fractions (Fig. 2b,c, Extended Data Fig. 3c,d and Supplementary Tables 10–12). LSC+ fractions were mostly enriched for ERV1 and ERV3 related subfamilies, whereas LSC− fractions showed enrichment for SINE1, SINE2 and transposon related subfamilies (Fig. 2c). Comparing AML fractions to normal hematopoietic populations showed that the proportions of TE subfamilies discriminating HSPC versus mature and LSC+ versus LSC− fractions were skewed relative to genome-wide TE composition (Fig. 2d). Specifically, ERV1 and ERV3 related subfamilies were preferentially accessible in HSPC and LSC+, while accessible SINE1 and transposons related subfamilies were overrepresented in mature and LSC− populations (Fig. 2d). Most accessible TE subfamilies in LSC+ or LSC− fractions were shared with HSPC (21/44) or mature populations (53/77), respectively (Fig. 2e and Extended Data Fig. 4a,b). Shared subfamilies between LSC+ and HSPC were largely from the ERV1 and ERV3 families (14/21 subfamilies; Fig. 2e and Extended Data Fig. 4c,d), whereas subfamilies shared between LSC− fractions and mature populations were mainly SINE1 and transposons related (Fig. 2e and Extended Data Fig. 4c,d). Although LINE1 subfamilies did not distinguish LSC+ from LSC−, those accessible in HSPC versus mature populations were even more enriched in LSC+ than in HSPC (Extended Data Fig. 4d), in line with diverse LINE1 functions in normal tissues and cancers47, including regulating genome topology through L1M3f48, and gene expression regulatory functions through young LINE1 subfamilies (for example, L1PA5 and L1PA7) in colon and breast cancer49,50,51. Enrichment levels among shared subfamilies were largely concordant between HSPC/LSC+ and mature/LSC− (Extended Data Fig. 4d). Two TE subfamilies accessible in LSC+, namely LTR12C (ERV1 family) and L1PA7 (LINE1 family), showed the largest discordance with HSPC (Extended Data Fig. 4d). Overall, our data suggest that LSCs can be distinguished from committed leukemic cells based on accessible TE enrichment patterns that are largely shared with normal HSPC populations.
Accessible TE signature scores stemness and stratify AML risk
Stemness is related to survival and relapse in AML and other cancers52,53. We hypothesized that TE subfamilies differentially accessible in LSC+ and LSC− fractions (44 in LSC+ and 77 in LSC−; Fig. 2b,c), hereafter the LSCTE121 signature, could score TE-associated stemness in AML and be related to clinical outcomes. Thus, we performed ATAC–seq on peripheral blood from three independent AML cohorts (cohort1, n = 29; cohort2, n = 60; cohort3, n = 77; Supplementary Table 13). Unsupervised clustering using the LSCTE121 signature grouped subsets of patients with the LSC− fractions (cohort1, 9/29; cohort2, 20/60; cohort3, 36/77) while the remaining patients clustered with LSC+ fractions (cohort1, 20/29; cohort2, 40/60; cohort3, 41/77; Extended Data Fig. 5a–c). This AML patient classification into LSC-like versus non-LSC-like profiles mirrors recent gene expression-based patient stratification54,55. We next computed a single LSCTE121 z score by combining enrichment of LSC+ TE subfamilies and depletion of LSC− TE subfamilies using Stouffer’s method (Methods). This score is intended to measure LSC-ike stemness properties rather than serve as a clinical biomarker. In all three cohorts, the dendrogram branch containing LSC+ fractions showed the highest LSCTE121 z scores, and LSC+ fractions scored higher than LSC− fractions (Extended Data Fig. 5a–d). LSCTE121 z score did not correlate with blast count (cohort1—R = 0.031, P = 0.88; cohort2—R = −0.17, P = 0.21; cohort3—R = 0.025, P = 0.83; Extended Data Fig. 5e–g), suggesting that the score reflected stemness rather than cellularity. In cohort1, Kaplan–Meier analysis comparing the top versus bottom quartile of LSCTE121 z score showed significant differences in disease-free and overall survival (Fig. 3a–c and Extended Data Fig. 5h). We also compared LSCTE121 with LSC17 (ref. 15). Sample classification differed between the two approaches (Fig. 3a,d). LSC17 best predicted overall survival (Extended Data Fig. 5i; log-rank, P = 2 × 10−4) and LSCTE121 showed low correlation with LSC17 (R = 0.27, P = 0.16; Fig. 3d). Mechanistically, TE accessibility near LSC17 genes supported their distinction. Among 1646 TE elements within +/−2 kb of the 17 LSC17 genes, 811 (49%) belonged to TE subfamilies enriched in LSC− fractions, whereas 33 (0.02%) belonged to LSC+ enriched TE subfamilies (Supplementary Table 14). In agreement, ATAC signal over these nearby TE elements was higher in LSC− than LSC+ fractions (Extended Data Fig. 5j), was not elevated relative to TE elements surrounding other genes (Extended Data Fig. 5k), and did not differ from 100 random, size-matched 17 genes sets in LSC+ fractions (Extended Data Fig. 5l). Next, we assessed accessibility at Transcription Start Sites (TSS) in cohort1 patients stratified based on LSC17 (high = 12, low = 17) or LSCTE121 (high = 14, low = 15), and gene set enrichment analysis to identify pathways linked to increased TSS accessibility in LSC17-high or LSCTE121-high patients. We identified 311 pathways with higher TSS accessibility in LSC17-high patients (Extended Data Fig. 5m and Supplementary Tables 15 and 16), with cell cycle-related pathways preferentially detected in LSC17-high versus LSCTE121-high patients (Fig. 3e,f and Extended Data Fig. 5n). In contrast, LSCTE121-high patients were associated with the interleukin 10 signaling (Fig. 3g,h and Extended Data Fig. 5o), a pathway known to promote LSC stemness56. Concordantly, LSCTE121 z scores were highest in fractions with the elevated limiting dilution assay-defined LSC frequency (Fig. 3i and Supplementary Table 17). Disease-free survival differences observed with LSCTE121 were also observed in cohort2 and cohort3 AML patients (Fig. 3j–o and Extended Data Fig. 5p,q). Together, our results suggest that LSCTE121 captures stemness biology distinct from that of LSC17, and that each provides AML risk stratification based on different stemness discriminators.
Fig. 3: The LSCTE121 signature stratifies AML patients with distinct disease-free intervals.The alternative text for this image may have been generated using AI.
a, Heatmap showing differential accessibility z scores for LSCTE121 (q < 0.01 LSC+ versus LSC−) in bulk cohort1 AML tumors. Please note that bulk tumors do not cluster according to their LSC17 scores with LSC+ or LSC− fractions. b, Boxplot displaying the LSCTE121 z score in cohort1 patients (n = 29). Patients were divided into LSCTE121-high or low based on the heatmaps in a. Boxplots show the median and upper and lower quartiles; whiskers represent 1.5× IQR. P-value results of the two-sided Wilcoxon test are showcased on the boxplots. c, Kaplan–Meier estimates using disease-free interval as a clinical endpoint for LSCTE121 z score top versus bottom 25 percentile in cohort1 (n = 8 per group). The log-rank (Mantel–Cox) P values are shown on the Kaplan–Meier curves. d, Pearson correlation between LSC17 score and LSCTE121 z score in cohort1 bulk AML patients. The linear regression line is shown in blue and the confidence interval is shown in gray. P-value results of null-hypothesis testing are showcased on the plots. e, GSEA enrichment plots showcasing significantly increased chromatin accessibility at cell cycle gene promoters (TSS) in LSC17-high patients (cohort1). P-value results of two-sided weighted Kolmogorov–Smirnov test are showcased on the plots. f, Volcano plot showing the log2 fold change chromatin accessibility for each cell cycle gene promoter (TSS) between LSC17-high and LSC17-low patients (cohort1) versus the −log10(P), resulting from the Wald test, for that fold change. g, GSEA enrichment plots showcasing significant increased chromatin accessibility at IL10 signaling gene promoters (TSS) in LSCTE121-high patients (cohort1). P values results of weighted Kolmogorov–Smirnov test are showcased on the plots. h, Volcano plot showing the log2 fold change chromatin accessibility for each IL10 signaling gene promoter (TSS) between LSCTE121-high and LSCTE121-low patients (cohort1) versus the −log10, resulting from the Wald test, P values for that fold change. i, Boxplot of LSCTE121 z score in LSC fractions (n = 35; high, n = 1; med, n = 6; low, n = 4; NE, n = 24). Please note that engrafting fractions show a higher LSCTE121 z score than nonengrafting fractions. Boxplots show the median and upper and lower quartiles; whiskers represent 1.5× IQR. P values results of two-sided paired Wilcoxon test are showcased. j, Same as a, but for cohort2 patients (LSCTE121-high, n = 15; LSCTE121-low, n = 45). k, Same as b, but for cohort2 patients (LSCTE121-high, n = 15; LSCTE121-low, n = 45). Boxplots show the median and upper and lower quartiles; whiskers represent 1.5× IQR. P-value results of the two-sided Wilcoxon test are shown in the boxplots. l, Same as c, but for cohort2 patients (LSCTE121-high, n = 15; LSCTE121-low, n = 45). The log-rank (Mantel–Cox) P values are shown on the Kaplan–Meier curves. m, Same as a, but for cohort3 patients (LSCTE121-high, n = 35; LSCTE121-low, n = 42). n, Same as b, but for cohort3 patients (LSCTE121-high, n = 35; LSCTE121-low, n = 42). Boxplots show the median and upper and lower quartiles; whiskers represent 1.5× IQR. P-value results of the two-sided Wilcoxon test are showcased on the boxplots. o, Same as c, but for cohort3 patients (LSCTE121-high, n = 35; LSCTE121-low, n = 42). The log-rank (Mantel–Cox) P values are shown on the Kaplan–Meier curves. GSEA, gene set enrichment analysis.
LSC-accessible TEs dock AML essential factors
Using the ReMap cistrome data, we asked whether accessible TE subfamilies shared between HSPC and LSC+, or unique to either group (Fig. 2e), could provide docking sites for transcription factors. Some TE subfamilies showed enrichment for many cistromes, while others showed few (Extended Data Fig. 5), aligning with the role of certain TE subfamilies as transcription factor ‘hubs’30,34,57,58. ERG, RUNX1, LMO2 and TRIM28 cistromes were consistently enriched across accessible TEs in HSPC and LSC+ (Fig. 4a-b, Extended Data Fig. 6a-b and Supplementary Tables 18–20). These transcription factors contribute to normal hematopoiesis and/or leukemia43,59,60,61. Beyond roles in erythroblast differentiation60, TRIM28 also suppresses TE function62, suggesting involvement in HSPC and LSC+ fate commitment. The interplay between ERG and RUNX1 in hematopoiesis and AML63,64 is reflected by their enrichment over >70% of the same accessible TE subfamilies (10/15 HSPC-specific, 8/10 LSC+ specific, 6/8 shared; Extended Data Fig. 6c), including LTR78 (ERG—q = 0.005, log(OR) = 1.74; RUNX1—q = 0.0004, log(OR) = 1.9), LTR67B (ERG—q = 1.12 × 10−8, log(OR) = 2.73; RUNX1—q = 5.77 × 10−17, log(OR) = 4.09) and MLT1E2 (ERG—q = 0.0003, log(OR) = 2.02; RUNX1—q = 0.009, log(OR) = 1.73; Extended Data Fig. 6d and Supplementary Table 18–20). Over half of the TE subfamilies enriched for ERG and/or RUNX1 harbored their DNA motifs (8/25 for ERG motif and 11/32 for RUNX1 motif; Extended Data Fig. 6e,f). ERG, RUNX1, LMO2 and TRIM28 were more expressed in HSPC than mature populations (Extended Data Fig. 6g). Collectively, these results suggest that HSPC and LSC+ share a core TE-based regulatory layer driven by these four transcription factors.
Fig. 4: TE subfamilies enriched in accessible chromatin in LSC provide binding sites to transcription factors essential for AML.The alternative text for this image may have been generated using AI.
a, Frequency plot of transcription factor cistromes enriched at TE subfamilies in accessible chromatin in LSC+ fractions. The top 5% most frequently enriched transcription factor cistromes are shown. b, UpSet plot showing the intersection of transcription factor cistromes enriched at accessible TE subfamilies in LSC+ and HSPC populations. c, Enrichment GIGGLE score for individual LYL1, NFYA and NFYB cistromes profiled in cell lines derived from various tissue states across TE subfamilies specifically accessible in LSC+. Boxplots show the median and upper and lower quartiles; whiskers represent 1.5× IQR. Cistromes showcased here were found significantly enriched over the TE subfamilies of interest (Fisher’s exact two-tailed, P < 0.05). d, Enrichment of NFY DNA motifs within TE subfamilies enriched in the accessible chromatin of LSC+. The red dashed line corresponds to −log10(q) = 1.3 (q = 0.05) threshold. The q value corresponds to Benjamini–Hochberg-corrected one-sided hypergeometric test P values. e, Overview of essentiality scores of LSC+ specific transcription factors across all cancer types available in DepMap with more than five cell lines, based on CRISPR data. The distribution of the essentiality scores of each transcription factor calculated in AML cell lines was compared to the distribution of the same transcriptional factor calculated in cell lines from every other cancer type. Rectangles inner color corresponds to the median essentiality score, while border color corresponds to the two-sided pairwise t test, Benjamini–Hochberg-corrected P values (adjusted P value).
Cistrome enrichment over accessible TEs also identified 11 transcription factors specific to HSPC, including HIF1A, SETDB1, GATA1, ZNF143, TEAD4, PBX2, EZH2, CBX3, SMC3, MAX and HDAC2 (Fig. 4b and Extended Data Fig. 6a). Although CBX3 and MAX have not been linked to hematopoiesis, both were more expressed in HSPC than mature populations (Extended Data Fig. 7a). PBX2 is dispensable for normal hematopoiesis or immune response65, TEAD4 favors embryonic-to-blood commitment66, whereas GATA1, HIF1A, SETB1, EZH2 and HDAC2 regulate HSCs self-renewal and/or differentiation67,68,69,70,71. ZNF143 and SMC3 together with CTCF regulate 3D genome organization72,73,74. In HSPC, the cistrome for these factors, and those of known partners (SMC1A, RAD21 and CHD8)72,75,76, enrich over LTR41B elements (q ≤ 3.22 × 10−34, OR ≤ 9.36; Extended Data Fig. 6d and Supplementary Table 18), aligning with a known role for LTR41B in genome topology77. DNA motif enrichment at some of these TEs (Extended Data Fig. 7b) is consistent with the role of CTCF as a gatekeeper of stemness in hematopoietic populations7. GATA1 and HDAC2 were expressed at higher levels in HSPC than mature populations (Extended Data Fig. 7a), consistent with their role in normal hematopoiesis70,71. Together, these results support the role of HSPC-accessible TE subfamilies as docking sites for HSC regulators and genome topology factors.
Finally, the cistrome of LYL1, NFYA, NFYB and POU5F1 was enriched over LSC-specific accessible TEs (Fig. 4a,b and Supplementary Table 19). GIGGLE analysis showed a biased enrichment towards cistromes generated in blood or bone marrow samples for LYL1, NFYA and NFYB (Fig. 4c). These cistromes were not enriched over ATAC peaks outside repeats in LSC+ (Extended Data Fig. 7c), suggesting that TE accessibility drives their enrichment. NFYA/NFYB DNA motifs were significantly detected within cistrome-enriched TE subfamilies (Fig. 4d), which are known to be prevalent within LTR12 subfamilies34,78,79. To test function relevance, DepMap CRISPR knock-out essentiality screen results (Methods) revealed negative essentiality scores for all four transcription factors in most AML cell lines (12/23; Extended Data Fig. 7d). Across 23 AML cell lines, LYL1, NFYA and NFYB were consistently essential (P < 0.05) whereas POU5F1 was not more essential than other genes (Extended Data Fig. 7d). This essentiality was not detected in DepMap RNAi screen (Extended Data Fig. 7e). Across cancer types, LYL1 was significantly more essential in AML than in other cancers (CRISPR screen—97%, 34/35; RNAi screen—75%, 17/21) (Fig. 4e, Extended Data Fig. 7f and Supplementary Tables 21 and 22), aligning with its pro-oncogenic role in AML80. NFYA and NFYB were broadly essential across cancers (Fig. 4e and Extended Data Fig. 7f). Overall, our results suggest that LSC-specific TE accessibility defines docking sites for transcription factors, including LYL1, NFYA and NFYB, essential in AML.
LSC-accessible TEs are essential for stemness
We undertook functional studies to assess how TEs contribute to stemness using the OCI-AML22 model, which harbors functionally assessed LSC+ and LSC−, enrichable using FACS sorting (CD34+CD38−, very high LSC frequency; CD34+CD38+ , very low LSC frequency; CD34−CD38+/−, LSC−)46. Scoring for the LSCTE121 signature on ATAC–seq from each OCI-AML22 fraction revealed that the LSC+ fraction of OCI-AML22 cells clustered with LSC+ AML fractions (Fig. 5a), whereas the OCI-AML22 LSC−46 clustered with most LSC− AML fractions (Fig. 5a). Thus, OCI-AML22 recapitulates TE accessibility heterogeneity seen in primary AML. In agreement, the CD34+/CD38− fractions showed a higher LSCTE121 z score than CD34+/CD38+ fractions (Extended Data Fig. 8a), supporting this score as a readout of stemness potential.
Fig. 5: Accessibility at LTR12C elements is essential for LSC stemness properties.The alternative text for this image may have been generated using AI.
a, Heatmap showing unsupervised clustering from differential accessibility z scores for TE subfamilies accessible across LSC populations (q < 0.01 LSC+ versus LSC−), OCI-AML22 fractions. b, Boxplot displaying the LTR12C accessibility z scores in LSC+ (n = 11) and LSC− (n = 24). Boxplots show the median and upper and lower quartiles; whiskers represent 1.5× IQR. The q value corresponding to the Benjamini–Hochberg corrected adjusted two-sided Wilcoxon signed-rank test is shown on the boxplot. c, ATAC–seq signal profile across all LTR12C elements accessible in LSC+ and/or LSC− fractions. d, Violin plots for the ATAC–seq signal over LTR12C elements accessible in LSC+ or LSC−. P values results of two-sided paired Wilcoxon test are showcased. e, Same as panel c but for OCI-AML22 fractions, OCI-AML3 and MOLM13. f, Same as d, but for OCI-AML22 fractions, OCI-AML3 and MOLM13. P-value results of two-sided paired Wilcoxon test are showcased. g, Violin plots showcasing H3K27ac signal distribution in OCI-AML3 stably expressing both CRISPRi and scramble (control in gray) or LTR12C (purple) guide RNA combos, over LTR12C (target) or LTR12D (negative control). Every dot represents an individual TE element. Please note that the loss of H3K27ac signal is specific to LTR12C elements only. P-value results of two-sided Wilcoxon test are showcased. h, Same as g, but for H3K9me3 signal. P-value results of two-sided Wilcoxon test are showcased. i, Direct H3K27ac and H3K9me3 signal comparison over LTR12C elements. Axes represent log2(fold change) H3K27ac signal (x axis) or H3K9me3 signal (y axis) between CRISPRi OCI-AML3 expressing LTR12C guide RNA combo and CRISPRi OCI-AML3 expressing scramble guide RNA combo. Every dot is an LTR12C element. Please note that the majority of LTR12C elements (purple dots) showcase loss of H3K27ac signal (−log2(fold change)) and gain of H3K9me3 signal (+log2(fold change)). j, Growth curves of CRISPRi OCI-AML3 cells expressing either scramble (control) or LTR12C guide RNA combos. The results are shown as means ± s.d. from three independent transductions with the vector promoting the expression of either scramble (control) guide RNAs or guide RNAs targeting LTR12C elements. P-value results of two-sided Student’s t test are showcased on the plot. k, Violin plot for the chromatin accessibility signal distribution in OCI-AML22-LSC+ stably expressing both CRISPRi and scramble (control in gray) or LTR12C (purple) guide RNA combos over LTR12C (target) or LTR12D (negative control) elements. A representative biological replicate of three is shown. Every dot represents an individual TE element. P-value results of two-sided paired samples Wilcoxon test are showcased on the violin plot. l, Percentage of CD34+/CD38−, CD34+/CD38+ , CD34−/CD38+ and CD34−/CD38− cells upon CRISPRi-mediated chromatin editing at LTR12C elements in OCI-AML22-LSC+. The results are shown as means ± s.d. from three independent transductions with the vector promoting the expression of either five scramble (control) guide RNAs or five guide RNAs targeting LTR12C elements. P value was generated by a two-sided t test.
To directly test whether TE subfamilies are required for stemness, we used chromatin editing to repress accessibility across elements from an individual TE subfamily. We focused on LTR12C, which, with L1PA7, exhibited strong differential accessibility between LSC+ and HSPC (Extended Data Fig. 4d). LTR12C and L1PA7 were more accessible in LSC+ than LSC− fractions in terms of ATAC–seq signal and absolute number of accessible elements (Fig. 5b–d and Extended Data Fig. 8b–f). Genes within 10 kb of the most accessible LTR12C and L1PA7 elements in LSC+ included known stemness-associated genes, such as TRIM71 (ref. 81) and PAK3 (ref. 82; Extended Data Fig. 8b,e and Supplementary Table 23). In OCI-AML22, LTR12C accessibility was highest in CD34+/CD38− LSC+ fractions and decreased as cells lost stemness across the cellular hierarchy, whereas L1PA7 did not show the same pattern (Fig. 5e,f and Extended Data Fig. 8g,h). Although commonly used AML cell lines (OCI-AML3 and MOLM13) showed low overall LTR12C accessibility, a subset of LTR12C elements was accessible (Fig. 5e,f and Extended Data Fig. 8i). We therefore first tested LTR12C function in OCI-AML3 by repressing accessibility at LTR12C elements using CRISPR/dCas9-KRAB (CRISPRi). Using Repguide83, we designed pool guide RNAs targeting the 2765 LTR12C elements (predicted to hit 2400 copies, 87%), and a scrambled guide control pool (Supplementary Table 24). In OCI-AML3 CRISPRi cells34, ATAC–seq and CUT&RUN for H3K27ac and H3K9me3 revealed a strong and specific reduction of chromatin accessibility over LTR12C elements (P = 9.8 × 10−16) with limited off-target effect on related TE subfamilies (for example, LTR12D; Extended Data Fig. 8j–l), accompanied by loss of H3K27ac and gain of H3K9me3 at LTR12C elements (Fig. 5g–i and Extended Data Fig. 8m–p). Repressing LTR12C decreased OCI-AML3 growth (Fig. 5j).
To assess whether LTR12C accessibility supports LSC+ properties, we stably expressed CRISPRi6,84,85 in OCI-AML22 and introduced either scrambled or LTR12C guides (Extended Data Fig. 9a-c; Methods). Across three replicates, ATAC–seq confirmed reduced accessibility at LTR12C elements (P ≤ 8 × 10−4) with limited off-target effects (Fig. 5k and Extended Data Fig. 9d–j). Functionally, LTR12C repression reduced the LSC+ CD34+/CD38− fraction (62.5–44.3%) and increased the more committed CD34+/CD38+ fraction (27.2–45.4%; Fig. 5l). Collectively, these results establish that LTR12C accessibility is a determinant of LSC stemness.

