Case studies
These 10 case studies showcase representative questions from GeneBench-Pro. Each case study includes the original prompt, datasets, and supporting materials. For an overview of the benchmark and key findings, see the announcement blog.
Note: File previews show excerpts from the full datasets.
Case study 1
Somatic oncology: Structural variant-guided tumor therapy benefit-risk decision
Estimate whether a synthetic TXR1-directed inhibitor has positive clinical utility in tumors whose target activation is driven by a structural variant. TXR1, TXR1i, DLR1, and star-allele labels are synthetic benchmark labels.
_The target subgroup has to be recovered from long-read, expression, tumor-quality, and pharmacogenomic evidence before benefit and toxicity can be interpreted as a treatment decision._
Files provided to the model
patient_idanalysis_setagesexsitecalendar_periodecogtumor_burdenprior_linesprior_resistancelineage_classtherapy_classassessed16benefit16tox_stop_8wktime_zero_day
MTB0001 1 73.8 M S1 P2 2 0.787 3 1 A TXR1i 0 1 0
MTB0002 1 55.2 M S3 P1 1 2.637 0 1 A TXR1i 1 0 0 0
MTB0003 1 68.8 F S4 P2 0 0.891 2 1 A TXR1i 1 1 1 0
MTB0004 1 82.8 F S2 P2 2 4.101 0 0 B TXR1i 1 0 0 0
MTB0005 1 65.5 F S1 P3 1 7.0 1 1 A TXR1i 1 0 0 0
Registry covariates, therapy, week-16 assessment, benefit, and early toxicity.
Case study 2
Functional genomics: CRISPR target validation: lncRNA transcript or genomic locus?
Decide whether an apparent lncRNA dependency is transcript-specific or driven by nearby-locus and neighbor-gene effects.
_Transcript-directed evidence has to survive controls for local DNA-locus perturbation, neighbor-gene repression, guide swaps, GC toxicity, and plate effects._
guide_idnominal_targetchrcoordstranddist_lnc_tss_bpdist_neighbor_tss_bpguide_gc_frac
g001 LINC473 chr7 100014+14 30 0.624
g002 LINC473 chr7 100035-43 67 0.584
g003 LINC473 chr7 100051+116 56 0.622
g004 LINC473 chr7 100066-59 66 0.617
g005 LINC473 chr7 100088+74 77 0.715
Guide coordinates, targets, distances, and GC features.
Case study 3
Statistical genetics: Prioritizing protein drug targets in a linked genetic locus
Estimate direct disease effects for two nearby proteins using cis multivariable Mendelian randomization (cis-MVMR) while handling assay scale, allele orientation, winner's curse, LD, and residual local pleiotropy.
_The two proteins share a correlated locus. The analysis has to move from marginal associations to conditional, LD-aware disease effects on a common protein scale._
snppos_bpeffect_alleleother_allelemafbetasepval
rs200000 50000000 A C 0.42215 0.006438668310706808 0.003267330091203412 0.04876727714241972
rs200001 50010126 A C 0.05709 0.011008993337581301 0.006955239208750407 0.11345916603941006
rs200002 50020253 G T 0.09021 0.009922014757116319 0.005633023027015518 0.07817048492026045
rs200003 50030379 G T 0.48399 0.010569215614164573 0.0032291419740237445 0.0010638520681901973
rs200004 50040506 A G 0.37703 0.007036551378238654 0.0033297592321269802 0.034580976884336506
Screening-stage protein association summaries for PROTA.
Case study 4
Clinical genomics / carrier screening: DRX1 carrier-screening residual risk under CNV and pseudogene calibration
Estimate ancestry-specific carrier frequencies, residual risk after a negative screen, partner carrier frequency, and affected-conceptus risk from carrier-screening assay data.
_The residual-risk estimate depends on pseudogene-aware carrier calls, founder-haplotype collapse, ancestry-specific assay calibration, and standardization from tested partners back to the full partner roster._
sample_idcollectionancestryfamily_history_tier
S_EUR_0001 screening EUR 0
S_EUR_0002 screening EUR 0
S_EUR_0003 screening EUR 0
S_EUR_0004 screening EUR 0
S_EUR_0005 screening EUR 1
Screening-roster adults with ancestry and screening context.
Case study 5
Single-cell genomics: Activated-monocyte eQTL after ambient RNA correction
Estimate a genotype effect on activated-monocyte expression after removing ambient RNA and technical contamination from single-cell RNA-seq data.
_Ambient RNA affects both target expression and the marker panel used to call activation state, so correction has to occur before the eQTL model._
cell_iddonortotal_umiHBBIFI6ISG15LST1CXCL10
D01_C001 D01 1113 7 3 4 83 5
D01_C002 D01 1103 6 3 3 112 10
D01_C003 D01 1141 9 8 12 63 9
D01_C004 D01 1250 7 60 43 2 17
D01_C005 D01 1045 9 1 2 51 15
Per-cell UMI counts for marker genes, contamination markers, and the target gene.
Case study 6
Structural genetics: Nested structural variant: expression support and clinical association
Estimate whether a nested structural subhaplotype inside an anonymous inversion-like locus has a calibrated clinical association and credible expression support.
_A nested copy-dosage signal can be confounded by the broader inversion orientation, so dosage calibration, expression support, and clinical modeling have to remain distinct._
sample_idcaseageage_bandsexpc1pc2pc3ancestry_groupclinic_stratumrecruitment_stream
Q00012 1 50.45 50_64 0-1.01514-0.21032-0.08849 EUR tertiary clinic
Q00028 0 57.39 50_64 0-1.25987-0.12498 0.2344 EUR regional registry
Q00029 1 68.4 65_plus 0 0.91598 0.62177 0.01891 AFR tertiary clinic
Q00030 1 74.07 65_plus 1 0.21125-0.59634-0.08197 EAS community registry
Q00032 1 82.82 65_plus 0-1.12034-0.24372 0.14665 EUR community clinic
Clinical and covariate data for the full cohort.
Case study 7
Regulatory genomics: Measuring chromatin loop strength after structural-variant and mapping artifact masking
Quantify a focal case-control Hi-C loop-strength difference after removing low-mappability and structural-variant artifacts from the expected-contact background.
_The target loop is defined at 20 kb resolution, but the expected-contact model is distorted unless low-mappability contacts and a case-only SV stripe are masked first._
bin_idchromstartendgc_contentmappabilityre_sites
0 chr8 400000 420000 0.46199033821572594 0.9787574214704273 5
1 chr8 420000 440000 0.5044124208534677 0.8901084943498397 5
2 chr8 440000 460000 0.43218451584938194 0.9056879289326712 3
3 chr8 460000 480000 0.4733197282681218 0.9376529840664789 3
4 chr8 480000 500000 0.4444956062150748 0.8682565517981877 4
Target-resolution bin annotations.
Case study 8
Statistical genetics: Multi-parent QTL mapping with founder reconstruction
Map a chromosome-1 quantitative-trait locus in an eight-founder recombinant population by reconstructing founder ancestry before testing the phenotype association.
_The visible marker data are biallelic, but the biological signal is founder ancestry. A defensible analysis therefore has to reconstruct founder state, check marker orientation, and separate the QTL from a batch-aligned nuisance peak._
marker_idchrpos_cM
m2_065 2 59.762431265596575
m2_103 2 94.52656615104739
m2_107 2 98.18761427503033
m2_079 2 72.20130244108847
m1_054 1 49.907510212292195
Marker identifiers, chromosomes, and genetic-map positions.
Case study 9
Population genetics: Parent-specific ancestry and recent admixture timing
Infer parent-specific ancestry proportions and recent admixture timing from phased local-ancestry tracts after repairing reciprocal artifacts and a chromosome-specific label inversion.
_Ancestry fractions and pulse times both change if reciprocal tract artifacts, chromosome-local label inversion, or map denominators are handled incorrectly._
chromhapstart_morganend_morganancposteriorlow_complexity_frac
chr1 h1 0.03 0.505 A 0.985 0.08
chr1 h1 0.505 0.535 B 0.62 0.92
chr1 h1 0.535 1.478849 A 0.985 0.08
chr1 h1 1.503727 1.852681 B 0.985 0.08
chr1 h1 1.852681 2.422373 A 0.985 0.08
Phased local-ancestry tracts with coordinates, ancestry labels, posterior values, and QC annotations.
Case study 10
Population genetics: Estimating selection from noisy ancient-DNA time series
Infer which of two haploid loci is under stronger positive selection from ancient allele-frequency time series while accounting for allele orientation, directional error, drift, and changing population size.
_Noisy ancient trajectories are not directly comparable until both loci are placed on the same derived-allele scale and the provided sample-level sequencing-error values are modeled directly._
generationalt_readstotal_readsseq_errorsample_year
6 36 40 0.16-4500
12 34 45 0.16-4278
18 41 55 0.16-4056
24 38 70 0.16-3833
30 36 90 0.16-3611
Read-count time series for locus A.
_Noisy ancient trajectories are not directly comparable until both loci are placed on the same derived-allele scale and the provided sample-level sequencing-error values are modeled directly._
Files provided to the model
generationalt_readstotal_readsseq_errorsample_year
6 36 40 0.16-4500
12 34 45 0.16-4278
18 41 55 0.16-4056
24 38 70 0.16-3833
30 36 90 0.16-3611
Read-count time series for locus A.
Comments
Sign in or join free to leave a comment.
No comments yet. Be the first.