API

This page describes the application programming interface (API) for PyPGx.

Below is the list of submodules available in the API:

  • core : The core submodule is the main suite of tools for PGx research.

  • genotype : The genotype submodule is primarily used to make final diplotype calls by interpreting candidate star alleles and/or detected structural variants.

  • pipeline : The pipeline submodule is used to provide convenient methods that combine multiple PyPGx actions and automatically handle semantic types.

  • plot : The plot submodule is used to plot various kinds of profiles such as read depth, copy number, and allele fraction.

  • utils : The utils submodule contains main actions of PyPGx.

For getting help on a specific submodule (e.g. utils):

from pypgx.api import utils
help(utils)

core

The core submodule is the main suite of tools for PGx research.

Functions:

build_definition_table(gene[, assembly])

Build the definition table of star alleles for specified gene.

collapse_alleles(gene, alleles[, assembly])

Collapse redundant candidate star alleles.

get_default_allele(gene[, assembly])

Get the default allele of specified gene.

get_exon_ends(gene[, assembly])

Get exon ends for specified gene.

get_exon_starts(gene[, assembly])

Get exon starts for specified gene.

get_function(gene, allele)

Get matched function from the allele table.

get_paralog(gene)

Get the paralog of specified gene.

get_priority(gene, phenotype)

Get matched priority from the phenotype table.

get_recommendation(drug, gene1, phenotype1)

Get recommendation for specified drug-phenotype combination.

get_ref_allele(gene)

Get the reference allele for target gene.

get_region(gene[, assembly])

Get matched region from the gene table.

get_score(gene, allele)

Get matched score from the allele table.

get_strand(gene)

Get DNA strand (‘+’ or ‘-‘) for specified gene.

get_variant_impact(variant)

Get variant impact from the variant table.

get_variant_synonyms(gene[, assembly])

Get variant synonyms.

has_phenotype(gene)

Return True if specified gene has phenotype data.

has_score(gene)

Return True if specified gene has activity score.

has_sv(gene[, allele])

Return True if specified gene or allele has SV.

is_legit_allele(gene, allele)

Return True if specified allele exists in the allele table.

is_target_gene(gene)

Return True if specified gene is one of the target genes.

list_alleles(gene[, variants, assembly])

List all star alleles present in the allele table.

list_functions([gene])

List all functions present in the allele table.

list_genes([mode])

List genes in the gene table.

list_phenotypes([gene])

List all phenotypes present in the phenotype table.

list_variants(gene[, alleles, mode, assembly])

List variants that are used to define star alleles.

load_allele_table()

Load the allele table.

load_cnv_table()

Load the CNV table.

load_cpic_table()

Load the CPIC table.

load_diplotype_table()

Load the diplotype table.

load_equation_table()

Load the phenotype equation table.

load_gene_table()

Load the gene table.

load_phenotype_table()

Load the phenotype table.

load_recommendation_table()

Load the recommendation table.

load_variant_table()

Load the variant table.

predict_phenotype(gene, a, b)

Predict phenotype based on two haplotype calls.

predict_score(gene, allele)

Predict activity score based on haplotype call.

sort_alleles(alleles[, by, gene, assembly])

Sort star alleles by either priority or name.

pypgx.api.core.build_definition_table(gene, assembly='GRCh37')[source]

Build the definition table of star alleles for specified gene.

The table will only contain star alleles that are defined by SNVs and/or indels. It will not include alleles with SV (e.g. CYP2D6*5) or alleles with no variants (e.g. CYP2D6*2 for GRCh37 and CYP2D6*1 for GRCh38).

Parameters
  • gene (str) – Target gene.

  • assembly ({‘GRCh37’, ‘GRCh38’}, default: ‘GRCh37’) – Reference genome assembly.

Returns

Definition table.

Return type

fuc.api.pyvcf.VcfFrame

Examples

>>> import pypgx
>>> vf = pypgx.build_definition_table('CYP4F2')
>>> vf.df
  CHROM       POS         ID REF ALT QUAL FILTER      INFO FORMAT *2 *3
0    19  15990431  rs2108622   C   T    .      .  VI=V433M     GT  0  1
1    19  16008388  rs3093105   A   C    .      .   VI=W12G     GT  1  0
>>> vf = pypgx.build_definition_table('CYP4F2', assembly='GRCh38')
>>> vf.df
  CHROM       POS         ID REF ALT QUAL FILTER      INFO FORMAT *2 *3
0    19  15879621  rs2108622   C   T    .      .  VI=V433M     GT  0  1
1    19  15897578  rs3093105   A   C    .      .   VI=W12G     GT  1  0
pypgx.api.core.collapse_alleles(gene, alleles, assembly='GRCh37')[source]

Collapse redundant candidate star alleles.

Note that this method only considers core variants for collapsing.

Parameters
  • gene (str) – Gene name.

  • alleles (list) – List of alleles.

  • assembly ({‘GRCh37’, ‘GRCh38’}, default: ‘GRCh37’) – Reference genome assembly.

Returns

Collapsed list of alleles.

Return type

list

Examples

>>> import pypgx
>>> pypgx.list_variants('CYP2B6', alleles='*6', mode='core')
['19-41512841-G-T', '19-41515263-A-G']
>>> pypgx.list_variants('CYP2B6', alleles='*7', mode='core')
['19-41512841-G-T', '19-41515263-A-G', '19-41522715-C-T']
>>> pypgx.collapse_alleles('CYP2B6', ['*6', '*7'])
['*7']
pypgx.api.core.get_default_allele(gene, assembly='GRCh37')[source]

Get the default allele of specified gene.

Parameters
  • gene (str) – Gene name.

  • assembly ({‘GRCh37’, ‘GRCh38’}, default: ‘GRCh37’) – Reference genome assembly.

Returns

Default allele.

Return type

str

Examples

>>> import pypgx
>>> pypgx.get_default_allele('CYP2D6')
'*2'
>>> pypgx.get_default_allele('CYP2D6', assembly='GRCh38')
'*1'
pypgx.api.core.get_exon_ends(gene, assembly='GRCh37')[source]

Get exon ends for specified gene.

Parameters
  • gene (str) – Gene name.

  • assembly ({‘GRCh37’, ‘GRCh38’}, default: ‘GRCh37’) – Reference genome assembly.

Returns

List of end positions.

Return type

list

See also

get_exon_starts

Get exon starts for specified gene.

Examples

>>> import pypgx
>>> pypgx.get_exon_ends('CYP2D6')
[42522754, 42522994, 42523636, 42523985, 42524352, 42524946, 42525187, 42525911, 42526883]
>>> pypgx.get_exon_ends('CYP2D6', assembly='GRCh38')
[42126752, 42126992, 42127634, 42127983, 42128350, 42128944, 42129185, 42129909, 42130810]
pypgx.api.core.get_exon_starts(gene, assembly='GRCh37')[source]

Get exon starts for specified gene.

Parameters
  • gene (str) – Gene name.

  • assembly ({‘GRCh37’, ‘GRCh38’}, default: ‘GRCh37’) – Reference genome assembly.

Returns

List of start positions.

Return type

list

See also

get_exon_ends

Get exon ends for specified gene.

Examples

>>> import pypgx
>>> pypgx.get_exon_starts('CYP2D6')
[42522500, 42522852, 42523448, 42523843, 42524175, 42524785, 42525034, 42525739, 42526613]
>>> pypgx.get_exon_starts('CYP2D6', assembly='GRCh38')
[42126498, 42126850, 42127446, 42127841, 42128173, 42128783, 42129032, 42129737, 42130611]
pypgx.api.core.get_function(gene, allele)[source]

Get matched function from the allele table.

Parameters
  • gene (str) – Gene name.

  • allele (str) – Star allele.

Returns

Function status.

Return type

str

Examples

>>> import pypgx
>>> pypgx.get_function('CYP2D6', '*1')
'Normal Function'
>>> pypgx.get_function('CYP2D6', '*4')
'No Function'
>>> pypgx.get_function('CYP2D6', '*22')
'Uncertain Function'
>>> pypgx.get_function('UGT1A1', '*80+*37')
'Decreased Function'
>>> pypgx.get_function('CYP2D6', '*140')
nan
pypgx.api.core.get_paralog(gene)[source]

Get the paralog of specified gene.

Parameters

gene (str) – Gene name.

Returns

Paralog gene. Empty string if none exists.

Return type

str

Examples

>>> import pypgx
>>> pypgx.get_paralog('CYP2D6')
'CYP2D7'
>>> pypgx.get_paralog('CYP2D7')
'CYP2D6'
>>> pypgx.get_paralog('CYP2B6')
'CYP2B7'
>>> pypgx.get_paralog('CYP2E1')
''
pypgx.api.core.get_priority(gene, phenotype)[source]

Get matched priority from the phenotype table.

Parameters
  • gene (str) – Gene name.

  • phenotype (str) – Phenotype name.

Returns

EHR priority.

Return type

str

Examples

>>> import pypgx
>>> pypgx.get_priority('CYP2D6', 'Normal Metabolizer')
'Normal/Routine/Low Risk'
>>> pypgx.get_priority('CYP2D6', 'Ultrarapid Metabolizer')
'Abnormal/Priority/High Risk'
>>> pypgx.get_priority('CYP3A5', 'Normal Metabolizer')
'Abnormal/Priority/High Risk'
>>> pypgx.get_priority('CYP3A5', 'Poor Metabolizer')
'Normal/Routine/Low Risk'
pypgx.api.core.get_recommendation(drug, gene1, phenotype1, gene2=None, phenotype2=None)[source]

Get recommendation for specified drug-phenotype combination.

Parameters
  • drug (str) – Drug name.

  • gene1 (str) – Gene name.

  • phenotype1 (str) – Phenotype name.

  • gene2 (str, optional) – Second gene name.

  • phenotype2 (str, optional) – Second phenotype name.

Returns

Drug recommendation.

Return type

str

Examples

>>> import pypgx
>>> # Codeine, an opiate and prodrug of morphine, is metabolized by CYP2D6
>>> pypgx.get_recommendation('codeine', 'CYP2D6', 'Normal Metabolizer')
'Use codeine label recommended age- or weight-specific dosing.'
>>> pypgx.get_recommendation('codeine', 'CYP2D6', 'Ultrarapid Metabolizer')
'Avoid codeine use because of potential for serious toxicity. If opioid use is warranted, consider a non-tramadol opioid.'
>>> pypgx.get_recommendation('codeine', 'CYP2D6', 'Poor Metabolizer')
'Avoid codeine use because of possibility of diminished analgesia. If opioid use is warranted, consider a non-tramadol opioid.'
>>> pypgx.get_recommendation('codeine', 'CYP2D6', 'Indeterminate')
'None'
>>> # It's possible to have an altered recommendation for Normal Metabolizer
>>> pypgx.get_recommendation('tacrolimus', 'CYP3A5', 'Normal Metabolizer')
'Increase starting dose 1.5 to 2 times recommended starting dose. Total starting dose should not exceed 0.3 mg/kg/day. Use therapeutic drug monitoring to guide dose adjustments.'
>>> # Some recommendations are determined by multiple genes (the order doesn't matter)
>>> pypgx.get_recommendation('fluvastatin', 'CYP2C9', 'Normal Metabolizer')
/Users/sbslee/Desktop/pypgx/pypgx/api/core.py:633: UserWarning: Recommendations for fluvastatin are determined by multiple genes (CYP2C9, SLCO1B1); for best results, specify phenotype for each gene
  warnings.warn(message)
'Prescribe desired starting dose and adjust doses of fluvastatin based on disease-specific guidelines.'
>>> pypgx.get_recommendation('fluvastatin', 'SLCO1B1', 'Normal Function')
/Users/sbslee/Desktop/pypgx/pypgx/api/core.py:633: UserWarning: Recommendations for fluvastatin are determined by multiple genes (CYP2C9, SLCO1B1); for best results, specify phenotype for each gene
  warnings.warn(message)
'Prescribe desired starting dose and adjust doses of fluvastatin based on disease-specific guidelines.'
>>> pypgx.get_recommendation('fluvastatin', 'CYP2C9', 'Normal Metabolizer', 'SLCO1B1', 'Normal Function')
'Prescribe desired starting dose and adjust doses of fluvastatin based on disease-specific guidelines.'
>>> pypgx.get_recommendation('fluvastatin', 'SLCO1B1', 'Normal Function', 'CYP2C9', 'Normal Metabolizer')
'Prescribe desired starting dose and adjust doses of fluvastatin based on disease-specific guidelines.'
pypgx.api.core.get_ref_allele(gene)[source]

Get the reference allele for target gene.

Parameters

gene (str) – Target gene.

Returns

Reference allele.

Return type

str

Examples

>>> import pypgx
>>> pypgx.get_ref_allele('CYP2D6')
'*1'
>>> pypgx.get_ref_allele('NAT1')
'*4'
pypgx.api.core.get_region(gene, assembly='GRCh37')[source]

Get matched region from the gene table.

Parameters
  • gene (str) – Gene name.

  • assembly ({‘GRCh37’, ‘GRCh38’}, default: ‘GRCh37’) – Reference genome assembly.

Returns

Requested region.

Return type

str

pypgx.api.core.get_score(gene, allele)[source]

Get matched score from the allele table.

Parameters
  • gene (str) – Gene name.

  • allele (str) – Star allele.

Returns

Activity score.

Return type

float

See also

predict_score

Predict activity score based on haplotype call.

Examples

>>> import pypgx
>>> pypgx.get_score('CYP2D6', '*1')  # Allele with normal function
1.0
>>> pypgx.get_score('CYP2D6', '*4')  # Allele with no function
0.0
>>> pypgx.get_score('CYP2D6', '*22') # Allele with uncertain function
nan
>>> pypgx.get_score('CYP2B6', '*1')  # CYP2B6 does not have activity score
nan
pypgx.api.core.get_strand(gene)[source]

Get DNA strand (‘+’ or ‘-‘) for specified gene.

Parameters

gene (str) – Gene name.

Returns

‘+’ or ‘-‘.

Return type

str

pypgx.api.core.get_variant_impact(variant)[source]

Get variant impact from the variant table.

Parameters

variant (str) – Variant name.

Returns

Variant impact.

Return type

str

Examples

>>> import pypgx
>>> pypgx.get_variant_impact('22-42522580-C-T') # Missense variant
'R497H'
>>> pypgx.get_variant_impact('10-96541756-T-A') # Splice variant
'Splice Defect'
>>> pypgx.get_variant_impact('22-42524435-T-A') # Intron variant
''
>>> pypgx.get_variant_impact('22-42524435-T-C')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/sbslee/Desktop/pypgx/pypgx/api/core.py", line 588, in get_variant_impact
    raise sdk.utils.VariantNotFoundError(variant)
pypgx.sdk.utils.VariantNotFoundError: 22-42524435-T-C
pypgx.api.core.get_variant_synonyms(gene, assembly='GRCh37')[source]

Get variant synonyms.

Parameters
  • gene (str) – Target gene.

  • assembly ({‘GRCh37’, ‘GRCh38’}, default: ‘GRCh37’) – Reference genome assembly.

Returns

Variant synonyms.

Return type

dict

Examples

>>> import pypgx
>>> pypgx.get_variant_synonyms('UGT1A1')
{'2-234668879-CAT-CATAT': '2-234668879-C-CAT', '2-234668879-CAT-CATATAT': '2-234668879-C-CATAT'}
>>> pypgx.get_variant_synonyms('CYP2D6')
{}
pypgx.api.core.has_phenotype(gene)[source]

Return True if specified gene has phenotype data.

Parameters

gene (str) – Gene name.

Returns

Whether phenotype is supported.

Return type

bool

Examples

>>> import pypgx
>>> pypgx.has_phenotype('CYP2D6')
True
>>> pypgx.has_phenotype('CYP4F2')
False
pypgx.api.core.has_score(gene)[source]

Return True if specified gene has activity score.

Parameters

gene (str) – Gene name.

Returns

Whether activity score is supported.

Return type

bool

Examples

>>> import pypgx
>>> pypgx.has_score('CYP2D6')
True
>>> pypgx.has_score('CYP2B6')
False
pypgx.api.core.has_sv(gene, allele=None)[source]

Return True if specified gene or allele has SV.

The method will return False regardless of the specified allele if the target gene is not in the list of genes with SV. Additionally, it will return False if the specified allele is 'Indeterminate'.

Parameters
  • gene (str) – Target gene.

  • allele (str, optional) – Allele to be tested.

Returns

True if the allele has SV.

Return type

bool

Examples

>>> import pypgx
>>> pypgx.has_sv('CYP2D6')            # PyPGx has SV data for CYP2D6
True
>>> pypgx.has_sv('CYP3A5')            # PyPGx does not have SV data for CYP3A5
False
>>> pypgx.has_sv('CYP2D6', '*1')      # No SV
False
>>> pypgx.has_sv('CYP2D6', '*5')      # Gene deletion
True
>>> pypgx.has_sv('CYP2D6', '*2x2')    # Gene duplication
True
>>> pypgx.has_sv('CYP2D6', '*36+*10') # Tandem arrangement
True
>>> pypgx.has_sv('CYP3A5', '*1x2+*2') # Imaginary SV
/Users/sbslee/Desktop/pypgx/pypgx/api/core.py:289: UserWarning: PyPGx currently has no SV data available for CYP3A5. For more details, please visit the Genes section (https://pypgx.readthedocs.io/en/latest/genes.html) in the Read the Docs.
  warnings.warn(f"PyPGx currently has no SV data available for {gene}. "
False
>>> pypgx.has_sv('CYP2D6', 'Indeterminate')
False
pypgx.api.core.is_legit_allele(gene, allele)[source]

Return True if specified allele exists in the allele table.

Parameters
  • gene (str) – Target gene.

  • allele (str) – Allele to be tested.

Returns

True if the allele is legit.

Return type

bool

pypgx.api.core.is_target_gene(gene)[source]

Return True if specified gene is one of the target genes.

Parameters

gene (str) – Gene name.

Returns

True if specified gene is one of the target genes.

Return type

bool

Examples

>>> import pypgx
>>> pypgx.is_target_gene('CYP2D6')
True
>>> pypgx.is_target_gene('CYP2D7')
False
pypgx.api.core.list_alleles(gene, variants=None, assembly='GRCh37')[source]

List all star alleles present in the allele table.

Parameters
  • gene (str) – Target gene.

  • variants (str or list, optional) – Only list alleles carrying specified variant(s) as a part of definition.

  • assembly ({‘GRCh37’, ‘GRCh38’}, default: ‘GRCh37’) – Reference genome assembly.

Returns

Requested alleles.

Return type

list

Examples

>>> import pypgx
>>> pypgx.list_alleles('CYP4F2')
['*1', '*2', '*3']
>>> pypgx.list_alleles('CYP2B6', variants=['19-41515263-A-G'], assembly='GRCh37')
['*4', '*6', '*7', '*13', '*19', '*20', '*26', '*34', '*36', '*37', '*38']
pypgx.api.core.list_functions(gene=None)[source]

List all functions present in the allele table.

Parameters

gene (str, optional) – Return only functions belonging to this gene.

Returns

Available functions.

Return type

list

Examples

>>> import pypgx
>>> pypgx.list_functions()
[nan, 'Normal Function', 'Uncertain Function', 'Increased Function', 'Decreased Function', 'No Function', 'Unknown Function', 'Possible Decreased Function', 'Possible Increased Function']
>>> pypgx.list_functions(gene='CYP2D6')
['Normal Function', 'No Function', 'Decreased Function', 'Uncertain Function', 'Unknown Function', nan]
pypgx.api.core.list_genes(mode='target')[source]

List genes in the gene table.

Parameters

mode ({‘target’, ‘control’, ‘all’}, default: ‘target’) – Specify which gene set to return.

Returns

Gene set.

Return type

list

Examples

>>> import pypgx
>>> pypgx.list_genes(mode='target')[:5] # First five target genes
['CACNA1S', 'CFTR', 'CYP1A2', 'CYP2A6', 'CYP2A13']
>>> pypgx.list_genes(mode='control')
['EGFR', 'RYR1', 'VDR']
>>> pypgx.list_genes(mode='all')[:5] # Includes pseudogenes
['CACNA1S', 'CFTR', 'CYP1A2', 'CYP2A6', 'CYP2A7']
pypgx.api.core.list_phenotypes(gene=None)[source]

List all phenotypes present in the phenotype table.

Parameters

gene (str, optional) – Return only phenotypes belonging to this gene.

Returns

Available phenotypes.

Return type

list

Examples

>>> import pypgx
>>> pypgx.list_phenotypes()
['Intermediate Metabolizer', 'Normal Metabolizer', 'Poor Metabolizer', 'Rapid Metabolizer', 'Ultrarapid Metabolizer', 'Likely Intermediate Metabolizer', 'Likely Poor Metabolizer', 'Possible Intermediate Metabolizer']
>>> pypgx.list_phenotypes(gene='CYP2D6')
['Ultrarapid Metabolizer', 'Normal Metabolizer', 'Intermediate Metabolizer', 'Poor Metabolizer']
pypgx.api.core.list_variants(gene, alleles=None, mode='all', assembly='GRCh37')[source]

List variants that are used to define star alleles.

Some alleles, such as reference alleles, may return an empty list because they do not contain any variants.

Parameters
  • gene (str) – Target gene.

  • alleles (str or list, optional) – Allele name or list of alleles.

  • mode ({‘all’, ‘core’, ‘tag’}, default: ‘all’) – Whether to return all variants, core variants only, or tag variants only.

  • assembly ({‘GRCh37’, ‘GRCh38’}, default: ‘GRCh37’) – Reference genome assembly.

Returns

Coordinate sorted list of variants.

Return type

list

Examples

>>> import pypgx
>>> pypgx.list_variants('CYP4F2')
['19-15990431-C-T', '19-16008388-A-C']
>>> pypgx.list_variants('CYP4F2', alleles=['*2'])
['19-16008388-A-C']
>>> pypgx.list_variants('CYP4F2', alleles=['*2', '*3'])
['19-15990431-C-T', '19-16008388-A-C']
>>> pypgx.list_variants('CYP4F2', alleles=['*2'], assembly='GRCh38')
['19-15897578-A-C']
>>> pypgx.list_variants('CYP4F2', alleles=['*1'])
[]
>>> pypgx.list_variants('CYP2B6', alleles=['*6'], mode='all')
['19-41495755-T-C', '19-41496461-T-C', '19-41512841-G-T', '19-41515263-A-G']
>>> pypgx.list_variants('CYP2B6', alleles=['*6'], mode='core')
['19-41512841-G-T', '19-41515263-A-G']
>>> pypgx.list_variants('CYP2B6', alleles=['*6'], mode='tag')
['19-41495755-T-C', '19-41496461-T-C']
pypgx.api.core.load_allele_table()[source]

Load the allele table.

Returns

Requested table.

Return type

pandas.DataFrame

Examples

>>> import pypgx
>>> df = pypgx.load_allele_table()
>>> df.head()
      Gene StarAllele  ActivityScore                           Function                                    GRCh37Core GRCh37Tag                                    GRCh38Core GRCh38Tag     SV
0    ABCB1         *1            NaN                    Normal Function  7-87138645-A-G,7-87160618-A-C,7-87179601-A-G       NaN  7-87509329-A-G,7-87531302-A-C,7-87550285-A-G       NaN  False
1    ABCB1         *2            NaN                 Increased Function                                           NaN       NaN                                           NaN       NaN  False
2  CACNA1S  Reference            NaN                    Normal Function                                           NaN       NaN                                           NaN       NaN  False
3  CACNA1S   c.520C>T            NaN  Malignant Hyperthermia Associated                               1-201061121-G-A       NaN                               1-201091993-G-A       NaN  False
4  CACNA1S  c.3257G>A            NaN  Malignant Hyperthermia Associated                               1-201029943-C-T       NaN                               1-201060815-C-T       NaN  False
pypgx.api.core.load_cnv_table()[source]

Load the CNV table.

Returns

Requested table.

Return type

pandas.DataFrame

Examples

>>> import pypgx
>>> df = pypgx.load_cnv_table()
>>> df.head()
     Gene          Name
0  CYP2A6        Normal
1  CYP2A6  Deletion1Het
2  CYP2A6  Deletion1Hom
3  CYP2A6  Deletion2Het
4  CYP2A6  Deletion3Het
pypgx.api.core.load_cpic_table()[source]

Load the CPIC table.

The copy of the CPIC table in PyPGx is current as of August 19, 2023. To obtain the latest CPIC table, you can visit the Genes-Drugs page on the CPIC website.

Returns

Requested table.

Return type

pandas.DataFrame

Examples

>>> import pypgx
>>> df = pypgx.load_cpic_table()
>>> df.head()
    Gene           Drug    RxNorm                                 ATC                                          Guideline CPICLevel CPICLevelStatus PharmGKBLevel             FDALabel               PMID
0    HLA-B       abacavir  190521.0  J05AF06, J05AR02, J05AR13, J05AR04  https://cpicpgx.org/guidelines/guideline-for-a...         A           Final            1A     Testing required  24561393;22378157
1    HLA-B    allopurinol     519.0                    M04AA01, M04AA51  https://cpicpgx.org/guidelines/guideline-for-a...         A           Final            1A  Testing recommended  23232549;26094938
2  MT-RNR1       amikacin     641.0           D06AX12, J01GB06, S01AA21  https://cpicpgx.org/guidelines/cpic-guideline-...         A           Final            1A                  NaN           34032273
3  CYP2C19  amitriptyline     704.0                    N06AA09, N06CA01  https://cpicpgx.org/guidelines/guideline-for-t...         A           Final            1A                  NaN  23486447;27997040
4   CYP2D6  amitriptyline     704.0                    N06AA09, N06CA01  https://cpicpgx.org/guidelines/guideline-for-t...         A           Final            1A       Actionable PGx  23486447;27997040
pypgx.api.core.load_diplotype_table()[source]

Load the diplotype table.

Returns

Requested table.

Return type

pandas.DataFrame

Examples

>>> import pypgx
>>> df = pypgx.load_diplotype_table()
>>> df.head()
      Gene            Diplotype                              Phenotype
0  CACNA1S  Reference/Reference               Uncertain Susceptibility
1  CACNA1S   Reference/c.520C>T  Malignant Hyperthermia Susceptibility
2  CACNA1S  Reference/c.3257G>A  Malignant Hyperthermia Susceptibility
3  CACNA1S    c.520C>T/c.520C>T  Malignant Hyperthermia Susceptibility
4  CACNA1S   c.520C>T/c.3257G>A  Malignant Hyperthermia Susceptibility
pypgx.api.core.load_equation_table()[source]

Load the phenotype equation table.

Returns

Requested table.

Return type

pandas.DataFrame

Examples

>>> import pypgx
>>> df = pypgx.load_equation_table()
>>> df.head()
     Gene                 Phenotype              Equation
0  CYP2C9          Poor Metabolizer        0 <= score < 1
1  CYP2C9  Intermediate Metabolizer        1 <= score < 2
2  CYP2C9        Normal Metabolizer            2 == score
3  CYP2D6          Poor Metabolizer     0 <= score < 0.25
4  CYP2D6  Intermediate Metabolizer  0.25 <= score < 1.25
pypgx.api.core.load_gene_table()[source]

Load the gene table.

Returns

Requested table.

Return type

pandas.DataFrame

Examples

>>> import pypgx
>>> df = pypgx.load_gene_table()
>>> df.head()
      Gene  Target  Control Paralog  Variants     SV PhenotypeMethod  RefAllele GRCh37Default GRCh38Default Strand           GRCh37Region           GRCh38Region                                   GRCh37ExonStarts                                     GRCh37ExonEnds                                   GRCh38ExonStarts                                     GRCh38ExonEnds
0    ABCB1    True    False     NaN      True  False             NaN         *1            *2            *2      -    7:87130178-87345639    7:87500862-87716323  87133178,87135212,87138590,87144546,87145824,8...  87133765,87135359,87138797,87144744,87145981,8...  87503862,87505896,87509274,87515230,87516508,8...  87504449,87506043,87509481,87515428,87516665,8...
1  CACNA1S    True    False     NaN      True  False       Diplotype  Reference     Reference     Reference      -  1:201005639-201084694  1:201036511-201115426  201008639,201009358,201009749,201010631,201012...  201009210,201009502,201009841,201010717,201012...  201039511,201040230,201040621,201041503,201043...  201040082,201040374,201040713,201041589,201043...
2     CFTR    True    False     NaN      True  False       Diplotype  Reference     Reference     Reference      +  7:117117016-117311719  7:117477024-117671665  117120016,117144306,117149087,117170952,117174...  117120201,117144417,117149196,117171168,117174...  117480024,117504252,117509033,117530898,117534...  117480147,117504363,117509142,117531114,117534...
3   CYP1A1    True    False     NaN      True  False             NaN         *1            *1            *1      -   15:75008882-75020951   15:74716541-74728528  75011882,75013307,75013539,75013754,75013931,7...  75013115,75013394,75013663,75013844,75014058,7...  74719541,74720966,74721198,74721413,74721590,7...  74720774,74721053,74721322,74721503,74721717,7...
4   CYP1A2    True    False     NaN      True  False             NaN        *1A           *1A           *1A      +   15:75038183-75051941   15:74745844-74759607  75041183,75042070,75043529,75044105,75044464,7...  75041238,75042910,75043650,75044195,75044588,7...  74748844,74749729,74751188,74751764,74752123,7...  74748897,74750569,74751309,74751854,74752247,7...
pypgx.api.core.load_phenotype_table()[source]

Load the phenotype table.

Returns

Requested table.

Return type

pandas.DataFrame

Examples

>>> import pypgx
>>> df = pypgx.load_phenotype_table()
>>> df.head()
      Gene                              Phenotype                     Priority
0  CACNA1S               Uncertain Susceptibility                  Normal Risk
1  CACNA1S  Malignant Hyperthermia Susceptibility  Abnormal/Priority/High Risk
2     CFTR                     Favorable Response                         None
3     CFTR                   Unfavorable Response                         None
4     CFTR                          Indeterminate                         None
pypgx.api.core.load_recommendation_table()[source]

Load the recommendation table.

Returns

Requested table.

Return type

pandas.DataFrame

Examples

>>> import pypgx
>>> df = pypgx.load_recommendation_table()
>>> df.head()
         Drug   Gene1                         Phenotype1 Gene2 Phenotype2                                     Recommendation
0  tacrolimus  CYP3A5                 Normal Metabolizer  None       None  Increase starting dose 1.5 to 2 times recommen...
1  tacrolimus  CYP3A5           Intermediate Metabolizer  None       None  Increase starting dose 1.5 to 2 times recommen...
2  tacrolimus  CYP3A5  Possible Intermediate Metabolizer  None       None                                               None
3  tacrolimus  CYP3A5                   Poor Metabolizer  None       None  Initiate therapy with standard recommended dos...
4  tacrolimus  CYP3A5                      Indeterminate  None       None                                               None
pypgx.api.core.load_variant_table()[source]

Load the variant table.

Returns

Requested table.

Return type

pandas.DataFrame

Examples

>>> import pypgx
>>> df = pypgx.load_phenotype_table()
>>> df.head()
      Gene                              Phenotype                     Priority
0  CACNA1S               Uncertain Susceptibility                  Normal Risk
1  CACNA1S  Malignant Hyperthermia Susceptibility  Abnormal/Priority/High Risk
2     CFTR                     Favorable Response                         None
3     CFTR                   Unfavorable Response                         None
4     CFTR                          Indeterminate                         None
pypgx.api.core.predict_phenotype(gene, a, b)[source]

Predict phenotype based on two haplotype calls.

The method can handle star alleles with structural variation including gene deletion, duplication, and tandem arrangement.

For detailed implementation, please see the Phenotype prediction section.

Parameters
  • gene (str) – Target gene.

  • a, b (str) – Star allele for each haplotype. The order of alleles does not matter.

Returns

Phenotype prediction.

Return type

str

Examples

>>> import pypgx
>>> pypgx.predict_phenotype('CYP2D6', '*4', '*5')   # Both alleles have no function
'Poor Metabolizer'
>>> pypgx.predict_phenotype('CYP2D6', '*5', '*4')   # The order of alleles does not matter
'Poor Metabolizer'
>>> pypgx.predict_phenotype('CYP2D6', '*1', '*22')  # *22 has uncertain function
'Indeterminate'
>>> pypgx.predict_phenotype('CYP2D6', '*1', '*1x2') # Gene duplication
'Ultrarapid Metabolizer'
>>> pypgx.predict_phenotype('CYP2B6', '*1', '*4')   # *4 has increased function
'Rapid Metabolizer'
pypgx.api.core.predict_score(gene, allele)[source]

Predict activity score based on haplotype call.

The method can handle star alleles with structural variation including gene deletion, duplication, and tandem arrangement.

Note that the method will return NaN for alleles with uncertain function as well as for alleles from a gene that does not use the activity score system.

For detailed implementation, please see the Phenotype prediction section.

Parameters
  • gene (str) – Gene name.

  • allele (str) – Star allele.

Returns

Activity score.

Return type

float

See also

get_score

Get matched data from the allele table.

Examples

Here are some examples for the CYP2D6 gene:

>>> import pypgx
>>> pypgx.predict_score('CYP2D6', '*1')            # Allele with normal function
1.0
>>> pypgx.predict_score('CYP2D6', '*1x2')          # Gene duplication of *1
2.0
>>> pypgx.predict_score('CYP2D6', '*1x4')          # Gene multiplication of *1
4.0
>>> pypgx.predict_score('CYP2D6', '*4')            # Allele with no function
0.0
>>> pypgx.predict_score('CYP2D6', '*4x2')          # Gene duplication of *4
0.0
>>> pypgx.predict_score('CYP2D6', '*22')           # Allele with uncertain function
nan
>>> pypgx.predict_score('CYP2D6', '*22x2')         # Gene duplication of *22
nan
>>> pypgx.predict_score('CYP2D6', '*36+*10')       # Tandem arrangement
0.25
>>> pypgx.predict_score('CYP2D6', '*1x2+*4x2+*10') # Complex event
2.25

We can also predict activity score for the DPYD gene:

>>> pypgx.predict_score('DPYD', 'Reference')
1.0
>>> pypgx.predict_score('DPYD', 'c.1905+1G>A (*2A)')
0.0
>>> pypgx.predict_score('DPYD', 'c.295_298delTCAT (*7)')
0.0
>>> pypgx.predict_score('DPYD', 'c.703C>T (*8)')
0.0

All of the CYP2B6 alleles will return NaN because it does not have activity score:

>>> pypgx.predict_score('CYP2B6', '*1')
nan
>>> pypgx.predict_score('CYP2B6', '*2')
nan
pypgx.api.core.sort_alleles(alleles, by='priority', gene=None, assembly='GRCh37')[source]

Sort star alleles by either priority or name.

By default (by='priority') the method reports high priority alleles first. This means alleles are sorted by the following order: 1. allele function (e.g. ‘No Function’ > ‘Normal Function’), 2. number of core variants (e.g. three SNVs > one SNV), 3. number of core variants that impact protein coding (e.g. two misssense variants > one missense variant plus one intron variant), and 4. reference allele status (e.g. non-reference allele with two SNVs > reference allele with two SNVs such that CYP2D6*46 > CYP2D6*1 in GRCh37). Note that the priority of allele function decreases in the following order: ‘No Function’, ‘Decreased Function’, ‘Possible Decreased Function’, ‘Increased Function’, ‘Possible Increased Function’, ‘Uncertain Function’, ‘Unknown Function’, ‘Normal Function’.

When by='name' the method will report alleles with a smaller number first. This means, for example, ‘*4’ will come before ‘*10’ whereas lexicographic sorting would produce the opposite result. This is particularly useful when forming a diplotype (e.g. ‘*4/*10’ vs. ‘*10/*4’).

Parameters
  • alleles (list) – List of alleles.

  • by ({‘priority’, ‘name’}, default: ‘priority’) – Determines which method to use for sorting alleles:

    • ‘priority’: Report high priority alleles first.

    • ‘name’: Report alleles with a smaller number first.

  • gene (str) – Target gene. Only required when method='priority'.

  • assembly ({‘GRCh37’, ‘GRCh38’}, default: ‘GRCh37’) – Reference genome assembly. Only relevant when method='priority'.

Returns

Sorted list of alleles.

Return type

list

Examples

Assume we have following alleles for the CYP2D6 gene:

>>> alleles = ['*1', '*2', '*4', '*10']

We can sort the alleles by their prioirty with method='priority':

>>> import pypgx
>>> alleles = pypgx.sort_alleles(alleles, by='priority', gene='CYP2D6', assembly='GRCh37')
>>> alleles
['*4', '*10', '*1', '*2']

We can restore the original order by sorting again with method='name':

>>> alleles = pypgx.sort_alleles(alleles, by='name')
>>> alleles
['*1', '*2', '*4', '*10']

Note that we can also sort alleles by name for genes that do not use the star allele nomenclature (e.g. the DPYD gene):

>>> alleles = ['c.557A>G', 'c.2194G>A (*6)', 'c.496A>G', 'Reference', 'c.1627A>G (*5)']
>>> pypgx.sort_alleles(alleles, by='name')
['Reference', 'c.496A>G', 'c.557A>G', 'c.1627A>G (*5)', 'c.2194G>A (*6)']

genotype

The genotype submodule is primarily used to make final diplotype calls by interpreting candidate star alleles and/or detected structural variants.

Classes:

CYP2A6Genotyper(df, assembly)

Genotyper for CYP2A6.

CYP2B6Genotyper(df, assembly)

Genotyper for CYP2B6.

CYP2D6Genotyper(df, assembly)

Genotyper for CYP2D6.

CYP2E1Genotyper(df, assembly)

Genotyper for CYP2E1.

CYP4F2Genotyper(df, assembly)

Genotyper for CYP4F2.

G6PDGenotyper(df, assembly)

Genotyper for G6PD.

GSTM1Genotyper(df, assembly)

Genotyper for GSTM1.

GSTT1Genotyper(df, assembly)

Genotyper for GSTT1.

SLC22A2Genotyper(df, assembly)

Genotyper for SLC22A2.

SULT1A1Genotyper(df, assembly)

Genotyper for SULT1A1.

SimpleGenotyper(df, gene, assembly)

Genotyper for genes without SV.

UGT1A4Genotyper(df, assembly)

Genotyper for UGT1A4.

UGT2B15Genotyper(df, assembly)

Genotyper for UGT2B15.

UGT2B17Genotyper(df, assembly)

Genotyper for UGT2B17.

Functions:

call_genotypes([alleles, cnv_calls])

Call genotypes for target gene.

class pypgx.api.genotype.CYP2A6Genotyper(df, assembly)[source]

Genotyper for CYP2A6.

class pypgx.api.genotype.CYP2B6Genotyper(df, assembly)[source]

Genotyper for CYP2B6.

class pypgx.api.genotype.CYP2D6Genotyper(df, assembly)[source]

Genotyper for CYP2D6.

class pypgx.api.genotype.CYP2E1Genotyper(df, assembly)[source]

Genotyper for CYP2E1.

class pypgx.api.genotype.CYP4F2Genotyper(df, assembly)[source]

Genotyper for CYP4F2.

class pypgx.api.genotype.G6PDGenotyper(df, assembly)[source]

Genotyper for G6PD.

class pypgx.api.genotype.GSTM1Genotyper(df, assembly)[source]

Genotyper for GSTM1.

class pypgx.api.genotype.GSTT1Genotyper(df, assembly)[source]

Genotyper for GSTT1.

class pypgx.api.genotype.SLC22A2Genotyper(df, assembly)[source]

Genotyper for SLC22A2.

class pypgx.api.genotype.SULT1A1Genotyper(df, assembly)[source]

Genotyper for SULT1A1.

class pypgx.api.genotype.SimpleGenotyper(df, gene, assembly)[source]

Genotyper for genes without SV.

class pypgx.api.genotype.UGT1A4Genotyper(df, assembly)[source]

Genotyper for UGT1A4.

class pypgx.api.genotype.UGT2B15Genotyper(df, assembly)[source]

Genotyper for UGT2B15.

class pypgx.api.genotype.UGT2B17Genotyper(df, assembly)[source]

Genotyper for UGT2B17.

pypgx.api.genotype.call_genotypes(alleles=None, cnv_calls=None)[source]

Call genotypes for target gene.

Parameters
  • alleles (str or pypgx.Archive, optional) – Archive file or object with the semantic type SampleTable[Alleles].

  • cnv_calls (str or pypgx.Archive, optional) – Archive file or object with the semantic type SampleTable[CNVCalls].

Returns

Archive object with the semantic type SampleTable[Genotypes].

Return type

pypgx.Archive

pipeline

The pipeline submodule is used to provide convenient methods that combine multiple PyPGx actions and automatically handle semantic types.

Functions:

run_chip_pipeline(gene, output, variants[, …])

Run genotyping pipeline for chip data.

run_long_read_pipeline(gene, output, variants)

Run genotyping pipeline for long-read sequencing data.

run_ngs_pipeline(gene, output[, variants, …])

Run genotyping pipeline for NGS data.

pypgx.api.pipeline.run_chip_pipeline(gene, output, variants, assembly='GRCh37', panel=None, impute=False, force=False, samples=None, exclude=False)[source]

Run genotyping pipeline for chip data.

Parameters
  • gene (str) – Target gene.

  • output (str) – Output directory.

  • variants (str) – Input VCF file must be already BGZF compressed (.gz) and indexed (.tbi) to allow random access. Statistical haplotype phasing will be skipped if input VCF is already fully phased.

  • assembly ({‘GRCh37’, ‘GRCh38’}, default: ‘GRCh37’) – Reference genome assembly.

  • panel (str, optional) – VCF file corresponding to a reference haplotype panel (compressed or uncompressed). By default, the 1KGP panel in the pypgx-bundle directory will be used.

  • impute (bool, default: False) – If True, perform imputation of missing genotypes.

  • force (bool, default : False) – Overwrite output directory if it already exists.

  • samples (str or list, optional) – Subset the VCF for specified samples. This can be a text file (.txt, .tsv, .csv, or .list) containing one sample per line. Alternatively, you can provide a list of samples.

  • exclude (bool, default: False) – If True, exclude specified samples.

pypgx.api.pipeline.run_long_read_pipeline(gene, output, variants, assembly='GRCh37', force=False, samples=None, exclude=False)[source]

Run genotyping pipeline for long-read sequencing data.

Parameters
  • gene (str) – Target gene.

  • output (str) – Output directory.

  • variants (str) – Input VCF file must be already BGZF compressed (.gz) and indexed (.tbi) to allow random access.

  • assembly ({‘GRCh37’, ‘GRCh38’}, default: ‘GRCh37’) – Reference genome assembly.

  • force (bool, default : False) – Overwrite output directory if it already exists.

  • samples (str or list, optional) – Subset the VCF for specified samples. This can be a text file (.txt, .tsv, .csv, or .list) containing one sample per line. Alternatively, you can provide a list of samples.

  • exclude (bool, default: False) – If True, exclude specified samples.

pypgx.api.pipeline.run_ngs_pipeline(gene, output, variants=None, depth_of_coverage=None, control_statistics=None, platform='WGS', assembly='GRCh37', panel=None, force=False, samples=None, exclude=False, samples_without_sv=None, do_not_plot_copy_number=False, do_not_plot_allele_fraction=False, cnv_caller=None)[source]

Run genotyping pipeline for NGS data.

During copy number analysis, if the input data is targeted sequencing, the method will apply inter-sample normalization using summary statistics across all samples. For best results, it is recommended to specify known samples without SV using samples_without_sv.

Parameters
  • gene (str) – Target gene.

  • output (str) – Output directory.

  • variants (str, optional) – Input VCF file must be already BGZF compressed (.gz) and indexed (.tbi) to allow random access. Statistical haplotype phasing will be skipped if input VCF is already fully phased.

  • depth_of_coverage (str, optional) – Archive file or object with the semantic type CovFrame[DepthOfCoverage].

  • control_statistics (str or pypgx.Archive, optional) – Archive file or object with the semantic type SampleTable[Statistics].

  • platform ({‘WGS’, ‘Targeted’}, default: ‘WGS’) – Genotyping platform.

  • assembly ({‘GRCh37’, ‘GRCh38’}, default: ‘GRCh37’) – Reference genome assembly.

  • panel (str, optional) – VCF file corresponding to a reference haplotype panel (compressed or uncompressed). By default, the 1KGP panel in the pypgx-bundle directory will be used.

  • force (bool, default : False) – Overwrite output directory if it already exists.

  • samples (str or list, optional) – Subset the VCF for specified samples. This can be a text file (.txt, .tsv, .csv, or .list) containing one sample per line. Alternatively, you can provide a list of samples.

  • exclude (bool, default: False) – If True, exclude specified samples.

  • samples_without_sv (list, optional) – List of known samples without SV.

  • do_not_plot_copy_number (bool, default: False) – Do not plot copy number profile.

  • do_not_plot_allele_fraction (bool, default: False) – Do not plot allele fraction profile.

  • cnv_caller (str or pypgx.Archive, optional) – Archive file or object with the semantic type Model[CNV]. By default, a pre-trained CNV caller in the pypgx-bundle directory will be used.

plot

The plot submodule is used to plot various kinds of profiles such as read depth, copy number, and allele fraction.

Functions:

plot_bam_copy_number(copy_number[, fitted, …])

Plot copy number profile from CovFrame[CopyNumber].

plot_bam_read_depth(read_depth[, path, …])

Plot copy number profile with BAM data.

plot_cn_af(copy_number, imported_variants[, …])

Plot both copy number profile and allele fraction profile in one figure.

plot_vcf_allele_fraction(imported_variants)

Plot allele fraction profile with VCF data.

plot_vcf_read_depth(gene, vcf[, assembly, …])

Plot read depth profile with VCF data.

pypgx.api.plot.plot_bam_copy_number(copy_number, fitted=False, path=None, samples=None, ymin=- 0.3, ymax=6.3, fontsize=25)[source]

Plot copy number profile from CovFrame[CopyNumber].

Parameters
  • copy_number (str or pypgx.Archive) – Archive file or object with the semantic type CovFrame[CopyNumber].

  • fitted (bool, default: False) – If True, show the fitted line as well.

  • path (str, optional) – Create plots in this directory (default: current directory). Use path='-' to return a list of matplotlib.figure.Figure objects instead of writing files.

  • samples (str or list, optional) – Specify which samples should be included for analysis by providing a text file (.txt, .tsv, .csv, or .list) containing one sample per line. Alternatively, you can provide a list of samples.

  • ymin (float, default: -0.3) – Y-axis bottom.

  • ymax (float, default: 6.3) – Y-axis top.

  • fontsize (float, default: 25) – Text fontsize.

Returns

Output type depends on path.

Return type

None or list

pypgx.api.plot.plot_bam_read_depth(read_depth, path=None, samples=None, ymin=None, ymax=None, fontsize=25)[source]

Plot copy number profile with BAM data.

Parameters
  • read_depth (str or pypgx.Archive) – Archive file or object with the semantic type CovFrame[ReadDepth].

  • path (str, optional) – Create plots in this directory (default: current directory). Use path='-' to return a list of matplotlib.figure.Figure objects instead of writing files.

  • samples (str or list, optional) – Specify which samples should be included for analysis by providing a text file (.txt, .tsv, .csv, or .list) containing one sample per line. Alternatively, you can provide a list of samples.

  • ymin (float, optional) – Y-axis bottom.

  • ymax (float, optional) – Y-axis top.

  • fontsize (float, default: 25) – Text fontsize.

Returns

Output type depends on path.

Return type

None or list

pypgx.api.plot.plot_cn_af(copy_number, imported_variants, path=None, samples=None, ymin=- 0.3, ymax=6.3, fontsize=25)[source]

Plot both copy number profile and allele fraction profile in one figure.

Parameters
  • copy_number (str or pypgx.Archive) – Archive file or object with the semantic type CovFrame[CopyNumber].

  • imported_variants (str or pypgx.Archive) – Archive file or object with the semantic type VcfFrame[Imported] or VcfFrame[Consolidated].

  • path (str, optional) – Create plots in this directory (default: current directory). Use path='-' to return a list of matplotlib.figure.Figure objects instead of writing files.

  • samples (str or list, optional) – Specify which samples should be included for analysis by providing a text file (.txt, .tsv, .csv, or .list) containing one sample per line. Alternatively, you can provide a list of samples.

  • ymin (float, default: -0.3) – Y-axis bottom.

  • ymax (float, default: 6.3) – Y-axis top.

  • fontsize (float, default: 25) – Text fontsize.

Returns

Output type depends on path.

Return type

None or list

pypgx.api.plot.plot_vcf_allele_fraction(imported_variants, path=None, samples=None, fontsize=25)[source]

Plot allele fraction profile with VCF data.

Parameters
  • imported_variants (str or pypgx.Archive) – Archive file or object with the semantic type VcfFrame[Imported] or VcfFrame[Consolidated].

  • path (str, optional) – Create plots in this directory (default: current directory). Use path='-' to return a list of matplotlib.figure.Figure objects instead of writing files.

  • samples (str or list, optional) – Specify which samples should be included for analysis by providing a text file (.txt, .tsv, .csv, or .list) containing one sample per line. Alternatively, you can provide a list of samples.

  • fontsize (float, default: 25) – Text fontsize.

Returns

Output type depends on path.

Return type

None or list

pypgx.api.plot.plot_vcf_read_depth(gene, vcf, assembly='GRCh37', path=None, samples=None, ymin=None, ymax=None)[source]

Plot read depth profile with VCF data.

Parameters
  • gene (str) – Target gene.

  • vcf (str) – VCF file.

  • assembly ({‘GRCh37’, ‘GRCh38’}, default: ‘GRCh37’) – Reference genome assembly.

  • path (str, optional) – Create plots in this directory (default: current directory). Use path='-' to return a list of matplotlib.figure.Figure objects instead of writing files.

  • samples (str or list, optional) – Specify which samples should be included for analysis by providing a text file (.txt, .tsv, .csv, or .list) containing one sample per line. Alternatively, you can provide a list of samples.

  • ymin (float, optional) – Y-axis bottom.

  • ymax (float, optional) – Y-axis top.

Returns

Output type depends on path.

Return type

None or list

utils

The utils submodule contains main actions of PyPGx.

Functions:

call_phenotypes(genotypes)

Call phenotypes for target gene.

combine_results([genotypes, phenotypes, …])

Combine various results for the target gene.

compare_genotypes(first, second[, verbose])

Calculate concordance between two genotype results.

compute_control_statistics(gene, bams[, …])

Compute summary statistics for control gene from BAM files.

compute_copy_number(read_depth, …[, …])

Compute copy number from read depth for target gene.

compute_target_depth(gene, bams[, assembly, bed])

Compute read depth for target gene from BAM files.

count_alleles(results)

Count star alleles from genotype calls.

create_consolidated_vcf(imported_variants, …)

Create a consolidated VCF file.

create_input_vcf(vcf, fasta, bams[, …])

Call SNVs/indels from BAM files for all target genes.

create_regions_bed([assembly, …])

Create a BED file which contains all regions used by PyPGx.

estimate_phase_beagle(imported_variants[, …])

Estimate haplotype phase of observed variants with the Beagle program.

filter_samples(archive, samples[, exclude])

Filter Archive for specified samples.

import_read_depth(gene, depth_of_coverage[, …])

Import read depth data for target gene.

import_variants(gene, vcf[, assembly, …])

Import SNV/indel data for target gene.

predict_alleles(consolidated_variants)

Predict candidate star alleles based on observed SNVs and indels.

predict_cnv(copy_number[, cnv_caller])

Predict CNV from copy number data for target gene.

prepare_depth_of_coverage(bams[, assembly, …])

Prepare a depth of coverage file for all target genes with SV from BAM files.

print_data(input)

Print the main data of specified archive.

print_metadata(input)

Print the metadata of specified archive.

slice_bam(input, output[, assembly, genes, …])

Slice BAM file for all genes used by PyPGx.

test_cnv_caller(cnv_caller, copy_number, …)

Test a CNV caller for the target gene.

train_cnv_caller(copy_number, cnv_calls[, …])

Train a CNV caller for the target gene.

pypgx.api.utils.call_phenotypes(genotypes)[source]

Call phenotypes for target gene.

Parameters

genotypes (str or pypgx.Archive) – Archive file or object with the semantic type SampleTable[Genotypes].

Returns

Archive object with the semantic type SampleTable[Phenotypes].

Return type

pypgx.Archive

pypgx.api.utils.combine_results(genotypes=None, phenotypes=None, alleles=None, cnv_calls=None)[source]

Combine various results for the target gene.

Parameters
  • genotypes (str or pypgx.Archive, optional) – Archive file or object with the semantic type SampleTable[Genotypes].

  • phenotypes (str or pypgx.Archive, optional) – Archive file or object with the semantic type SampleTable[Phenotypes].

  • alleles (str or pypgx.Archive, optional) – Archive file or object with the semantic type SampleTable[Alleles].

  • cnv_calls (str or pypgx.Archive, optional) – Archive file or object with the semantic type SampleTable[CNVCalls].

Returns

Archive object with the semantic type SampleTable[Results].

Return type

pypgx.Archive

pypgx.api.utils.compare_genotypes(first, second, verbose=False)[source]

Calculate concordance between two genotype results.

Only samples that appear in both genotype results will be used to calculate concordance for genotype calls as well as CNV calls.

Parameters
  • first (str or pypgx.Archive) – First archive file or object with the semantic type SampleTable[Results].

  • second (str or pypgx.Archive) – Second archive file or object with the semantic type SampleTable[Results].

  • verbose (bool, default: False) – If True, print the verbose version of output, including discordant calls.

Examples

>>> import pypgx
>>> pypgx.compare_genotypes('results-1.zip', 'results-2.zip')
# Genotype
Total: 100
Compared: 100
Concordance: 1.000 (100/100)
# CNV
Total: 100
Compared: 100
Concordance: 1.000 (100/100)
pypgx.api.utils.compute_control_statistics(gene, bams, assembly='GRCh37', bed=None)[source]

Compute summary statistics for control gene from BAM files.

Note that for the arguments gene and bed, the ‘chr’ prefix in contig names (e.g. ‘chr1’ vs. ‘1’) will be automatically added or removed as necessary to match the input BAM’s contig names.

Parameters
  • gene (str) – Control gene (recommended choices: ‘EGFR’, ‘RYR1’, ‘VDR’). Alternatively, you can provide a custom region (format: chrom:start-end).

  • bams (str or list) – One or more input BAM files. Alternatively, you can provide a text file (.txt, .tsv, .csv, or .list) containing one BAM file per line.

  • assembly ({‘GRCh37’, ‘GRCh38’}, default: ‘GRCh37’) – Reference genome assembly.

  • bed (str, optional) – By default, the input data is assumed to be WGS. If it’s targeted sequencing, you must provide a BED file to indicate probed regions.

Returns

Archive object with the semantic type SampleTable[Statistics].

Return type

pypgx.Archive

pypgx.api.utils.compute_copy_number(read_depth, control_statistics, samples_without_sv=None)[source]

Compute copy number from read depth for target gene.

The method will convert read depth from target gene to copy number by performing intra-sample normalization using summary statistics from control gene.

If the input data was generated with targeted sequencing as opposed to WGS, the method will also apply inter-sample normalization using summary statistics across all samples. For best results, it is recommended to manually specify a list of known reference samples that do not have SV.

Parameters
  • read_depth (str or pypgx.Archive) – Archive file or object with the semantic type CovFrame[ReadDepth].

  • control_statistics (str or pypgx.Archive) – Archive file or object with the semandtic type SampleTable[Statistics].

  • samples_without_sv (list, optional) – List of known samples without SV.

Returns

Archive file with the semandtic type CovFrame[CopyNumber].

Return type

pypgx.Archive

pypgx.api.utils.compute_target_depth(gene, bams, assembly='GRCh37', bed=None)[source]

Compute read depth for target gene from BAM files.

By default, the input data is assumed to be WGS. If it’s targeted sequencing, you must provide a BED file with bed to indicate probed regions.

Parameters
  • gene (str) – Target gene.

  • bams (str or list) – One or more input BAM files. Alternatively, you can provide a text file (.txt, .tsv, .csv, or .list) containing one BAM file per line.

  • assembly ({‘GRCh37’, ‘GRCh38’}, default: ‘GRCh37’) – Reference genome assembly.

  • bed (str, optional) – BED file.

Returns

Archive object with the semantic type CovFrame[ReadDepth].

Return type

pypgx.Archive

pypgx.api.utils.count_alleles(results)[source]

Count star alleles from genotype calls.

pypgx.api.utils.create_consolidated_vcf(imported_variants, phased_variants)[source]

Create a consolidated VCF file.

Parameters
  • imported_variants (str or pypgx.Archive) – Archive file or object with the semantic type VcfFrame[Imported].

  • phased_variants (str or pypgx.Archive) – Archive file or object with the semandtic type VcfFrame[Phased].

Returns

Archive object with the semantic type VcfFrame[Consolidated].

Return type

pypgx.Archive

pypgx.api.utils.create_input_vcf(vcf, fasta, bams, assembly='GRCh37', genes=None, exclude=False, dir_path=None, max_depth=250)[source]

Call SNVs/indels from BAM files for all target genes.

To save computing resources, this method will call variants only for target genes whose at least one star allele is defined by SNVs/indels. Therefore, variants will not be called for target genes that have star alleles defined only by structural variation (e.g. UGT2B17).

Parameters
  • vcf (str) – Output VCF file. It must have .vcf.gz as suffix.

  • fasta (str) – Reference FASTA file.

  • bams (str or list) – One or more input BAM files. Alternatively, you can provide a text file (.txt, .tsv, .csv, or .list) containing one BAM file per line.

  • assembly ({‘GRCh37’, ‘GRCh38’}, default: ‘GRCh37’) – Reference genome assembly.

  • genes (list, optional) – List of genes to include.

  • exclude (bool, default: False) – Exclude specified genes. Ignored when genes=None.

  • dir_path (str, optional) – By default, intermediate files (likelihoods.bcf, calls.bcf, and calls.normalized.bcf) will be stored in a temporary directory, which is automatically deleted after creating final VCF. If you provide a directory path, intermediate files will be stored there.

  • max_depth (int, default: 250) – At a position, read maximally this number of reads per input file. If your input data is from WGS (e.g. 30X), you don’t need to change this option. However, if it’s from targeted sequencing with ultra-deep coverage (e.g. 500X), then you need to increase the maximum depth.

pypgx.api.utils.create_regions_bed(assembly='GRCh37', add_chr_prefix=False, merge=False, target_genes=False, sv_genes=False, var_genes=False, genes=None, exclude=False)[source]

Create a BED file which contains all regions used by PyPGx.

Parameters
  • assembly ({‘GRCh37’, ‘GRCh38’}, default: ‘GRCh37’) – Reference genome assembly.

  • add_chr_prefix (bool, default: False) – Whether to add the ‘chr’ string in contig names.

  • merge (bool, default: False) – Whether to merge overlapping intervals (gene names will be removed too).

  • target_genes (bool, default: False) – Whether to only return target genes, excluding control genes and paralogs.

  • sv_genes (bool, default: False) – Whether to only return target genes whose at least one star allele is defined by structural variation.

  • var_genes (bool, default: False) – Whether to only return target genes whose at least one star allele is defined by SNVs/indels.

  • genes (list, optional) – List of genes to include.

  • exclude (bool, default: False) – Exclude specified genes. Ignored when genes=None.

Returns

BED file.

Return type

fuc.api.pybed.BedFrame

Examples

>>> import pypgx
>>> bf = pypgx.create_regions_bed()
>>> bf.gr.df.head()
  Chromosome      Start        End     Name
0          1  201005639  201084694  CACNA1S
1          1   60355979   60395470   CYP2J2
2          1   47391859   47410148  CYP4A11
3          1   47600112   47618399  CYP4A22
4          1   47261669   47288021   CYP4B1
>>> bf = pypgx.create_regions_bed(assembly='GRCh38')
>>> bf.gr.df.head()
  Chromosome      Start        End     Name
0          1  201036511  201115426  CACNA1S
1          1   59890307   59929773   CYP2J2
2          1   46926187   46944476  CYP4A11
3          1   47134440   47152727  CYP4A22
4          1   46796045   46822413   CYP4B1
>>> bf = pypgx.create_regions_bed(add_chr_prefix=True)
>>> bf.gr.df.head()
  Chromosome      Start        End     Name
0       chr1  201005639  201084694  CACNA1S
1       chr1   60355979   60395470   CYP2J2
2       chr1   47391859   47410148  CYP4A11
3       chr1   47600112   47618399  CYP4A22
4       chr1   47261669   47288021   CYP4B1
>>> bf = pypgx.create_regions_bed(merge=True)
>>> bf.gr.df.head()
  Chromosome     Start       End
0          1  47261669  47288021
1          1  47391859  47410148
2          1  47600112  47618399
3          1  60355979  60395470
4          1  97540298  98389615
pypgx.api.utils.estimate_phase_beagle(imported_variants, panel=None, impute=False)[source]

Estimate haplotype phase of observed variants with the Beagle program.

Parameters
  • imported_variants (str or pypgx.Archive) – Archive file or object with the semantic type VcfFrame[Imported]. The ‘chr’ prefix in contig names (e.g. ‘chr1’ vs. ‘1’) will be automatically added or removed as necessary to match the reference VCF’s contig names.

  • panel (str, optional) – VCF file corresponding to a reference haplotype panel (compressed or uncompressed). By default, the 1KGP panel in the pypgx-bundle directory will be used.

  • impute (bool, default: False) – If True, perform imputation of missing genotypes.

Returns

Archive object with the semantic type VcfFrame[Phased].

Return type

pypgx.Archive

pypgx.api.utils.filter_samples(archive, samples, exclude=False)[source]

Filter Archive for specified samples.

Parameters
  • archive (str or pypgx.archive) – Archive file or object.

  • samples (str or list) – Specify which samples should be included for analysis by providing a text file (.txt, .tsv, .csv, or .list) containing one sample per line. Alternatively, you can provide a list of samples.

  • exclude (bool, default: False) – If True, exclude specified samples.

Returns

Fitlered Archive object.

Return type

pypgx.Archive

pypgx.api.utils.import_read_depth(gene, depth_of_coverage, samples=None, exclude=False)[source]

Import read depth data for target gene.

Parameters
  • gene (str) – Gene name.

  • depth_of_coverage (str or pypgx.Archive) – Archive file or object with the semantic type CovFrame[DepthOfCoverage].

  • samples (str or list, optional) – Specify which samples should be included for analysis by providing a text file (.txt, .tsv, .csv, or .list) containing one sample per line. Alternatively, you can provide a list of samples.

  • exclude (bool, default: False) – If True, exclude specified samples.

Returns

Archive object with the semantic type CovFrame[ReadDepth].

Return type

pypgx.Archive

pypgx.api.utils.import_variants(gene, vcf, assembly='GRCh37', platform='WGS', samples=None, exclude=False)[source]

Import SNV/indel data for target gene.

The method will slice the input VCF for the target gene to create an archive object with the semantic type VcfFrame[Imported] or VcfFrame[Consolidated].

Parameters
  • gene (str) – Target gene.

  • vcf (str or fuc.api.pyvcf.VcfFrame) – Input VCF file must be already BGZF compressed (.gz) and indexed (.tbi) to allow random access. Alternatively, you can provide a VcfFrame object.

  • assembly ({‘GRCh37’, ‘GRCh38’}, default: ‘GRCh37’) – Reference genome assembly.

  • platform ({‘WGS’, ‘Targeted’, ‘Chip’, ‘LongRead’}, default: ‘WGS’) – Genotyping platform used. When the platform is ‘WGS’, ‘Targeted’, or ‘Chip’, the method will assess whether every genotype call in the sliced VCF is haplotype phased (e.g. ‘0|1’). If the sliced VCF is fully phased, the method will return VcfFrame[Consolidated] or otherwise VcfFrame[Imported]. When the platform is ‘LongRead’, the method will return VcfFrame[Consolidated] after applying the phase-extension algorithm to estimate haplotype phase of any variants that could not be resolved by read-backed phasing.

  • samples (str or list, optional) – Specify which samples should be included for analysis by providing a text file (.txt, .tsv, .csv, or .list) containing one sample per line. Alternatively, you can provide a list of samples.

  • exclude (bool, default: False) – If True, exclude specified samples.

Returns

Archive object with the semantic type VcfFrame[Imported] or VcfFrame[Consolidated].

Return type

pypgx.Archive

pypgx.api.utils.predict_alleles(consolidated_variants)[source]

Predict candidate star alleles based on observed SNVs and indels.

Parameters

consolidated_variants (str or pypgx.Archive) – Archive file or object with the semantic type VcfFrame[Consolidated].

Returns

Archive object with the semantic type SampleTable[Alleles].

Return type

pypgx.Archive

pypgx.api.utils.predict_cnv(copy_number, cnv_caller=None)[source]

Predict CNV from copy number data for target gene.

Genomic positions that are missing copy number because, for example, the input data is targeted sequencing will be imputed with forward filling.

Parameters
  • copy_number (str or pypgx.Archive) – Archive file or object with the semantic type CovFrame[CopyNumber].

  • cnv_caller (str or pypgx.Archive, optional) – Archive file or object with the semantic type Model[CNV]. By default, a pre-trained CNV caller in the pypgx-bundle directory will be used.

Returns

Archive object with the semantic type SampleTable[CNVCalls].

Return type

pypgx.Archive

pypgx.api.utils.prepare_depth_of_coverage(bams, assembly='GRCh37', bed=None, genes=None, exclude=False)[source]

Prepare a depth of coverage file for all target genes with SV from BAM files.

To save computing resources, this method will count read depth only for target genes whose at least one star allele is defined by structural variation. Therefore, read depth will not be computed for target genes that have star alleles defined only by SNVs/indels (e.g. CYP3A5).

Parameters
  • bams (str or list) – One or more input BAM files. Alternatively, you can provide a text file (.txt, .tsv, .csv, or .list) containing one BAM file per line.

  • assembly ({‘GRCh37’, ‘GRCh38’}, default: ‘GRCh37’) – Reference genome assembly.

  • bed (str, optional) – By default, the input data is assumed to be WGS. If it’s targeted sequencing, you must provide a BED file to indicate probed regions. Note that the ‘chr’ prefix in contig names (e.g. ‘chr1’ vs. ‘1’) will be automatically added or removed as necessary to match the input BAM’s contig names.

  • genes (list, optional) – List of genes to include.

  • exclude (bool, default: False) – Exclude specified genes. Ignored when genes=None.

Returns

Archive object with the semantic type CovFrame[DepthOfCoverage].

Return type

pypgx.Archive

pypgx.api.utils.print_data(input)[source]

Print the main data of specified archive.

Parameters

input (pypgx.Archive) – Archive file.

pypgx.api.utils.print_metadata(input)[source]

Print the metadata of specified archive.

Parameters

input (pypgx.Archive) – Archive file.

pypgx.api.utils.slice_bam(input, output, assembly='GRCh37', genes=None, exclude=False)[source]

Slice BAM file for all genes used by PyPGx.

Parameters
  • input – Input BAM file. It must be already indexed to allow random access.

  • output (str) – Output BAM file.

  • assembly ({‘GRCh37’, ‘GRCh38’}, default: ‘GRCh37’) – Reference genome assembly.

  • genes (list, optional) – List of genes to include.

  • exclude (bool, default: False) – Exclude specified genes. Ignored when genes=None.

pypgx.api.utils.test_cnv_caller(cnv_caller, copy_number, cnv_calls, confusion_matrix=None, comparison_table=None)[source]

Test a CNV caller for the target gene.

Parameters
  • cnv_caller (str or pypgx.Archive) – Archive file or object with the semantic type Model[CNV].

  • copy_number (str or pypgx.Archive) – Archive file or object with the semantic type CovFrame[CopyNumber].

  • cnv_calls (str or pypgx.Archive) – Archive file or object with the semantic type SampleTable[CNVCalls].

  • confusion_matrix (str, optional) – Write the confusion matrix as a CSV file where rows indicate actual class and columns indicate prediction class.

  • comparison_table (str, optional) – Write a CSV file comparing actual vs. predicted CNV calls for each sample.

pypgx.api.utils.train_cnv_caller(copy_number, cnv_calls, confusion_matrix=None, comparison_table=None)[source]

Train a CNV caller for the target gene.

This method will return a SVM-based multiclass classifier that makes CNV calls using the one-vs-rest strategy.

Parameters
  • copy_number (str or pypgx.Archive) – Archive file or object with the semantic type CovFrame[CopyNumber].

  • cnv_calls (str or pypgx.Archive) – Archive file or object with the semantic type SampleTable[CNVCalls].

  • confusion_matrix (str, optional) – Write the confusion matrix as a CSV file where rows indicate actual class and columns indicate prediction class.

  • comparison_table (str, optional) – Write a CSV file comparing actual vs. predicted CNV calls for each sample.

Returns

Archive object with the semantic type Model[CNV].

Return type

pypgx.Archive