Marker panels and genetic codes¶
Every query is scored against a fixed set of HMM marker panels. Each panel contributes a completeness column and, where applicable, a duplication column to the summary table. Gene calling for nucleotide input is run under nine genetic codes, and the selected code is recorded per query.
Marker panels¶
| Panel | Column prefix | Size | Purpose |
|---|---|---|---|
| GVOG4 | gvog4 |
n/4 | Core NCLDV single-copy orthologs |
| GVOG8 | gvog8 |
n/8 | Core NCLDV single-copy orthologs |
| BUSCO eukaryotic single-copy | busco |
n/255 | Eukaryotic carry-over flag |
| Universal COG (UNI56) | cog |
n/56 | Prokaryotic carry-over flag |
| Mryavirus | mrya |
n/6 | Mryavirus markers |
| Phage (geNomad) | phage |
n/20 | Phage contamination flag |
| Virophage core | vp |
n/4 | MCP, Penton, ATPase, Protease |
| Mirusviricota core | mirus |
n/4 | MCP, ATPase, Portal, Triplex |
| Capsid typing | capsid_group, ncldv_mcp_total |
count | Capsid (MCP) type tally |
| PPV flag | plv |
count | A32 (PLV_PC_054) proteins placing with PPV references |
For each panel, {panel}_completeness reports distinct marker models present over panel size (for example 8/8), and {panel}_dup reports total hits over distinct models present (a duplication factor).
Note
Two panels deviate from the {panel}_dup convention. The virophage panel emits vp_completeness and vp_mcp (count of VP MCP hits), not vp_dup. The Mirusviricota panel emits mirus_completeness only. capsid_group is a label:count tally across the Nucleocytoviricota and Mirusviricota phyla and the Bellas & Sommaruga capsid groups; ncldv_mcp_total is the NCLDV-specific MCP count; plv counts A32 proteins (PLV_PC_054) that place with PPV references and flags Polinton-like viruses and virophages within the PPV (Preplasmiviricota) domain. It is 0 for ordinary NCLDV.
Order-level markers are a separate panel of 576 order-conserved orthologous groups, built only when fast mode is off (-e/--extended). The default fast mode skips them. See Tune speed and accuracy for the speed and resolution trade-off.
Per-column definitions for the full summary table are in Output reference; interpretation of completeness, contamination, and duplication is in Quality metrics.
Genetic codes¶
Nine genetic codes are tested during gene calling:
| Code | Translation table |
|---|---|
| 0 | Pyrodigal meta mode (pretrained models) |
| 1 | NCBI standard |
| 4 | NCBI mold, protozoan, coelenterate mitochondrial; Mycoplasma, Spiroplasma |
| 6 | NCBI ciliate, dasycladacean, hexamita nuclear |
| 11 | NCBI bacterial, archaeal, plant plastid |
| 15 | NCBI Blepharisma nuclear |
| 29 | NCBI Mesodinium nuclear |
| 106 | Added by the pyrodigal fork |
| 129 | Added by the pyrodigal fork |
The codes tested and the selection margin are set in config (genetic_codes.codes, genetic_codes.improvement_threshold); see Configuration reference.
Selection rule, applied per query:
- Start from meta (code 0).
- Replace it when another code yields more complete marker hits (over 66 percent HMM coverage).
- Or equal hits with over 5 percent higher average hit score.
- Or equal hits with over 5 percent higher coding density.
The margin for the second and third conditions is improvement_threshold (0.05).
The selected code is written to the ttable column. When the pyrodigal meta model wins, ttable reads codemeta. Protein (.faa) input skips gene calling and reports ttable no_fna, and its nucleotide statistics (GCperc, CODINGperc) are 0.00.
For the full gene-calling and marker-detection sequence, see How it works.