Assess genome quality¶
Open gvclass_summary.tsv (or the .csv) after a run and read the columns below to decide whether a classified giant virus MAG (GVMAG) is high quality or needs manual curation. You need a finished classification first; see Classify bins. Full column definitions live in the output reference.
Read completeness¶
estimated_completeness is the percentage of the expected genome recovered for the assigned lineage. A value near 100 means little is missing. In the bundled example, PkV-RF01 and AC3300027503___Ga0255182_1000024 both score 100.00, while the 10-contig GVMAG-S-1096109-37 scores 82.86.
Read completeness_model_reliability next to it. The value (advisory_only, moderate, or high) describes the per-order model that produced the estimate, not your genome. A 100.00 tagged advisory_only means the assigned order has too few references to calibrate the model, so treat that number as a rough guide. The GVMAG-S-1096109-37 estimate is tagged moderate, which carries more weight.
Note
estimated_completeness is the only completeness field in the main table. Switch the underlying estimator with --completeness-mode legacy if you need the older behavior (see the CLI reference).
Read contamination¶
estimated_contamination is the production estimate from the trained model. On clean bins it sits near zero; the three example genomes report 0.00, 0.08, and 0.00.
contamination_type names the likely source. It reads clean below the threshold and carries a specific label once estimated_contamination reaches 10 or higher.
contamination_type |
likely source |
|---|---|
clean |
contamination below the threshold; nothing to remove |
cellular |
eukaryotic or prokaryotic cellular sequence |
mixed_viral |
two or more viral orders mixed in one bin |
phage |
bacteriophage sequence |
duplication |
duplicated content, often an assembly chimera |
uncertain |
ambiguous signal flagged for triage (can be a novel-virus signature, not true contamination) |
Read duplication¶
Duplication catches bins that merge multiple populations or chimeric assemblies. Check order_dup and gvog8_dup:
- above ~2: multiple populations or assembly chimeras are likely; inspect the bin
- below ~1.5: typically clean
Each {panel}_dup column is a duplication factor, total marker hits divided by distinct markers present. A single-copy marker found twice pushes the factor above 1.
Check for cellular carry-over¶
Giant viruses obligately lack the single-copy markers that cellular genomes carry, so any of those markers signals sequence from a host or a co-binned cell. Watch two panels and their duplication factors:
busco_completeness(n/255) withbusco_dup: eukaryotic BUSCO single-copy markerscog_completeness(n/56) withcog_dup: universal prokaryotic COG (UNI56) single-copy markers
A non-zero busco_completeness or cog_completeness, especially with an elevated _dup, points to cellular carry-over. See the markers reference for every panel.
Warning
Giant viruses carry many eukaryote-like genes acquired by horizontal gene transfer. GVClass does NOT count those as contamination; the cellular signal is restricted to conserved translation-machinery HMMs (BUSCO eukaryotic plus UNI56 prokaryotic) that giant viruses lack. mixed_viral likewise means several viral orders mixed in one bin, not host genes. See quality metrics for the reasoning.
Look deeper¶
Two extra outputs explain a flagged genome.
gvclass_summary.extended.tsvis always written. It holds the per-contig diagnostics (cellular_coherent_*,cellular_lineage_purity_median,viral_bearing_contig_count,contig_attribution_mode) that the main table omits to stay readable.<query>.contamination_candidates.tsvis written per query whenestimated_contaminationreaches 10, the type is interpretable, and suspicious contigs are found. It names the specific contigs to drop or check.
The verdict¶
Apply one combined rule:
- high
estimated_completeness, low duplication (order_dupandgvog8_dupbelow ~1.5), andcontamination_typeclean: a high-quality GVMAG - low completeness, high duplication, or any non-clean type: send the bin for manual curation
Tip
All three bundled example genomes pass: high completeness, low duplication, and clean. Use them as a reference point when reading your own summary table.