Skip to content

Output files and columns

GVClass writes a run-level summary plus per-query artifacts to the output directory (<query_dir>_results by default, or the path passed to -o). The flags that add or remove files are listed in the CLI reference.

Files

Run-level files

Written to the root of the output directory.

File Contents
gvclass_summary.tsv Main summary table, one row per query, tab-separated.
gvclass_summary.csv Same content as gvclass_summary.tsv, comma-separated.
gvclass_summary.extended.tsv Always-on per-contig contamination diagnostics, one row per query, tab-separated.
gvclass_summary.extended.csv Same content as the extended TSV, comma-separated.

Per-query files

Written for each input query.

File Contents
<query>.final_summary.tsv Single-query row in the 44-column main schema, used to rebuild the run summary on --resume.
<query>.summary.tab Legacy per-query summary table.
<query>.tar.gz Bundled per-query artifacts.
<query>.SUCCESS Resume sentinel; its presence marks the query complete for --resume.
<query>.contamination_candidates.tsv Suspicious contigs; written only when estimated_contamination is at least 10, contamination_type is interpretable, and suspicious contigs are found.

Species-tree files

Written only with --species-tree. See Build a species tree and the species tree explanation.

Path Contents
species_tree/<query>/<query>.treefile Concatenated-marker species tree for the query.
species_tree/<query>/<query>.partitions.txt Per-marker partition definitions for the query supermatrix.
species_tree/<query>/species_tree_taxonomy.tsv Taxonomy and full-precision nearest-reference distances for the query species tree.
species_tree/combined.* Combined tree across all queries; written only with --species-tree-combined.
species_tree/_combined/<panel>/ Per-panel combined trees for multi-domain batches; written only with --species-tree-combined.

Note

Without --species-tree, no species_tree/ directory is produced and the four species_tree_* summary columns are nd.

Summary columns

gvclass_summary.tsv and gvclass_summary.csv share the same 44 columns in this order.

Column Description
query Input filename for the query.
taxonomy_majority Full lineage from the per-marker single-gene tree nearest-neighbor majority vote.
species_tree_nn_taxonomy Lineage of the nearest reference in the concatenated-marker species tree; nd unless --species-tree was set.
taxonomy_confidence high when every emitted rank cleared its distinct-marker threshold, otherwise one or more of low_support, reduced_fastmode, no_support.
capsid_group Unified capsid-type tally as label:count across the Nucleocytoviricota, Mirusviricota, and Bellas & Sommaruga capsid groups.
species Species-rank call with per-taxon counts.
genus Genus-rank call with per-taxon counts.
family Family-rank call with per-taxon counts.
order Order-rank call with per-taxon counts.
class Class-rank call with per-taxon counts.
phylum Phylum-rank call with per-taxon counts.
domain Domain-rank call with per-taxon counts.
avgdist Average tree distance to the reference neighbors.
order_dup Average copy number of expected order-level markers; elevated values indicate duplicated, chimeric, or mixed bins.
estimated_completeness Estimated percent of the expected genome recovered for the assigned lineage; the only completeness field in the main table.
completeness_model_reliability advisory_only, moderate, or high, from the per-order model hold-out R^2; a property of the model, not the genome.
estimated_contamination Output of the trained extra_trees_v1 contamination model and the primary contamination estimate.
contamination_type Likely contamination source (clean, cellular, mixed_viral, phage, duplication, uncertain), populated when estimated_contamination is at least 10.
gvog4_completeness Distinct core NCLDV GVOG4 markers present, as n/4.
gvog4_dup GVOG4 duplication factor (total marker hits / distinct markers present).
gvog8_completeness Distinct core NCLDV GVOG8 markers present, as n/8.
gvog8_dup GVOG8 duplication factor.
busco_completeness Eukaryotic BUSCO single-copy markers present, as n/255.
busco_dup BUSCO duplication factor; elevated values indicate cellular (eukaryote) carry-over.
cog_completeness Universal COG (UNI56) single-copy markers present, as n/56.
cog_dup COG duplication factor; elevated values indicate cellular (prokaryote) carry-over.
mrya_completeness Mryavirus markers present, as n/6.
mrya_dup Mryavirus duplication factor.
phage_completeness Phage (geNomad) markers present, as n/20.
phage_dup Phage duplication factor.
ncldv_mcp_total Count of NCLDV-specific major capsid protein (MCP) markers.
vp_completeness Virophage core markers present (MCP, Penton, ATPase, Protease), as n/4.
vp_mcp Count of virophage MCP hits.
plv Count of A32 (PLV_PC_054) proteins placing with PPV references; 0 for ordinary NCLDV.
mirus_completeness Mirusviricota core markers present (MCP, ATPase, Portal, Triplex), as n/4.
contigs Number of contigs in the query.
LENbp Total length in base pairs.
GCperc GC content as a percentage.
genecount Number of predicted genes.
CODINGperc Coding density as a percentage.
ttable Genetic code used for gene calling; no_fna for protein inputs.
species_tree_nn_genome Nearest-reference genome id from the species tree; nd without --species-tree.
species_tree_nn_distance Distance to the nearest reference in the species tree; nd without --species-tree.
species_tree_clade_id Clade id of the nearest reference in the species tree; nd without --species-tree.

The per-contig diagnostic columns (cellular_coherent_*, cellular_lineage_purity_median, viral_bearing_contig_count, contig_attribution_mode) appear only in the extended table.

For how to read completeness, contamination, and duplication, see Quality metrics. For the marker panels behind the completeness columns, see Markers.