Skip to content

Classify a directory of bins

Metagenomic binning gives you one FASTA per putative genome. GVClass takes a directory of those bins and returns a taxonomy call and a quality table for each one. This is the recommended way to run the tool.

Prepare the input directory

Put one FASTA file per putative genome in a single directory. A file may hold several contigs that belong to the same genome; GVClass treats the whole file as one query. Use nucleotide FASTA (.fna) or protein FASTA (.faa).

For reliable giant-virus calls, mind the assembled length:

  • Minimum supported: about 20 kb.
  • Better reliability: at or above 30 kb.
  • Preferred: at or above 50 kb.

Keep filenames clean. The filename becomes the query name, so avoid ., ;, and :. Use _ or - instead. For protein input, write headers as filename|proteinid.

Tip

Filter short contigs (below a few kb) out of each bin before you run. For giant viruses, prefer bins assembled to at least 50 kb; short fragments add noise and weaken the marker signal.

Run the classification

Run from the cloned repository so the launcher can find src/:

pixi run gvclass my_bins -o my_results -t 32

Here my_bins is your input directory, -o my_results is the output directory, and -t 32 sets the total thread budget. Omit -o and results go to <query_dir>_results (for this command, my_bins_results).

Note

To run from any directory, install the CLI wrapper or use the Apptainer wrapper on a cluster. See Configure the database for shared database setups and Run on an HPC cluster for batch submission.

Choose parallelism

Two flags control throughput:

  • -t sets the total number of threads.
  • -j sets how many bins run at once (workers). GVClass picks a worker count automatically when you leave -j unset.

For a directory of many small bins, more workers help; for a few large genomes, give each worker more threads. See Tune speed and accuracy for the trade-offs.

Read the results

GVClass writes a combined table for the whole run plus per-query files:

  • gvclass_summary.tsv and gvclass_summary.csv: one row per bin with the taxonomy call and quality metrics.
  • per-query files such as <query>.final_summary.tsv and <query>.tar.gz inside the output directory.

See Output files and columns for every file and the full column layout. To turn the table into a curation decision, read Assess genome quality. For what happens between FASTA and taxonomy, see How GVClass works.