This GSVA_readme_6.txt file was generated on 2023-09-11 by Ruonan Wu and modified 2023-12-08 by Bill Nelson


GENERAL INFORMATION

1. Title of Dataset: Global Soil Virus (GSV) Atlas

2. Principal Researcher: Kirsten Hofmockel

3. Information about funding sources supporting the data: DOE BER GSP Soil Microbiome SFA, FWP 

4. Geographic location of data collection: a total of 2953 soil metagenomes collected globally. Geographic coordodates of each sample refer to GSVA_sample_metadata_5.csv in this datapackage.


5. Data History(2020-07-24 to 2023-09-01): 
	A. Reporting Date: 2023-09-01
	B. Data activity type: curation of soil metagenomes and collection of the geochemical data of each sample, identifacation of soil viral contigs, clustering of viral contigs, host predictions, functional annotations of viral genes
	D. Activity Reporter Contact Information: emilybgraham, Ruonan0101
	E. Who is the data given to next ?: JGI: Antonio Camargo, Matt Nolan, Alex Copeland, Nikos C. Kyrpides


DATA & FILE OVERVIEW

1. Data Activity Type Record:
	A. File List: 
	GSVA_soil_viruses_1.fna.gz. Viral contigs detected that passed QA/QC. 
	GSVA_soil_viruses_genome_metadata_2.tsv.gz. A file containing metadata associated with each contig. This would include: sample ID, contig ID, contig length, host assignment, cluster assignment (vOTU, genus, family), checkV and genomad quality.
	GSVA_soil_viruses_3.faa.gz. A fasta file contain all the predicted viral proteins.
	GSVA_soil_viruses_gene_metadata_4.tsv.gz. A file containing data associated with each gene. This would include: sample ID, contig ID, gene ID, KO annotation, KO quality/threshold, CAZy annotation, CAZy quality/threshold, Pfam annotation, Pfam quality/threshold.
	GSVA_sample_metadata_5.csv. Geographic and physiochemcial data of the currated soil samples. 

	B. Relationship between files, if important: 
	GSVA_soil_viruses_genome_metadata_2.tsv.gz contains clustering and host prediction results of the viral contigs in GSVA_soil_viruses_1.fna.gz. Seqeunces in GSVA_soil_viruses_3.faa.gz were predicted from viral contigs in GSVA_soil_viruses_1.fna.gz. Functional annotation results in GSVA_soil_viruses_gene_metadata_4.tsv.gz are for the protein sequences in GSVA_soil_viruses_3.faa.gz. 
	C. Additional related data collected that was not included in the current data package: 
	NA

METHODOLOGICAL INFORMATION

1. Data Activity Type Record:
	A. Description of methods used for collection/generation of data: 
	 We collected a total of 2,953 soil metagenomic samples from soils from major repositories and ecological networks including the JGI Integrated Microbial Genomes and Microbiomes (IMG/M) platform, MG-RAST metagenomics analysis server, Global Urban Soil Ecological Education Network (GLUSEEN), Earth Microbiome Project (EMP), and National Ecological Observatory Network (NEON) plus submissions from individual collaborators. This included 1,552 samples not previously included in IMG/M (Figure 1 and 2).
	For samples collected via JGI IMG/M, we retrieved all studies with GOLD ecosystem type of “Soil” as of August 2020. We manually curated metagenomic sequences to remove misclassified data as follows. We removed samples from studies with the following: (1) GOLD ecosystem types: Rock-dwelling, Deep subsurface, Plant litter, Geologic, Oil reservoir, Volcanic, and Contaminated; (2) GOLD ecosystem subtypes: Wetlands, Aquifer, Tar, Sediment, Fracking Water, and Soil crust; (4) words in title: wetland, sediment, acid mine, cave wall surface, mine tailings, rock biofilm, beach sand, Petroleum, Stalagmite, Subsurface hydrocarbon microbial communities, Vadose zone, mud volcano, Fumarolic, enriched, Composted filter cake, Ice psychrophilic, oil sands, groundwater, Contaminated, rock biofilm, Deep mine, coal mine fire, Hydrocarbon resource environments, Marine, enrichment, groundwater, mangrove, saline desert, Hydroxyproline, Rifle, coastal, compost, biocrust, crust, Creosote, soil warming, Testing DNA extraction, and/or Agave; (5) GOLD geographic location of wetland; and (6) GOLD project type of Metagenome - Cell Enrichment. Additionally, sample names that indicated experimental manipulation (e.g., CO2 enrichment or nitrogen fertilization) or were located in permafrost layers were manually excluded. This resulted in 1,480 curated metagenomes from publicly available data in IMG/M.
	After collating samples from JGI IMG/M and the newly collected samples from external networks and collaborators, the final dataset consisted of 2,953 soils with 2,015,688,128 contigs, representing 1.2 terabases of assembled DNA sequences.
	In parallel, we retrieved mean values for soil parameters from the SoilGrids250m database from 0-5 cm. SoilGrids250m is a spatial interpolation of global soil parameters using ~150,000 soil samples and 158 remote sensing-based products. Here, we focus on six parameters often associated with soil microbial communities: bulk density, cation exchange capacity, nitrogen, pH, soil organic carbon, and clay content. Because our focus on spatial dynamics and soils were collected at various times, we did not include temporally dynamic variables such as soil moisture or temperature in our set of environmental parameters, though we acknowledge they may have profound impacts on the soil virosphere. 

	B. Methods for processing the data: 
	Viral identification: To standardize data analysis across all samples, the 1,552 soil metagenomic samples not collected from IMG/M were analyzed using the JGI’s Metagenome Workflow117. In brief, samples were individually assembled using MetaSpades v. 3.1. 1,476 of the 1,552 assembled soil samples passed default quality control thresholds118, yielding 133 gigabases of assembled DNA in 241,465,924 contigs. Additionally, three very large metagenomes (>1TB each) were assembled separately due to computational limitations in standard workflows119. The resulting assemblies were assigned GOLD identification numbers and imported into IMG/M and processed using version 5.0.0 of the IMG/M Metagenome Annotation Pipeline to align with data obtained directly from IMG/M117. 
	We performed an initial identification of viral contigs using IMG/VR v3’s viral classification pipeline53. The pipeline uses 25,281 viral protein families contained within IMG/VR v353, 16,712 protein families of viral origin from the Pfam database120 and VirFinder121 to identify putative viral genomes in contigs that were at least 1 kb long. During the course of this study, geNomad (version 1.3.3)122, a tool for virus identification with improved classification performance was released and incorporated into our pipeline to improve prediction confidence and perform taxonomic assignment. We also analyzed viral sequences by CheckV v1.0.1 (database version 1.5)123 to estimate the quality of the viral genomes. As this study focused on non-integrated viral genomes, contigs that were flagged by either geNomad or CheckV as proviral were discarded. From the remaining contigs, viral genomes were selected using the following rules: (1) contigs of at least 1kb with high similarity to genomes in the CheckV database (that is, that had high- or medium-quality completeness estimates) or that contained direct terminal repeats were automatically selected; (2) contigs longer than 10kb were required to have a geNomad virus score higher than 0.8 and to either encode one virus hallmark (e.g., terminase, capsid proteins, portal protein, etc.) or to have a virus marker enrichment (as computed by geNomad) of at least 5.0; (3) contigs shorter than 10 kb and longer than 5 kb were required to have a geNomad virus score higher than 0.9, to encode at least one virus hallmark, and to have a virus marker enrichment higher than 2.0. This resulted in 49,649 viral contigs that we used for downstream analysis.
	
	Viral clustering: Viral genomes were clustered into viral operational taxonomic units (vOTUs) following MIUViG guidelines (95% average nucleotide identity, ANI; 85% aligned fraction51 ). In brief, we performed an all-versus-all BLAST (version 2.13.0+, ‘-task megablast -evalue 1e-5 -max_target_seqs 20000’) search to estimate pairwise ANIs and AFs, as described in Nayfach et al.123, and using the Leiden algorithm (igraph Python library, version 0.9.10, resolution parameter = 1.0) to perform the clustering of the genomes. Viruses were also grouped at the genus (40% AAI — average amino acid identity; 20% shared genes) and family (20% AAI; 10% shared genes) levels using DIAMOND124 for protein alignment and Markov Cluster Process (MCL)125 for clustering, as described in a previous study52.

	Host prediction: Viral sequences were assigned to putative host (bacterial and archaeal) taxa through matches to a previously described database of CRISPR spacers of 1.6 million bacterial and archaeal genomes from NCBI GenBank and MAGs (release 242; 15 February 2021)126–130. Sequences of viral genomes were queried against the spacer database (https://portal.nersc.gov/cfs/m342/crisprDB) using blastn (v2.9.0+, parameters: ‘-max_target_seqs = 1000 -word_size = 8 -dust = no’). Only alignments with at least 25 bp, less than 2 mismatches, and that covered ≥ 95% of the spacer length were considered. Viral sequences were assigned to the host taxon at the lowest taxonomic rank that had at least two spacers matched and that represented >70% of all matches. 
	
	Functional annotation: We leveraged an intermediate output of geNomad (version 1.3.3)122 (‘genes.tsv’) to screen putative AMGs on the detected viral contigs. Proteins of the viral contigs were annotated by viral and microbial-specific HMMs implemented in geNomad. The identified viral hallmark (e.g., terminase, major capsid protein) and non-hallmark proteins were labeled as ‘VV-1’ and ‘V*-0’ in geNomad output, respectively. The rest of the viral proteins of the detected viral contigs that were annotated as non-virus-specific or unclassified were then classified into five categories of putative AMGs based on the presence of viral hallmark or non-hallmarks up- or down-stream as mentioned previously46. The AMGs with both virus-specific genes (‘VV-1’ or ‘V*-0’) were retained for the following analysis. To improve the functional annotations of the putative AMGs and highlight the viral potentials of metabolizing carbohydrates and glycoconjugates, the AMG proteins were also annotated by Carbohydrate-Active enZYmes (CAZY) Database and Kyoto Encyclopedia of Genes and Genomes (KEGG) database using the default settings in addition to the functional annotation databases implemented in geNomad. The putative AMG was assigned to the functional annotation with the highest bitscore (e.g,. duplicate annotations were not allowed). Following Hurwitz and U’Ren102 and Hurwitz et al.131, we further screened putative AMGs to remove genes not found in KEGG pathways. Additionally, in recognition of the ambiguity in distinguishing genes encoding auxiliary metabolic functions versus core metabolic processes102, we discuss the resulting set of genes presented here as ‘putative AMGs’

	C. Instrument- or software-specific information needed to interpret the data: 
	geNomad (version 1.3.3), CheckV (v1.0.1), IMG/M Metagenome pipeline (version 5.0.0), BLAST (version 2.13.0+), igraph Python library (version 0.9.10), DIAMOND, blastn (v2.9.0+), R (v 4.1.0), dbCAN HMMdb (release 7.0)
	D. Standards and calibration information, if appropriate: NA
	E. Environmental/experimental conditions: details refer to GSVA_sample_metadata_5.csv. 
	F. Describe any quality-assurance procedures performed on the data: Details included in Method description


DATA-SPECIFIC INFORMATION FOR: 

GSVA_soil_viruses_1.fna.gz
1. Data Activity Type Record:
	A. Number of variables: NA 
	B. Number of records: 49649
	C. Variable List: NA 
	D. Missing data codes: NA
	E. Specialized formats or other abbreviations used: IUPAC nucleotide

GSVA_soil_viruses_genome_metadata_2.tsv.gz
1. Data Activity Type Record:
	A. Number of variables: 22
	B. Number of cases/rows: 49649
	C. Variable List:
1.	contig_id - contig identifier
2.	contig_length - contig length (bp)
3.	n_genes - number of genes predicted
4.	votu - vOTU assignment (MIUViG standard)
5.	genus_cluster - MCL (40% AAI, 20% shared)
6.	family_cluster - MCL (20% AAI, 10% shared)
7.	genomad_taxonomy - geNomad taxonomy assignment
8.	predicted_host - host organism predicted by CRISPR spacers
9.	genomad_chromosome_score - confidence (0-1) that seqeunce is chromosomal
10.	genomad_plasmid_score- confidence (0-1) that seqeunce is plasmid
11.	genomad_virus_score- confidence (0-1) that seqeunce is viral
12.	genomad_n_uscg - number of universal single-copy conserved genes
13.	genomad_n_plasmid_hallmarks - number of plasmid marker genes
14.	genomad_n_virus_hallmarks - number of virus marker genes
15.	genomad_marker_enrichment_c - measure of USCG prevalence
16.	genomad_marker_enrichment_p - measure of plasmid marker prevalence
17.	genomad_marker_enrichment_v - measure of viral marker prevalence
18.	checkv_quality - CheckV quality score
19.	checkv_completeness - CheckV completeness score
20.	checkv_completeness_method - CheckV completeness method
21.	checkv_aai - amino acid identity to closest reference seqeunce
22.	checkv_aai_shared_fraction - alignment fraction to closest reference sequnce
	D. Missing data codes: NA
	E. Specialized formats or other abbreviations used: NA

GSVA_soil_viruses_3.faa.gz
1. Data Activity Type Record:
	A. Number of variables: NA
	B. Number of records: 1432147
	C. Variable List: NA
	D. Missing data codes: NA
	E. Specialized formats or other abbreviations used: IUPAC amino acid

GSVA_soil_viruses_gene_metadata_4.tsv.gz
1. Data Activity Type Record:
	A. Number of variables: 14
	B. Number of cases/rows: 1432147
	C. Variable List: 
1.	gene_id - gene identifier
2.	contig_id - contig identifier
3.	protein_family - protein family determined by ???
4.	start_coordinate - predicted CDS start coordinate
5.	end_coordinate - predicted CDS end coordinate
6.	strand - predicted CDS strand
7.	partial - binary indicator of N-terminal or C-terminal truncation (00/10/01/11)
8.	start_type - start codon
9.	rbs_motif - ribosome binding site motif
10.	genetic_code - NCBI genetic code id
11.	gc_content - fraction G+C (0.00-1.00)
12.	pfam - PFAM classification(s)
13.	cazyme - CAZy classification(s)
14.	kegg_ortholog - KEGG Orthology classification
	D. Missing data codes: NA
	E. Specialized formats or other abbreviations used: IUPAC nucleotide

GSVA_sample_metadata_5.csv
1. Data Activity Type Record:
	A. Number of variables: 327
	B. Number of cases/rows: 2953
	C. Variable List: metadata columns 1-39 (A-AM) were obtained from IMG (http://img.jgi.gov). The remaining columns (40-327/AN-LO) were obtained from SoilGrids (http://soilsgrid.org). Descriptions of the SoilGrids columns can be found at https://ncss-tech.github.io/soilDB/reference/fetchSoilGrids.html.
	D. Missing data codes: NA
	E. Specialized formats or other abbreviations used: