This readme.txt file was generated on 2022-10-03 by Ruonan Wu, Bill Nelson and Ryan Mcclure GENERAL INFORMATION 1. Title of Dataset: High-Throughput Chromosomal Confirmation Capture (Hi-C) Metagenome Sequencing Reveals Moisture Impact on Soil Phage-Host Interactions 2. Contributors: Principal Investigator: Kirsten Hofmockel Data contributor (creator, producer, etc): Phase Genomics generate bulk and Hi-C metagenomes; Metatranscriptome data were generated via metaphenome experiment. RW, BN and RM analyzed the dataset (details refer to the author contribution section of the manuscript,https://pnnl.sharepoint.com/:w:/r/teams/SoilMicrobiomeSFA/Shared%20Documents/General/2_Intermediate%20Metaphenomes/A2.5%20Metaphenome%20Incubation/HI_C%20Metaphenome/MS/Hi_metaphenome_NM_01062023_MainText.docx?d=wc3efc45dae794bf097add45c3339bd3f&csf=1&web=1&e=we4HXa) 3. Information about funding sources supporting the data: DOE BER GSP Soil Microbiome SFA, FWP DOE BER GSP Soil Microbiome SFA, thrust 1: NK7748 4. Geographic location of data collection: soil samples were collected from 46°15'04"N, 119°43'43"W, Prosser, WA, USA and incubated at PNNL lab. DATA & FILE OVERVIEW 1. Data Activity Type Record: A. File List: Dateset1: Hi-C Viral Host Links /rcfs/projects/SoilSFA_r0/Hi_C_ForDataHub/Hi_C_data_dir/HiC_filteredLinks_dir/viral_host_associations_HiC.tsv Dateset2: HiC manuscript: bulk metagenomic data_MG_raw Upload mmechs-324-sg_R1_001.fastq.gz,mmechs-324-sg_R2_001.fastq.gz mmechs-330-sg_R1_001.fastq.gz,mmechs-330-sg_R2_001.fastq.gz mmechs-335-sg_R1_001.fastq.gz,mmechs-335-sg_R2_001.fastq.gz mmechs-297-sg_R1_001.fastq.gz,mmechs-297-sg_R2_001.fastq.gz mmechs-306-sg_R1_001.fastq.gz,mmechs-306-sg_R2_001.fastq.gz mmechs-317-sg_R1_001.fastq.gz,mmechs-317-sg_R2_001.fastq.gz Dateset3: Assembly product (viral contigs and MAGs) /rcfs/projects/SoilSFA_r0/Hi_C_ForDataHub/MG_dir/MG_assembly_dir/ViralContigs_dir/viral_contigs.fna /rcfs/projects/SoilSFA_r0/Hi_C_ForDataHub/MG_dir/MG_assembly_dir/MAGs_dir/*.fasta Dateset4: Quality Filtered Metatranscript Reads /rcfs/projects/SoilSFA_r0/Hi_C_ForDataHub/MT_dir/*.fastq.gz B. Relationship between files, if important: Dataset 1-3 were generated from samples that were collected from two timepoints of the metaphenome experiment (2 timepoint * 3 replicate for each). The two timepoints are noted as D1 (before soil drying, after the pre-incubatoin) and D15 (after soil drying) timepoints. Dateset 4 are metatranscriptome data generated from all timepoints of the metaphenome experiment. C. Additional related data collected that was not included in the current data package: NA METHODOLOGICAL INFORMATION 1. Data Activity Type Record: A. Description of methods used for collection/generation of data: Metagenome sequencing: The genomic DNA extracted from the six samples was quality checked using Qubit BR (Invitrogen, Waltham, MA, USA) and shipped to Phase Genomics on dry ice. The paired-end deep sequencing libraries were prepared using ProxiMeta library preparation reagents (Phase Genomics, Seattle, WA). Sequencing was performed on an Illumina NovaSeq generating an average of 177 million PE150 read pairs. Hi-C cross-linking and sequencing: For each replicate, 5 g of soil per time-point were sent on dry ice to Phase Genomics to be processed per their low-biomass protocol (ProxiMetaTM Hi-C Kit Protocol v4.0). In brief, samples were mixed in 25 ml of water and vortexed for 5 min. The tubes were centrifuged at 1000 x g for 10 min. to allow sediment to settle. The supernatant was transferred to a new tube and formaldehyde was added to a final concentration of 1% (v/v). The tubes were incubated at room temperature for 20 min. with occasional gentle mixing by inversion or rotation. Glycine was added to a final concentration of 1% (v/v) to quench the crosslinking reaction and the samples were incubated at room temperature for 20 min. with occasional gentle mixing by rotation. A Hi-C library was created using a ProxiMeta Hi-C Microbiome v4.0 Kit (Phase Genomics, Seattle, WA) which is the commercially available version of the Hi-C protocol23. Following the manufacturer's instructions, the cross-linked DNA extracted from each replicate was digested using Sau3AI and MlucI restriction enzymes, and proximity-ligated with biotinylated nucleotides to create chimeric molecules composed of fragments from different regions of genomes that were physically proximal in vivo. Chimeric molecules were pulled down with streptavidin beads and processed using the ProxiMeta library preparation reagents (Phase Genomics, Seattle, WA). The Hi-C metagenomes were sequenced on an Illumina NovaSeq. Metatranscriptome sequencing: The extracted RNA was treated with Turbo DNase (Invitrogen, Waltham, MA, USA) followed by clean up with a Zymo RNA Clean and Concentrator Kit purification kit (Zymo Research, Irvine, CA, USA). The resulting RNA was quality checked using an Agilent RNA 6000 Nano kit (Agilent, Santa Clara, CA, USA) and quantified using a Qubit (Invitrogen, Waltham, MA, USA). The RNA extracted from the six samples was sent to the Joint Genome Institute (JGI) to be sequenced using the standard metatranscriptome workflow (https://jgi.doe.gov/user-programs/pmo-overview/project-materials-submission-overview/rna-submission-guidelines/). B. Methods for processing the data: Dateset1: Hi-C Viral Host Links: virus-host associations indicated by Hi-C sequencing were processed by Phase Genomics by folowing the protocal, https://www.biorxiv.org/content/10.1101/2021.06.14.448389v1.full Dateset2: the raw bulk metagenomic data were trimmed and de-novo assembled. The assemblies were screened for viral contigs (by virsorter2, Deepvirfinder, VIBRANT, checkV) and binned to generate MAGs that were further refined by Hi-C read links. Dateset3 includes the dereplicated viral contigs and the finalized MAGs. Dateset4: the raw metatranscriptome data collected along the course of metaphenome incubation were trimmed and quality controled. The detailed method and tools used were documented in the manuscript draft(https://pnnl.sharepoint.com/:w:/r/teams/SoilMicrobiomeSFA/Shared%20Documents/General/2_Intermediate%20Metaphenomes/A2.5%20Metaphenome%20Incubation/HI_C%20Metaphenome/MS/Hi_metaphenome_NM_01062023_MainText.docx?d=wc3efc45dae794bf097add45c3339bd3f&csf=1&web=1&e=we4HXa). C. Instrument- or software-specific information needed to interpret the data: Hi-C libary: ProxiMeta Hi-C Microbiome v4.0 Kit shotgun DNA library: ZYMObiomics DNA miniprep kit sequencing platform: Illumina NovaSeq bioinfomatic tools: fastp (v0.20.1), MEGAHIT (v1.2.9), BWA-MEM (v0.7.17), SAMBLASTER (v0.1.24), samtools (v1.9), CheckM (v1.2.0), dRep (v3.4.0), GTDB-Tk (v2.1.0),VirSorter (v2), VIBRANT (v1.2.1), DeepVirfinder (v 2020.11.21), CheckV (v0.7.0), ViPTreeGen (v1.1.2), vConTACT2 (v0.9.19),BBMap (v 38.34) D. Standards and calibration information, if appropriate: NA E. Environmental/experimental conditions: part of metaphenome experiment, three replicate soil samples were collected at D1 (before soil drying, field moisture) and D15 (after soil drying, constant weight). F. Describe any quality-assurance procedures performed on the data: replicate samples were collected to capture virus-host interactions. DATA-SPECIFIC INFORMATION FOR: Dateset1: Hi-C Viral Host Links 1. Data Activity Type Record: A. Number of variables: 18 B. Number of cases/rows: 79 C. Variable List: Sample viral_contig_name viral_contig_length (bp) viral_contig_read_count (reads) viral_contig_read_depth (reads/kbp) viral_contig_read_depth_in_this_cluster (reads/kbp) cluster_name cluster_length (bp) cluster_read_count (reads) cluster_read_depth (reads/kbp) intra_read_count (reads) intra_linkage_density (reads/kbp^2) inter_read_count (reads) raw_inter_linkage_density (reads/kbp^2) raw_inter_vs_intra_ratio viral_element_copies_per_cell adjusted_inter_connective_linkage_density (reads/kbp^2) adjusted_inter_vs_intra_ratio D. Missing data codes: NA E. Specialized formats or other abbreviations used: NA DATA-SPECIFIC INFORMATION FOR: Dateset2: HiC manuscript: bulk metagenomic data_MG_raw Upload 1. Data Activity Type Record: A. Number of variables: NA; sequencing data B. Number of cases/rows: NA; sequencing data C. Variable List: NA; sequencing data D. Missing data codes: NA; sequencing data E. Specialized formats or other abbreviations used: .fastq.gz DATA-SPECIFIC INFORMATION FOR:Dateset3: Assembly product (viral contigs and MAGs) 1. Data Activity Type Record: A. Number of variables: NA; sequencing data B. Number of cases/rows: NA; sequencing data C. Variable List: NA; sequencing data D. Missing data codes: NA; sequencing data E. Specialized formats or other abbreviations used: .fna; .fasta DATA-SPECIFIC INFORMATION FOR: Dateset4: Quality Filtered Metatranscript Reads 1. Data Activity Type Record: A. Number of variables: NA; sequencing data B. Number of cases/rows: NA; sequencing data C. Variable List: NA; sequencing data D. Missing data codes: NA; sequencing data E. Specialized formats or other abbreviations used: .fastq.gz