The following Items will be uploaded to the DataHub.

#####################
./npy_blocks/


Our image dataset is reported using FP32 numpy arrays. I have pre-sorted my 
arrays into static training-validation-testing splits. the npy_blocks data
represent the main training input to the neural network: images of galaxies
taken from the GALEX, PanSTARRS, and WISE surveys. Each subset has been 
further split into 16 different chunks for the 16 different workers running
training. The *_indices.npy files are integer numpy arrays that communicate 
for each subset and chunk, which mantis_shrimp_index the objects are. This
allows us to query our spectroscopic data for targets.

./npy_blocks/
----train_indices.npy
----train/
    ----mantis_shrimp_*.npy [* in {0,...15}]
----val_indices.npy
----val/
    ----mantis_shrimp_*.npy [* in {0,...15}]
----test_indices.npy
----test/
    ----mantis_shrimp_*.npy [* in {0,...15}]


######################
./spectroscopy/

The spectrocopy directory contains comma-seperated-values (.csv) files
and binary python pickle (.pkl) files of our learning targets +
additional metadata about each source. Each file contained has an index
representing the mantis_shrimp_index and additional metadata. That metadata is one of:
a) redshift, a number describing the cosmological redshift of the galaxy
b) ra, or right ascensnion describing the position in the sky of the galaxy
c) dec, or declination describing the position in the sky of the galaxy
d) extinction values (ebv), describing the reddening due to dust along the line of sight
    between us and the galaxy
e) photosurvey_ID: if the original spectroscopic survey had an objID, here I include it.
    This can be either a big-int or text depending on the original survey
f) survey: text describing the original source of the target
g) photoz: number representing another project's predicted photometric redshift of the object
h) photoz_err: number representing the uncertainty of another project's predicted photometric redshift of the object
i) prob_gal: BeckPZ also report the probability of an object being a galaxy (rather than a star or QSO).
j) a value representing whether the WISExPS1 photometry is well represented in the original trianing dataset used in Beck et al 2023

spectroscopy is taken from public astornomical spectroscopic surveys: Sloan digital
sky survey (SDSS), dark energy spectroscopic instrument survey (DESI), DEEP2,
GAMA, VVDS, VIPERS, 6dF, and WiggleZ. Citations for these sources are provided
in Table 2 of our public report https://arxiv.org/pdf/2402.03535.pdf.


./spectroscopy/
----redshifts.pkl
----redshifts_withextinction.pkl
----./redshifts_broken_beck/
    ----chunk*.csv [* in {1,...,8}]

#####################
./masks/

masks contain an "exists_mask", a numpy boolean array communicating
whether an object exists given our queried cutouts. Essentially,
I chose to represent objects that do not have avaiable photometry
as all zeros rather than remove them from the dataset entirely. This
choice is reflected in there being images with all zeros in the 
npy_blocks/* files. A natural thing to do is simply remove these files
in preprocessing.

./masks/
----exists_mask.npy


#####################
./model_weights/

contains a directory of model weights to be trained. Currently empty.
Will contain both  SSL pre-trained weights and fine-tuned weights of
our models.

#####################
./etc/

contains information queried from online, public astronomical surveys to 
be matches onto our mantis_shrimp index. For example, contains photon
flux received by telescope from an object that exists in our image dataset.