Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

post berkeley-schema-fy24 merge issue: review FileTypeEnum composition and correlation with other DataObject slots/relationships #2186

Open
turbomam opened this issue Sep 24, 2024 · 2 comments
Assignees
Labels
FileTypeEnum composition or usage of FileTypeEnum, which fills a DataObject's data_object_type slot

Comments

@turbomam
Copy link
Member

turbomam commented Sep 24, 2024

  • Can a DataObject that is the output of any process use any FileTypeEnum in it's data_object_type slot?
  • do the different permissible values come from different axes of differentiation?
  • should we use an is_a hierarchy within the PVs?
  • should we re-normalize all of the permissible values to lower_snake_case (would require a corresponding data migration and changes to code that would describe future DataObjects)

As one example: what are the advantages and disadvantages of generality or specificity in

The same question might apply to other PVs in this enumeration.

low priority for now (in my opinion)

cc @mslarae13 @brynnz22

see also the following label (although we might want to remove it at some point)

for example, we could use a link like this, instead of a lable (berkeley-schema-fy24 in this case)

@turbomam turbomam self-assigned this Sep 24, 2024
@turbomam
Copy link
Member Author

Claude finds these different axes of differentiation or concerns in FileTypeEnum:

  1. Data Type / Analysis Method:

    • Metagenome data
    • Metabolomics data (FT ICR-MS, GC-MS)
    • Metaproteomics data
    • Assembly data
    • Annotation data (various types)
    • Read-based analysis
    • Taxonomic classification (GOTTCHA2, Kraken2, Centrifuge)
  2. Processing Stage:

    • Raw data
    • Filtered data
    • Error-corrected data
    • Assembled data
    • Annotated data
  3. File Format:

    • FASTQ
    • BAM
    • FASTA
    • GFF
    • JSON
    • TSV
    • PDF
    • HTML
  4. Sequencing Read Type:

    • Raw Read 1 (forward)
    • Raw Read 2 (reverse)
    • Interleaved paired-end
  5. Quality Control Stage:

    • QC Statistics
    • QC non-rRNA reads
  6. Biological Entity Focus:

    • Protein-related
    • Peptide-related
    • RNA-related (rRNA, tRNA, etc.)
    • Gene-related
  7. Output Type:

    • Report files
    • Statistical files
    • Plot files (heatmap, barplot, Krona plot)
    • Binning results
  8. Annotation Type:

    • Structural annotation
    • Functional annotation
    • Various specific annotation types (e.g., TIGRFam, CRT, Genemark, etc.)
  9. Compression Status:

    • Compressed files (e.g., zip files for bins)
    • Uncompressed files
  10. Workflow Stage:

    • Intermediate files
    • Final output files
    • Workflow statistics

@turbomam turbomam added the FileTypeEnum composition or usage of FileTypeEnum, which fills a DataObject's data_object_type slot label Sep 24, 2024
@mslarae13
Copy link
Contributor

@turbomam is this accomplished with the merged PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FileTypeEnum composition or usage of FileTypeEnum, which fills a DataObject's data_object_type slot
Projects
None yet
Development

No branches or pull requests

2 participants