domain examples

Getting your Trinity Audio player ready…

Below are one concrete, ready-to-download dataset or archive for each of the six domains you plan to feed into a “life-inquiry” LLM. Each example is open-access, richly annotated, and maps cleanly onto at least one of the framework’s life-dimensions (agency, adaptation, communication, etc.). Follow the “Why it matters / How to fetch” notes to decide whether it fits your pipeline.


1 │ Biological life — Genomic “ground truth”

Dataset: Escherichia coli K-12 MG1655 complete chromosome, GenBank accession U00096.3.
Why it matters: Classic model-organism genome (4.64 Mbp) with exhaustively curated gene, RNA and operon annotation — a template for metabolism, replication and mutation studies.
How to fetch:

# FASTA
efetch -db nucleotide -format fasta -id U00096.3 > ecoli_K12_U00096.3.fna
# Full GenBank (features, qualifiers)
efetch -db nucleotide -format gbwithparts -id U00096.3 > ecoli_K12_U00096.3.gb

License-free for research; mirrors on FTP if you need bulk. (NCBI)


2 │ Artificial life / simulation — Digital-organism evolution

Dataset: Avida v2.14.0 Zenodo archive (DOI 10.5281/zenodo.5068026).
What’s inside: Source plus sample experiment folders containing every organism’s “genome” (instruction strings), fitness scores, event logs and full lines-of-descent across thousands of updates — perfect for supervised or contrastive learning on emergence and adaptation.
How to fetch: one-click ZIP (9 MB) or pull via Zenodo REST; logs are plain text and CSV-style for easy parsing. (Zenodo)


3 │ Astrobiology / space-mission data

Dataset: Mars 2020 Perseverance Mastcam-Z EDR/RDR bundle (PDS identifier urn:nasa:pds:mars2020_mastcamz).
What’s inside: Raw (EDR) and calibrated (RDR) multispectral images from Jezero Crater, Sol 0 → present, each with UTC time-stamps, rover pose, camera geometry and basic radiometric metadata.
How to fetch: Navigate to the PDS Imaging Node → “mars2020_mastcamz” bundle; download per-sol tarballs or automate with wget --recursive. All files carry a U.S. Government public-domain notice. (pds.nasa.gov)


4 │ Synthetic biology — Standardized gene-circuit part

Dataset: iGEM Registry BBa_C0062 (luxR quorum-sensing transcription factor) record — FASTA, SBOL 2.0, phenotype notes.
Why it matters: Encodes cell-density–responsive regulation, illustrating “communication” and “purpose” dimensions. SBOL metadata embeds roles, component hierarchy and ontology terms, ideal for a multimodal tokenizer (text + graph + sequence).
How to fetch: registry page → “Get Selected Sequence” for FASTA; “SBOL” button for machine-readable RDF/XML. CC-BY-SA license. (iGEM Parts Registry)


5 │ Philosophy / humanities — Conceptual discourse corpus

Dataset: PhilArchive pre-print “Information Theory, Evolution and the Origin of Life” (item code GRAARO-7).
Why it matters: 10 000-word review linking Shannon information, agency and biological evolution — prime textual material for the subjective/objective and information/physics axes of your map.
How to fetch: Direct PDF download from PhilArchive; full text is CC-BY. Pair with other PhilArchive entries via OAI-PMH for bulk harvest. (PhilArchive, PhilArchive)


6 │ Multimodal bio-acoustics — Communication signals

Dataset: Xeno-Canto XC45953 — 21-second stereo recording of a Great Tit (Parus major) territorial song, with detailed geotag, date, and CC BY-NC-ND license.
Why it matters: Real-world, species-labeled acoustic signal useful for training cross-modal models on “communication” cues (pattern → taxonomy). Combine with spectrogram augmentation for self-supervised objectives.
How to fetch: Click “Download audio file” on entry page or script via Xeno-Canto JSON API (/api/2/recordings?query=XC45953). (Xeno-canto)


Next steps

  • Assemble per-domain loaders that normalize file names, add provenance JSON, and emit a unified document schema (e.g., {text, modality, domain, dimension-tags}) before sharding into your LLM pipeline.
  • Use the example identifiers above as unit-tests: each loader should ingest one of these records end-to-end, emitting a clean, self-describing sample ready for tokenization.

These six concrete datasets give you a minimal, demonstrable slice of every knowledge domain the life-inquiry framework touches—enough to start building ingestion code and scaling out to the larger repositories listed in the previous overview.


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *