🦥 SLOTH Cookbook

Structural Loader with On-demand Traversal Handling

Lazy by design. Fast by default.

This comprehensive cookbook demonstrates how to use SLOTH for parsing, validating, modifying, and writing mmCIF files with elegant dot notation and high-performance gemmi backend.

Table of Contents

Setup and Installation
Import Required Libraries
Understanding SLOTH’s Core Components
Parsing mmCIF Files with Embedded Data
Exploring Data Structures with Dot Notation
Demonstrating 2D Slicing
Validating mmCIF Data
Modifying mmCIF Data
Creating Sample Data - Manual Approach
Creating Sample Data - Programmatic Approach
Creating Sample Data - Auto-Creation with Dot Notation
Exporting to Nested JSON
Importing from JSON
Round-Trip Validation
Writing Modified mmCIF Files
Complete Workflow Example

1. Setup and Installation

SLOTH can be installed via pip. Make sure you have Python 3.8 or higher.

[1]:

# Install SLOTH (if not already installed)
# !pip install -i https://test.pypi.org/simple/ sloth-mmcif

# Verify installation
import sloth
print(f"✅ SLOTH version: {sloth.__version__ if hasattr(sloth, '__version__') else 'installed'}")

✅ SLOTH version: 0.8.0

2. Import Required Libraries

Let’s import all the SLOTH components we’ll need for this cookbook.

[2]:

import io
import os
import json
import tempfile
import warnings
from pathlib import Path

# SLOTH core components
from sloth.mmcif import (
    MMCIFHandler,
    ValidatorPlugin,
    DataSourceFormat,
    SchemaWarning,
    ValidationReport,
    SchemaValidator,
    MMCIFValidator,
    DataBlockValidator,
    ContainerValidator,
    mandatory_items,
    value_length,
)

# SLOTH data models
from sloth.mmcif.models import MMCIFDataContainer, DataBlock, Category

print("✅ All libraries imported successfully!")

✅ All libraries imported successfully!

3. Understanding SLOTH’s Core Components

SLOTH provides an elegant, Pythonic API for working with mmCIF data:

MMCIFDataContainer: The top-level container holding one or more data blocks
DataBlock: A named collection of categories (like data_1ABC)
Category: A collection of related items (like _atom_site or _entity)
Dot Notation: Access data naturally like container.data_1ABC._atom_site.Cartn_x

Key Features

✨ Auto-creation: Objects are created automatically as you access them
🚀 High Performance: Uses gemmi backend for fast parsing
🐍 Pythonic: Clean, intuitive API with dot notation
🔄 Round-trip: Full support for mmCIF → JSON → mmCIF conversions

4. Parsing mmCIF Files with Embedded Data

Let’s parse a comprehensive protein-ligand complex structure. We’ll use embedded demo data for convenience.

[3]:

# Comprehensive demo mmCIF data with TRUE hierarchical relationships
# This structure demonstrates actual nesting with proper parent-child relationships:
#   entity -> entity_poly -> entity_poly_seq (3 levels)
#   entity -> struct_asym -> atom_site (3 levels)
COMPREHENSIVE_DEMO_MMCIF = """data_DEMO
#
# Entry information (top level)
#
_entry.id DEMO
#
# Database references
#
loop_
_database_2.database_id
_database_2.database_code
PDB DEMO
EMDB DEMO
#
# Entity information (parent level - root of hierarchy)
#
loop_
_entity.id
_entity.type
_entity.src_method
_entity.pdbx_description
1 polymer man 'Catalytic domain of model transferase'
2 water nat 'Water molecules'
#
# Entity polymer (child of entity via entity_id -> entity.id)
# Will nest under entity with id=1
#
loop_
_entity_poly.entity_id
_entity_poly.type
_entity_poly.nstd_chirality
_entity_poly.pdbx_seq_one_letter_code
1 'polypeptide(L)' no MAGLY
#
# Entity polymer sequence (child of entity_poly)
# Will nest under entity_poly, creating 3-level hierarchy
#
loop_
_entity_poly_seq.entity_id
_entity_poly_seq.num
_entity_poly_seq.mon_id
1 1 MET
1 2 ALA
1 3 GLY
1 4 LEU
1 5 TYR
#
# Structural asymmetric unit (child of entity)
# Creates parallel branch from entity
#
loop_
_struct_asym.id
_struct_asym.entity_id
_struct_asym.details
A 1 'Protein chain A'
W 2 'Water chain'
#
# Atom sites (child of struct_asym via label_asym_id -> struct_asym.id)
# Creates deeper nesting under struct_asym
#
loop_
_atom_site.group_PDB
_atom_site.id
_atom_site.type_symbol
_atom_site.label_atom_id
_atom_site.label_comp_id
_atom_site.label_asym_id
_atom_site.label_entity_id
_atom_site.label_seq_id
_atom_site.Cartn_x
_atom_site.Cartn_y
_atom_site.Cartn_z
_atom_site.occupancy
_atom_site.B_iso_or_equiv
_atom_site.pdbx_PDB_model_num
ATOM 1 N N MET A 1 1 20.154 6.718 46.973 1.00 25.00 1
ATOM 2 C CA MET A 1 1 21.618 6.765 47.254 1.00 24.50 1
ATOM 3 C C MET A 1 1 22.147 8.178 47.451 1.00 23.85 1
ATOM 4 N N ALA A 1 2 23.456 9.012 48.123 1.00 22.45 1
ATOM 5 C CA ALA A 1 2 24.123 10.234 48.567 1.00 21.30 1
HETATM 6 O O HOH W 2 . 12.345 15.678 35.432 1.00 18.56 1
HETATM 7 O O HOH W 2 . 13.456 16.789 36.543 1.00 19.67 1
#
"""

print("📝 Demo mmCIF data loaded with TRUE hierarchical relationships")
print("   🌲 3-level nesting: entity → entity_poly → entity_poly_seq")
print("   🌲 3-level nesting: entity → struct_asym → atom_site")
print("   ✨ This will create proper nested JSON output!")

📝 Demo mmCIF data loaded with TRUE hierarchical relationships
   🌲 3-level nesting: entity → entity_poly → entity_poly_seq
   🌲 3-level nesting: entity → struct_asym → atom_site
   ✨ This will create proper nested JSON output!

[4]:

# Parse the embedded demo data using in-memory file
print("⚡ Parsing mmCIF data with gemmi backend (in-memory)...")

# Create an in-memory file object
mmcif_file = io.StringIO(COMPREHENSIVE_DEMO_MMCIF)

# Parse using MMCIFHandler - we'll write to a temp file since gemmi needs a file path
# (gemmi's C++ library requires file paths, but we make it seamless)
import tempfile
with tempfile.NamedTemporaryFile(mode='w', suffix='.cif', delete=False) as tmp:
    tmp.write(COMPREHENSIVE_DEMO_MMCIF)
    tmp_path = tmp.name

try:
    handler = MMCIFHandler()
    mmcif = handler.read(tmp_path)

    print(f"✅ Successfully parsed!")
    print(f"   Data blocks: {len(mmcif.data)}")

    if mmcif.data:
        block = mmcif.data[0]
        print(f"   Block name: '{block.name}'")
        print(f"   Categories: {len(block.categories)}")
        print(f"   Category names: {', '.join(block.categories)}")
finally:
    os.unlink(tmp_path)  # Clean up temp file

⚡ Parsing mmCIF data with gemmi backend (in-memory)...
✅ Successfully parsed!
   Data blocks: 1
   Block name: 'DEMO'
   Categories: 7
   Category names: _entry, _database_2, _entity, _entity_poly, _entity_poly_seq, _struct_asym, _atom_site

5. Exploring Data Structures with Dot Notation

SLOTH’s elegant dot notation makes accessing mmCIF data intuitive and Pythonic.

[5]:

# Access data block using dot notation
block = mmcif.data_DEMO  # Elegant dot notation for accessing block by name!
print(f"🧱 Block name: {block.name}")

# Access categories using dot notation (elegant!)
if "_entry" in block.categories:
    entry_category = block._entry  # Dot notation in action!
    print(f"\n📂 Entry category:")
    print(f"   Entry ID: {entry_category.id[0]}")
    print(f"   Entry type: {entry_category.type[0] if hasattr(entry_category, 'type') else 'N/A'}")

# Access database information
if "_database_2" in block.categories:
    db_category = block._database_2  # Direct dot notation!
    print(f"\n💾 Database category:")
    print(f"   Database IDs: {db_category.database_id}")
    print(f"   Database codes: {db_category.database_code}")

# Access entity information
if "_entity" in block.categories:
    entity_category = block._entity  # Elegant dot notation!
    print(f"\n🧬 Entity category:")
    print(f"   Entity IDs: {entity_category.id}")
    print(f"   Entity types: {entity_category.type}")
    print(f"   Descriptions: {entity_category.pdbx_description}")

print("\n💡 Tip: Use mmcif.data_BLOCKNAME to access blocks and block._category_name.item_name for data!")

🧱 Block name: DEMO

📂 Entry category:
   Entry ID: DEMO
   Entry type: N/A

💾 Database category:
   Database IDs: ['PDB', 'EMDB']
   Database codes: ['DEMO', 'DEMO']

🧬 Entity category:
   Entity IDs: ['1', '2']
   Entity types: ['polymer', 'water']
   Descriptions: ["'Catalytic domain of model transferase'", "'Water molecules'"]

💡 Tip: Use mmcif.data_BLOCKNAME to access blocks and block._category_name.item_name for data!

6. Demonstrating 2D Slicing

SLOTH supports both column-wise and row-wise access with powerful slicing capabilities.

[6]:

# Column-wise access with dot notation
if "_atom_site" in block.categories:
    atom_site = block._atom_site

    print("📊 Column-wise access (the Pythonic way):")
    print(f"   Row count: {atom_site.row_count}")
    print(f"   Available items: {', '.join(atom_site.items[:5])}...")
    print(f"\n   Type symbols: {atom_site.type_symbol}")
    print(f"   X coordinates: {atom_site.Cartn_x}")
    print(f"   Y coordinates: {atom_site.Cartn_y}")
    print(f"   Z coordinates: {atom_site.Cartn_z}")

    # Row-wise access
    print(f"\n📋 Row-wise access (elegant and readable):")
    first_row = atom_site[0]
    print(f"   First atom:")
    print(f"     Type: {first_row.type_symbol}")
    print(f"     ID: {first_row.id}")
    print(f"     Position: ({first_row.Cartn_x}, {first_row.Cartn_y}, {first_row.Cartn_z})")

    # Slicing rows
    if atom_site.row_count >= 3:
        print(f"\n📑 Row slicing:")
        for i, row in enumerate(atom_site[0:3]):
            print(f"   Atom {i+1}: {row.type_symbol} at ({row.Cartn_x}, {row.Cartn_y}, {row.Cartn_z})")

print("\n💪 Dot notation makes your code readable, elegant, and Pythonic!")

📊 Column-wise access (the Pythonic way):
   Row count: 7
   Available items: group_PDB, id, type_symbol, label_atom_id, label_comp_id...

   Type symbols: ['N', 'C', 'C', 'N', 'C', 'O', 'O']
   X coordinates: ['20.154', '21.618', '22.147', '23.456', '24.123', '12.345', '13.456']
   Y coordinates: ['6.718', '6.765', '8.178', '9.012', '10.234', '15.678', '16.789']
   Z coordinates: ['46.973', '47.254', '47.451', '48.123', '48.567', '35.432', '36.543']

📋 Row-wise access (elegant and readable):
   First atom:
     Type: N
     ID: 1
     Position: (20.154, 6.718, 46.973)

📑 Row slicing:
   Atom 1: N at (20.154, 6.718, 46.973)
   Atom 2: C at (21.618, 6.765, 47.254)
   Atom 3: C at (22.147, 8.178, 47.451)

💪 Dot notation makes your code readable, elegant, and Pythonic!

7. Validating mmCIF Data

SLOTH provides multi-level validation with ValidatorPlugin.validate(), schema-aware warnings, and a flexible plugin system for custom rules. Registration happens on model instances — scope is inferred from which class you call register() on.

[7]:

# --- 7a. MMCIFValidator.validate() — the recommended approach ---
print("🛡️  Validation with MMCIFValidator.validate()")
print("=" * 50)

# Full validation (dictionary schema + wwPDB rules)
vp = MMCIFValidator()
report = vp.validate(mmcif)
print(f"\n📊 Full validation report:")
print(f"   Valid:    {report.is_valid}")
print(f"   Errors:   {len(report.errors)}")
print(f"   Warnings: {len(report.warnings)}")

# Empty ValidatorPlugin — no rules, so empty report
empty_vp = ValidatorPlugin()
empty_report = empty_vp.validate(mmcif)
print(f"\n📊 Empty plugin (no rules):")
print(f"   Issues: {len(empty_report.all_issues)}")

# --- 7b. Validate a single block or category ---
print("\n\n🔍 Validating a single block:")
block_report = vp.validate(block)
print(f"   Valid: {block_report.is_valid}")

print("\n🔍 Validating a single category:")
cat_report = vp.validate(block._entry)
print(f"   _entry valid: {cat_report.is_valid}")

# --- 7c. Plugin registration for dot-notation validation ---
print("\n\n🔌 Plugin-based validation:")
# Build a custom validator with rule factories
custom_vp = ValidatorPlugin()
custom_vp.register_validator("_entry", mandatory_items(["id"]))
custom_vp.register_validator("_entry", value_length("id", min_len=1))

# Parse test data
with tempfile.NamedTemporaryFile(mode='w', suffix='.cif', delete=False) as tmp:
    tmp.write(COMPREHENSIVE_DEMO_MMCIF)
    tmp_path = tmp.name

try:
    mmcif_validated = handler.read(tmp_path)
    block_validated = mmcif_validated.data[0]

    # Register on a category for dot-notation .validate()
    block_validated._entry.register("validate", custom_vp)
    print("   _entry.validate():")
    block_validated._entry.validate()
    print("   ✅ Passed!")

    # Cross-category validation with .against()
    if len(block_validated.categories) >= 2:
        cat_a, cat_b = block_validated.categories[:2]
        block_validated[cat_a].register("validate", custom_vp)
        print(f"   {cat_a}.validate().against({cat_b}):")
        block_validated[cat_a].validate().against(block_validated[cat_b])
        print("   ✅ Passed!")
finally:
    os.unlink(tmp_path)

# --- 7d. Schema warnings ---
print("\n\n⚠️  Schema warnings demo:")
with warnings.catch_warnings(record=True) as w:
    warnings.simplefilter("always")
    block._entry.my_unknown_field = ["test"]
    if w:
        print(f"   SchemaWarning: {w[0].message}")
    else:
        print("   (no warning — field may be known)")

print("\n✅ Validation features demonstrated!")

🛡️  Validation with MMCIFValidator.validate()
==================================================

📊 Full validation report:
   Valid:    False
   Errors:   7
   Warnings: 0

📊 Empty plugin (no rules):
   Issues: 0


🔍 Validating a single block:
   Valid: False

🔍 Validating a single category:
   _entry valid: True


🔌 Plugin-based validation:
   _entry.validate():
   ✅ Passed!
   _entry.validate().against(_database_2):
   ✅ Passed!


⚠️  Schema warnings demo:
   SchemaWarning: Item 'my_unknown_field' is not in the mmCIF dictionary for category '_entry'.

✅ Validation features demonstrated!

8. Modifying mmCIF Data

Modify data elegantly using dot notation assignments.

[8]:

print("✏️  Modifying data with dot notation...")

# Modify database information using elegant dot notation
if "_database_2" in block.categories:
    db_category = block._database_2

    print(f"\n📋 Original database_id: {db_category.database_id}")

    # Simple assignment with dot notation - change the last entry
    original_value = db_category.database_id[-1]
    db_category.database_id[-1] = "BMRB"  # Change EMDB to BMRB

    print(f"✏️  Modified database_id: '{original_value}' → '{db_category.database_id[-1]}'")
    print(f"   Using: block._database_2.database_id[-1] = 'BMRB'")
    print(f"\n📋 Updated database_id: {db_category.database_id}")

print("\n✅ Data modification complete!")

✏️  Modifying data with dot notation...

📋 Original database_id: ['PDB', 'EMDB']
✏️  Modified database_id: 'EMDB' → 'BMRB'
   Using: block._database_2.database_id[-1] = 'BMRB'

📋 Updated database_id: ['PDB', 'BMRB']

✅ Data modification complete!

9. Creating Sample Data - Manual Approach

The traditional approach: manually writing mmCIF format strings. SLOTH works in-memory, avoiding unnecessary disk I/O.

[9]:

print("🖋️  Method 1: Manual mmCIF content creation (in-memory)")

# Create mmCIF content as a string
sample_content = """data_1ABC
_entry.id 1ABC_STRUCTURE
_database_2.database_id PDB
_database_2.database_code 1ABC
loop_
_atom_site.group_PDB
_atom_site.id
_atom_site.type_symbol
_atom_site.Cartn_x
_atom_site.Cartn_y
_atom_site.Cartn_z
ATOM 1 N 10.123 20.456 30.789
ATOM 2 C 11.234 21.567 31.890
"""

# Use in-memory file
mmcif_io = io.StringIO(sample_content)
print(f"✅ Created in-memory mmCIF content ({len(sample_content)} bytes)")

# Parse from temporary file (gemmi requires file path)
with tempfile.NamedTemporaryFile(mode='w', suffix='.cif', delete=False) as tmp:
    tmp.write(sample_content)
    tmp_path = tmp.name

try:
    manual_mmcif = handler.read(tmp_path)
    print(f"✅ Verified: {len(manual_mmcif.data[0].categories)} categories")
finally:
    os.unlink(tmp_path)

🖋️  Method 1: Manual mmCIF content creation (in-memory)
✅ Created in-memory mmCIF content (275 bytes)
✅ Verified: 3 categories

10. Creating Sample Data - Programmatic Approach

Create mmCIF data programmatically using SLOTH’s API with dictionary-style assignments.

[10]:

print("⚙️  Method 2: Programmatic creation (in-memory)")

# Create container and block
mmcif_prog = MMCIFDataContainer()
block_prog = DataBlock("1ABC")

# Create categories and add data
entry_category = Category("_entry")
entry_category["id"] = ["1ABC_STRUCTURE"]

database_category = Category("_database_2")
database_category["database_id"] = ["PDB"]
database_category["database_code"] = ["1ABC"]

atom_site_category = Category("_atom_site")
atom_site_category["group_PDB"] = ["ATOM", "ATOM"]
atom_site_category["id"] = ["1", "2"]
atom_site_category["type_symbol"] = ["N", "C"]
atom_site_category["Cartn_x"] = ["10.123", "11.234"]
atom_site_category["Cartn_y"] = ["20.456", "21.567"]
atom_site_category["Cartn_z"] = ["30.789", "31.890"]

# Add categories to block
block_prog["_entry"] = entry_category
block_prog["_database_2"] = database_category
block_prog["_atom_site"] = atom_site_category

# Add block to container
mmcif_prog["1ABC"] = block_prog

# Write to in-memory string
output_buffer = io.StringIO()
with tempfile.NamedTemporaryFile(mode='w', suffix='.cif', delete=False) as tmp:
    handler.write(mmcif_prog, tmp.name)
    tmp_path = tmp.name

try:
    with open(tmp_path, 'r') as f:
        programmatic_content = f.read()
    print(f"✅ Created programmatic data in-memory ({len(programmatic_content)} bytes)")
    print(f"✅ Categories: {len(mmcif_prog.data[0].categories)}")
finally:
    os.unlink(tmp_path)

⚙️  Method 2: Programmatic creation (in-memory)
✅ Created programmatic data in-memory (277 bytes)
✅ Categories: 3

11. Creating Sample Data - Auto-Creation with Dot Notation ✨

This is SLOTH’s most powerful feature! Objects are automatically created as you access them using elegant dot notation.

[11]:

print("✨ Method 3: Auto-creation with Elegant Dot Notation (in-memory)")
print("=" * 50)
print("SLOTH automatically creates nested objects!")
print()

# Create an empty container - this is all you need!
mmcif_auto = MMCIFDataContainer()

# Use dot notation to auto-create everything - just like magic!
mmcif_auto.data_1ABC._entry.id = ["1ABC_STRUCTURE"]
mmcif_auto.data_1ABC._database_2.database_id = ["PDB"]
mmcif_auto.data_1ABC._database_2.database_code = ["1ABC"]

# Add atom data
mmcif_auto.data_1ABC._atom_site.group_PDB = ["ATOM", "ATOM"]
mmcif_auto.data_1ABC._atom_site.id = ["1", "2"]
mmcif_auto.data_1ABC._atom_site.type_symbol = ["N", "C"]
mmcif_auto.data_1ABC._atom_site.Cartn_x = ["10.123", "11.234"]
mmcif_auto.data_1ABC._atom_site.Cartn_y = ["20.456", "21.567"]
mmcif_auto.data_1ABC._atom_site.Cartn_z = ["30.789", "31.890"]

# Convert to in-memory string
with tempfile.NamedTemporaryFile(mode='w', suffix='.cif', delete=False) as tmp:
    handler.write(mmcif_auto, tmp.name)
    tmp_path = tmp.name

try:
    with open(tmp_path, 'r') as f:
        dot_notation_content = f.read()
    print(f"✅ Created dot notation data in-memory ({len(dot_notation_content)} bytes)")
finally:
    os.unlink(tmp_path)

print(f"\n🔍 What was auto-created:")
print(f"   📦 Container: {len(mmcif_auto)} block(s)")
print(f"   🧱 Block '1ABC': {len(mmcif_auto.data_1ABC.categories)} categories")
print(f"   📂 Categories: {', '.join(mmcif_auto.data_1ABC.categories)}")
print(f"\n💎 Elegant access examples:")
print(f"   Entry ID: {mmcif_auto.data_1ABC._entry.id[0]}")
print(f"   Database: {mmcif_auto.data_1ABC._database_2.database_id[0]}")
print(f"   Atom types: {mmcif_auto.data_1ABC._atom_site.type_symbol}")
print(f"\n🚀 Just write what you want, SLOTH creates what you need!")

✨ Method 3: Auto-creation with Elegant Dot Notation (in-memory)
==================================================
SLOTH automatically creates nested objects!

✅ Created dot notation data in-memory (277 bytes)

🔍 What was auto-created:
   📦 Container: 1 block(s)
   🧱 Block '1ABC': 3 categories
   📂 Categories: _entry, _database_2, _atom_site

💎 Elegant access examples:
   Entry ID: 1ABC_STRUCTURE
   Database: PDB
   Atom types: ['N', 'C']

🚀 Just write what you want, SLOTH creates what you need!

12. Exporting to Nested JSON

SLOTH exports mmCIF data to nested JSON format, resolving parent-child relationships automatically. Categories like entity_poly are nested under their parent entity, and entity_poly_seq is nested under entity_poly.

[12]:

print("📊 Demonstrating nested JSON export (in-memory):")

# Export to JSON string (no file writing)
print("\n🔧 Exporting to Nested JSON (in-memory):")
json_string = handler.export(mmcif)  # Returns string when no file_path provided
json_data = json.loads(json_string)

print(f"\n📁 Export Summary:")
print(f"   JSON size: {len(json_string):,} bytes")

# Pretty-print the full nested JSON
print("\n📄 Full nested JSON output:")
print(json.dumps(json_data, indent=2))

# Display the nested structure
print("\n🔍 Nested Structure Preview:")

# Show the hierarchy
if 'data_DEMO' in json_data and '_entity' in json_data['data_DEMO']:
    entities = json_data['data_DEMO']['_entity']
    if entities:
        entity = entities[0]
        print(f"   📦 _entity (top level):")
        print(f"      - id: {entity.get('id')}")
        print(f"      - type: {entity.get('type')}")

        if '_entity_poly' in entity:
            print(f"      └─ 📦 _entity_poly (nested child):")
            poly = entity['_entity_poly'][0] if entity['_entity_poly'] else {}
            print(f"         - entity_id: {poly.get('entity_id')}")
            print(f"         - type: {poly.get('type')}")

            if '_entity_poly_seq' in poly:
                seq_list = poly['_entity_poly_seq']
                print(f"         └─ 📦 _entity_poly_seq (nested grandchild): {len(seq_list)} residues")
                if seq_list:
                    print(f"            - First residue: {seq_list[0].get('mon_id')} at position {seq_list[0].get('num')}")

print("\n✨ All category names have the '_' prefix, whether nested or not!")

📊 Demonstrating nested JSON export (in-memory):

🔧 Exporting to Nested JSON (in-memory):
📦 Using cached mapping rules
📦 Using cached dictionary data

📁 Export Summary:
   JSON size: 3,129 bytes

📄 Full nested JSON output:
{
  "data_DEMO": {
    "_entry": [
      {
        "id": "DEMO",
        "my_unknown_field": "test"
      }
    ],
    "_database_2": [
      {
        "database_id": "BMRB",
        "database_code": "DEMO"
      },
      {
        "database_id": "PDB",
        "database_code": "DEMO"
      }
    ],
    "_entity": [
      {
        "id": "1",
        "type": "polymer",
        "src_method": "man",
        "pdbx_description": "'Catalytic domain of model transferase'",
        "_entity_poly": [
          {
            "entity_id": "1",
            "type": "'polypeptide(L)'",
            "nstd_chirality": "no",
            "pdbx_seq_one_letter_code": "MAGLY",
            "_entity_poly_seq": [
              {
                "entity_id": "1",
                "num": "1",
                "mon_id": "MET"
              },
              {
                "entity_id": "1",
                "num": "2",
                "mon_id": "ALA"
              },
              {
                "entity_id": "1",
                "num": "3",
                "mon_id": "GLY"
              },
              {
                "entity_id": "1",
                "num": "4",
                "mon_id": "LEU"
              },
              {
                "entity_id": "1",
                "num": "5",
                "mon_id": "TYR"
              }
            ]
          }
        ],
        "_struct_asym": [
          {
            "id": "A",
            "entity_id": "1",
            "details": "'Protein chain A'",
            "_atom_site": [
              {
                "group_PDB": "ATOM",
                "id": "1",
                "type_symbol": "N",
                "label_atom_id": "N",
                "label_comp_id": "MET",
                "label_asym_id": "A",
                "label_entity_id": "1",
                "label_seq_id": "1",
                "Cartn_x": "20.154",
                "Cartn_y": "6.718",
                "Cartn_z": "46.973",
                "occupancy": "1.00",
                "B_iso_or_equiv": "25.00",
                "pdbx_PDB_model_num": "1"
              },
              {
                "group_PDB": "ATOM",
                "id": "2",
                "type_symbol": "C",
                "label_atom_id": "CA",
                "label_comp_id": "MET",
                "label_asym_id": "A",
                "label_entity_id": "1",
                "label_seq_id": "1",
                "Cartn_x": "21.618",
                "Cartn_y": "6.765",
                "Cartn_z": "47.254",
                "occupancy": "1.00",
                "B_iso_or_equiv": "24.50",
                "pdbx_PDB_model_num": "1"
              },
              {
                "group_PDB": "ATOM",
                "id": "3",
                "type_symbol": "C",
                "label_atom_id": "C",
                "label_comp_id": "MET",
                "label_asym_id": "A",
                "label_entity_id": "1",
                "label_seq_id": "1",
                "Cartn_x": "22.147",
                "Cartn_y": "8.178",
                "Cartn_z": "47.451",
                "occupancy": "1.00",
                "B_iso_or_equiv": "23.85",
                "pdbx_PDB_model_num": "1"
              },
              {
                "group_PDB": "ATOM",
                "id": "4",
                "type_symbol": "N",
                "label_atom_id": "N",
                "label_comp_id": "ALA",
                "label_asym_id": "A",
                "label_entity_id": "1",
                "label_seq_id": "2",
                "Cartn_x": "23.456",
                "Cartn_y": "9.012",
                "Cartn_z": "48.123",
                "occupancy": "1.00",
                "B_iso_or_equiv": "22.45",
                "pdbx_PDB_model_num": "1"
              },
              {
                "group_PDB": "ATOM",
                "id": "5",
                "type_symbol": "C",
                "label_atom_id": "CA",
                "label_comp_id": "ALA",
                "label_asym_id": "A",
                "label_entity_id": "1",
                "label_seq_id": "2",
                "Cartn_x": "24.123",
                "Cartn_y": "10.234",
                "Cartn_z": "48.567",
                "occupancy": "1.00",
                "B_iso_or_equiv": "21.30",
                "pdbx_PDB_model_num": "1"
              }
            ]
          }
        ]
      },
      {
        "id": "2",
        "type": "water",
        "src_method": "nat",
        "pdbx_description": "'Water molecules'",
        "_struct_asym": [
          {
            "id": "W",
            "entity_id": "2",
            "details": "'Water chain'",
            "_atom_site": [
              {
                "group_PDB": "HETATM",
                "id": "6",
                "type_symbol": "O",
                "label_atom_id": "O",
                "label_comp_id": "HOH",
                "label_asym_id": "W",
                "label_entity_id": "2",
                "label_seq_id": ".",
                "Cartn_x": "12.345",
                "Cartn_y": "15.678",
                "Cartn_z": "35.432",
                "occupancy": "1.00",
                "B_iso_or_equiv": "18.56",
                "pdbx_PDB_model_num": "1"
              },
              {
                "group_PDB": "HETATM",
                "id": "7",
                "type_symbol": "O",
                "label_atom_id": "O",
                "label_comp_id": "HOH",
                "label_asym_id": "W",
                "label_entity_id": "2",
                "label_seq_id": ".",
                "Cartn_x": "13.456",
                "Cartn_y": "16.789",
                "Cartn_z": "36.543",
                "occupancy": "1.00",
                "B_iso_or_equiv": "19.67",
                "pdbx_PDB_model_num": "1"
              }
            ]
          }
        ]
      }
    ]
  }
}

🔍 Nested Structure Preview:
   📦 _entity (top level):
      - id: 1
      - type: polymer
      └─ 📦 _entity_poly (nested child):
         - entity_id: 1
         - type: 'polypeptide(L)'
         └─ 📦 _entity_poly_seq (nested grandchild): 5 residues
            - First residue: MET at position 1

✨ All category names have the '_' prefix, whether nested or not!

13. Importing from JSON

Import previously exported JSON files back into SLOTH’s data structures.

[13]:

print("📥 Demonstrating import functionality (in-memory):")

# Import from JSON string via temporary file
print("\n✅ Importing from JSON:")
with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as tmp:
    tmp.write(json_string)
    tmp_path = tmp.name

try:
    imported_container = handler.load(tmp_path)
    print(f"   ✅ Successfully imported from JSON")
    print(f"      Data blocks: {len(imported_container.data)}")
    if imported_container.data:
        imported_block = imported_container.data[0]
        print(f"      Categories: {len(imported_block.categories)}")
        print(f"      Category names: {', '.join(imported_block.categories[:5])}...")
finally:
    os.unlink(tmp_path)

📥 Demonstrating import functionality (in-memory):

✅ Importing from JSON:
   ✅ Successfully imported from JSON
      Data blocks: 1
      Categories: 7
      Category names: _entry, _database_2, _entity, _entity_poly, _entity_poly_seq...

14. Round-Trip Validation

Validate data integrity by comparing original and imported data.

[14]:

print("🔄 Demonstrating round-trip validation:")

# Compare original and imported data
original_block = mmcif.data[0]
imported_block = imported_container.data[0]

print(f"\n📊 Comparing original vs imported:")
print(f"   Original categories: {len(original_block.categories)}")
print(f"   Imported categories: {len(imported_block.categories)}")

# Find common categories
common_categories = set(original_block.categories).intersection(
    set(imported_block.categories)
)
print(f"   ✓ Common categories: {len(common_categories)}")

# Check a sample category in detail
if common_categories:
    sample_cat = sorted(common_categories)[0]  # Sort for deterministic output
    print(f"\n🔍 Checking category: {sample_cat}")

    original_cat = original_block[sample_cat]
    imported_cat = imported_block[sample_cat]

    # Compare item names
    original_items = set(original_cat.items)
    imported_items = set(imported_cat.items)
    common_items = original_items.intersection(imported_items)

    if common_items:
        sample_item = sorted(common_items)[0]  # Sort for deterministic output
        original_values = original_cat[sample_item]
        imported_values = imported_cat[sample_item]

        print(f"   Item: '{sample_item}'")
        print(f"   Original values: {len(original_values)} items")
        print(f"   Imported values: {len(imported_values)} items")

        if len(original_values) > 0 and len(imported_values) > 0:
            if original_values[0] == imported_values[0]:
                print(f"   ✅ First value matches: '{original_values[0]}'")
            else:
                print(f"   ⚠️ First value differs!")

print("\n✅ Round-trip validation complete!")

🔄 Demonstrating round-trip validation:

📊 Comparing original vs imported:
   Original categories: 7
   Imported categories: 7
   ✓ Common categories: 7

🔍 Checking category: _atom_site
   Item: 'B_iso_or_equiv'
   Original values: 7 items
   Imported values: 7 items
   ✅ First value matches: '25.00'

✅ Round-trip validation complete!

15. Writing Modified mmCIF Files

Write modified mmCIF data using the handler’s write method. For demos, we use temporary files that are immediately cleaned up.

[15]:

print("💾 Writing mmCIF to in-memory string:")

# Write to in-memory buffer
output_buffer = io.StringIO()

# Since gemmi needs a file path, we'll use a temp file and read it back
with tempfile.NamedTemporaryFile(mode='w', suffix='.cif', delete=False) as tmp:
    handler.write(mmcif, tmp.name)
    tmp_path = tmp.name

try:
    with open(tmp_path, 'r') as f:
        output_content = f.read()
    print(f"✅ Written to in-memory string ({len(output_content)} bytes)")

    # Parse it back to verify
    with tempfile.NamedTemporaryFile(mode='w', suffix='.cif', delete=False) as tmp2:
        tmp2.write(output_content)
        tmp2_path = tmp2.name

    try:
        verify_data = handler.read(tmp2_path)
        verify_block = verify_data.data[0]

        print("\n🔍 Verifying output:")
        print(f"   ✅ Data blocks: {len(verify_data.data)}")
        print(f"   ✅ Categories: {len(verify_block.categories)}")
        print(f"   ✅ Block name: '{verify_block.name}'")

        # Show a sample of the data
        if "_database_2" in verify_block.categories:
            db_cat = verify_block._database_2
            print(f"\n📋 Verification - Database information:")
            print(f"   Database IDs: {db_cat.database_id}")
            print(f"   Database codes: {db_cat.database_code}")
    finally:
        os.unlink(tmp2_path)
finally:
    os.unlink(tmp_path)

💾 Writing mmCIF to in-memory string:
✅ Written to in-memory string (1363 bytes)

🔍 Verifying output:
   ✅ Data blocks: 1
   ✅ Categories: 7
   ✅ Block name: 'DEMO'

📋 Verification - Database information:
   Database IDs: ['PDB', 'BMRB']
   Database codes: ['DEMO', 'DEMO']

16. Complete Workflow Example

Let’s put everything together in a complete in-memory workflow - no disk clutter!

[16]:

print("🚀 Complete SLOTH Workflow (in-memory)")
print("=" * 50)

# Step 1: Create a new container with dot-notation (pending proxies auto-commit)
print("\n1️⃣ Creating new mmCIF data with dot notation...")
workflow_mmcif = MMCIFDataContainer()

workflow_mmcif.data_WORKFLOW._entry.id = ["WORKFLOW_DEMO"]
workflow_mmcif.data_WORKFLOW._database_2.database_id = ["PDB", "BMRB"]
workflow_mmcif.data_WORKFLOW._database_2.database_code = ["WORK", "WORK"]
workflow_mmcif.data_WORKFLOW._entity.id = ["1", "2"]
workflow_mmcif.data_WORKFLOW._entity.type = ["polymer", "non-polymer"]
workflow_mmcif.data_WORKFLOW._entity.pdbx_description = ["Protein", "Ligand"]

print("   ✅ Created with dot-notation auto-creation!")

# Step 2: Validate with MMCIFValidator
print("\n2️⃣ Validating data...")
workflow_vp = MMCIFValidator()
report = workflow_vp.validate(workflow_mmcif)
print(f"   Valid: {report.is_valid}")
print(f"   Errors: {len(report.errors)}, Warnings: {len(report.warnings)}")
print("   ✅ Validation complete!")

# Step 3: Modify the data
print("\n3️⃣ Modifying data...")
workflow_mmcif.data_WORKFLOW._entry.id[0] = "MODIFIED_WORKFLOW"
print(f"   ✅ Modified entry ID to: {workflow_mmcif.data_WORKFLOW._entry.id[0]}")

# Step 4: Export to JSON (in-memory)
print("\n4️⃣ Exporting to JSON (in-memory)...")
workflow_handler = MMCIFHandler()
workflow_json_string = workflow_handler.export(workflow_mmcif)
print(f"   ✅ Exported to in-memory JSON ({len(workflow_json_string)} bytes)")

# Step 5: Import from JSON (in-memory via temp file)
print("\n5️⃣ Importing from JSON (in-memory)...")
with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as tmp:
    tmp.write(workflow_json_string)
    tmp_path = tmp.name

try:
    reimported = handler.load(tmp_path)
    print(f"   ✅ Reimported! Categories: {len(reimported.data[0].categories)}")
finally:
    os.unlink(tmp_path)

# Step 6: Write to in-memory mmCIF string
print("\n6️⃣ Writing final mmCIF (in-memory)...")
with tempfile.NamedTemporaryFile(mode='w', suffix='.cif', delete=False) as tmp:
    handler.write(reimported, tmp.name)
    tmp_path = tmp.name

try:
    with open(tmp_path, 'r') as f:
        final_content = f.read()
    print(f"   ✅ Written to in-memory string ({len(final_content)} bytes)")
finally:
    os.unlink(tmp_path)

# Step 7: Verify round-trip
print("\n7️⃣ Verifying round-trip integrity...")
original_cats = len(workflow_mmcif.data[0].categories)
final_cats = len(reimported.data[0].categories)
print(f"   Original categories: {original_cats}")
print(f"   Final categories: {final_cats}")
print(f"   ✅ Round-trip successful!" if original_cats == final_cats else "   ⚠️ Category count changed")

print("\n" + "=" * 50)
print("🎉 Complete workflow finished successfully!")
print("💡 SLOTH works entirely in-memory - no disk I/O needed!")

🚀 Complete SLOTH Workflow (in-memory)
==================================================

1️⃣ Creating new mmCIF data with dot notation...
   ✅ Created with dot-notation auto-creation!

2️⃣ Validating data...
   Valid: False
   Errors: 2, Warnings: 0
   ✅ Validation complete!

3️⃣ Modifying data...
   ✅ Modified entry ID to: MODIFIED_WORKFLOW

4️⃣ Exporting to JSON (in-memory)...
📦 Using cached mapping rules
📦 Using cached dictionary data
   ✅ Exported to in-memory JSON (318 bytes)

5️⃣ Importing from JSON (in-memory)...
   ✅ Reimported! Categories: 3

6️⃣ Writing final mmCIF (in-memory)...
   ✅ Written to in-memory string (213 bytes)

7️⃣ Verifying round-trip integrity...
   Original categories: 3
   Final categories: 3
   ✅ Round-trip successful!

==================================================
🎉 Complete workflow finished successfully!
💡 SLOTH works entirely in-memory - no disk I/O needed!

Summary

You’ve learned how to:

✅ Parse mmCIF files with high-performance gemmi backend
✅ Access data elegantly using dot notation with tab completion
✅ Slice data both column-wise and row-wise
✅ Validate data with MMCIFValidator().validate() (standalone or registered)
✅ Use schema warnings to catch unknown categories/items
✅ Custom rules with composable rule factories and the plugin system
✅ Register plugins on models for dot-notation validation
✅ Modify data with simple assignments
✅ Create data three ways: manual, programmatic, and auto-creation
✅ Export to nested JSON with automatic relationship resolution
✅ Import from JSON with full round-trip support
✅ Write modified mmCIF files
✅ Execute complete workflows combining all features

Key Takeaways

🦥 Lazy by design, fast by default — SLOTH combines elegant APIs with high performance
✨ Pending proxies — non-existent categories/blocks are created on first write, not on access
🐍 Pythonic — dot notation, tab completion, and fuzzy “Did you mean …?” errors
🔄 Round-trip support — full mmCIF → JSON → mmCIF conversion
🌲 Nested JSON — automatically resolves parent-child relationships for hierarchical data
🛡️ Model-level validation — register validators on any model, or call vp.validate() standalone
🔌 Plugin system — validation is just one example; register any callable on a model

Next Steps

Explore the SLOTH documentation
Check out real-world examples in the repository
Contribute to the project on GitHub

Happy coding with SLOTH! 🦥