π¦₯ SLOTH Cookbookο
Structural Loader with On-demand Traversal Handlingο
Lazy by design. Fast by default.
This comprehensive cookbook demonstrates how to use SLOTH for parsing, validating, modifying, and writing mmCIF files with elegant dot notation and high-performance gemmi backend.
Table of Contentsο
1. Setup and Installationο
SLOTH can be installed via pip. Make sure you have Python 3.8 or higher.
[1]:
# Install SLOTH (if not already installed)
# !pip install -i https://test.pypi.org/simple/ sloth-mmcif
# Verify installation
import sloth
print(f"β
SLOTH version: {sloth.__version__ if hasattr(sloth, '__version__') else 'installed'}")
β
SLOTH version: 0.8.0
2. Import Required Librariesο
Letβs import all the SLOTH components weβll need for this cookbook.
[2]:
import io
import os
import json
import tempfile
import warnings
from pathlib import Path
# SLOTH core components
from sloth.mmcif import (
MMCIFHandler,
ValidatorPlugin,
DataSourceFormat,
SchemaWarning,
ValidationReport,
SchemaValidator,
MMCIFValidator,
DataBlockValidator,
ContainerValidator,
mandatory_items,
value_length,
)
# SLOTH data models
from sloth.mmcif.models import MMCIFDataContainer, DataBlock, Category
print("β
All libraries imported successfully!")
β
All libraries imported successfully!
3. Understanding SLOTHβs Core Componentsο
SLOTH provides an elegant, Pythonic API for working with mmCIF data:
MMCIFDataContainer: The top-level container holding one or more data blocks
DataBlock: A named collection of categories (like
data_1ABC)Category: A collection of related items (like
_atom_siteor_entity)Dot Notation: Access data naturally like
container.data_1ABC._atom_site.Cartn_x
Key Featuresο
β¨ Auto-creation: Objects are created automatically as you access them
π High Performance: Uses gemmi backend for fast parsing
π Pythonic: Clean, intuitive API with dot notation
π Round-trip: Full support for mmCIF β JSON β mmCIF conversions
4. Parsing mmCIF Files with Embedded Dataο
Letβs parse a comprehensive protein-ligand complex structure. Weβll use embedded demo data for convenience.
[3]:
# Comprehensive demo mmCIF data with TRUE hierarchical relationships
# This structure demonstrates actual nesting with proper parent-child relationships:
# entity -> entity_poly -> entity_poly_seq (3 levels)
# entity -> struct_asym -> atom_site (3 levels)
COMPREHENSIVE_DEMO_MMCIF = """data_DEMO
#
# Entry information (top level)
#
_entry.id DEMO
#
# Database references
#
loop_
_database_2.database_id
_database_2.database_code
PDB DEMO
EMDB DEMO
#
# Entity information (parent level - root of hierarchy)
#
loop_
_entity.id
_entity.type
_entity.src_method
_entity.pdbx_description
1 polymer man 'Catalytic domain of model transferase'
2 water nat 'Water molecules'
#
# Entity polymer (child of entity via entity_id -> entity.id)
# Will nest under entity with id=1
#
loop_
_entity_poly.entity_id
_entity_poly.type
_entity_poly.nstd_chirality
_entity_poly.pdbx_seq_one_letter_code
1 'polypeptide(L)' no MAGLY
#
# Entity polymer sequence (child of entity_poly)
# Will nest under entity_poly, creating 3-level hierarchy
#
loop_
_entity_poly_seq.entity_id
_entity_poly_seq.num
_entity_poly_seq.mon_id
1 1 MET
1 2 ALA
1 3 GLY
1 4 LEU
1 5 TYR
#
# Structural asymmetric unit (child of entity)
# Creates parallel branch from entity
#
loop_
_struct_asym.id
_struct_asym.entity_id
_struct_asym.details
A 1 'Protein chain A'
W 2 'Water chain'
#
# Atom sites (child of struct_asym via label_asym_id -> struct_asym.id)
# Creates deeper nesting under struct_asym
#
loop_
_atom_site.group_PDB
_atom_site.id
_atom_site.type_symbol
_atom_site.label_atom_id
_atom_site.label_comp_id
_atom_site.label_asym_id
_atom_site.label_entity_id
_atom_site.label_seq_id
_atom_site.Cartn_x
_atom_site.Cartn_y
_atom_site.Cartn_z
_atom_site.occupancy
_atom_site.B_iso_or_equiv
_atom_site.pdbx_PDB_model_num
ATOM 1 N N MET A 1 1 20.154 6.718 46.973 1.00 25.00 1
ATOM 2 C CA MET A 1 1 21.618 6.765 47.254 1.00 24.50 1
ATOM 3 C C MET A 1 1 22.147 8.178 47.451 1.00 23.85 1
ATOM 4 N N ALA A 1 2 23.456 9.012 48.123 1.00 22.45 1
ATOM 5 C CA ALA A 1 2 24.123 10.234 48.567 1.00 21.30 1
HETATM 6 O O HOH W 2 . 12.345 15.678 35.432 1.00 18.56 1
HETATM 7 O O HOH W 2 . 13.456 16.789 36.543 1.00 19.67 1
#
"""
print("π Demo mmCIF data loaded with TRUE hierarchical relationships")
print(" π² 3-level nesting: entity β entity_poly β entity_poly_seq")
print(" π² 3-level nesting: entity β struct_asym β atom_site")
print(" β¨ This will create proper nested JSON output!")
π Demo mmCIF data loaded with TRUE hierarchical relationships
π² 3-level nesting: entity β entity_poly β entity_poly_seq
π² 3-level nesting: entity β struct_asym β atom_site
β¨ This will create proper nested JSON output!
[4]:
# Parse the embedded demo data using in-memory file
print("β‘ Parsing mmCIF data with gemmi backend (in-memory)...")
# Create an in-memory file object
mmcif_file = io.StringIO(COMPREHENSIVE_DEMO_MMCIF)
# Parse using MMCIFHandler - we'll write to a temp file since gemmi needs a file path
# (gemmi's C++ library requires file paths, but we make it seamless)
import tempfile
with tempfile.NamedTemporaryFile(mode='w', suffix='.cif', delete=False) as tmp:
tmp.write(COMPREHENSIVE_DEMO_MMCIF)
tmp_path = tmp.name
try:
handler = MMCIFHandler()
mmcif = handler.read(tmp_path)
print(f"β
Successfully parsed!")
print(f" Data blocks: {len(mmcif.data)}")
if mmcif.data:
block = mmcif.data[0]
print(f" Block name: '{block.name}'")
print(f" Categories: {len(block.categories)}")
print(f" Category names: {', '.join(block.categories)}")
finally:
os.unlink(tmp_path) # Clean up temp file
β‘ Parsing mmCIF data with gemmi backend (in-memory)...
β
Successfully parsed!
Data blocks: 1
Block name: 'DEMO'
Categories: 7
Category names: _entry, _database_2, _entity, _entity_poly, _entity_poly_seq, _struct_asym, _atom_site
5. Exploring Data Structures with Dot Notationο
SLOTHβs elegant dot notation makes accessing mmCIF data intuitive and Pythonic.
[5]:
# Access data block using dot notation
block = mmcif.data_DEMO # Elegant dot notation for accessing block by name!
print(f"π§± Block name: {block.name}")
# Access categories using dot notation (elegant!)
if "_entry" in block.categories:
entry_category = block._entry # Dot notation in action!
print(f"\nπ Entry category:")
print(f" Entry ID: {entry_category.id[0]}")
print(f" Entry type: {entry_category.type[0] if hasattr(entry_category, 'type') else 'N/A'}")
# Access database information
if "_database_2" in block.categories:
db_category = block._database_2 # Direct dot notation!
print(f"\nπΎ Database category:")
print(f" Database IDs: {db_category.database_id}")
print(f" Database codes: {db_category.database_code}")
# Access entity information
if "_entity" in block.categories:
entity_category = block._entity # Elegant dot notation!
print(f"\n𧬠Entity category:")
print(f" Entity IDs: {entity_category.id}")
print(f" Entity types: {entity_category.type}")
print(f" Descriptions: {entity_category.pdbx_description}")
print("\nπ‘ Tip: Use mmcif.data_BLOCKNAME to access blocks and block._category_name.item_name for data!")
π§± Block name: DEMO
π Entry category:
Entry ID: DEMO
Entry type: N/A
πΎ Database category:
Database IDs: ['PDB', 'EMDB']
Database codes: ['DEMO', 'DEMO']
𧬠Entity category:
Entity IDs: ['1', '2']
Entity types: ['polymer', 'water']
Descriptions: ["'Catalytic domain of model transferase'", "'Water molecules'"]
π‘ Tip: Use mmcif.data_BLOCKNAME to access blocks and block._category_name.item_name for data!
6. Demonstrating 2D Slicingο
SLOTH supports both column-wise and row-wise access with powerful slicing capabilities.
[6]:
# Column-wise access with dot notation
if "_atom_site" in block.categories:
atom_site = block._atom_site
print("π Column-wise access (the Pythonic way):")
print(f" Row count: {atom_site.row_count}")
print(f" Available items: {', '.join(atom_site.items[:5])}...")
print(f"\n Type symbols: {atom_site.type_symbol}")
print(f" X coordinates: {atom_site.Cartn_x}")
print(f" Y coordinates: {atom_site.Cartn_y}")
print(f" Z coordinates: {atom_site.Cartn_z}")
# Row-wise access
print(f"\nπ Row-wise access (elegant and readable):")
first_row = atom_site[0]
print(f" First atom:")
print(f" Type: {first_row.type_symbol}")
print(f" ID: {first_row.id}")
print(f" Position: ({first_row.Cartn_x}, {first_row.Cartn_y}, {first_row.Cartn_z})")
# Slicing rows
if atom_site.row_count >= 3:
print(f"\nπ Row slicing:")
for i, row in enumerate(atom_site[0:3]):
print(f" Atom {i+1}: {row.type_symbol} at ({row.Cartn_x}, {row.Cartn_y}, {row.Cartn_z})")
print("\nπͺ Dot notation makes your code readable, elegant, and Pythonic!")
π Column-wise access (the Pythonic way):
Row count: 7
Available items: group_PDB, id, type_symbol, label_atom_id, label_comp_id...
Type symbols: ['N', 'C', 'C', 'N', 'C', 'O', 'O']
X coordinates: ['20.154', '21.618', '22.147', '23.456', '24.123', '12.345', '13.456']
Y coordinates: ['6.718', '6.765', '8.178', '9.012', '10.234', '15.678', '16.789']
Z coordinates: ['46.973', '47.254', '47.451', '48.123', '48.567', '35.432', '36.543']
π Row-wise access (elegant and readable):
First atom:
Type: N
ID: 1
Position: (20.154, 6.718, 46.973)
π Row slicing:
Atom 1: N at (20.154, 6.718, 46.973)
Atom 2: C at (21.618, 6.765, 47.254)
Atom 3: C at (22.147, 8.178, 47.451)
πͺ Dot notation makes your code readable, elegant, and Pythonic!
7. Validating mmCIF Dataο
SLOTH provides multi-level validation with ValidatorPlugin.validate(), schema-aware warnings, and a flexible plugin system for custom rules. Registration happens on model instances β scope is inferred from which class you call register() on.
[7]:
# --- 7a. MMCIFValidator.validate() β the recommended approach ---
print("π‘οΈ Validation with MMCIFValidator.validate()")
print("=" * 50)
# Full validation (dictionary schema + wwPDB rules)
vp = MMCIFValidator()
report = vp.validate(mmcif)
print(f"\nπ Full validation report:")
print(f" Valid: {report.is_valid}")
print(f" Errors: {len(report.errors)}")
print(f" Warnings: {len(report.warnings)}")
# Empty ValidatorPlugin β no rules, so empty report
empty_vp = ValidatorPlugin()
empty_report = empty_vp.validate(mmcif)
print(f"\nπ Empty plugin (no rules):")
print(f" Issues: {len(empty_report.all_issues)}")
# --- 7b. Validate a single block or category ---
print("\n\nπ Validating a single block:")
block_report = vp.validate(block)
print(f" Valid: {block_report.is_valid}")
print("\nπ Validating a single category:")
cat_report = vp.validate(block._entry)
print(f" _entry valid: {cat_report.is_valid}")
# --- 7c. Plugin registration for dot-notation validation ---
print("\n\nπ Plugin-based validation:")
# Build a custom validator with rule factories
custom_vp = ValidatorPlugin()
custom_vp.register_validator("_entry", mandatory_items(["id"]))
custom_vp.register_validator("_entry", value_length("id", min_len=1))
# Parse test data
with tempfile.NamedTemporaryFile(mode='w', suffix='.cif', delete=False) as tmp:
tmp.write(COMPREHENSIVE_DEMO_MMCIF)
tmp_path = tmp.name
try:
mmcif_validated = handler.read(tmp_path)
block_validated = mmcif_validated.data[0]
# Register on a category for dot-notation .validate()
block_validated._entry.register("validate", custom_vp)
print(" _entry.validate():")
block_validated._entry.validate()
print(" β
Passed!")
# Cross-category validation with .against()
if len(block_validated.categories) >= 2:
cat_a, cat_b = block_validated.categories[:2]
block_validated[cat_a].register("validate", custom_vp)
print(f" {cat_a}.validate().against({cat_b}):")
block_validated[cat_a].validate().against(block_validated[cat_b])
print(" β
Passed!")
finally:
os.unlink(tmp_path)
# --- 7d. Schema warnings ---
print("\n\nβ οΈ Schema warnings demo:")
with warnings.catch_warnings(record=True) as w:
warnings.simplefilter("always")
block._entry.my_unknown_field = ["test"]
if w:
print(f" SchemaWarning: {w[0].message}")
else:
print(" (no warning β field may be known)")
print("\nβ
Validation features demonstrated!")
π‘οΈ Validation with MMCIFValidator.validate()
==================================================
π Full validation report:
Valid: False
Errors: 7
Warnings: 0
π Empty plugin (no rules):
Issues: 0
π Validating a single block:
Valid: False
π Validating a single category:
_entry valid: True
π Plugin-based validation:
_entry.validate():
β
Passed!
_entry.validate().against(_database_2):
β
Passed!
β οΈ Schema warnings demo:
SchemaWarning: Item 'my_unknown_field' is not in the mmCIF dictionary for category '_entry'.
β
Validation features demonstrated!
8. Modifying mmCIF Dataο
Modify data elegantly using dot notation assignments.
[8]:
print("βοΈ Modifying data with dot notation...")
# Modify database information using elegant dot notation
if "_database_2" in block.categories:
db_category = block._database_2
print(f"\nπ Original database_id: {db_category.database_id}")
# Simple assignment with dot notation - change the last entry
original_value = db_category.database_id[-1]
db_category.database_id[-1] = "BMRB" # Change EMDB to BMRB
print(f"βοΈ Modified database_id: '{original_value}' β '{db_category.database_id[-1]}'")
print(f" Using: block._database_2.database_id[-1] = 'BMRB'")
print(f"\nπ Updated database_id: {db_category.database_id}")
print("\nβ
Data modification complete!")
βοΈ Modifying data with dot notation...
π Original database_id: ['PDB', 'EMDB']
βοΈ Modified database_id: 'EMDB' β 'BMRB'
Using: block._database_2.database_id[-1] = 'BMRB'
π Updated database_id: ['PDB', 'BMRB']
β
Data modification complete!
9. Creating Sample Data - Manual Approachο
The traditional approach: manually writing mmCIF format strings. SLOTH works in-memory, avoiding unnecessary disk I/O.
[9]:
print("ποΈ Method 1: Manual mmCIF content creation (in-memory)")
# Create mmCIF content as a string
sample_content = """data_1ABC
_entry.id 1ABC_STRUCTURE
_database_2.database_id PDB
_database_2.database_code 1ABC
loop_
_atom_site.group_PDB
_atom_site.id
_atom_site.type_symbol
_atom_site.Cartn_x
_atom_site.Cartn_y
_atom_site.Cartn_z
ATOM 1 N 10.123 20.456 30.789
ATOM 2 C 11.234 21.567 31.890
"""
# Use in-memory file
mmcif_io = io.StringIO(sample_content)
print(f"β
Created in-memory mmCIF content ({len(sample_content)} bytes)")
# Parse from temporary file (gemmi requires file path)
with tempfile.NamedTemporaryFile(mode='w', suffix='.cif', delete=False) as tmp:
tmp.write(sample_content)
tmp_path = tmp.name
try:
manual_mmcif = handler.read(tmp_path)
print(f"β
Verified: {len(manual_mmcif.data[0].categories)} categories")
finally:
os.unlink(tmp_path)
ποΈ Method 1: Manual mmCIF content creation (in-memory)
β
Created in-memory mmCIF content (275 bytes)
β
Verified: 3 categories
10. Creating Sample Data - Programmatic Approachο
Create mmCIF data programmatically using SLOTHβs API with dictionary-style assignments.
[10]:
print("βοΈ Method 2: Programmatic creation (in-memory)")
# Create container and block
mmcif_prog = MMCIFDataContainer()
block_prog = DataBlock("1ABC")
# Create categories and add data
entry_category = Category("_entry")
entry_category["id"] = ["1ABC_STRUCTURE"]
database_category = Category("_database_2")
database_category["database_id"] = ["PDB"]
database_category["database_code"] = ["1ABC"]
atom_site_category = Category("_atom_site")
atom_site_category["group_PDB"] = ["ATOM", "ATOM"]
atom_site_category["id"] = ["1", "2"]
atom_site_category["type_symbol"] = ["N", "C"]
atom_site_category["Cartn_x"] = ["10.123", "11.234"]
atom_site_category["Cartn_y"] = ["20.456", "21.567"]
atom_site_category["Cartn_z"] = ["30.789", "31.890"]
# Add categories to block
block_prog["_entry"] = entry_category
block_prog["_database_2"] = database_category
block_prog["_atom_site"] = atom_site_category
# Add block to container
mmcif_prog["1ABC"] = block_prog
# Write to in-memory string
output_buffer = io.StringIO()
with tempfile.NamedTemporaryFile(mode='w', suffix='.cif', delete=False) as tmp:
handler.write(mmcif_prog, tmp.name)
tmp_path = tmp.name
try:
with open(tmp_path, 'r') as f:
programmatic_content = f.read()
print(f"β
Created programmatic data in-memory ({len(programmatic_content)} bytes)")
print(f"β
Categories: {len(mmcif_prog.data[0].categories)}")
finally:
os.unlink(tmp_path)
βοΈ Method 2: Programmatic creation (in-memory)
β
Created programmatic data in-memory (277 bytes)
β
Categories: 3
11. Creating Sample Data - Auto-Creation with Dot Notation β¨ο
This is SLOTHβs most powerful feature! Objects are automatically created as you access them using elegant dot notation.
[11]:
print("β¨ Method 3: Auto-creation with Elegant Dot Notation (in-memory)")
print("=" * 50)
print("SLOTH automatically creates nested objects!")
print()
# Create an empty container - this is all you need!
mmcif_auto = MMCIFDataContainer()
# Use dot notation to auto-create everything - just like magic!
mmcif_auto.data_1ABC._entry.id = ["1ABC_STRUCTURE"]
mmcif_auto.data_1ABC._database_2.database_id = ["PDB"]
mmcif_auto.data_1ABC._database_2.database_code = ["1ABC"]
# Add atom data
mmcif_auto.data_1ABC._atom_site.group_PDB = ["ATOM", "ATOM"]
mmcif_auto.data_1ABC._atom_site.id = ["1", "2"]
mmcif_auto.data_1ABC._atom_site.type_symbol = ["N", "C"]
mmcif_auto.data_1ABC._atom_site.Cartn_x = ["10.123", "11.234"]
mmcif_auto.data_1ABC._atom_site.Cartn_y = ["20.456", "21.567"]
mmcif_auto.data_1ABC._atom_site.Cartn_z = ["30.789", "31.890"]
# Convert to in-memory string
with tempfile.NamedTemporaryFile(mode='w', suffix='.cif', delete=False) as tmp:
handler.write(mmcif_auto, tmp.name)
tmp_path = tmp.name
try:
with open(tmp_path, 'r') as f:
dot_notation_content = f.read()
print(f"β
Created dot notation data in-memory ({len(dot_notation_content)} bytes)")
finally:
os.unlink(tmp_path)
print(f"\nπ What was auto-created:")
print(f" π¦ Container: {len(mmcif_auto)} block(s)")
print(f" π§± Block '1ABC': {len(mmcif_auto.data_1ABC.categories)} categories")
print(f" π Categories: {', '.join(mmcif_auto.data_1ABC.categories)}")
print(f"\nπ Elegant access examples:")
print(f" Entry ID: {mmcif_auto.data_1ABC._entry.id[0]}")
print(f" Database: {mmcif_auto.data_1ABC._database_2.database_id[0]}")
print(f" Atom types: {mmcif_auto.data_1ABC._atom_site.type_symbol}")
print(f"\nπ Just write what you want, SLOTH creates what you need!")
β¨ Method 3: Auto-creation with Elegant Dot Notation (in-memory)
==================================================
SLOTH automatically creates nested objects!
β
Created dot notation data in-memory (277 bytes)
π What was auto-created:
π¦ Container: 1 block(s)
π§± Block '1ABC': 3 categories
π Categories: _entry, _database_2, _atom_site
π Elegant access examples:
Entry ID: 1ABC_STRUCTURE
Database: PDB
Atom types: ['N', 'C']
π Just write what you want, SLOTH creates what you need!
12. Exporting to Nested JSONο
SLOTH exports mmCIF data to nested JSON format, resolving parent-child relationships automatically. Categories like entity_poly are nested under their parent entity, and entity_poly_seq is nested under entity_poly.
[12]:
print("π Demonstrating nested JSON export (in-memory):")
# Export to JSON string (no file writing)
print("\nπ§ Exporting to Nested JSON (in-memory):")
json_string = handler.export(mmcif) # Returns string when no file_path provided
json_data = json.loads(json_string)
print(f"\nπ Export Summary:")
print(f" JSON size: {len(json_string):,} bytes")
# Pretty-print the full nested JSON
print("\nπ Full nested JSON output:")
print(json.dumps(json_data, indent=2))
# Display the nested structure
print("\nπ Nested Structure Preview:")
# Show the hierarchy
if 'data_DEMO' in json_data and '_entity' in json_data['data_DEMO']:
entities = json_data['data_DEMO']['_entity']
if entities:
entity = entities[0]
print(f" π¦ _entity (top level):")
print(f" - id: {entity.get('id')}")
print(f" - type: {entity.get('type')}")
if '_entity_poly' in entity:
print(f" ββ π¦ _entity_poly (nested child):")
poly = entity['_entity_poly'][0] if entity['_entity_poly'] else {}
print(f" - entity_id: {poly.get('entity_id')}")
print(f" - type: {poly.get('type')}")
if '_entity_poly_seq' in poly:
seq_list = poly['_entity_poly_seq']
print(f" ββ π¦ _entity_poly_seq (nested grandchild): {len(seq_list)} residues")
if seq_list:
print(f" - First residue: {seq_list[0].get('mon_id')} at position {seq_list[0].get('num')}")
print("\n⨠All category names have the '_' prefix, whether nested or not!")
π Demonstrating nested JSON export (in-memory):
π§ Exporting to Nested JSON (in-memory):
π¦ Using cached mapping rules
π¦ Using cached dictionary data
π Export Summary:
JSON size: 3,129 bytes
π Full nested JSON output:
{
"data_DEMO": {
"_entry": [
{
"id": "DEMO",
"my_unknown_field": "test"
}
],
"_database_2": [
{
"database_id": "BMRB",
"database_code": "DEMO"
},
{
"database_id": "PDB",
"database_code": "DEMO"
}
],
"_entity": [
{
"id": "1",
"type": "polymer",
"src_method": "man",
"pdbx_description": "'Catalytic domain of model transferase'",
"_entity_poly": [
{
"entity_id": "1",
"type": "'polypeptide(L)'",
"nstd_chirality": "no",
"pdbx_seq_one_letter_code": "MAGLY",
"_entity_poly_seq": [
{
"entity_id": "1",
"num": "1",
"mon_id": "MET"
},
{
"entity_id": "1",
"num": "2",
"mon_id": "ALA"
},
{
"entity_id": "1",
"num": "3",
"mon_id": "GLY"
},
{
"entity_id": "1",
"num": "4",
"mon_id": "LEU"
},
{
"entity_id": "1",
"num": "5",
"mon_id": "TYR"
}
]
}
],
"_struct_asym": [
{
"id": "A",
"entity_id": "1",
"details": "'Protein chain A'",
"_atom_site": [
{
"group_PDB": "ATOM",
"id": "1",
"type_symbol": "N",
"label_atom_id": "N",
"label_comp_id": "MET",
"label_asym_id": "A",
"label_entity_id": "1",
"label_seq_id": "1",
"Cartn_x": "20.154",
"Cartn_y": "6.718",
"Cartn_z": "46.973",
"occupancy": "1.00",
"B_iso_or_equiv": "25.00",
"pdbx_PDB_model_num": "1"
},
{
"group_PDB": "ATOM",
"id": "2",
"type_symbol": "C",
"label_atom_id": "CA",
"label_comp_id": "MET",
"label_asym_id": "A",
"label_entity_id": "1",
"label_seq_id": "1",
"Cartn_x": "21.618",
"Cartn_y": "6.765",
"Cartn_z": "47.254",
"occupancy": "1.00",
"B_iso_or_equiv": "24.50",
"pdbx_PDB_model_num": "1"
},
{
"group_PDB": "ATOM",
"id": "3",
"type_symbol": "C",
"label_atom_id": "C",
"label_comp_id": "MET",
"label_asym_id": "A",
"label_entity_id": "1",
"label_seq_id": "1",
"Cartn_x": "22.147",
"Cartn_y": "8.178",
"Cartn_z": "47.451",
"occupancy": "1.00",
"B_iso_or_equiv": "23.85",
"pdbx_PDB_model_num": "1"
},
{
"group_PDB": "ATOM",
"id": "4",
"type_symbol": "N",
"label_atom_id": "N",
"label_comp_id": "ALA",
"label_asym_id": "A",
"label_entity_id": "1",
"label_seq_id": "2",
"Cartn_x": "23.456",
"Cartn_y": "9.012",
"Cartn_z": "48.123",
"occupancy": "1.00",
"B_iso_or_equiv": "22.45",
"pdbx_PDB_model_num": "1"
},
{
"group_PDB": "ATOM",
"id": "5",
"type_symbol": "C",
"label_atom_id": "CA",
"label_comp_id": "ALA",
"label_asym_id": "A",
"label_entity_id": "1",
"label_seq_id": "2",
"Cartn_x": "24.123",
"Cartn_y": "10.234",
"Cartn_z": "48.567",
"occupancy": "1.00",
"B_iso_or_equiv": "21.30",
"pdbx_PDB_model_num": "1"
}
]
}
]
},
{
"id": "2",
"type": "water",
"src_method": "nat",
"pdbx_description": "'Water molecules'",
"_struct_asym": [
{
"id": "W",
"entity_id": "2",
"details": "'Water chain'",
"_atom_site": [
{
"group_PDB": "HETATM",
"id": "6",
"type_symbol": "O",
"label_atom_id": "O",
"label_comp_id": "HOH",
"label_asym_id": "W",
"label_entity_id": "2",
"label_seq_id": ".",
"Cartn_x": "12.345",
"Cartn_y": "15.678",
"Cartn_z": "35.432",
"occupancy": "1.00",
"B_iso_or_equiv": "18.56",
"pdbx_PDB_model_num": "1"
},
{
"group_PDB": "HETATM",
"id": "7",
"type_symbol": "O",
"label_atom_id": "O",
"label_comp_id": "HOH",
"label_asym_id": "W",
"label_entity_id": "2",
"label_seq_id": ".",
"Cartn_x": "13.456",
"Cartn_y": "16.789",
"Cartn_z": "36.543",
"occupancy": "1.00",
"B_iso_or_equiv": "19.67",
"pdbx_PDB_model_num": "1"
}
]
}
]
}
]
}
}
π Nested Structure Preview:
π¦ _entity (top level):
- id: 1
- type: polymer
ββ π¦ _entity_poly (nested child):
- entity_id: 1
- type: 'polypeptide(L)'
ββ π¦ _entity_poly_seq (nested grandchild): 5 residues
- First residue: MET at position 1
β¨ All category names have the '_' prefix, whether nested or not!
13. Importing from JSONο
Import previously exported JSON files back into SLOTHβs data structures.
[13]:
print("π₯ Demonstrating import functionality (in-memory):")
# Import from JSON string via temporary file
print("\nβ
Importing from JSON:")
with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as tmp:
tmp.write(json_string)
tmp_path = tmp.name
try:
imported_container = handler.load(tmp_path)
print(f" β
Successfully imported from JSON")
print(f" Data blocks: {len(imported_container.data)}")
if imported_container.data:
imported_block = imported_container.data[0]
print(f" Categories: {len(imported_block.categories)}")
print(f" Category names: {', '.join(imported_block.categories[:5])}...")
finally:
os.unlink(tmp_path)
π₯ Demonstrating import functionality (in-memory):
β
Importing from JSON:
β
Successfully imported from JSON
Data blocks: 1
Categories: 7
Category names: _entry, _database_2, _entity, _entity_poly, _entity_poly_seq...
14. Round-Trip Validationο
Validate data integrity by comparing original and imported data.
[14]:
print("π Demonstrating round-trip validation:")
# Compare original and imported data
original_block = mmcif.data[0]
imported_block = imported_container.data[0]
print(f"\nπ Comparing original vs imported:")
print(f" Original categories: {len(original_block.categories)}")
print(f" Imported categories: {len(imported_block.categories)}")
# Find common categories
common_categories = set(original_block.categories).intersection(
set(imported_block.categories)
)
print(f" β Common categories: {len(common_categories)}")
# Check a sample category in detail
if common_categories:
sample_cat = sorted(common_categories)[0] # Sort for deterministic output
print(f"\nπ Checking category: {sample_cat}")
original_cat = original_block[sample_cat]
imported_cat = imported_block[sample_cat]
# Compare item names
original_items = set(original_cat.items)
imported_items = set(imported_cat.items)
common_items = original_items.intersection(imported_items)
if common_items:
sample_item = sorted(common_items)[0] # Sort for deterministic output
original_values = original_cat[sample_item]
imported_values = imported_cat[sample_item]
print(f" Item: '{sample_item}'")
print(f" Original values: {len(original_values)} items")
print(f" Imported values: {len(imported_values)} items")
if len(original_values) > 0 and len(imported_values) > 0:
if original_values[0] == imported_values[0]:
print(f" β
First value matches: '{original_values[0]}'")
else:
print(f" β οΈ First value differs!")
print("\nβ
Round-trip validation complete!")
π Demonstrating round-trip validation:
π Comparing original vs imported:
Original categories: 7
Imported categories: 7
β Common categories: 7
π Checking category: _atom_site
Item: 'B_iso_or_equiv'
Original values: 7 items
Imported values: 7 items
β
First value matches: '25.00'
β
Round-trip validation complete!
15. Writing Modified mmCIF Filesο
Write modified mmCIF data using the handlerβs write method. For demos, we use temporary files that are immediately cleaned up.
[15]:
print("πΎ Writing mmCIF to in-memory string:")
# Write to in-memory buffer
output_buffer = io.StringIO()
# Since gemmi needs a file path, we'll use a temp file and read it back
with tempfile.NamedTemporaryFile(mode='w', suffix='.cif', delete=False) as tmp:
handler.write(mmcif, tmp.name)
tmp_path = tmp.name
try:
with open(tmp_path, 'r') as f:
output_content = f.read()
print(f"β
Written to in-memory string ({len(output_content)} bytes)")
# Parse it back to verify
with tempfile.NamedTemporaryFile(mode='w', suffix='.cif', delete=False) as tmp2:
tmp2.write(output_content)
tmp2_path = tmp2.name
try:
verify_data = handler.read(tmp2_path)
verify_block = verify_data.data[0]
print("\nπ Verifying output:")
print(f" β
Data blocks: {len(verify_data.data)}")
print(f" β
Categories: {len(verify_block.categories)}")
print(f" β
Block name: '{verify_block.name}'")
# Show a sample of the data
if "_database_2" in verify_block.categories:
db_cat = verify_block._database_2
print(f"\nπ Verification - Database information:")
print(f" Database IDs: {db_cat.database_id}")
print(f" Database codes: {db_cat.database_code}")
finally:
os.unlink(tmp2_path)
finally:
os.unlink(tmp_path)
πΎ Writing mmCIF to in-memory string:
β
Written to in-memory string (1363 bytes)
π Verifying output:
β
Data blocks: 1
β
Categories: 7
β
Block name: 'DEMO'
π Verification - Database information:
Database IDs: ['PDB', 'BMRB']
Database codes: ['DEMO', 'DEMO']
16. Complete Workflow Exampleο
Letβs put everything together in a complete in-memory workflow - no disk clutter!
[16]:
print("π Complete SLOTH Workflow (in-memory)")
print("=" * 50)
# Step 1: Create a new container with dot-notation (pending proxies auto-commit)
print("\n1οΈβ£ Creating new mmCIF data with dot notation...")
workflow_mmcif = MMCIFDataContainer()
workflow_mmcif.data_WORKFLOW._entry.id = ["WORKFLOW_DEMO"]
workflow_mmcif.data_WORKFLOW._database_2.database_id = ["PDB", "BMRB"]
workflow_mmcif.data_WORKFLOW._database_2.database_code = ["WORK", "WORK"]
workflow_mmcif.data_WORKFLOW._entity.id = ["1", "2"]
workflow_mmcif.data_WORKFLOW._entity.type = ["polymer", "non-polymer"]
workflow_mmcif.data_WORKFLOW._entity.pdbx_description = ["Protein", "Ligand"]
print(" β
Created with dot-notation auto-creation!")
# Step 2: Validate with MMCIFValidator
print("\n2οΈβ£ Validating data...")
workflow_vp = MMCIFValidator()
report = workflow_vp.validate(workflow_mmcif)
print(f" Valid: {report.is_valid}")
print(f" Errors: {len(report.errors)}, Warnings: {len(report.warnings)}")
print(" β
Validation complete!")
# Step 3: Modify the data
print("\n3οΈβ£ Modifying data...")
workflow_mmcif.data_WORKFLOW._entry.id[0] = "MODIFIED_WORKFLOW"
print(f" β
Modified entry ID to: {workflow_mmcif.data_WORKFLOW._entry.id[0]}")
# Step 4: Export to JSON (in-memory)
print("\n4οΈβ£ Exporting to JSON (in-memory)...")
workflow_handler = MMCIFHandler()
workflow_json_string = workflow_handler.export(workflow_mmcif)
print(f" β
Exported to in-memory JSON ({len(workflow_json_string)} bytes)")
# Step 5: Import from JSON (in-memory via temp file)
print("\n5οΈβ£ Importing from JSON (in-memory)...")
with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as tmp:
tmp.write(workflow_json_string)
tmp_path = tmp.name
try:
reimported = handler.load(tmp_path)
print(f" β
Reimported! Categories: {len(reimported.data[0].categories)}")
finally:
os.unlink(tmp_path)
# Step 6: Write to in-memory mmCIF string
print("\n6οΈβ£ Writing final mmCIF (in-memory)...")
with tempfile.NamedTemporaryFile(mode='w', suffix='.cif', delete=False) as tmp:
handler.write(reimported, tmp.name)
tmp_path = tmp.name
try:
with open(tmp_path, 'r') as f:
final_content = f.read()
print(f" β
Written to in-memory string ({len(final_content)} bytes)")
finally:
os.unlink(tmp_path)
# Step 7: Verify round-trip
print("\n7οΈβ£ Verifying round-trip integrity...")
original_cats = len(workflow_mmcif.data[0].categories)
final_cats = len(reimported.data[0].categories)
print(f" Original categories: {original_cats}")
print(f" Final categories: {final_cats}")
print(f" β
Round-trip successful!" if original_cats == final_cats else " β οΈ Category count changed")
print("\n" + "=" * 50)
print("π Complete workflow finished successfully!")
print("π‘ SLOTH works entirely in-memory - no disk I/O needed!")
π Complete SLOTH Workflow (in-memory)
==================================================
1οΈβ£ Creating new mmCIF data with dot notation...
β
Created with dot-notation auto-creation!
2οΈβ£ Validating data...
Valid: False
Errors: 2, Warnings: 0
β
Validation complete!
3οΈβ£ Modifying data...
β
Modified entry ID to: MODIFIED_WORKFLOW
4οΈβ£ Exporting to JSON (in-memory)...
π¦ Using cached mapping rules
π¦ Using cached dictionary data
β
Exported to in-memory JSON (318 bytes)
5οΈβ£ Importing from JSON (in-memory)...
β
Reimported! Categories: 3
6οΈβ£ Writing final mmCIF (in-memory)...
β
Written to in-memory string (213 bytes)
7οΈβ£ Verifying round-trip integrity...
Original categories: 3
Final categories: 3
β
Round-trip successful!
==================================================
π Complete workflow finished successfully!
π‘ SLOTH works entirely in-memory - no disk I/O needed!
Summaryο
Youβve learned how to:
MMCIFValidator().validate() (standalone or registered)Key Takeawaysο
vp.validate() standaloneNext Stepsο
Explore the SLOTH documentation
Check out real-world examples in the repository
Contribute to the project on GitHub
Happy coding with SLOTH! π¦₯