Data Versioning: Managing Your Dataset Evolution 📚

Learn how to track, manage, and restore different versions of your datasets in Dataloop - your key to maintaining data lineage and reproducibility.

Getting Started with Versioning 🌟

1. Getting Your Dataset

# Get your dataset
dataset = project.datasets.get(dataset_id='<dataset_id>')

2. Dataset Cloning

# Clone an entire dataset
dataset.clone(clone_name='dataset_v2',
              filters=None,
              with_items_annotations=True,
              with_metadata=True,
              with_task_annotations_status=True)

# Clone with filters
filters = dl.Filters()
filters.add(field='dir', values='/specific/folder')
dataset.clone(clone_name='filtered_dataset',
              filters=filters,
              with_items_annotations=True)

Dataset Management 📊

1. Listing Datasets

# List all datasets in project
project.datasets.list()

2. Merging Datasets

# Merge two datasets
dataset_ids = ["dataset-1-id", "dataset-2-id"]
project_ids = ["project-1-id", "project-2-id"]
dataset_merge = dl.datasets.merge(
    merge_name="my_merged_dataset",
    project_ids=project_ids,
    dataset_ids=dataset_ids,
    with_items_annotations=True,
    with_metadata=False,
    with_task_annotations_status=False
)

Best Practices 👑

1. Dataset Organization

Use clear naming conventions for cloned datasets
Document the purpose of each dataset version
Keep track of dataset lineage
Validate merged datasets before using them

2. Version Management

Clone datasets before making major changes
Use filters to create specific subset versions
Maintain documentation of version differences
Test merged datasets thoroughly

3. Error Prevention

# Validate dataset before operations
try:
    dataset = project.datasets.get(dataset_id='dataset_id')
    # Proceed with operations
except dl.exceptions.NotFound:
    print("Dataset not found!")

Pro Tips 💡

Clone with Purpose
- Always specify meaningful clone names
- Include relevant metadata and annotations
- Document the reason for cloning
Merge with Care
- Ensure datasets have compatible recipes
- Verify project and dataset IDs
- Test merged dataset integrity
Version Control
- Keep track of dataset versions
- Document changes between versions
- Maintain clear version naming conventions

Ready to explore pipelines and automation? Let's move on to the next chapter! 🚀