Data Versioning: Managing Your Dataset Evolution 📚
Learn how to track, manage, and restore different versions of your datasets in Dataloop - your key to maintaining data lineage and reproducibility.
Getting Started with Versioning 🌟
1. Getting Your Dataset
# Get your dataset
dataset = project.datasets.get(dataset_id='<dataset_id>')
2. Dataset Cloning
# Clone an entire dataset
dataset.clone(clone_name='dataset_v2',
filters=None,
with_items_annotations=True,
with_metadata=True,
with_task_annotations_status=True)
# Clone with filters
filters = dl.Filters()
filters.add(field='dir', values='/specific/folder')
dataset.clone(clone_name='filtered_dataset',
filters=filters,
with_items_annotations=True)
Dataset Management 📊
1. Listing Datasets
# List all datasets in project
project.datasets.list()
2. Merging Datasets
# Merge two datasets
dataset_ids = ["dataset-1-id", "dataset-2-id"]
project_ids = ["project-1-id", "project-2-id"]
dataset_merge = dl.datasets.merge(
merge_name="my_merged_dataset",
project_ids=project_ids,
dataset_ids=dataset_ids,
with_items_annotations=True,
with_metadata=False,
with_task_annotations_status=False
)
Best Practices 👑
1. Dataset Organization
- Use clear naming conventions for cloned datasets
- Document the purpose of each dataset version
- Keep track of dataset lineage
- Validate merged datasets before using them
2. Version Management
- Clone datasets before making major changes
- Use filters to create specific subset versions
- Maintain documentation of version differences
- Test merged datasets thoroughly
3. Error Prevention
# Validate dataset before operations
try:
dataset = project.datasets.get(dataset_id='dataset_id')
# Proceed with operations
except dl.exceptions.NotFound:
print("Dataset not found!")
Pro Tips 💡
Clone with Purpose
- Always specify meaningful clone names
- Include relevant metadata and annotations
- Document the reason for cloning
Merge with Care
- Ensure datasets have compatible recipes
- Verify project and dataset IDs
- Test merged dataset integrity
Version Control
- Keep track of dataset versions
- Document changes between versions
- Maintain clear version naming conventions
Ready to explore pipelines and automation? Let's move on to the next chapter! 🚀