Feature Vectors and Feature Sets

Feature Vectors

Feature Vectors are numerical representations of entities, such as items or annotations. These vectors can vary in length and capture essential characteristics of the entities they represent.

Key Concepts

  • Representation: Feature vectors can represent various entities, including items and annotations.
  • Flexibility: The length of a feature vector can vary based on the specific use case or model.
  • Origin: Feature vectors can be derived from raw data or generated by machine learning models.

Examples

  1. Image Embedding - a pre-trained model like ResNet can generate a 1024-dimensional feature vector for an image, capturing its visual characteristics.
  2. Natural Language Processing (NLP) - Large Language Models (LLMs) like BERT and RoBERTa use embeddings to represent textual elements (words, phrases, sentences) as dense vectors. These embeddings encapsulate semantic meaning and linguistic relationships.
  3. Custom Segmentation - in marketing and business analytics, for example, customer data can be represented as feature vectors. For example, a customer's feature vector might include:
Copy
Copied
Age: 35
Annual Income: 75000
Years as Customer: 5
Number of Purchases: 20
Average Purchase Value: 150

This feature vector [35, 75000, 5, 20, 150] represents key characteristics of the customer, which can be used for segmentation, personalized marketing, or predictive modeling without involving neural networks.

Feature Set

Feature vectors are grouped under a Feature Set. A Feature Set contains the metadata for describing a set of feature vectors, and includes general information about the features, e.g. size, type, etc.

Create, Get, Delete

First we need to create a feature set on a project. Creating a set using the SDK:

Copy
Copied
import dtlpy as dl

project = dl.projects.get(project_id='60c8561b374417847ff59fba')
feature_set = project.feature_sets.create(name='rag_embeddings_BAAI_bge-large-en',
                                          size=1024,
                                          set_type='embeddings',
                                          entity_type=dl.FeatureEntityType.ITEM)

A Dataloop model entity can be connected directly to the feature set:

Copy
Copied
feature_set = project.feature_sets.create(name='text-embedding-3',
                                          set_type='embeddings',
                                          entity_type=dl.FeatureEntityType.ITEM,
                                          model_id=project.models.get('my-embedder'),
                                          size=1536)

Now we can add feature vectors for items using HuggingFace

Copy
Copied
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
import torch
import dtlpy as dl
import tqdm

model_name = "BAAI/bge-large-en"
model_kwargs = {'device': 'cuda' if torch.cuda.is_available() else 'cpu'}
encode_kwargs = {'normalize_embeddings': True}
embeddings = HuggingFaceBgeEmbeddings(cache_folder='.cache',
                                      model_name=model_name,
                                      model_kwargs=model_kwargs,
                                      encode_kwargs=encode_kwargs
                                      )

dataset = dl.datasets.get(dataset_id='<dataset id>')
items = dataset.items.list()
pbar = tqdm.tqdm(total=items.items_count)

for item in items.all():
    text = item.download(save_locally=False).read().decode('utf8')
    embeddings = embeddings.embed_documents([text])[0]
    feature = feature_set.features.create(value=embeddings,
                                          entity=item)
    pbar.update()

After that, GET and DELETE are easy, same as all other SDK entities:

Copy
Copied
import dtlpy as dl

feature_set = dl.feature_sets.get(feature_set_name='rag_embeddings_BAAI_bge-large-en')
feature_set.delete()

Getting the feature vector value from of a single item

Copy
Copied
item = dl.items.get(item_id='618126e38f1fa2b52ae96d05')
items_features = list(item.features.list().all())
print(f'This item has {len(items_features)} feature vectors')

Querying and Nearest Neighbours

We can query over feature vectors distance:

Copy
Copied
custom_filter = {
    'filter': {'$and': [{'hidden': False}, {'type': 'file'}]},
    'page': 0,
    'pageSize': 1000,
    'resource': 'items',
    'join': {
        'on': {
            'resource': 'feature_vectors',
            'local': 'entityId',
            'forigen': 'id'
        },
        'filter': {
            'value': {
                '$euclid': {
                    'input': [5, 5],
                    '$euclidSort': {'eu_dist': 'ascending'}
                }
            },
            'featureSetId': feature_set.id
        },
    }
}
filters = dl.Filters(custom_filter=custom_filter,
                     resource=dl.FiltersResource.ITEM)

res = dataset.items.list(filters=filters)
print(res.items_count)

for i, f in enumerate(res.items):
    filt = dl.Filters(resource=dl.FiltersResource.FEATURE, field='entityId', values=f.id)
    p = list(feature_set.features.list(filters=filt).all())
    print(p[0].value)
    if i == 10:
        break