A fast nomic embedding generator

By: Dan
Categories: Tech
Languages: python php

Introduction

In this article I explain how to create a simple python web service to calculate text embeddings using the nomic-embed-text model, and how to consume it from PHP code.

The header image is a representation of the nomic vector space.

The full code of this post and more, can be found here.

A CSV file with the text of 25'000 Wikipedia articles and their nomic text embedding can be downloaded here ( 196 Mb).

Motivation

I wanted to play with the Simple English Wikipedia dataset hosted by OpenAI (~700MB zipped, 1.7GB CSV file) in a PHP project using LLPhant. The dataset contains 25'000 wikipedia articles with their text embeddings. However, the embeddings in the file were calculated using OpenAI's embedding models, and for my tests, I wanted to use only self-hosted LLMs. So I decided to recalculate the embeddings using the nomic-embed-text encoder as it is distributed under the Apache License.

My first approach was to use ollama or localai to run the model. However, the performance was quite poor—overnight, my computer was only able to generate embeddings for a few hundred articles.

I needed a faster solution...

What's the point with these embeddings?

When you have the embeddings of all the Wikipedia articles, it is possible to perform a similarity search on a query.

To do this, we embed the text of the query into a vector, then conduct a similarity search with the vectors in the database, and retrieve a list of documents with similar content. Using another LLM to rerank the documents, we can filter out the ones that do not correspond to the query's semantics.

The final list of documents can be fed into a question-answering LLM to provide context. This is the principle of Retrieval-Augmented Generation (RAG).

In the screenshot below, note how the article about the python serpent is retrieved but then completely filtered out by the reranking.

Screenshot of the app running

Calculate the embedding

With the help of the Transformers Python library, generating text embedding is very simple.

from sentence_transformers import SentenceTransformer

embedding_generator = SentenceTransformer('nomic-ai/nomic-embed-text-v1.5', trust_remote_code=True)
embeddings = embedding_generator.encode(['This is the text to embed'])
print(embeddings[0].tolist())

The first time the code runs it will download the nomic-embed-text model. This model is about 500MB, so it takes some time to download and requires disk space.

On my machine, I encountered "heating problems" while using the GPU. It is possible to force the model to run on the CPU. By default, it will try to use the GPU first and fall back to the CPU if no GPU is available.

embedding_generator = SentenceTransformer(..., device='cpu')

A simple Python API

Flask allows to quickly create an API without much boilerplate code.

An API that responds on the /hello route and says... well... hello can be implemented as:

#!/usr/bin/env python
from flask import Flask

api = Flask(__name__)

@api.route('/hello', methods=['GET'])
def hello():
    return 'Hello API', 200

if __name__ == "__main__":
    api.run()

That's everything I needed.

Request Validation

Even though it’s not strictly required for such a small project, I wanted to validate the incoming requests. I used the marshmallow library, which makes it easy to validate a JSON payload against a schema.

The API expects a request in the form { "document": "the data to embed" }, so the code is simply:

from marshmallow import Schema, fields

class EmbeddingSchema(Schema):
    document = fields.Str(required=True)

@api.route('/embedding', methods=['POST'])
def embed():
    body = request.get_json(force=True)
    schema = EmbeddingSchema()

    try:
        result = schema.load(body)
    except ValidationError as err:
        return 'Invalid request', 400
    
    document = result['document']

Put it all togheter

The final API looks like this.

The embedding_generator service is created once at the startup (rather than in the route) because it takes time to load and does not need to be recreated with every request. Once created, it remains in memory for the duration of the API’s runtime.

#!/usr/bin/env python
import json

from marshmallow import ValidationError, Schema, fields
from sentence_transformers import SentenceTransformer

from flask import Flask, request


class EmbeddingSchema(Schema):
    document = fields.Str(required=True)

embedding_generator = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)

api = Flask(__name__)

def generate_error(error: str, payload: any, status: int = 400):
    return json.dumps({
        'error': error,
        'payload': payload,
    }), status

def generate_response(data: any, status: int = 200):
    return json.dumps({
        'data': data,
    }), status

@api.route('/embedding', methods=['POST'])
def embed():
    body = request.get_json(force=True)
    schema = EmbeddingSchema()

    try:
        result = schema.load(body)
    except ValidationError as err:
        return generate_error('Validation failed', err.messages)

    query = result['document']

    if query == '':
        return generate_error('No query given.')

    embeddings = embedding_generator.encode([query])
    return generate_response(embeddings[0].tolist())


if __name__ == "__main__":
    api.run()

A little something extra

Everything works pretty straightforwardly with Python. However, to add a little extra, we can build an executable from the Python code with the help of Cython.

cython3 --embed -o api.c api.py
python_include=$(python -c "import sysconfig; print(sysconfig.get_path('include'))")
gcc -Wall api.c -o api -L$(python-config --prefix)/lib64 $(python-config --embed --ldflags) -I"$python_include"
rm api.c

Then, we can simply copy around the api executable.

Call the API from PHP

With the help of Guzzle calling the API from PHP is easy.

<?php declare(strict_types= 1);

use GuzzleHttp\Client;

$text = "this is the text to be embedded";

$client = new Client(['base_uri' => 'http://127.0.0.1:5000']);

$response = $client->post('/embedding', [
    'json' => ['document' => $text],
    'headers' => [
        'Content-Type' => 'application/json',
    ],
]);

$body = $response->getBody()->getContents();

/** @var array{error:string,data:float[]} */
$result = json_decode($body, true);

echo $result['data'];

Conclusion

With this small API hosted on my computer and the model running on the CPU, I was able to embed the 25,000 Wikipedia articles in about 16 hours. In comparison, it would have taken an eternity with the initial approach.