Initializing…
🌙

Filters

Results (0)

IDOriginal TextSimilarity scoreURL
No results yet
Log

Project description

This is an example of an application that allows searching a large text collection using semantic search combined with structured field search. The source articles text is processed using LLM, and the structured results are saved to a database. Based on this derived data, vectors are calculated using an embedding model. This is a preparatory stage for further processing of your information.

Then, the data collection becomes available for local search, both for semantic proximity based on vectors and by keywords and structured features.

What this application does

A semantic + structured text search tool powered by EmbeddingGemma 300M via wllama and DuckDB-WASM. Runs entirely in your browser — no server required.

Architecture
  • Embedding model: EmbeddingGemma 300M (Q4_0) via wllama — runs client-side.
  • Vector search: Cosine similarity over flat parquet partitions (text_embeddings view).
  • Structured data: 30-column indexed_text view loaded from a single parquet file, including nested STRUCT, MAP, and ARRAY types.
  • Filters: DuckDB SQL WHERE-clause filters applied directly to structured columns.
Semantic search workflow
  1. Enter a free-text query.
  2. The running LLM computes a 768-dim L2-normalized embedding vector with wllama.
  3. DuckDB-WASM compares this vector against all stored embeddings using list_cosine_similarity.
  4. Results are joined with indexed_text to display original text and metadata.
Available embedded fields

Each record can produce embeddings for multiple sub-fields:

NameSource
summarys.summary
meanings.meaning
nsfwReasons.nsfw_reasons[]
insights.key_insights[]
contradictorys.contradictory_statements[]
themes.themes[]
topics.topics[]['topic_description']
audiences.target_audience['audience_description']
Filters

Filters apply a DuckDB WHERE clause on the indexed_text table.

  • Select — single-value (Primary Genre, Sentiment Polarity, Completeness, Target Audience, Demagoguery Severity).
  • Chips — multi-value (Secondary Genres, Keywords, Demagoguery Techniques).
  • Tristate — Yes / No / All (Is NSFW, Has Advertising).

Multiple filters are AND-ed together in the SQL.

Indexed text schema

The indexed_text view exposes columns over a parquet file:

task_id                  BIGINT
url                      VARCHAR
title                    VARCHAR
summary                  VARCHAR
meaning                  VARCHAR
target_audience          STRUCT("level" VARCHAR, audience_description VARCHAR)
genre                    STRUCT(primary_genre VARCHAR, secondary_genres VARCHAR[])
topics                   STRUCT("name" VARCHAR, topic_description VARCHAR)[]
themes                   VARCHAR[]
key_insights             VARCHAR[]
is_not_safe_for_work     BOOLEAN
keywords                 VARCHAR[]
keyword_taxonomy         VARCHAR[]
nsfw_reasons             VARCHAR[]
metadata                 MAP(VARCHAR, VARCHAR)
user_rating              INTEGER
metadata_create_time     TIMESTAMP
sentiment                STRUCT(polarity VARCHAR, confidence DOUBLE, tone VARCHAR, explanation VARCHAR)
completeness             STRUCT(score DOUBLE, "level" VARCHAR, missing_elements VARCHAR[])
contradictory_statements VARCHAR[]
demagoguery_analysis     STRUCT(detected_techniques_used_in_this_text VARCHAR[], severity VARCHAR, explanation VARCHAR)
presence_of_advertising  BOOLEAN
advertising_details      STRUCT(advertising_items VARCHAR[], confidence DOUBLE, explanation VARCHAR)