Haversine KNN Candidate Generation

Generate geographically proximate candidate pairs for entity matching using haversine-distance KNN, with optional partitioning by country to improve recall and

triggers:haversine knngeo candidate generationentity matchingpartition by countrynearest neighborsgeospatial deduplication

GitHub SKILL.md

What it does

This skill provides a practical implementation and guidance for generating candidate pairs for entity matching using KNN with the haversine distance. It's designed for location-based entities (POIs, stores, addresses) where comparing all N^2 pairs is infeasible. The approach computes K nearest geographic neighbors per record (optionally within country/region partitions), returns neighbor ranks and geographic distances (km), and produces a candidate DataFrame ready for downstream scoring/classification.

When to use it

Use this skill when you need an efficient candidate generation step for deduplication or record linkage tasks involving geospatial data. It's ideal for datasets with latitude/longitude columns where you want to limit comparisons to nearby records, or when you want to combine geo-based recall with text-based matching to improve overall candidate quality. Partitioning by country is recommended for global datasets to avoid irrelevant cross-region matches.

What's included

Scripts: Python example code showing conversion to radians, per-partition KNN using scikit-learn NearestNeighbors (haversine metric), and assembling candidate pairs with distance and neighbor rank.
References: Linked Kaggle examples for real-world baselines.
Instructions: Step-by-step workflow (convert lat/lon to radians, partition, fit KNN, extract neighbors, build candidate DataFrame) and configuration guidance (n_neighbors typical range, distance unit conversion).

Compatible agents

Best used with data-science and Python-capable agents (Copilot/Code assistants, Python-running agent environments) that can execute or adapt the provided example code for production pipelines.

Not yet audited

This skill has not been reviewed by our automated audit pipeline yet.

Information

Repository: ds-skills
Stars: 46
Installs: 0

More from ds-skills

Transitive Match Closure

Post-process entity-match predictions to enforce symmetry and transitivity (graph closure) to improve recall in deduplication and record linkage.

Related Skills

Extract Circuit Netlist Edge Features

Extract structured edge features from a bipartite circuit netlist (NetworkX MultiGraph), normalizing device/net ordering, mapping terminal colors, and detecting

Markdrop

Convert PDFs to structured Markdown or interactive HTML and generate AI-powered descriptions for images and tables using multiple LLM providers.

FastAPI Project Templates

Creates production-ready FastAPI project scaffolds with async patterns, DI, middleware, and testing best practices for high-performance APIs.

Unitree Robot Controller

Control and command Unitree robots (GO1/GO2/G1/H1) via OpenClaw: initialization, basic motion commands, and sensor integrations.

Alpha Forge Pre-Ship Quality Gates

Pre-merge quality gates for PRs that validate RNG determinism, forked URLs, runtime parameter ranges, and manifest synchronization to reduce review cycles.

Plotly (Interactive Python Visualizations)

Interactive Python visualization skill for building hoverable, zoomable, and embeddable charts (Plotly Express + Graph Objects) for EDA, dashboards, and present

Dr. Manhattan - Prediction Market Trading

Unified CCXT-style API and tools for discovering, analyzing, and trading prediction markets across Polymarket, Kalshi, Opinion, Limitless, and Predict.fun; incl

Manim Idea to Export

Turn plain-language concepts into production-ready ManimCE scene blueprints, runnable code, preview renders, and final export commands for video assets.

Back to Skills