
from ds-skills46
Generate geographically proximate candidate pairs for entity matching using haversine-distance KNN, with optional partitioning by country to improve recall and
This skill provides a practical implementation and guidance for generating candidate pairs for entity matching using KNN with the haversine distance. It's designed for location-based entities (POIs, stores, addresses) where comparing all N^2 pairs is infeasible. The approach computes K nearest geographic neighbors per record (optionally within country/region partitions), returns neighbor ranks and geographic distances (km), and produces a candidate DataFrame ready for downstream scoring/classification.
Use this skill when you need an efficient candidate generation step for deduplication or record linkage tasks involving geospatial data. It's ideal for datasets with latitude/longitude columns where you want to limit comparisons to nearby records, or when you want to combine geo-based recall with text-based matching to improve overall candidate quality. Partitioning by country is recommended for global datasets to avoid irrelevant cross-region matches.
Best used with data-science and Python-capable agents (Copilot/Code assistants, Python-running agent environments) that can execute or adapt the provided example code for production pipelines.
This skill has not been reviewed by our automated audit pipeline yet.