Experimental

This module contains experimental functionality that may not be ready for production use.

Approximate Nearest Neighbors

class perception.experimental.ann.ApproximateNearestNeighbors(con, table, paramstyle, index, hash_length, metadata_columns=None, dtype='uint8', distance_metric='euclidean')

A wrapper for a FAISS index.

Parameters:
  • con – A database connection from which to obtain metadata for matched hashes.
  • table – The table in the database that we should query for metadata.
  • paramstyle – The parameter style for the given database
  • index – A FAISS index (or filepath to a FAISS index)
  • hash_length – The length of the hash that is being matched against.
  • metadata_columns – The metadata that should be returned for queries.
  • dtype – The data type for the vectors
  • distance_metric – The distance metric for the vectors
classmethod from_database(con, table, paramstyle, hash_length, ids_train=None, train_size=None, chunksize=100000, metadata_columns=None, index=None, gpu=False, dtype='uint8', distance_metric='euclidean')

Train and build a FAISS index from a database connection.

Parameters:
  • con – A database connection from which to obtain metadata for matched hashes.
  • table – The table in the database that we should query for metadata.
  • paramstyle – The parameter style for the given database
  • hash_length – The length of the hash that is being matched against.
  • ids_train – The IDs for the vectors to train on.
  • train_size – The number of vectors to use for training. Will be randomly selected from 1 to the number of vectors in the database. Ignored if ids_train is not None.
  • chunksize – The chunks of data to draw from the database at a time when adding vectors to the index.
  • metadata_columns – The metadata that should be returned for queries.
  • index – If a pretrained index is provided, training will be skipped, any existing vectors will be discarded, and the index will be repopulated with the current contents of the database.
  • gpu – If true, will attempt to carry out training on a GPU.
  • dtype – The data type for the vectors
  • distance_metric – The distance metric for the vectors
nlist

The number of lists in the index.

nprobe

The current value of nprobe.

ntotal

The number of vectors in the index.

query_by_id(ids, include_metadata=True, include_hashes=False)

Get data from the database.

Parameters:
  • ids – The hash IDs to get from the database.
  • include_metadata – Whether to include metadata columns.
  • include_hashes – Whether to include the hashes
Return type:

DataFrame

save(filepath)

Save an index to disk.

Parameters:filepath – Where to save the index.
search(queries, threshold=None, threshold_func=None, hash_format='base64', k=1)

Search the index and return matches.

Parameters:
  • queries (List[QueryInput]) – A list of queries in the form of {“id”: <id>, “hash”: “<hash_string>”}
  • threshold (Optional[int]) – The threshold to use for matching. Takes precedence over threshold_func.
  • threshold_func (Optional[Callable[[ndarray], int]]) – A function that, given a query vector, returns the desired match threshold for that query.
  • hash_format – The hash format used for the strings in the query.
  • k – The number of nearest neighbors to return.
Returns:

{ “id”: <query ID>, “matches”: [{“distance”: <distance>, “id”: <match ID>, “metadata”: {}}]}

The metadata consists of the contents of the metadata columns specified for this matching instance.

Return type:

Matches in the form of a list of dicts of the form

set_nprobe(nprobe)

Set the value of nprobe.

Parameters:nprobe – The new value for nprobe
Return type:int
string_to_vector(s, hash_format='base64')

Convert a string to vector form.

Parameters:
  • s (str) – The hash string
  • hash_format – The format for the hash string
Return type:

ndarray

tune(n_query=100, min_recall=99, max_noise=3)

Obtain minimum value for nprobe that achieves a target level of recall. :param n_query: The number of hashes to use as test hashes. :param min_recall: The minimum desired recall for the index. :param max_noise: The maximum amount of noise to add to each test hash

Returns:A tuple of recall, latency (in ms), and nprobe where the nprobe value is the one that achieved the resulting recall.
Raises:TuningFailure if no suitable nprobe value is found.
vector_to_string(vector, hash_format='base64')

Convert a vector back to string

Parameters:
  • vector – The hash vector
  • hash_format – The format for the hash
Return type:

Optional[str]

perception.experimental.ann.serve(index, default_threshold=None, default_threshold_func=None, default_k=1, concurrency=2, log_level=20, host='localhost', port=8080)

Serve an index as a web API. This function does not block. If you wish to use the function in a blocking manner, you can do something like

loop = asyncio.get_event_loop()
loop.run_until_complete(serve(...))
loop.run_forever()

You can query the API with something like:

curl --header "Content-Type: application/json" \
     --request POST \
     --data '{"queries": [{"hash": "<hash string>", "id": "bar"}], "threshold": 1200}' \
     http://localhost:8080/v1/similarity
Parameters:
  • index (ApproximateNearestNeighbors) – The underlying index
  • default_threshold (Optional[int]) – The default threshold for matches
  • default_k (int) – The default number of nearest neighbors to look for
  • concurrency (int) – The number of concurrent requests served
  • log_level – The log level to use for the logger
  • host – The host for the servoce
  • port – The port for the service

Local Descriptor Deduplication

perception.experimental.local_descriptor_deduplication.deduplicate(filepaths, max_features=256, min_features=10, max_size=256, coarse_pct_probe=0, coarse_threshold=100, minimum_coarse_overlap=0.01, minimum_validation_match=0.4, minimum_validation_intersection=0.6, minimum_validation_inliers=5, ratio=0.5, max_workers=None, use_gpu=True)

Deduplicate images by doing the following:

  1. Unletterbox all images and resize to some maximum size, preserving aspect ratio.
  2. Compute the SIFT descriptors and keypoints for all the resulting images.
  3. Perform a coarse, approximate search for images with common features.
  4. For each candidate pair, validate it pairwise by checking the features and keypoints with the traditional approach using the ratio test. See validate_match for more information.
Parameters:
  • filepaths (Iterable[str]) – The list of images to deduplicate.
  • max_features (int) – The maximum number of features to extract.
  • min_features (int) – The minimum number of features to extract.
  • max_size (int) – The maximum side length for an image.
  • coarse_pct_probe (float) – The minimum fraction of nearest lists to search. If the product of pct_probe and the number of lists is less than 1, one list will be searched.
  • corase_threshold – The threshold for a match as a euclidean distance.
  • minimum_coarse_overlap (float) – The minimum overlap between two files to qualify as a match.
  • minimum_validation_match (float) – The minimum number of matches passing the ratio test.
  • minimum_validation_intersection (float) – The minimum overlapping area between the keypoints in the filtered set of matches and the original keypoints.
  • minimum_validation_inliers (int) – The minimum number of inliers for the transformation matrix.
  • ratio (float) – The ratio to use for Lowe’s ratio test.
  • max_workers (Optional[int]) – The maximum number of threads to use for doing the final validation step.
Return type:

List[Tuple[str, str]]

Returns:

A list of pairs of file duplicates.

perception.experimental.local_descriptor_deduplication.validate_match(kp1, des1, kp2, des2, dims1, dims2, minimum_match=0.4, minimum_intersection=0.6, minimum_inliers=5, ratio=0.5)

Validate the match between two sets of keypoints and descriptors. The validation algorithm is as follows:

  1. Compute the mutual set of matches between the two sets of descriptors and filter them using Lowe’s ratio test.
  2. If the minimum number of passing matches is less than “minimum_match”, the match fails. This ensures we don’t have trivial matches.
  3. Compute the intersection area of the matched keypoints versus the raw keypoints. If the area overlap is less than minimum_intersection, the match fails. This ensures we don’t match on small subsegments of an image, such as logos.
  4. Compute a transformation matrix using cv2.findHomography. If we cannot obtain a transformation matrix, the match fails. If the sum total of inliers for the transformation matrix is less than minimum_inliers, the match fails.
  5. Finally, use the transformation matrix on a set of points representing the bounding box of each image. If less than minimum_intersection of the bounding box fits within the bounds of the transformed version, the match fails. This is a second pass safety check for logos and other subsegments of images.
Parameters:
  • kp1 (ndarray) – The first set of keypoints
  • des1 (ndarray) – The first set of descriptors
  • kp2 (ndarray) – The second set of keypoints
  • des2 (ndarray) – The second set of descriptors
  • dims1 (Tuple[int, int]) – The dimensions (width, height) for the first image
  • dims2 (Tuple[int, int]) – The dimensions (width, height) for the second image
  • minimum_match (float) – The minimum number of matches passing the ratio test.
  • minimum_intersection (float) – The minimum overlapping area between the keypoints in the filtered set of matches and the original keypoints.
  • minimum_inliers (int) – The minimum number of inliers for the transformation matrix.
  • ratio – The ratio to use for Lowe’s ratio test.
Return type:

float

Returns:

True if the match passes, False otherwise.