Experimental¶

This module contains experimental functionality that may not be ready for production use.

Approximate Nearest Neighbors¶

class perception.experimental.ann.ApproximateNearestNeighbors(con, table, paramstyle, index, hash_length, metadata_columns=None, dtype='uint8', distance_metric='euclidean')¶

A wrapper for a FAISS index.

Parameters:

con – A database connection from which to obtain metadata for matched hashes.
table – The table in the database that we should query for metadata.
paramstyle – The parameter style for the given database
index – A FAISS index (or filepath to a FAISS index)
hash_length – The length of the hash that is being matched against.
metadata_columns – The metadata that should be returned for queries.
dtype – The data type for the vectors
distance_metric – The distance metric for the vectors

classmethod from_database(con, table, paramstyle, hash_length, ids_train=None, train_size=None, chunksize=100000, metadata_columns=None, index=None, gpu=False, dtype='uint8', distance_metric='euclidean')¶

Train and build a FAISS index from a database connection.

Parameters:

con – A database connection from which to obtain metadata for matched hashes.
table – The table in the database that we should query for metadata.
paramstyle – The parameter style for the given database
hash_length – The length of the hash that is being matched against.
ids_train – The IDs for the vectors to train on.
train_size – The number of vectors to use for training. Will be randomly selected from 1 to the number of vectors in the database. Ignored if ids_train is not None.
chunksize – The chunks of data to draw from the database at a time when adding vectors to the index.
metadata_columns – The metadata that should be returned for queries.
index – If a pretrained index is provided, training will be skipped, any existing vectors will be discarded, and the index will be repopulated with the current contents of the database.
gpu – If true, will attempt to carry out training on a GPU.
dtype – The data type for the vectors
distance_metric – The distance metric for the vectors

nlist¶: The number of lists in the index.

nprobe¶: The current value of nprobe.

ntotal¶: The number of vectors in the index.

query_by_id(ids, include_metadata=True, include_hashes=False)¶

Get data from the database.

Parameters:	ids – The hash IDs to get from the database. include_metadata – Whether to include metadata columns. include_hashes – Whether to include the hashes
Return type:	`DataFrame`

save(filepath)¶

Save an index to disk.

Parameters:	filepath – Where to save the index.

search(queries, threshold=None, threshold_func=None, hash_format='base64', k=1)¶

Search the index and return matches.

Parameters:

queries (List[QueryInput]) – A list of queries in the form of {“id”: <id>, “hash”: “<hash_string>”}
threshold (Optional[int]) – The threshold to use for matching. Takes precedence over threshold_func.
threshold_func (Optional[Callable[[ndarray], int]]) – A function that, given a query vector, returns the desired match threshold for that query.
hash_format – The hash format used for the strings in the query.
k – The number of nearest neighbors to return.

Returns:

{ “id”: <query ID>, “matches”: [{“distance”: <distance>, “id”: <match ID>, “metadata”: {}}]}

The metadata consists of the contents of the metadata columns specified for this matching instance.

Return type:

Matches in the form of a list of dicts of the form

set_nprobe(nprobe)¶

Set the value of nprobe.

Parameters:	nprobe – The new value for nprobe
Return type:	`int`

string_to_vector(s, hash_format='base64')¶

Convert a string to vector form.

Parameters:	s (`str`) – The hash string hash_format – The format for the hash string
Return type:	`ndarray`

tune(n_query=100, min_recall=99, max_noise=3)¶

Obtain minimum value for nprobe that achieves a target level of recall. :param n_query: The number of hashes to use as test hashes. :param min_recall: The minimum desired recall for the index. :param max_noise: The maximum amount of noise to add to each test hash

Returns:	A tuple of recall, latency (in ms), and nprobe where the nprobe value is the one that achieved the resulting recall.
Raises:	TuningFailure if no suitable nprobe value is found.

vector_to_string(vector, hash_format='base64')¶

Convert a vector back to string

Parameters:	vector – The hash vector hash_format – The format for the hash
Return type:	`Optional`[`str`]

perception.experimental.ann.serve(index, default_threshold=None, default_threshold_func=None, default_k=1, concurrency=2, log_level=20, host='localhost', port=8080)¶

Serve an index as a web API. This function does not block. If you wish to use the function in a blocking manner, you can do something like

loop = asyncio.get_event_loop()
loop.run_until_complete(serve(...))
loop.run_forever()

You can query the API with something like:

curl --header "Content-Type: application/json" \
     --request POST \
     --data '{"queries": [{"hash": "<hash string>", "id": "bar"}], "threshold": 1200}' \
     http://localhost:8080/v1/similarity

Parameters:

index (ApproximateNearestNeighbors) – The underlying index
default_threshold (Optional[int]) – The default threshold for matches
default_k (int) – The default number of nearest neighbors to look for
concurrency (int) – The number of concurrent requests served
log_level – The log level to use for the logger
host – The host for the servoce
port – The port for the service

Local Descriptor Deduplication¶

perception.experimental.local_descriptor_deduplication.deduplicate(filepaths, max_features=256, min_features=10, max_size=256, coarse_pct_probe=0, coarse_threshold=100, minimum_coarse_overlap=0.01, minimum_validation_match=0.4, minimum_validation_intersection=0.6, minimum_validation_inliers=5, ratio=0.5, max_workers=None, use_gpu=True)¶

Deduplicate images by doing the following:

Unletterbox all images and resize to some maximum size, preserving aspect ratio.
Compute the SIFT descriptors and keypoints for all the resulting images.
Perform a coarse, approximate search for images with common features.
For each candidate pair, validate it pairwise by checking the features and keypoints with the traditional approach using the ratio test. See validate_match for more information.

Parameters:	filepaths (`Iterable`[`str`]) – The list of images to deduplicate. max_features (`int`) – The maximum number of features to extract. min_features (`int`) – The minimum number of features to extract. max_size (`int`) – The maximum side length for an image. coarse_pct_probe (`float`) – The minimum fraction of nearest lists to search. If the product of pct_probe and the number of lists is less than 1, one list will be searched. corase_threshold – The threshold for a match as a euclidean distance. minimum_coarse_overlap (`float`) – The minimum overlap between two files to qualify as a match. minimum_validation_match (`float`) – The minimum number of matches passing the ratio test. minimum_validation_intersection (`float`) – The minimum overlapping area between the keypoints in the filtered set of matches and the original keypoints. minimum_validation_inliers (`int`) – The minimum number of inliers for the transformation matrix. ratio (`float`) – The ratio to use for Lowe’s ratio test. max_workers (`Optional`[`int`]) – The maximum number of threads to use for doing the final validation step.
Return type:	`List`[`Tuple`[`str`, `str`]]
Returns:	A list of pairs of file duplicates.

perception.experimental.local_descriptor_deduplication.validate_match(kp1, des1, kp2, des2, dims1, dims2, minimum_match=0.4, minimum_intersection=0.6, minimum_inliers=5, ratio=0.5)¶

Validate the match between two sets of keypoints and descriptors. The validation algorithm is as follows:

Compute the mutual set of matches between the two sets of descriptors and filter them using Lowe’s ratio test.
If the minimum number of passing matches is less than “minimum_match”, the match fails. This ensures we don’t have trivial matches.
Compute the intersection area of the matched keypoints versus the raw keypoints. If the area overlap is less than minimum_intersection, the match fails. This ensures we don’t match on small subsegments of an image, such as logos.
Compute a transformation matrix using cv2.findHomography. If we cannot obtain a transformation matrix, the match fails. If the sum total of inliers for the transformation matrix is less than minimum_inliers, the match fails.
Finally, use the transformation matrix on a set of points representing the bounding box of each image. If less than minimum_intersection of the bounding box fits within the bounds of the transformed version, the match fails. This is a second pass safety check for logos and other subsegments of images.

Parameters:	kp1 (`ndarray`) – The first set of keypoints des1 (`ndarray`) – The first set of descriptors kp2 (`ndarray`) – The second set of keypoints des2 (`ndarray`) – The second set of descriptors dims1 (`Tuple`[`int`, `int`]) – The dimensions (width, height) for the first image dims2 (`Tuple`[`int`, `int`]) – The dimensions (width, height) for the second image minimum_match (`float`) – The minimum number of matches passing the ratio test. minimum_intersection (`float`) – The minimum overlapping area between the keypoints in the filtered set of matches and the original keypoints. minimum_inliers (`int`) – The minimum number of inliers for the transformation matrix. ratio – The ratio to use for Lowe’s ratio test.
Return type:	`float`
Returns:	True if the match passes, False otherwise.