Experimental¶
This module contains experimental functionality that may not be ready for production use.
Approximate Nearest Neighbors¶
-
class
perception.experimental.ann.
ApproximateNearestNeighbors
(con, table, paramstyle, index, hash_length, metadata_columns=None, dtype='uint8', distance_metric='euclidean')¶ A wrapper for a FAISS index.
Parameters: - con – A database connection from which to obtain metadata for matched hashes.
- table – The table in the database that we should query for metadata.
- paramstyle – The parameter style for the given database
- index – A FAISS index (or filepath to a FAISS index)
- hash_length – The length of the hash that is being matched against.
- metadata_columns – The metadata that should be returned for queries.
- dtype – The data type for the vectors
- distance_metric – The distance metric for the vectors
-
classmethod
from_database
(con, table, paramstyle, hash_length, ids_train=None, train_size=None, chunksize=100000, metadata_columns=None, index=None, gpu=False, dtype='uint8', distance_metric='euclidean')¶ Train and build a FAISS index from a database connection.
Parameters: - con – A database connection from which to obtain metadata for matched hashes.
- table – The table in the database that we should query for metadata.
- paramstyle – The parameter style for the given database
- hash_length – The length of the hash that is being matched against.
- ids_train – The IDs for the vectors to train on.
- train_size – The number of vectors to use for training. Will be randomly selected from 1 to the number of vectors in the database. Ignored if ids_train is not None.
- chunksize – The chunks of data to draw from the database at a time when adding vectors to the index.
- metadata_columns – The metadata that should be returned for queries.
- index – If a pretrained index is provided, training will be skipped, any existing vectors will be discarded, and the index will be repopulated with the current contents of the database.
- gpu – If true, will attempt to carry out training on a GPU.
- dtype – The data type for the vectors
- distance_metric – The distance metric for the vectors
-
nlist
¶ The number of lists in the index.
-
nprobe
¶ The current value of nprobe.
-
ntotal
¶ The number of vectors in the index.
-
query_by_id
(ids, include_metadata=True, include_hashes=False)¶ Get data from the database.
Parameters: - ids – The hash IDs to get from the database.
- include_metadata – Whether to include metadata columns.
- include_hashes – Whether to include the hashes
Return type: DataFrame
-
save
(filepath)¶ Save an index to disk.
Parameters: filepath – Where to save the index.
-
search
(queries, threshold=None, threshold_func=None, hash_format='base64', k=1)¶ Search the index and return matches.
Parameters: - queries (
List
[QueryInput
]) – A list of queries in the form of {“id”: <id>, “hash”: “<hash_string>”} - threshold (
Optional
[int
]) – The threshold to use for matching. Takes precedence over threshold_func. - threshold_func (
Optional
[Callable
[[ndarray
],int
]]) – A function that, given a query vector, returns the desired match threshold for that query. - hash_format – The hash format used for the strings in the query.
- k – The number of nearest neighbors to return.
Returns: { “id”: <query ID>, “matches”: [{“distance”: <distance>, “id”: <match ID>, “metadata”: {}}]}
The metadata consists of the contents of the metadata columns specified for this matching instance.
Return type: Matches in the form of a list of dicts of the form
- queries (
-
set_nprobe
(nprobe)¶ Set the value of nprobe.
Parameters: nprobe – The new value for nprobe Return type: int
-
string_to_vector
(s, hash_format='base64')¶ Convert a string to vector form.
Parameters: - s (
str
) – The hash string - hash_format – The format for the hash string
Return type: ndarray
- s (
-
tune
(n_query=100, min_recall=99, max_noise=3)¶ Obtain minimum value for nprobe that achieves a target level of recall. :param n_query: The number of hashes to use as test hashes. :param min_recall: The minimum desired recall for the index. :param max_noise: The maximum amount of noise to add to each test hash
Returns: A tuple of recall, latency (in ms), and nprobe where the nprobe value is the one that achieved the resulting recall. Raises: TuningFailure if no suitable nprobe value is found.
-
vector_to_string
(vector, hash_format='base64')¶ Convert a vector back to string
Parameters: - vector – The hash vector
- hash_format – The format for the hash
Return type: str
-
perception.experimental.ann.
serve
(index, default_threshold=None, default_threshold_func=None, default_k=1, concurrency=2, log_level=20, host='localhost', port=8080)¶ Serve an index as a web API. This function does not block. If you wish to use the function in a blocking manner, you can do something like
loop = asyncio.get_event_loop() loop.run_until_complete(serve(...)) loop.run_forever()
You can query the API with something like:
curl --header "Content-Type: application/json" \ --request POST \ --data '{"queries": [{"hash": "<hash string>", "id": "bar"}], "threshold": 1200}' \ http://localhost:8080/v1/similarity
Parameters: - index (
ApproximateNearestNeighbors
) – The underlying index - default_threshold (
Optional
[int
]) – The default threshold for matches - default_k (
int
) – The default number of nearest neighbors to look for - concurrency (
int
) – The number of concurrent requests served - log_level – The log level to use for the logger
- host – The host for the servoce
- port – The port for the service
- index (
Local Descriptor Deduplication¶
-
perception.experimental.local_descriptor_deduplication.
deduplicate
(filepaths, max_features=256, min_features=10, max_size=256, coarse_pct_probe=0, coarse_threshold=100, minimum_coarse_overlap=0.01, minimum_validation_match=0.4, minimum_validation_intersection=0.6, minimum_validation_inliers=5, ratio=0.5, max_workers=None)¶ Deduplicate images by doing the following:
- Unletterbox all images and resize to some maximum size, preserving aspect ratio.
- Compute the SIFT descriptors and keypoints for all the resulting images.
- Perform a coarse, approximate search for images with common features.
- For each candidate pair, validate it pairwise by checking the features and keypoints with the traditional approach using the ratio test. See validate_match for more information.
Parameters: - filepaths (
Iterable
[str
]) – The list of images to deduplicate. - max_features (
int
) – The maximum number of features to extract. - min_features (
int
) – The minimum number of features to extract. - max_size (
int
) – The maximum side length for an image. - coarse_pct_probe (
float
) – The minimum fraction of nearest lists to search. If the product of pct_probe and the number of lists is less than 1, one list will be searched. - corase_threshold – The threshold for a match as a euclidean distance.
- minimum_coarse_overlap (
float
) – The minimum overlap between two files to qualify as a match. - minimum_validation_match (
float
) – The minimum number of matches passing the ratio test. - minimum_validation_intersection (
float
) – The minimum overlapping area between the keypoints in the filtered set of matches and the original keypoints. - minimum_validation_inliers (
int
) – The minimum number of inliers for the transformation matrix. - ratio (
float
) – The ratio to use for Lowe’s ratio test. - max_workers (
Optional
[int
]) – The maximum number of threads to use for doing the final validation step.
Return type: List
[Tuple
[str
,str
]]Returns: A list of pairs of file duplicates.
-
perception.experimental.local_descriptor_deduplication.
validate_match
(kp1, des1, kp2, des2, dims1, dims2, minimum_match=0.4, minimum_intersection=0.6, minimum_inliers=5, ratio=0.5)¶ Validate the match between two sets of keypoints and descriptors. The validation algorithm is as follows:
- Compute the mutual set of matches between the two sets of descriptors and filter them using Lowe’s ratio test.
- If the minimum number of passing matches is less than “minimum_match”, the match fails. This ensures we don’t have trivial matches.
- Compute the intersection area of the matched keypoints versus the raw keypoints. If the area overlap is less than minimum_intersection, the match fails. This ensures we don’t match on small subsegments of an image, such as logos.
- Compute a transformation matrix using cv2.findHomography. If we cannot obtain a transformation matrix, the match fails. If the sum total of inliers for the transformation matrix is less than minimum_inliers, the match fails.
- Finally, use the transformation matrix on a set of points representing the bounding box of each image. If less than minimum_intersection of the bounding box fits within the bounds of the transformed version, the match fails. This is a second pass safety check for logos and other subsegments of images.
Parameters: - kp1 (
ndarray
) – The first set of keypoints - des1 (
ndarray
) – The first set of descriptors - kp2 (
ndarray
) – The second set of keypoints - des2 (
ndarray
) – The second set of descriptors - dims1 (
Tuple
[int
,int
]) – The dimensions (width, height) for the first image - dims2 (
Tuple
[int
,int
]) – The dimensions (width, height) for the second image - minimum_match (
float
) – The minimum number of matches passing the ratio test. - minimum_intersection (
float
) – The minimum overlapping area between the keypoints in the filtered set of matches and the original keypoints. - minimum_inliers (
int
) – The minimum number of inliers for the transformation matrix. - ratio – The ratio to use for Lowe’s ratio test.
Return type: float
Returns: True if the match passes, False otherwise.