Benchmarking¶
-
class
perception.benchmarking.
BenchmarkImageDataset
(df)¶ -
categories
¶ The categories included in the dataset
-
deduplicate
(hasher, threshold=0.001, isometric=False)¶ Remove duplicate files from dataset.
Parameters: - files – A list of file paths
- hasher (
ImageHasher
) – A hasher to use for finding a duplicate - threshold – The threshold required for a match
- isometric – Whether to compute the rotated versions of the images
Return type: Tuple
[BenchmarkImageDataset
,Set
[Tuple
[str
,str
]]]Returns: A list where each entry is a list of files that are duplicates of each other. We keep only the last entry.
-
filter
(**kwargs)¶ Obtain a new dataset filtered with the given keyword arguments.
-
classmethod
from_tuples
(files)¶ Build dataset from a set of files.
Parameters: files ( List
[Tuple
[str
,str
]]) – A list of tuples where each entry is a pair filepath and category.
-
classmethod
load
(path_to_zip_or_directory, storage_dir=None, verify_md5=True)¶ Load a dataset from a ZIP file or directory.
Parameters: - path_to_zip_or_directory (
str
) – Pretty self-explanatory - storage_dir (
Optional
[str
]) – If providing a ZIP file, where to extract the contents. If None, contents will be extracted to a folder with the same name as the ZIP file in the same directory as the ZIP file. - verify_md5 – Verify md5s when loading
- path_to_zip_or_directory (
-
save
(path_to_zip_or_directory)¶ Save a dataset to a directory or ZIP file.
Parameters: path_to_zip_or_directory – Pretty self-explanatory
-
transform
(transforms, storage_dir, errors='raise')¶ Prepare files to be used as part of benchmarking run.
Parameters: - transforms (
Dict
[str
,Augmenter
]) – A dictionary of transformations. The only required key is noop which determines how the original, untransformed image is saved. For a true copy, simply make the noop key imgaug.augmenters.Noop(). - storage_dir (
str
) – A directory to store all the images along with their transformed counterparts. - errors (
str
) – How to handle errors reading files. If “raise”, exceptions are raised. If “warn”, the error is printed as a warning.
Returns: A BenchmarkImageTransforms object
Return type: transforms
- transforms (
-
-
class
perception.benchmarking.
BenchmarkImageTransforms
(df)¶ -
categories
¶ The categories included in the dataset
-
compute_hashes
(hashers, max_workers=5)¶ Compute hashes for a series of files given some set of hashers.
Parameters: - hashers (
Dict
[str
,ImageHasher
]) – A dictionary of hashers. - max_workers (
int
) – Maximum number of workers for parallel hash computation.
Returns: A BenchmarkHashes object.
Return type: metrics
- hashers (
-
filter
(**kwargs)¶ Obtain a new dataset filtered with the given keyword arguments.
-
classmethod
load
(path_to_zip_or_directory, storage_dir=None, verify_md5=True)¶ Load a dataset from a ZIP file or directory.
Parameters: - path_to_zip_or_directory (
str
) – Pretty self-explanatory - storage_dir (
Optional
[str
]) – If providing a ZIP file, where to extract the contents. If None, contents will be extracted to a folder with the same name as the ZIP file in the same directory as the ZIP file. - verify_md5 – Verify md5s when loading
- path_to_zip_or_directory (
-
save
(path_to_zip_or_directory)¶ Save a dataset to a directory or ZIP file.
Parameters: path_to_zip_or_directory – Pretty self-explanatory
-
-
class
perception.benchmarking.
BenchmarkVideoDataset
(df)¶ -
categories
¶ The categories included in the dataset
-
filter
(**kwargs)¶ Obtain a new dataset filtered with the given keyword arguments.
-
classmethod
from_tuples
(files)¶ Build dataset from a set of files.
Parameters: files ( List
[Tuple
[str
,str
]]) – A list of tuples where each entry is a pair filepath and category.
-
classmethod
load
(path_to_zip_or_directory, storage_dir=None, verify_md5=True)¶ Load a dataset from a ZIP file or directory.
Parameters: - path_to_zip_or_directory (
str
) – Pretty self-explanatory - storage_dir (
Optional
[str
]) – If providing a ZIP file, where to extract the contents. If None, contents will be extracted to a folder with the same name as the ZIP file in the same directory as the ZIP file. - verify_md5 – Verify md5s when loading
- path_to_zip_or_directory (
-
save
(path_to_zip_or_directory)¶ Save a dataset to a directory or ZIP file.
Parameters: path_to_zip_or_directory – Pretty self-explanatory
-
transform
(transforms, storage_dir, errors='raise')¶ Prepare files to be used as part of benchmarking run.
Parameters: - transforms (
Dict
[str
,Callable
]) – A dictionary of transformations. The only required key is noop which determines how the original, untransformed video is saved. Each transform should be a callable function with that accepts an input_filepath and output_filepath argument and it should return the output_filepath (which may have a different extension appended by the transform function). - storage_dir (
str
) – A directory to store all the videos along with their transformed counterparts. - errors (
str
) – How to handle errors reading files. If “raise”, exceptions are raised. If “warn”, the error is printed as a warning.
Returns: A BenchmarkVideoTransforms object
Return type: transforms
- transforms (
-
-
class
perception.benchmarking.
BenchmarkVideoTransforms
(df)¶ -
categories
¶ The categories included in the dataset
-
compute_hashes
(hashers, max_workers=5)¶ Compute hashes for a series of files given some set of hashers.
Parameters: - hashers (
Dict
[str
,VideoHasher
]) – A dictionary of hashers. - max_workers (
int
) – Maximum number of workers for parallel hash computation.
Returns: A BenchmarkHashes object.
Return type: hashes
- hashers (
-
filter
(**kwargs)¶ Obtain a new dataset filtered with the given keyword arguments.
-
classmethod
load
(path_to_zip_or_directory, storage_dir=None, verify_md5=True)¶ Load a dataset from a ZIP file or directory.
Parameters: - path_to_zip_or_directory (
str
) – Pretty self-explanatory - storage_dir (
Optional
[str
]) – If providing a ZIP file, where to extract the contents. If None, contents will be extracted to a folder with the same name as the ZIP file in the same directory as the ZIP file. - verify_md5 – Verify md5s when loading
- path_to_zip_or_directory (
-
save
(path_to_zip_or_directory)¶ Save a dataset to a directory or ZIP file.
Parameters: path_to_zip_or_directory – Pretty self-explanatory
-
-
class
perception.benchmarking.
BenchmarkHashes
(df)¶ A dataset of hashes for transformed images. It is essentially a wrapper around a pandas.DataFrame with the following columns:
- guid
- error
- filepath
- category
- transform_name
- hasher_name
- hasher_dtype
- hasher_distance_metric
- hasher_hash_length
- hash
-
categories
¶ The categories included in the dataset
-
compute_threshold_recall
(precision_threshold=99.9, grouping=None, **kwargs)¶ Compute a table for threshold and recall for each category, hasher, and transformation combinations. Additional arguments passed to compute_metrics.
Parameters: - precision_threshold – The precision threshold to use for choosing a distance threshold for each hasher.
- grouping – List of fields to group by. By default, all fields are used (category, and transform_name).
Return type: DataFrame
Returns: A pandas DataFrame with 7 columns. The key columns are threshold (The optimal distance threshold for detecting a match for this combination), recall (the number of correct matches divided by the number of possible matches), and precision (the number correct matches divided by the total number of matches whether correct or incorrect).
-
filter
(**kwargs)¶ Obtain a new dataset filtered with the given keyword arguments.
-
show_histograms
(grouping=None, precision_threshold=99.9, **kwargs)¶ Plot histograms for true and false positives, similar to https://tech.okcupid.com/evaluating-perceptual-image-hashes-okcupid/ Additional arguments passed to compute_metrics.
Parameters: grouping – List of fields to group by. By default, all fields are used (category, and transform_name).
Video Transforms¶
Transforming videos can be more complex, so we provide the following tools for transforming videos.
-
perception.benchmarking.video_transforms.
get_simple_transform
(width=-1, height=-1, pad=None, codec=None, clip_pct=None, clip_s=None, sar=None, fps=None, output_ext=None)¶ Resize to a specific size and re-encode.
Parameters: - width (
Union
[str
,int
]) – The target width (-1 to maintain aspect ratio) - height (
Union
[str
,int
]) – The target height (-1 to maintain aspect ratio) - pad (
Optional
[str
]) – An ffmpeg pad argument provided as a string. - codec (
Optional
[str
]) – The codec for encoding the video. - fps – The new frame rate for the video.
- clip_pct (
Optional
[Tuple
[float
,float
]]) – The video start and end in percentages of video duration. - clip_s (
Optional
[Tuple
[float
,float
]]) – The video start and end in seconds (used over clip_pct if both are provided). - sar – Whether to make all videos have a common sample aspect ratio (i.e., for all square pixels, set this to ‘1/1’).
- output_ext – The extension to use when re-encoding (used to select video format). It should include the leading ‘.’.
- width (
-
perception.benchmarking.video_transforms.
get_black_frame_padding_transform
(duration_s=0, duration_pct=0)¶ Get a transform that adds black frames at the start and end of a video.
Parameters: - duration_s – The duration of the black frames in seconds.
- duration_pct – The duration of the black frames as a percentage of video duration. If both duration_s and duration_pct are provided, the maximum value is used.
-
perception.benchmarking.video_transforms.
get_slideshow_transform
(frame_input_rate, frame_output_rate, max_frames=None, offset=0)¶ Get a slideshow transform to create slideshows from videos.
Parameters: - frame_input_rate – The rate at which frames will be sampled from the source video (e.g., a rate of 1 means we collect one frame per second of the input video).
- frame_output_rate – The rate at which the sampled frames are played in the slideshow (e.g., a rate of 0.5 means each frame will appear for 2 seconds).
- max_frames – The maximum number of frames to write.
- offset – The number of seconds to wait before beginning the slide show.