Media Deduplication¶
Perceptual hashes can be used to deduplicate sets of images. Below we provide two examples (one simple, one larger scale).
For most use cases, we recommend using PHash with hash_size=16
and
with 0.2 as the distance threshold as in the example below. You may wish to adjust
this threshold up or down based on your tolerance for false negatives / positives.
In practice, deduplicating in memory on your machine by the methods below may be impractical. For larger-scale applications, you may wish to use tools like FAISS, Annoy, or databases with functionality for querying based on distance such as MemSQL.
For the supported hashers, below are our recommended thresholds with expected false positive rates of <1%.
hasher | threshold |
---|---|
ahash (hash_size=16) | 0.008 |
blockmean | 0.008 |
dhash (hash_size=16) | 0.07 |
marrhildreth | 0.1 |
pdq | 0.2 |
phash (hash_size=16) | 0.2 |
wavelet (hash_size=16) | 0.02 |
Simple example¶
In this example, we download a ZIP file containing 18 images. One of the images is duplicated twice and another image is duplicated once.
import os
import glob
import zipfile
import urllib.request
import tabulate
import pandas as pd
from perception import tools, hashers
urllib.request.urlretrieve(
"https://thorn-perception.s3.amazonaws.com/thorn-perceptual-deduplication-example.zip",
"thorn-perceptual-deduplication-example.zip"
)
with zipfile.ZipFile('thorn-perceptual-deduplication-example.zip') as f:
f.extractall('.')
filepaths = glob.glob('thorn-perceptual-deduplication-example/*.jpg')
duplicate_pairs = tools.deduplicate(files=filepaths, hashers=[(hashers.PHash(hash_size=16), 0.2)])
print(tabulate.tabulate(pd.DataFrame(duplicate_pairs), showindex=False, headers=['file1', 'file2'], tablefmt='rst'))
# Now we can do whatever we want with the duplicates. We could just delete
# the first entry in each pair or manually verify the pairs to ensure they
# are, in fact duplicates.
file1 | file2 |
---|---|
thorn-perceptual-deduplication-example/309b.jpg | thorn-perceptual-deduplication-example/309.jpg |
thorn-perceptual-deduplication-example/309b.jpg | thorn-perceptual-deduplication-example/309a.jpg |
thorn-perceptual-deduplication-example/309a.jpg | thorn-perceptual-deduplication-example/309.jpg |
thorn-perceptual-deduplication-example/315a.jpg | thorn-perceptual-deduplication-example/315.jpg |
Real-world example¶
In the example below, we use the Caltech 256 Categories dataset. Like most other public image datasets, it contains a handful of duplicates in some categories.
The code below will:
- Download the dataset
- Group all the filepaths by category (the dataset is provided in folders)
- Within each group, find duplicates using PHash. We will compare not just the original images, but also the 8 isometric transformations for each image.
import os
import tarfile
from glob import glob
import urllib.request
import tqdm
from perception import hashers, tools
urllib.request.urlretrieve(
"http://www.vision.caltech.edu/Image_Datasets/Caltech256/256_ObjectCategories.tar",
"256_ObjectCategories.tar"
)
with tarfile.open('256_ObjectCategories.tar') as tfile:
tfile.extractall()
files = glob('256_ObjectCategories/**/*.jpg')
# To reduce the number of pairwise comparisons,
# we can deduplicate within each image category
# (i.e., we don't need to compare images of
# butterflies with images of chess boards).
filepath_group = [
(
filepath,
os.path.normpath(filepath).split(os.sep)[-2]
) for filepath in files
]
groups = list(set([group for _, group in filepath_group]))
# We consider any pair of images with a PHash distance of < 0.2 as
# as a duplicate.
comparison_hashers = [(hashers.PHash(hash_size=16), 0.2)]
duplicate_pairs = []
for current_group in groups:
current_filepaths = [
filepath for filepath, group in filepath_group if group == current_group
]
current_duplicate_pairs = tools.deduplicate(
files=current_filepaths,
hashers=comparison_hashers,
isometric=True,
progress=tqdm.tqdm
)
duplicate_pairs.extend(current_duplicate_pairs)
# Now we can do whatever we want with the duplicates. We could just delete
# the first entry in each pair or manually verify the pairs to ensure they
# are, in fact duplicates.
Video deduplication¶
Video deduplication requires more thought depending on your tolerance for false positives and how important temporal relationships are. Below is one example approach for deduplicating a group of videos by taking frames from each video that are sufficiently different from each other (to avoid keeping too many) and then using them all to find pairs of videos that have matching frames.
import urllib.request
import zipfile
import glob
import tqdm
import perception.hashers
# Download some example videos.
urllib.request.urlretrieve(
"https://thorn-perception.s3.amazonaws.com/thorn-perceptual-video-deduplication-example.zip",
"thorn-perceptual-video-deduplication-example.zip"
)
with zipfile.ZipFile('thorn-perceptual-video-deduplication-example.zip') as f:
f.extractall('.')
# By default, this will use TMK L1 with PHashU8.
hasher = perception.hashers.SimpleSceneDetection(max_scene_length=5)
# Set a threshold for matching frames within videos and across videos.
filepaths = glob.glob('thorn-perceptual-video-deduplication-example/*.m4v') + \
glob.glob('thorn-perceptual-video-deduplication-example/*.gif')
# Returns a list of dicts with a "filepath" and "hash" key. "hash" contains a
# list of hashes.
hashes = hasher.compute_parallel(filepaths=filepaths, progress=tqdm.tqdm)
# Flatten the hashes into a list of (filepath, hash) tuples.
hashes_flattened = perception.tools.flatten([
[(hash_group['filepath'], hash_string) for hash_string in hash_group['hash']]
for hash_group in hashes
])
duplicates = perception.tools.deduplicate_hashes(
hashes=hashes_flattened,
threshold=50,
hasher=hasher
)