Compare two images of different versions of a page #15

anjackson · 2019-04-03T11:09:37Z

We would like to compare e.g. crawl-time or current-live site with the archival-playback version. One way to compare them is to compare the images using standard image comparison techniques.

This could also be applied to same-site-over-time etc.

Related work:

PROMISE (see below)
wa_screenshot_compare
Brozzler captures crawl-time screenshots.
See capturing crawl images tool here and render WARCs in browser here. Example output here.
Detecting similar images at scale via fingerprints
Older SCAPE Project stuff:
- Browser Screenshot Comparison Tool
- browser-shots-tool
- Pagelyzer and it's user manual

edipretoro · 2019-04-03T11:52:21Z

Here is some details about what I've done so far.

I've been using SSIM (https://ieeexplore.ieee.org/document/1284395 and https://ece.uwaterloo.ca/~z70wang/publications/ssim.pdf). Here is the Python code used:

#!/usr/bin/env python

from skimage.measure import compare_ssim
import cv2
import imutils

original_filename = './static/screenshots/fff6db9a99884cca50e01cbef58a4f33d797920f.jpeg'
archived_filename = './static/screenshots/fff6db9a99884cca50e01cbef58a4f33d797920f-archived.jpeg'

original = cv2.imread(original_filename)
archived = cv2.imread(archived_filename)

if original is not None and archived is not None:
  original_gray = cv2.cvtColor(original, cv2.COLOR_BGR2GRAY)
  archived_gray = cv2.cvtColor(archived, cv2.COLOR_BGR2GRAY)
  (score, diff) = compare_ssim(original_gray, archived_gray, full=True)
  print("Difference between the two images is %f" % score)

Here is another try with perceptual hash:

#!/usr/bin/env python

from PIL import Image
import imagehash

original_filename = './data/fff6db9a99884cca50e01cbef58a4f33d797920f.jpeg'
archived_filename = './data/fff6db9a99884cca50e01cbef58a4f33d797920f-archived.jpeg'

hash_original = imagehash.phash(Image.open(original_filename))
hash_archived = imagehash.phash(Image.open(archived_filename))
if (hash_archived == hash_original):
    print("Those two images are identical")
else:
    print("Those two images are not identical")
print("Here is the difference: %d" % (hash_original - hash_archived))

And here is another algorithm named VQI and used by Switzerland:

import cv2

def vqi_compute(original, archived, sample_size, url):
    original = cv2.imread(original)
    archived = cv2.imread(archived)
    original_means_regions = []
    archived_means_regions = []
    for x in range(1, int(original.shape[1]/sample_size), 2):
        k = sample_size * x
        l = sample_size * x + sample_size
        for y in range(1, int(original.shape[0]/sample_size), 2):
            i = sample_size * y
            j = sample_size * y + sample_size
            original_means_regions.append(cv2.mean(original[i:j,k:l])[:3])
            archived_means_regions.append(cv2.mean(archived[i:j,k:l])[:3])
    sum = 0
    for i in range(len(original_means_regions)):
        o = original_means_regions[i]
        a = archived_means_regions[i]
        sum = sum + sqrt(pow(o[0] - a[0], 2) + pow(o[1] - a[1], 2) + pow(o[2] - a[2], 2))
    return (url, sum)

For that implementation, I just copy the function because I haven't yet finished the implementation.

Hope this help!

anjackson · 2019-04-03T17:08:55Z

This Python implementation of e.g. pHash might be helpful: https://github.com/JohannesBuchner/imagehash

ibnesayeed · 2019-04-03T17:45:45Z

While ImageMagick can be used for pixel-level diffs, https://github.com/myint/perceptualdiff is a great tool to measure perceptible differences.

anjackson · 2019-04-04T09:32:12Z

There's also https://github.com/commonsmachinery/blockhash-python/ (via @zuups)

anjackson added the visual-qa-stream label Apr 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compare two images of different versions of a page #15

Compare two images of different versions of a page #15

anjackson commented Apr 3, 2019 •

edited

Loading

edipretoro commented Apr 3, 2019

anjackson commented Apr 3, 2019

ibnesayeed commented Apr 3, 2019

anjackson commented Apr 4, 2019

Compare two images of different versions of a page #15

Compare two images of different versions of a page #15

Comments

anjackson commented Apr 3, 2019 • edited Loading

edipretoro commented Apr 3, 2019

anjackson commented Apr 3, 2019

ibnesayeed commented Apr 3, 2019

anjackson commented Apr 4, 2019

anjackson commented Apr 3, 2019 •

edited

Loading