Pixel-based eigenface approachAn early prototype interface using the 68-point dlib model
Loading interactive content...

The underlying model for the various experiments in this project was created in Python using emotion-tagged face-image datasets from Kaggle.

First Pass

At first, a straight-forward eigenface approach was taken, using principal component analysis (PCA) to identify the "average" grayscale pixel-representations for each of the tagged emotions. Combining these through weighted addition creates composite images that in some sense correlate to more complex emotional expressions.

While simple, a pixel-based approach unfortunately has several drawbacks:

  • Many of the most significant PCA components are not expression-related, but rather represent variations in lighting, pose, skin color(!), and other factors. The input faces need to be as uniformly normalized as possible to limit these effects. (Spoiler Alert: This is never perfectly possible.)
  • The very nature of facial expression is about spatial distortion of features, not pixel-value changes at specific image coordinates, which is inherently contradictory to the above needs.
  • The PCA components need to be able to "cancel" each other out in superposition, which means many components are in an inverted grayscale. Humans are decidedly not unperturbed by such inversion.

Face Detection

In addition to grayscale-value adjustments like gamma correction, an attempt at improving this naive pixel-based approach was made using the dlib face detection library to better normalize the images through rotation, centering, scaling, and masking. Despite noticeable improvements, the fundamental issues with this approach were due to remain. While continuing to try to fine-tune and find workarounds like using sobel filters to detect edges, an epiphany came to use the landmark point coordinates from the face detection library as the data source itself. Instead of pixel-vector components, PCA could now produce coordinate-displacement vectors representing the emotions. This quickly solved almost every issue:

  • Coordinates could easily be normalized through affine transformations
  • "Superficial" properties (lighting, skin-tone, etc.) now only effected the model in so much that they effected correct detection of faces (and could easily be skipped if problematic)
  • The polarity of the displacement values had a clear interpretation in terms of direction of movement of individual landmarks
  • Decisions on how to visualize the resulting expressions could now be made at will

Micro-Expressions

The dlib model used was comprised of 68 landmark points representing the key features of the eyes, nose, mouth, and outer boundaries of the face. The resulting emotional expression model appeared to successfully cover a wide range of composite emotions, but seemed limited in representing a few of the key primary emotions tagged in the dataset. Surprise and happiness, for example, were easily recognizable, due to the major changes in the size and shape of the eyes and mouth. But disgust was not. The model was unable to capture (or visualize) the "scrunching" of the face that occurs separate from the displacement of major features. More points were needed.

3D Rendering

MediaPipe's face detection model provides a 478-point landmark set, and so a new expression model was built using the same PCA process as before. Initially, additional landmarks were chosen to draw "wrinkles" on the two-dimensional visualization, but the "drawn-on" nature of this approach looked rather unnatural. Another aha moment came in realizing that MediaPipe's three-dimensional output could be rendered in the browser using Three.js. By switching to 3D rendering, light and shadow could naturally indicate more of the changes outside of major feature displacement. A cel-shader was used to give the model a simplified but "friendly" appearance, avoiding possible uncanny valley effects. These aesthetic choices could easily be changed in the rendering pipeline. And more excitingly, the underlying expression model could still be visualized in entirely different ways (cf. the pareidolia experiment).