This README is put together somewhat hastily—please message me if you want clarification on something; your lack of understanding something is most likely due to my lack of sufficient documentation!
The following assumes you have:
- A set of images for a left-hand and a right-hand speaker that each include:
- two directories of images with the face + eyelids: a sequence of blinking and an "eye open" (non-blinking) image
- a directory of potential iris positions (if you plan to not use moving eyes, this can just be a single image)
- four directories of images with the mouth: a sequence for short syllables, medium syllables, and long syllables, plus an image to use for "listening" (mouth closed)
- examples here.
- A background image
- A stereo .wav file featuring the speech you want in your video (I use one speaker on the left and the other on the right; example)
- A TextGrid file with annotations for where short, medium, and long syllables are in each speaker's speech signal, plus indications of when they are each listening, when they should blink and, if applicable, when their eyes should be looking where (example).
- A tab-delimited text file generated from the TextGrid (see instructions below; note that the script is sensitive to row order; layer the rows as eyes > blinks > speech as in the example)
- Software: Praat, Python, ffmpeg, and ImageMagick. I use Paintbrush (on OSX) to generate the images, anything that gives PNGs will do.
- record all your sentences (in mono) to your liking and clip the edges closely to the beginning and end of the turn and also equalize the volume of the sentence clips (e.g., Praat Modify/Scale intensity... 75)
- choose which speaker will be left and which will be right (I create both versions for each stim video)
- create a silent version of each sentence (e.g., in Praat use Modify/Set part to zero... all)
- combine the silent and non-silent versions of each turn so that each speaker's utterances are on the correctly assigned channel (e.g., speaker A on channel 1, speaker B on channel 2). In Praat, you can open both, select them, and click Combine/Combine to stereo; the channels will be combined in vertical order (top = 1/left, bottom = 2/right)
- create a 500ms (or duration of your choosing) silence to go between all utterances and at the start and end of the videos. make it in stereo with matching sampling characteristics to your recorded sentences (otherwise it will not concatenate)
- concatenate all the stereo sentences and silences for a dialogue. I do that in Praat by opening silence-utt1-silence-utt2-...-utt9-silence, then going to Combine/Concatenate recoverably. Save both the sound and the TextGrid
- manually add in the blink and speech tiers (see the example). The values for each annotation should match the directory structure of your source images. for this example, the blink tier has two possible actions: B (blink) and N (nonblink); the speech tier has four possible actions: listening, short (syllable; <200ms), medium (syllable; 200–500ms), and long (syllable; >500ms); the eye tier has many possible actions, but is set on a grid of two integers (e.g., 13) where the first integer indicates vertical position of the iris (1 = high, 3 = low) and where the second integer indicates horizontal position of the iris (1 = left, 5 = right). You can generate all the 1–3 and 1–5 combinations and then run the script "make-grid-paths" to create all the possible eye movements on this grid automatically. Add a tier called "Snippets" with the name of the dialogue file you want as the annotation value. Use the same name of the dialogue file for the .TextGrid and .wav file.
- Some extra notes about this example: I am having the two characters look away and down the whole time, and they only blink once per third of the dialogue. I am also making sure that blinks only happen while a speaker is speaking so they don't cause a distraction. Note that syllable onset placement isn't exactly where it'd be in, e.g., a phonetic analysis; generally speaking I put them in the middle of onset consonants for each syllable (so no boundaries between syllables ending and starting with vowels).
Convert the annotations (I use ELAN)
- When you're sure that all the annotations are correct and that they line up correctly with the stereo audio, you can convert the TextGrid to a tab-delimited text. I find ELAN to be useful for this (see images below). Export it using the dialogue file name you used for the .wav and .TextGrid files.
- For each video you want to create, make a folder with the same name that you've used for the .wav, .TextGrid, and .txt files. Put in the associated .wav, .TextGrid, and .txt files as well as the directory of images that you want to use. The directory should also include a background image against which your characters will appear. The directory should be next to the other Python and shell scripts for the animation (directory structure example here).
- Make sure you navigate to the directory that encompases these scripts and the dialogue director(ies) you made, and then use ./animate DIR/ to begin the animation. The video should appear inside of the dialogue directory when it's finished.
In case they're helpful to you... (they're mostly useful ImageMagick commands + reminders for animation script usage)
- Make transparent background:
gm mogrify -transparent white $(find ./ -name '*.png')
- Create composites for left/right images:
gm composite -geometry -100+500 LFT-IMG.png BGD.png COMPOSITE.png
/ - Resize:
find . -name '*.png' -exec gm mogrify -resize 720x480 {} +
- Resize w/ crop:
find . -name '*.png' -exec gm mogrify -resize 720x405 -crop 720x404+0+1 {} +
- Make the eye movement paths:
./make-grid-paths DIR/images/L-eye/ && ./make-grid-paths DIR/images/R-eye/
- Create the video:
./animate DIR
These scripts were originally developed by @sctice!
Please cite this repository and/or the following paper when using the code:
Lammertink, Imme, de Vries, Maartje, Rowland, Caroline, & Casillas, Marisa (in prep). You and I: Using epistemic cues to predict who will talk next in conversation.