System Design |
|||
Visual |
| Video Input System | ||
![]() |
||
| Skeletonization | ||
![]() |
When the video comes into the Sender, it is first thresholded to acquire a binary silhouette image and the image is passed on to a distance transform algorithm. The distance transform scans both vertically and horizontally across the image, counting up along successive white pixels that represent a silhouette. If the pixel has already been counted on a previous pass, it checks its current count against the previous count and takes the minimum. The result is that the largest values in the image are along the middle of the silhouette. The goal of skeletonizing the image is to extract just this middle ridge.
This process has a few issues. First, the ridge value is not an actual local max. It simply tells you how far in pixels the middle of the shape is from the nearest edge pixel. However, this measure is taken in the L1 Norm and not the more familiar L2 or Euclidean Norm. This leaves ridge-like artifacts that can be difficult to handle. We filtered most of these out by recognizing that neigboring pixels in an image processed by the distance transform are off by one. A ridge was thus defined by a triangular shape extending 3 pixels in opposite directions. If a ridge value was 8, the triangle would be 5 6 7 8 7 6 5. Once the ridge pixels were extracted, we had to figure out how to connect the pixels to form a skeleton. We did this by starting at the bottom left-hand corner of the image and scanning up successive rows. For each row, we would keep track of what new ridge pixels we encountered and try to find the most likely ridge pixel already encountered to connect them to. Likeliness awas determined by a number of factors including proximity and slope between points. The slope factor helped us overcome the horizontal bias in our scanning pattern. Once we had all of the ridge pixels connected to likely matches, we applied a filter to connect shorter line segments into longer lines. If 2 touching segments had a similar slope or close proximity, we would take out the common point and consider the endpoints as a new line. On average, this reduced our data from 300 pairs of points to around 60 while still closely approximating the original 300 segment skeleton. |
|
| Rendering | ||
![]() <Membrain Texture Rendering> |
All rendering is done on the Receiver. It takes both local and remote skeletons and renders the virtual space from the point to view of the local skeletons. That is the local skeletons are in the foreground with the remote skeletons in the background. In between them is the membrane, a tube-like, elastic surface that reacts to participants' motion both locally and remotely.
The skeletons are rendered by taking the line segments and turning them into textured quads. The quads have a Gaussian kernel type texture applied to them, and they are all blended together to form a filled out person. This is then captured to a texture a further refined through fragment shaders. Finally, the rendered skeletons are textured to the membrane as well as brought back into software where they are used to displace the membrane, creating a more dimensional look. The membrane itself consists of 3 pieces, the 2 surfaces where local and remote skeletons are rendered into and the interactive form seperating them. This middle form has a 2D wave simulation running along its surface such that when it is stimulated, it fluctuates wildly according to the wave simulation with a damping constant near 1 (meaning that it takes a bit of time for it to return to steady state). The wave surface is textured with as particle streaking texture that illuminates the surface according to participants' lateral position. |
|