Recent advances in sensor technology and a continuous increase in available network bandwidth allow for expansive camera networks to be deployed to survey extended areas. The Multi-Joint Vision project aims at acquiring and processing image data from such a large-scale multi-camera system in real time distributed over multiple computers, obtaining information about persons, robots and objects within the surveyed area. Example applications include multi-person tracking, human gesture interpretation and object recognition. The project involves the integration of a flexibly expandable architecture and interfaces for distributed computer vision applications. It is integrated into a demonstration scenario developed for the CoTeSys cluster of excellence, which aims to further research into multi joint action of humans and cognitive systems.
To achieve a redundant survey of the scene, 40 ethernet-connected cameras were installed on a metal scaffolding at ceiling height, approximately 3.5 m above the floor level. The cameras' fields of view (FOV) cover the whole experimental area with a top-down view, they were set up to achieve a coverage redundancy of approximately 75 %, which is measured at a height of 1.7 m (the average height of an adult person). The cameras used in the setup are Baumer TXG08c industrial imaging cameras, which provide images of 1024 x 768 pixels at a rate of 28 frames per second each. Image acquisition occurs asynchronously over Gigabit Ethernet, using the GigE-Vision (GEV) standard.
Each of the cameras is connected to one of 40 diskless client nodes, where image capturing and processing is being handled. Cameras with adjacent FOVs are assigned to different camera groups. This helps to compensate for the observed fact that human beings in social scenarios such as the coffee-break demonstration scenario tend to flock together, rather than distribute evenly over the surveyed area, and reduces the likelihood of adjacent cameras becoming unavailable simultaneously in case of problems caused by single processing nodes, thus improving system robustness and load-balancing between the image processing nodes. The diskless client nodes are assembled using off-the-shelf components, using x64 architecture hexa-core processors and a Linux operating system. They are equipped with two network adapters each. One of these adapters connects the node to the local camera group network, while the other one connects to the client network via a 48-port switch.
A variety of perception tasks has to be addressed by the described camera system. A common denominator in all these tasks is that they benefit from a total survey of the scene to be executed effectively. To allow the robots to approach specific persons for interaction, humans in the scenario have to be detected and tracked across the whole apartment in real time, without confusing their identities in the process. To allow robots to plan and execute the manipulation of objects, viable candidates such as tools or containers have to be detected and identified. For robot movement planning, the experimental space has to be segmented into traversable and obstructed areas by obstacle detection and floor segmentation.
Since the camera system is designed to cover the whole area, the challenges start with the scope of the system which has to be designed and integrated. At any single moment, a full-size combined image from all 40 cameras would measure 5120 x 6144 pixels, while the combined data rate generated by the cameras amounts to approximately 7.6 Gbps. Since this exceeds the capacity of a single GigE adapter by far, the image processing has to be distributed. This incurs challenges regarding the integration of data over all the processing nodes maintaining the cameras, such as the real-time exchange of extracted features to track persons and objects. A common approach to tracking consists of a detection phase, in which a first estimation for the position of an object of interest is derived from an image in a calculatively expensive process without prior knowledge regarding its position, and a tracking phase, in which this object is tracked within successive images by exploiting the knowledge of its position in the preceding images, using a predictive algorithm. For such a tracking approach to be implemented on a distributed multi-camera system efficiently, exchange of world position and tracked features between the involved processing clients has to be dealt with to avoid repeated detection phases, and thus improve the performance of the system beyond the one of the sum of its parts.