Bidimensional Bar Code Project
Aim of the BBC project is the
application of High Performance Computer Networks (HPCN) technology
to the problem of real-time decoding of new bi-dimensional, optically readable,
codes that are an evolution of traditional linear bar codes. The HPCN approach
is necessary to adapt the performance of the products to the requirements
of this new market. The consortium comprises Datalogic that is a manufacturer
of bar code readers that acts as End User, University of Parma that has
expertise on HPCN architectures and acts as a HPCN Expert and Elsag-Bailey,
a manufacturer of machine for service automation, that has already experience
of application of HPCN to image processing and acts as a HPCN Technology
Provider. Main benefits in the consortium will go to Datalogic whose purpose
is to lay down the foundation for a new generation of readers. Results
of the job can be used to disseminate usage of HPCN in other embedded systems
with real time constraints.
The BBC Consortium
The BBC Consortium is composed by Datalogic,
di Ingegneria dell'Informazione of the University
of Parma and Elsag
been designing, producing and selling products for auto identification
since 1972 and is the European leader in the sector of bar code readers
with a world wide market coverage. Datalogic S.p.A. has been designing,
manufacturing and selling products for the automatic identification from
over 25 years. Datalogic is active in the following areas (Business Units):
Fixed Position Scanners for reading
bar-codes or other printed codes, without the supervision of a human operator
in which Datalogic occupies a leading position in Europe and the second
in the world.
Hand Held Readers for reading bar-codes
or other printed codes in industry or retail environment. In this field
Datalogic occupies the second position in the European market.
Photocells in which Datalogic is
the Italian leader and one of the main competitors in Europe.
The Dipartimento di Ingegneria dell'Informazione
acts as an expert of High Performance Computer Architectures, Parallel
Programming and Distributed System. Its task is to design and evaluate
solutions to Bidimensional Bar Code processing using an approach based
on common off-the-shelf (COTS) hardware and software. Nowadays a
solution is often more cost effective than a custom hardware/software codesigned
system; in particular, different combinations of platforms will be evaluated
and multiple strategies of parallelization will be employed. Besides, the
Dipartimento di Ingegneria dell'Informazione will provide training and
support related to software and architectures for parallel and distributed
Elsag Bailey, being an industrial strength
image processing systems producer, acts as a provider of algorithms and
software for bidimensional barcode recognition and decoding. Elsag Bailey,
a Finmeccanica Company, is the leader of a Group of Companies operating
in the business area of Electronics for Automation. The fields of activity
of the Group are Automation of Services, Automation of Discrete and Continuous
Industrial Processes. The Company designs, produces and installs electronic
systems for the automation of industrial processes distribution of electrical
energy, equipment and information systems for factory automation, mechanization
and automation of postal services, electronic mail services, image processing
and recognition, automatic document reading.
The application domain
Bar codes are a widely used system
for attaching information (such as vendor or product description) to goods.
Until now the most important automatic identification symbologies have
been the linear bar-codes. Though many different kinds of
bar-codes exist, they all encode the information in the width of the elements
of the symbol (black bars and white spaces). A bar-code can store only
a limited number of characters (Typically ten-thirty characters) and these
characters are often used as a key in a computerized data base where the
complete information is available. A typical example is the UPC
symbology (Universal Product Code) used to identify products
in the retail environment.
Also the intrinsic robustness of the bar code symbologies is not very
high, given that many codes do not have any check capability and just some
of them make use of a check digit that reduces but does not eliminate the
risk of misreading. In this situation a typical way to improve the reliability
level is to use the height of the code and to validate a decoding operation
only if a certain number of identical results are obtained from the same
For these reasons some new kinds of bar code were introduced, specifically
Bar Codes: the aim was to increase both the
density, that is the amount of information (i.e. the number of
characters) stored per area unit, and the code reliability,
taking advantage of Error Correction Codes (ECC) schemes. This required
moving from unidimensional codes to bidimensional codes to better exploit
all the available space of goods packages.
The figure below shows an ordinary linear bar code, along with some
bidimensional codes extending its capabilities.
The PDF "stacked" code uses all the available height to fit
multiple code rows into the same area of a linear bar code, increasing
code spatial density; click here
to see some examples of real shots of packages labeled with PDF
The MAXICODE symbol uses a matrix of black and white hexagons
to code information bits, protected by Reed-Solomon polynomial block
codes. In order to ease code location, a suitable bull's eye pattern
is put in the centre of the code. This is made by a series of alternating
black and white circles, as can be seen in the figure below or in the MAXICODE
The DATAMATRIX symbol is a bi-dimensional code without locating
target suitable to be used when only a short information is required and
the available area is too small to store a liner bar-code, e.g. silicon
wafer, electronic components, small pharmaceutical items. Its nature is
that of a truly bidimensional, black and white, dot matrix.
Due to its complex features, the process of decoding a bi-dimensional
code is a computing intensive task that requires a computing power really
higher than a traditional, linear bar-code.
The typical steps required to decode a symbol are:
Bar codes reading can be either attended or unattended:
in attended reading an human user is physically present to control
the reader (a typical example is an hand-held reader, with which the user
"shoots" at the item containing the codes (triggering an image acquisition
from a small digital camera inside the reader with each shot). Unattended
reading, instead, does not involve any direct intervention from human
operators and is used in industrial environments, where conveyor belts
can be used to carry around the packages and fixed cameras can take still
pictures of the goods or even acquire a continuos video stream.
Acquire a bi-dimensional image, in grey levels, of the label containing
the symbol: this can be made by means of 2-D CCD camera or systems with
a linear CCD in the case of moving targets.
Locate the symbol inside the image: this is often the slower, time consuming,
processing activity because of the large size of the image and the fact
that the type, number and
resolution of the symbols present in the image and the context of the image
itself are typically unknown. Its goal is to find all the possible region
(called Region of Interest) that could contain a symbol. Location must
be robust against: noise, distortions, interferences, scaling and rotation.
Extract the characteristic features of the symbols from the region of interest
found in the previous step. (e.g. for a MAXICODE the features are the grey
intensity and the coordinates
of the centre of each hexagon in the symbol).
Decode the symbol, classifying the features extracted using the syntactic
rules of the symbology and translate the information in an ASCII string.
This step often include an error detection and
correction technique (e.g. Reed Solomon Algorithms are often used) that
requires a computation time rapidly increasing with the correction capability.
The figures below show the simplified structure of both attended and
The problem consists in the real-time
decoding of these bidimensional symbologies and this a very computing
The term 'real time' really means
different things depending on the system: in the case of an attended
system, in which a human operator is using the reader,
the term real-time means to have a response time satisfying the feeling
of the operator, e.g. a decoding time of about 500 msec is acceptable.
In an unattended system
in which the reader is part of a fully automatic system, the term real
time means both to be able to manage a very large throughput of
data and to perform the
decoding operation in a very short time (e.g. throughput up to 160
MPixels\sec and 160 msec of average response time are desirable).
Datalogic has already developed two readers to cover these applications
but the performance obtained are still far from this target, so as those
the other competitors are.
The HPCN approach
The aim of the project is to select
and use an HPCN architecture that allows an improvement the performance
currently achieved by the systems already available.
Moreover, because the current systems make use of dedicated hardware (ASICs),
another aim of the project is to obtain a larger flexibility of
system and a shorter time to market
thanks to a fully programmable and scalable architecture. As
a working methodology it has been chosen to concentrate the job, not on
the design of a new HPCN hardware, but on the selection of a commercially
available HPCN architecture and on the porting of the software on such
High performance computing has focused more and more on standard architectures,
languages and networks during the last few years, shifting away gradually
from completely custom machines and adopting first common off-the-shelf
(COTS) microprocessors, then COTS operating systems and now
even COTS interconnection networks. According to this trend, the
basic technology we expect to use is a multi-processors environment in
an embedded system, where the processors are COTS, general purpose
CPUs. In this area the proper choice should be done during the project,
taking into account the need of commercial availability and scalability
of the architecture. What we expect to obtain is not only to demonstrate
that an HPCN architecture allows the reaching of the required performance
and guarantees a larger flexibility
of the system, but also to be able to quickly transform the technology
demonstrator in one ore more engineered products.
A promising strategy will be examined
with particular attention in the BBC project, owing to its potential advantages.
The main tenet of the approach is the integration to the largest extent
of PC-based technology and components within a HPCN architecture.
The motivations underlying this approach stem from the observation that
microprocessors (and PC technology in general) exhibit the highest performance
increase and dramatic price/performance reductions.
These trends have already had a large impact on the HPCN market for scientific
computing, where architectures based on very specialised approaches (e.g.,
the Connection Machine) have suffered large market reductions or even been
put out of market in favour of architectures based on low-cost CMOS microprocessors
(so-called "killer" microprocessors).
Integration of PC-derived technology
into a HPCN embedded system has the potential to bring similar price and
performance trends to the benefit of image processing for automatic identification.
Indeed, PC components can be considered as commodities, due to the large
market they are developed for, and hence enjoy unprecedented technological
evolution and low cost. The large market has also led to vendor-supported
standards that improve interoperability and allow open platforms. Similar
economies of scale and performance trends hold for the software tools and
programming environments. Often neglected cost factors of an embedded system
include run-time software components (real time kernel, libraries) and
development environments and tools (debuggers, compilers). Software components
and tools are available at much lower prices for PC-based platforms from
many different suppliers, and often exhibit better quality owing to the
stiff competition (with the possible exception of real-time OSes).
While full-featured real-time OSes
have been existing for a long time for the x86 architecture, an intriguing
question is whether general-purpose software tools can be adapted to embedded,
real-time applications. If viable, this approach would extend the benefits
of the fast-moving PC technology to the software side as well. As a matter
of fact, standard, general-purpose OSes like Windows NT, Solaris
2.x, Linux, and the ongoing POSIX 1003 standardisation
effort, exhibit a number of real-time oriented features that may enable
their integration within the HPCN architecture.
These features include:
User-level and/or kernel-supported
multi-threading libraries and lightweight processes, providing improved
context-switch time, more efficient inter-thread communication via shared
spaces, and potential for effective exploitation of hardware parallelism
(SMP -- symmetric multiprocessing);
Real-time scheduling classes and scheduling
Low-latency preemptable kernel with
frequent pre-emption points;
PC-type technology allows three levels
of parallelism. None of them is peculiar to PC technology, but each of
them may bring specific price, performance, or scalability advantages in
Ability to lock memory pages for critical
(ILP). This type of parallelism is exhibited by many current super-scalar,
dynamic execution CPUs, both general purpose and DSPs.The recent multimedia
(MMX) of the Pentium processor instruction set allow limited
processing that may be very effective for image processing tasks. Compiler
quality is crucial to exploit this type of parallelism,
or a direct resort to assembly language is needed. With a fast-paced processor
evolution, compiler timely availability is also crucial to take full advantage
of the latest technological developments.
Shared memory, symmetrical multiprocessing
(SMP). SMP PCs with 2 or 4 Pentium or Pentium II
nodes have been developed for higher duty computing tasks such as departmental
servers, web servers, and database engines. This intermediate grain of
parallelism, though, is very promising for embedded systems as well, especially
since it can be exploited by the multi-threading OS features previously
mentioned. Extension to 8 nodes and integration of MMX nodes can
provide further opportunities for performance scaling.
Network of workstations
(NOW). Clusters of PC nodes, possibly SMP, offer a coarser
grain parallelism and further extendibility to the architecture. Current
low-cost interconnection technology
(e.g. Fast Ethernet, FDDI) does not provide the high bandwidth
and reduced latency required by the proposed demonstrator. However, a number
of developments are in progress (e.g.
Ethernet, ATM, worm-LAN), spurred by the multimedia and
video distribution market, and hence a further scalability dimension may
be available in the future.
Works in progress
Currently, the Dipartimento
di Ingegneria dell'Informazione is using the following
Intel-based personal computer to develop parallel code for bidimensional
barcode processing, based on sequential programs provided by the other
members of the BBC consortium:
Intel Pentium II (Deschutes-MMX)
|Number of Processors:
|Level 1 Cache Memory (per CPU):
16 kB code + 16 kB data
|Level 2 Cache Memory (per CPU):
|Secondary Memory (Disk Storage):
9 GB (Ultra Wide SCSI II)
Such an architecture can effectively exploit the first two levels of
parallelism (i. e. ILP and SMP).
In order to select a parallelization approach, some general knowledge of
original software is needed. The code provided by industrial partners is
written in C an can be roughly described as a pipeline taking grayscaled
images as input and composed by the following steps:
Though both steps were coded in C, first step implementation exhibited
regular code patterns such as for loops and accumulated multiplications;
besides, simple bidimensional arrays containing original or or filtered
image pixels were the only data structures used. This code lends itself
well to a pure data parallel approach, taking advantage both of ILP (multiple
pixels per assembly instruction) and SMP (reduced or total lack of synchronization
due to almost completely disjoint working sets between the two processors).
Some generic analysis is carried out over the whole image to extract low
level contrast and gradient information. Classical techniques such as bidimensional
convolution filtering are used, and the image is divided in tiles. Then
initial guesses about what kind of code could be contained in every tile
The outcome of the previous step is stored in various queues, containing
the set of guesses for each kind of code (these guesses are termed "seeds").
Then a code specific algorithm is used to check whether every seed in a
queue indeed points to a code instance. If so, a refinement algorithm is
run to gather all available information about a code instance (position,
orientation, perspective deformations, etc.). Eventually, the code instance
is decoded and the content is returned as a character string.
On the other hand, nearly all the other pipeline steps strongly depend
on specific bidimensional code features and use advanced algorithms and
complex irregular data structures, fully exploiting C language data modeling
capabilities. A fine grained parallelization of such a code is nearly unachievable
without deep algorithm redesign, but this approach would tightly entangle
parallelization effort and algorithm development. It is a goal of the BBC
project to keep these two facets apart as much as possible.
The differences between the two image processing sections are not an
accidental property of the current software version. Rather, they are an
essential feature of the problem, stemming from the fact that in low level
image processing the design space is explored changing data (e. g. the
convolution matrix of a bidimensional filter), but code structure and image
representation stay the same. When dealing with higher level tasks, instead
(such as code localization, region of interest growing and labelling or
contour identification), many algorithms can be tried, each one with specific
code and data structures required.
The above considerations resulted in the following parallelization
With this approach, the only burden placed on algorithm designers and coders
is to preserve program safety within a concurrent environment; that is,
either the code must be reentrant or it must be synchronized with mutexes
or spinlocks. Program parallelization and algorithm design can thus proceed
The first, low level, step has been dealt with using custom assembly language
with MMX instructions for core loops. This requires heavy programming
effort and can only be worthwhile for basic image processing, because this
step is shared across all bidimensional code kinds and here ILP
parallelization should have the biggest payoff. The image has been divided
into Nproc stripes, each one of them
processed by the assembly routine on a different CPU, adding SMP
The second processing step uses only SMP parallelization; existing
processing code has been divided into a small number of concurrent tasks
(e. g. different bidimensional codes will be searched for by different
Currently the modifications of the software concern only the steps following
the loading of the image into memory. However the strategy adopted in each
of these steps is different.
A sample of the most significant results of the parallelization effort
is reported in the following table and figure.
For low level image preprocessing, two techniques have been used:
The work is divided between the two microprocessors, simply dividing images
in two parts.
The MMX instruction set has been used.
Average execution time
Standard deviation (50 samples)
|Single thread with MMX
|Single thread without MMX
|Two threads with MMX
|Two threads without MMX
The tests reported in the table above have been made on 50 images (660x494
pixel) using Linux as operating system and GCC 2.8
as compiler (similar results were obtained under Windows NT
and Solaris 2.x). Procedures exploiting MMX instruction set
have been developed in assembly (paying much attention to Intel's
Application Notes). This was necessary because of the absence of
a compiler supporting MMX instructions, but it required a significant
The table above shows that the software based on MMX instructions
achieves worse performance than ordinary, compiler-optimized software
(even though this handcrafted algorithm performed better than some commercial
libraries for analogous targets). We did not deem worthwhile spending other
time to improve this performance, also considering the upcoming
specifications by Intel.
The results of SMP parallelization are quite satisfactory. The
obtained speed-up is 1.8. The small difference with the ideal value (2)
is due to conflicts on the bus for memory accesses.
Another project goal was to study how different operating systems and
compilers affect this kind of image processing tasks. The following figure
plots software execution time versus image area for the different operating
systems available. Execution times refers to the first part of the software
(basic image analysis and regions of interst location; neither refinement
nor decoding is considered).
It must be stressed that this kind of software issues a small number
of system calls, mainly related to threads and memory management. Threading
model has been designed trying to minimize concurrency overheads; these
are, in decreasing cost order:
Being this software strictly CPU-bound, it does not interact much with
external subsystems (apart from memory). In this case an optimizing compiler
becomes more important than an efficient operating system. The small difference
between Linux and Solaris is only due to operating system
because the compiler is the same GCC 2.8 for both. The better
performance is probably due to better thread management. Instead the poor
NT performance seems to be connected to adopted development environment
(Power++ based on Watcom compiler); some preliminary
tests using Cygnus Win32
seem to strenghten this hypothesis.
Thread creation and deletion.
Synchronization among threads.
Context switch among threads.
The performance results obtained in the BBC project confirm the
validity of a HPCN approach for fast processing of Bidimensional
At the University of Parma
we are now investigating alternative approaches to parallelization while
also considering support for additional families of codes.
The software developed within the BBC project is also making
its way into real products at Datalogic