\begin{ abstract }
    % Start with a two-sentence (at most) description of the big-picture
    % problem and why we care, and a sentence at the end that emphasizes how
    % your work is part of the solution.

    Heterogeneous systems with Central Processing Units and computational
    accelerators such as GPUs, FPGAs or the upcoming Intel MIC are becoming
    mainstream.    In these systems, peak performance includes the performance
    of not just the CPUs but also all available accelerators.  In spite of this
    fact, the majority of programming models for heterogeneous computing focus
    on only one of these.  With the development of Accelerated OpenMP for GPUs,
    both from PGI and Cray, we have a clear path to extend traditional OpenMP
    applications incrementally to use GPUs.  The extensions are geared toward
    switching from CPU parallelism to GPU parallelism. However they do not
    preserve the former while adding the latter. Thus computational potential is
    wasted since either the CPU cores or the GPU cores are left idle.  Our goal
    is to create a runtime system that can intelligently divide an accelerated
    OpenMP region across all available resources automatically. This paper
    presents our proof-of-concept runtime system for dynamic task scheduling
    across CPUs and GPUs.  Further, we motivate the addition of this system into
    the proposed \emph{OpenMP for Accelerators} standard. Finally, we show that
    this option can produce as much as a two-fold performance improvement over
    using either the CPU or GPU alone.

\end{ abstract }
