Kids Library Home

Welcome to the Kids' Library!

Search for books, movies, music, magazines, and more.

     
Available items only
Electronic Book
Author Barlas, Gerassimos, author.

Title Multicore and GPU programming : an integrated approach / Gerassimos Barlas.

Publication Info. Amsterdam : Morgan Kaufmann, 2015.

Copies

Location Call No. OPAC Message Status
 Axe Elsevier ScienceDirect Ebook  Electronic Book    ---  Available
Description 1 online resource
text txt rdacontent
computer c rdamedia
online resource cr rdacarrier
Note CIP data; item not viewed.
Summary Multicore and GPU Programming offers broad coverage of the key parallel computing skillsets: multicore CPU programming and manycore "massively parallel" computing. Using threads, OpenMP, MPI, and CUDA, it teaches the design and development of software capable of taking advantage of today's computing platforms incorporating CPU and GPU hardware and explains how to transition from sequential programming to a parallel computing paradigm. Presenting material refined over more than a decade of teaching parallel computing, author Gerassimos Barlas minimizes the challenge with multiple examples, extensive case studies, and full source code. Using this book, you can develop programs that run over distributed memory machines using MPI, create multi-threaded applications with either libraries or directives, write optimized applications that balance the workload between available computing resources, and profile and debug programs targeting multicore machines
Contents Front Cover -- Multicore and GPU Programming: An Integrated Approach -- Copyright -- Dedication -- Contents -- List of Tables -- Preface -- What Is in This Book -- Using This Book as a Textbook -- Software and Hardware Requirements -- Sample Code -- Chapter 1: Introduction -- 1.1 The era of multicore machines -- 1.2 A taxonomy of parallel machines -- 1.3 A glimpse of contemporary computing machines -- 1.3.1 The cell BE processor -- 1.3.2 Nvidia's Kepler -- 1.3.3 AMD's APUs -- 1.3.4 Multicore to many-core: tilera's TILE-Gx8072 and intel's xeon phi -- 1.4 Performance metrics -- 1.5 Predicting and measuring parallel program performance -- 1.5.1 Amdahl's law -- 1.5.2 Gustafson-barsis's rebuttal -- Exercises -- Chapter 2: Multicore and parallel program design -- 2.1 Introduction -- 2.2 The PCAM methodology -- 2.3 Decomposition patterns -- 2.3.1 Task parallelism -- 2.3.2 Divide-and-conquer decomposition -- 2.3.3 Geometric decomposition -- 2.3.4 Recursive data decomposition -- 2.3.5 Pipeline decomposition -- 2.3.6 Event-based coordination decomposition -- 2.4 Program structure patterns -- 2.4.1 Single-program, multiple-data -- 2.4.2 Multiple-program, multiple-data -- 2.4.3 Master-worker -- 2.4.4 Map-reduce -- 2.4.5 Fork/join -- 2.4.6 Loop parallelism -- 2.5 Matching decomposition patterns with program structure patterns -- Exercises -- Chapter 3: Shared-memory programming: threads -- 3.1 Introduction -- 3.2 Threads -- 3.2.1 What is a thread? -- 3.2.2 What are threads good for? -- 3.2.3 Thread creation and initialization -- 3.2.3.1 Implicit thread creation -- 3.2.4 Sharing data between threads -- 3.3 Design concerns -- 3.4 Semaphores -- 3.5 Applying semaphores in classical problems -- 3.5.1 Producers-consumers -- 3.5.2 Dealing with termination -- 3.5.2.1 Termination using a shared data item -- 3.5.2.2 Termination using messages.
3.5.3 The barbershop problem: introducing fairness -- 3.5.4 Readers-writers -- 3.5.4.1 A solution favoring the readers -- 3.5.4.2 Giving priority to the writers -- 3.5.4.3 A fair solution -- 3.6 Monitors -- 3.6.1 Design approach 1: critical section inside the monitor -- 3.6.2 Design approach 2: monitor controls entry to critical section -- 3.7 Applying monitors in classical problems -- 3.7.1 Producers-consumers revisited -- 3.7.1.1 Producers-consumers: buffer manipulation within the monitor -- 3.7.1.2 Producers-consumers: buffer insertion/extraction exterior to the monitor -- 3.7.2 Readers-writers -- 3.7.2.1 A solution favoring the readers -- 3.7.2.2 Giving priority to the writers -- 3.7.2.3 A fair solution -- 3.8 Dynamic vs. static thread management -- 3.8.1 Qt's thread pool -- 3.8.2 Creating and managing a pool of threads -- 3.9 Debugging multithreaded applications -- 3.10 Higher-level constructs: multithreaded programming without threads -- 3.10.1 Concurrent map -- 3.10.2 Map-reduce -- 3.10.3 Concurrent filter -- 3.10.4 Filter-reduce -- 3.10.5 A case study: multithreaded sorting -- 3.10.6 A case study: multithreaded image matching -- Exercises -- Chapter 4: Shared-memory programming: OpenMP -- 4.1 Introduction -- 4.2 Your First OpenMP Program -- 4.3 Variable Scope -- 4.3.1 OpenMP Integration V.0: Manual Partitioning -- 4.3.2 OpenMP Integration V.1: Manual Partitioning Without a Race Condition -- 4.3.3 OpenMP Integration V.2: Implicit Partitioning with Locking -- 4.3.4 OpenMP Integration V.3: Implicit Partitioning with Reduction -- 4.3.5 Final Words on Variable Scope -- 4.4 Loop-Level Parallelism -- 4.4.1 Data Dependencies -- 4.4.1.1 Flow Dependencies -- 4.4.1.2 Antidependencies -- 4.4.1.3 Output Dependencies -- 4.4.2 Nested Loops -- 4.4.3 Scheduling -- 4.5 Task Parallelism -- 4.5.1 The sections Directive -- 4.5.1.1 Producers-Consumers in OpenMP.
4.5.2 The task Directive -- 4.6 Synchronization Constructs -- 4.7 Correctness and Optimization Issues -- 4.7.1 Thread Safety -- 4.7.2 False Sharing -- 4.8 A Case Study: Sorting in OpenMP -- 4.8.1 Bottom-Up Mergesort in OpenMP -- 4.8.2 Top-Down Mergesort in OpenMP -- 4.8.3 Performance Comparison -- Exercises -- Chapter 5: Distributed memory programming -- 5.1 Communicating Processes -- 5.2 MPI -- 5.3 Core concepts -- 5.4 Your first MPI program -- 5.5 Program architecture -- 5.5.1 SPMD -- 5.5.2 MPMD -- 5.6 Point-to-Point communication -- 5.7 Alternative Point-to-Point communication modes -- 5.7.1 Buffered Communications -- 5.8 Non blocking communications -- 5.9 Point-to-Point Communications: Summary -- 5.10 Error reporting and handling -- 5.11 Collective communications -- 5.11.1 Scattering -- 5.11.2 Gathering -- 5.11.3 Reduction -- 5.11.4 All-to-All Gathering -- 5.11.5 All-to-All Scattering -- 5.11.6 All-to-All Reduction -- 5.11.7 Global Synchronization -- 5.12 Communicating objects -- 5.12.1 Derived Datatypes -- 5.12.2 Packing/Unpacking -- 5.13 Node management: communicators and groups -- 5.13.1 Creating Groups -- 5.13.2 Creating Intra-Communicators -- 5.14 One-sided communications -- 5.14.1 RMA Communication Functions -- 5.14.2 RMA Synchronization Functions -- 5.15 I/O considerations -- 5.16 Combining MPI processes with threads -- 5.17 Timing and Performance Measurements -- 5.18 Debugging and profiling MPI programs -- 5.19 The Boost.MPI library -- 5.19.1 Blocking and non blocking Communications -- 5.19.2 Data Serialization -- 5.19.3 Collective Operations -- 5.20 A case study: diffusion-limited aggregation -- 5.21 A case study: brute-force encryption cracking -- 5.21.1 Version #1 : "plain-vanilla'' MPI -- 5.21.2 Version #2 : combining MPI and OpenMP -- 5.22 A Case Study: MPI Implementation of the Master-Worker Pattern.
5.22.1 A Simple Master-Worker Setup -- 5.22.2 A Multithreaded Master-Worker Setup -- Exercises -- Chapter 6: GPU programming -- 6.1 GPU Programming -- 6.2 CUDA's programming model: threads, blocks, and grids -- 6.3 CUDA's execution model: streaming multiprocessors and warps -- 6.4 CUDA compilation process -- 6.5 Putting together a CUDA project -- 6.6 Memory hierarchy -- 6.6.1 Local Memory/Registers -- 6.6.2 Shared Memory -- 6.6.3 Constant Memory -- 6.6.4 Texture and Surface Memory -- 6.7 Optimization techniques -- 6.7.1 Block and Grid Design -- 6.7.2 Kernel Structure -- 6.7.3 Shared Memory Access -- 6.7.4 Global Memory Access -- 6.7.5 Page-Locked and Zero-Copy Memory -- 6.7.6 Unified Memory -- 6.7.7 Asynchronous Execution and Streams -- 6.7.7.1 Stream Synchronization: Events and Callbacks -- 6.8 Dynamic parallelism -- 6.9 Debugging CUDA programs -- 6.10 Profiling CUDA programs -- 6.11 CUDA and MPI -- 6.12 Case studies -- 6.12.1 Fractal Set Calculation -- 6.12.1.1 Version #1: One thread per pixel -- 6.12.1.2 Version #2: Pinned host and pitched device memory -- 6.12.1.3 Version #3: Multiple pixels per thread -- 6.12.1.4 Evaluation -- 6.12.2 Block Cipher Encryption -- 6.12.2.1 Version #1: The case of a standalone GPU machine -- 6.12.2.2 Version #2: Overlapping GPU communication and computation -- 6.12.2.3 Version #3: Using a cluster of GPU machines -- 6.12.2.4 Evaluation -- Exercises -- Chapter 7: The Thrust template library -- 7.1 Introduction -- 7.2 First steps in Thrust -- 7.3 Working with Thrust datatypes -- 7.4 Thrust algorithms -- 7.4.1 Transformations -- 7.4.2 Sorting and searching -- 7.4.3 Reductions -- 7.4.4 Scans/prefix sums -- 7.4.5 Data management and manipulation -- 7.5 Fancy iterators -- 7.6 Switching device back ends -- 7.7 Case studies -- 7.7.1 Monte carlo integration -- 7.7.2 DNA Sequence alignment -- Exercises.
Chapter 8: Load balancing -- 8.1 Introduction -- 8.2 Dynamic load balancing: the Linda legacy -- 8.3 Static Load Balancing: The Divisible LoadTheory Approach -- 8.3.1 Modeling Costs -- 8.3.2 Communication Configuration -- 8.3.3 Analysis -- 8.3.3.1 N-Port, Block-Type, Single-Installment Solution -- 8.3.3.2 One-Port, Block-Type, Single-Installment Solution -- 8.3.4 Summary -- Short Literature Review -- 8.4 DLTlib: A library for partitioning workloads -- 8.5 Case studies -- 8.5.1 Hybrid Computation of a Mandelbrot Set "Movie'':A Case Study in Dynamic Load Balancing -- 8.5.2 Distributed Block Cipher Encryption: A Case Study in Static Load Balancing -- Appendix A: Compiling Qt programs -- A.1 Using an IDE -- A.2 The qmake Utility -- Appendix B: Running MPI programs -- B.1 Preparatory Steps -- B.2 Computing Nodes discovery for MPI Program Deployment -- B.2.1 Host Discovery with the nmap Utility -- B.2.2 Automatic Generation of a Hostfile -- Appendix C: Time measurement -- C.1 Introduction -- C.2 POSIX High-Resolution Timing -- C.3 Timing in Qt -- C.4 Timing in OpenMP -- C.5 Timing in MPI -- C.6 Timing in CUDA -- Appendix D: Boost.MPI -- D.1 Mapping from MPI C to Boost.MPI -- Appendix E: Setting up CUDA -- E.1 Installation -- E.2 Issues with GCC -- E.3 Running CUDA without an Nvidia GPU -- E.4 Running CUDA on Optimus-Equipped Laptops -- E.5 Combining CUDA with Third-Party Libraries -- Appendix F: DLTlib -- F.1 DLTlib Functions -- F.1.1 Class Network: Generic Methods -- F.1.2 Class Network: Query Processing -- F.1.3 Class Network: Image Processing -- F.1.4 Class Network: Image Registration -- F.2 DLTlib Files -- Glossary -- Bibliography -- Index.
Subject Multiprocessors -- Programming.
Parallel programming (Computer science)
Graphics processing units -- Programming.
Multiprocesseurs -- Programmation.
Programmation parallèle (Informatique)
Processeurs graphiques -- Programmation.
Multiprocessors -- Programming
Parallel programming (Computer science)
Other Form: Print version: 9780124171374
ISBN 9780124171404 (electronic bk.)
0124171400 (electronic bk.)
9780124171374
0124171370
Standard No. AU@ 000062467447
CHBIS 010547549
CHNEW 001012570
CHVBK 341783854
DEBSZ 427902681
GBVCP 820850314
NZ1 15919186
UKMGB 016941139

 
    
Available items only