پردازش مشترک پرس و جو در کش برای معماریهای CPU-GPU جفت شده

در حال بارگذاری

پردازش مشترک پرس و جو در کش برای معماریهای CPU-GPU جفت شده

23 اکتبر 2022

فایل ورد و پاورپوینت

2120

3 بازدید

۶۹,۷۰۰ تومان

توجه : به همراه فایل word این محصول فایل پاورپوینت (PowerPoint) و اسلاید های آن به صورت هدیه ارائه خواهد شد

این مقاله، ترجمه شده یک مقاله مرجع و معتبر انگلیسی می باشد که به صورت بسیار عالی توسط متخصصین این رشته ترجمه شده است و به صورت فایل ورد (microsoft word) ارائه می گردد

متن داخلی مقاله بسیار عالی، پر محتوا و قابل درک می باشد و شما از استفاده ی آن بسیار لذت خواهید برد. ما عالی بودن این مقاله را تضمین می کنیم

فایل ورد این مقاله بسیار خوب تایپ شده و قابل کپی و ویرایش می باشد و تنظیمات آن نیز به صورت عالی انجام شده است؛ به همراه فایل ورد این مقاله یک فایل پاور پوینت نیز به شما ارئه خواهد شد که دارای یک قالب بسیار زیبا و تنظیمات نمایشی متعدد می باشد

توجه : در صورت مشاهده بهم ریختگی احتمالی در متون زیر ،دلیل ان کپی کردن این مطالب از داخل فایل می باشد و در فایل اصلی پردازش مشترک پرس و جو در کش برای معماریهای CPU-GPU جفت شده،به هیچ وجه بهم ریختگی وجود ندارد

تعداد صفحات این فایل: ۲۵ صفحه

بخشی از ترجمه :

بخشی از مقاله انگلیسیعنوان انگلیسی:In-Cache Query Co-Processing on Coupled CPU-GPU Architectures~~en~~

Abstract

Recently, there have been some emerging processor designs that the CPU and the GPU (Graphics Processing Unit) are integrated in a single chip and share Last Level Cache (LLC). However, the main memory bandwidth of such coupled CPU-GPU architectures can be much lower than that of a discrete GPU. As a result, current GPU query coprocessing paradigms can severely suffer from memory stalls. In this paper, we propose a novel in-cache query co-processing paradigm for main memory On-Line Analytical Processing (OLAP) databases on coupled CPU-GPU architectures. Specifically, we adapt CPU-assisted prefetching to minimize cache misses in GPU query co-processing and CPU-assisted decompression to improve query execution performance. Furthermore, we develop a cost model guided adaptation mechanism for distributing the workload of prefetching, decompression, and query execution between CPU and GPU. We implement a system prototype and evaluate it on two recent AMD APUs A8 and A10. The experimental results show that 1) in-cache query co-processing can effectively improve the performance of the state-of-the-art GPU co-processing paradigm by up to 30% and 33% on A8 and A10, respectively, and 2) our workload distribution adaption mechanism can significantly improve the query performance by up to 36% and 40% on A8 and A10, respectively.

۱ Introduction

Query co-processing paradigm on GPUs has been an effective means to improve the performance of main memory databases for OLAP (e.g., [15, 17, 22, 28, 25, 13, 30, 29]). Currently, most systems are based on discrete CPU-GPU architectures, where the CPU and the GPU are connected via the relatively slow PCI-e bus. Recently, some emerging processor designs that the CPU and the GPU are integrated in a single chip and share LLC. For example, the AMD Accelerated Processing Unit (APU) architecture integrates CPU and GPU in a single chip, and Intel released their latest generation Ivy Bridge processor in late April 2012. On those emerging heterogeneous architectures, the low speed of PCIe is no longer an issue. Coupled CPU-GPU architectures call for new data processing mechanisms. There have been studies on more collaborative and fine-grained schemes for query co-processing [19, 38] and other data processing workloads (e.g., key-value stores [21] and MapReduce [7]). Despite the effectiveness of previous studies on query coprocessing on coupled architectures, both CPU and GPU execute homogeneous workloads in previous studies [19, 7, 38, 21]. However, due to the unique architectural design of coupled CPU-GPU architectures, such homogeneous workload distribution schemes can hinder query co-processing performance on the GPU. On the one hand, the GPU in the coupled architecture is usually less powerful than the one in the discrete architecture. On the other hand, the GPU in the coupled architecture accesses main memory (usually DDR3), which has a much lower bandwidth than the discrete GPU memory (usually GDDR5). These two factors lead to severe underutilization of the GPU in the coupled architecture because of memory stalls. The inherent GPU design of Single Program Multiple Data (SPMD) execution model and the in-order nature of GPU cores make the GPU in the coupled architecture more sensitive to memory stalls. In this paper, we investigate how to reduce memory stalls suffered by the GPU and further improve the performance of query co-processing on the coupled CPU-GPU architecture. On the recent coupled CPU-GPU architectures, the computational capability of the GPU is still much higher than that of the CPU. For example, the GPU can have 5 and 6 times higher Giga Floating Point Operations per Second (GFLOPS) than the CPU on AMD APUs A8 and A10, respectively. The superb raw computational capability of the GPU leads to a very similar speedup if the input data is in cache. However, due to the above-mentioned impact of memory stalls on the GPU co-processing, the speedup is as low as 2 when the data cannot fit into cache (more detailed results can be found in Section 3.1). Thus, the natural question is whether we can and how to ensure that the working set of query co-processing can fit in the cache as much as possible to fully unleash the GPU power. In this paper, we propose a novel in-cache query co-processing paradigm for main memory databases on coupled CPU-GPU architectures. Specifically, we adapt CPU-assisted prefetching to minimize the cache misses from the GPU and CPUassisted decompression schemes to improve query execution performance. No matter whether or not the decompression is involved, our scheme ensures that the input data to the GPU query co-processing has been prefetched. Thus, the GPU executions are mostly on in-cache data, without suffering from memory stalls. Specifically, unlike homogeneous workload distributions in previous query co-processing paradigms [19, 7], our workload distribution is heterogeneous: a CPU core can now perform memory prefetching, decompression, and even query processing, and the GPU can now perform decompression and query processing. We further develop a cost model guided adaptation mechanism for distributing the workload of prefetching, decompression, and query evaluations between the CPU and the GPU. Finegrained resource allocation is achieved by device fission that divides the CPU or the GPU into smaller scheduling units (either by OpenCL runtime or our software-based approaches). We implement a system prototype and evaluate it on two recent AMD APUs A8 and A10. The experimental results show that 1) in-cache query co-processing is able to effectively improve the performance of GPU query co-processing by up to 30% and 33% on A8 and A10, respectively, and 2) our cost model can effectively predict a suitable workload distribution, and our distribution adaption mechanisms significantly improve the query performance by 36-40%. The remainder of this paper is organized as follows. In Section 2, we introduce the background and preliminary on coupled architectures and OpenCL. In Section 3, we elaborate the design and implementation of in-cache query coprocessing, followed by the cost model in Section 4. We present the experimental results in Section 5. We review the related work in Section 6 and conclude in Section 7.

$$en!!

راهنمای خرید:

لینک دانلود فایل بلافاصله بعد از پرداخت وجه به نمایش در خواهد آمد.
همچنین لینک دانلود به ایمیل شما ارسال خواهد شد به همین دلیل ایمیل خود را به دقت وارد نمایید.
ممکن است ایمیل ارسالی به پوشه اسپم یا Bulk ایمیل شما ارسال شده باشد.
در صورتی که به هر دلیلی موفق به دانلود فایل مورد نظر نشدید با ما تماس بگیرید.