diff --git a/images/CERNVolumes.png b/images/CERNVolumes.png index 67b38f7..6537f2a 100644 Binary files a/images/CERNVolumes.png and b/images/CERNVolumes.png differ diff --git a/images/google-dc-map.png b/images/google-dc-map.png new file mode 100644 index 0000000..9f18813 Binary files /dev/null and b/images/google-dc-map.png differ diff --git a/images/jean-zay-hpc.png b/images/jean-zay-hpc.png new file mode 100644 index 0000000..39cbcc3 Binary files /dev/null and b/images/jean-zay-hpc.png differ diff --git a/src/00_SDD_DE_Course_Introduction.md b/src/00_SDD_DE_Course_Introduction.md index 760f445..e60d7cb 100644 --- a/src/00_SDD_DE_Course_Introduction.md +++ b/src/00_SDD_DE_Course_Introduction.md @@ -14,14 +14,15 @@ Harnessing the complexity of large amounts of data is a challenge in itself. But Big Data processing is more than that: originally characterized by the 3 Vs of Volume, Velocity and Variety, the concepts popularized by Hadoop and Google require dedicated computing solutions (both software and infrastructure), -which will be explored in this module. We'll also take a dive in new programming and infrastructure technologies -that emerged from these concepts. +which will be explored in this module. + +We'll also take a dive in new programming and infrastructure technologies that emerged from these concepts. ## Objectives By the end of this module, participants will be able to: -- Understand the differences and usage between main distributed computing architectures (HPC, Big Data, Cloud, CPU vs GPGPU) +- Understand the differences and usages of main distributed computing architectures (HPC, Big Data, Cloud, CPU vs GPGPU) - Implement the distribution of simple operations via the Map/Reduce principle in PySpark and Dask - Understand the principle of Kubernetes - Deploy a Big Data Processing Platform on the Cloud @@ -77,6 +78,7 @@ What is this course module main subject? ## Big Data & Distributed Computing (3h) +- [Current introduction (30min)](00_SDD_DE_Course_Introduction.html) - [Introduction to Big Data and its ecosystem (1h)](01_Introduction_Big_Data.html) - What is Big Data? - Legacy “Big Data” ecosystem @@ -92,14 +94,15 @@ What is this course module main subject? ## Deployment & Intro to Kubernetes (3h) -- MLOps: deploying your model as a Web App +MLOps: deploying your model as a Web App + - [Introduction to Orchestration](https://supaerodatascience.github.io/DE/slides/2_2b_orchestration.html) - [Introduction to Kubernetes](12_OrchestrationKubernetes.html) ## Kubernetes hands on (3h) - Zero to Jupyterhub: deploy a Jupyterhub on Kubernetes -- Deploy a Daskhub: a Dask enables Jupyterhub (for later use) +- Deploy a Daskhub: a Dask enabled Jupyterhub (for later use) [Slides](13_Dask_On_Cloud.html) @@ -113,11 +116,11 @@ What is this course module main subject? - Machine and Deep Learning (Sickit Learn, TensorFlow, Pytorch) - Jupyter notebooks, Binder, Google Colab - [Spark Introduction (30m)](03_Spark_Introduction.html) -- Play with MapReduce through Spark (Notebook on small datasets) (1.5h) +- Play with MapReduce using Spark (Notebook on small datasets) (1.5h) ## Distributed Processing and Dask hands on (3h) -- [Manage large datasets(30m)](24_Large_Datasets.html) +- [Manage large datasets (30m)](24_Large_Datasets.html) - [Dask Introduction (30m)](22_Dask_Pangeo.html) - Includes [Dask tutorial(2h)](https://github.com/dask/dask-tutorial). @@ -127,7 +130,7 @@ What is this course module main subject? - Subject presentation - Everyone should have a Daskhub cloud platform setup or Dask on local computer - Get the data -- Notebook with cell codes to fill or answers to give +- Notebook with codes cell to fill and answers to give - Clean big amounts of data using Dask in the cloud or on a big computer - Train machine learning models in parallel (hyper parameter search) - Complete with yor own efforts! @@ -145,4 +148,3 @@ What will we do today? ![Answer](https://cdn.strawpoll.com/images/polls/qr/xVg71DedQyr.png) [Answer link](https://strawpoll.com/xVg71DedQyr) - diff --git a/src/01_Introduction_Big_Data.md b/src/01_Introduction_Big_Data.md index 2a27c80..5c09698 100644 --- a/src/01_Introduction_Big_Data.md +++ b/src/01_Introduction_Big_Data.md @@ -17,7 +17,8 @@ date: 2026 ## Some figures -![Volume of data produced in a day in 2019 (source www.visualcapitalist.com)](images/a-day-in-data.jpg){width="50%"} +![](images/a-day-in-data.jpg){width="50%"} +![](https://www.digitalsilk.com/wp-content/uploads/2024/12/how-much-data-is-generated-per-day-hero-image.jpg) ## Some figures in sciences @@ -78,7 +79,7 @@ Not a technology. ## Quizz -What is the estimated size of the global data sphere? +What is the estimated size of the global data sphere in 2025? - Answer A: 175 Petabytes - Answer B: 175 Exabytes @@ -253,7 +254,7 @@ Data production or scientific exploration: ## Quizz -What is the typical volumes of scientific Datasets (multiple choices)? +What are the typical volumes of scientific Datasets (multiple choices)? - Answer A: MBs - Answer B: GBs diff --git a/src/02_Big_Data_Platforms.md b/src/02_Big_Data_Platforms.md index f7b6e16..6d18715 100644 --- a/src/02_Big_Data_Platforms.md +++ b/src/02_Big_Data_Platforms.md @@ -532,23 +532,22 @@ python /data/training/SLURM/plot_template.py ::: ::: {.column width="50%"} -![Jean-Zay supercomputer](http://www.idris.fr/media/images/jean-zay-annonce-01.jpg?id=web%3Aeng%3Ajean-zay%3Acpu%3Ajean-zay-cpu-hw-eng) +![Jean-Zay supercomputer](images/jean-zay-hpc.png) ::: :::::::::::::: ## TOP500 -| Rank | System | Cores | Rmax (TFlop/s) | Rpeak (PFlop/s) | Power (kW) | +| Rank | System | Cores | Rmax (PFlop/s) | Rpeak (PFlop/s) | Power (kW) | |------| -------|-------|----------------|-----------------|------------| -| 1 | Frontier - United States | 8,699,904 | 1,194.00 | 1,679.82 | 22,703 | -| 2 | Aurora - United States | 4,742,808 | 585.34 | 1,059.33 | 24,687 | -| 4 | Supercomputer Fugaku - Japan | 7,630,848 | 442.01 | 537.21 | 29,899 | -| 5 | LUMI - Finland | 2,752,704 | 2379.70 | 531.51 | 7,107 | -| 17 | Adastra - France | 319,072 | 46.10 | 61.61 | 921 | -| 167 | Jean Zay - France | 93,960 | 4.48 | 7.35 | | +| 1 | El Capitan - United States | 11,340,000 | 1,809.00 | 2,821.10 | 29,685 | +| 4 | JUPITER Booster - Germany | 4,801,344 | 1,000.00 | 1,226.28 | 15,794 | +| 7 | Supercomputer Fugaku - Japan | 7,630,848 | 442.01 | 537.21 | 29,899 | +| 26 | CEA-HE - France | 548,352 | 90.79 | 171.26 | 1,770 | +| 290 | Jean Zay - France | 93,960 | 4.48 | 7.35 | | -[Top 500 (november 2023)](https://top500.org/lists/top500/2023/11/) +[Top 500 (november 2025)](https://top500.org/lists/top500/2025/11/) ## Big Data and Hadoop @@ -628,7 +627,7 @@ Hence the cloud computing model... ### GPGPU - Specific hardware (expensive) -- Really efficient for Deep Learning algorithms +- Really efficient for Deep Learning algorithms (learning and inference) - Image processing, Language processing ## Quizz diff --git a/src/10_Cloud_Computing.md b/src/10_Cloud_Computing.md index efc5b9a..1b53838 100644 --- a/src/10_Cloud_Computing.md +++ b/src/10_Cloud_Computing.md @@ -35,8 +35,7 @@ I took most of the content from theirs: ::: ::: {.column width="70%"} -![](https://www.datacenterknowledge.com/sites/datacenterknowledge.com/files/wp-content/uploads/2013/06/lulea-rows.jpg){width="35%"} -![](https://www.datacenterknowledge.com/sites/datacenterknowledge.com/files/wp-content/uploads/2013/06/fb-lulea-external-fans.jpg){width="35%"} +![](https://www.akita.co.uk/wp-content/uploads/2023/09/cloud-storage-facilities-1.jpg) (Facebook's data center & server racks) @@ -45,7 +44,7 @@ I took most of the content from theirs: ## Google Cloud Data Center locations -![Data Centers](https://cloud.google.com/images/locations/regions.png) +![Data Centers](images/google-dc-map.png) ## Cloud Definition @@ -226,13 +225,13 @@ What means IaaS? ## Public (European) ![](https://www.comptoir-hardware.com/images/stories/_logos/ovhcloud.png){width=20%} -![](https://cloud.orange.com/ui/app/static/assets/brand/logo_header_login.png){width=20%} +![](https://www.orange-business.com/sites/default/files/illustration-obs---cloud---infrastructures.png){width=20%} ![](images/open_telekom_cloud.png){width=20%} Academic, public founded: ![gaiax](https://gaia-x.eu/wp-content/uploads/2022/12/Gaia-X_Logo_Inverted_White_Transparent_210401-3-1000x687.png){width=20%} -![EOSC](https://eosc-portal.eu/sites/all/themes/theme1/logo.png){width=20%} +![EOSC](https://eosc.eu/wp-content/uploads/2023/08/EOSCA_logo.svg){width=20%} ## Private/on premise diff --git a/src/14_ObjectStorage.md b/src/14_ObjectStorage.md index 9f73077..4be8ecb 100644 --- a/src/14_ObjectStorage.md +++ b/src/14_ObjectStorage.md @@ -146,7 +146,7 @@ What is Cloud Optimized? ::: ::: {.column width="50%"} -![](https://staging.dev.element84.com/wp-content/uploads/2019/04/smiley_tiled.png) +![](https://guide.cloudnativegeo.org/images/cog-diagram-2.png) ::: ::::::::::::::