Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file modified images/CERNVolumes.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/google-dc-map.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/jean-zay-hpc.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
20 changes: 11 additions & 9 deletions src/00_SDD_DE_Course_Introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,14 +14,15 @@ Harnessing the complexity of large amounts of data is a challenge in itself.

But Big Data processing is more than that: originally characterized by the 3 Vs of Volume, Velocity and Variety,
the concepts popularized by Hadoop and Google require dedicated computing solutions (both software and infrastructure),
which will be explored in this module. We'll also take a dive in new programming and infrastructure technologies
that emerged from these concepts.
which will be explored in this module.

We'll also take a dive in new programming and infrastructure technologies that emerged from these concepts.

## Objectives

By the end of this module, participants will be able to:

- Understand the differences and usage between main distributed computing architectures (HPC, Big Data, Cloud, CPU vs GPGPU)
- Understand the differences and usages of main distributed computing architectures (HPC, Big Data, Cloud, CPU vs GPGPU)
- Implement the distribution of simple operations via the Map/Reduce principle in PySpark and Dask
- Understand the principle of Kubernetes
- Deploy a Big Data Processing Platform on the Cloud
Expand Down Expand Up @@ -77,6 +78,7 @@ What is this course module main subject?

## Big Data & Distributed Computing (3h)

- [Current introduction (30min)](00_SDD_DE_Course_Introduction.html)
- [Introduction to Big Data and its ecosystem (1h)](01_Introduction_Big_Data.html)
- What is Big Data?
- Legacy “Big Data” ecosystem
Expand All @@ -92,14 +94,15 @@ What is this course module main subject?

## Deployment & Intro to Kubernetes (3h)

- MLOps: deploying your model as a Web App
MLOps: deploying your model as a Web App

- [Introduction to Orchestration](https://supaerodatascience.github.io/DE/slides/2_2b_orchestration.html)
- [Introduction to Kubernetes](12_OrchestrationKubernetes.html)

## Kubernetes hands on (3h)

- Zero to Jupyterhub: deploy a Jupyterhub on Kubernetes
- Deploy a Daskhub: a Dask enables Jupyterhub (for later use)
- Deploy a Daskhub: a Dask enabled Jupyterhub (for later use)

[Slides](13_Dask_On_Cloud.html)

Expand All @@ -113,11 +116,11 @@ What is this course module main subject?
- Machine and Deep Learning (Sickit Learn, TensorFlow, Pytorch)
- Jupyter notebooks, Binder, Google Colab
- [Spark Introduction (30m)](03_Spark_Introduction.html)
- Play with MapReduce through Spark (Notebook on small datasets) (1.5h)
- Play with MapReduce using Spark (Notebook on small datasets) (1.5h)

## Distributed Processing and Dask hands on (3h)

- [Manage large datasets(30m)](24_Large_Datasets.html)
- [Manage large datasets (30m)](24_Large_Datasets.html)
- [Dask Introduction (30m)](22_Dask_Pangeo.html)
- Includes [Dask tutorial(2h)](https://github.com/dask/dask-tutorial).

Expand All @@ -127,7 +130,7 @@ What is this course module main subject?
- Subject presentation
- Everyone should have a Daskhub cloud platform setup or Dask on local computer
- Get the data
- Notebook with cell codes to fill or answers to give
- Notebook with codes cell to fill and answers to give
- Clean big amounts of data using Dask in the cloud or on a big computer
- Train machine learning models in parallel (hyper parameter search)
- Complete with yor own efforts!
Expand All @@ -145,4 +148,3 @@ What will we do today?
![Answer](https://cdn.strawpoll.com/images/polls/qr/xVg71DedQyr.png)

[Answer link](https://strawpoll.com/xVg71DedQyr)

7 changes: 4 additions & 3 deletions src/01_Introduction_Big_Data.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,8 @@ date: 2026

## Some figures

![Volume of data produced in a day in 2019 (source www.visualcapitalist.com)](images/a-day-in-data.jpg){width="50%"}
![](images/a-day-in-data.jpg){width="50%"}
![](https://www.digitalsilk.com/wp-content/uploads/2024/12/how-much-data-is-generated-per-day-hero-image.jpg)

## Some figures in sciences

Expand Down Expand Up @@ -78,7 +79,7 @@ Not a technology.

## Quizz

What is the estimated size of the global data sphere?
What is the estimated size of the global data sphere in 2025?

- Answer A: 175 Petabytes
- Answer B: 175 Exabytes
Expand Down Expand Up @@ -253,7 +254,7 @@ Data production or scientific exploration:

## Quizz

What is the typical volumes of scientific Datasets (multiple choices)?
What are the typical volumes of scientific Datasets (multiple choices)?

- Answer A: MBs
- Answer B: GBs
Expand Down
19 changes: 9 additions & 10 deletions src/02_Big_Data_Platforms.md
Original file line number Diff line number Diff line change
Expand Up @@ -532,23 +532,22 @@ python /data/training/SLURM/plot_template.py
:::
::: {.column width="50%"}

![Jean-Zay supercomputer](http://www.idris.fr/media/images/jean-zay-annonce-01.jpg?id=web%3Aeng%3Ajean-zay%3Acpu%3Ajean-zay-cpu-hw-eng)
![Jean-Zay supercomputer](images/jean-zay-hpc.png)

:::
::::::::::::::

## TOP500

| Rank | System | Cores | Rmax (TFlop/s) | Rpeak (PFlop/s) | Power (kW) |
| Rank | System | Cores | Rmax (PFlop/s) | Rpeak (PFlop/s) | Power (kW) |
|------| -------|-------|----------------|-----------------|------------|
| 1 | Frontier - United States | 8,699,904 | 1,194.00 | 1,679.82 | 22,703 |
| 2 | Aurora - United States | 4,742,808 | 585.34 | 1,059.33 | 24,687 |
| 4 | Supercomputer Fugaku - Japan | 7,630,848 | 442.01 | 537.21 | 29,899 |
| 5 | LUMI - Finland | 2,752,704 | 2379.70 | 531.51 | 7,107 |
| 17 | Adastra - France | 319,072 | 46.10 | 61.61 | 921 |
| 167 | Jean Zay - France | 93,960 | 4.48 | 7.35 | |
| 1 | El Capitan - United States | 11,340,000 | 1,809.00 | 2,821.10 | 29,685 |
| 4 | JUPITER Booster - Germany | 4,801,344 | 1,000.00 | 1,226.28 | 15,794 |
| 7 | Supercomputer Fugaku - Japan | 7,630,848 | 442.01 | 537.21 | 29,899 |
| 26 | CEA-HE - France | 548,352 | 90.79 | 171.26 | 1,770 |
| 290 | Jean Zay - France | 93,960 | 4.48 | 7.35 | |

[Top 500 (november 2023)](https://top500.org/lists/top500/2023/11/)
[Top 500 (november 2025)](https://top500.org/lists/top500/2025/11/)

## Big Data and Hadoop

Expand Down Expand Up @@ -628,7 +627,7 @@ Hence the cloud computing model...
### GPGPU

- Specific hardware (expensive)
- Really efficient for Deep Learning algorithms
- Really efficient for Deep Learning algorithms (learning and inference)
- Image processing, Language processing

## Quizz
Expand Down
9 changes: 4 additions & 5 deletions src/10_Cloud_Computing.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,7 @@ I took most of the content from theirs:
:::
::: {.column width="70%"}

![](https://www.datacenterknowledge.com/sites/datacenterknowledge.com/files/wp-content/uploads/2013/06/lulea-rows.jpg){width="35%"}
![](https://www.datacenterknowledge.com/sites/datacenterknowledge.com/files/wp-content/uploads/2013/06/fb-lulea-external-fans.jpg){width="35%"}
![](https://www.akita.co.uk/wp-content/uploads/2023/09/cloud-storage-facilities-1.jpg)

(Facebook's data center & server racks)

Expand All @@ -45,7 +44,7 @@ I took most of the content from theirs:

## Google Cloud Data Center locations

![Data Centers](https://cloud.google.com/images/locations/regions.png)
![Data Centers](images/google-dc-map.png)

## Cloud Definition

Expand Down Expand Up @@ -226,13 +225,13 @@ What means IaaS?
## Public (European)

![](https://www.comptoir-hardware.com/images/stories/_logos/ovhcloud.png){width=20%}
![](https://cloud.orange.com/ui/app/static/assets/brand/logo_header_login.png){width=20%}
![](https://www.orange-business.com/sites/default/files/illustration-obs---cloud---infrastructures.png){width=20%}
![](images/open_telekom_cloud.png){width=20%}

Academic, public founded:

![gaiax](https://gaia-x.eu/wp-content/uploads/2022/12/Gaia-X_Logo_Inverted_White_Transparent_210401-3-1000x687.png){width=20%}
![EOSC](https://eosc-portal.eu/sites/all/themes/theme1/logo.png){width=20%}
![EOSC](https://eosc.eu/wp-content/uploads/2023/08/EOSCA_logo.svg){width=20%}

## Private/on premise

Expand Down
2 changes: 1 addition & 1 deletion src/14_ObjectStorage.md
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,7 @@ What is Cloud Optimized?
:::
::: {.column width="50%"}

![](https://staging.dev.element84.com/wp-content/uploads/2019/04/smiley_tiled.png)
![](https://guide.cloudnativegeo.org/images/cog-diagram-2.png)

:::
::::::::::::::
Expand Down
Loading