Skip to content

Commit 4e654c7

Browse files
committed
edit round of DataFrames
1 parent 746f18c commit 4e654c7

File tree

1 file changed

+77
-106
lines changed

1 file changed

+77
-106
lines changed

lectures/polars.md

Lines changed: 77 additions & 106 deletions
Original file line numberDiff line numberDiff line change
@@ -66,11 +66,10 @@ as [statsmodels](https://www.statsmodels.org/) and [scikit-learn](https://scikit
6666
This lecture will provide a basic introduction to polars.
6767

6868
```{tip}
69-
**Why use Polars over pandas?** The main reason is `performance`. As a general rule, it is recommended to have 5 to 10 times as much RAM as the size of the dataset to carry out operations in pandas, compared to 2 to 4 times needed for Polars. In addition, Polars is between 10 and 100 times as fast as pandas for common operations. A great article comparing the Polars and pandas can be found [in this JetBrains blog post](https://blog.jetbrains.com/pycharm/2024/07/polars-vs-pandas/)
69+
**Why use Polars over pandas?** One reason is **performance**. As a general rule, it is recommended to have 5 to 10 times as much RAM as the size of the dataset to carry out operations in pandas, compared to 2 to 4 times needed for Polars. In addition, Polars is between 10 and 100 times as fast as pandas for common operations. A great article comparing the Polars and pandas can be found [in this JetBrains blog post](https://blog.jetbrains.com/pycharm/2024/07/polars-vs-pandas/).
7070
```
7171

72-
Throughout the lecture, we will assume that the following imports have taken
73-
place
72+
Throughout the lecture, we will assume that the following imports have taken place
7473

7574
```{code-cell} ipython3
7675
import polars as pl
@@ -101,11 +100,7 @@ s
101100
```
102101

103102
```{note}
104-
You may notice the above series has no indices, unlike in [pd.Series](pandas:series).
105-
106-
This is because Polars' is column centric and accessing data is predominantly managed through filtering and boolean masks.
107-
108-
Here is [an interesting blog post discussing this in more detail](https://medium.com/data-science/understand-polars-lack-of-indexes-526ea75e413).
103+
You may notice the above series has no indices, unlike in [pd.Series](pandas:series).This is because Polars' is column centric and accessing data is predominantly managed through filtering and boolean masks. Here is [an interesting blog post discussing this in more detail](https://medium.com/data-science/understand-polars-lack-of-indexes-526ea75e413).
109104
```
110105

111106
Polars `Series` are built on top of Apache Arrow arrays and support many similar
@@ -127,9 +122,9 @@ For example they have some additional (statistically oriented) methods
127122
s.describe()
128123
```
129124

130-
However the Polars `series` cannot be used in the same as as a Pandas `series` when pairing data with indices.
125+
However the `pl.Series` object cannot be used in the same way as a `pd.Series` when pairing data with indices.
131126

132-
For example, using a Pandas `series` you can do the following:
127+
For example, using a `pd.Series` you can do the following:
133128

134129
```{code-cell} ipython3
135130
s = pd.Series(np.random.randn(4), name='daily returns')
@@ -139,42 +134,42 @@ s
139134

140135
However, in Polars you will need to use the `DataFrame` object to do the same task.
141136

142-
This means you will use the `DataFrame` object more commonly when using polars if you
143-
are interested in relationships between data.
137+
This means you will use the `DataFrame` object more often when using polars if you
138+
are interested in relationships between data
144139

145-
Essentially any column in a Polars `DataFrame` can be used as an indices through the `filter` method.
140+
Let's create a `pl.DataFrame` containing the equivalent data in the `pd.Series` .
146141

147142
```{code-cell} ipython3
148-
df_temp = pl.DataFrame({
143+
df = pl.DataFrame({
149144
'company': ['AMZN', 'AAPL', 'MSFT', 'GOOG'],
150145
'daily returns': s.to_list()
151146
})
152-
df_temp
147+
df
153148
```
154149

155150
To access specific values by company name, we can filter the DataFrame filtering on
156-
the `AMZN` ticker code and selecting the `daily returns`.
151+
the `AMZN` ticker code and selecting the `daily returns`.
157152

158153
```{code-cell} ipython3
159-
df_temp.filter(pl.col('company') == 'AMZN').select('daily returns').item()
154+
df.filter(pl.col('company') == 'AMZN').select('daily returns').item()
160155
```
161156

162157
If we want to update `AMZN` return to 0, you can use the following chain of methods.
163158

164159
```{code-cell} ipython3
165-
df_temp = df_temp.with_columns(
166-
pl.when(pl.col('company') == 'AMZN')
167-
.then(0)
168-
.otherwise(pl.col('daily returns'))
169-
.alias('daily returns')
160+
df = df.with_columns( # with_columns is similar to select but adds columns to the same DataFrame
161+
pl.when(pl.col('company') == 'AMZN') # filter for rows relating to AMZN in company column
162+
.then(0) # set values to 0
163+
.otherwise(pl.col('daily returns')) # otherwise keep the value in daily returns column
164+
.alias('daily returns') # assign back to the daily returns column
170165
)
171-
df_temp
166+
df
172167
```
173168

174-
You could also check if `AAPL` is in a column.
169+
You can check if a ticker code is in the company list
175170

176171
```{code-cell} ipython3
177-
'AAPL' in df_temp.get_column('company')
172+
'AAPL' in df['company']
178173
```
179174

180175
## DataFrames
@@ -188,7 +183,8 @@ In essence, a `DataFrame` in polars is analogous to a (highly optimized) Excel s
188183

189184
Thus, it is a powerful tool for representing and analyzing data that are naturally organized into rows and columns.
190185

191-
Let's look at an example that reads data from the CSV file `pandas/data/test_pwt.csv`, which is taken from the [Penn World Tables](https://www.rug.nl/ggdc/productivity/pwt/pwt-releases/pwt-7.0).
186+
Let's look at an example that reads data from the CSV file `pandas/data/test_pwt.csv`,
187+
which is taken from the [Penn World Tables](https://www.rug.nl/ggdc/productivity/pwt/pwt-releases/pwt-7.0).
192188

193189
The dataset contains the following indicators
194190

@@ -204,19 +200,21 @@ The dataset contains the following indicators
204200
We'll read this in from a URL using the `polars` function `read_csv`.
205201

206202
```{code-cell} ipython3
207-
df = pl.read_csv('https://raw.githubusercontent.com/QuantEcon/lecture-python-programming/master/source/_static/lecture_specific/pandas/data/test_pwt.csv')
203+
URL = 'https://raw.githubusercontent.com/QuantEcon/lecture-python-programming/master/source/_static/lecture_specific/pandas/data/test_pwt.csv'
204+
df = pl.read_csv(URL)
208205
type(df)
209206
```
210207

211-
Here's the content of `test_pwt.csv`
208+
Here is the content of `test_pwt.csv`
212209

213210
```{code-cell} ipython3
214211
df
215212
```
216213

217214
### Select Data by Position
218215

219-
In practice, one thing that we do all the time is to find, select and work with a subset of the data of our interests.
216+
In practice, one thing that we do all the time is to find, select and work with a
217+
subset of the data of our interests.
220218

221219
We can select particular rows using array slicing notation
222220

@@ -254,14 +252,17 @@ The most straightforward way is with the `filter` method.
254252
df.filter(pl.col('POP') >= 20000)
255253
```
256254

257-
To understand what is going on here, notice that `pl.col('POP') >= 20000` creates a boolean expression.
255+
In this case, `df.filter()` takes a boolean expression and only returns rows with the `True` values.
256+
257+
We can see this boolean mask by saving the comparison results in the following table.
258258

259259
```{code-cell} ipython3
260-
df.select(pl.col('POP') >= 20000)
260+
df.select(
261+
pl.col('country'), # Include country for reference
262+
(pl.col('POP') >= 20000).alias('meets_criteria') # meets_criteria shows results of the comparison expression
263+
)
261264
```
262265

263-
In this case, `df.filter()` takes a boolean expression and only returns rows with the `True` values.
264-
265266
Take one more example,
266267

267268
```{code-cell} ipython3
@@ -277,7 +278,8 @@ We can also allow arithmetic operations between different columns.
277278
df.filter((pl.col('cc') + pl.col('cg') >= 80) & (pl.col('POP') <= 20000))
278279
```
279280

280-
For example, we can use the conditioning to select the country with the largest household consumption - gdp share `cc`.
281+
For example, we can use the conditioning to select the country with the largest
282+
household consumption - gdp share `cc`.
281283

282284
```{code-cell} ipython3
283285
df.filter(pl.col('cc') == pl.col('cc').max())
@@ -291,13 +293,13 @@ df.filter((pl.col('cc') + pl.col('cg') >= 80) & (pl.col('POP') <= 20000)).select
291293

292294
**Application: Subsetting Dataframe**
293295

294-
Real-world datasets can be [enormous](https://developers.google.com/machine-learning/crash-course/overfitting).
296+
Real-world datasets can be very large.
295297

296298
It is sometimes desirable to work with a subset of data to enhance computational efficiency and reduce redundancy.
297299

298300
Let's imagine that we're only interested in the population (`POP`) and total GDP (`tcgdp`).
299301

300-
One way to strip the data frame `df` down to only these variables is to overwrite the dataframe using the selection method described above
302+
One way to strip the data frame `df` down to only these variables is to overwrite the `DataFrame` using the selection method described above
301303

302304
```{code-cell} ipython3
303305
df_subset = df.select(['country', 'POP', 'tcgdp'])
@@ -329,19 +331,16 @@ df.select([
329331
For more complex operations, we can use `map_elements` (similar to pandas' apply):
330332

331333
```{code-cell} ipython3
332-
# A trivial example using map_elements
333-
df.with_row_index().select([
334-
pl.col('index'),
334+
df.select([
335335
pl.col('country'),
336-
pl.col('POP').map_elements(lambda x: x * 2, return_dtype=pl.Float64).alias('POP_doubled')
336+
pl.col('POP').map_elements(lambda x: x * 2).alias('POP_doubled')
337337
])
338338
```
339339

340-
However as you can see from the Warning issued by Polars there is often a better way to achieve this using the Polars API.
340+
However as you can see from the warning issued by Polars there is often a better way to achieve this using the Polars API.
341341

342342
```{code-cell} ipython3
343-
df.with_row_index().select([
344-
pl.col('index'),
343+
df.select([
345344
pl.col('country'),
346345
(pl.col('POP') * 2).alias('POP_doubled')
347346
])
@@ -351,9 +350,9 @@ We can use complex filtering conditions with boolean logic:
351350

352351
```{code-cell} ipython3
353352
complex_condition = (
354-
pl.when(pl.col('country').is_in(['Argentina', 'India', 'South Africa']))
355-
.then(pl.col('POP') > 40000)
356-
.otherwise(pl.col('POP') < 20000)
353+
pl.when(pl.col('country').is_in(['Argentina', 'India', 'South Africa'])) # for the countries that match those in the list
354+
.then(pl.col('POP') > 40000) # mark True if population is > 40,000
355+
.otherwise(pl.col('POP') < 20000) # otherwise False if population is < 20,000
357356
)
358357
359358
df.filter(complex_condition).select(['country', 'year', 'POP', 'XRAT', 'tcgdp'])
@@ -366,22 +365,22 @@ The ability to make changes in dataframes is important to generate a clean datas
366365
**1.** We can use conditional logic to "keep" certain values and replace others
367366

368367
```{code-cell} ipython3
369-
df.with_columns(
370-
pl.when(pl.col('POP') >= 20000)
371-
.then(pl.col('POP'))
372-
.otherwise(None)
373-
.alias('POP_filtered')
374-
).select(['country', 'POP', 'POP_filtered'])
368+
df.with_columns( # add data column to the same dataframe
369+
pl.when(pl.col('POP') >= 20000) # when population is greater than 20,000
370+
.then(pl.col('POP')) # keep the population value
371+
.otherwise(None) # otherwise set the value to null
372+
.alias('POP_filtered') # save results in column POP_filtered
373+
).select(['country', 'POP', 'POP_filtered']) # select the columns of interest
375374
```
376375

377376
**2.** We can modify specific values based on conditions
378377

379378
```{code-cell} ipython3
380-
df_modified = df.with_columns(
381-
pl.when(pl.col('cg') == pl.col('cg').max())
382-
.then(None)
383-
.otherwise(pl.col('cg'))
384-
.alias('cg')
379+
df_modified = df.with_columns(
380+
pl.when(pl.col('cg') == pl.col('cg').max()) # when a value in the cg column is equal to the max cg value
381+
.then(None) # set to null
382+
.otherwise(pl.col('cg')) # otherwise keep the value in the cg column
383+
.alias('cg') # update the column with name cg
385384
)
386385
df_modified
387386
```
@@ -390,17 +389,19 @@ df_modified
390389

391390
```{code-cell} ipython3
392391
df.with_columns([
393-
pl.when(pl.col('POP') <= 10000).then(None).otherwise(pl.col('POP')).alias('POP'),
394-
(pl.col('XRAT') / 10).alias('XRAT')
392+
pl.when(pl.col('POP') <= 10000) # when population is < 10,000
393+
.then(None) # set the value to null
394+
.otherwise(pl.col('POP')) # otherwise keep the existing value
395+
.alias('POP'), # update the POP column
396+
(pl.col('XRAT') / 10).alias('XRAT') # using the XRAT column, divide the value by 10 and update the column in-place
395397
])
396398
```
397399

398-
**4.** We can use in-built functions to modify all individual entries in specific columns.
400+
**4.** We can use in-built functions to modify all individual entries in specific columns by data type.
399401

400402
```{code-cell} ipython3
401-
# Round all decimal numbers to 2 decimal places in numeric columns
402403
df.with_columns([
403-
pl.col(pl.Float64).round(2)
404+
pl.col(pl.Float64).round(2) # round all Float64 columns to 2 decimal places
404405
])
405406
```
406407

@@ -440,10 +441,10 @@ For example, we can use forward fill, backward fill, or interpolation
440441

441442
```{code-cell} ipython3
442443
# Fill with column means for numeric columns
443-
df_filled = df_with_nulls.with_columns([
444-
pl.col(pl.Float64).fill_null(pl.col(pl.Float64).mean())
444+
cols = ["cc", "tcgdp", "POP", "XRAT"]
445+
df_with_nulls.with_columns([
446+
pl.col(cols).fill_null(pl.col(cols).mean()) # fill null values with the column mean
445447
])
446-
df_filled
447448
```
448449

449450
Missing value imputation is a big area in data science involving various machine learning techniques.
@@ -454,15 +455,13 @@ There are also more [advanced tools](https://scikit-learn.org/stable/modules/imp
454455

455456
Let's imagine that we're only interested in the population (`POP`) and total GDP (`tcgdp`).
456457

457-
One way to strip the data frame `df` down to only these variables is to overwrite the dataframe using the selection method described above
458+
One way to strip the data frame `df` down to only these variables is to overwrite the `DataFrame` using the selection method described above
458459

459460
```{code-cell} ipython3
460461
df = df.select(['country', 'POP', 'tcgdp'])
461462
df
462463
```
463464

464-
Here the index `0, 1,..., 7` is redundant because we can use the country names as an index.
465-
466465
While polars doesn't have a traditional index like pandas, we can work with country names directly
467466

468467
```{code-cell} ipython3
@@ -483,7 +482,11 @@ df = df.with_columns((pl.col('population') * 1e3).alias('population'))
483482
df
484483
```
485484

486-
Next, we're going to add a column showing real GDP per capita, multiplying by 1,000,000 as we go because total GDP is in millions
485+
Next, we're going to add a column showing real GDP per capita, multiplying by 1,000,000 as we go because total GDP is in millions.
486+
487+
```{note}
488+
Polars (or Pandas) doesn't have a way of recording dimensional analysis units such as GDP represented in millions of dollars. This is left to the user to ensure they track their own units when undertaking analysis.
489+
```
487490

488491
```{code-cell} ipython3
489492
df = df.with_columns(
@@ -626,43 +629,7 @@ Note that polars offers many other file type alternatives.
626629

627630
Polars has [a wide variety](https://docs.pola.rs/user-guide/io/) of methods that we can use to read excel, json, parquet or plug straight into a database server.
628631

629-
### Using {index}`wbgapi <single: wbgapi>` and {index}`yfinance <single: yfinance>` to Access Data
630-
631-
The [wbgapi](https://pypi.org/project/wbgapi/) python library can be used to fetch data from the many databases published by the World Bank.
632-
633-
```{note}
634-
You can find some useful information about the [wbgapi](https://pypi.org/project/wbgapi/) package in this [world bank blog post](https://blogs.worldbank.org/en/opendata/introducing-wbgapi-new-python-package-accessing-world-bank-data), in addition to this [tutorial](https://github.com/tgherzog/wbgapi/blob/master/examples/wbgapi-quickstart.ipynb)
635-
```
636-
637-
We will also use [yfinance](https://pypi.org/project/yfinance/) to fetch data from Yahoo finance
638-
in the exercises.
639-
640-
For now let's work through one example of downloading and plotting data --- this
641-
time from the World Bank.
642-
643-
The World Bank [collects and organizes data](https://data.worldbank.org/indicator) on a huge range of indicators.
644-
645-
For example, [here's](https://data.worldbank.org/indicator/GC.DOD.TOTL.GD.ZS) some data on government debt as a ratio to GDP.
646-
647-
The next code example fetches the data for you and plots time series for the US and Australia
648-
649-
```{code-cell} ipython3
650-
import wbgapi as wb
651-
wb.series.info('GC.DOD.TOTL.GD.ZS')
652-
```
653-
654-
```{code-cell} ipython3
655-
govt_debt_pandas = wb.data.DataFrame('GC.DOD.TOTL.GD.ZS', economy=['USA','AUS'], time=range(2005,2016))
656-
govt_debt_pandas = govt_debt_pandas.T # move years from columns to rows for plotting
657-
658-
# Convert to polars
659-
govt_debt = pl.from_pandas(govt_debt_pandas.reset_index())
660-
```
661-
662-
```{code-cell} ipython3
663-
# For plotting, convert back to pandas format
664-
govt_debt.to_pandas().set_index('index').plot(xlabel='year', ylabel='Government debt (% of GDP)');
665-
```
632+
+++
666633

667634
## Exercises
668635

@@ -695,6 +662,10 @@ ticker_list = {'INTC': 'Intel',
695662

696663
Here's the first part of the program
697664

665+
```{note}
666+
Many python packages will return Pandas DataFrames by default. In this example we use the `yfinance` package and convert the data to a polars DataFrame
667+
```
668+
698669
```{code-cell} ipython3
699670
def read_data(ticker_list,
700671
start=dt.datetime(2021, 1, 1),

0 commit comments

Comments
 (0)