You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -66,11 +66,10 @@ as [statsmodels](https://www.statsmodels.org/) and [scikit-learn](https://scikit
66
66
This lecture will provide a basic introduction to polars.
67
67
68
68
```{tip}
69
-
**Why use Polars over pandas?** The main reason is `performance`. As a general rule, it is recommended to have 5 to 10 times as much RAM as the size of the dataset to carry out operations in pandas, compared to 2 to 4 times needed for Polars. In addition, Polars is between 10 and 100 times as fast as pandas for common operations. A great article comparing the Polars and pandas can be found [in this JetBrains blog post](https://blog.jetbrains.com/pycharm/2024/07/polars-vs-pandas/)
69
+
**Why use Polars over pandas?** One reason is **performance**. As a general rule, it is recommended to have 5 to 10 times as much RAM as the size of the dataset to carry out operations in pandas, compared to 2 to 4 times needed for Polars. In addition, Polars is between 10 and 100 times as fast as pandas for common operations. A great article comparing the Polars and pandas can be found [in this JetBrains blog post](https://blog.jetbrains.com/pycharm/2024/07/polars-vs-pandas/).
70
70
```
71
71
72
-
Throughout the lecture, we will assume that the following imports have taken
73
-
place
72
+
Throughout the lecture, we will assume that the following imports have taken place
74
73
75
74
```{code-cell} ipython3
76
75
import polars as pl
@@ -101,11 +100,7 @@ s
101
100
```
102
101
103
102
```{note}
104
-
You may notice the above series has no indices, unlike in [pd.Series](pandas:series).
105
-
106
-
This is because Polars' is column centric and accessing data is predominantly managed through filtering and boolean masks.
107
-
108
-
Here is [an interesting blog post discussing this in more detail](https://medium.com/data-science/understand-polars-lack-of-indexes-526ea75e413).
103
+
You may notice the above series has no indices, unlike in [pd.Series](pandas:series).This is because Polars' is column centric and accessing data is predominantly managed through filtering and boolean masks. Here is [an interesting blog post discussing this in more detail](https://medium.com/data-science/understand-polars-lack-of-indexes-526ea75e413).
109
104
```
110
105
111
106
Polars `Series` are built on top of Apache Arrow arrays and support many similar
@@ -127,9 +122,9 @@ For example they have some additional (statistically oriented) methods
127
122
s.describe()
128
123
```
129
124
130
-
However the Polars `series`cannot be used in the same as as a Pandas `series` when pairing data with indices.
125
+
However the `pl.Series` object cannot be used in the same way as a `pd.Series` when pairing data with indices.
131
126
132
-
For example, using a Pandas `series` you can do the following:
127
+
For example, using a `pd.Series` you can do the following:
133
128
134
129
```{code-cell} ipython3
135
130
s = pd.Series(np.random.randn(4), name='daily returns')
@@ -139,42 +134,42 @@ s
139
134
140
135
However, in Polars you will need to use the `DataFrame` object to do the same task.
141
136
142
-
This means you will use the `DataFrame` object more commonly when using polars if you
143
-
are interested in relationships between data.
137
+
This means you will use the `DataFrame` object more often when using polars if you
138
+
are interested in relationships between data
144
139
145
-
Essentially any column in a Polars `DataFrame`can be used as an indices through the `filter` method.
140
+
Let's create a `pl.DataFrame`containing the equivalent data in the `pd.Series`.
146
141
147
142
```{code-cell} ipython3
148
-
df_temp = pl.DataFrame({
143
+
df = pl.DataFrame({
149
144
'company': ['AMZN', 'AAPL', 'MSFT', 'GOOG'],
150
145
'daily returns': s.to_list()
151
146
})
152
-
df_temp
147
+
df
153
148
```
154
149
155
150
To access specific values by company name, we can filter the DataFrame filtering on
156
-
the `AMZN` ticker code and selecting the `daily returns`.
151
+
the `AMZN` ticker code and selecting the `daily returns`.
If we want to update `AMZN` return to 0, you can use the following chain of methods.
163
158
164
159
```{code-cell} ipython3
165
-
df_temp = df_temp.with_columns(
166
-
pl.when(pl.col('company') == 'AMZN')
167
-
.then(0)
168
-
.otherwise(pl.col('daily returns'))
169
-
.alias('daily returns')
160
+
df = df.with_columns( # with_columns is similar to select but adds columns to the same DataFrame
161
+
pl.when(pl.col('company') == 'AMZN') # filter for rows relating to AMZN in company column
162
+
.then(0) # set values to 0
163
+
.otherwise(pl.col('daily returns')) # otherwise keep the value in daily returns column
164
+
.alias('daily returns') # assign back to the daily returns column
170
165
)
171
-
df_temp
166
+
df
172
167
```
173
168
174
-
You could also check if `AAPL`is in a column.
169
+
You can check if a ticker code is in the company list
175
170
176
171
```{code-cell} ipython3
177
-
'AAPL' in df_temp.get_column('company')
172
+
'AAPL' in df['company']
178
173
```
179
174
180
175
## DataFrames
@@ -188,7 +183,8 @@ In essence, a `DataFrame` in polars is analogous to a (highly optimized) Excel s
188
183
189
184
Thus, it is a powerful tool for representing and analyzing data that are naturally organized into rows and columns.
190
185
191
-
Let's look at an example that reads data from the CSV file `pandas/data/test_pwt.csv`, which is taken from the [Penn World Tables](https://www.rug.nl/ggdc/productivity/pwt/pwt-releases/pwt-7.0).
186
+
Let's look at an example that reads data from the CSV file `pandas/data/test_pwt.csv`,
187
+
which is taken from the [Penn World Tables](https://www.rug.nl/ggdc/productivity/pwt/pwt-releases/pwt-7.0).
192
188
193
189
The dataset contains the following indicators
194
190
@@ -204,19 +200,21 @@ The dataset contains the following indicators
204
200
We'll read this in from a URL using the `polars` function `read_csv`.
Next, we're going to add a column showing real GDP per capita, multiplying by 1,000,000 as we go because total GDP is in millions
485
+
Next, we're going to add a column showing real GDP per capita, multiplying by 1,000,000 as we go because total GDP is in millions.
486
+
487
+
```{note}
488
+
Polars (or Pandas) doesn't have a way of recording dimensional analysis units such as GDP represented in millions of dollars. This is left to the user to ensure they track their own units when undertaking analysis.
489
+
```
487
490
488
491
```{code-cell} ipython3
489
492
df = df.with_columns(
@@ -626,43 +629,7 @@ Note that polars offers many other file type alternatives.
626
629
627
630
Polars has [a wide variety](https://docs.pola.rs/user-guide/io/) of methods that we can use to read excel, json, parquet or plug straight into a database server.
628
631
629
-
### Using {index}`wbgapi <single: wbgapi>` and {index}`yfinance <single: yfinance>` to Access Data
630
-
631
-
The [wbgapi](https://pypi.org/project/wbgapi/) python library can be used to fetch data from the many databases published by the World Bank.
632
-
633
-
```{note}
634
-
You can find some useful information about the [wbgapi](https://pypi.org/project/wbgapi/) package in this [world bank blog post](https://blogs.worldbank.org/en/opendata/introducing-wbgapi-new-python-package-accessing-world-bank-data), in addition to this [tutorial](https://github.com/tgherzog/wbgapi/blob/master/examples/wbgapi-quickstart.ipynb)
635
-
```
636
-
637
-
We will also use [yfinance](https://pypi.org/project/yfinance/) to fetch data from Yahoo finance
638
-
in the exercises.
639
-
640
-
For now let's work through one example of downloading and plotting data --- this
641
-
time from the World Bank.
642
-
643
-
The World Bank [collects and organizes data](https://data.worldbank.org/indicator) on a huge range of indicators.
644
-
645
-
For example, [here's](https://data.worldbank.org/indicator/GC.DOD.TOTL.GD.ZS) some data on government debt as a ratio to GDP.
646
-
647
-
The next code example fetches the data for you and plots time series for the US and Australia
Many python packages will return Pandas DataFrames by default. In this example we use the `yfinance` package and convert the data to a polars DataFrame
0 commit comments