Skip to content

Commit a6d6b7b

Browse files
Merge pull request #93 from tombarys/apriori
Update several things, corrected typos and ugly things, added simple core correlation code and credits
2 parents e8364d0 + 6fa78b7 commit a6d6b7b

File tree

5 files changed

+84
-38
lines changed

5 files changed

+84
-38
lines changed

src/data_analysis/book_sales_analysis/about_apriori.clj

Lines changed: 38 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@
2121
[data-analysis.book-sales-analysis.core-helpers-v2 :as helpers]
2222
[data-analysis.book-sales-analysis.market-basket-analysis-v2 :as mba]))
2323

24+
2425
;; # From Correlations to Recommendations
2526
;; ## A Publisher's Journey into Data-Driven Book Sales
2627

@@ -95,6 +96,31 @@
9596

9697
;; My first instinct was to calculate correlations between all books. A correlation tells you how often two books appear together compared to what you'd expect by chance. When I visualized this as a heatmap, with books ordered chronologically, something fascinating emerged:
9798

99+
;; The core of the correlation calculation is this function (thank you, [@generateme](https://github.com/generateme) for huge [help](https://clojurians.zulipchat.com/#narrow/channel/151924-data-science/topic/Correlation.20matrix.20best.20practice.3F/with/530339272)):
100+
101+
^:kindly/hide-code
102+
(kind/code
103+
"; ...
104+
105+
(defn corr-a-x-b [ds]
106+
(let
107+
[columns (tc/column-names ds)
108+
clean-ds (tc/drop-columns ds [:zakaznik]))]
109+
(-> (zipmap columns (stats/correlation-matrix (tc/columns clean-ds)))
110+
tc/dataset
111+
(tc/add-column :book columns))))
112+
113+
; ...")
114+
115+
;; ...which after chronological sorting (see article's repo) enters this plotly element:
116+
117+
^:kindly/hide-code
118+
(def example-corr-matrix-calculation
119+
(-> data/anonymized-shareable-ds
120+
helpers/onehot-encode-by-customers
121+
(helpers/corr-matrix data/books-meta)))
122+
123+
98124
(kind/plotly
99125
{:data [{:type "heatmap"
100126
:z (tc/columns data/corr-matrix-precalculated)
@@ -114,7 +140,7 @@
114140
:font {:size 12 :color "black"}
115141
:bgcolor "rgba(255, 255, 200, 0.9)" :bordercolor "yellow" :borderwidth 2}]}})
116142

117-
;; The bright red square in the upper-left corner revealed that **recently published books have much stronger co-purchase patterns** than older titles. This made intuitive sense—customers discovering our catalog tend to buy multiple new releases together.
143+
;; The bright red square in the lower-left corner revealed that **recently published books have much stronger co-purchase patterns** than older titles. This made intuitive sense—customers discovering our catalog tend to buy multiple new releases together.
118144

119145
;; ## A Surprising Discovery: Czech vs. Foreign Authors
120146

@@ -234,7 +260,7 @@ scatter-plot
234260
;; **Lift** measures whether this happens more than random chance:
235261

236262
;; $$
237-
;; "\text{Lift}(A \rightarrow B) = \dfrac{\text{Confidence}(A \rightarrow B)}{\text{Support}(\{B\})}
263+
;; \text{Lift}(A \rightarrow B) = \dfrac{\text{Confidence}(A \rightarrow B)}{\text{Support}(\{B\})}
238264
;; $$
239265

240266
;; A lift greater than 1 indicates positive association—the items are purchased together more often than if they were independent. A lift of 2.3 means the combination is 2.3 times more likely than chance.
@@ -255,7 +281,7 @@ scatter-plot
255281
(tc/head (:rules-grouped quick-formatted) 15)
256282
{:element/max-height "500px"})
257283

258-
;; Reading these rules is straightforward. For example, if a customer buys Yuval Noah Harari's "Sapiens," there's a 72% chance they'll also purchase "Nexus" (another Harari title), and this combination is nearly twice as likely as random chance (lift = 1.99).
284+
;; Reading these rules is straightforward. For example, if a customer buys "Zamilujte se do angličtiny", there's a 37% chance they'll also purchase "365 anglických frází a výrazů" (another author's title), and this combination is 6 times as likely as random chance (lift = 6,12).
259285

260286
;; ## Visualizing the Network
261287

@@ -344,6 +370,15 @@ scatter-plot
344370
;; - Full presentation code: *to be published*
345371
;; - SciCloj community: [scicloj.github.io](https://scicloj.github.io)
346372

373+
;; ## Credits
374+
375+
;; I would like to thank [Daniel Slutsky](https://github.com/daslu) for inviting me to participate in the conference and encouraging me all the time not to give up. I think he is doing an amazing job for this great community.
376+
;; I would also like to thank [Timothy Pratley](https://github.com/timothypratley) for his support during the publication process.
377+
;; Extra thanks to my friends at [Not Null Makers](https://notnullmakers.com/) for making me a better Clojure person :).
378+
379+
380+
381+
347382
;; ---
348383

349384
;; *This article is based on a presentation at Macroexpand conference, October 2025.*
5.91 KB
Binary file not shown.

src/data_analysis/book_sales_analysis/core_helpers_v2.clj

Lines changed: 42 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -112,8 +112,8 @@
112112
(map #(str/replace % #"\+" ""))
113113
(map #(str/trim %))
114114
(map sanitize-str)
115-
(map #(str/replace % #"\-\-.+$" ""))
116-
(map #(str/replace % #"\-+$" ""))
115+
(map #(str/replace % #"\-\-.+$" ""))
116+
(map #(str/replace % #"\-+$" ""))
117117
(map #(str/replace % #"^3" "k3"))
118118
(map #(str/replace % #"^5" "k5"))
119119
(remove (fn [item] (some (fn [substr] (str/includes? (name item) substr))
@@ -125,6 +125,28 @@
125125

126126
;; ### Metadata Enriching and Convenience Functions
127127

128+
(def end-time
129+
(jt/local-date 2025 10 1))
130+
131+
(defn months-between "Calculate how many months a product has been on market"
132+
[start-date end-date]
133+
(let [days (if (and start-date end-date)
134+
(jt/time-between start-date end-date :days)
135+
0)]
136+
(long (Math/round (/ days 30.4375)))))
137+
138+
(defn months-on-market
139+
"Months `book` is on a market. Zero if not at all."
140+
[books-ds book end-date]
141+
(let [date (try
142+
(-> books-ds
143+
(tc/select-columns [:titul :datum-zahajeni-prodeje])
144+
(tc/select-rows #(str/starts-with? (name (:titul %)) (name book)))
145+
(tc/get-entry :datum-zahajeni-prodeje 0))
146+
(catch Exception e nil))
147+
month (if (nil? date) 0 (months-between date end-date))]
148+
month))
149+
128150
(defn czech-author? [book-title]
129151
(let [czech-books #{:k30-hodin
130152
:k365-anglickych-cool-fraz-a-vyrazov
@@ -159,7 +181,7 @@
159181

160182
;; ### One-Hot Encoding Functions
161183

162-
184+
163185
(defn onehot-encode-by-customers ;; FIXME needs refactor and simplification :)
164186
"One-hot encode dataset aggregated by customer.
165187
Each customer gets one row with 0/1 values for each book they bought.
@@ -279,25 +301,23 @@
279301
tc/dataset
280302
(tc/add-column :book columns))))
281303

282-
(defn corr-3-col
283-
"Creates a correlation matrix with two columns of books \n
284-
=> _unnamed [4 3]: \n
285-
| :book-0 | :book-1 | :correlation |
286-
|---------|---------|-------------:|
287-
| :a | :a | 1.00000000 |
288-
| :a | :b | -0.12121831 |
289-
| :b | :a | -0.12121831 |
290-
| :b | :b | 1.00000000 | \n
291-
- `flatten` is used here to make a linear sequence of numbers which should match corresponding variable names. \n
292-
- Since we make pairs of names `((for...[a b])` creates a cartesian product) we need to separate these to individual columns, tc/seperate-column does the trick, refer: https://scicloj.github.io/tablecloth/#separate"
293-
[ds]
294-
(let [names (tc/column-names ds)
295-
mat (flatten (stats/correlation-matrix (tc/columns ds)))]
296-
(-> (tc/dataset {:book (for [a names b names] [a b])
297-
:correlation mat})
298-
(tc/separate-column :book)
299-
(tc/rename-columns {":book-0" :titul-knihy
300-
":book-1" :book-1}))))
304+
(defn corr-matrix
305+
"Creates a correlation matrix with books sorted by publication date (chronological order) \n
306+
`books-onehot` – one-hot encoded dataset"
307+
[books-onehot books-meta]
308+
(-> (corr-a-x-b (-> books-onehot
309+
(tc/reorder-columns
310+
(sort-by #(months-on-market books-meta % end-time)
311+
(tc/column-names books-onehot)))
312+
(tc/drop-columns [:zakaznik])))
313+
(tc/reorder-columns
314+
(sort-by #(months-on-market books-meta % end-time)
315+
(tc/column-names books-onehot)))
316+
(tc/add-column :sort-col
317+
(fn [ds] (map #(months-on-market books-meta % end-time)
318+
(tc/column ds :book))))
319+
(tc/order-by :sort-col)
320+
(tc/drop-columns :sort-col)))
301321

302322

303323
;; ### Export helper functions from other namespaces for convenience

src/data_analysis/book_sales_analysis/data_sources_v2.clj

Lines changed: 4 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -7,16 +7,6 @@
77

88
;; ### Main Orders Data (WooCommerce exports)
99

10-
#_(def anonymized-presentation-ds
11-
"Anonymized customers full dataset of all orders - for presentation purposes
12-
❗ NOT FOR SHARING"
13-
(tc/dataset
14-
(helpers/merge-csvs
15-
["data/anonymized-customers-only-presentation-v2.csv"]
16-
{:header? true
17-
:separator ","
18-
:key-fn #(keyword (helpers/sanitize-str %))})))
19-
2010
(def anonymized-shareable-ds
2111
"Fully anonymized slice of db - for sharing purposes
2212
✅ SAFE TO SHARE"
@@ -29,13 +19,15 @@
2919

3020
;; ### Book Metadata
3121

22+
(def books-meta
23+
(tc/dataset "src/data_analysis/book_sales_analysis/books_meta.nippy"))
24+
3225
;; ## Quick Access
3326
;; Most commonly used datasets with short aliases
3427

3528
(def orders-share anonymized-shareable-ds)
36-
#_(def orders-slides anonymized-presentation-ds)
3729

38-
(def corr-matrix-precalculated
30+
(def corr-matrix-precalculated
3931
(tc/dataset "src/data_analysis/book_sales_analysis/corr-matrix.nippy"))
4032

4133
corr-matrix-precalculated

src/data_analysis/book_sales_analysis/market_basket_analysis_v2.clj

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@
1212
[clojure.math.combinatorics :as combo]
1313
[clojure.string :as str]
1414
[clojure.set]
15-
[data-analysis.book-sales-analysis.data-sources-v2 :as data]
1615
[data-analysis.book-sales-analysis.core-helpers-v2 :as helpers]))
1716

1817
;; # Market Basket Analysis

0 commit comments

Comments
 (0)