Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions pydata-amsterdam-2023/category.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"title": "PyData Amsterdam 2023"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
{
"description": "Pickle files can be evil and simply loading them can run arbitrary code on your system. This talk presents why that is, how it can be exploited, and how skops is tackling the issue for scikit-learn/statistical ML models. We go through some lower level pickle related machinery, and go in detail how the new format works.\n\nThe pickle format has many vulnerabilities and loading them alone can run arbitrary code on the user\u2019s system [1]. In this session we go through the process used by the pickle module to persist python objects, while demonstrating how they can be exploited. We go through how __getstate__ and __setstate__ are used, and how the output of a __reduce__ method is used to reconstruct an object, and how one can have a malicious implementation of these methods to create a malicious pickle file without knowing how to manually create a pickle file by manipulating a file on a lower level. We also briefly touch on other known exploits and issues related to the format [2].\n\nWe also show how one can look inside a pickle file and the operations run by it while loading it, and how one could get an equivalent python script which would result in the output of the pickle file [3]\nThen I present an alternative format from the skops library [4] which can be used to store scikit-learn based models. We talk about what the format is, and how persistence and loading is done, and what we do to prevent loading malicious objects or to avoid running arbitrary code. This format can be used to store almost any scikit-learn estimator, as well as xgboost, lightgbm, and catboost models.\n\n[1] https://peps.python.org/pep-0307/#security-issues\n[2] https://github.com/moreati/pickle-fuzz\n[3] https://github.com/trailofbits/fickling\n[4] https://skops.readthedocs.io/en/stable/persistence.html\n\nBio:\nAdrin\nAdrin works on a few open source projects including skops which tackles some of the MLOps challenges related to scikit-learn models. He has a PhD in Bioinformatics, has worked as a consultant, and in an algorithmic privacy and fairness team. He's also a core developer of scikit-learn and fairlearn.\n\n\n\nwww.pydata.org\n\nPyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. \n\nPyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.\n\n00:00 Welcome!\n00:10 Help us add time stamps or captions to this video! See the description for details.\n\nWant to help add timestamps to our YouTube videos to help with discoverability? Find out more here: https://github.com/numfocus/YouTubeVideoTimestamps",
"duration": 1339,
"language": "eng",
"recorded": "2023-09-14",
"related_urls": [
{
"label": "Conference Website",
"url": "https://amsterdam2023.pydata.org/cfp/schedule/"
},
{
"label": "https://github.com/numfocus/YouTubeVideoTimestamps",
"url": "https://github.com/numfocus/YouTubeVideoTimestamps"
},
{
"label": "https://github.com/trailofbits/fickling",
"url": "https://github.com/trailofbits/fickling"
},
{
"label": "https://github.com/moreati/pickle-fuzz",
"url": "https://github.com/moreati/pickle-fuzz"
},
{
"label": "https://skops.readthedocs.io/en/stable/persistence.html",
"url": "https://skops.readthedocs.io/en/stable/persistence.html"
},
{
"label": "https://peps.python.org/pep-0307/#security-issues",
"url": "https://peps.python.org/pep-0307/#security-issues"
}
],
"speakers": [
"Adrin Jalali"
],
"tags": [],
"thumbnail_url": "https://i.ytimg.com/vi/9w_H5OSTO9A/maxresdefault.jpg",
"title": "Let's exploit pickle, and `skops` to the rescue!",
"videos": [
{
"type": "youtube",
"url": "https://www.youtube.com/watch?v=9w_H5OSTO9A"
}
]
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
{
"description": "Power Users, Long Tail Users, and Everything In Between: Choosing Meaningful Metrics and KPIs for Product Strategy\n\nData scientists in industry often have to wear many hats. They must navigate statistical validity, business acumen and strategic thinking, while also representing the end user. In this talk, we will talk about the pillars that make a metric the right one for a job, and how to choose appropriate Key Performance Indicators (KPIs) to drive product success and strategic gains.\n\nOur presentation will traverse the relationship of data science skills in product strategy - embracing the multifaceted role of the data scientist and navigating the journey from user segmentation to making data-driven decisions.\n\nThe Data Scientist's Hat Trick: We initiate by emphasising the assorted roles that a data scientist plays in today's business landscape - from being a statistician ensuring the accuracy and validity of data to a strategist driving business decisions. [5 mins]\n\nChoosing Significant Metrics: Next, we'll delve into the nuances of selecting the right metric for the job. Specifically, we\u2019ll talk about the different pillars of metrics setting, for common data science responsibilities such as randomised controlled trials, offline evaluation, opportunity analysis etc. [7 mins]\n\nSetting The Right KPIs: Once metrics are defined, we'll venture into setting the correct KPIs - the small set of top line numbers that say if our venture is doing the job. [7 mins]\n\nData-Driven Decision Making: Lastly, we'll elucidate how to leverage the data you've gathered to make informed, strategic decisions. This necessitates interpreting your metrics and KPIs, spotting trends, and making necessary adjustments to stay on course. [7 mins]\n\nIncorporating real-world case studies, we'll demonstrate how these concepts intertwine to contribute to product success.\n\nLearning Objectives:\n* Appreciate the multifaceted role of a data scientist in driving product strategies.\n* Learn to set realistic and challenging KPIs that align with your company's overarching objectives.\n* Gain insights into leveraging data for informed decision-making and product strategy adjustments.\n\nBio:\nAlon Nir\nData scientist (Data Lead) at Spotify. Dismal scientist by education. Advocating against pie charts since 2015. Self-proclaimed GIF connoisseur.\n\nDror A. Guldin\nData Scientist (Tech Lead) at Meta\n\n\n\nwww.pydata.org\n\nPyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. \n\nPyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.\n\n00:00 Welcome!\n00:10 Help us add time stamps or captions to this video! See the description for details.\n\nWant to help add timestamps to our YouTube videos to help with discoverability? Find out more here: https://github.com/numfocus/YouTubeVideoTimestamps",
"duration": 1707,
"language": "eng",
"recorded": "2023-09-14",
"related_urls": [
{
"label": "Conference Website",
"url": "https://amsterdam2023.pydata.org/cfp/schedule/"
},
{
"label": "https://github.com/numfocus/YouTubeVideoTimestamps",
"url": "https://github.com/numfocus/YouTubeVideoTimestamps"
}
],
"speakers": [
"Alon Nir",
"Dror A. Guldin"
],
"tags": [],
"thumbnail_url": "https://i.ytimg.com/vi/Yd35Q2oclY8/maxresdefault.jpg",
"title": "Power Users, Long Tail Users, and Everything In Between: Choosing Meaningful Metrics and KPIs for Product Strategy",
"videos": [
{
"type": "youtube",
"url": "https://www.youtube.com/watch?v=Yd35Q2oclY8"
}
]
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
{
"description": "This informative talk aims to close the gap between the theory of data contracts and their real-life implementations. It contains a few Python code snippets and is aimed primarily at data and software engineers. However, it could be food for thought for machine learning engineers, data scientists, and other data consumers.\n\nTopic: There are a lot of ongoing discussions happening about data contracts. I would like to share with you some lessons learned from data contract implementations and show you some Python examples.\n\nAudience: data and software engineers; potentially could be interesting for machine learning engineers, data scientists, and other data consumers. Some affinity with Pandas, Great Expectations, and Open Table Formats are desirable.\n\nType: Informative with some hands-on examples\n\nMain takeaways:\n- better understanding of the data contracts concept\n- tips for batch data contracts implementations\n- tips for streaming data contracts implementations\n\nBio: \nAlyona Galyeva\nAlyona Galyeva is an organizer of PyLadies Amsterdam, co-organizer of MLOps and Crafts, Microsoft AI MVP and Principal Engineer at Thoughtworks\nObserve - Optimize - Learn - Repeat\nPassionate about encouraging others to see different perspectives and constructively break the rules.\nI found my joy in building, optimizing, and deploying end-to-end AI and Data Engineering Solutions.\n\n\nwww.pydata.org\n\nPyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. \n\nPyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.\n\n00:00 Welcome!\n00:10 Help us add time stamps or captions to this video! See the description for details.\n\nWant to help add timestamps to our YouTube videos to help with discoverability? Find out more here: https://github.com/numfocus/YouTubeVideoTimestamps",
"duration": 1504,
"language": "eng",
"recorded": "2023-09-14",
"related_urls": [
{
"label": "Conference Website",
"url": "https://amsterdam2023.pydata.org/cfp/schedule/"
},
{
"label": "https://github.com/numfocus/YouTubeVideoTimestamps",
"url": "https://github.com/numfocus/YouTubeVideoTimestamps"
}
],
"speakers": [
"Alyona Galyeva"
],
"tags": [],
"thumbnail_url": "https://i.ytimg.com/vi/YGKqvMhaEVA/maxresdefault.jpg",
"title": "Data Contracts in action powered by Python open source ecosystem",
"videos": [
{
"type": "youtube",
"url": "https://www.youtube.com/watch?v=YGKqvMhaEVA"
}
]
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
{
"description": "Have you ever struggled with a multitude of columns created by One Hot Encoder? Or decided to look beyond it, but found it hard to decide which feature encoder would be a good replacement?\n\nGood news, there are many encoding techniques that have been developed to address different types of categorical data. This talk will provide an overview on various encoding methods available in data science, and a guidance on decision making about which one is appropriate for the data at hand.\n\nJoin this talk if you would like to hear about the importance of feature encoding and why it is important to not default to One Hot Encoding in every scenario. It will start with commonly used approaches and will progress into more advanced and powerful techniques which can help extract meaningful information from the data.\n\nFor each presented encoder, after this talk you will know:\n- When to use it\n- When NOT to use it\n- Important considerations specific to the encoder\n- Python library that offers a built-in method with the encoder, facilitating easy integration into feature engineering pipelines.\n\nI will explore different feature encoding approaches and provide guidance for decision-making. I will cover simpler methods like Label, One Hot, and Frequency encoding, progressing to powerful techniques like Target and Rare Label encoding. Finally, I will explain more complex approaches like Weight of Evidence, Hash and Catboost encoding. I will close the talk with summarizing the key takeaways.\n\nTarget Audience:\nData scientists and anyone interested in feature encoding\n\nPrevious experience with feature encoders can be useful but is not mandatory to follow the talk.\n\n\n\nwww.pydata.org\n\nPyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. \n\nPyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.\n\n00:00 Welcome!\n00:10 Help us add time stamps or captions to this video! See the description for details.\n\nWant to help add timestamps to our YouTube videos to help with discoverability? Find out more here: https://github.com/numfocus/YouTubeVideoTimestamps",
"duration": 1628,
"language": "eng",
"recorded": "2023-09-14",
"related_urls": [
{
"label": "Conference Website",
"url": "https://amsterdam2023.pydata.org/cfp/schedule/"
},
{
"label": "https://github.com/numfocus/YouTubeVideoTimestamps",
"url": "https://github.com/numfocus/YouTubeVideoTimestamps"
}
],
"speakers": [
"Ana Chaloska"
],
"tags": [],
"thumbnail_url": "https://i.ytimg.com/vi/4Opsiqj6gcY/maxresdefault.jpg",
"title": "To One-Hot or Not: A guide to feature encoding and when to use what",
"videos": [
{
"type": "youtube",
"url": "https://www.youtube.com/watch?v=4Opsiqj6gcY"
}
]
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
{
"description": "Pick your next hot LLM prompt using a Bayesian tournament! Get a quick LLM dopamine hit with a side of decision theory vegetables. It's Bayesian Thunderdome: many prompts enter, one prompt leaves.\n\nHow do you chose the best LLM prompt systematically beyond guessing and vibes? Use the winner of a Bayesian tournament! Get a quick dopamine hit from fun LLM prompt magic with a side of Bayesian decision theory vegetables. If you are doing stuff with LLMs \u2014 you'll get a serious tool to improve your prompting game. If you're not using LLMs \u2014 you'll learn about Bayesian tournaments. They are not well known but have wide applicability: they help you optimally choose a winner using a minimal number of matches.\n\nBio:\nAndy Kitchen\nI've helped found multiple start-ups, including CorticalLabs an AI+Biotech company working on \"Synthetic Biological Intelligence\". I've co-authored several papers and patents in deep learning and neuroscience. I've made a mess in more than a dozen programming languages over my career. My stack is full. I've worked on custom neural interface hardware to web apps and everything in between. I've won a few hack-a-thons. I started the Machine Learning and AI meetup in Melbourne Australia, helped found & organize the Compose :: Melbourne conference. I have two cats, I scoop their poop most days.\n\n\n\nwww.pydata.org\n\nPyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. \n\nPyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.\n\n00:00 Welcome!\n00:10 Help us add time stamps or captions to this video! See the description for details.\n\nWant to help add timestamps to our YouTube videos to help with discoverability? Find out more here: https://github.com/numfocus/YouTubeVideoTimestamps",
"duration": 1746,
"language": "eng",
"recorded": "2023-09-14",
"related_urls": [
{
"label": "Conference Website",
"url": "https://amsterdam2023.pydata.org/cfp/schedule/"
},
{
"label": "https://github.com/numfocus/YouTubeVideoTimestamps",
"url": "https://github.com/numfocus/YouTubeVideoTimestamps"
}
],
"speakers": [
"Andy Kitchen"
],
"tags": [],
"thumbnail_url": "https://i.ytimg.com/vi/UY3wxjk2o6o/maxresdefault.jpg",
"title": "Promptly Evaluating Prompts with Bayesian Tournaments",
"videos": [
{
"type": "youtube",
"url": "https://www.youtube.com/watch?v=UY3wxjk2o6o"
}
]
}
Loading