🥏 Recently, I found this box of frisbees in my parent's basement, and it's what's left of my biggest failure — a search engine #startup. 𝗙𝗮𝗶𝗹𝘂𝗿𝗲 𝘀𝗼𝘂𝗻𝗱𝘀 𝗵𝗮𝗿𝘀𝗵, but we learn by trial and error, so every mistake is an opportunity for growth.
📊 One of the biggest lessons I learned was technical, and it had to do with the importance of #analytics and 𝗱𝗲𝗯𝘂𝗴 𝗮𝗹𝗴𝗼𝗿𝗶𝘁𝗵𝗺𝘀 for points of failure. It was then I realized that Machine Learning had a problem. After all, how do you debug ML models? This is how in 2017, I first stumbled upon Interpretable ML / Explainable AI research. Fast forward to 2020, and I was writing a book about it! And I spoke about this journey to San-Francisco-based A.I. startup entrepreneurs and workers.
💪 𝐼𝑛 𝐶𝑜𝑛𝑐𝑙𝑢𝑠𝑖𝑜𝑛: the frisbees may have been the only tangible items, but my failure left behind stories, ideas, lessons, and a brand new perspective — that has only made me 𝘀𝘁𝗿𝗼𝗻𝗴𝗲𝗿! As for the frisbees, they will find a new home with goodwill.
I learned to program on this computer — I was a child during the '80s 🤓. It had a 4.77 MHz CPU, 256 KB RAM, monochrome display, and no hard drive, so you had to be creative to overcome 𝗿𝗲𝘀𝗼𝘂𝗿𝗰𝗲 𝗰𝗼𝗻𝘀𝘁𝗿𝗮𝗶𝗻𝘁𝘀 — not to mention exercise patience!
We are 𝘀𝗼 𝘀𝗽𝗼𝗶𝗹𝗲𝗱 these days! To put it in context, most smartphones 📱 have over 16 thousand times the RAM and more storage than would have fit in a room in the 80s. Add that to cheap, limitless cloud storage. I am not complaining.. That is great! However, I wonder how much does resource constraints foster software innovation — and optimal code.
Today, trillion-parameters deep learning 🤖 models are pushing the envelope. Still, at the same time, it seems illogical that they represent the most 𝗲𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝘁 𝘀𝗼𝗹𝘂𝘁𝗶𝗼𝗻 grounded in, for instance, biology, causal understanding of the world, or statistics. So before ushering in the age of quantum computing, I'm hoping we hit some resource limitations to focus more energy on more creative and intuitive solutions — not to mention cost-effective.
What do you think? How much does an abundance of resources hinder or enable creative solutions?
🇨🇷 7 years ago, I had a fantastic 4-day journey trekking through the 𝗖𝗼𝘀𝘁𝗮 𝗥𝗶𝗰𝗮𝗻 𝗿𝗮𝗶𝗻𝗳𝗼𝗿𝗲𝘀𝘁. On the 1st day, we had to cross a wild river with a metal basket hanging on a rusty rope. And I thought to myself, "what the hell have I gotten into?!".
🐒 On that journey, I saw 𝗺𝗮𝗻𝘆 𝘀𝗽𝗲𝗰𝗶𝗲𝘀 of wildlife. I slept smelling the moss on the bark and wet ferns. And I woke up every morning to a majestic orchestra of birds, insects, monkeys, and frogs. It's also hard to realize the sheer scale of a rainforest when you are in it. On peaks, we could see the many green valleys we had crossed with Ceiba trees towering 17 stories high over the canopy!
🌎 We only have 36% of rainforests left. When I was born it was well over 50%. Today is #WorldRainforestDay and I thought I’d share a story of why I care. In #DataScience, we think 𝘧𝘢𝘤𝘵𝘴 & 𝘧𝘪𝘨𝘶𝘳𝘦𝘴 alone are convincing. But often it's the 𝘭𝘪𝘷𝘦𝘥 𝘦𝘹𝘱𝘦𝘳𝘪𝘦𝘯𝘤𝘦 & 𝘦𝘮𝘰𝘵𝘪𝘰𝘯𝘴 that come with them that make things matter to us. I don't regret crossing the river on the basket because the journey the followed was life-changing. If I was an environmentalist before because of the facts I knew, now I had more conviction than ever that #nature had to be preserved for future generations!
Today is 𝐖𝐨𝐫𝐥𝐝 𝐅𝐨𝐨𝐝 𝐒𝐚𝐟𝐞𝐭𝐲 𝐃𝐚𝐲. For me, it's a day of reflection.
🦠 After all, 𝗖𝗢𝗩𝗜𝗗𝟭𝟵 had a food safety-related genesis. Natural disasters and clearing land for urbanization + agriculture pushes wildlife closer to human settlements, which fuel pandemic risk.
🌽 Food safety is essential, no doubt, but it's intrinsically related to 𝗳𝗼𝗼𝗱 𝘀𝗲𝗰𝘂𝗿𝗶𝘁𝘆 — and this is what worries me the most. There's a need to feed another 2 billion mouths by 2050 and double food production to that end. So, as a data scientist in agriculture, I'm inspired to make my tiny contribution to improving food security.
🌎 However, 𝗰𝗹𝗶𝗺𝗮𝘁𝗲 𝗰𝗵𝗮𝗻𝗴𝗲 can make our food production goals nearly impossible. Under a high-emission scenario, by 2050, a huge swath of the United States will suffer from a sizable decline in crop yields. However, this is offset by the fact that other areas of the country will experience an increase. Other countries won't be that lucky since they are entirely vulnerable given so many land challenges: desertification, land degradation, climate change adaptation, undernourishment, biodiversity, groundwater stress, and water quality (see IPCC for details). It's an existential threat to humanity, and we have only a few years to reverse this trajectory.
When discussing human judgment and, by extension, algorithmic decisions, we are used to talking about 𝐛𝐢𝐚𝐬, but what about 𝐧𝐨𝐢𝐬𝐞?
🎯 Nobel Laureate ᴅᴀɴɪᴇʟ ᴋᴀʜɴᴇᴍᴀɴ and co-authors make a case for why we should pay close attention to it in their new book 𝑁𝑜𝑖𝑠𝑒: 𝐴 𝐹𝑙𝑎𝑤 𝑖𝑛 𝐻𝑢𝑚𝑎𝑛 𝐽𝑢𝑑𝑔𝑒𝑚𝑒𝑛𝑡. It has some compelling stories to underpin how widespread the problem is in business and government with succinct illustrations. For instance, I love the target illustration and the error decompositions.
📢 The book covers group dynamics such as information cascades, social pressure, group polarization as amplifiers of noise, and some cognitive #biases to boot. Lastly, it outlines noise mitigation strategies with decision hygiene, decision observers, and noise audits, which were BY FAR the biggest takeaways for me.
😒 However, if you are already familiar with the topic, the book will likely disappoint (at least a little). It can feel very repetitive and not getting into enough depth, and its entanglement with bias means it keeps referring to concepts covered in 𝑇ℎ𝑖𝑛𝑘𝑖𝑛𝑔 𝐹𝑎𝑠𝑡 𝑎𝑛𝑑 𝑆𝑙𝑜𝑤, as it was some long-lost final chapter. I still enjoyed it, regardless.
Have you read it? Do you want to?
🎲 𝐂𝐡𝐚𝐧𝐜𝐞 & 📈 𝐃𝐞𝐜𝐢𝐬𝐢𝐨𝐧-𝐌𝐚𝐤𝐢𝐧𝐠 — Long before I called myself a data scientist, I helped build a backend website for sports betting on my very first job. After that, for about a decade, I improved the user experience for gambling sites of all kinds. As it turns out, the first data I engaged with professionally taught me a lot about human nature.
Throughout human history, we have been fascinated with chance. The first known tools used to this end were knucklebones in ancient Sumer, either for fortune-telling or games of chance. Better tools have been invented since, like dice, playing cards, and more recently, random number generators (RNGs). However, now we wield randomness for business/scientific purposes and not just mysticism/entertainment. In fact, the most powerful #MachineLearning methods depend on RNGs.
I recently read the book 𝘛𝘩𝘦 𝘋𝘳𝘶𝘯𝘬𝘢𝘳𝘥𝘴 𝘞𝘢𝘭𝘬, which made me reflect on what drew me to the discipline. We are surrounded by randomness, but humans want to be in control, often attributing skill to successful random events (ɢᴀᴍʙʟᴇʀ'ꜱ ꜰᴀʟʟᴀᴄʏ), and lack thereof otherwise. #Data can improve decisions by separating the signal from the noise and tracing outcomes to plausible causes. This possibility is what inspires my journey! What's yours?
𝗪𝗵𝘆 𝗶𝘀 𝗲𝗻𝗱𝗶𝗻𝗴 𝘇𝗼𝗼𝗺 𝗺𝗲𝗲𝘁𝗶𝗻𝗴𝘀 𝙨𝙤 𝙖𝙬𝙠𝙬𝙖𝙧𝙙 ? First, you say bye and, during what seems like an eternity, have to gracelessly stare at the host and any remaining attendees as you all fumble around clicking the end meeting button or keyboard shortcut!
I realize there are more pressing issues to solve with #machinelearning but can't Zoom come up with a gesture or voice-activated feature to stop the meeting as soon as it's over to spare introverts like me from those clumsy moments. Does this happen to you? If so, what do you think should prompt the ending?
- 👋🏼 A wave gesture?
- 💬 The words "zoom end"?
- ⬅️ A slide left gesture?
- 🤷🏽 None: Suck it up!
We often hear "𝙘𝙤𝙧𝙧𝙚𝙡𝙖𝙩𝙞𝙤𝙣 𝙙𝙤𝙚𝙨 𝙣𝙤𝙩 𝙞𝙢𝙥𝙡𝙮 𝙘𝙖𝙪𝙨𝙖𝙩𝙞𝙤𝙣". And when working with data, it's easy to fall into this trap! Even aided by domain knowledge and complex models, it's often tough to disentangle both.
📈 I'm an advocate of 𝐈𝐧𝐭𝐞𝐫𝐩𝐫𝐞𝐭𝐚𝐛𝐥𝐞 𝐌𝐚𝐜𝐡𝐢𝐧𝐞 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠 (also known as XAI) because using approximations it can help understand models. However, I must accept that the most popular XAI methods rely on correlations, which is a significant limitation.
🔄 The solution to this problem is 𝐜𝐚𝐮𝐬𝐚𝐛𝐢𝐥𝐢𝐭𝐲 which yields a causal explanation rather than a correlation-based one. The authors of a recent paper (Leon Chou, Catarina Moreira, Peter Bruza, Chun Ouyang, and Joaquim Jorge), propose counterfactuals as a means to provide causability.
🤔 Counterfactuals are a good fit because they ask the question, "𝙬𝙝𝙖𝙩 𝙞𝙛?" which comes naturally to us humans and, given some properties, serves as a satisfactory causal explanation. There's a family of counterfactual methods that meet many of the properties. But, unfortunately, in recent years they have been overshadowed by other XAI methods.
The authors of the paper performed a topic modeling and word co-occurrence analysis on academic research since 2012. It shows nodes for each keyword where size denotes frequency and color the most popular year. While it's good news that discussion has evolved from machine-centric topics such as pattern recognition to more human-centric such as XAI (see 1st figure), there are clear research gaps between causality and XAI - not to mention counterfactuals and causality (see 2nd figure). Check out their amazing paper for more details. Featured image by: Michal Jarmoluk from Pixabay
#Python is merely a toolbox 🧰
It's a magical bottomless toolbox but a toolbox nonetheless.
Don't get me wrong. Tools are essential, but no tool should define the data science discipline. Tools come and go, but the fundamentals of our discipline don't.
For instance, 𝗰𝗮𝗿𝗽𝗲𝗻𝘁𝗿𝘆 didn't always involve power tools like circular saws, but it ALWAYS has involved 𝘄𝗼𝗼𝗱. So carpenters must first and foremost understand wood. It's many properties such as varieties, strengths, malleability, moisture, and grain. It's limitations and applications. Not to mention the language, diagrams and math used to discuss wood.
Likewise, the skill every 𝗱𝗮𝘁𝗮 𝘀𝗰𝗶𝗲𝗻𝘁𝗶𝘀𝘁 should have is understanding 𝗱𝗮𝘁𝗮. It's properties, limitations, and applications. Also how to effectively communicate findings to all audiences.
Articles tend to confound data science tools with skills, and data science books are mostly tool-centric, not 𝗱𝗮𝘁𝗮-𝗰𝗲𝗻𝘁𝗿𝗶𝗰 nor 𝗺𝗶𝘀𝘀𝗶𝗼𝗻-𝗰𝗲𝗻𝘁𝗿𝗶𝗰. And it's no wonder why I get messages from aspiring data scientists asking me what machine learning library they should learn first — tantamount to a novice carpenter playing with a circular saw on their first day! 🤷🏻♂️
I recently finished Bill Gates's book on #ClimateChange. It's an urgent topic. And as a data scientist, any book that begins with a KPI of sorts and then spends the rest of the book breaking it down and explaining how to address each part will have me hooked! I applaud how he approaches some challenges. For instance, advocating for nuclear to address growing energy needs.
That being said, it has some disappointing blind spots:
- 💸 𝗜𝗻𝗰𝗲𝗻𝘁𝗶𝘃𝗲𝘀: expects people to be only swayed by lower costs ("green premiums") — if only people were that rational!
- 🐟 𝗢𝗰𝗲𝗮𝗻𝘀: doesn't mention how large-scale fishing operations, not to mention container ships, are destroying the oceans, which sequester massive amounts of greenhouse gases (see 𝘚𝘦𝘢𝘴𝘱𝘪𝘳𝘢𝘤𝘺 on Netflix). He does mention mangroves but underestimates their role.
- 🌳 𝗘𝗻𝘃𝗶𝗿𝗼𝗻𝗺𝗲𝗻𝘁𝗮𝗹 𝗱𝗲𝗴𝗿𝗮𝗱𝗮𝘁𝗶𝗼𝗻: single-KPI approach cannot address how other finite resources already constrained and contaminated are magnified by climate change — while simultaneously contributing to it!
- 🗳️ 𝗣𝗼𝗹𝗶𝘁𝗶𝗰𝘀: no mention of how lobbying has slowed down progress on the climate change front and will continue to do so — unless stopped!