Executives at PyData Global 2022
Write-up by: Ian Ozsvald and Lauren Oldja
License: Creative Commons, By Attribution (CC BY 4.0)
Event date: 2022–12–01
This is the 6th Executives at PyData, a discussion session for anyone in a leadership role who wants to share solutions and discuss issues focused on running successful data science projects. This is the first time we’ve written-up the discussion for wider distribution. A set of questions were discussed during a 2-hour Zoom call, and a summary of the resulting discussion is given below. The sessions aren’t exhaustive — we’ve captured a summary of the conversation as it happened. The session was hosted by Ian Ozsvald, Lauren Oldja and special guest Douglas Squirrel — biographies and links are at the end of the document.
The full video is available here!
We thank the organizers of PyData Global and the worldwide PyData community for such an open and welcoming event and, in particular, the chairs for letting us run another Executives at PyData discussion session. Thank you also to the attendees whose active participation generated an insightful discussion. Particular thanks to Sophia and Conrad for your note-taking!
● What tools and processes support a remote-first culture?
● How do you get good projects for your backlog?
● Is the T-shaped team (deep specialism & broad shallow knowledge) the ideal? Is there a better profile for a team?
● How do you get DS to work with other business units?
● Which tools help data science teams?
● In what sort of organizations does DS work well?
What tools and processes support a remote-first culture?
Switching from in-person to remote-first has been a big change for many teams. Different work styles are required, along with new tools. A hybrid culture seems to have settled post-Covid-19, rather than a return to in-person only.
Prioritizing social interactions and open communication is important. Avoid fragmentation by agreeing on a small number of tools; Zoom, Slack, and Atlassian (ex. Confluence, JIRA, Trello, etc.) all garnered honorable mentions. One participant noted that “a [corporate] culture of actually reading and writing is a surprisingly big game-changer and surprisingly rare. I’ve only seen very few orgs where people will write coherent and comprehensive emails and be able to assume that anyone will read them.”
Regular calls supplemented by occasional in-real-life meetings are a reasonable substitute for in-person teams. Regular meetings online help to avoid the lack of communication that might occur, with a goal of clear objective setting. Automations such as GeekBot support stand-ups with asynchronous updates in Slack and Teams.
Onboarding is surprisingly important. If the new hire gets to meet the bosses and has a “buddy” outside their direct reporting structure, they’ll feel closer to the team very quickly and will be more self-sufficient in navigating remote work.
One downside of being remote is when a problem occurs — getting everyone “in the room” to solve a problem is less efficient than having them physically present and focused. No clear solutions were identified, prompting one participant to note that “maybe it is not a realistic target for things to go as smooth as face to face when working remotely.”
How do you get good projects for your backlog?
Not having a good backlog of potential projects is painful for a team as new projects can take months to find and warm-up. This may be less of a problem for mature teams, which may be better served by “throwing out their backlog” when their large backlog prevents them from seeking out higher value or more strategically aligned projects. Anything that’s really important will soon be re-prioritized and it relieves the team of a weight on their shoulders.
If you have the luxury of an office where most teams work in-person, it is possible to “walk around” and ask people about their problems. That is, data scientists can build relationships and improve their reputation as “problem solvers” by seeking others out. This is aligned with a model of data science working as an internal consultancy to other teams. While a remote-first culture removes physical co-location, scheduling meetings to ask about pains in different teams is an adequate alternative.
A useful technique is to hold co-creation workshops with potential clients. If one simply asks “what do you want,” the answer is likely to be something like “more reporting” or “a better dashboard”, which potentially reflects a gap in understanding what business value data science solutions can bring. Instead, focus your requirements gathering on building empathy for their pain points or soliciting their big ideas. Try to figure out what’s really valuable to them; an automated report or dashboard may still be the first step towards building rapport with your team.
Given a set of ideas, prioritizing them is very important. Choosing a project just because it is available, but not because it has been derisked and evaluated, is a recipe for a poor-quality project. It is essential to review each possible project and identify if 1) it offers enough value to the business, 2) there’s a set of hungry users, 3) a deployment route exists and, critically, 4) if data is available.
When is the last time anyone on your team spoke with a customer? If nobody in the team really understands the customer, you’re far too far away. Go talk to the customer, get to their needs and their process, to unlock a deeper understanding of the business problem you’re solving. This is particularly important for data science teams embedded in Product teams building product features!
Is the T-shaped data scientist (deep specialism & broad shallow knowledge) the ideal? Is there a better profile for a team?
A T-shaped team, with complementary deep skills, should naturally form around business units. Specializations might include deployment, data access, front-ends, statistics, ML, or project ownership.
From those who had experience working in the T-shape model, there was consensus that it can work well. Coupling a data scientist with a good data engineer is “a joy” — the complementary skills become a force multiplier to help both members achieve a larger goal far more efficiently. The T-shape paradigm can also inform career development trajectories for individuals, as they move from generalist to expertise in set skills verticals, and inform hiring or training needs as gaps in team coverage become apparent.
It is also popular for data scientists to embed in respective Product teams, with the data science team acting as a consultancy of technical peers. If a data scientist is working solo, however, they will need to find ways to get enough of the missing skills to supply value to their organization.
How should Data Scientists work with other business units, de-risk, and show progress?
While embedding individual data scientists into business units may not make full use of the advantages of “T-shape teams” (this is, teams made up of T-shaped data scientists), this model of work does have the advantage of naturally building a backlog of ideas to work on. Being an isolated data science team “searching for work” is likely harder than when you’re embedded in a working unit (such as finance, marketing, or people & culture) or product team and share a team with stakeholders.
If a business unit focuses on “value streams” rather than fixed traditional business units, it can form around problems. (This is a very enlightened view of agile team organization!) These value-stream teams might use data science (or any other specialization) to solve the teams they encounter. Since these teams must focus on the necessary skills to provide value, they’ll naturally form “T-shaped teams.”
Speaking of agile, there is a clear consensus that data projects cannot be delivered in a “waterfall” manner; that is, gathering requirements at the beginning of a project and some months later having the data science team deliver to spec is far too risky for many of the same reasons software is rarely built this way.
One must avoid having data science team members disappear for three months to write a solution. Instead focus on regular client collaboration with regular deployments of new value, starting from whatever’s the easiest thing to deliver to show progress. This unlocks new learning and builds confidence, delivering value while tightening the feedback loop.
If your data science team has a tendency to want to hide away and build something, it may indicate that they would benefit from engaging with a Product Owner (a.k.a. Product Manager). The Product Owner guides the features and sets priorities in the project, and their frequent collaboration with clients and other stakeholders builds accountability and trust that enables projects to run smoothly. They can also protect your team’s developer time from the burden of too many meetings, changing priorities, and scope creep.
Many data science teams lack a dedicated Data Product Owner, so they’re forced to make product decisions or be at the whim of changing priorities.
Squirrel noted the “elephant carpaccio method”, where you break down huge tasks into tiny pieces which can be tackled. It works well if the team can work on delivering value in very small increments — even weekly or daily.
Sometimes you may need to use different terminology — if “agile teams” are seen as dirty words, maybe try “cross-functional teams” if that is deemed acceptable. Know your audience!
On Managing Expectations and Communicating Risk
If you’re in a physical office, take over a wall to show results. If you can highlight interesting graphics which include problems (e.g. charts with “red dots showing problems — very visibly”) you’ll get interest from anyone who walks past. If senior people see these problems, they’ll get involved with solving them. If you’re remote it is harder to make and maintain this sort of passive report, but it is still possible, such as by keeping some simple graphics up-to-date in Slack or Confluence. These walls, whether physical or virtual, serve to both showcase team wins and communicate risk.
Data Scientists who act as their own product or project managers must be mindful of “building a fence” to manage expectations without offending key stakeholders. Rather than saying “no” to new or changing requirements, consider using the improvisational theatre technique of “yes, and” to turn blockers and problems into discussions that can help conversations avoid being stuck.
An example might be “we now need this new tool, although not finished, to be deployed next month”, with a “yes, and that means our prioritization will have to change significantly — what other projects should be deprioritized as a consequence?” This is an efficient path to positive conflict.
If management has set bad priorities or under-resourced a team, it is very acceptable to point this out. It is ok to “be the bad person and pour water on the fire” if the project is at risk; it is in fact your professional obligation.
Often the senior people haven’t realized the consequence of their decisions because they’re remote from the details — if you find the right stakeholder, tell them that there’s a problem and help them to fix it. Normally there’s someone in the hierarchy who has the resources to fix the problem that otherwise will turn into their failure.
Finally, one simply does not have a data science project until they have access to the data. Make sure you have this in the requirements of your project plan.
Which tools help data science teams?
Participants gave honorable mention to the following tools and resources:
● https://greatexpectations.io/ for data stability
● https://pandera.readthedocs.io/ for stable Pandas schemas
● https://pypi.org/project/pandas-profiling/ for early exploration
● https://career-ladders.dev/ lists example career descriptions for technical teams
● https://basecamp.com/shapeup an alternative to choosing between waterfall and scrum
Are there examples of organizations where data science works well?
Any organization that turns data directly into money (e.g. hedge funds, sports betting firms) are obvious candidates for mature, proven methods and processes. Organizations with a longer feedback loop between using the data and making money tend to be harder candidates. Older organizations with diverse data systems where they haven’t needed to use their data for business decisions tend to be worse choices for data science, as the data often needs much work to yield value.
Any organization that has low-hanging fruit such as process automation (e.g. replacing human-generated Excel sheets with machine-produced Excel sheets) are good candidates for a young team. Whilst the work isn’t cutting edge, it will quickly provide value to an organisation which will open new opportunities. Automating paper form transcription with OCR will streamline a process and again offers new value. Taking it to the next level by running it as a scheduled task, while simple, can be a forcing mechanism for building the infrastructure that can run your team’s first ML models.
Sometimes a tool such as an ML system isn’t useful in itself, but taking the explanations from the system will enable business to change. An example was given of a churn predictor that wasn’t used to reduce churn directly, but where the top-features and their interactions were used to gain greater understanding about the levers that might help the business to improve its process.
While data science remains a recent field that continues to progress rapidly, best practices do exist and have begun to coalesce. These steps include directions to de-risk and deliver high value results.
Keeping a focus on what the task that needs to be solved and the business changes that are required to enable positive change is critical. Knowing the metrics that will engage your business colleagues will keep the team aligned on what matters to the business. Delivering value in frequent incremental steps will help to build confidence amongst the stakeholders and will provide critical feedback to the DS team about the problem they’re solving.
PyData Global 2022
● Executives at PyData — https://global2022.pydata.org/cfp/talk/DP7GJC/ — 2 hour video
● Data Science Project Patterns that Work — https://global2022.pydata.org/cfp/talk/9GYEJB/ | https://youtu.be/pPgic2V7oWg?t=19875
● The 10 Commandments of Reliable Data Science — https://global2022.pydata.org/cfp/talk/VKEWPE/ | https://youtu.be/7uKc8RsZgR8?t=14809
● Steering a Data Science Project — https://global2022.pydata.org/cfp/talk/HDNA9X/ |
PyData London 2022
PyData Global 2021
● Building a Data-Driven Product from Scratch, How Hard Can It Be? — https://pydata.org/global2021/schedule/presentation/217/building-a-data-driven-product-from-scratch-how-hard-can-it-be/ |
PyData NYC 2019
● Managing Stakeholders: The Key to Successful Data Science for Business — https://pydata.org/nyc2019/schedule/presentation/51/managing-stakeholders-the-key-to-a-successful-data-science-project/ | https://www.youtube.com/watch?v=gTxdF3nA70A
Ian is a Chief Data Scientist, strategic advisor to data science teams, co founder of the PyData London meetup and conference series, author of the Successful Data Science Projects course and PyData conference talks and author of O’Reilly’s High-Performance Python (2nd ed).
Join Ian’s newsletter (https://notanumber.email/) for ideas on getting to success and higher performance Python. Ian can be found at:
Lauren is Principal Data Scientist on the newly minted data science team at Bonterra, a lead organiser for the NYC Python meetup, and three-time Chair of the PyData NYC conference.
Squirrel is looking to invite executives (all execs, CEOs to CFOs to CTOs) to learn about making their data science and tech teams more profitable by joining the Squirrel Squadron at https://squirrelsquadron.com. For a free book on management tools send an email to Squirrel at email@example.com.
Squirrel has been coding for forty years and has led software teams for twenty. He uses the power of conversations to create dramatic productivity gains in technology organizations of all sizes. Squirrel’s experience includes growing software teams as a CTO in startups from fintech to biotech to music, and everything in between; consulting on product improvement at over 175 organizations in the UK, US, Australia, and Europe; and coaching a wide variety of leaders in improving their conversations, aligning to business goals, and creating productive conflict. He lives in Frogholt, England, in a timber-framed cottage built in the year 1450.
The mission of NumFOCUS is to promote open practices in research, data, and scientific computing by serving as a fiscal sponsor for open source projects and organising community-driven educational programs. NumFOCUS is a 501(c)(3) public charity in the United States.
NumFOCUS provides a stable, independent, and professional home for the open source projects powering contemporary scientific inquiry and business processes. We aim to ensure that funding and resources are available to sustain projects in the scientific data stack over the long haul.