Artsy Engineering Radio

A podcast exploring questions in software engineering.

Artsy Engineering Radio

How Data and Engineering Work Together At Artsy

October 14, 2021 • Artsy Engineering • Season 1 • Episode 38

At Artsy the Data and Engineering Teams work together really closely. Listen as Abhiti Prabahar and Jon Allured chat about this collaboration and how this has evolved over time. Also find out what even is a data pipeline!

Jon Allured: 0:09

Hi, everyone, this is john, welcome to another episode of Artsy Engineering Radio. I'm an engineer on the grow team. And I'm joined today by Abhiti. Please introduce yourself.

Abhiti Prabahar: 0:20

Hi, my name is Abhiti. I am a senior debt data analyst here at Artsy. I've been here for around three years, and I'm based in the New York office.

Jon Allured: 0:31

You're actually in the office today. Yes, very exciting. So So we work together on the grow team, which is a product team here at Artsy focus on top of the funnel, we've had episodes where we talk about, like, we're kind of focused on SEO, we're also interested in signup, retaining users and some of those like onboarding flows. But I wanted to kind of focus today maybe I'm talking about how the data and engineering teams kind of work together and learn a bit more about that stuff. So I guess maybe just start just to maybe talk more about like your role here at Artsy. And maybe like, how you found Artsy, how did you? How did you come to work here?

Abhiti Prabahar: 1:09

Yeah. So before Artsy, I was at another startup called civis analytics based in Chicago. And I was actually on the data engineering team there. And so I was hired for a different position for an analyst position, and then immediately, like, switched to the data engineering position, and was very nervous because I did not want to be doing engineering, I thought at the time, but I was there, I ended up being there for like, a little over a year and realized that I do I, first of all, I learned so much. And like, really, I was able to develop my technical skills, which is what I wanted to do. But then I kind of realized, like, I think I want to be working with data more strategically, I'd like to get more involved with like product strategy and business strategy and stuff like that. So I started looking for data analyst positions. And I knew I wanted to be in New York, and I wanted to work for like a mission driven company. And so I literally just googled data analyst jobs, NYC and found Artsy at the top on the Google jobs board, which I did not know existed. Yeah, I had no idea that Google Jobs was the thing. But yeah, I just like read the role. And I think like in the data field over the past many, or like few years, and even now, there's a lot of buzzwords, and it's like, you know, the data field is like, really blown up. And something that really stuck out to me about the data analyst position from Artsy is like reading through it. There Were None of those buzzwords and like I could tell it just seemed very like down to earth and like it just aligned with like, the kind of data work I want it to be doing. And so that like really stuck out to me and I like still tell Ani, the team lead now about how like, I just the job listing was so well written. Yeah, so I applied, interviewed, and been here. And it's been really awesome. I've learned a lot and learned a lot about the art industry. And it's been really cool to work with these with everyone here

Jon Allured: 3:02

is a fun thing about working at Artsy that you learn our market and our industry stuff. But you've seen a lot of evolutions for our data team. And we can talk about that. But maybe we can kind of focus on like where we are today. So I mentioned that we work together on the growth team, but that's only like, kind of like one of your teams. Because you also split like self identify as a data person. So you're on this, like, all like as I'm on the engineering team. I so anyway, maybe you could talk more about that division of your mind.

Abhiti Prabahar: 3:31

Yeah, no, totally. Yeah, the data team, the way we've been structured since I joined is we have our core central data team, which is within the product data design, engineering. org. So we're like, technically with like the product org. And so we have our like, core central team that has our own ceremony. So we have sprint planning, we have stand up, we have knowledge, shares, and retros and stuff like that. And what that allows for is for all of us to make sure like all our work, first of all, that we're sharing all of our work with each other and making sure we're kind of all using the same prophecies. Yeah, cross pollinating, but I think it also helps like kind of standardize and like sort of like the data work that comes from our team. So you kind of like know, like what you're getting into, if you like, work with the data team in some way. So I think that's really nice, too. Is that like, yeah, working together in that way. But then, of course, also, like you said, the cross pollination and like just learning what's going on in the rest of the business. So that's really cool that we have that like core team. But then each of us like you said, john, like works with a different product team, and then a different business team. And so the actual like, format is that each analyst so there's seven of us, but one is the team lead Ani, and so there's six of us who are each one, there's one of each of us on the each product team. Almost Yeah, and that exactly. And so we don't necessarily go to all of the players To team ceremonies, because we have our own ceremonies, but we go to the ones where we think we can contribute and help support. And then with that product team, each product team works with the business team. And so we also work very closely with that business team both on like product related things, and also not product related things. So for example, I'm on grow with john, but I also work with marketing very, very closely.

Jon Allured: 5:24

Yeah. And like in the context of the data work there, it's like, analysis, triggering events and following them through the lifecycle of a user's journey or whatever. Exactly. Yeah, cool. So like we talked about, there's been evolutions. One big change is the emphasis we've put on the data team in general, we think we like only really had an data engineer, that's way different these days, we have like an entire data engineering team. And so I wonder if you could talk about that a little bit.

Abhiti Prabahar: 5:55

Yeah, that's been really interesting. So when I first joined we, our team leader at the time actually was I think, technically, our data engineer. Yeah, I think so. And so he did. Yeah, he did a lot of like, the data engineering work at that time. But one other thing is like, we have a huge data pipeline that runs every night that powers our like, basically, our annual analysis. And so that when I first joined was completely 100%, owned by us by the data team. And so we had to know a little bit of data engineering, to like get around the pipeline, and be able to debug it and contribute to parts of it. And so that has definitely changed since I first joined with the addition of more data engineers and the creation of this data platform team. And so in the beginning, anytime the pipeline failed, basically, we were on call, and we'd have to respond to it and like, figure out why it was broken, whatever we estimated, I don't know, maybe like 20 30% of our time was towards the data pipeline, which isn't a great also use of our skills, because we none of us really have data engineering background, and like, I have one year of work experience, but I didn't know how to write and Ruby and things like that. So it was like, very difficult for us to be anything more than reactive with the pipeline. So we, for a long time, wanted to have a whole data engineering team so that they could work more proactively on the pipeline, and you know, cut down, it's build time and all that kind of stuff. So yeah, so then a year ago, we hired our first like, actual full time data engineer. And that was really awesome. And they worked on migrating our pipeline. And ever since then, we've hired like, more and more data engineers who, now we're trying to transfer ownership of the pipeline to but now I spend very little time on the pipeline, which is very exciting. But also, I do miss it. Cuz Yeah, frozen was the data engineer ones. Yeah.

Jon Allured: 7:56

Cool. So I'm gonna try to play the naive role here. And I'm just like, how can we define a data pipeline for people what like, maybe try to let's try to construct some kind of map for people? Like what are the parts of our data pipeline? What does it mean when we say our data pipeline?

Abhiti Prabahar: 8:11

So a data pipeline, I think is usually described as like, usually has like the steps that are they call it ELT were ETL. And so that's extract, load and transform. And the idea is that, basically, you're just trying to get data from one place to another. Usually, it's not actually one to one. It's like data from many different places to many other different places, which is what our pipeline is. But yeah, it goes through this process ELT. And so the way our pipeline works is we first extract a bunch of data from like a bunch of our systems, like our web systems, or assets. And then like all of our internal databases, database bases and stuff, we extract that into s3, and then we load that into redshift. And then we transform all of that data. And so that's where we, as the data team are super involved, because that's like 1000s and 1000s, of lines of SQL code, where we're transforming the data to just make it into to get it into a format that's like more digestible and like easier to analyze, both by us and like people at the company through the BI tool we use, which is Looker. Yeah, so then it's transformed, and it's kept in redshift. And then we expose that in Looker. And Looker is where most people at the company go to query the data, visualize that and do some analysis. So yeah, that's how the pipeline roughly works.

Jon Allured: 9:38

Yeah, and so in that extract phase, it's like not just our systems, it's also like vendors. So we have some vendors, I think, like braze. Were our email service provider. And so there's there's still there's still ways to get to make it even more complicated and get data in there from outside of, strictly speaking our systems.

Abhiti Prabahar: 9:57

Yes. So apart from our system, I think the other biggest category are like marketing tools, like braze, like you just said, it's our email service provider, where john and i worked together very closely on migrating from our old one to the two breeze. In that case, what we do is we are both sending data like transform data from our pipeline, or just like data from our front, like our front end events to those tools and or like then getting that data from those marketing tools into our pipeline. And so to talk about the ladder first, it's like, the reason we want data from those tools in our pipeline is because we want that performance data, like we want to know how campaigns did in in braze, how email campaigns are like push notification campaigns, performed in braids, and things like that. So that's kind of why we extract data from those tools.

Jon Allured: 10:51

Some of the work we talked about here is in service of helping decision makers make decisions, you know, surfacing data in a way that is easy to explore, and slice and dice. I think one of the things that y'all do, too, is like some care and feeding and maintenance of like dashboards. And so I wonder if you could talk about like, like, what, what are dashboards? And how have we kind of figured out how to make best use of them?

Abhiti Prabahar: 11:18

Yeah, that's been a really interesting sort of evolution also in like, the data culture of Artsy, but like I said, we use Looker, which is an amazing tool because it really allows for a self service data culture. And so the way it works is it's like point and click like, lets you just choose these are the variables you want. This is what you want to graph. And then it just crafts the SQL query and visualizes it, and you can like edit the visualization any way you want. It's, it's really cool. I love Looker. But what that also created when we first I think launched Looker, which was before I joined was that people were just querying everything, right? Like, the world is your oyster when it comes to Looker. And so you get a lot of questions that are like, why doesn't this number here match this number over here, and it's like, oh, because you're actually pulling it slightly incorrectly. It's not super clear. Like it's not their fault. It's just like, not super clear, and Looker. So we had a lot of those problems for many years. But I think it was also good because it also scaled like data driven decision making across the org, it was good that people got to explore with Looker. But I think what we realized, I guess, in this past year was we needed to have more centralized like data and metrics. And so what we decided I think it was at the beginning of this year was to come up with or to create these like centralized dashboards that were like data certified. And so we have like, our like level one dashboard, which is just like the top company wide metrics that are pulled the exact way like these are the correct numbers. If you ever want to learn more about it, you should go from this link in the dashboard and explore From then on, like, Don't try to recreate it. And so yeah, that just makes sure that then like for example, Mike or CEO isn't hearing like two different numbers for a top company wide KPI like that's not great. So yeah, so we we had that initiative, and that's been working actually really well. It's been really good for also help focusing like the metrics of the company as well,

Jon Allured: 13:27

maybe the questions we want to answer from these dashboards change over time. So do you see it as like something that you'll continue to tinker with? And if you find new metric comes like is, um, you know, we'd like to say there's a new metric you care about, we may see it added to a dashboard.

Abhiti Prabahar: 13:45

Yeah, for sure. So people are always like, the way we communicated the dashboard out was if you find you know that your team is starting to go against something else, or you find something actually like incorrect, then definitely let us know. And we'll like think through Does this make sense to strategically modify and things like that? It's I don't know how often we're changing it. But like, I it's definitely like an ongoing process, not static. Yeah, it's not static at all. So sometimes we'll have larger rollouts. Like Actually, we just rolled out a new, slightly refined metric. And the way we did that was just we changed it. And then we communicated to the team, hey, this might look a little different. It's because we changed it. And this is the reason why.

Jon Allured: 14:29

Yeah, okay, so maybe another way we can describe our data pipeline is like the kinds of tables or the kinds of items that are that are in it. And so just as like, a way to talk about it, the data pipeline would have like a user's table, and it would have an artworks table and it would have, you know, partners, and it would have all the kinds of same kind of structure or whatever that our main API has for, you know, it's its rails Mongo database. It's gonna be like a superset of that. And then like our ecommerce, Postgres databases schema, these, these things will all sort of like ladder into each other. And so I was thinking about like, you know, there's like this transform step where you're like, getting it in, maybe you're gonna like prefix a table with like the service name that it comes from, you're shaking your head. So but like you also do enhancements, right? That like, we are going to take in these inputs, and then someone in the business layer has said that this thing should be true or like, whatever. So I'm curious how you think about, like, the way enhancement goes down, and like how enhancement is like consumed back into the system?

Abhiti Prabahar: 15:36

So yeah, I think the users table is a great example. But obviously, like we want gravity, our internal database to have like, kind of the source of truth and who our users are at a given point of time. But like you said, there's only like a set number of fields on there. And we on the data team, and at the company, we want to know more about a user we want to know, so many like, like, how many page views have they had, how many times have they inquired on on our work, how many times somebody did like all these different like what we call roll ups. And so what we do is we just directly pull from the gravity users tables to the internal user database. And then we, like you said, enhance that data with a bunch of roll up data that we've created through that transform stuff that I described earlier. So the users table is one of those tables that has like a ton of dependencies. Like it. Yeah, you have to wait for so many roll ups to be built before we are able to join back onto it. But yeah, that's kind of roughly how it works.

Jon Allured: 16:39

Yeah. Something else you just made me think of. So you talked about waiting. So there is some latency in our data pipeline. I think it's like data, it's rebuilt daily. Is that still accurate?

Abhiti Prabahar: 16:49

Every night? Yeah. So

Jon Allured: 16:51

a user signs up. It's 339. On my clock here. I wouldn't see that user in Looker. I wouldn't see any of their activities in our various enhancements or roll ups or anything. Until the next day. Is that right?

Abhiti Prabahar: 17:04

Correct. Yeah, so we do have like a Yeah, day lag, because our pipeline actually takes like, I think 12 hours now. So it's kicked off, like in the evening on like, East Coast standards, and then builds and it's usually ready by like, 10am Eastern Time The next morning. And so it can only have as much data as was available before the extract again. So yeah, it is a usually the week just tell people, it's a one day lag.

Jon Allured: 17:33

Gotcha. Just too short. Yeah. So so there's that kind of data of just like the user table and the person's first and last name, and all kinds of other things you would imagine. But another big part of our data pipeline is just like events, consuming events, specifying like the schema of our events, so that maybe you could talk a little bit about, like, how we approach this problem, keeping them all in step and some of those efforts.

Abhiti Prabahar: 17:59

Yeah, so one of the things that we work on with product teams is instrumenting events, like front end events on the website, and the apps, and what those are like things like page view events, and click events and tap events. And those help us understand like how users are literally using a feature, which is super helpful for us to be able to, like learn from and then iterate on. And so it can be kind of a tedious process to add new events. And I don't think that engineers enjoy really implementing them either. But it's super important, obviously, to our pipeline and to our analysis efforts. But the way it usually works is the team will be working on a new feature and someone on the team will ping their teams analyst and ask them, you know, what events do we want? And it's not a perfect science, but we usually just think through like, what questions are we going to want to answer? And based off that we will we know if it's a success, let's work backwards from exactly because if you try to implement every single event, it becomes crazy. And that's definitely what I did at the beginning when I first joined Artsy, because like I yeah, I'm very, I like to be thorough, but this is not a case in which you want to be there. Oh, gotcha. But yeah, you want to be very smart about it. Because it's also you know, it's it's also extra work for engineering to implement the events as well. So because we've been doing it for so long, like we do have like, generally good standards of like, this is what they should look like. But we also introduced a new tool, I think last year called cohesion, where we try to standardize the events that are going in. And it also helps like enforce the schema, because we saw a lot of issues where the properties or the events weren't what we expected them to be. And so that has helped as well, because everything is code reviewed. And then what's the word?

Jon Allured: 19:51

code? Like self document?

Abhiti Prabahar: 19:53

Yeah, it's all documented. Exactly. Yeah. Like if you have to add a new event, you can look there and be like, okay, This is a very similar event. And it kind of just reduces the burden of having to like think through. Okay, these are all the properties I need to add. So that has helped a lot with like just sort of reducing some of the burden of adding new events. But in general, yeah, it's still a little bit of a tedious process that I mean, that must be done.

Jon Allured: 20:18

Yeah. So So typically, as we're working on a new feature, we're like you say, communicating with our data, data analysts and trying to understand like, what do we want to like? What success let's work backwards into event types? Do we have them? Do we need to add them whatever? And then these events are going through segment, correct? Yes, you talk about the roles segment plays in the

Abhiti Prabahar: 20:38

Yeah, exactly. So segment is the tool that we use to, it basically connects all of our data sources to all of our data destinations. And so segment is what we use to like fire those events that we were just talking about. And then we can send events that we need from the sources to all the destinations. So the events that we instrument on like website and the apps, all of them go to redshift, because in that transformation stuff I was talking about earlier, we also we take both the data from the back end, the back end databases, but also from segment from the front end events that segments firing. And that's where like transforming all of that. And that's how we get like page view events and stuff and the page view data. And so we send that data to redshift. But we also send it to all these like marketing integration. So we send to braise Google Analytics, like, and yeah, I can't think of now

Jon Allured: 21:39

even like some ad platforms. Yeah. So like, if the marketing team is spending dollars on a thing, and there's some way we can connect that ad spend to maybe commercial activities and of activity, whatever, then then we can kind of complete that feedback loop and make sure we're spending money on good ads or

Abhiti Prabahar: 21:55

whatever. Exactly. Yeah. So the biggest thing is like with marketing campaigns, usually like what they to optimize the campaign, you need, like feedback from our websites to see like, what did the user actually interact with? And so we're sending those events to those platforms, and then they can optimize them and based off that data that we're sending them. Exactly.

Jon Allured: 22:15

So abidi, thanks so much for joining me and talking about all things data. Hopefully this gives people kind of a better sense of like, how things have changed here, kind of where we are today. What even is a data pipeline, and some of the choices we've made. Thanks so much for speaking with me. Yeah, this

Abhiti Prabahar: 22:30

is great. Thank you for having me. And that's it

Jon Allured: 22:33

for this episode of Artsy Engineering Radio lecture next time. Thanks for listening. You can follow us on Twitter at Artsy open source. Keep up with our blog@artsy.github.io This episode was produced by Asia Simpson. Thank you Eve Essex for our theme music. You can find her on all major streaming platforms. Until next time, this is Artsy Engineering Radio