Get the most as a Data Scientist with Microsoft Fabric (Public Preview)

– [Both] Yooo! – What's up? I'm Adam. – And I'm Nellie. – And I'm so happy Nellie
is here with us today. We're gonna talk about data science as it relates to Microsoft Fabric. Nellie, thank you so much
for joining us today. Tell us a little bit about who you are and like the team that you work on. – My name is Nellie Gustafsson. I'm a product manager in
the Microsoft Fabric team, working specifically on
Synapse data science, and I lead these Synapse experiences.

All right, Nellie, can you explain to me what does Microsoft Fabric bring to the table for data scientists? – First and foremost, in Microsoft Fabric, we're bringing data scientists
into the same platform as the other team players, and this has a bunch of advantages. You kind of need data for data science, so it makes it super convenient for data science teams to be able to work on top of the same
security and governed data. Data scientists are ultimately developers, and we are bringing in a lot of enhanced developer experiences. Like we have notebooks, you
can do a lot of code authoring. We're just trying to make
it easier for developers to be able to do their work
and automate their work. And, of course, for data science, you also need kind of
machine learning tools, so we bring a lot to the
table when it comes to, for example, model and
experiment tracking. And we're gonna talk about
that in a little bit. – All right, Nellie,
enough of all talking, you know what we like to do
it here in Guy in a Cube.

Let's do what? Let's head
over to your machine. – [Nellie] We are in the
data science persona. So at the top here, you
can see different items. First for example, you can see that you can work with
machine learning models. – [Adam] Yeah, not a data set. That's a different model. – [Nellie] Yes, I agree. The name is a little bit overloaded. We may do some name changes. You typically develop these from code. That's the most natural way. So you see here that you
have model and experiment. We also have the notebook, and that's where you author the code. But a lot of people don't
actually know is that a model in itself is more
like a container of stuff that you can put in it, like an empty bag. Let's say you're taking a
trip to the grocery store, and you're writing up stuff, right? And then you're kind of like, okay, so I got this, and you check it off. So it's just a way for you to like as you're experimenting or testing things or as you're completing steps, you basically know, oh,
I already got this one, or, you know what? Actually, you know, this
didn't work out for me.

So I don't know if the grocery store list
actually is a good analogy. But, you know, let's say I
actually want to jump in, and I want to create a
machine learning model. We could actually give it
a name yo-model, right? (Adam laughs) That's a good name. So you can actually start off
with a template, for example. You actually just create a dummy notebook, and then you can actually see there, oh, I need to use MLflow when I create models and experiments.

But typically with data science, you also need to solve a problem and- – You have a theory or a hypothesis, and you're trying to prove it. – Exactly, yes, it's science. You have something to test and
then you create an experiment and then you can try different iterations, and they become runs under the experiment. We'll go through that in a bit. So this notebook was just an example. I wanted to actually walk
you through a scenario where you can just go ahead and see, hey, can we solve a problem? So in this case, we have a
bunch of New York taxi data. Before the trip starts, we want to predict the
duration of the trip. You see here that I have
the lakehouse attached. This is the lakehouse that data engineering teams have prepped. – [Adam] And I'll tie this back to some other videos we've had talking about the lakehouse and other items. Once the data's in the
lakehouse and in OneLake, it can be reused for things
like data science, right? So we're not copying the data somewhere, having to rewire things up.

We're just using the data
that's already there. – Exactly. You know, I'm not gonna go through every detail in this, but eventually you want to get
to the machine learning model so that you can run it and
get the predicted values and then analyze them. But before you get there, there's a bunch of steps
that data science teams have to do in terms of preprocessing the data beyond what the data engineers did. So let's load some of
the data that we have.

I think it's like millions
of records actually that's sitting in the lakehouse. – [Adam] Oh, it's baby data. – [Nellie] I'm loading
into panda's data frame because we're working with Python here. So I mentioned that data
prep is pretty tedious. And what I've done is that I actually navigated to the data tab and I'm gonna launch an experience that we're introducing
called Data Wrangler. It's actually gonna go
through your notebook. It's like, oh, you have
a lot of data frames, let me help you list them. And then you can click on one, and you can actually open it up in a grid. And now you can just apply
a bunch of operations here like low-code experiences, and then it's gonna generate code for you.

– [Adam] I got initial vibe of Power Query when you pulled this up,
but it's not Power Query. – Right now as a data scientist, I'm in a notebook, I use Python. I don't really use Power Query, although Power Query is an awesome tool. It's not really targeted
towards me as a user. Something we're actually working
on is to really make sure that we align a lot of these operations, like how we name them. You can probably do the same
things, but here I can analyze.

I can see, you know, do I have any missing values
in any of the columns? Maybe I want to take
one of the columns here. I want to say, you know what? I want to drop the missing values. For example, on this
column you see here that, oh, I got some code, and
if I want to apply this, I can do that.
– [Adam] Nice. – [Nellie] And I can also
just add this code cell back to my notebook. So let's dive back into
the problem we're solving. So you can actually say, hey, I want to export
the code to a notebook. In this case, you know-
– [Adam] We don't. – [Nellie] I don't want to do that. The next step is basically
you're gonna explore your data, visualize your data. Python users like to do
this with Python libraries. So you can basically install
third party libraries.

Leave a Reply

Your email address will not be published. Required fields are marked *