{"id":5154,"date":"2022-09-14T03:35:02","date_gmt":"2022-09-13T20:35:02","guid":{"rendered":"https:\/\/fazz.com\/?p=5154"},"modified":"2024-02-21T12:45:31","modified_gmt":"2024-02-21T05:45:31","slug":"a-suite-of-data-tools-for-beginners","status":"publish","type":"post","link":"https:\/\/fazz.com\/newsroom\/company\/a-suite-of-data-tools-for-beginners\/","title":{"rendered":"A Suite of Data Tools for Beginners"},"content":{"rendered":"\n
I have too many tools, said no woodworker ever! <\/p>Lisa Stronzi<\/cite><\/blockquote>\n\n\n\n
This article will share the tools used by Data Scientists in our daily work, and explore how they can also be helpful for other teams. Because Python is where we do most of our work, the tools we are using are also mainly collections of Python libraries.<\/p>\n\n\n\n
What does a Data Scientist do?<\/h2>\n\n\n\n
\n\n
- Data Acquisition and Exploration<\/li>
- Model Development<\/li>
- Model Deployment<\/li>
- Monitoring<\/li><\/ul>\n<\/div>\n\n\n\n
\nLife of a Data Scientist, day in – day out.<\/figcaption><\/figure>\n<\/div>\n<\/div>\n\n\n\n We will share some of the tools for the first 2 steps (Data Acquisition and Exploration, and Model Development) in this article.<\/p>\n\n\n\n
How to get these tasks done efficiently?<\/h2>\n\n\n\n
Below are the tools in python to help us work efficiently with a variety of customisation<\/p>\n\n\n\n
Source : <\/a>https:\/\/numpy.org\/<\/a><\/figcaption><\/figure>\n\n\n\n Before sharing about Python and its libraries, we will first go through SQL and its use for analysis and predictive modelling.<\/p>\n\n\n\n
SQL<\/code> <\/p>\n\n\n\n
- SQL is a standard language for storing, manipulating and retrieving data in databases. SQL can be used in database systems like MySQL, SQL Server, MS Access, Postgres, as well as data warehouse systems like BigQuery and Redshift. <\/li><\/ul>\n\n\n\n
<\/figure>\n\n\n\n
- The skills necessary to be a good data scientist include being able to retrieve and work with data, and to do that we need to be well versed in SQL, the standard language for communicating with database systems. Every organisation has a database and to read and retrieve data from it for data modelling, we use SQL either by querying using python, directly or other means.<\/li><\/ul>\n\n\n\n
General Tools<\/h3>\n\n\n\n
About Python and its libraries<\/p>\n\n\n\n
jupyter notebook
<\/code><\/p>\n\n\n\n
- The name, “jupyter\u201d, comes from the core supported programming languages that it supports: Julia, Python, and R and \u201cnotebook\u201d denotes a document that contains both code and rich text elements. Because of the mix of 2, it\u2019s a great place to bring the interactive analysis code and outcome in one place.<\/li><\/ul>\n\n\n\n
How to install jupyter notebook<\/p>\n\n\n\n
- Using Anaconda:<\/strong> Install Python and Jupyter using the Anaconda Distribution, which includes Python, the Jupyter Notebook, and other commonly used packages for scientific computing and data science. <\/li>
- Using PIP:<\/strong> Install Jupyter using the PIP package manager<\/strong> used to install and manage software packages\/libraries written in Python. – When using
Jupyter<\/code>, we can choose to execute individual cells rather than run the whole script all at once.<\/li><\/ul>\n\n\n\n
\n\nYou will see a few examples of
jupyter<\/code> notebooks for libraries mentioned below.
For more details you can go to jupyter.org<\/a><\/p>\n<\/div>\n<\/div>\n\n\n\nData Acquisition and Exploration<\/h3>\n\n\n\n
Jupyter Notebook for Data Exploration <\/h4>\n\n\n\n
- One of the reason why Jupyter Notebook is often used is that we can run step by step codes and see the outcome, as well as to visualise the data. In this way, it helps us in the early phases of the project, especially when exploring data and when initially developing the model.<\/li><\/ul>\n\n\n\n
Import
pandas<\/code> as pd<\/h4>\n\n\n\n
pandas<\/code> is one of the first libraries we use when we want to load the data. It creates a 2-dimension tabular data structure (called as DataFrame in
pandas<\/code>) from the database and interfaces with data using DataFrames, which have named columns for data manipulation. It is similar to a spreadsheet, a SQL table or the data.frame in R.<\/li><\/ul>\n\n\n\n
Source : <\/a>https:\/\/www.datacamp.com\/tutorial\/tutorial-jupyter-notebook<\/a><\/figcaption><\/figure>\n\n\n\n \n\n
pandas<\/code> make it easier to explore and manipulate data by providing different features mentioned in image.<\/p>\n<\/div>\n\n\n\n
\n<\/figure>\n<\/div>\n<\/div>\n\n\n\n
\ud83d\udca1pandas<\/code> is preferred over multiple other tools like SQL, R, etc.. because it is faster and easier to perform data analyses for given DataFrame\/Series.<\/pre>\n\n\n\n
Import
NumPy<\/code> as np<\/h4>\n\n\n\n
While
pandas<\/code> helps us manipulate and analyze data in tabular format, NumPy helps us<\/p>\n\n\n\n
- To provide various method\/function for Array, Metrics, and linear algebra.<\/li>
- It stands for Numerical Python and provides lots of useful features for operations on n-arrays and matrices in Python.<\/li>
- This library provides vectorisation of mathematical operations on the
NumPy<\/code> array type, which enhances performance and speeds up the execution.<\/li>
- It\u2019s very easy to work with large multidimensional arrays and matrices using
NumPy<\/code>.<\/li><\/ul>\n\n\n\n
A small example of data handling using
numpy<\/code><\/p>\n\n\n\n
<\/figure>\n\n\n\n
\ud83d\udca1pandas<\/code> vs
NumPy<\/code> - NumPy performs better than pandas when it comes to complex mathematical calculations on multidimensional arrays with data lesser than 50k rows. We often use both tools when manipulating and exploring data<\/pre>\n\n\n\n
Data Visualisation<\/h3>\n\n\n\n
import
matplotlib.pyplot<\/code> as plt<\/h4>\n\n\n\n
One important part of storytelling through data is \u201cCorrect visualisation\u201d.<\/p>\n\n\n\n
There are multiple libraries in python which helps us draw plots with an option to customise and save the image.
matplotlib<\/code> is one of them!<\/p>\n\n\n\n
matplotlib<\/code> is a collection of command style functions that make
matplotlib<\/code> work like MATLAB.<\/p>\n\n\n\n
Scatter plot using matplotlib<\/figcaption><\/figure>\n\n\n\n \ud83d\udca1matplotlib<\/code> works efficiently with DataFrames and arrays. There are other libraries like
seaborn<\/code> ,
plotly<\/code> which are equally used for interactive and customizable plots, and often provide better looking graphs<\/pre>\n\n\n\n
Model Development<\/h3>\n\n\n\n
import
scikit-learn<\/code> as sklearn<\/h4>\n\n\n\n
There are multiple open source and paid tools available for predictive analytics, one of which is
scikit-learn<\/code> library used by many Data Scientists to build ML models since it is:<\/p>\n\n\n\n
- Simple and efficient tools for predictive data analysis<\/li>
- Accessible to everybody, and reusable in various contexts<\/li>
- Built on
numpy<\/code>,
scipy<\/code>, and
matplotlib<\/code><\/li>
- Open source, commercially usable – BSD license<\/li><\/ul>\n\n\n\n
sklearn<\/code> provides access to high-quality, easy-to-use, implementations of popular algorithms so
scikit-learn<\/code> is a great place to start.<\/p>\n\n\n\n
Source: <\/a>https:\/\/github.com\/microsoft\/hummingbird\/wiki\/Supported-Operators<\/a><\/figcaption><\/figure>\n\n\n\n <\/p>\n\n\n\n
Example of logistic regression model in
sklearn<\/code><\/p>\n\n\n\n
<\/figure>\n\n\n\n
While
sklearn<\/code> provides a large library of model types, data scientists also make frequent use of other packages, especially if they are creating more specialised types of machine learning models. For example, there are libraries for creating tree-based models which have better performance especially when training with large amount of data, such as
CatBoost<\/code> and
XGBoost<\/code>. Neural network models, which are also very popular especially for natural language processing and computer vision models, are commonly built using
Tensorflow<\/code> or
PyTorch<\/code> .<\/p>\n\n\n\n
Examples of packages for tree-based models<\/h4>\n\n\n\n
\n\nClassifier model built using XGBoost<\/figcaption><\/figure>\n<\/div>\n\n\n\n \nModel built in Catboost. It shows the performance metric of train and test datasets as samples are included.<\/figcaption><\/figure>\n<\/div>\n<\/div>\n\n\n\n Examples of neural network model packages<\/h4>\n\n\n\n
Neural network model built in Tensorflow and visualised using Tensorboard<\/p>\n\n\n\n
<\/figure>\n\n\n\n
\ud83d\udca1sklearn<\/code> is one of the most commonly used library and provides a good baseline for learning machine learning and experimenting with many basic models.<\/pre>\n\n\n\n
Conclusion<\/h2>\n\n\n\n
Python libraries are preferred by data teams as it is easier and faster to perform data analytics tasks all in one place. Especially, for people who prefer MS Excel, they can easily switch to these python libraries to fast track their work when there is big data involved or otherwise too.<\/p>\n\n\n\n
You can refer to below documents for the libraries<\/p>\n\n\n\n
- pandas<\/a> – to analyze DataFrame<\/li>
- numpy<\/a> – to compute numerical operations on multidimensional array<\/li>
- matplotlib<\/a> – to visualize data<\/li>
- scikit-learn<\/a> – to conduct predictive analytics<\/li><\/ol>\n\n\n\n
Written by Mitali Chotrani<\/p>\n\n\n\n
If you are passionate about data and want to be involved in making the future of finance accessible to businesses of all scales, let\u2019s be part of our team. Find out more about our available vacancy on this link<\/a>.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"
I have too many tools, said no woodworker ever! Lisa Stronzi This article will share the tools used by Data Scientists in our daily work, and explore how they can also be helpful for other teams. Because Python is where we do most of our work, the tools we are using are also mainly collections […]<\/p>\n","protected":false},"author":19,"featured_media":5191,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[550],"tags":[359,75,48,72],"yoast_head":"\n
A Suite of Data Tools for Beginners - Fazz<\/title>\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\n\t\n\t\n\n\n\n\n\n\t\n\t\n\t\n