Why Python is Good for Data Science and Application Development

Why Python is Good for Data Science and Application Development

The design of any programming language involves a compromise. Low-level languages ​​are difficult to learn, require the programmer to do a lot manually, but allow flexible code optimization and speed.

High-level languages ​​allow you to solve the same problems in a more convenient and simple way but have fewer methods and tools for optimization. One such language is Python.

python

The Big Data Directorate X5 Retail Group has been around for more than two years. It employs Data Science experts who process data arrays of customers and products, as well as developers who create software products for working with big data.

When we first started, we have faced the question of tools, primarily programming languages.

The first desire was to take the most advanced tools, for example, Java – a universal, productive, constantly evolving, and extremely popular language.

However, a significant part of our tasks simply did not require such a sophisticated tool.

In addition, Java is quite difficult to learn, and for our DS-specialists, who are more analytics and mathematicians than programmers, it could become a problem.

We needed a language that was equally convenient for the entire management, so our attention to Python.

Strengths

How is Python good for a team that has both developers and data science experts?

I will list the properties of this language, for which we chose it for our tasks.

High development productivity

The language is interpreted, so it can be written faster than C.

Implicit, but strong typing provides less code for solving problems than in Java.

A concise and clear syntax allows you to quickly write readable code. For a person who knows C or Java, Python is generally intuitive.

Compare how the same function written in Java and Python looks like:

factorial calculation in Java:

class Factorial {
static int factorial(int n) { if (n == 0) return 1; return n*factorial(n-1); }

public static void main(String\[\] args)  
{ 
    System.out.println(factorial(5)); 
} 

}

factorial calculation in Python:

def factorial(n): return 1 if (n==1 or n==0) else n * factorial(n - 1)

print(factorial(5))

Low entry threshold to study

Python is widely used in the Big Data field.

The need for data analysis most often arises among those who run the business – analysts, economists.

Learning heavy languages ​​like Java or C is not practical for them – unlike Python, which can be learned quite quickly.

programming-books

Comparison of documentation sizes for different languages

Interactivity of the language (calculations without compilation)

Analysts also appreciate Python for its ability to code on the go thanks to its built-in interpreter.

In Data Science, this is relevant for testing hypotheses online.

Integrated features for optimizing source code

For developers, the built-in interpreter can also be useful: since Python offers implicit and dynamic typing of data, it is possible to evaluate the degree of optimization only in the process of code execution, for which the interpreter is useful.

It translates the source code into machine instructions that can suggest an idea for optimization. For example, comparing two instructions, you can understand why one is faster than the other.

This is an important advantage for working with Big Data because, in addition to data analysis, there is a lot of work to improve their processing algorithms.

python-code

The difference in the speed of execution of identical, at first glance, functions

Dynamic language development

Another argument in favor of Python for us was that this language is developing rapidly and intensively.

With each version, the performance of the language is improved, and the syntax is improved. For example, in version 3.8, a new walrus operator appeared – :=which is a serious event for any language.

In low-level languages ​​such as C ++ or Java, the pace of change is much lower – they are approved by a special commission, which meets once every several years.

In Python, the process is more open to the community, everyone can offer their ideas, and numbers are growing rapidly.

The need for teamwork solutions

The features of Python make it an interesting tool for team development.

Due to the fact that the language interpreter hides the details of low-level machine computing, developers need to discuss and delve into the details of the project in more detail.

For example, when in Java the developer determines the type of the return value of the function, and some kind of problem occurs with the value type, the program simply does not start.

A Python program may start, but it will not work correctly if the value type is fundamentally important.

Such errors can be difficult to find in the development stage.

To some, this circumstance will seem rather a minus, but a collective discussion often helps to find the most successful solutions.

And also it allows developers to feel involved in the common cause, which positively affects motivation.

Ability to quickly expand applications with new features

As I said, in addition to data engineering, we also have tasks for developing web applications and microservices.

Python may not be the best choice it can be less productive than other languages.

But for web applications of medium load and at the MVP stage, Python is more than convenient due to the fact that the development of new features takes less time.

programming-trending

Top 5 TIOBE index of the popularity of programming languages ​​in March 2020

Python in the chain of our tasks using an example

I will give a concrete example where we use Python.

Business challenge

Provide regular collection and analysis of information about the buyers of our distribution networks.

Based on this data, we can segment the audience by highlighting specific characteristics (attributes) for each customer – for example, his tendency to make premium purchases.

There are many such attributes, to implement the methodology for calculating them is the work of a Data Science specialist.

But data is updated over time, so this means it needs to automate.

To do this, we developed a system in Python that allows us to calculate these attributes with a given frequency.

The main difficulty is that we need to give regular access to these data to other departments of the company, who will use them to build all kinds of marketing models and create reports.

Organizing access to the results of these calculations per se is an interesting technical task.

Decision

Apache airflow

The calculations are based on the PySpark framework in the Hadoop ecosystem.

The developed service is based on microservice architecture, microservices are orchestrated through Kubernetes.

In this series, Apache Airflow is noteworthy – an open tool for planning and monitoring data processing processes.

It is written in Python, which allows you to connect with developers and data analysts.

Airflow is extremely convenient for this because it allows you to simply describe complex data pipelines.

db-table

Pipeline example in Airflow

There were some difficulties in making Airflow friends with Kubernetes.

Airflow is actively developing, so documentation often lags behind the current version of the code.

Support for Kubernetes is a relatively new feature, so many can understand the code and comments.

And the Airflow is written in Python is helpful.

When there is little documentation, it is extremely important to be able to understand the code.

Sample code with comments. Functions for Calculating the Inverse Root in Quake 3 Source Code

In general, our service consists: plans calculations through Airflow, and the second is responsible for filling storefronts with data.

Both parts of the service use Python: it is Airflow with the pipeline, in the second – the system of integration microservices.

Conclusion

The developed system allows you to automate and schedule regular calculations of 45 attributes of consumer behavior.

The amount of data accumulated from these calculations over three years is 4.5 terabytes, and other departments of the company have the ability to easily access them.

Thus, Python allows you to solve the most diverse problems.

It brings developers and specialists in the project for whom programming is not a core skill – business analysts, data analysts, data scientists.

Great for agile development, for agile optimization.

For a company that has many multi-level tasks and is working with big data, Python is a great addition to competencies in low-level languages.

The post Why Python is Good for Data Science and Application Development appeared first on Creador.

Did you find this article valuable?

Support Pawan Kumar by becoming a sponsor. Any amount is appreciated!