The design of any programming language involves a compromise. Low-level languages are difficult to learn, require the programmer to do a lot manually, but allow flexible code optimization and speed.
High-level languages allow you to solve the same problems in a more convenient and simple way but have fewer methods and tools for optimization. One such language is Python.
The Big Data Directorate X5 Retail Group has been around for more than two years. It employs Data Science experts who process data arrays of customers and products, as well as developers who create software products for working with big data.
When we first started, we have faced the question of tools, primarily programming languages.
The first desire was to take the most advanced tools, for example, Java – a universal, productive, constantly evolving, and extremely popular language.
However, a significant part of our tasks simply did not require such a sophisticated tool.
In addition, Java is quite difficult to learn, and for our DS-specialists, who are more analytics and mathematicians than programmers, it could become a problem.
We needed a language that was equally convenient for the entire management, so our attention to Python.
Strengths
How is Python good for a team that has both developers and data science experts?
I will list the properties of this language, for which we chose it for our tasks.
High development productivity
The language is interpreted, so it can be written faster than C.
Implicit, but strong typing provides less code for solving problems than in Java.
A concise and clear syntax allows you to quickly write readable code. For a person who knows C or Java, Python is generally intuitive.
Compare how the same function written in Java and Python looks like:
factorial calculation in Java:
class Factorial
{
static int factorial(int n)
{
if (n == 0)
return 1;
return n*factorial(n-1);
}
public static void main(String\[\] args)
{
System.out.println(factorial(5));
}
}
factorial calculation in Python:
def factorial(n): return 1 if (n==1 or n==0) else n * factorial(n - 1)
print(factorial(5))
Low entry threshold to study
Python is widely used in the Big Data field.
The need for data analysis most often arises among those who run the business – analysts, economists.
Learning heavy languages like Java or C is not practical for them – unlike Python, which can be learned quite quickly.
Comparison of documentation sizes for different languages
Interactivity of the language (calculations without compilation)
Analysts also appreciate Python for its ability to code on the go thanks to its built-in interpreter.
In Data Science, this is relevant for testing hypotheses online.
Integrated features for optimizing source code
For developers, the built-in interpreter can also be useful: since Python offers implicit and dynamic typing of data, it is possible to evaluate the degree of optimization only in the process of code execution, for which the interpreter is useful.
It translates the source code into machine instructions that can suggest an idea for optimization. For example, comparing two instructions, you can understand why one is faster than the other.
This is an important advantage for working with Big Data because, in addition to data analysis, there is a lot of work to improve their processing algorithms.
The difference in the speed of execution of identical, at first glance, functions
Dynamic language development
Another argument in favor of Python for us was that this language is developing rapidly and intensively.
With each version, the performance of the language is improved, and the syntax is improved. For example, in version 3.8, a new walrus operator appeared – :=
which is a serious event for any language.
In low-level languages such as C ++ or Java, the pace of change is much lower – they are approved by a special commission, which meets once every several years.
In Python, the process is more open to the community, everyone can offer their ideas, and numbers are growing rapidly.
The need for teamwork solutions
The features of Python make it an interesting tool for team development.
Due to the fact that the language interpreter hides the details of low-level machine computing, developers need to discuss and delve into the details of the project in more detail.
For example, when in Java the developer determines the type of the return value of the function, and some kind of problem occurs with the value type, the program simply does not start.
A Python program may start, but it will not work correctly if the value type is fundamentally important.
Such errors can be difficult to find in the development stage.
To some, this circumstance will seem rather a minus, but a collective discussion often helps to find the most successful solutions.
And also it allows developers to feel involved in the common cause, which positively affects motivation.
Ability to quickly expand applications with new features
As I said, in addition to data engineering, we also have tasks for developing web applications and microservices.
Python may not be the best choice it can be less productive than other languages.
But for web applications of medium load and at the MVP stage, Python is more than convenient due to the fact that the development of new features takes less time.
Top 5 TIOBE index of the popularity of programming languages in March 2020
Python in the chain of our tasks using an example
I will give a concrete example where we use Python.
Business challenge
Provide regular collection and analysis of information about the buyers of our distribution networks.
Based on this data, we can segment the audience by highlighting specific characteristics (attributes) for each customer – for example, his tendency to make premium purchases.
There are many such attributes, to implement the methodology for calculating them is the work of a Data Science specialist.
But data is updated over time, so this means it needs to automate.
To do this, we developed a system in Python that allows us to calculate these attributes with a given frequency.
The main difficulty is that we need to give regular access to these data to other departments of the company, who will use them to build all kinds of marketing models and create reports.
Organizing access to the results of these calculations per se is an interesting technical task.
Decision
The calculations are based on the PySpark framework in the Hadoop ecosystem.
The developed service is based on microservice architecture, microservices are orchestrated through Kubernetes.
In this series, Apache Airflow is noteworthy – an open tool for planning and monitoring data processing processes.
It is written in Python, which allows you to connect with developers and data analysts.
Airflow is extremely convenient for this because it allows you to simply describe complex data pipelines.
Pipeline example in Airflow
There were some difficulties in making Airflow friends with Kubernetes.
Airflow is actively developing, so documentation often lags behind the current version of the code.
Support for Kubernetes is a relatively new feature, so many can understand the code and comments.
And the Airflow is written in Python is helpful.
When there is little documentation, it is extremely important to be able to understand the code.
Sample code with comments. Functions for Calculating the Inverse Root in Quake 3 Source Code
In general, our service consists: plans calculations through Airflow, and the second is responsible for filling storefronts with data.
Both parts of the service use Python: it is Airflow with the pipeline, in the second – the system of integration microservices.
Conclusion
The developed system allows you to automate and schedule regular calculations of 45 attributes of consumer behavior.
The amount of data accumulated from these calculations over three years is 4.5 terabytes, and other departments of the company have the ability to easily access them.
Thus, Python allows you to solve the most diverse problems.
It brings developers and specialists in the project for whom programming is not a core skill – business analysts, data analysts, data scientists.
Great for agile development, for agile optimization.
For a company that has many multi-level tasks and is working with big data, Python is a great addition to competencies in low-level languages.
The post Why Python is Good for Data Science and Application Development appeared first on Creador.