An uso Windows cada da, y lo considero un sistema operativo genial, pero la realidad es que Windows protege al usuario de muchas cosas. Esta proteccin es buena en la mayora de los casos, pero de alguna manera, hace que tengas menos poder sobre el PC.

Uno de los comandos que escribi varias veces era `pip install ...`

, me dijo que era para instalar paquetes en Python. En ese momento, no saba ni qu era Python, y mucho menos que era `pip`

, pero se me qued grabado en el cerebro.

La realidad es que no es necesario tener Linux para poder usar `pip`

, funciona perfectamente en Windows o Mac, pero yo lo conoc en ese contexto. Soy ingeniero civil de profesin, pero he aprendido programacin por mi cuenta. La programacin es una herramienta muy poderosa para cualquier persona, tanto es as que creo debiera estar incorporada en los sistemas de educacin bsica. De la misma manera que se ensean matemticas, debieran explicarse unas bases de programacin. Al menos algo sobre estructuras de datos, de cmo funciona internet, de servidores, APIs, etc.

Como dicen por ah, todo parece imposible hasta que se hace, en ese contexto hace unos das me propuse hacer mi primer proyecto `pip`

.

Decid crear un proyecto `pip`

por una razn fundamental: tengo varios programas en Python que utilizo con frecuencia en distintos proyectos, y necesito poder importarlos sin tener que incorporar el cdigo en cada uno de ellos. Mi objetivo es lograr que cada carpeta de proyecto sea "aislable", es decir, que incluya todos los archivos necesarios para funcionar de manera independiente.

La ventaja de utilizar una librera subida a PyPI en un proyecto es que te permite evitar la repeticin de cdigo, adems de poder actualizarla y que se actualice automticamente en todos los proyectos en los que la utilizas. De esta forma, la gestin de tus programas en Python se vuelve mucho ms eficiente y cmoda.

Adems, considero que es genial poder aportar a la comunidad un programa con cdigo abierto, el cual quizs puedan utilizar y mejorar. El compartir nuestros conocimientos y habilidades en programacin no solo nos permite ayudar a otros a resolver problemas, sino que tambin nos permite crecer como programadores y tener una visin ms amplia de las posibilidades que ofrece este fascinante campo.

El proyecto Python que seleccion para crear mi primera librera es bastante sencillo, pero es de gran utilidad para la empresa donde trabajo, ya que lo usamos con mucha frecuencia. Tiene como funcin tomar una nmina de pagos en formato Excel, siguiendo un formato especfico, y transformarla al formato de otros bancos. Una nmina de pagos es un listado de transferencias a distintos destinatarios, lo que permite realizar los pagos de forma ms eficiente y evitar tener que hacerlos uno a uno.

A pesar de su simplicidad, esta librera `pip`

puede ser de gran ayuda para cualquier empresa que necesite realizar pagos de manera automatizada. Con solo unos pocos pasos, se puede integrar esta herramienta en cualquier proyecto de Python y mejorar significativamente la eficiencia de los procesos de pago.

La estructura del programa se basa en funciones que requieren una ruta de archivo Excel como parmetro de entrada. Estas funciones utilizan la biblioteca Pandas para llevar a cabo las transformaciones necesarias para producir el output deseado. Adems, el programa cuenta con una interfaz de usuario desarrollada con Tkinter que simplifica el uso del cdigo para aquellos que no tienen conocimientos de programacin.

- Crear una estructura de directorios para el proyecto

```
nombre_del_proyecto/
|-- nombre_del_proyecto/
| |-- __init__.py
| |-- archivo1.py
| |-- archivo2.py
|-- README.md
|-- LICENSE
|-- setup.py
```

Agregar cdigo al archivo

`archivo1.py`

y`archivo2.py`

. Estos archivos contendrn el cdigo del proyecto.Crear un archivo

`README.md`

para proporcionar una descripcin y documentacin del proyecto.Crear un archivo

`LICENSE`

para indicar bajo qu trminos se puede utilizar el proyecto.Crear un archivo

`setup.py`

para definir los metadatos del proyecto, como su nombre, versin, autor y dependencias. Se puede utilizar el siguiente ejemplo como gua.

```
from setuptools import setup, find_packages
setup(
name='nombre_del_proyecto',
version='0.1',
author='Tu Nombre',
author_email='tu_email@ejemplo.com',
description='Descripcin del proyecto',
packages=find_packages(),
install_requires=[
'paquete1',
'paquete2',
],
)
```

Hice todo el proceso y funcion, por el camino fueron saliendo ciertas dudas que pude resolver, por ejemplo:

- El hecho de que el proyecto tenga una interfaz gener la siguiente pregunta: Cmo puedo abrir la interfaz una vez instalado el paquete PIP? Esto se puede resolver aadiendo el siguiente cdigo al archivo
`setup.py`

:

```
entry_points={
'console_scripts': [
'start_menu_conversor_nominas = conversor_nominas_bancos_chile.bank_tkinter_menu:iniciar_menu'
]
},
```

- El programa se lograba subir bien a PyPi, pero cuando lo ejecutaba, daba error porque algunas de las libreras indicadas en
`install_requires`

faltaban, u otras sobraban porque forman parte de la base de Python.

Cuando prob de instalar la librera me di cuenta que sala un mensaje indicando algo similar a que el uso del archivo `setup.py`

deba ser substituido por otro llamado `project.toml`

. Ese detalle no haba sido indicado por GPT, imagino porque est entrenado con datos hasta final del 2021.

Las instrucciones que me haba dado GPT para empaquetar el proyecto eran ejecutar los siguientes comandos:

```
python setup.py sdist bdist_wheel
twine upload dist/*
```

Sin embargo, encontr este artculo donde indica que no es recomendable invocar `setup.py`

directamente.

Finalmente elimin el archivo `setup.py`

y configur el siguiente `project.toml`

:

```
[tool.poetry]
name = "conversor_nominas_bancos_chile"
version = "1.8.2"
description = "Librera que convierte el formato de nminas del BCI al formato del resto de bancos."
authors = [
"Antonio Canada Momblant <xxxx@gmail.com>"
]
license = "MIT"
readme = "README.md"
repository = "https://github.com/tonicanada/conversor_nominas_bancos_chile"
[tool.poetry.dependencies]
python = "^3.10"
pandas = "^1.5.3"
numpy = "^1.24.2"
datetime = "^5.1"
pathlib = "^1.0.1"
tk = "^0.1.0"
openpyxl = "^3.1.2"
xlrd = "^2.0.1"
[tool.poetry.scripts]
start_menu_conversor_nominas = "conversor_nominas_bancos_chile.bank_tkinter_menu:iniciar_menu"
[build-system]
requires = ["poetry>=1.0"]
build-backend = "poetry.masonry.api"
```

Para subir el paquete a PIP instal la librera poetry y ejecut los siguientes comandos:

```
poetry build
poetry publish
```

La instalacin de la librera puede hacerse mediante el siguiente comando:

```
pip install conversor_nominas_bancos_chile
```

Una vez instalado, se puede abrir la interfaz grfica simplemente ejecutando `start_menu_conversor_nominas`

.

A continuacin les muestro un ejemplo sobre cmo poder usar las funciones de la librera.

```
from conversor_nominas_bancos_chile import bank_functions
import pkg_resources
import json
from pathlib import Path
import pandas as pd
# ___________________________________________________________________
# A continuacin se muestra el uso de las funciones para la conversin de nminas
# Al ejecutarse, guardarn el archivo output en la ruta indicada.
path_nomina = Path(
"/home/acm/Coding/acm_pip_packages/conversor_nominas_bancos_chile/conversor_nominas_bancos_chile/planillas_test/input_nomina_bci_ejemplo.xls")
path_datosempresa = Path(
"/home/acm/Coding/acm_pip_packages/conversor_nominas_bancos_chile/conversor_nominas_bancos_chile/planillas_test/datos_empresas.xlsx")
# Ejemplo de transformacin a formato "Banco Chile Nminas Transferencias Masivas"
bank_functions.bci_to_bancochile_nomina_transferencias(
path_nomina, "98765432-1", path_datosempresa)
# Ejemplo de transformacin a formato "Banco Chile Pagos Masivos"
bank_functions.bci_to_bancochile_pagosmasivos(
path_nomina, "98765432-1", path_datosempresa, "812", "prov")
# Ejemplo de transformacin a formato "Santander"
bank_functions.bci_to_santander_transferenciasmasivas(
path_nomina, "76234531-2", path_datosempresa)
# Ejemplo de transformacin a formato "BICE"
bank_functions.bci_to_bice_nomina(
path_nomina, "87543201-9", path_datosempresa)
# ___________________________________________________________________
# A continuacin se muestra como se puede acceder a ciertos archivos de la librera
def get_file_from_package(path):
resource = pkg_resources.resource_filename(
'conversor_nominas_bancos_chile', path)
return resource
# Archivo JSON 'bancos_codigos.json'
bancos_codigos = get_file_from_package('bancos_codigos.json')
with open(bancos_codigos) as f:
bancos_codigos = json.loads(f.read())
# Archivo JSON 'bancos_headers_nomina.json' que contiene los distintos encabezados segn cada banco
bancos_headers_nomina = get_file_from_package('bancos_headers_nomina.json')
with open(bancos_headers_nomina) as f:
bancos_headers_nomina = json.loads(f.read())
# Archivo EXCEL con ejemplo de planilla 'datos_empresas.xlsx'
datos_empresas = get_file_from_package('planillas_test/datos_empresas.xlsx')
df = pd.read_excel(datos_empresas)
# Archivo EXCEL con ejemplo de planilla input nmina BCI'
planilla_ejemplo_input = get_file_from_package(
'planillas_test/input_nomina_bci_ejemplo.xls')
df = pd.read_excel(planilla_ejemplo_input)
```

Espero les pueda servir! Tanto la informacin de cmo subir una librera PIP, como el programa en s mismo. Aqu les dejo el link al repositorio del proyecto.

Si te gust este artculo, favor dale a 👏 y comprtelo! Puedes seguirme en mi blog, LinkedIn, Twitter, Facebook, Medium.

]]>Given a list of cities and the distances between each pair of cities, what is the shortest possible route that visits each city exactly once and returns to the origin city?

Despite how simple it is to describe the problem (even one child would understand what we are looking for), there is no efficient algorithm yet to find the optimal solution. Its classified as NP-Hard Problem, no polynomial time method has been found so far. As we can see in this article, a modern normal computer can find the optimal solution until 22-24 points (cities).

I find it inspiring that today, despite our great technological advances, there are problems so easy to describe still unresolved. We are not talking about quantum physics or fluid mechanics! We just want to find the shortest route given a certain number of points to visit!

In this article, well explain and implement code on algorithms to find the optimal solution and also some approximations. We are going to use cSharp and the code will result in a plugin for AutoCAD, which provides a very friendly GUI.

This form is the programs interface that will allow the user to play with different TSP algorithms and quantity of nodes, it has 3 tabs as can be seen in the next picture.

Clicking on `insert sample nodes`

, a certain quantity of nodes will be drawn in CADs model, then the user can choose an algorithm to solve the TSP for that nodes. The next picture shows an example with 20 nodes.

A naive approach to solving TSP would be the brute-force solution, which means finding all possible routes given a certain number of nodes. This is a very expensive way to solve it, with a time complexity of O(n!). To be exact, the brute-force time complexity is (n-1)!/2. Imagine you have n nodes, then, if we want to compute all possible paths, we must pick one random start node, then we have (n-1) options for the next node, and (n-2) for the next, etc This gives us (n-1)!, but we should consider that the path (1 > 2 > 3 > 4 > 1) is the same as (1 > 4 > 3 > 2 > 1), thats the reason we divide by 2.

In this article, well analyze 2 ways of computing the optimal solution, the Integer Linear Programming and Dynamic Programming approach, which are slightly better than the brute-force method. Afterward, well move to explore 2 approximation algorithms, which run much faster than the previous ones and are not so bad in precision, they are called 2T (Double-Tree) and Christofides approximation. In the worst-case scenario, 2T would be 2 times the optimal, and Christofides solution would be 1.5 times.

Finally, well talk about Google OR-Tools Routing library, which is free and provide powerful approximations to the TSP that run very fast and combine more than 1 algorithm strategy.

Linear Programming (LP) is a powerful way to solve problems, and part of its beauty is its simplicity, we only need to formulate (express) our input in the required way, then LP will do the rest of the job returning the output solution. This formulation consists in:

Cost Function to optimize (maximize or minimize):

$$c_1x_1+c_2x_2+... +c_nx_n$$

Variables must be positive or equal to 0:

$$x_1\geq0, x_2\geq0, ..., x_n\geq0$$

List of constraints:

This can be expressed in matricial form as follows:

$$\begin{align} c^Tx \ \ \text{subject to:} \newline x \geq 0 \newline A \ x \leq b \newline \text{Where} \ c \in \mathbb{R^n}, b \in \mathbb{R^m}, A \in \mathbb{R^{m \times n}} \end{align}$$

Integer Linear Programming adds one more constraint, and that is our variables (x), which must be positive integers, meaning \(x \in \mathbb{Z^n}\).

The key part of using LP is finding the correct formulation for the problem. Sometimes theres more than 1 possible formulation, and one can be more efficient than the other. In fact, well explore 2 possible ways to formulate the TSP, and well see how they differ in their performance.

We can follow our intuition to think about the formulation of this problem, we need to define our `variables`

, `constraints`

and the `objective function`

. Its easy to think about it if we work with an example:

**Variables**:

What we are looking for is a tour that passes through all nodes at just one time (with the minimum length). We can declare our variables as the edges of the complete graph formed by the nodes. If the variable (edge) is equal to 1 means forms part of the optimal tour, otherwise, if its equal to 0, does not belong to the optimal tour.**Objective function**:

We want to find the tour with the minimum distance, so it makes sense to write our objective function as follows:**Constraints**:

Intuitively we can state:Each node is a start point of an edge that belongs to the optimal tour:

Each node is an end point of an edge that belongs to the optimal tour:

So Are we ready? Thats all?

Unfortunately not! Theres something we are not taking into account

There isnt any constraint to eliminate possible **subtours**! The following picture shows 3 setups without the subtour elimination constraints. Note that is also possible a subtour with only 2 nodes, which starts from one node and comes back to it.

So We need to add some constraints to eliminate these subtours How can we do it? Next, well discuss 2 possible ways to do it:

Adding constraints relating each **subset size** to its number of **activated edges**. An "active edge" means its variable is equal to 1, so belongs to the optimal tour. This means, for example, if we have the subset {0,1,2}, we can only have activated 2 possible edges, but not 3. We can state, for each possible subset:

$$\text{|number of subset activated edges|} \leq | \text{subset size}-1 |$$

Expressed more formally:

This method is easy to understand, but it adds a big number of constraints, and this causes the LP algorithm to be quite inefficient. Next are computed the number of constraints added by this way of eliminating subtours:

Item | Add or Deduct | Number of constraints |

Possible subsets | Add | \(2^n\) |

Subsets with just one node | Deduct | \(n\) |

Subset with all 0 | Deduct | \(1\) |

Subset with all 1 | Deduct | \(1\) |

Total of constraints | \(2^n-n-2\) |

There is another way to eliminate subtours, which may be less intuitive, but very smart, that provides a more compact formulation. It was discovered by Miller, Tucker and Zemlin in 1960. This formulation introduces new `time variables`

, which we call \(u_i\)

The idea is to find a relation between \(x_i\), \(u_i\) and \(u_j\).

\(x_{ij} = 1 \implies u_j \geq u_i + 1\)

\(x_{ij} =0 \implies \text{There's no direct relation}\)

We can model this using the `big number technique`

:

Where \(M\) is some large number, we can choose \(M = n-1\), because \(u_i \in [1,n-1]\). We can sum up these time constraints as follows:

Its important to note that only node 0 is not restricted by these constraints. With this formulation we have drastically reduced the number of constraints, from \(2^n\) to \(n\).

Next is presented the code implementation of the ILP formulations commented above, we use the Linear Solver offered by OR-Tools library from Google. Using this code, the ILP formulation with time variables runs faster than the other one.

These are the steps to solve a Dynamic Programming problem:

Identify the

`recurrence relation`

and solve the problem with a`top-down`

approachOptimize solution adding

`memoization`

Optimize solution using iteration,

`bottom-up`

approach

Lets compute manually one example to see if we can detect the recurrence relation, we are going to work with a 4-node graph with the following distance matrix. Its not symmetric, but thats perfectly fine, imagine its a road system where the route going from node A to B is shorter than vice versa.

The next diagram shows all possible tours we can take starting from node 0, for example, 0 > 1 > 2 > 3 > 0, 0 > 1 > 3 > 2 > 0, etc.

If we look carefully we can see that in every node we are doing the same, next is presented the recurrence relation:

Or expressed in a more general way:

Where \(g(i,S)\) is the minimum cost from node \(i\) to the subset \(S\) of nodes, in other words, is the optimal cost of the subset \({i} \cup S\), starting from some `starting_node`

(in our example is 0), and ending in \(i\). Next is presented a code implementation that solves the problem using this recurrence relation with a `top-down`

approach.

Now that we have solved the problem using the `recurrence relation`

, the next step is to try to find if we can avoid certain recursion calls using `memoization`

. We can create a 2D table to store the computed values of our recurrent function, each row can correspond to a certain subset \({i} \cup S\) and each column to the last index visited \(i\). Therefore, there will be \(2^n\) rows and \(n\) columns. The following table corresponds to the example were working on.

We can also use this memoization table to compute the optimal tour, meaning the order of node indexes, starting and ending in 0, that has the minimum cost. Our example is "0 -> 1 -> 3 -> 2 -> 0". Next is presented the code including this optimization.

Next is presented the code that solves the TSP problem avoiding recursion with a bottom-up approach. The memo table is filled from the bottom of the tree to the top.

The Dynamic Programming approach has O(n^2 * 2^n) time, which is a great improvement comparing it with *brute-force*. As can be seen in the following table, for \(n \leq 10\), DP time complexity beats the brute-force time.

\(n\) | \(n!\) | \(n^2 2^n\) |

3 | 1 | 72 |

4 | 3 | 256 |

5 | 12 | 800 |

6 | 60 | 2,304 |

7 | 360 | 6,272 |

8 | 2,520 | 16,384 |

9 | 20,160 | 41,472 |

10 | 181,440 | 102,400 |

11 | 1,814,400 | 247,808 |

12 | 19,958,400 | 589,824 |

13 | 239,500,800 | 1,384,448 |

14 | 3,113,510,400 | 3,211,264 |

15 | 43,589,145,600 | 7,372,800 |

As we have seen, the optimal solution approaches run in exponential time, so we cant use them for more than 24-25 nodes. What can we do? The TSP problem appears many times in our daily lives, for example, companies need a solution to schedule their delivery orders with the minimum cost possible.

For this reason, TSP problem has some approximation solutions that run much faster than the optimal algorithms. We are going to analyze 2 approx. algorithms, they are called 2T Double-Tree approximation and Christofides algorithm. In the worst-case scenario, 2T would be 2 times the optimal solution, and Christofides 1.5 times.

Both approximation solutions (2T and Christofides) are based on the concept of `minimum spanning tree`

(MST). What is a MST?

A minimum spanning tree (MST) or minimum weight spanning tree is a subset of the edges of a connected, edge-weighted undirected graph that connects all the vertices together, without any cycles and with the minimum possible total edge weight.

As you can see in the code, we use the Kruskal's Algorithm with a Union-Find data structure to find the MST.

The cost of a MST is a lower bound of the optimal solution for the TSP problem. \(c(T_G)\leq c(H^*_G)\)

Notation:

\(T_G\): MST of a graph G.

\(H^*_G\): Hamiltonian Graph which is the optimal solution of a TSP problem.

Why \(T_G\) is a lower bound of \(H^*_G\)?This is easy to demonstrate: if we take \(H^*_G\) (TSP Optimal Solution) and remove one edge, we have a tree (which is not the minimum as \(T_G\) (MST)).

Once we have understood the concept of the MST and checked that is always a lower bound of the TSP solution, its easy to build an algorithm that will return an approximation of the TSP problem with a maximum error of 2T (meaning in the worst-case scenario our approximation will be the double of the optimal solution).

These are the steps to build the 2T approximation:

Find an MST which we call \(T_G\)

Duplicate the edges of \(T_G\)

Find an Eulerian tour using Hierholzer algorithm (DFS traversal)

Shortcut the Eulerian tour (remove duplicate vertices)

If we do these steps, in the worst-case scenario we have visited every edge twice (DFS traversal), thats the reason its called 2T approximation. Its worth saying that when we remove duplicates (step 4), the cost only can decrease due to the triangle inequality.

We can improve the Double-Tree approximation with Christofides algorithm, which in the worst-case scenario will be 3/2 times the optimal solution. First, well explain which are the steps and afterward well demonstrate why is a 1.5T approx.

Find an MST which we call \(T_G\)

Find the subset of vertices in \(T_G\) with odd degree, which we call \(S\) (there will always be an even number of vertices with odd degree (later well explain why).

Find a Minimum Perfect Matching \(M\) on \(S\). As you can see in the code, we use linear programming to find \(M\).

Add the set of edges of \(M\) to \(T_G\). As you can see in the image below, multi-edge is allowed (look at edges between nodes P and N).

Find an Eulerian Tour

Shortcut the Eulerian Tour (remove duplicate vertices)

Now weve understood the steps of Christofides algorithm, lets try to understand the reason behind them.

**Why do we want to find the set of odd-degree vertex S?**

The main strategy of Christofides algorithm is to find an Eulerian tour from the MST and then shortcut it (removing the duplicate nodes). To have an Eulerian tour in a graph we need every vertex to be even degree. We want to find the set of odd vertices because we need somehow to turn them into even.

**Why do we compute the Minimum Perfect Matching on S?**

The idea is to add one degree to every odd-vertex, we can achieve this by finding a perfect matching on S (set of odd-degree nodes), if we do this, we have achieved our goal and find an Eulerian tour. The Minimum Perfect Matching is the optimal way to add these edges (adding the min cost possible).

**Why there will always be an even number of odd-degree vertex?**

We know by the handshaking lemma that the sum of all vertex degrees in a graph is double the number of edges:

Where \(V\) is the set of all vertices in \(G\). Lets divide \(V\) into 2 sets of vertices:

\(V = R \ \cup \ S\)

\(R =\) Set of even-degree vertices in \(G\)

\(S=\) Set of odd-degree vertices in \(G\)

So we can express the handshaking lemma as follows:

The right side of the equation (2 |E|) is an even number, so the left side has to be even as well. By definition, the sum of even-vertex degrees is also even.

$$(\deg{r_1} + \deg r_2 + \dots + \deg r_k)\ \text{is an even number}$$

It means that the sum of odd-vertex must be even as well to maintain the whole left side equation even.

$$(\deg{s_1} + \deg s_2 + \dots + \deg s_p)\ \text{is an even number}$$

We need the sum of odd numbers to be even, it means \(p\) is even.

**Why Christofides is a 1.5 approximation of the TSP?**

The first step to perform Christofides is to find an MST (similar to the 2T Double-Tree discussed before), we already know that is a lower bound of optimal solution on G. Then the question is why adding the Minimum Perfect Matching edges adds, in the worst-case scenario, an error of 0.5 T. To understand this, lets think about these 2 perfect matching shown in the following picture, which is made based on the optimal solution TSP of the set S.

\(M_1\) and \(M_2\) are perfect matching on \(H_S^*\), but not the Minimum Perfect Matching \(M\), so we can state:

\(c(M) \leq c(M_1) \ \text{and} \ \ c(M) \leq c(M_2)\)

This implies that \(c(M)\) is lesser or equal to the average of \(c(M_1)\) and \(c(M_2)\).

$$c(M) \leq \frac{1}{2} (c(M_1) + c(M_2))$$

As said before, MST is a lower bound of the optimal solution TSP, meaning \(T_G \leq c(H_G^*)\).

We also now:

Set S has fewer vertex than G, so, by the triangle inequality:

\(c(H_S^*) = c(M_1)+c(M_2)\)

Then we can conclude:

So finally:

$$c(\text{Eulerian Tour}) = \frac{2}{3}H_G^*$$

Until now we have explored some optimal algorithms approaches (linear and dynamic programming) and some approximation algorithms (Double-Tree and Christofides). I think its worth understanding things from the base, and in computer science, test your knowledge by implementing the concepts in code yourself. As much as you can, avoid "black boxes".

However, once we know what we are talking about, its also important to explore which tools are out there that are already implemented, optimized and maintained, maybe there is an open-source tool we can use to achieve our goal. This is also important because we can build from there instead of reinventing the wheel from the base.

Google OR-Tools is an open-source library that can help us a lot with the TSP problem and related concepts (for example linear and integer programming). OR stands for "Optimization Research".

OR-Tools is an open source software suite for optimization, tuned for tackling the world's toughest problems in vehicle routing, flows, integer and linear programming, and constraint programming.

We are going to add the feature to use OR-Tools to solve TSP. OR-Tools provide also an approximation of the TSP problem, but applies a `first solution strategy`

and afterward refines it with other algorithms.

First solution strategies are listed here, some of them are:

CHRISTOFIDES:

We know about it!PATH_CHEAPEST_ARC:

Starting from a route "start" node, connect it to the node which produces the cheapest route segment, then extend the route by iterating on the last node added to the route.GLOBAL_CHEAPEST_ARC:

Iteratively connect two nodes which produce the cheapest route segment.

Next image shows different TSP solutions obtained by the library OR-Tools, for a 50 vertex graph, with different first-solution-strategies.

As we have seen, OR-Tools TSP implementation provides very good approximations, and the algorithms run quite fast even when dealing with graphs with many vertices. On my computer (which is not a super-computer) it takes 3.19 s to give a solution for a graph of 500 points, and 13.47 s for one of 1,000 points.

Another cool thing about OR-Tools is that has the feature to solve `vehicle routing`

problems, which can be seen as an extension of the TSP problem. Imagine that you have a company that has to deliver 200 different points in the city, and you have 4 vehicles. What would be the route that you would give to each vehicle in order to optimize the delivery time? Well OR-Tools can help you with this!

You can find the full code in this Github Repository.

If you enjoyed this story, please click the 👏 button and share to help others find it! Feel free to leave a comment below. You can connect with me on **LinkedIn**, **Medium**, **Twitter**, **Facebook**.

Thanks for reading!

]]>Recently I read about the Maximum Network Flow Problem, which goal is to find the maximum flow rate for a certain network. Imagine a network of pipes symbolized as a weighted graph, with a source S and a sink T (see graph below) the problem can be translated as: What is the maximum flow of water that I can put in S without the pipes breaking? There are many applications where we can use this problem (bipartite matching problem, baseball elimination, airline scheduling, and many more).

The solution for this example graph is 21, it means you cant put in S more than 21 units of flow without breaking pipes. Usually there isnt a unique graph setup to support that amount of flow, the following is one of them. As can be seen, the sum of the edge weights from S is 18 + 3 = 21.

The first approach I took to understand (implement on code) and resolve this problem was FordFulkerson algorithm, I had to read many articles and watch lots of videos to achieve it. The process was a bit frustrating because there where many new concepts to learn, as for example residual graphs, augmenting paths it was all worthy and a great experience. But then while I was doing more research about this problem, a new mysterious concept appeared Linear Programming (LP)! At first I thought was simply another way to solve the problem, but then I noticed the beauty and the power of this technique. With this way I was able to model and solve same problem much faster than with the previous algorithm.

But the power of linear programming is far beyond solving this particular problem. As long as you can model your problem in the way this technique needs, there is a great chance you can use it to solve it. Some examples where LP can be used are finding minimum cost perfect matching, optimal assignments, shortest path or even solving a sudoku!.

The purpose of this article is to show this, using as example the max network problem. First I will explain the Ford-Fulkerson algorithm implementation, then Im going to solve the same problem using Linear Programming. In order to show different LP libraries well solve it using Python (pulp library), and cSharp (or-tools).

Im not going to explain in detail how this algorithm works, but I want to give some intuition about it and then compare its complexity to the LP problem modeling approach.

The input for the Ford-Fulkerson algorithm is a network capacities graph flow as shown above. Then we follow this steps:

- Initialize a
`flow network graph`

with all edge weights equal to 0. We will update this graph in each iteration. At step 0 we havent sent any flow, thats the reason why all its weights are 0. - Initialize a
`residual network graph`

which at step 0 is a copy of the input capacities graph. This graph will be used to find possible`augmenting paths`

. Its important to note that once you have sent flow, you can`undo`

it, this is represented by red lines in the figure below. - Find a possible path from S to T in the
`residual network graph`

. If there is a path, it means still is possible to add more flow to the network, thats the reason why this paths are called`augmenting paths`

. - Compute the
`bottleneck`

for the current`augmenting path`

. - Update both
`flow network graph`

and`residual network graph`

. - Go back to step 3 until there isnt path from S to T in the
`residual graph`

.

The following images show this process for our example.

Next I show my code implementation of Ford-Fulkerson algorithm, probably can be improved in many ways, but It has all the steps mentioned before. Here I summarize briefly the code:

`getMaxFlowNetwork`

This is the main function, receives as an input the capacities graph as an adjacency matrix and returns the maximum flow.`findPossiblePath`

Returns a possible path from S to T, for a given residual network graph, if there is no path returns undefined. Uses DFS traversal.`getBottleneck`

Receives as an input a possible path from S to T and the current residual network graph, then returns the bottleneck, meaning the edge with minimum weight for that given path.

`updateFlowNetwork`

Updates current flow network graph receiving as an input the augmenting path found (path and bottleneck).`updateResidualNetwork`

Updates current residual network graph receiving as an input the updated flow network graph.

You can find this code in the following Gist.

In a non formal way, Linear Programming is a way to find the maximum or the minimum value of a function (cost function - which has to be linear), given a set of constraints (which also had to be linear).

Example:

- Cost Function: $$ C(x_1, x_2, x_3) = 7x_1 + 3 x_2 - x_3 $$
- Constraints: $$ x_1 0 $$ $$ x_2 0 $$ $$ x_3 0 $$ $$ x_1 + 2x_2 + x_3 = 3 $$

The first 3 constraints are common in all linear programming problems, they mean solution has to be in the positive octant. As we can see in the picture, last constraint is a plane. This lead us to a triangle, where one of its points will be the solution of the optimization. $$ A = (3,0,0) \implies C(A) = 7(3)+3(0)-(0) = 21 $$ $$ B = (0,1.5,0) \implies C(B) = 7(0) + 3(1.5) -(0) = 4.5 $$ $$ C = (0,0,3) \implies C(C) = 7(0) + 3(0) -(3) = -3 $$

Therefore:

- Point A lead us to the maximum of C, obtaining 21
- Point C lead us to the minimum of C, obtaining -3

Maybe you will be thinking LP isnt a powerful tool as I said after seeing this basic example, but believe me, this tool shines because of its model simplicity. You only need a linear cost functions and constraints, but they can be far more complex than this basic example. Furthermore, you can add the constraint that variables should be integers (ILP), in fact we will use this to solve the previous Maximum Network Flow problem.

In order to use LP to solve Max Network Flow we only need to model it as LP requires. As long as we can express the variables, the cost function to maximize, and the constraints (in a linear way), LP will do the job for us!

Intuitively is easy to find that our variables should be the final weights of flow for each graph edge. Then, our variables can be named as follows:

- Edge_S_A
- Edge_A_B
- Edge_A_C
- Edge_A_D
- etc

What we want to maximize? The answer is not hard to find, we want to maximize the amount of flow sent from the source (S). So, in our case:

- Cost Function = Edge_S_A + Edge_S_C

We want to find the maximum of this value.

Which will be our constraints? Our intuition can lead us easily to them:

**Conservation of flow**: For each node, the amount of flow received must be equal to the sent. This can be expressed using our variables as:- Edge_S_A = Edge_A_B + Edge_A_C + Edge_A_D
- Edge_A_B + Edge_D_B = Edge_B_T
- Edge_S_C + Edge_A_C = Edge_C_D
- Edge_C_D + Edge_A_D = Edge_D_B + Edge_D_T

**Edge capacities**: This is easy, each edge cant exceed its capacity:- Edge_S_A 18
- Edge_A_B 9
- Edge_A_C 2
- Edge_A_D 10
- etc

And thats it! This is all we need to solve the problem using LP! See how easy was to model the problem using LP compared with previous method! Of course, Ford-Fulkerson algorithm has a lot of value and it allow us to understand more about how to find a solution, but LP is like a black box that does the job for us.

PuLP is a LP Python library with a very good documentation, full of examples.

Here we use another LP library called OR-Tools to solve our sample problem with c# language.

If you enjoyed this story, please click the 👏 button and share to help others find it! Feel free to leave a comment below. You can connect with me on LinkedIn, Medium, Twitter, Facebook.

Thanks for reading!

]]>Autodesk AutoCAD is a widely known program by engineers and designers used to create 2D and 3D models, it has a great interface with lots of options. It's an intuitive easy-to-learn software which allows the user to achive great results fastly. CAD stands for Computed-Aided-Design, and Auto for Autodesk. Last stable version has been launched on 2022.

At the same time, being able to use AutoCAD through programming can give us a powerful tool to solve many problems. This article explains how to do this. We'll use .NET AutoCAD API to create a plugin that will compute the Shortest Path for a given graph, from each node to every other node (using Dijskstra Algorithm).

Our input is an `undirected graph`

where the edge weights correspond to their length.

The output we want to get are 2 tables:

- Shortest path
`distance`

from each node to every other node. - Shortest path
`route`

from each node to every other node.

Now, for example, if we want to know the shortest path between B and E, we know the shortest route is `B I G J E`

and the length of that path is 48,79.

Here we'll make a very simple plugin where, once loaded in AutoCAD, will respond to the command `hello`

and draw the following circle in the model. This will be useful to learn the first steps to create any .NET plugin for AutoCAD.

First step is download and install Visual Studio Community, then create a fresh new project selecting `c# Class Library`

for `.NET Framework`

, we can name it `MyFirstCadPlugin`

.

Once new project is created, then we need to add `AutoCAD dll references`

to access the .NET AutoCAD API. These references are listed below, and located in the `Program Files`

folder, where AutoCAD is installed.

- acmgd.dll
- acdbmgd.dll
- accoremgd.dll

Now we have to configure the debug project properties, setting the option `start external program`

to start AutoCAD (`acad.exe`

) while debugging.

Next we should uncheck `loader lock`

in the Exception Settings in order to allow Visual Studio to execute AutoCAD while debugging.

We can use the following code in `Class1.cs`

to create the plugin. This code, as its explained in the comments, first connects to the active AutoCAD document and database, then creates a transaction where a Circle and Text entities are defined.

Finally, if we press `Start`

in Visual Studio, new AutoCAD instance will appear, then we can load our plugin typing the "netload" command and searching for the `MyFirstCadPlugin.dll`

, stored in `/bin/Debug`

, in our cSharp project files. Once loaded, by pressing hello in the command bar, the circle with "hello!" inside will appear in the model!

We already know how to create a basic .NET plugin for AutoCAD, so we can go deeper and focus in our real goal, which is to create a program that will compute the `Shortest Path Matrices`

for a given graph. The following windows form summarizes the functionality of the program.

Here is explained how it works:

`Insert Sample Graph`

button will draw into the model a sample graph. This is useful to show the user an example to try the program. In order to use a custom graph, block nodes should same type as in the sample graph (block's name: "node", and with a text label).`Generate Shortest Path Matrix`

button will prompt the user to select a graph, and then will generate the output matrices and save them as CSV files in the selected folder by the user.

We know a graph is composed by a set of edges and nodes, but we have to use AutoCAD elements to represent them. The edges can be easily treated as `Lines`

or `Polylines`

, but for the nodes there is not such a direct AutoCAD object. Every node has 2 properties: position (x and y), and label, for example, the following picture shows a node where label = "B", and position (x = 138,89, y = 169,11).

There is an element in AutoCAD that can be used to represent nodes in a very simple and natural way, and it's called `Block`

. We will create a custom block with the desired shape to use it to represent the nodes.

The following code (commented below) is used to create custom blocks in cSharp.

These are the main functions:

`CircleBlockNodeEntities`

This method returns a list of entities to create a block node shaped by a circle and a letter inside. There are 2 entities in this block: circle and text.`LeaderBlockNodeEntities`

Returns a list of entities to create a block node shaped by a leader line with its label above, like the following picture. There are 3 entities in this block: polyline, circle, and text.

`InsertBlockNodeToDb`

This method creates a block into the current model database, uses as argument the list of entities returned by one of the methods explained before, and the name we want to give to that block. For example, the following code will create a block named "node", with the`CircleBlock`

entities.

```
List<Entity> blockNodeEntities = BlockNodeCreator.CircleBlockNodeEntities(acCurDb, new Point3d(0, 0, 0));
BlockNodeCreator.InsertBlockNodeToDb(bt, acDoc, acCurDb, "node", blockNodeEntities);
```

`DrawBlockNodeToModel`

This function draws into the model a block node, receives as arguments the block's name, label, and its position. For example, the following code will draw a block named "node", with the label "B" in the (20, 100, 0) position.

```
DrawBlockNodeToModel(bt, acBlkTblRec, "node", "B", new Point3d(20, 100, 0));
```

We have solved the way we are going to represent a graph through AutoCAD elements, now we have to add some functionality to draw an entire sample graph. But from where are we going to read the info to draw that sample graph? Or How are we going to tell the program the set of edges (polylines or lines), and nodes (blocks) to be drawn?

Here is where `CSV files`

(tables), can help us to do the job. The sample graph will be described with 2 separate csv files, one for the nodes, and another one for the edges, they will be structured as follows.

`nodes.csv`

`edges.csv`

Where each row of `nodes.csv`

defines a node, with its label and position, and `edges.csv`

has the information of a polyline vertex. These CSV files are embedded files in the Resource Folder. Next image corresponds to the 2 polylines defined in the above table:

With these 2 CSV files and the appropiate code to read them we can draw any sample graph into the current AutoCAD model. Next is presented the code to do this.

This code can be summarized as follows:

- Function
`GetNodes`

reads the corresponding csv file and return a list of nodes. - Function
`GetEdges`

reads the CSV file and return a dictionary where*keys*are the`polyline_id`

and*values*are`Polyline AutoCAD objects`

. Function InsertSampleGraph draws into the model the sample graph defined by the CSV files, through the 2 functions defined above. - Function
`InsertSampleGraph`

draws into the model the sample graph defined by the CSV files, through the 2 functions defined above.

So far we know how to represent a graph with AutoCAD, and how to plot a sample one. It's time to attack our main goal, which is, for a given graph, get the Shortest Path Matrices (one for the shortest distance, and the other with the path to achive that distance.

First we need to prompt the user to select a graph in the model, we do this through the following piece of code.

This code prompts the user to select the graph, then returns an array of `ObjectId`

with all the selected elements. This function is pretty reusable for other AutoCAD plugins we want to build, because often we'll need the user to select something in the model.

Next is presented the code to perform Dijkstra and save the Shortest Path Matrices as CSV files.

The logic this code follows is:

Filter the

`ObjectId`

array that comes from the`GraphModelSelector`

function presented above. Every polyline and line will be converted to an`Edge`

, and every block node to a`Node`

.Generate

`Adjacency Matrix`

from the list of edges and nodes. We create a dictionary from the nodes list, where the*key*is a Tuple with the coordinates point, and the*value*is a Tuple with node's label and index. Then, if we iterate for every edge, and check if both its`start_point`

and`end_point`

are a key in the dictionary, we can update the adjacency matrix because that points are connected at a distance as the edge length. Next piece of code explains this (see lines 117 to 140 from the previous gist). Below is presented the adjacency matrix for our sample graph.Build a function to perform Dijkstra algorithm having as an argument the adjacency matrix, the starting point, and the node list. This function is called

`PerformDijkstra`

as you can see in the above gist, and will return an array of the struct`DistanceAndRoute`

. For example, if we invoke the function for the second node (labeled B), will return an array with the shortest distances from node B to every other node, and another array with the routes associated to that paths. See picture below.Finally we

`PerformDijkstra`

from every node in order to obtain the output we want.`GenerateShortestPathMatrix`

does this job and returns as an output the Shortest Path Matrices into 2 CSV (one for the distances, and the other for the routes).

And that's it! We've built the plugin and it does exactly what we wanted!

Once we are sure we have tested our program it's time to move from the `debug`

mode into `release`

mode, we can change this in the menu Build "Configuration manager".

In order to load the plugin, we open AutoCAD and type "netload" in the commands bar. Then a menu will show up, we must search into our project files, in the Release folder we select the dll with the project`s name, for example`

ShortestPathMatrix.dll`.

Now we can type `shortestpath`

in the command bar, and our form will appear! Our plugin is ready to be used!

I hope you liked reading this article, as I said in the beginning, being able to use AutoCAD by coding is a powerful tool we can use to solve many problems. In this Github repository you can find all the project files.

If you enjoyed this story, please click the 👏 button and share to help others find it! Feel free to leave a comment below. You can connect with me on Medium, LinkedIn, Twitter, Facebook.

]]>In this article we are going to use a dataset of employees as an example to find insights and relations between variables using regression, and to interpret the result reports.

Each employee's address is described as `latitude`

and `longitude`

, so we could check if there is a correlation between these variables and the employee wage. It's important to note that is fictional data (employees live in the middle of pacific ocean 🌎😂).

As we can see in the following graphs both variables seem to be correlated with wage.

- As latitude and longitude increase (north-east direction), wages grow
- As wages grow, seems they prefer to live in the north-east of the city

But... Is there causality? If yes, which is the dependent and the independent variable? In other words, which variable is the causal of the other? This answer is not always easy to find but in this example we can infer the company is located in a city where better neighborhoods are in the north-east. So, as employees earn more, they prefer to move to the north-east of city. Wage is the independent variable and latitude and longitude, dependents.

**Definitions**:

**Causal Effects for variable X**: Changes in outcomes due to changes in X, holding all the rest of the variables constant. Later we are going to make a model to predict employees' wage based on several variables like gender, age, location, etc. We can say there is a causal effect on wage due to gender if, holding the rest of variables constant, and changing the gender, causes a change in wage.**Confounding variable**: Variable that influences both the dependent variable and the independent variable, causing a spurious association. Imagine we find that motorbike accidents are highly correlated with the sale of umbrellas. As umbrellas' sales go up, motorbike accidents increase. We could think that umbrellas' sales are the causal of motorbike accidents, but what really happens is that rain is affecting both variables (umbrellas and accidents).

Now we are going to interpret the linear regression report, taking as an example the `wage`

-`latitude`

regression. This is the equation for simple linear regression:
$$
y = \alpha + \beta \ x
$$

Next table shows the regression report:

**R-squared**: This number is the % of the variance explained by the model. In our case it's just 2%, a very low number, but still positive (better than an horizontal line with the mean value). $$ R^2 = 1 - \frac{\text{unexplained variation}}{\text{total variation}} = 1 - \frac{SS_r}{SS_t} = 1- \frac{\sum_i y_i-\hat{y}}{\sum_i y_i-\overline{y}} $$

**Adjusted R-squared**: R-squared comes with an inherent problem, the fact that if we add any independent variable to the regression (multilinear regression), even if the variable doesn't have any relation with the dependent one, R-squared will increase or keep equal. The adjusted R-square "fixes" that problem. Adjusted R-squared is always less than or equal to R-squared. $$ R^2_{\text{adjusted}} = 1- \frac{(1 - R^2) \ (n-1)}{n-k-1} $$ Where \(n\) is the number of points in our data sample, and \(k\) the number of independent variables.**F-statistic**: This test is used to see if we can reject the following null hypothesis: $$ H_0: \beta = 0 $$ $$ H_1: \beta \neq 0 $$ If we can't reject H0 means that our regression is useless, because our coefficient is not statistically significant. As we can see in the output for the`wage`

-`latitude`

regression, p-value is less than 5%, then we can reject H0, meaning that our slope coefficient is statistically significant.**Log-likelihood, AIC, BIC**: Without getting too into the math, the log-likelihood (\(l\)) measures how strong a model is in fitting the data. The more parameters we add, log-likelihood will increase, but we don't want our model to over-fit, that's why we add the number of parameters (\(k\)) into the equation. $$ \text{AIC} = 2 \ k - 2 \ l $$ $$ \text{BIC} = \ln{(n)} \ k - 2 \ l \ $$ When comparing models, we should pick the one with the lowest AIC and BIC (low number of parameters and highest log-likelihood). AIC and BIC differ in the first coefficient, BIC is the one to use if the models we're comparing have different number of samples, because it normalizes it with the term (\(\ln{n}\))**Variables section**: This is maybe the most important part in the regression output. It means that our equation would look like follows: $$ \text{latitude} = -10,73 + 1.67 \times 10^{-6} \ \text{wage} $$ The rest of the table (standard_error, t, p_value, confidence_interval), is showing us in reality one piece of information in different ways, and that's the coefficient statistical significance. The constant term (const) doesn't tell us too much (theoretically would be the wage for latitude = 0), but has to be there to build our line equation. In our example, wage term has p_value = 0,4%, < 5%, so we can consider it's statistical significant.

Usually reality is too complex to explain one term with just one parameter, that's the reason why we want to add more variables in our regression: $$ y = \beta_0 + \beta_1 \ x_1 + \beta_2 \ x_2+ \beta_3 \ x_3 + \text{...} + \beta_i \ x_i $$

Following our employees' example dataset, now we're going to make a model to predict the wage based on `latitude`

and `longitude`

(the other way around than before). Later we'll make another model with more parameters and check if our regression improves.

This is our regression outcome:

As we can see, this regression doesn't have much value for the following reasons:

- F-statistic p_value is greater than 5%, meaning that we can't reject the null hypothesis: $$ H_0: \beta_1 = \beta_2 = 0 $$
- All coefficients p_value are also greater than 5%.
- R-squared is less than 2%.

Now we're going to try to improve the model adding the following variables:

- Gender
- Age
- Nationality
- Civil status
- Contract type (fixed or indefinite term)
- Management level (top-level, middle-level, low-level, laborer)

As you can see, almost all of these variables are categorical (except age), then, in order to apply regression, we have to convert them into dummy variables. We can do these easily with `pandas`

as follows:

```
df = pd.get_dummies(df, columns=['gender', 'nationality_group', 'management_level', 'contract_type'], drop_first=True)
```

We use `drop_first`

because in categorical variables, if we know (n-1), we can infer the missing one (example: contract_type_indefinite_term = 0 means fixed_term contract).

Next regression output is shown:

This model is much better than the one before:

- F-statistic p_value < 5%
- Many of the regression-parameters are statistically significant. Higher t-values correspond to
`management_level`

, meaning that that variable is clearly affecting`wage`

. - R-squared is explaining 83% of variance (highly improvement from the previous model).
- AIC and BIC are lower than the previous model.

Now we can measure model's accuracy through the following concepts:

- MAE: Mean Absolute Error = 3.912 USD $$ \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |\text{actual_values} - \text{predicted_values}| $$
- RMSE: Root Mean Squared Error = 6.578 USD $$ \text{RMSE} = \sqrt{ \frac{1}{n} \sum_{i=1}^{n} (\text{actual_values} - \text{predicted_values})^2} $$

We use logistic regression when the dependent variable is categorical. For example, using our employees' dataset, let's say we want to predict whether an employee is a laborer or not, based on wage. $$ p(x) = \frac{1}{1+e^{-(\beta_0+\beta_1 \ x)}} $$

Next table is the logistic regression output:

The same way we can make a multilinear regression, we can build a multilogistic regression, using more than 1 independent variable. Following same example as before, we can try to predict whether an employee is a laborer or not, not only with wage, but also with other parameters like age, gender, nationality, etc.

$$ p(x) = \frac{1}{1+e^{-(\beta_0+\beta_1 \ x_1 +\beta_2 \ x_2+ \text{...} + \beta_m \ x_m)}} $$

Here you can find the Jupyter notebook and the dataset. used to write this article.

Thank you for reading!

]]>Then someone taught me about the Central Limit Theorem and how to compare means with the t-test, but at that time I didnt fully understood those concepts, and I realized that I just learnt how to solve problems and exercises about it.

I think its important to understand the Whys and the intuition behind the theory before learning the mechanics of solving exercises about them. Now tools like Python facilitates us a lot to do that, thats the purpose of this article. We are going to use Python to illustrate the first steps towards inferential statistics, with key concepts like Central Limit Theorem or the t-tests. As I usually say: When I code it, then I understand it.

CLT states that if we have a population of some variable and take m samples of n-size, and we calculate some parameter in each sample (for example mean, median, standard deviation, etc), the distribution of that m parameters will be normal as n increases, and its variance will decrease also as n increases (distribution curve will narrow). This is true even if the original population doesnt follow a normal distribution.

As said before, I think the best way to understand CLT is to practice with some data and obtaining the expected results. We are going to use the CLT with 3 distributions computing the mean, the median and the standard deviation.

- Uniform distribution
- Normal distribution
- Binomial distribution

Now we are going to check the CLT by plotting histograms of sample parameters and show how they change if we increase n (sample-size). Well also plot the associated normal curve, and for that well need to know the standard error.

Its important to say that CLT is usually applied for the mean, but actually, as well see, we can apply it for any parameter (median, variances, standard deviation). This is relevant because, for example, sometimes the median tells us much more than the mean.

What is the Standard Error? Its the standard deviation of the sampling distribution. It decreases as n (sample size) increases. Each parameter will have associated a different standard error formula. Next we show the SE expression for the SE mean, SE median, SE std.

We can prove this is true using the previous function. Well compute manually the standard deviation and compare it to the formula value, they should be similar.

As we can see, results are similar:

```
{
"binomialdist_example (mean, n=200)": {
"computed_se": 0.0504901583850754,
"formula_se": 0.05815619602703747
},
"normaldist_example (median, n=200)": {
"computed_se": 0.004247561325654817,
"formula_se": 0.004343732383749954
},
"uniformdist_example (std, n=200)": {
"computed_se": 0.013055146058296654,
"formula_se": 0.02204320895687334
}
}
```

Now we are going to build a function to plot histograms and distribution curves as n (sample size) increases:

As we can see, as the CLT theorem states, with any kind of distribution, if we take m samples of n-size, and compute some parameter on them (for example mean, median or standard deviation), the distribution of that m parameters will be normal, and its variance will decrease as n increases (distribution curve will narrow).

Once we have a notion about what CLT is about, now we can apply this knowledge to understand the t-test. T-test is used normally for the following cases:

We want to check if the population mean is equal or different from the sample mean. Here we are using directly CLT theorem with this t statistic (We assume s because we dont know the population variance): $$ t = \frac{\overline{x}-\mu}{{\frac{s}{\sqrt{n}}}} $$

We want to check if given 2 samples, the population mean of them is equal or different. Assumptions:

**Homogeneity of Variance:**

Population variances are assumed to be equal. I think we can have an intuition about the reason of this assumption because the t-test is actually using the CLT theorem to compare 2 means. We have to apply the proper standard error. The standard error depends on the sample size and population variance. Different sample sizes and variances will lead to different standard errors.

We can find an adjusted standard error if sample sizes are different, but for different variances its better to apply a whole different test (Welch test). These are the cases and its corresponding t-values with their proper standard errors:

**Sample independence:**

It means that there are 2 different groups, that its not the same group that has been measured twice. If the samples are paired (dependent) t-statistic is very similar to the One sample test one, but our variable is the difference between samples.

$$ t = \frac{\overline{d}-\mu_d}{{\frac{s_d}{\sqrt{n}}}} $$

Note: In all t-tests (1 or 2 sample test) we assume population follows a normal distribution, but as we have seen, CLT theorem states that as n increases, the sample mean (or other parameters) will follow a normal distribution.

Next we are going to do some examples for each test mentioned above, and we will also check that the t-statistics are correct plotting the histogram and the distribution curve (as done before with CLT).

We have the potato yield from 12 different farms. We know that the standard potato yield for the given variety is 𝜇=20. x = [21.5, 24.5, 18.5, 17.2, 14.5, 23.2, 22.1, 20.5, 19.4, 18.1, 24.1, 18.5] Test if the potato yield from these farms is significantly better than the standard yield.

We found there is a 42% chance that 𝜇=x, based on our sample and its mean and standard deviation. So we cant reject H0, we cant conclude that 𝜇<x.

1 sample is extracted from normal-distributed population. The sample mean is x = 50 and standard deviation 𝑠=5. There are 30 observations. Considering the following hypothesis:

H0: 𝜇=48

H1: 𝜇48

With significance of 5%, can we reject H0?

5% of significance means 2,5% per tail, as we can see in the following picture:

We can reject H0 with 5% of significance, meaning that is very likely that 𝜇48.

As said before, we want to check if given 2 samples, the population mean of them is equal or different. Here were going to check that the following t-statistic is correct by plotting histograms for the general case: Equal or unequal sample sizes, similar variances. After that well do an example exercise for each case.

In order to do this check well tweak plot_histograms_sample_parameter function, generating m samples of the difference between their sample means (with different n size). Better than describing it with words, its easy to understand reading code:

As we can see, the blue normal curves fit well in the histograms, it means that our t-statistic is correct.

We can measure persons fitness by measuring body fat percentage. The normal range for men is 1520%, and the normal range for women is 2025% body fat. We have 2 sample data from a group of men and women. The following dictionary shows the data.

```
example3_data = {
"men": [13.3, 6.0, 20.0, 8.0, 14.0, 19.0,
18.0, 25.0, 16.0, 24.0, 15.0, 1.0, 15.0],
"women": [22.0, 16.0, 21.7, 21.0, 30.0,
26.0, 12.0, 23.2, 28.0, 23.0]
}
```

Using t-test we want to know if there is difference significance between the population mean of men and women group. These will be our hypothesis.

H0: 𝜇_men = 𝜇_women

H1: 𝜇_men 𝜇_women

We can reject H0 with 5% of significance, meaning that is very likely that 𝜇_men 𝜇_women.

A study was conducted to investigate the effectiveness of hypnotism in reducing pain. Results are shown in the following dataframe. The before value is matched to an after value. Are the sensory measurements, on average, lower after hypnotism? Test at 5% significance level.

As we see, we can reject HO with a p-value of 0.009478. So, based on our data, its very likely that hypnotism is reducing pain.

Hope this article helped to understand inferential statistic key concepts as Central Limit Theorem and how t-test work, gaining confidence when applying them. Here you can find the full Jupyter Notebook used for writing this story.

[1]: Ahn S., Fessler, J. (2003). Standard Errors of Mean, Variance, and Standard Deviation Estimators. The University of Michigan.

[2] machinelearningplus.com. One Sample T Test Clearly Explained with Examples | ML+. (2020, October 8). https://www.machinelearningplus.com/statistics/one-sample-t-test/

[3] bookdown.org. Practice 13 Conducting t-tests for Matched or Paired Samples in R. Retrieved April 9, 2022 from https://bookdown.org/logan_kelly/r_practice/p13.html

[4] jmp.com. The Two-Sample t-test. Retrieved April 9, 2022 from https://www.jmp.com/en_ch/statistics-knowledge-portal/t-test/two-sample-t-test.html

]]>\begin{equation} k_a: \begin{cases} x^2+ \left( y-\frac{r}{2} \right)^2 = r^2 \newline z=0 \end{cases}\ \end{equation}

\begin{equation} k_b: \begin{cases} \left( y-\frac{r}{2} \right)^2 + z^2 = r^2 \newline x=0 \end{cases}\, \end{equation}

In this article well parameterize this beautiful surface, and show that its surface is the same as the sphere (\(4 \ \pi \ r^2\)), apart of some other properties.

As mention above, the oloid is a ruled surface, and its formed by the segments AB, where A belongs to \(k_a\) and B to \(k_b\), respectively, along both circles.

\[ A = \left(\begin{array}{ccc}r\,\sin\left(\alpha\right), & -\dfrac{r}{2}-r\,\cos\left(\alpha\right), & 0 \end{array}\right) \]

\[ \beta = \pi - \alpha/2 \]

\[ \sin(\beta) = sin(\pi - \alpha/2) = cos(\alpha) \]

\[ |\overrightarrow{\rm TM_A}|\, \sin(\beta) = r \implies |\overrightarrow{\rm TM_A}| = \left| \dfrac{r}{\cos(\alpha)}\right | \]

\[ T = \left(\begin{array}{ccc} 0, & -\dfrac{r}{2}-\dfrac{r}{\cos\left(\alpha\right)}, & 0 \end{array}\right) \]

\[ |\overrightarrow{\rm TM_B}|^2 = |\overrightarrow{\rm TB}|^2 + r^2 \] \[ |\overrightarrow{\rm TM_B}|^2 = \left( \dfrac{r}{2}+\dfrac{r}{\cos\left(\alpha\right)}+\dfrac{r}{2} \right)^2 = \left( \dfrac{r + r\ cos(\alpha)}{\cos\left(\alpha\right)} \right)^2 \] \[ \cos(\gamma) = \dfrac{-r}{|\overrightarrow{\rm TM_B}|} = \dfrac{-\cos\left(\alpha\right)}{1 + cos(\alpha)} \]

\[ B_y = \dfrac{r}{2}+r\ \cos(\gamma) = \dfrac{r}{2} - \dfrac{r\ \cos\left(\alpha\right)}{1 + cos(\alpha)} \]

\[ B_z = r\ \sin(\gamma) \]

\[ \sin(\gamma)^2 = 1 - \cos(\gamma)^2 = 1 - \left( \dfrac{\cos\left(\alpha\right)}{1 + cos(\alpha)} \right)^2 = \left( \dfrac{2\ \cos(\alpha) + 1}{(\cos(\alpha) + 1)^2} \right) \]

\[ B = \left(\begin{array}{ccc} 0, & \dfrac{r}{2} - \dfrac{r\ \cos\left(\alpha\right)}{1 + cos(\alpha)}, & \dfrac{ \pm\ r\,\sqrt{2\,\cos\left(\alpha \right)+1}}{\cos\left(\alpha \right)+1} \end{array}\right) \]

The square root in the z coordinate of B creates the following restriction: \[ 2\ \cos(\alpha) + 1 \geq 0 \implies -\dfrac{2 \pi}{3} \leq \alpha \leq \dfrac{2 \pi}{3} \]

But we have to avoid zero denominators in the y coordinate of B, so the domain of \( \alpha \) becomes:

\[ -\dfrac{2 \pi}{3} < \alpha < \dfrac{2 \pi}{3} \]

The oloid is a ruled surface generated by the AB segments, by the following equation, where v is between 0 and 1.

\[ A + v\ \overrightarrow{\rm AB} \]

\[ \overrightarrow{\rm AB} = \left(\begin{array}{ccc} -r\,\sin\left(\alpha \right), & \dfrac{r}{2}+r\,\cos\left(\alpha \right)-\dfrac{r\,\left(\cos\left(\alpha \right)-1\right)}{2\,\left(\cos\left(\alpha \right)+1\right)}, & \dfrac{\pm\ r\,\sqrt{2\,\cos\left(\alpha \right)+1}}{\cos\left(\alpha \right)+1} \end{array}\right) \]

\[ \overrightarrow{\rm AB} = \left(\begin{array}{ccc} -r\,\sin\left(\alpha \right), & \dfrac{r\,\left({\cos\left(\alpha \right)}^2+\cos\left(\alpha \right)+1\right)}{\cos\left(\alpha \right)+1}, & \dfrac{ \pm\ r\,\sqrt{2\,\cos\left(\alpha \right)+1}}{\cos\left(\alpha \right)+1} \end{array}\right) \]

\[ A + v\ \overrightarrow{\rm AB} = \left(\begin{array}{ccc} -r\,\sin\left(\alpha \right)\,\left(v-1\right), & \dfrac{r\,\left(2\,v-3\,\cos\left(\alpha \right)-2\,{\cos\left(\alpha \right)}^2+2\,v\,\cos\left(\alpha \right)+2\,v\,{\cos\left(\alpha \right)}^2-1\right)}{2\,\left(\cos\left(\alpha \right)+1\right)}, & \dfrac{\pm\ r\,v\,\sqrt{2\,\cos\left(\alpha \right)+1}}{\cos\left(\alpha \right)+1} \end{array}\right) \]

\[ 0 \leq v \leq 1,\ -\dfrac{2 \pi}{3} < \alpha < \dfrac{2 \pi}{3} \]

\[ |\overrightarrow{\rm AB}|^2 = r^2\ \sin(\alpha)^2 + \dfrac{r^2\,\left({\cos\left(\alpha \right)}^2+\cos\left(\alpha \right)+1\right)^2}{(\cos\left(\alpha \right)+1)^2} + \dfrac{r^2\,(2\,\cos\left(\alpha \right)+1)}{(\cos\left(\alpha \right)+1)^2} \] \[ |\overrightarrow{\rm AB}|^2 = r^2\,\left(1 -{\cos\left(\alpha \right)}^2+\frac{{\left({\cos\left(\alpha \right)}^2+\cos\left(\alpha \right)+1\right)}^2}{{\left(\cos\left(\alpha \right)+1\right)}^2} + \frac{2\,\cos\left(\alpha \right)+1}{{\left(\cos\left(\alpha \right)+1\right)}^2}\right) \] \[ t = \cos(\alpha) \] \[ |\overrightarrow{\rm AB}|^2 = r^2\,\left(1-t^2+\frac{{\left(t^2+t+1\right)}^2}{{\left(t+1\right)}^2}+\frac{2\,t+1}{{\left(t+1\right)}^2}\right) \] \[ |\overrightarrow{\rm AB}|^2 = r^2\,\left( \dfrac{(1-t^2)(t+1)^2+(t^2+t+1)^2+(2t+1)}{(t+1)^2} \right) \] \[ |\overrightarrow{\rm AB}|^2 = r^2\,\left( \dfrac{3t^2 + 6t + 3}{(t+1)^2} \right) \] \[ |\overrightarrow{\rm AB}|^2 = r^2\,\left( \dfrac{3\ (t+1)^2}{(t+1)^2} \right) \] \[ |\overrightarrow{\rm AB}|^2 = 3r^2 \]

\[ |\overrightarrow{\rm AB}| = \sqrt3 \ r \]

Due its a ruled surface, area can be computed by the following formula (see this publication by J. B. Reynolds):

\[ \frac{d \ \overrightarrow{OB}}{d \ \alpha} = \left(\begin{array}{ccc} 0, & \dfrac{r\,\sin\left(\alpha \right)}{{\left(\cos\left(\alpha \right)+1\right)}^2}, & \dfrac{\pm\ r\,\sin\left(2\,\alpha \right)}{2\,{\left(\cos\left(\alpha \right)+1\right)}^2\,\sqrt{2\,\cos\left(\alpha \right)+1}} \end{array}\right) \]

\[ \frac{d \ \overrightarrow{OA}}{d \ \alpha} = \left(\begin{array}{ccc} r\,\cos\left(\alpha \right), & r\,\sin\left(\alpha \right), & 0 \end{array}\right) \]

Since well continue with the positive value of the 3rd coordinate of the derivative of OB, we are computing the oloids top surface. Then, in order to have the total surface, well have to multiply by 2 the result of the integral.

\[ (1-v)\ \frac{d \ \overrightarrow{OB}}{\alpha} + v\ \frac{d \ \overrightarrow{OA}}{\alpha} = \left(\begin{array}{ccc} r\,v\,\cos\left(\alpha \right), & \dfrac{r\,\sin\left(\alpha \right)\,\left(v\,{\cos\left(\alpha \right)}^2+2\,v\,\cos\left(\alpha \right)+1\right)}{{\left(\cos\left(\alpha \right)+1\right)}^2}, & -\dfrac{r\,\sin\left(2\,\alpha \right)\,\left(v-1\right)}{2\,{\left(\cos\left(\alpha \right)+1\right)}^2\,\sqrt{2\,\cos\left(\alpha \right)+1}} \end{array}\right) \]

\[ \overrightarrow{\rm AB} \times \left((1-v)\ \frac{d \ \overrightarrow{OB}}{\alpha} + v\ \frac{d \ \overrightarrow{OA}}{\alpha} \right) = \] \[ = \left(\begin{array}{ccc} -\dfrac{r^2\,\sin\left(\alpha \right)\,\left(3\,v\,\cos\left(\alpha \right)-\cos\left(\alpha \right)+1\right)}{\left(\cos\left(\alpha \right)+1\right)\,\sqrt{2\,\cos\left(\alpha \right)+1}}, & \dfrac{r^2\,\cos\left(\alpha \right)\,\left(3\,v\,\cos\left(\alpha \right)-\cos\left(\alpha \right)+1\right)}{\left(\cos\left(\alpha \right)+1\right)\,\sqrt{2\,\cos\left(\alpha \right)+1}}, & -\dfrac{r^2\,\left(3\,v\,\cos\left(\alpha \right)-\cos\left(\alpha \right)+1\right)}{\cos\left(\alpha \right)+1} \end{array}\right) \]

\[ \left| \overrightarrow{\rm AB} \times \left((1-v)\ \frac{d \ \overrightarrow{OB}}{\alpha} + v\ \frac{d \ \overrightarrow{OA}}{\alpha} \right) \right|^2 = \frac{2\,r^4\,{\left(3\,v\,\cos\left(\alpha \right)-\cos\left(\alpha \right)+1\right)}^2}{2\,{\cos\left(\alpha \right)}^2+3\,\cos\left(\alpha \right)+1} \]

\[ A =2 \ \sqrt{2} \ r^2 \int_{-2\pi/3}^{2\pi/3} \frac{\frac{1}{2} \cos{\alpha} + 1}{\sqrt{2\,{\cos\left(\alpha \right)}^2+3\,\cos\left(\alpha \right)+1}} \,d\alpha\ \]

\[ A =\left. 2 \ \sqrt{2} \ r^2 \ \dfrac{\cos\left(\dfrac{\alpha}{2}\right) \sqrt{2\ \cos(\alpha) + 1} \left( \sin^{-1}\left( \dfrac{2 \sin\left( \dfrac{\alpha}{2}\right)}{\sqrt{3}}\right) + \tan^{-1} \left( \dfrac{\sin\left(\dfrac{\alpha}{2} \right)}{\sqrt{2\ \cos{\alpha} + 1} } \right)\right)}{\sqrt{2\,{\cos\left(\alpha \right)}^2+3\,\cos\left(\alpha \right)+1}} \right|_{-2\pi/3}^{2\pi/3} \]

\[ A =\left. 2 \ \sqrt{2} \ r^2 \ \dfrac{\cos\left(\dfrac{\alpha}{2}\right) \sqrt{2\ \cos(\alpha) + 1} \left( \sin^{-1}\left( \dfrac{2 \sin\left( \dfrac{\alpha}{2}\right)}{\sqrt{3}}\right) + \tan^{-1} \left( \dfrac{\sin\left(\dfrac{\alpha}{2} \right)}{\sqrt{2\ \cos{\alpha} + 1} } \right)\right)}{\sqrt{(2\ \cos(\alpha) + 1)(\cos(\alpha) + 1)}} \right|_{-2\pi/3}^{2\pi/3} \]

\[ A =\left. 2 \ \sqrt{2} \ r^2 \ \dfrac{\cos\left(\dfrac{\alpha}{2}\right) \left( \sin^{-1}\left( \dfrac{2 \sin\left( \dfrac{\alpha}{2}\right)}{\sqrt{3}}\right) + \tan^{-1} \left( \dfrac{\sin\left(\dfrac{\alpha}{2} \right)}{\sqrt{2\ \cos{\alpha} + 1} } \right)\right)}{\sqrt{\cos(\alpha) + 1}} \right|_{-2\pi/3}^{2\pi/3} \]

\[ A =2 \ \sqrt{2} \ r^2 \left( \dfrac{\sqrt{2} \pi}{2} + \dfrac{\sqrt{2} \pi}{2} \right) = 2 \ \sqrt{2} \ r^2 \left( \sqrt{2}\ \pi \right) = 4\ \pi \ r^2 \]

Which is the same area as the sphere.

You can find a parametrized oloid in this Geogebra link.

]]>On the other hand, being able to make API calls and process the response provides a new world of endless possibilities. Nowadays many companies give access to their data via certain endpoints.

Why not put these 2 tools together? In this article well explain how to do it.

There are 2 main ways in Excel to do it:

- Via Visual Basic script
- Via making a query from the data menu

The first thing is enable the developer menu. This can be done in File Options Customize Ribbon:

Once this is done we have to open the VBA editor.

In order to process the JSON response of the API call, we need to add the JsonConverter module, which can be found in the following url: https://github.com/VBA-tools/VBA-JSON/releases, then import JsonConverter.bas into the project. In the VBA Editor, go to File Import.

Then we also need to import 2 references into the project from the Tools menu.

- Microsoft XML, v6.0
- Microsoft Scripting Runtime

Next we have to create a new module to write the code that will make the api call. Here I present 2 examples:

- Get the users from https://jsonplaceholder.typicode.com/

- Get the people from the Star Wars API (https://swapi.dev/).

If you want to save the excel file, remember to use the xlsm extension, which allows macros.

Excel 2016 has a built-in feature that allows to make API calls. Previous versions can also make it, but installing the PowerQuery plugin. To make an API call we must go to the Data tab and click on New Query From Other Sources From Web.

Then we click on Advanced. Here we put the url, and if credentials are needed, they can be entered as a header.

Hope it was helpful!

]]>