Máquina virtual con Debian – la caja loca de uborZz

Recojo memoria/resumen de los pasos seguidos para virtualizar un debian y montarle un entorno con python, a pelo.

Se toca:

Descarga – instalación VMWare
Descarga – instalación en máquina virtual de Debian
Instalación de las VMTools
Instalación de pycharm
Notas instalación virtual env y libs de python a mano con pip
Resumen tools python para data science

Descarga Debian

https://www.debian.org/CD/http-ftp/#stable

Descargar versión standar, sin parafernalias, la netinst.

VMware

Usamos Workstation 12 Player. Es gratuito.

Creamos máquina a partir de la imagen de Debian descargada.
Player > File > New Virtual Machine…

Le añadimos más ram: pulsar Customize Hardware…

Instalación debian

Instalación de debian en nuestra máquina virtual local.

Se elige lugar (útil para elegir la hora pasos después), después se elige idioma del sistema y del teclado. *Inglés y teclado en español en este caso.
Sin password de root
User:password que utilizaremos como queramos.
Time zone
Opciones de partición por defecto.

Continuar con todo lo siguiente tras la pequeña instalación elegir el mirror del package manager, por defecto te dará uno cercano si se configuró correctamente el país. No proxy.

Al elegir el software (seleccionar con barra espaciadora, instalar todo lo seleccionado con enter):

Comenzará la instalación gorda.
Cuando nos pregunte por la instalación del grub, elegir el directorio por defecto.

Todo finaliza y el SO inicia.

Instalar las VMware Tools

Permiten arrastrar ficheros dentro/fuera de la vista en el vmplayer, usar ctrl+c y ctrl+v entre nuestro sistema y el sistema de la máquina virtual, autoajuste del tamaño del pantalla al maximizar/minimizar…

El link no corresponde con la versión, pero las instrucciones están más o menos igual:
http://www.debianadmin.com/install-vmware-tools-on-debian-wheezy.html

Al clickar se carga la imagen de las vmware tools en el lector virtual de la máquina.

En caso que no nos aparezca, simplemente ir (en el VMware Workstation) a Player > Manage > Install VMware tools… Esto montará el disco con las tools.

Abrimos un terminal y escribimos:
$ mount

en este caso haremos:
$ sudo mount dev/sr0 /mnt
$ tar -C /tmp -zxvf /mnt/VMwareTools-10.1.5-5055693.tar.gz (versión del momento).
$ cd /tmp/vmware-tools-distrib
$ sudo ./vmware-install.pl

Dar Yes/OK o dejar por defecto todas las preguntas. Tras reiniciar la máquina virtual estarán instaladas y funcionando las tools.

Instalar pycharm

Descargar Pycharm Community :
https://www.jetbrains.com/pycharm/download/#section=linux

Abrir terminal en carpeta de la descarga y seguir la guía de instalación:
https://confluence.jetbrains.com/display/PYH/Installing+PyCharm+on+Linux+according+to+FHS

En resumen:
$ sudo mv pycharm-professional-4.0.1.tar.gz /opt/
$ cd /opt/
$ sudo tar -xzvf /opt/pycharm-community-2017.1.tar.gz
$ sudo rm pycharm-community-2017.1.tar.gz

Creamos un link para poder ejecutarlo con el comando pycharm:
$ sudo ln -s /opt/pycharm-community-2017.1/bin/pycharm.sh /usr/bin/pycharm

Para entornos virtuales lo suyo sería usar Anaconda y tener bien organizados los entornos el mismo path, además que resuelve genial el instalar los paquetes con sus dependencias. En cualquier caso, aquí se hizo a pelo. Ver link para notas de instalación de anaconda!

Desde pycharm podemos crear el entorno virtual, o directamente por línea de comandos ($ virtualenv Nombre), *Esto gracias a que la instalación de pycharm ya nos ha instalado el gestor de paquetes -pip- y de entornos -virtualenv-.
En nuestro caso lo hacemos directamente desde el pycharm como se ve en la imágen siguiente. Cogemos la versión 3.4.2 que es la última del repo de debian y la tenemos ya instalada.

Resumen uso virtualenvs

Activar:
$ source /home/marcial/bigdata/venv/bin/activate
* En windows simplemente: > D:VenvDirectoryScriptsactivate

Desactivar:
$ deactivate

Una vez activo un virtualenv, con pip podremos instalar paquetes para dicho entorno. Desde las mismas settings en pycharm tambien se pueden instalar paquetes y librerías, pero no funciona tan bien.

Entorno para bigdata

https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-learn-data-science-python-scratch-2/

De lo dicho en los tutos del enlace anterior, resumen de los paquetes que podrán ser necesarios (Copypaste):

NumPy stands for
Numerical Python. The most powerful feature of NumPy is n-dimensional
array. This library also contains basic linear algebra functions,
Fourier transforms, advanced random number capabilities and tools for
integration with other low level languages like Fortran, C and C++
SciPy stands for
Scientific Python. SciPy is built on NumPy. It is one of the most useful
library for variety of high level science and engineering modules like
discrete Fourier transform, Linear Algebra, Optimization and Sparse
matrices.
Matplotlib for
plotting vast variety of graphs, starting from histograms to line plots
to heat plots.. You can use Pylab feature in ipython notebook (ipython
notebook –pylab = inline) to use these plotting features inline. If you
ignore the inline option, then pylab converts ipython environment to an
environment, very similar to Matlab. You can also use Latex commands to
add math to your plot.
Pandas for structured
data operations and manipulations. It is extensively used for data
munging and preparation. Pandas were added relatively recently to Python
and have been instrumental in boosting Python’s usage in data scientist
community.
Scikit Learn for
machine learning. Built on NumPy, SciPy and matplotlib, this library
contains a lot of effiecient tools for machine learning and statistical
modeling including classification, regression, clustering and
dimensionality reduction.
Statsmodels for statistical modeling. Statsmodels
is a Python module that allows users to explore data, estimate
statistical models, and perform statistical tests. An extensive list of
descriptive statistics, statistical tests, plotting functions, and
result statistics are available for different types of data and each
estimator.
Seaborn for
statistical data visualization. Seaborn is a library for making
attractive and informative statistical graphics in Python. It is based
on matplotlib. Seaborn aims to make visualization a central part of
exploring and understanding data.
Bokeh for creating
interactive plots, dashboards and data applications on modern
web-browsers. It empowers the user to generate elegant and concise
graphics in the style of D3.js. Moreover, it has the capability of
high-performance interactivity over very large or streaming datasets.
Blaze for extending
the capability of Numpy and Pandas to distributed and streaming
datasets. It can be used to access data from a multitude of sources
including Bcolz, MongoDB, SQLAlchemy, Apache Spark, PyTables, etc.
Together with Bokeh, Blaze can act as a very powerful tool for creating
effective visualizations and dashboards on huge chunks of data.
Scrapy for web
crawling. It is a very useful framework for getting specific patterns of
data. It has the capability to start at a website home url and then dig
through web-pages within the website to gather information.
SymPy for symbolic
computation. It has wide-ranging capabilities from basic symbolic
arithmetic to calculus, algebra, discrete mathematics and quantum
physics. Another useful feature is the capability of formatting the
result of the computations as LaTeX code.
Requests for accessing
the web. It works similar to the the standard python library urllib2
but is much easier to code. You will find subtle differences with
urllib2 but for beginners, Requests might be more convenient.
os for Operating system and file operations
networkx and igraph for graph based data manipulations
regular expressions for finding patterns in text data
BeautifulSoup for scrapping web. It is inferior to Scrapy as it will extract information from just a single webpage in a run.

Los instalamos todos en el entorno (despues de seleccionar el virtualenv):

$ source /home/marcial/bigdata/venv/bin/activate
$ pip install numpy
$ pip install scipy
$ pip install scikit-learn
$ pip install beautifulsoup4

*Nota: re y os deberíamos tenerlos por defecto.
…

NOTA!! Aunque esto va bien, todo se instala OK utilizando Anaconda tanto en linux como en windows, si no hay motivos para no hacerlo, mejor tirar por ese camino! Hago resumen del entorno en windows: ver Link.