Getting started with MLFlow with MinIO [In 2022]

5 minute read

Published: September 09, 2022

For my (academic) research work, I was keeping it simple i.e. maintaining a CSV which was appended by a Python function for each machine learning (ML) experiment. So far, it was working quite nicely (for only single user, which is me). But, obvious downside was that I start having nested folders for the artifacts such as models, graphs, latent features etc. Also, opening the big csv file looks daunting. Therefore, I decide to try out MLFlow for my future experimentation. MLFlow, which is opensourced under Apache License 2.0 is a ML lifecycle platform which unify various aspect of ML which can include experimentation i.e. trying out different architectures, model parameters, data preprocessing etc. Additionally, i also act as central model registry which further support in reproducibility, deployment. For the trials, obvious choice is to use docker as usual. There are many great repositories exist (like this and this one). For my use-case, I’d like to use:

MLFlow’s friendly UI for visualization and monitoring
MinIO bucket to dump the model which can be easily switch with AWS’s S3 if I got the money
Still usage SQLite for various parameters etc. because of my personal single-user requirements

Therefore, I created yet another GitHub repository which includes MLFlow and MinIO along with the SQLite.

Steps to running the setup locally or on remote server

One can setup the MLFlow server on a single machine using this approach. It can either a local machine from which you will run ML experiments or a remote machine which is simply used for tracking. In my case, it is a local machine.

Clone the repository git clone https://github.com/ikespand/docker-mlflow-minio.git.
This is optional step: you have then option to modify the user id and password for MinIO. For the same, edit the .env file to override default settings. Here don’t be confuse with AWS in name until unless you set it up. This naming comes from MinIO which offers S3-like storage facility locally. If you have AWS S3 account then you can configure it in similar way.
Start the docker for MinIO and MLFlow with SQLite by docker-compose up. Use Powershell on Windows as volume mounting can have problems with git-bash.
As a result, you should able to see localhost:5000 for our MLFlow server while localhost:9001 for MinIO.
You can login to MinIO with the credentials mentioned in .env file. Your username=AWS_ACCESS_KEY_ID and password=AWS_SECRET_ACCESS_KEY. In MinIO dashboard, you will see the mlflow bucket has been created as a result of our docker-compose run.
At this point, you can observe that mlflow_data and minio_data folders are created in the repository. You can also configure the location of these folders by modifying docker-compose.yml. Here, your overall setup will look like as following:

How it works?

In docker-compose services, we first build the image for MLFlow which is pretty simple. We use the official python image and install few dependencies. I explicitly froze the version of pip packages because I had problem with the newer version of the libraries. We then use this build image and define all the key environment variables which MLFlow expects. The source of these still remains .env file.
Then, we move to the MinIO image where again official image was used and we have defined all the environment variable in a similar way as of above.
Finally, we have last step of createbuckets which will create the our first bucket for mlflow. This step is using official minio/mc image which allow us to perform basic operations for buckets like copy, list etc.

Test

Once things are running and you’re able to see the dashboards for both MinIO and MlFlow then you can proceed to test this. First step is to configure the local machine where we will run the ML experiments. Therefore, configure the environment variable so that MLFlow’s python library can pick up these to communicate with MLFlow. Open the bash_profile as shown below in Windows and copy-paste the credential from the .env file there.
Now, open the terminal and then start with experimentation. It is necessary to restart the terminal so that these evironment variable recognized by the session.
Now, we’re all set to try it out. There is a script in the repository called test_setup_with_scikitlearn.py. This script basically runs a Scikit-learn based machine learning task to classify MNIST dataset. You will observe that we have configured the MLflow with following lines.
```
URI = r"http://localhost:5000"
mlflow.set_tracking_uri(URI)
mlflow.set_experiment("MyMLTask")
```
In the above, we have used URI of http://localhost:5000 becuase our MLFlow server is running on the same machine where we want to try out the ML run. In case, you have used a remote server exclusive for MLFlow then you need to replace the URI with correponding machine’s IP address, also you might need to allow communication to 5000 port.

Logging a parameter, metric or a figure is pretty easy. The parameters and metrics goes to the SQLite while images goes to MinIO bucket. E.g.

# Log parameters with which we want to experiment and record results
mlflow.log_param("gamma", gamma) 
mlflow.log_param("kernel", kernel)
mlflow.log_metric("mse", mse)
mlflow.log_metric("mae", mae)
# Log figure to visualize after the runs
mlflow.log_figure(fig, 'comparision.png')

Finally, we can save the final deployable model as an atifact which will also go to MinIo bucket. E.g.

# Save the model
tracking_url_type_store = urlparse(mlflow.get_tracking_uri()).scheme
model_signature = mlflow.models.signature.infer_signature(X_train, y_train)
run_id = run.info.run_uuid
experiment_id = run.info.experiment_id
if tracking_url_type_store != "file":
  mlflow.sklearn.log_model(clf, "clf")
else:
  mlflow.sklearn.log_model(clf, "clf", signature=model_signature)

After running the script, you can see this new experiment in the dashborad and browse through parameters and artifacts like following:

Next steps?

Setting up complete pipeline for parameter tuning and logging.
Model deployment directly from MLFlow.
Use-case grows, then maybe try out more production ready setup like this.

Reach out to me on Instagram for a faster reply!

Share on

Twitter Facebook LinkedIn

Setting up an Overpass API server with Docker

3 minute read

Published: April 13, 2023

You might have already seen my blog on OpenStreetMap from 2020. In that post, I briefly talked about Overpass API server with a pointer to a GitHub repo to setting up locally. However, that setup might fail if we try to build for entire planet. Additionally, source repository hasn’t been updated for 5 years. I’ve received some requests to assist in setting up the Overpass server. Therefore, this short post basically illustrates the process of local Overpass server.

Starting your own Overpass API server

Currently, the process has been very simple and straightforward, thanks to this docker image docker image.

Building the server from raw OSM files

This approach is beneficial when our region of interest in small (compared to entire world).

docker run -e OVERPASS_META=yes -e OVERPASS_MODE=init -e OVERPASS_PLANET_URL=http://download.geofabrik.de/europe/monaco-latest.osm.bz2 -e OVERPASS_DIFF_URL=http://download.openstreetmap.fr/replication/europe/monaco/minute/ -e OVERPASS_RULES_LOAD=10 -v /overpass_db/:/db/overpass_clone_db -p 8888:80 -it --name overpass_monaco wiktorn/overpass-api

This usually takes 5 minutes on a normal computer including the downloading image from dockerhub, downloading OSM file from geofabrik and building the database for a 700 KB file. All the generated builds can be used with the server. At the end of this build, docker container will be stopped and need to be started again. For the same, either one can assign a name to container in the previous step or can look at auto-generated name with docker ps --all command. After grabbing the name, simply start the container as docker start <CONTAINER NAME>. Alternatively, one can pass -e OVERPASS_STOP_AFTER_INIT false option so that we can continue the instance after flushing database.

With that, we can query for a pizza shop: -.

In the above, we didn’t use a custom region as shown here obtained from OSM tool without tweaks. But, workaround should be simple, i.e. either providing file as file:/// as mention here or hosting file with local HTTP server.

Note: So far, I am not able to circumvent the issue when I try to mount a local folder in current directory. Although build was successful however during the query server results in error. This is specific to windows only.

Cloning for entire world

When we need to scale up to entire world, then cloning is a better option compared to building from the raw OSM file. In this case, we can pass the option in OVERPASS_MODE for clone. The data will be cloned from the Overpass API server with the defined replication. You can check out available replication frequency on the OSM wiki.

docker run -e OVERPASS_MODE=clone -e OVERPASS_DIFF_URL=https://planet.openstreetmap.org/replication/day/ -v /big/docker/overpass_clone_db/:/db -p 8888:80 -it --name overpass_world wiktorn/overpass-api

This process took approx 2 hours with a good internet speed on a Linux machine, and it took approx 204 GB. For sure, this number will grow with more contributions. We can now query for nearby Indian restaurants from our POI with query: -

Using in Python

I have a sample REST API with Flask which basically finds the nearest toilets from the query point. To use the above server is easy, we just need to change the URL of Overpass API server mentioned here to self.sever = 'http://localhost:8888/api/interpreter'.

Reach out to me on Instagram for a faster reply!

Machine Learning Applied to Stock Market Prediction: A Comparison between LSTM and ESN

5 minute read

Published: October 03, 2022

I have been working with various neural networks for a while and always find recurrent neural network (RNN) very special. Whenever, there is a dependency with previous data points, then these networks shines out. Time-series problem especially fits here for e.g. predicting the weather pattern or stock market. In the past, I have done work to compare various RNNs for such tasks and my work concluded that ESN works quite well compared to simple deep neural network and some RNNs like LSTM or GRU (paper-1, paper-2 and paper-3) (of course this statement vary with the domain). Those works were done mostly with the scientific data for turbulent flow for thermal plumes, which has significant effects on weather (Wikipedia). Now, I was curious to see how these network can work for practical purpose.

Deep learning on HPC cluster with LSF queue

2 minute read

Published: September 28, 2022

This is rather a short documentation to run the TensorFlow jobs on a high performance computing (HPC) cluster which is using LSF Load Sharing Facility. This can be extrapolated to other HPC systems with some tweaks. If you’re new to LSF and HPC jobs with DL then this post nicely summerizes the jargon.

First impression with Firebase Realtime Database using Python [and also with Swift for iOS]

6 minute read

Published: May 14, 2022

[Updated on 05.08.2022]

Firebase is a Backend-as-a-Service (BaaS) app development platform backed by Google. It provides a variety of services like database, cloud storage etc. Most prominent use case (atleast for me) is having a realtime database in almost no time to enable quick prototyping. Recently, I’ve experimented with Python and Swift (iOS) SDKs for Firebase and this documentation summerizes the same.

Firebase in Python

Adding streaming data to Firebase via Python

The intention behind adding the continous stream of data to Firebase is inspired from Internet of Thing (IoT) sensor where data is keep coming, and based upon the data we might want to trigger some action. There are few post already available like this, this and this. As the most simplest case, my target is to aquire some timeseries data on a given interval and send this to Firebase continously. As no external sensor is attached to my system, so I decide to get my sysetm information like RAM usage, CPU usage etc. on a given frequency and then add this to our database.

Creating Firebase project

Creating a Firebase account and setting up database is straightforwad with google id. Here, we will use Firebase Realtime Database.

Go to Firebase Console and click on Create a Project. Remember each project is a separate container and it can have several databases as per our requirement, so kind of namespace.
Set a desired name of the project and move forward.
Depending on the tracking and analysing use case, we can keep the Google Analytics on, and proceed. Depending upon next option will be Google Analytics ID.
Now, our project is comissioned and we have our dashboard.
On the left sidebar, click on Build>Realtime Database. Then click on Create a Database and select geographical location accordingly.
Select on default locked mode which we can change afterwards easily.
Now, our realtime database is ready. And we can see the Data, Rules, Backups and Usage tab. Within the Data tab, we can hover on URL and by clicking +, we can add the data. However, we will do this via Python.
We need to change the access Rules by clicking on that tab and set them to true. It will allow anyone to read/write data! For more details, see the official docs
Then we need to register our application. For the same, go to project overview dashboard and from various icons select the one for the Web.
Register the app by defining its name. It will generate a JS code and we need to copy the firebaseConfig part dictionary (i.e. content inside the {})and save it as a txt file in our python projetc folder locally as credential.txt.

Accessing via Python

After activating virtual or conda environment install pip install pyrebase4. For me, normal pyrebase was throwing an error so this was the easiest solution, I found at that time.
Copy the following script and paste into a new python file. Afterwards try it out. Remember to have credential.txt in the same folder.

import pyrebase
import psutil
import platform
import time
import shutil
import datetime

def get_device_name():
    dn = platform.node()
    #dn = ("").join([i.replace('-','') for i in dn])
    return dn

def get_timestamp():
    ts = time.time()
    return datetime.datetime.fromtimestamp(ts).strftime('%d%m%YT%H%M%S')
    
def sys_info():
    disc_usage = shutil.disk_usage("/")                  
    info = {}
    # info["ts"] =  get_timestamp()
    info["ram_usage"] = psutil.virtual_memory()[2]
    info["cpu_usage"] = psutil.cpu_percent()
    info["disk_usage"] = disc_usage[1]/disc_usage[0]*100
    return info 

def read_cred(filename:str)->dict:
    d = {}
    with open(filename) as f:
        for line in f:
            (key, val) = line.split(': "')
            d[key] = val.split('"')[0]
    return d

def initialize_firebase():
    config = read_cred(filename = "credential.txt")
    firebase = pyrebase.initialize_app(config)
    #_auth = firebase.auth()
    return firebase

def add_data_to_firebase(db):    
    data = sys_info()
    #resp = db.push(data)
    resp = db.child(get_device_name()).child(get_timestamp()).set(data)
    return resp


def print_all_data_in_db():
    users = db.child().get()
    print(users.val())

# %%

if __name__ == "__main__":
    db = initialize_firebase().database()
    ctr = 0
    res = []
    while ctr < 3:
        print("Instance: ", ctr)
        res.append(add_data_to_firebase(db))
        time.sleep(2)
        ctr+=1
    print("Done!")

This script will add data in neted form similar to a sensor reading. Here, our key becomes the timstamp and each timestamp contains 3 reading. The parent key is the name of our device.
A more sophisticated example can be found on my Github.

Firebase in Swift

This part will make a similar attempt with Firebase RT database using the Swift programming language while targetting the iOS development. Steps are similar and intutive as above, but for the sake of completeness:

Create a project on Firebase and make it for iOS.
From the quick configuration steps fill up the required fill and make sure to enter correct Bundle ID for your iOS app.
Download the config file (GoogleService-Info.plist) then drag and drop in Xcode’s file navigator as shown in the official SDK documentation.
Correspondingly, add the entries to podfile for Firebase. My podfile looks as following. If you don’t have podfile then download cocoapod and initialize with pod init. With pod install, dependencies will install. Remember, its a big file therefore, it will take some time. Also, I needed to close my Xcode otherwise keep getting some errors.

# Uncomment the next line to define a global platform for your project
 platform :ios, '12.0'

target 'ObjectDetection' do
  # Comment the next line if you're not using Swift and don't want to use dynamic frameworks
  use_frameworks!

  # Pods for ObjectDetection
  pod 'TensorFlowLiteSwift'

  # Pods for firebase
  pod 'FirebaseAuth'
  pod 'FirebaseFirestore'
  pod 'FirebaseDatabase'
end

As a next step, we need to initiate Firebase at the beginning of application with all the configuration. Therefore, in AppDelegate.swift, do the following (again suggested in setup process in Firebase):

import UIKit
import FirebaseCore


@UIApplicationMain
class AppDelegate: UIResponder, UIApplicationDelegate {

  var window: UIWindow?

  func application(_ application: UIApplication,
    didFinishLaunchingWithOptions launchOptions:
      [UIApplicationLaunchOptionsKey: Any]?) -> Bool {
    FirebaseApp.configure()

    return true
  }
}

Now, we’re all set to use Firebase SDK. Depending on the usage, we can write/read data to Firebase’s realtime database. In my one of the module, implementation goes as follow:

import FirebaseDatabase
    func writeToFirebase(outputClass:String, outputClassScore: Float){
        let dateString = "SampleDateAndTime"
        let locationData = "SampleLocation"
        let ref = Database.database().reference().child("deviceID/\(deviceID)").child("\(dateString)")
        ref.updateChildValues(["location":locationData,
                               "outputClass":outputClass,
                               "outputClassScore": outputClassScore])
    }

// Get the deviceId
let deviceID = UIDevice.current.identifierForVendor!.uuidString
// Then we call this function with the arguments
writeToFirebase(outputClass:"myOutputClass", outputClassScore:1.0)

This will then write the desired results to the Firebase. For me the use-case was to deploy a ML model on iPhone and then set the relavant data to the Firebase, so that I can query and build relavant dashboard.

Next steps

Explore and examine the following and update the documentation:

Adding authentication layer in the iOS application.
Firestore integration for app data e.g. images.

Reach out to me on Instagram for a faster reply!

Steps to running the setup locally or on remote server

How it works?

Test

Next steps?

Share on

You May Also Enjoy

Setting up an Overpass API server with Docker

Starting your own Overpass API server

Building the server from raw OSM files

Cloning for entire world

Using in Python

Machine Learning Applied to Stock Market Prediction: A Comparison between LSTM and ESN

Deep learning on HPC cluster with LSF queue

First impression with Firebase Realtime Database using Python [and also with Swift for iOS]

Firebase in Python

Adding streaming data to Firebase via Python

Creating Firebase project

Accessing via Python

Firebase in Swift

Next steps