Mojo🔥by Modular & its new AI ecosystem
It's exciting to see the world of AI revolutionizing day by day with more powerful tools being built. At the same time, there are challenges of large hardware requirements (GPU, TPU, IPU, etc.) for model training and deploying which involves huge cost factors and latency issues that we need to address for production quality AI systems. Different sets of AI technologies like Tensorflow, PyTorch and others have their workflows. Keeping in mind to scale this cluttered AI infrastructure, Modular launched a bunch of products resolving uniformity in AI development with speed in performance.
Mojo — The new AI programming Language
Mojo is considered a superset of the AI language giant Python. Modular, the company by Chris Lattner (the creator of Swift programming language) and Tim Davis (ex-GoogleML Product Lead) claim Mojo to be 35k times faster than Python. It comes with the power/speed of C programming language, with Python syntax and builds feature on top of it.
The compiler has yet to be made public as it's still WIP. Preview access to try out Mojo in the Jupyter Notebook playground can be accessed through early signup. I tried using the native Python code vs Mojo implementation of a 3*3 matrix multiplication and the computational time results were astonishing. Since Mojo is built on top of Python, it provides the support of the vast libraries and packages which makes Python famous.
FUN FACT: Mojo allows file extension as fire emoji (.🔥) other than .mojo . Eg. hello.🔥or hello.mojo
What makes Mojo so Powerful?
Mojo is explicitly based on MLIR compiler infrastructure (derived from LLVMs), used under the hood of most machine learning accelerators (developed at TensorFlow). MLIR supports a framework for representing and transforming different levels of computation, enabling efficient and portable execution across various hardware targets. MLIR plays a crucial role in bridging the gap between high-level machine learning frameworks and low-level hardware targets, providing a unified representation and optimization framework for efficient model execution.
Mojo has some added features on top of Python which enables it to run faster. These features include:
System Programming — Lexical scoping of variables rather than global scoping.
def your_function():
let x: SI8 = 42
let y: SI64 = 17
let z: SI8
if x != 0:
z = 1
else:
z = foo()
use(z)
Strongly typed language (declaring variable types) overcoming Python’s dynamic typing which determines variables during runtime making it slow.
Use of structs which is static and hence works in compile time, instead of classes in Python.
@value
struct MyPair:
var first: Int
var second: Int
def __lt__(self, rhs: MyPair) -> Bool:
return self.first < rhs.first or
(self.first == rhs.first and
self.second < rhs.second)
fn definitions replacing Python’s def.
fn useStrings():
var a: MyString = "hello"
print(a) # Should print "hello"
var b = a # ERROR: MyString doesn't implement __copyinit__
a = "Goodbye"
print(b) # Should print "hello"
print(a) # Should print "Goodbye"
Metaprogramming and Adaptive compilation/Autotuning— overcoming Python GIL (Global Interpreter Lock) to support parallelism with multithreading capabilities. Mojo redefines Python’s function parameters and arguments of function into the usage of parameterized types and functions. Mojo allows SIMD (Single Instruction Multiple Data) operations.
fn funWithSIMD():
# Make a vector of 4 floats.
let small_vec = SIMD[DType.float32, 4](1.0, 2.0, 3.0, 4.0)
# Make a big vector containing 1.0 in float16 format.
let big_vec = SIMD[DType.float16, 32].splat(1.0)
# Do some math and convert the elements to float32.
let bigger_vec = (big_vec+big_vec).cast[DType.float32]()
# You can write types out explicitly if you want of course.
let bigger_vec2 : SIMD[DType.float32, 32] = bigger_vec
Autotuning increases exponentially during runtime. Autotuning benefits from the internal implementation details of the Mojo compiler stack, i.e. MLIR. Autotuning needs continued development and iteration over time to improve its accuracy and efficiency.
# vector-length-agnostic algorithm to a buffer of data,
from Autotune import autotune
def exp_buffer_impl[dt: DType](data: ArraySlice[dt]):
# Pick vector length for this dtype and hardware
alias vector_len = autotune(1, 4, 8, 16, 32)
# Use it as the vectorization length
vectorize[exp[dt, vector_len]](data)
To dig deeper into the Mojo world, head onto official docs https://docs.modular.com/mojo/programming-manual.html
Modular’s Inference Engine
The Modular AI Inference Engine is a unified AI execution engine that powers all your PyTorch and TensorFlow workloads while delivering significant usability, performance, and portability gains. It is the world’s fastest unified inference engine, and it can be deployed to any cloud or on-prem environment with minimal code changes.
The Modular AI Inference Engine has several benefits, including:
- Speed: The Modular AI Inference Engine is incredibly fast, delivering 3–4x latency and throughput gains out-of-the-box on state-of-the-art models across Intel, AMD and Graviton architectures.
- Portability: The Modular AI Inference Engine is portable across a wide range of hardware platforms, including CPUs, GPUs, and ASICs. This means that you can deploy your models to any environment without having to rewrite or convert them.
- Usability: The Modular AI Inference Engine is easy to use. It provides a simple API that can be used to deploy models in a variety of ways.
- Extensibility: The Modular AI Inference Engine is extensible. You can add new features and functionality to the engine as needed.
Inference Engine Python API
from modular import engine
session = engine.InferenceSession()
# Load and initialize both the TensorFlow and PyTorch models
tf_bert_session = session.load(tf_bert_model)
pt_dlrm_session = session.load(pt_dlrm_model)
# Run inference on both the TensorFlow BERT and PyTorch DLRM Models
# Run BERT TensorFlow model with a given question.
question = "When did Copenhagen become the capital of Denmark?"
attention_mask, input_ids, token_type_ids = convert_to_tokens(question)
bert_outputs = tf_bert_session.execute(attention_mask, input_ids, token_type_ids)
# Perform DLRM PyTorch model with random one-hot row of suggested items and features.
recommended_items = np.random.rand(4, 8, 100).astype(np.int32)
dense_features = np.random.rand(4, 256).astype(np.float32)
dlrm_outputs = pt_dlrm_session.execute(dense_features, recommended_items)
# Inspecting the output of BERT
print("Number of output tensors:", len(bert_outputs))
print(bert_outputs[0].shape, bert_outputs[0].dtype)
print(bert_outputs[1].shape, bert_outputs[1].dtype)
print("Answer:", convert_to_string(input_ids, bert_outputs))
# Output
Number of output tensors: 2
(1, 192) float32
(1, 192) float32
Answer: Copenhagen became the capital of Denmark in the early 15th century
Conclusion
I’m excited to try out the Mojo compiler on my local machine once it's out. It’ll be interesting to see how truly it outperforms Python. It sure will then be a milestone in AI programming. It’s commendable how Modular has efficiently considered the effective things out of all the existing programming languages and built its compiler for Mojo to be performant along with drawbacks and pitfalls of these systems to make its inference engine and cross-platform hardware ecosystem.
Additional References
- Modular’s Product Launch Keynote — https://youtu.be/-3Kf2ZZU-dg
- Lex Fridman’s Podcast with Chris Lattner — https://youtu.be/pdJQ8iVTwj8
- Modular Blog Posts — https://www.modular.com/blog
- Modular Inference Engine docs — https://docs.modular.com/engine/