litert-community/Qwen3-0.6B

Main Model Card: Qwen/Qwen3-0.6B

This model card provides a few variants of the Qwen3-0.6B model that are ready for deployment on Android and Desktop.

How to Use

Android (Google AI Edge Gallery)

You can either install Google AI Edge Gallery through Open Beta in the Play Store or install the APK from Github.

To build the demo app from source, please follow the instructions from the GitHub repository.

Android (LiteRT-LM)

1. Add the dependency

Make sure you have the necessary dependency in your Gradle file.

dependencies {
    implementation("com.google.ai.edge.litertlm:litertlm:<LATEST_VERSION>")
}

2. Inference with the LiteRT-LM API

import com.google.ai.edge.litertlm.*

suspend fun main() {
  Engine.setNativeMinLogSeverity(LogSeverity.ERROR) // hide log for TUI app
  val engineConfig = EngineConfig(
      modelPath = "/path/to/your/model.litertlm", // Replace with model path
      backend = Backend.CPU, // Or Backend.GPU
      visionBackend = Backend.GPU,
  )

  // See the Content class for other variants.
  val multiModalMessage = Message.of(
      Content.Text("Tell me a Joke."),
  )
  Engine(engineConfig).use { engine ->
    engine.initialize()

    engine.createConversation().use { conversation ->
      while (true) {
        print("\n>>> ")
        conversation.sendMessageAsync(Message.of(readln())).collect { print(it) }
      }
    }
  }
}

Try running this model on NPU by using this .litertlm file and setting your EngineConfig’s backend to NPU. To check if your phone’s NPU is supported see this guide.

Desktop

To build a Desktop application, C++ is the current recommendation. See the following code sample.

// Create engine.
auto engine_settings = EngineSettings::CreateDefault(
    model_assets,
    /*backend=*/litert::lm::Backend::CPU,
);

// The same steps to create Engine and Conversation as above...

// Send message to the LLM with image data.
absl::StatusOr<Message> model_message = (*conversation)->SendMessage(
    JsonMessage{
        {"role", "user"},
        {"content", { // Now content must be an array.
          {{"type", "text"}, {"text", "Tell me a Joke."}}
        }},
    });
CHECK_OK(model_message);

// Print the model message.
std::cout << *model_message << std::endl;

Performance

Android

Benchmarked on Vivo X300 Pro.

Backend Quantization scheme Context length Prefill (tokens/sec) Decode (tokens/sec) Model size (MB) Model File

CPU

dynamic_int8

4096

165 tk/s

9 tk/s

586 MB

🔗

GPU

dynamic_int8

4096

580 tk/s

21 tk/s

586 MB

🔗

NPU

a16w8

4096

1,472 tk/s

36 tk/s

992 MB

🔗

Notes:

  • Model Size: measured by the size of the file on disk.
  • The inference on CPU is accelerated via the LiteRT XNNPACK delegate with 4 threads
  • Benchmark is run with cache enabled and initialized. During the first run, the latency and memory usage may differ.
Downloads last month
9
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for litert-community/Qwen3-0.6B

Finetuned
Qwen/Qwen3-0.6B
Finetuned
(431)
this model