blog-explorers (Blog-explorers)

hesamation

posted an update 8 days ago

Post

2808

this is big... 50 AI researchers from Bytedance, Alibaba, Tencent, and other labs/universities just published a 300-page paper with surprising lessons about coding models and agents (data, pre and post-training, etc).

key highlights:

> small LLMs can beat proprietary giants
RL (RLVR specifically) gives small open-source models an edge over big models in reasoning. a 14B model trained with RLVR on high-quality verified problems can match the performance of OpenAI's o3.

> models have a hard time learning Python.
mixing language models during pre-training is good, but Python behaves different from statically typed languages. languages with similar syntax (Java and C#, or JavaScript and TypeScript) creates high positive synergy. mixing Python heavily into the training of statically typed languages can actually hurt because of Python's dynamic typing.

> not all languages are equal (coding scaling laws)
the amount of data required to specialize a model on a language drastically depends on the language. paper argues like C# and Java are easier to learn (less training data required). languages like Python and Javascript are actually more tricky to learn, ironically (you see AI most used for these languages :)

> MoE vs Dense (ability vs stability)
MoE models offer higher capacity, but are much more fragile during SFT than dense models. hyperparams in training have a more drastic effect in MoE models, while dense models are more stable. MoE models also require constant learning rate schedules to avoid routing instability.

> code models are "insecure" by default (duh)
training on public repos makes models learn years of accumulated insecure coding patterns. safety fine-tuning often fails to work much on code. a model might refuse to write a hate speech email but will happily generate a SQL-injection vulnerable function because it "works."

read the full paper:
From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence (2511.18538)

monsoon-nlp

posted an update 10 days ago

Post

373

PatchDNA, a DNA foundation model based on Meta's BLT tokenization strategy https://www.biorxiv.org/content/10.1101/2025.11.28.691095v1

anakin87

posted an update 13 days ago

Post

434

I made a visualization based on the Prime Intellect INTELLECT-3 technical report.

Wild to see how far they pushed GLM-4.5-Air-Base with SFT + RL.
SOTA for its size and competitive with models 3x larger.

All open.

Congrats on the release!

Model: PrimeIntellect/INTELLECT-3
Technical report: https://storage.googleapis.com/intellect-3-paper/INTELLECT_3_Technical_Report.pdf
Chat: https://chat.primeintellect.ai/

ccocks-deca

in blog-explorers/README 14 days ago

Card forensics using VLM localized prompt approach

2

#11 opened 14 days ago by

vijayksagi

anakin87

posted an update 28 days ago

Post

2868

LLMs can leak their post-training data (RL included) 💧

New interesting paper on this topic from Google DeepMind: Extracting alignment data in open models (2510.18554)

It's known that Language Models memorize data that can be extracted via prompting.

In this paper, the authors investigate this aspect:
- using open models, where prompting can be fully customized by the user, including special tokens.
- focusing on open-source models like Olmo, where full training data is available.

📤 How do they extract data?

During post-training (like SFT), new tokens such as <|user|> are introduced.

The authors hypothesize prompting the model with these tokens can make it output its alignment data (remember Magpie?).

For example, for SFT, their extraction prompt is <|endoftext|><|user|>.

📏 Evaluating memorization

The authors compare each sampled example with the original data using vector search with embedding similarity.

They find that many outputs are semantically very similar to the original data, even if the exact words differ.

Traditional string-matching algorithms underestimate memorization by 10x.

🔁 What about RL?

Surprisingly, the same technique works to extract data from Reinforcement Learning (PPO/GRPO) phases.

This is counter-intuitive because the RL objective is not designed to increase sequence likelihoods (unlike SFT).

Practical limitation: in this case, extraction relies on using the initial part of the training prompt, which is not generally public.

📈 Are the extracted data effective for post-training?

Both in SFT and RL, the extracted data can be used to fine-tune models to similar performance to the originals.

The authors suggest that model distillation, where a stronger model is used to drive the training of a weaker one, may be a form of indirect training on the original dataset.

Ihor

posted an update about 1 month ago

Post

1206

Hey builders 👷‍♀️

We’re Knowledgator, the team behind open-source NLP models like GLiNER, GLiClass, and many other used for zero-shot text classification and information extraction.

If you’ve explored them on Hugging Face or used our frameworks from GitHub, we’d love your input:
🧩 Which of our models, like GLiNER or zero-shot classifiers, do you find helpful in your practical workflows?
🧩 How’s the setup, performance, and accuracy been for you?
🧩 Anything confusing, buggy, or missing that would make your workflow smoother?

Your feedback helps us improve speed, clarity, and stability for everyone in the open-source community.

💬 Comment directly here or join the discussion. We read every one 😉:
GitHub: https://github.com/Knowledgator
Discord: https://discord.gg/GXRcAVJQ
HuggingFace:

knowledgator

📝 Want to shape our next release?
Click here to complete this 2-min survey: https://docs.google.com/forms/d/e/1FAIpQLSdyz2UMHrMDX8S9stpBk0wyfngtKSYzwk-02mN1VNYDdTw8OQ/viewform

ccocks-deca

in blog-explorers/README about 1 month ago

Pending Blog-Explorers Access Request

2

#10 opened about 1 month ago by

KarthikAvinash

tonywu71

authored a paper about 1 month ago

Surfer 2: The Next Generation of Cross-Platform Computer Use Agents

Paper • 2510.19949 • Published Oct 22 • 38

merve

posted an update about 2 months ago

Post

6687

deepseek-ai/DeepSeek-OCR is out! 🔥 my take ⤵️
> pretty insane it can parse and re-render charts in HTML
> it uses CLIP and SAM features concatenated, so better grounding
> very efficient per vision tokens/performance ratio
> covers 100 languages

4 replies

·

l-yohai

authored 4 papers about 2 months ago

HEISIR: Hierarchical Expansion of Inverted Semantic Indexing for Training-free Retrieval of Conversational Data using LLMs

Paper • 2503.04141 • Published Mar 6

Finding Diamonds in Conversation Haystacks: A Benchmark for Conversational Data Retrieval

Paper • 2510.02938 • Published Oct 3

Pre-Storage Reasoning for Episodic Memory: Shifting Inference Burden to Memory for Personalized Dialogue

Paper • 2509.10852 • Published Sep 13

What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLMs

Paper • 2505.19773 • Published May 26

bdsqlsz

authored a paper 2 months ago

A Large-scale Dataset for Robust Complex Anime Scene Text Detection

Paper • 2510.07951 • Published Oct 9 • 6

Sri-Vigneshwar-DJ

posted an update 2 months ago

Post

333

Do you think domain-specific embedding fine-tuners are needed?
I've been working with embeddings for marketing use cases and noticed something: most embeddings don't get marketing concepts very well. They're trained in general-purpose ways.
The Issue I'm Seeing
When I search marketing content with general embeddings:

"organic growth" returns farming articles
"conversion funnel" matches industrial equipment
"brand lift" doesn't connect to campaign effectiveness
Marketing jargon like CAC, ROAS, CTR aren't properly understood

My Question
Do you think domain-specific embeddings are needed for marketing?
Some thoughts:

Marketing has its own vocabulary and concept relationships
General models trained on Wikipedia/web crawl miss these nuances
But is fine-tuning worth the effort vs just using more retrieval tricks?

Quick Example
I fine-tuned all-mpnet-base-v2 on ~1000 marketing concept pairs and saw 15-20% better retrieval accuracy. But I'm curious:

Has anyone else tried this for marketing or other domains?
When do you think domain-specific embeddings are actually necessary vs overkill?
Are there better approaches I'm missing?

https://huggingface.co/blog/Sri-Vigneshwar-DJ/why-your-marketing-rag-system-needs-domain-specifi

6 replies

·

Sri-Vigneshwar-DJ

posted an update 2 months ago

Post

4429

🚀 Exciting News! We've released a Performance Marketing Expert Dataset from Hawky.ai [www.hawky.ai]

Hawky-ai

This dataset empowers AI models with cutting-edge strategies for Meta, Google Ads, and TikTok campaigns. It includes:
1. Multi-platform strategies for e-commerce, DTC, B2B, and more
2. Creative optimization and audience targeting insights
3. ROI-driven recommendations based on 2025 best practices

Sri-Vigneshwar-DJ/Performance-Marketing-Data

Sri-Vigneshwar-DJ

posted an update 2 months ago

Post

3332

🚀 Qwen3-Omni for Marketing: A Game-Changer

Just wanted to share something exciting I've been exploring—Qwen3-Omni and how it's transforming marketing workflows.

What makes it special? At Hawky.ai we are started experimenting with Qwen3 recently for Analysis and Optimization.

Unlike traditional tools that look at text, images, or audio separately, Qwen3-Omni analyzes everything together. It handles 119 languages, processes 40-minute audio sequences, and understands both images and videos—all at once.

The cool part? It's 2-3x faster than similar models thanks to its MoE architecture.

Real applications I'm seeing:
Ad Analysis: It scores video ads by combining visual elements, audio tone, and text—giving 25% better CTR predictions than single-mode tools.
Campaign Localization: Drop in one ad, get 10 localized versions with native voiceovers in under a minute. Perfect for testing across markets.

Market Research: Feed it competitor content, podcasts, or UGC videos. It extracts actionable insights like "3-second hooks boost retention by 15%" and saves about 70% of analysis time.

Quality Checks: Automatically catches lip-sync errors and audio-visual mismatches.

Full technical breakdown: https://huggingface.co/blog/Sri-Vigneshwar-DJ/hawky-aiqwen3-omni-advanced-architecture-and-marke

Has anyone else been experimenting with multimodal models for marketing? Would love to hear what you're building!

#MultimodalAI #MarTech #OpenSource

monsoon-nlp

posted an update 3 months ago

Post

456

Bio LLMs train on many genomes, but can we encode differences within a species? TomatoTomato adds pangenome tokens to represent a domestic tomato and a wild tomato in one sequence 🍅 🧬
monsoon-nlp/tomatotomato-gLM2-150M-v0.1

merve

posted an update 3 months ago

Post

6716

large AI labs open-sourced a ton of models last week 🔥
here's few picks, find even more here merve/sep-16-releases-68d13ea4c547f02f95842f05 🤝
> IBM released a new Docling model with 258M params based on Granite (A2.0) 📝 ibm-granite/granite-docling-258M
> Xiaomi released 7B audio LM with base and instruct variants (MIT) XiaomiMiMo/mimo-audio-68cc7202692c27dae881cce0
> DecartAI released Lucy Edit, open Nano Banana 🍌 (NC) decart-ai/Lucy-Edit-Dev
> OpenGVLab released a family of agentic computer use models (3B/7B/32B) with the dataset 💻 OpenGVLab/scalecua-68c912cf56f7ff4c8e034003
> Meituan Longcat released thinking version of LongCat-Flash 💭 meituan-longcat/LongCat-Flash-Thinking

2 replies

·

KnutJaegersberg

posted an update 3 months ago

Post

507

The Formative Mind: Theories of Consciousness as Practice

Instead of treating consciousness as a passive byproduct of a powerful unconscious engine, think of it as the engine itself: a process that builds rich representations (self-organizing), predicts and models its own processing (metarepresentation), and thereby brings an agent and its world into being (individuation). A brief synthesis.

https://huggingface.co/blog/KnutJaegersberg/formative-mind

Blog-explorers

AI & ML interests

Recent Activity

Card forensics using VLM localized prompt approach

Pending Blog-Explorers Access Request

Surfer 2: The Next Generation of Cross-Platform Computer Use Agents

HEISIR: Hierarchical Expansion of Inverted Semantic Indexing for Training-free Retrieval of Conversational Data using LLMs

Finding Diamonds in Conversation Haystacks: A Benchmark for Conversational Data Retrieval

Pre-Storage Reasoning for Episodic Memory: Shifting Inference Burden to Memory for Personalized Dialogue

What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLMs

A Large-scale Dataset for Robust Complex Anime Scene Text Detection

AI & ML interests

Recent Activity

Team members 751

blog-explorers's activity

Card forensics using VLM localized prompt approach

Pending Blog-Explorers Access Request