Bartowski PRO

bartowski

908 8 176

AI & ML interests

Senior Machine Learning Engineer at RedHat

Recent Activity

new activity 6 days ago

bartowski/DeepSeek-V4-Flash-GGUF:Is Q8_0 lossy against original precision (FP8?)

new activity 7 days ago

bartowski/DeepSeek-V4-Flash-GGUF:Run on LLamaCPP

updated a model 8 days ago

bartowski/DeepSeek-V4-Flash-GGUF

View all activity

Organizations

replied to their post about 2 months ago

No support merged yet, will keep an eye on this draft PR :)

https://github.com/ggml-org/llama.cpp/pull/23112

posted an update about 2 months ago

Post

36311

You may have noticed that my upload of MiMo-V2.5 upload didn't have the author in the model name:

bartowski/MiMo-V2.5-GGUF

Going forward, I plan to upload models from major 1st party developers without the author name attached for cleanliness, I feel it results in a nicer and more expected user experience

I will continue to uploaded fine tunes with that author + "_" appended for clarity, I personally feel it's nice to know at a glance who's tune it is, but it's also for the reason I first started doing it, to avoid it being confused for a new version of the official release

I hope this change makes sense, it seemed most reasonable to me and a poll I did (forever ago, I move slow sometimes) made it seem likely others would find it reasonable as well (feel free to let me know if you disagree, may not change my mind but I do value knowing what others think)

Thanks for downloading :)

4 replies

reacted to marksverdhei's post with 🔥 6 months ago

Post

4324

Inspired by the heroes of day zero quants (@TheBloke @danielhanchen @shimmyshimmer @bartowski ), I decided to join the race by releasing the first FP8 quant of glm-4.7-flash! Not as easy as i expected, but I'm happy i was still able to have it working within a few hours after the original model was released! Interested in feedback if anyone wants to try it out!

marksverdhei/GLM-4.7-Flash-FP8

Note: If my PR to vLLM isn't merged yet you might have to use my fork. Cheers! 🤗

reacted to jsulz's post with 🔥🚀 about 1 year ago

Post

6954

It's been a bit since I took a step back and looked at

xet-team progress to migrate Hugging Face from Git LFS to Xet, but every time I do it boggles the mind.

A month ago there were 5,500 users/orgs on Xet with 150K repos and 4PB. Today?
🤗 700,000 users/orgs
📈 350,000 repos
🚀 15PB

Meanwhile, our migrations have pushed throughput to numbers that are bonkers. In June, we hit upload speeds of 577Gb/s (crossing 500Gb/s for the first time).

These are hard numbers to put into context, but let's try:

The latest run of the Common Crawl from

commoncrawl was 471 TB.

We now have ~32 crawls stored in Xet. At peak upload speed we could move the latest crawl into Xet in about two hours.

We're moving to a new phase in the process, so stay tuned.

This shift in gears means it's also time to roll up our sleeves and look at all the bytes we have and the value we're adding to the community.

I already have some homework from @RichardErkhov to look at the dedupe across their uploads, and I'll be doing the same for other early adopters, big models/datasets, and frequent uploaders (looking at you @bartowski 👀)

Let me know if there's anything you're interested in; happy to dig in!

9 replies

posted an update about 1 year ago

Post

135459

Was going to post this on /r/LocalLLaMa, but apparently it's without moderation at this time :')

bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF

Was able to use previous mistral chat templates, some hints from Qwen templates, and Claude to piece together a seemingly working chat template, tested it with llama.cpp server and got perfect results, though lmstudio still seems to be struggling for some reason (don't know how to specify a jinja file there)

Outlined the details of the script and results in my llama.cpp PR to add the jinja template:

https://github.com/ggml-org/llama.cpp/pull/14349

Start server with a command like this:

./llama-server -m /models/mistralai_Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M.gguf --jinja --chat-template-file /models/Mistral-Small-3.2-24B-Instruct-2506.jinja

and it should be perfect! Hoping it'll work for ALL tools if lmstudio gets an update or something, not just llama.cpp, but very happy to see it works flawlessly in llama.cpp

In the meantime, will try to open a PR to minja to make the strftime work, but no promises :)

posted an update about 1 year ago

Post

39902

Access requests enabled for latest GLM models

While a fix is being implemented (https://github.com/ggml-org/llama.cpp/pull/12957) I want to leave the models up for visibility and continued discussion, but want to prevent accidental downloads of known broken models (even though there are settings that could fix it at runtime for now)

With this goal, I've enabled access requests. I don't really want your data, so I'm sorry that I don't think there's a way around that? But that's what I'm gonna do for now, and I'll remove the gate when a fix is up and verified and I have a chance to re-convert and quantize!

Hope you don't mind in the mean time :D

1 reply

reacted to jsulz's post with 🔥 about 1 year ago

Post

4284

What does it mean when models share the same bytes?

We've investigated some quants and have seen that a considerable portion of quantizations of the same model share the same bytes and can be deduplicated to save considerable upload time for quantizers on the Hub.

This space where we crack open a repo from @bartowski shows we can get significant dedupe xet-team/quantization-dedup

You can get a sense of why by reading this write-up: https://github.com/bartowski1182/llm-knowledge/blob/main/quantization/quantization.md

But what about finetuned models?

Since going into production the

xet-team has migrated hundreds of repositories on the Hub to our storage layer, including classic "pre-Hub" open-source models like FacebookAI/xlm-roberta-large (XLM-R) from

FacebookAI

XLM-R, introduced in 2019, set new benchmarks for multilingual NLP by learning shared representations across 100 languages. It was then fine-tuned on English, Spanish, Dutch, and German, generating language-specific derivations for each - check out the paper here Unsupervised Cross-lingual Representation Learning at Scale (1911.02116)

These finetunes share much of the same architecture and layout as XLM-R with similar training methods and goals. It makes sense that they would share bytes, but it's still fascinating to see.

We put together a similar space to explore these models to see where they overlap - check it out for yourself xet-team/finetune-dedupe

The darker each block in the heatmap, the more the bytes are shared. Clicking on a repos blocks shows all other repos that share blocks.

1 reply

reacted to fdaudens's post with 🔥❤️ over 1 year ago

Post

10142

Yes, DeepSeek R1's release is impressive. But the real story is what happened in just 7 days after:

- Original release: 8 models, 540K downloads. Just the beginning...

- The community turned those open-weight models into +550 NEW models on Hugging Face. Total downloads? 2.5M—nearly 5X the originals.

The reason? DeepSeek models are open-weight, letting anyone build on top of them. Interesting to note that the community focused on quantized versions for better efficiency & accessibility. They want models that use less memory, run faster, and are more energy-efficient.

When you empower builders, innovation explodes. For everyone. 🚀

The most popular community model? @bartowski 's DeepSeek-R1-Distill-Qwen-32B-GGUF version — 1M downloads alone.

5 replies

reacted to ngxson's post with 🔥 over 1 year ago

Post

4294

Check out my collection of pre-made GGUF LoRA adapters!

This allow you to use both normal + abliterated version of popular models like llama, qwen, etc, without having to double to amount of VRAM usage.

ngxson/gguf_lora_collection

5 replies

reacted to ngxson's post with 🚀 over 1 year ago

Post

4135

I made this small tool that can be useful for debugging Ollama chat template: ngxson/ollama_template_test

CC @bartowski you may need this ;-)

2 replies

replied to their post over 1 year ago

I don't love the period in the name since I don't like using it for purposes other than the file extension

I don't love the underscore either for what it's worth, but period feels wrong haha

- is probably ideal but then those are used in both author and model names already so the distinction between the two becomes blurred

posted an update over 1 year ago

Post

73782

Switching to author_model-name

I posted a poll on twitter, and others have mentioned the interest in me using the convention of including the author name in the model path when I upload.

It has a couple advantages, first and foremost of course is ensuring clarity of who uploaded the original model (did Qwen upload Qwen2.6? Or did someone fine tune Qwen2.5 and named it 2.6 for fun?)

The second thing is that it avoids collisions, so if multiple people upload the same model and I try to quant them both, I would normally end up colliding and being unable to upload both

I'll be implementing the change next week, there are just two final details I'm unsure about:

First, should the files also inherit the author's name?

Second, what to do in the case that the author name + model name pushes us past the character limit?

Haven't yet decided how to handle either case, so feedback is welcome, but also just providing this as a "heads up"

5 replies

replied to their post over 1 year ago

No it does not include the XS, the reason Q4_0 and IQ4_NL work i think is because they don't do any clever packing of the scaling factors, that's why K quants and IQ4_XS (which is like NL but with some K quant logic) don't work yet

replied to their post over 1 year ago

oh, yeah, of course.. I added all the ARM quants but then not Q4_0 which is now the only one that would work haha..

I'll go any make a Q4_0 for it I suppose ! just this once

replied to their post over 1 year ago

Don't love adding more formats but if your results are accurate it does seem worth including

replied to their post over 1 year ago

I've updated it to "Legacy format, offers online repacking for ARM and AVX CPU inference.", it is still overall legacy but with the online repacking is worth considering for speed

I'm hoping that IQ4_NL gets a few more packing options in the near future

replied to their post over 1 year ago

hell yeah. wish we could still offline compile, i get why it's not sustainable in the future but also until there's better support and more options would be nice to keep it around

replied to their post over 1 year ago

oh right sorry, forgot to include that PR, i'll add it above but it's here:

https://github.com/ggerganov/llama.cpp/pull/10541

I think the inference engines will just need to update to the newer versions and they'll get the repacking logic for free, if that's what you meant then yes

Bartowski PRO

AI & ML interests

Recent Activity

Organizations

bartowski's activity