Apple has a video understanding model too. I can't wait to find out what accessibility stuff they'll do with the models. As a blind person, AI has changed my life.
Fair enough. Anyway I wasn't trying to say what actually changed GP's life, I was just expressing my opinion on what video models could potentially bring as an improvement to a blind person.
I know some people will roll their eyes at this, but it's worth reflecting on how even small phrases like "my two cents" carry historical and cultural baggage. Language doesn't exist in a vacuum; it encodes assumptions about who gets to speak, whose perspectives are valued, and how we frame participation in public discourse.
The metaphor of assigning a literal monetary value to one's opinion reinforces the idea that contributions are transactional and that their "worth" is measured through an economic lens. That framing can be exclusionary, especially for people who have been historically marginalized by economic systems. It subtly normalizes a worldview where only those with enough "currency" - social, financial, or otherwise - deserve to be heard.
It's not about policing every idiom, but about being mindful of how our words might echo structures we're trying to move beyond. There are plenty of alternatives ("just my perspective" or "here's how I see it" etc.) that don't inherit that baggage. Language evolves, and we can choose to evolve it in more inclusive directions.
> The metaphor of assigning a literal monetary value to one's opinion reinforces the idea that contributions are transactional and that their "worth" is measured through an economic lens. That framing can be exclusionary, especially for people who have been historically marginalized by economic systems. It subtly normalizes a worldview where only those with enough "currency" - social, financial, or otherwise - deserve to be heard.
No. It’s acknowledging that that perhaps one’s opinion may not be as useful as somebody else’s in that moment. Which is often true!
Your first and third paragraphs are true, but they don’t apply to every bloody phrase.
guessing that being able to hear a description of what the camera is seeing (basically a special case of a video) in any circumstances is indeed life changing if you're blind...? take a picture through the window and ask what's the commotion? door closed outside that's normally open - take a picture, tell me if there's a sign on it? etc.
Not the gp, but currently reading a web novel with a card game where the author didn't include alt text in the card images. I contacted them about it and they started, but in the meantime ai was a big help. all kinds of other images on the internet as well when they are significant to understanding the surrounding text. better search experience when Google, DDG, and the like make finding answers difficult. I might use smart glasses for better outdoor orientation, though a good solution might take some time. phone camera plus ai is also situationally useful.
The question to ask is, what a sighted person learns after looking at the image? The answer is the alt text. E.g if the image is a floppy, maybe you communicate that this is the save button. If it shows a cat sleeping on the windowsill, the alt text is yep: "my cat looking cute while sleeping on the windowsill".
I really like how you framed this as the takeaway or learning that needs to happen as what should be in the alt and not a recitation of the image. Where I've often had issues is more for things like business charts and illustrations and less cute cat photos.
The logic stays the same though the answer is longer and not always easy. Just saying "business chart" is totally useless. You can make a choice on what to focus and say "a chart of the stock for the last five years with constant improvement and a clear increase by 17 percent in 2022" (if it is a simple point that you are trying to make) or you can provide an html table with the datapoints if there is data that the user needs to explore on their own.
The license[0] seems quite restrictive, limiting it's use to non commercial research. It doesn't meet the open source definition so it's more appropriate to call it weights available.
These are ~2 years behind state of the art from the looks of it. Still cool that they're releasing anything that's open for researchers to play with, but it's nothing groundbreaking.
No, it is not as good as Veo, but better than Grok, I would say. Definitely better than what was available 2 years ago. And it is only a 7B research model!
> Are other open weight video models also this small?
Apples models are weights-available not open weights, and yes, WAN 2.1, as well as the 14B models, also has 1.3B models; WAN 2.2, as well as the 14B models, also has a 5B model (the WAN 2.2 VAE used by Starflow-V is specifically the one used with the 5B model.) and because the WAN models are largely actually open weights models (Apache 2.0 licensed) there are lots of downstream open-licensed derivatives.
> Can this run on a single consumer card?
Modern model runtimes like ComfyUI can run models that do not fit in VRAM on a single consumer card by swapping model layers between RAM and VRAM as needed; models bigger than this can run on single consumer cards.
I think you need to go back and rewatch Will Smith eating spaghetti. These examples are far from perfect and probably not the best model right now, but they're far better than you're giving credit for.
As far as I know, this might be the most advanced text-to-video model that has been released? I'm not sure whether the license will qualify as open enough in everyone's eyes, though.
I wanted to write exactly the same thing, this reminded me of the Will Smith noodles. The juice glass keeps filling up after the liquid stopped pouring in.
Looks good. I wonder what use case Apple has in mind though, or I suppose this is just what the researchers themselves were interested in, perhaps due to the current zeitgeist. I'm not really sure how it works at big tech companies with regards to research, are there top down mandates?
> Datasets. We construct a diverse and high-quality collection of video datasets to train STARFlow-V. Specifically, we leverage the high-quality subset of Panda (Chen et al., 2024b) mixed with an in-house stock video dataset, with a total number of 70M text-video pairs.
That has nothing to do with it, and Apple wouldn’t train on user content, they’re not Google. If they ever did there would be opt in at best. There’s a reason they’re walking and observing, not running and trying to be the forefront cloud AI leader, like some others.
This doesn't necessarily mean that it's Wan2.2. People often don't train their own VAEs and just reuse an existing one, because a VAE isn't really what's doing the image generation part.
A little bit more background for those who don't know what a VAE is (I'm simplifying here, so bear with me): it's essentially a model which turns raw RGB images into a something called a "latent space". You can think of it as a fancy "color" space, but on steroids.
There are two main reasons for this: one is to make the model which does the actual useful work more computationally efficient. VAEs usually downscale the spatial dimensions of the images they ingest, so your model now instead of having to process a 1024x1024 image needs to work on only a 256x256 image. (However they often do increase the number of channels to compensate, but I digress.)
The other reason is that, unlike raw RGB space, the latent space is actually a higher level representation of the image.
Training a VAE isn't the most interesting part of image models, and while it is tricky, it's done entirely in an unsupervised manner. You give the VAE an RGB image, have it convert it to latent space, then have it convert it back to RGB, you take a diff between the input RGB image and the output RGB image, and that's the signal you use when training them (in reality it's a little more complex, but, again, I'm simplifying here to make the explanation more clear). So it makes sense to reuse them, and concentrate on the actually interesting parts of an image generation model.
No, using the WAN 2.2 VAE does not mean it is a WAN 2.2 edit.
> compressed to 7B.
No, if it was an edit of the WAN model that uses the 2.2 VAE, it would be expanded to 7B, not compressed (the 14B models of WAN 2.2 use the WAN 2.1 VAE, the WAN 2.2 VAE is used by the 5B WAN 2.2 model.)
They used the VAE of WAN like many other models do. For image models you see a lot of them using the flux VAE. Which is perfectly fine, they are released as apache2 and save you time to focus on your transformers architecture...
Hopefully this will make into some useful feature in the ecosystem and not contribute to having just more terrible slop. Apple has saved itself from the destruction of quality and taste that these model enabled, I hope it stays that way.
Apple has a video understanding model too. I can't wait to find out what accessibility stuff they'll do with the models. As a blind person, AI has changed my life.
> As a blind person, AI has changed my life.
Something one doesn't see in news headlines. Happy to see this comment.
+1 and I would be curious to read and learn more about it.
A blind comedian / TV personality in the UK has just done a TV show on this subject - I haven't seen it, but here's a recent article about it: https://www.theguardian.com/tv-and-radio/2025/nov/23/chris-m...
Hilariously, he beat the other teams in the “Say What You See” round (yes, really) of last year’s Big fat Quiz. No AI involved.
https://youtu.be/i5NvNXz2TSE?t=4732
If you want to see more on this topic, check out (google) the podcast I co-host called Accessibility and Gen. AI.
What other accessibility features do you wish existed in video AI models? Real-time vs post-processing?
> Something one doesn't see in news headlines.
I hope this wasn't a terrible pun
One cool feature they added for deaf parents a few years ago was a notification when it detects a baby crying.
Finally good news about the AI doing something good for the people.
I’m not blind and AI has been great for me too.
Something else must be wrong with you then :)
The smiley at the end doesn’t hide how awful your comment is.
Can you share some ways AI has changed your life?
I guess that auto-generated audio descriptions for (almost?) any video you want is a very, very nice feature for a blind person.
My two cents, this seems like a case where it’s better to wait for the person’s response instead of guessing.
My two cents, this seems like a comment it should be up to the OP to make instead of virtue signaling.
Yall could have gotten a serviceable answer about this topic out of ChatGPT. 2025 version of "let me google that for you"
> Can you share some ways AI has changed your life?
A question directed to GP, directly asking about their life and pointing this out is somehow virtue signalling, OK.
You can safely assume that anyone who uses “virtue signaling” unironically has nothing substantive to say.
From the list of virtues, which one was this signaling?
https://www.virtuesforlife.com/virtues-list/
Fair enough. Anyway I wasn't trying to say what actually changed GP's life, I was just expressing my opinion on what video models could potentially bring as an improvement to a blind person.
>My two cents
Yikes...
I know some people will roll their eyes at this, but it's worth reflecting on how even small phrases like "my two cents" carry historical and cultural baggage. Language doesn't exist in a vacuum; it encodes assumptions about who gets to speak, whose perspectives are valued, and how we frame participation in public discourse.
The metaphor of assigning a literal monetary value to one's opinion reinforces the idea that contributions are transactional and that their "worth" is measured through an economic lens. That framing can be exclusionary, especially for people who have been historically marginalized by economic systems. It subtly normalizes a worldview where only those with enough "currency" - social, financial, or otherwise - deserve to be heard.
It's not about policing every idiom, but about being mindful of how our words might echo structures we're trying to move beyond. There are plenty of alternatives ("just my perspective" or "here's how I see it" etc.) that don't inherit that baggage. Language evolves, and we can choose to evolve it in more inclusive directions.
> The metaphor of assigning a literal monetary value to one's opinion reinforces the idea that contributions are transactional and that their "worth" is measured through an economic lens. That framing can be exclusionary, especially for people who have been historically marginalized by economic systems. It subtly normalizes a worldview where only those with enough "currency" - social, financial, or otherwise - deserve to be heard.
No. It’s acknowledging that that perhaps one’s opinion may not be as useful as somebody else’s in that moment. Which is often true!
Your first and third paragraphs are true, but they don’t apply to every bloody phrase.
guessing that being able to hear a description of what the camera is seeing (basically a special case of a video) in any circumstances is indeed life changing if you're blind...? take a picture through the window and ask what's the commotion? door closed outside that's normally open - take a picture, tell me if there's a sign on it? etc.
Not the gp, but currently reading a web novel with a card game where the author didn't include alt text in the card images. I contacted them about it and they started, but in the meantime ai was a big help. all kinds of other images on the internet as well when they are significant to understanding the surrounding text. better search experience when Google, DDG, and the like make finding answers difficult. I might use smart glasses for better outdoor orientation, though a good solution might take some time. phone camera plus ai is also situationally useful.
As a (web app) developer I never quite sure what to put in alt. Figured you might have some advice here?
The question to ask is, what a sighted person learns after looking at the image? The answer is the alt text. E.g if the image is a floppy, maybe you communicate that this is the save button. If it shows a cat sleeping on the windowsill, the alt text is yep: "my cat looking cute while sleeping on the windowsill".
I really like how you framed this as the takeaway or learning that needs to happen as what should be in the alt and not a recitation of the image. Where I've often had issues is more for things like business charts and illustrations and less cute cat photos.
It might be that you’re not perfectly clear on what exactly you’re trying to convey with the image and why it’s there.
sorry, snark does not help with my desire to improve accessibility in the wild.
"A meaningless image of a chart, from which nevertheless emanates a feeling of stonks going up"
The logic stays the same though the answer is longer and not always easy. Just saying "business chart" is totally useless. You can make a choice on what to focus and say "a chart of the stock for the last five years with constant improvement and a clear increase by 17 percent in 2022" (if it is a simple point that you are trying to make) or you can provide an html table with the datapoints if there is data that the user needs to explore on their own.
[dead]
The license[0] seems quite restrictive, limiting it's use to non commercial research. It doesn't meet the open source definition so it's more appropriate to call it weights available.
[0]https://github.com/apple/ml-starflow/blob/main/LICENSE_MODEL
Looking at text to video examples (https://starflow-v.github.io/#text-to-video) I'm not impressed. Those gave me the feeling of the early Will Smith noodles videos.
Did I miss anything?
These are ~2 years behind state of the art from the looks of it. Still cool that they're releasing anything that's open for researchers to play with, but it's nothing groundbreaking.
No, it is not as good as Veo, but better than Grok, I would say. Definitely better than what was available 2 years ago. And it is only a 7B research model!
But 7b is rather small no? Are other open weight video models also this small? Can this run on a single consumer card?
> But 7b is rather small no?
Sure, its smallish.
> Are other open weight video models also this small?
Apples models are weights-available not open weights, and yes, WAN 2.1, as well as the 14B models, also has 1.3B models; WAN 2.2, as well as the 14B models, also has a 5B model (the WAN 2.2 VAE used by Starflow-V is specifically the one used with the 5B model.) and because the WAN models are largely actually open weights models (Apache 2.0 licensed) there are lots of downstream open-licensed derivatives.
> Can this run on a single consumer card?
Modern model runtimes like ComfyUI can run models that do not fit in VRAM on a single consumer card by swapping model layers between RAM and VRAM as needed; models bigger than this can run on single consumer cards.
Wan 2.2: "This generation was run on an RTX 3060 (12 GB VRAM) and took 900 seconds to complete at 840 × 420 resolution, producing 81 frames." https://www.nextdiffusion.ai/tutorials/how-to-run-wan22-imag...
I think you need to go back and rewatch Will Smith eating spaghetti. These examples are far from perfect and probably not the best model right now, but they're far better than you're giving credit for.
As far as I know, this might be the most advanced text-to-video model that has been released? I'm not sure whether the license will qualify as open enough in everyone's eyes, though.
I wanted to write exactly the same thing, this reminded me of the Will Smith noodles. The juice glass keeps filling up after the liquid stopped pouring in.
> STARFlow-V is trained on 96 H100 GPUs using approximately 20 million videos.
They don’t say for how long.
> Model Release Timeline: Pretrained checkpoints will be released soon. Please check back or watch this repository for updates.
> The checkpoint files are not included in this repository due to size constraints.
So it's not actually open weights yet. Maybe eventually once they actually release the weights it will be. "Soon"
Looks good. I wonder what use case Apple has in mind though, or I suppose this is just what the researchers themselves were interested in, perhaps due to the current zeitgeist. I'm not really sure how it works at big tech companies with regards to research, are there top down mandates?
Where do they get the video training data?
From the paper:
> Datasets. We construct a diverse and high-quality collection of video datasets to train STARFlow-V. Specifically, we leverage the high-quality subset of Panda (Chen et al., 2024b) mixed with an in-house stock video dataset, with a total number of 70M text-video pairs.
> in-house stock video dataset
Wonder if "iCloud backups" would be counted as "stock video" there? ;)
I have to delete as many videos as humanly possible before backing up to avoid blowing through my iCloud storage quota so I guess I’m safe
Turn on advanced data protection so they don't train on yours.
That has nothing to do with it, and Apple wouldn’t train on user content, they’re not Google. If they ever did there would be opt in at best. There’s a reason they’re walking and observing, not running and trying to be the forefront cloud AI leader, like some others.
"VAE: WAN2.2-VAE" so it's just a Wan2.2 edit, compressed to 7B.
This doesn't necessarily mean that it's Wan2.2. People often don't train their own VAEs and just reuse an existing one, because a VAE isn't really what's doing the image generation part.
A little bit more background for those who don't know what a VAE is (I'm simplifying here, so bear with me): it's essentially a model which turns raw RGB images into a something called a "latent space". You can think of it as a fancy "color" space, but on steroids.
There are two main reasons for this: one is to make the model which does the actual useful work more computationally efficient. VAEs usually downscale the spatial dimensions of the images they ingest, so your model now instead of having to process a 1024x1024 image needs to work on only a 256x256 image. (However they often do increase the number of channels to compensate, but I digress.)
The other reason is that, unlike raw RGB space, the latent space is actually a higher level representation of the image.
Training a VAE isn't the most interesting part of image models, and while it is tricky, it's done entirely in an unsupervised manner. You give the VAE an RGB image, have it convert it to latent space, then have it convert it back to RGB, you take a diff between the input RGB image and the output RGB image, and that's the signal you use when training them (in reality it's a little more complex, but, again, I'm simplifying here to make the explanation more clear). So it makes sense to reuse them, and concentrate on the actually interesting parts of an image generation model.
> "VAE: WAN2.2-VAE" so it's just a Wan2.2 edit
No, using the WAN 2.2 VAE does not mean it is a WAN 2.2 edit.
> compressed to 7B.
No, if it was an edit of the WAN model that uses the 2.2 VAE, it would be expanded to 7B, not compressed (the 14B models of WAN 2.2 use the WAN 2.1 VAE, the WAN 2.2 VAE is used by the 5B WAN 2.2 model.)
They used the VAE of WAN like many other models do. For image models you see a lot of them using the flux VAE. Which is perfectly fine, they are released as apache2 and save you time to focus on your transformers architecture...
Hopefully this will make into some useful feature in the ecosystem and not contribute to having just more terrible slop. Apple has saved itself from the destruction of quality and taste that these model enabled, I hope it stays that way.
[flagged]
you don't "appreciate" anything, you're just posting LLM comments
<joke> GGUF when? </joke>