In the latest AI breakthrough, Microsoft's Kosmos-1 has made a huge
leap towards achieving artificial general intelligence known as AGI, which is
the Holy Grail of AI researchers. While ChatGPT is limited to being an expert with text and Midjourney is
limited to being an expert with images, Kosmos-1 finally takes things to the
next level by being an expert with both. By learning from multiple data sources
such as images and text, this multimodal AI model could provide
a better understanding of the world with the idea of combining knowledge
from different modalities to help solve complex tasks like describing images
in natural language and more. Furthermore, experts agree that multimodal
perception is a crucial component of intelligence and therefore
essential for achieving AGI.
With Kosmos-1 shaping up to be
a powerful multimodal large language model with the capability to propel
the AI industry to new heights. In addition to language and multimodal perception, a possible AGI would also need
the ability to both model and act on the world with the Kosmos-1
multimodal large language model having the unique ability of modeling
the world through language and images. To achieve this feat,
the model was trained using partially related image and language data,
including word image pairs. Additionally, large amounts of Internet
text were used to train the model, as is common practice with large
language models like ChatGPT. The result is that Kosmos-1 has
the remarkable ability to describe images in natural language,
recognize text on images, write captions for them and even
answer questions about them. What's more, the model can even perform these tasks on direct request
or in a dialogue situation similar to ChatGPT,
Kosmos-1's language capabilities are on par with those of large language
models, allowing it to leverage methods like chain of thought prompting to achieve
even better results.
When it comes to Kosmos-1's ability to understand
and perceive visual information, in a recent visual IQ test, the model outperformed chance by between
5% and 9%, demonstrating an ability to perceive abstract conceptual
patterns in a nonverbal context. This was achieved by combining nonverbal
reasoning with linguistic pattern recognition, a feat that was previously
thought impossible by AI researchers. However, the research team now
acknowledges that there is still a significant performance gap between
Kosmos-1 and the average adult level.
Multimodal AI models like Kosmos-1 are proving to be highly effective
at representing implicit connections between different concepts, as
demonstrated by OpenAI's CLIP neuron study.
With further development and fine tuning, the potential applications of multimodal
AI models in the real world appear to be vast, with their direction heading
towards artificial general intelligence. But despite its success,
Kosmos-1's 1.6 billion parameters are relatively small compared
to today's large language models, so in order to unlock the full potential
of multimodal artificial intelligence, they hope to scale up the model to include
additional modalities in its training with the intention of adding
the ability to process speech next. In addition to expanding its
capabilities, they also believe this can help them overcome many of the current
limitations of the model. With the continued development of multimodal artificial intelligence like
Kosmos-1, the possibilities for advanced AI systems appear to be endless,
with Microsoft asserting that multimodal large language models like this will offer
new capabilities and opportunities not available with current large
language models alone.
Amazingly, as the progress of artificial intelligence accelerates,
AI is also bridging the gap between man and machine, as it can now literally read
human minds with incredible accuracy. In this latest AI breakthrough,
the Graduate School of Frontier Biosciences
at Osaka University in Japan is utilizing a stable diffusion model to reconstruct
visual experiences from fMRI data, which eliminates the requirement
of training complex AI models. Instead, the team only needs to train simple linear models that map the fMRI
signals from the lower and upper visual brain regions to individual
stable diffusion components.