Google Gemini AI:Multi-Modal Transformers Design using DALL-E3

Google introduces its long awaited Gemini AI suite of Large Language Models

Introduction

  • Context: Google introduces Gemini, a highly advanced and versatile AI model.
  • Design Philosophy: Built as a multimodal system, Gemini processes and integrates diverse data types, including text, code, audio, images, and video.
  • Versions: Available in three distinct forms – Gemini Ultra, Gemini Pro, and Gemini Nano, each tailored for specific performance and application needs.

Features

  • Multimodal Integration: Unlike traditional models that stitch together separate components for different data types, Gemini is inherently multimodal, trained across various data types from inception.
  • Flexible Deployment: Designed for efficient operation across a range of platforms, from large data centers to mobile devices.
  • Advanced Reasoning and Coding: Exhibits superior capabilities in complex reasoning and coding tasks across multiple programming languages.

Comparative Performance: Gemini vs. GPT-4 and GPT-3.5

  • Gemini Ultra vs. GPT-4:
    • MMMU (Multi-discipline College-Level Reasoning Problems): Gemini Ultra achieves a 59.4% 0-shot pass rate, surpassing GPT-4V’s 56.8%.
    • VQAv2 (Natural Image Understanding): Gemini Ultra scores 77.8% in 0-shot performance, slightly ahead of GPT-4V’s 77.2%.
    • TextVQA (OCR on Natural Images): Gemini Ultra leads with an 82.3% 0-shot performance, compared to GPT-4V’s 78%.
    • DocVQA (Document Understanding): Gemini Ultra scores 90.9% in 0-shot performance, outperforming GPT-4V’s 88.4%.
    • Infographic VQA (Infographic Understanding): Gemini Ultra achieves an 80.3% 0-shot performance, higher than GPT-4V’s 75.1%.
    • MathVista (Mathematical Reasoning in Visual Contexts): Gemini Ultra scores 53% in 0-shot performance, surpassing GPT-4V’s 49.9%.
  • Gemini Pro vs. GPT-3.5:
    • CoVoST 2 (Automatic Speech Translation): Gemini Pro achieves a BLEU score of 40.1, significantly outperforming Whisper v2.
    • FLEURS (Automatic Speech Recognition): Gemini Pro shows a word error rate of 7.6%, markedly better than Whisper v3’s 17.6%.

Benefits

  • Enhanced Problem-Solving: Superior performance in benchmarks indicates a higher capability in complex problem-solving and knowledge application.
  • Diverse Applications: Applicable in various fields, including scientific research, coding, and multimedia content understanding.
  • Efficiency and Scalability: Optimized for performance across different hardware, making it suitable for both high-end and constrained environments.
  • Superior Multimodal Understanding: Gemini’s performance in benchmarks like MMMU, VQAv2, and TextVQA demonstrates its advanced capabilities in understanding and integrating multimodal data.
  • Enhanced Language and Speech Processing: Gemini Pro’s performance in CoVoST 2 and FLEURS indicates its proficiency in language translation and speech recognition tasks

Other Technical Specifications

  • Training Infrastructure: Utilizes Google’s latest TPUs (v4 and v5e), ensuring efficient and scalable training processes.
  • Safety and Responsibility: Undergoes extensive safety evaluations, including bias and toxicity assessments, adhering to Google’s AI Principles.

Conclusion

  • Innovation Milestone: Gemini represents a significant leap in AI development, particularly in multimodal understanding and processing.
  • Future Potential: Its advanced capabilities and flexible deployment options position it as a key driver for future AI innovations across various sectors.
  • Commitment to Safety: Google’s focus on safety and responsibility in Gemini’s development aligns with broader industry needs for ethical AI advancement.

Stability AI- Zephyr 3B: Multi-Modal Transformers Design using DALL-E3

Stability AI introduces Zephyr 3B LLM

Introduction

  • Model Overview: StableLM Zephyr 3B is a new chat model and the latest iteration in the series of lightweight Large Language Models (LLMs) by Stability AI.
  • Inspiration and Extension: It extends the pre-existing StableLM 3B-4e1t model and is inspired by the Zephyr 7B model from HuggingFace.
  • Parameter Size: With 3 billion parameters, it efficiently caters to a wide range of text generation needs on edge devices.

Features

  • Training Methodology: The model underwent supervised fine-tuning on multiple instruction datasets, including UltraChat, MetaMathQA, Evol Wizard Dataset, & Capybara Dataset.
  • Direct Preference Optimization (DPO): Utilizes the DPO algorithm with the UltraFeedback dataset for aligning human preferences in text generation.
  • Efficient Size: Despite having 3 billion parameters, it maintains efficiency suitable for edge devices.

Model Performance

  • Benchmarking: Demonstrates superior capabilities in generating contextually relevant, coherent, and linguistically accurate text, benchmarked on platforms like MT Bench and AlpacaEval.
  • Performance Superiority: Capable of surpassing larger models tailored for similar use cases, showcasing power and efficiency.

Enabling Diverse Applications

  • Versatility: Suitable for various complex applications, including creative content creation, instructional design, and content personalization.
  • Efficiency: Retains a compact size (60% smaller than 7b models), enabling use on devices with limited computational power.

Commercial Applications

  • Usage: Available for commercial products or purposes, with contact information provided for interested parties.
  • Community Engagement: Updates and progress can be followed through newsletters, social media, and the Discord community.

Conclusion

  • Innovation in LLMs: StableLM Zephyr 3B represents a significant advancement in the field of LLMs, particularly in terms of efficiency and application versatility.
  • Future Potential: Its design and capabilities position it as a powerful tool for a wide range of text generation tasks, especially in environments with computational constraints.

Other AI News

  • Top 50 Tech Companies launch ‘The AI Alliance’ to support open, safe, and responsible AI

The AI Alliance, co-launched by IBM and Meta, represents a significant collaboration in the field of artificial intelligence. This international community brings together a diverse range of organizations, including technology developers, researchers, and adopters, all dedicated to advancing open, safe, and responsible AI. The Alliance encompasses creators of AI tooling and benchmarking, educational institutions, hardware builders, and software framework champions. It also includes creators of widely used open models, highlighting a commitment to open innovation and responsible AI development.

Key members and collaborators of the AI Alliance span various sectors, including academic institutions like Cornell University and Dartmouth College, technology companies such as AMD and Dell Technologies, and research organizations like the National Science Foundation and NASA. This broad membership underscores the Alliance’s commitment to a multi-faceted approach to AI, addressing aspects like education, research, development, deployment, and governance. The Alliance aims to foster an environment where AI advancements are made through collaborative efforts, ensuring that AI benefits are accessible and ethically aligned.

The AI Alliance plans to initiate its work through member-driven working groups, focusing on major AI-related topics. It will also establish a governing board and technical oversight committee to advance project areas, set standards, and provide guidelines. This structure is designed to promote open science and innovation in AI, ensuring that developments in the field are safe, responsible, and beneficial to society at large. The Alliance’s approach reflects a growing recognition of the importance of collaborative and open efforts in the rapidly evolving field of AI.

  • Apple introduces MLX and MLX Data new ML frameworks and libraries

Apple Inc. has recently introduced new machine learning frameworks and model libraries, MLX and MLX Data, indicating a significant shift towards generative AI applications. This move is somewhat unexpected, considering Apple’s traditionally cautious stance in AI technology. MLX, optimized for Apple Silicon chips, enables developers to create AI models that function efficiently on Apple devices like MacBooks. MLX Data serves as a complementary data loading framework. These developments suggest Apple’s growing interest in the generative AI space, which encompasses the creation of new content such as text, images, and videos.

The MLX framework, inspired by existing frameworks like PyTorch and Jax, is tailored for Apple hardware, offering seamless integration and shared memory utilization for faster model training. Developers have observed that MLX is capable of training complex AI models, potentially on par with those like Meta’s Llama and Stable Diffusion. This advancement hints at Apple’s potential entry into the increasingly popular generative AI sector, which has seen rapid expansion by major tech companies like Microsoft and Google.

While Apple has incorporated AI technology in its products for years, it has typically refrained from explicitly labeling it as AI, preferring terms like “machine learning.” Recent efforts, particularly since September, indicate an acceleration in AI development at Apple, focusing on evaluating foundational models for broader application. Despite the lack of specific details from Apple regarding its plans for MLX, MLX Data, and generative AI, industry analysts speculate that the company may introduce innovative generative AI features across its services and devices. However, given Apple’s focus on privacy, it is expected to thoroughly consider the ethical implications of these technologies.

  • AMD launches MI300X + MI300A accelerator chips and Ryzen 8040 Series processors for AI-based laptops

Advanced Micro Devices (AMD) has unveiled its latest innovations in AI hardware, introducing the AMD Instinct MI300X and MI300A accelerator chips for data center AI processing, and the AMD Ryzen 8040 Series processors for AI-based laptops. The announcement marks a significant expansion of AMD’s AI hardware capabilities, targeting both data center and personal computing applications.

The AMD Instinct MI300X and MI300A chips are designed to deliver high performance for AI workloads, with the MI300X offering industry-leading memory bandwidth and the MI300A combining CPU and GPU functionalities for efficient high-performance computing and AI tasks. These accelerators are expected to be adopted in large-scale cloud and enterprise deployments, with Microsoft Azure already announcing the new Azure ND MI300x v5 Virtual Machine series optimized for AI workloads and powered by AMD Instinct MI300X accelerators.

In the realm of personal computing, the AMD Ryzen 8040 Series processors, previously code-named Hawk Point, are set to enhance AI compute capabilities in laptops. These processors are accompanied by the Ryzen AI 1.0 Software, a software stack that allows developers to deploy AI-enhanced applications for Windows. AMD’s next-gen “Strix Point” CPUs, which will include the XDNA 2 architecture and are slated for release in 2024, promise a significant increase in AI compute performance, enabling new generative AI experiences on Windows PCs. The Ryzen 8040 Series processors are expected to be widely available from major OEMs starting in Q1 2024, signaling AMD’s commitment to advancing AI technology across various computing platforms.

  • AI infrastructure company VAST Data raises $118 million in Series E funding

VAST Data, a company specializing in AI infrastructure, has raised $118 million in a Series E funding round, led by Fidelity Management & Research Company. This latest investment, with contributions from New Enterprise Associates (NEA), BOND Capital, and Drive Capital, has significantly increased VAST Data’s valuation to $9.1 billion, nearly tripling its value since the 2021 Series D round. The company plans to use this capital to accelerate the development of its infrastructure, which aims to centralize data as a core component of system operations, enhancing technology, economics, social dynamics, and scientific research.

Founded in 2016, VAST Data has evolved into a unified data platform, integrating storage, database, and containerized compute engine services into a single, scalable software stack. This platform simplifies data management and processing, supporting rapid data capture, synthesis, and learning, which is crucial for next-generation AI-driven applications. VAST Data’s platform is already being used by major enterprises like Zoom, Allen Institute, and Pixar Animation Studios. With the launch of this platform, VAST Data’s cumulative software bookings surpassed $1 billion in the financial quarter ending September, demonstrating significant growth and a strong market position in the AI infrastructure sector.

  • SuperDuperDB releases version 0.1 of its open-source framework for AI enabled enterprise databases

SuperDuperDB, an Intel Ignite portfolio company, has released version 0.1 of its open-source framework, designed to integrate AI capabilities directly into enterprise databases. This Python package framework enables users to incorporate machine learning models and AI application programming interfaces (APIs) with existing databases, allowing the construction of AI applications on top of them. The framework supports popular AI models and databases and has secured $1.75 million in early funding from Hetz.vc, Session.vc, and MongoDB’s venture capital arm. Timo Hagenow, CEO of SuperDuperDB, emphasizes the framework’s potential to bridge the gap between data storage systems and AI, simplifying the building and management of AI applications.

SuperDuperDB addresses the challenge of deploying powerful machine learning models and proprietary data in enterprise operations. Traditionally, developers have to navigate complex processes to bring these models into production, involving intricate pipelines and tools from the MLOps and DevOps ecosystems. SuperDuperDB simplifies this by enabling AI models to be brought directly to the database used by enterprises. The framework allows for scalable deployment of AI models and APIs, transforming databases into AI development and deployment environments. This approach supports not only standard machine learning models but also the latest generative AI models for various applications, including vector search. The product, though only a few months old, has already gained traction among major data ecosystem players like MongoDB, PostgreSQL, MySQL, and Snowflake, indicating its potential to significantly ease the building and deployment of AI applications in various sectors.

  • Meta advances research with Emu Video, introduces ‘Reimagine’ in group chat on Messenger & Instagram and Reels in Meta AI chats

Meta has made significant advancements in AI over the past year, introducing new AI experiences across its apps and devices, and opening access to its Llama family of large language models. The company has also published research breakthroughs like Emu Video and Emu Edit, which are expected to unlock new capabilities in its products next year. Meta AI, the company’s virtual assistant, has been enhanced to provide more detailed responses on mobile, more accurate summaries of search results, and a broader range of helpful responses. This assistant can be accessed in various ways, including messaging platforms and Ray-Ban Meta smart glasses. Meta AI is also being used behind the scenes to improve experiences on Facebook and Instagram, offering AI-generated post comment suggestions, community chat topic suggestions, and enhancing product copy in Shops.

Meta AI has introduced a new feature called ‘Reimagine’ in group chat on Messenger and Instagram, which allows users to generate and share images based on text prompts. This feature enables users to create and modify images in a fun and social way, enhancing the interaction in group chats. Additionally, Meta is rolling out Reels in Meta AI chats, allowing users to discover new content and experiences through more than just text. This feature can be used, for example, to recommend places to visit on a trip and share Reels of top sites, helping users decide on must-see attractions.

On Facebook, Meta AI is being used to enhance everyday experiences, making expression and discovery easier. The AI is exploring ways to help users create personalized greetings, edit Feed posts, draft introductions for Facebook Dating profiles, and set up new Groups. Meta is also testing AI-generated images on Facebook, such as converting images from landscape to portrait orientation for easier sharing to Stories. Additionally, Meta AI is being used to surface relevant information in Groups, suggest topics for new chats, improve search capabilities, and assist in Marketplace transactions. These advancements demonstrate Meta’s commitment to integrating AI into its platforms, enhancing user experiences, and exploring new ways AI can be used in social media and communication.

  • xAI has applied to US SEC to raise $1 billion through equity offering

Elon Musk’s AI startup, xAI, is planning to raise up to $1 billion through an equity offering, as indicated in a recent filing with the U.S. Securities and Exchange Commission. So far, the company has secured $134.7 million from the targeted $1 billion. This fundraising initiative comes at a time when AI startups are gaining significant attention, partly fueled by the success of OpenAI’s ChatGPT and its substantial funding from Microsoft Corp. However, there are growing regulatory concerns about AI’s potential role in spreading misinformation.

Musk, known for his critical stance on Big Tech’s AI strategies, particularly regarding censorship, launched xAI in July. He envisions xAI as a “maximum truth-seeking AI,” positioning it as a competitor to Google’s Bard and Microsoft’s Bing AI. Musk’s approach to AI development focuses on fostering a “maximally curious” AI rather than programming explicit moral guidelines. His involvement in the AI sector dates back to 2015 when he co-founded OpenAI, the creator of ChatGPT, but he stepped down from its board in 2018.

Last month, xAI introduced “Grok,” a chatbot designed to rival ChatGPT. Musk plans to integrate this new AI technology into his social media platform X and also offer it as a standalone application, as per his announcement in November.

  • Microsoft will soon enhance Copilot services by offering OpenAI’s GPT-4 Turbo model

Microsoft is set to enhance its Copilot service with several new features, including the integration of OpenAI’s latest models. Copilot will soon support the GPT-4 Turbo model, which offers a larger context window of 128K, allowing for a better understanding of queries and improved responses. This update is currently being tested with select users and is expected to be widely integrated into Copilot in the coming weeks. Additionally, Microsoft has updated the DALL-E 3 model in Bing Image Creator and Copilot, enabling the creation of higher quality and more accurate images in response to prompts.

The Microsoft Edge browser, which includes a Copilot sidebar, is also receiving updates. It will now have the capability to compose text within websites’ text inputs, allowing users to rewrite sentences inline. Furthermore, Copilot in Microsoft Edge can now summarize videos watched on YouTube, enhancing the user experience with AI-powered content interpretation.

For coders and developers, a new code interpreter feature is being introduced to Copilot. This feature will enable Copilot to write code in response to complex, natural-language requests, run the code in a sandboxed environment, and use the results to provide higher quality responses. Users will also have the ability to upload and download files to and from Copilot, facilitating work with personal data and code as well as Bing search results. On the Bing side, Microsoft is adding a “Deep Search” feature that leverages the power of GPT-4 to optimize search results for complex topics, expanding search queries into more comprehensive descriptions for more relevant results.

  • Leonardo.Ai announces $31 million USD capital raise

Leonardo.Ai, a Sydney, Australia-based generative AI startup, has announced a significant funding round of $31 million USD. The investment was led by a group of investors including Blackbird, Side Stage Ventures, Smash Capital, TIRTA Ventures, Gaorong Capital, and Samsung Next. Since its founding last year, Leonardo.Ai has rapidly grown, reaching seven million users who have generated over 700 million images. The company recently launched an enterprise version of its platform, which includes collaboration tools, private cloud hosting, and API access for building tech infrastructure on top of Leonardo.Ai’s platform.

The platform is designed for creative industries such as gaming, advertising, fashion, and architecture, allowing users to save, edit, and build multiple assets in the same style for reuse. Users can also build and train their own models for image generation. Leonardo.Ai’s versatility is showcased in various use cases, including storyboards for video production and mockups of gaming characters. CEO and co-founder J.J. Fiasson’s interest in generative AI was sparked by Google Deep Dream and further developed at his previous startup, Raini Studios.

Leonardo.Ai stands out from other generative AI art platforms by offering users a high degree of control over the content creation process. One of its key features, Live Canvas, allows users to input a text prompt and make a sketch, with the platform generating a photorealistic image based on both text and sketch prompts in real time. The new funding will enable Leonardo.Ai to expand its sales and marketing team, scale its enterprise product, and build its engineering team, focusing on enhancing the platform’s utility and control for various creative applications.

  • Meta AI releases new AI models to enable cross-lingual communication in real-time : SeamlessExpressive, SeamlessStreaming, and SeamlessM4T v2

Meta AI researchers have developed a new suite of artificial intelligence models called Seamless Communication, designed to facilitate more natural and authentic communication across languages. The flagship model, Seamless, combines the capabilities of three other models — SeamlessExpressive, SeamlessStreaming, and SeamlessM4T v2 — into a unified system. This model is the first publicly available system that enables expressive cross-lingual communication in real-time. SeamlessExpressive focuses on preserving the vocal style and emotional nuances of the speaker’s voice during translation, while SeamlessStreaming enables near real-time translation with minimal latency across nearly 100 spoken and written languages. The SeamlessM4T v2 model serves as the foundation for the other two, offering improved consistency between text and speech output.

The potential applications of these models are vast, ranging from real-time multilingual conversations using smart glasses to automatically dubbed videos and podcasts. They could significantly reduce language barriers, aiding immigrants and others who face communication challenges. However, the researchers also acknowledge the potential for misuse in voice phishing scams and deepfakes. To promote responsible use, they have implemented safety measures like audio watermarking and techniques to reduce toxic outputs. In line with Meta’s commitment to open research, the Seamless Communication models have been publicly released on Hugging Face and Github, allowing researchers and developers to build upon this work and further bridge multilingual connections in a globally interconnected world.

  • Visual Electric takes on Midjourney and DALL-E3 with infinite virtual canvas approach to help enhance creativity

Visual Electric, a San Francisco-based startup, has launched a new approach to AI art generation, aiming to move beyond the chat interface common in services like Midjourney or OpenAI’s DALL-E. Founded by Colin Dunn, Visual Electric emerges from stealth mode with a seed round from Sequoia, BoxGroup, and Designer Fund. The platform is designed for enterprise users such as independent designers and in-house designers at major brands. Unlike existing AI art generators that use a linear, chat-based interface, Visual Electric offers an infinite virtual canvas where users can generate, drag, and move images around, allowing side-by-side comparison and a more nonlinear creative process. This interface, positioned at the top of the screen, also includes autocomplete suggestions for prompts, similar to Google search, to assist users in generating desired images.

Visual Electric differentiates itself by providing additional tools for modifying prompts and styles, including pre-set styles and options to specify image aspect ratios and dominant colors. Users can also “remix” or regenerate images based on initial prompts, or use a digital brush to “touch up” selected portions of an image, allowing for more precise control over AI-generated content. The platform includes a built-in upscaler for enhancing image resolution and detail. Visual Electric offers a free plan with limited daily generations, a standard plan with unlimited generations and commercial usage license, and a pro plan with higher resolution images and private storage options. The platform also features an “Inspiration” feed, similar to Pinterest, where users can view and remix AI-generated images made by others, fostering a community-driven approach to AI art creation.

  • Chinese startup DeepSeek launches conversational chatbot to compete with ChatGPT

DeepSeek AI, a Chinese startup, has launched DeepSeek Chat, a new conversational AI offering designed to compete with ChatGPT. DeepSeek Chat utilizes 7B and 67B-parameter DeepSeek Large Language Models (LLMs), trained on a dataset of 2 trillion tokens in both English and Chinese. These models have shown strong performance in various evaluations, including coding and mathematics, and in some cases, have outperformed Meta’s Llama 2-70B model. DeepSeek has open-sourced both the base and instruction-tuned versions of these models, aiming to encourage further research and commercial use within academic and commercial communities.

DeepSeek Chat is accessible via a web interface and currently only offers the 67B version. The models use auto-regressive transformer decoder architecture, similar to Llama, but with different inference approaches. The 7B model employs multi-head attention, while the 67B model uses grouped-query attention. Benchmarks have demonstrated the 67B model’s superior capabilities in reasoning, coding, math, and Chinese comprehension. However, there have been concerns about censorship, as some users reported automatic redaction of responses related to China. The launch of DeepSeek LLMs represents a significant step in China’s AI development, expanding the country’s offerings to cover all popular model sizes and serving a broad spectrum of end users.

  • Nous Research releases Nous Hermes 2 Vision, a vision-language model available now on Hugging Face

Nous Research, a private applied research group, has released Nous Hermes 2 Vision, a lightweight vision-language model available on Hugging Face. This open-source model, building on the company’s previous OpenHermes-2.5-Mistral-7B model, introduces vision capabilities, allowing users to prompt with images and extract text information from visual content. However, shortly after its launch, the model exhibited issues such as hallucinations, leading to glitches and its renaming to Hermes 2 Vision Alpha. Nous Research plans to release a more stable version in the future.

Nous Hermes 2 Vision Alpha is designed to navigate complex human discourse, combining visual information with learned data to provide detailed answers in natural language. For example, it can analyze an image of a burger and determine its health implications. The model differentiates itself by using SigLIP-400M, a streamlined architecture that makes it more lightweight and enhances performance on vision-language tasks. It has also been trained on a custom dataset enriched with function calling, enabling users to extract written information from images. Despite its innovative approach, early usage has revealed significant issues with hallucinations and spamming EOS tokens, indicating that the model is still in its developmental stage and requires further refinement.

  • X starts rolling xAI’s Grok Chatbot to its Premium+ subscribers in the US

Elon Musk’s AI startup, xAI, is introducing a new AI chatbot named Grok, targeting Premium+ subscribers of the social media platform X. This announcement, made via a post on X, follows Musk’s previous statement about Grok’s availability post-early beta testing. Amidst a shift in advertiser focus away from the microblogging platform, Musk is emphasizing a strategic pivot towards subscription-based revenue. His vision includes transforming X into a multifunctional “super app” that encompasses messaging, social networking, and peer-to-peer payment services.

Musk founded xAI in July as a counter to the AI initiatives of major tech companies, voicing concerns over excessive censorship and inadequate safety measures in existing AI technologies. This move places xAI in direct competition with leading tech firms like Microsoft and Google, who are also rapidly developing AI-powered products in response to the widespread interest generated by OpenAI’s ChatGPT. Musk, a co-founder of OpenAI in 2015, distanced himself from the organization in 2018 by stepping down from its board.

About The Author

Bogdan Iancu

Bogdan Iancu is a seasoned entrepreneur and strategic leader with over 25 years of experience in diverse industrial and commercial fields. His passion for AI, Machine Learning, and Generative AI is underpinned by a deep understanding of advanced calculus, enabling him to leverage these technologies to drive innovation and growth. As a Non-Executive Director, Bogdan brings a wealth of experience and a unique perspective to the boardroom, contributing to robust strategic decisions. With a proven track record of assisting clients worldwide, Bogdan is committed to harnessing the power of AI to transform businesses and create sustainable growth in the digital age.