“AI” - Things You Might Not Know | Langchain & Hugging Face eh?
In recent times, the global conversation around the technology has been dominated by talks around Generative AI, ChatGPT and how other forms of AI models or platforms can be used for multiple business cases and products.
AI indeed brings in never-ending innovations but it also introduces new engineering technology challenges and one of them is efficient processing of data in applications using language models-LLM, generative AI and semantic search.
In my further article, I will try to dig in some of the technologies that makes AI what it is other than just scratching the surface.
I remember to have signed up for OpenAI beta API way back in late 2020s, when OpenAI was still unknown to most of us. My initial use case around OpenAI was to utilize it for language translation from Japanese to English and vice-versa, however didn’t explored it much because of job and location changes.
Now, AI market has become overcrowded and you will hear technologies like ChatGPT, Dalle; OpenAI etc.
Some of the other opensource hot alternatives to OpenAI are LLaMa, Llama2, Falcon, H2o.
I have been recently working on few ideas around OpenAI that I will soon publish as a opensource contribution, also stay tuned for my other articles around Llama2 and H2o 😊
Let’s understand some of the important keywords in AI world
Huggingface
It’s a one stop platform all things machine learning and data science. It’s an amazing community driven platform that guides users in creating, training and deploying their very own machine learning models. Additionally, it offers all the necessary tools to showcase, run, and implement artificial intelligence in real-life applications.
Consider, HuggingFace a Github of Machine learning models that also enables users to test and deploy their AI models.
You can sign up and collaborate for free on HF Hub, however they do have Pro Account to get better support and collaboration. You can explore other plans as well auto training and deploying the AI models.
It’s basically your one-stop-shop for all things AI!
Ideation — My idea for combining Infosec with AI
I have been thinking about a idea combining few Infosec use cases with OpenAI and stumbled upon something called “LangChain”
I am able to successfully develop an application using Python, Flask, OpenAI,
and LangChain that allows me to retrieve data shapes along with other data
analysis simply using the chat feature of the application. In short, this
application empowered me to get data insights by asking simple English
questions from the application chat interface.
New Product (Data + AI) = New Product Offering :-)
I will talk about my product idea in a different article, for now let’s move further and understand more about Langchain
and other cool concepts.
It is an open-source Python and JS/TS framework that allows AI developers to combine LLMs like GPT and interact with external sources of computation and your custom data.
Langchain
allows you to use your own documents as referencing database for the available LLMs and the you can also programmatically define actions like email etc.
It takes the provided documents that is required to be referenced by LLM and then it slices it into smaller document chunks that are then stored into a vector database (VectorStore: Explained below)
.
FYA my application works flawless but only challenge is that you can’t use beyond 60,000 tokens for making an OpenAI API calls.
In this process of building Langchain
application, I got curious about data processing and they way it’s feeding my data to LLM. So, I started going through Langchain
documentation and I came across something new called ‘Vector Database.’
Upon further reading, I came across a company called ‘PineCone”
that specializes in providing a “Vector Databases”
tailored for machine learning applications and also got to know that how Google's or other Symantec searching
also relies on similar concept of vector embedding
stored in similar vector databases..
It’s time to get some understanding about these technologies
Vector Database
A “Vector Database” for machine learning and data retrieval is like a special kind of storage space that is optimised to keep track of querying vectors and embeddings that represent things. Vectors are numerical representations of objects or data points, often used in machine learning for tasks like finding similar things, suggestion and Natural Language Processing.
Vector databases are designed to efficiently handle operations like nearest neighbour search, similarity search, and indexing of high-dimensional data. They are crucial in applications where finding similar items or objects based on their numerical representation is a key requirement. (Para Source:https://www.pinecone.io)
Vector Embeddings?
It refers to the process of converting data points or objects into numerical vectors in a continuous space. These vectors are designed to capture meaningful relationships or similarities between the original data points.
A vector embedding involves representing an item, such as an orange, by assigning numerical values to its distinctive characteristics. These values are organized in a sequence, resulting in what is known as a vector.
This numerical representation encapsulates the essential attributes of the item in a format that can be processed and analyzed by mathematical algorithms.
Some of the techniques that are used to represent text content into a embedding are as following:
- Word2vec
- Glove
- Bert
- GPT
- Fasttext
In order to store all the above vector embeddings you need a specialized database and obtain a result from these embeddings you would have to use “Cosine Similarity.”
Given the computational requirements, it’s not easy for normal databases to retrieve the results from a combination of millions or even billions of records and that’s where a “Vector Database” comes in, enabling faster search and optimized storage.
Symantec searching?
It involves understanding the context and meaning behind a search query to provide more accurate and relevant results. It goes beyond just the keyword matching, taking into account the user’s intent and the relationships between words. It also uses the concept of embedding that provides the numerical representation of text like same as we have discussed above.
Lets search “Blue Cat Cable”
and “Blue Cat”
Blue Cat Cable
Hope you find it all useful!
` No AI is used, just for few edits and proofreadings :-) `
I’m a bit tired writing and thinking, need break!
So, I will pause here and return with the second part with some practical hands-on exercises.
Thank you for taking out time to read my article. If you have any suggestions or feedback then I would love to hear them.
Feel free to reach out!
~AshishSecDev