We are looking for a founding product designer to have ownership of the front-end platform and public benchmark design. We expect you to have high autonomy, strong visual aesthetic and the dynamism necessary to design the UX/UI of an enterprise platform quickly.
There are two primary areas for which you will be responsible.
(1) Designing our evaluation platform, which is used by companies to produce internal benchmarks. This is the interface that is used by subject matter experts to perform their review, then automate the evaluation for subsequent runs.
(2) Fleshing out and improving the design of our public benchmark available at https://www.vals.ai. The initial version of this site has garnered some attention. There are significant ways we can improve the public reports, making them the status quo for evaluating LLM performance on enterprise tasks.
It’s worth highlighting that in this role you are a founding team member, not an employee. This means you have ownership in the company and product direction. You don’t take orders, you engage in discussions. We are growing quickly and you will help hire early team members to support your work.
Measuring model ability is the most challenging part of creating applications that are capable of automating any given part of the economy. There are no good techniques or benchmarks for evaluating LLM performance on business-relevant tasks, so adoption for enterprise production settings has been limited (see Wittgenstein’s ruler).
This problem materializes in each place where LLMs have potential: in understanding whether the AI tool companies are building a product will satisfy a customer demand, determining how feasible models and vendors are for a given enterprise in making purchasing decisions, for researchers who need a north star to which to expand model ability.
Today, answering these questions amounts to hiring a human review team to manually evaluate model outputs. This is prohibitively expensive and slow.
Vals AI is building the enterprise benchmark of LLM and LLM apps on real-world business tasks. In doing so we are creating the infrastructure + certification to automatically audit LLM applications, verifying they are ready for consumption.
See our benchmarks and launch announcement in Bloomberg. We aim to build the barometer for whether AI is useful, and in doing so, accelerate the automation of all knowledge work.
Our core technology enables us to review + automatically audit LLM applications in high value industries (legal, insurance, finance, healthcare). With this and our own data, we maintain a public benchmark of the major LLMs on enterprise tasks. Our success will be based on three components:
To achieve each of these, we are looking for machine learning engineers (Head of AI, Members of Technical Staff) to develop novel evaluation techniques, strong designers and front-end engineers (Founding Product Engineer) to contribute to the platform, and a tenacious operator to write reports and maintain our social media (email rayan@vals.ai if this is of interest).
Founding team: The core methodology behind this platform comes from NLP evaluation research we had done at Stanford. We raised a 5M seed from some of the top institutional and angel investors in the valley. Our team has prior work experience at NVIDIA, Meta, Microsoft, Palantir and HRT. Collectively, we have over 300 citations in our published work.
Tech stack: Our frontend is built in React with TSX. We use Django as our back-end framework. All of the infra is on AWS.
Know someone who would be a good fit? Connect them with rayan@vals.ai. If we hire them and they stay on for 90 days you’ll get a $10,000 referral bonus and Vals AI merch!
‍