下载模型文件 https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/tree/main

下载 LLaMA.cpp 和预编译二进制 https://github.com/ggerganov/llama.cpp https://github.com/ggerganov/llama.cpp/releases

在 python3.8 上运行

python -m pip install -r llama.cpp/requirements/requirements-convert.txt
# Updated in https://github.com/ggerganov/llama.cpp/pull/6920
# python llama.cpp/convert.py llama-3-8B-instruct/ --outfile llama-3-8b.gguf --outtype q5_k_m --vocab-type bpe
(cd llama.cpp; python convert-hf-to-gguf.py models/llama-3-8B-instruct --outfile ../llama-3-8B-hf.gguf)
llama-b2776-bin-win-clblast-x64/quantize.exe llama-3-8b.gguf .\llama-3-8b-q4_k_m.gguf q4_k_m

llama-b2776-bin-win-openblas-x64/main --model llama-3-8b-q4_k_m.gguf --color --interactive --multiline-input --reverse-prompt "User:" --file llama.cpp/prompts/dan.txt
 
llama-b2776-bin-win-openblas-x64/main --model llama-3-8b-q5_k_m.gguf --temp 0.2 --in-prefix '<|start_header_id|>user<|end_header_id|>
 
' --in-suffix '<|eot_id|><|start_header_id|>assistant<|end_header_id|>
 
' --prompt "<|begin_of_text|><|start_header_id|>system<|end_header_id|>
 
You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
 
Hello?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
 
" --color --interactive --multiline-input --ctx-size 2048
 
llama-b2776-bin-win-openblas-x64/server --model llama-3-8b-q4_k_m.gguf --embeddings --ctx-size 2048 --host 0.0.0.0 --port 9870 --chat-template llama3
I am designing a RAG algorithm with an LLM. But the LLM always return much more words than I need, which is bad for the dataset metrics. For example, if asked "Were Tim and Jimmy of the same nationality?", the expected answer in the dataset is "Yes", but the LLM will return "Yes, they are all from America.". This makes the Exact Match and F1 score both low.
So follow the OpenAI Prompt Engineering Guide, help me optimize the prompt for LLM to make it performance better.
Here's my original prompt:
You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Answers should be concise, up to five words. For example, if asked, "What type of district is southern California home to many of?", your answer can be "South Coast Metro". Do NOT repeat the question or providing any information unrelated to the question. Keep the answer concise.
\
I have an Intel CPU i5-12400. Here are the flags of my CPU:
"""
fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pdcm sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves umip gfni vaes vpclmulqdq rdpid fsrm md_clear flush_l1d arch_capabilities
"""
I have no NVIDIA GPU or AMD GPU, but i5-12400 does has an integrated GPU. To get the best performance, which version should I use under Windows, and why?
cudart-llama-bin-win-cu11.7.1-x64.zip
cudart-llama-bin-win-cu12.2.0-x64.zip
llama-b2776-bin-macos-arm64.zip
llama-b2776-bin-macos-x64.zip
llama-b2776-bin-ubuntu-x64.zip
llama-b2776-bin-win-arm64-x64.zip
llama-b2776-bin-win-avx-x64.zip
llama-b2776-bin-win-avx2-x64.zip
llama-b2776-bin-win-avx512-x64.zip
llama-b2776-bin-win-clblast-x64.zip
llama-b2776-bin-win-cuda-cu11.7.1-x64.zip
llama-b2776-bin-win-cuda-cu12.2.0-x64.zip
llama-b2776-bin-win-kompute-x64.zip
llama-b2776-bin-win-noavx-x64.zip
llama-b2776-bin-win-openblas-x64.zip
llama-b2776-bin-win-sycl-x64.zip
llama-b2776-bin-win-vulkan-x64.zip
\